Correlated Subqueries: Row-by-Row Comparisons

The Problem: Personalized Comparisons

What if each row needs its OWN answer? (e.g., "Mickey's average is 4.2, Minnie's is 4.4")

Imagine: "Find users whose latest rating is above THEIR personal average" - Each user has a different average!

Why Basic Subqueries Aren't Enough

-- WRONG: Regular subquery - whose average is this?
SELECT u.name, l.rating
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating > (                    -- Compare each rating to...
    SELECT AVG(rating)                -- GLOBAL average (not personalized!)
    FROM Listens                      -- No correlation to outer query
)                                     -- Everyone gets same threshold: 4.2
-- Problem: Compares everyone to the SAME global average (4.2)

What we really need: Each user compared to THEIR OWN average!

Solution: Correlated Subqueries

Good news: Add one extra line inside basic subquery to personalize (correlate)

    WHERE l2.user_id = u.user_id      -- CORRELATED: References outer u.user_id!

Full Query

-- CORRECT: Each user compared to THEIR personal average
SELECT u.name, l.rating
FROM Users u                          -- OUTER SCOPE: u is visible to inner query
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating > (                    -- For EACH row, compare to...
    -- INNER SCOPE: Can access outer variables (u.user_id)
    SELECT AVG(l2.rating)             -- This specific user's average
    FROM Listens l2                   -- l2 is LOCAL to this subquery
    WHERE l2.user_id = u.user_id      -- CORRELATED: u.user_id from OUTER scope!
)                                     -- Runs once per outer row
-- Now each row gets its personalized comparison!

# Conceptual execution model:
for user in Users:  # Outer loop
    for listen in Listens:
        if listen.user_id == user.user_id and listen.rating is not None:
            # Inner query: Calculate THIS user's average
            sum = 0
            count = 0
            for l in Listens:  # Inner loop
                if l.user_id == user.user_id and l.rating is not None:
                    sum += l.rating
                    count += 1
            avg = sum / count

            # Check condition
            if listen.rating > avg:
                output(user.name, listen.rating)

Key Insight: The inner query can "see" the outer query's current row!

Example: Users Above Their Own Average

-- Find users who have at least one rating above THEIR personal average
SELECT DISTINCT u.name                -- DISTINCT: Each user appears once
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating > (                    -- Check if this rating exceeds...
    -- Calculate THIS user's personal average
    SELECT AVG(l2.rating)             -- Inner query can "see" outer u.user_id
    FROM Listens l2
    WHERE l2.user_id = u.user_id      -- CORRELATED to outer u.user_id
)

Step-by-Step Execution

User	Their Ratings	Their AVG	Ratings > Their AVG	Result
Mickey	4.5, 4.2, 3.9	4.2	4.5 > 4.2 ✓	Include
Minnie	4.7, 4.6, 3.9	4.4	4.7 > 4.4 ✓, 4.6 > 4.4 ✓	Include
Daffy	2.9, 4.9, NULL	3.9	4.9 > 3.9 ✓	Include
Pluto	(no ratings)	NULL	(no comparisons)	Exclude

Output

name	Notes
Mickey	Has rating 4.5 above their 4.2 average
Minnie	Has ratings 4.7, 4.6 above their 4.4 average
Daffy	Has rating 4.9 above their 3.9 average

Each user gets their OWN personalized average
Different threshold for each row
This is impossible with basic subqueries!

Example: Users with High Ratings (EXISTS Pattern)

-- Find users with at least one high rating (>4.5)
SELECT u.name                         -- For each user...
FROM Users u                          -- OUTER SCOPE: u variable is accessible to subquery
WHERE EXISTS (                        -- Check if ANY rows exist where...
    -- INNER SCOPE: This subquery can "see" the outer u variable
    -- Inner query: Check if THIS user has any rating > 4.5
    SELECT 1                          -- SELECT 1 is shorthand convention (don't care about value for EXISTs). 
                                      -- You could also SELECT user_id
    FROM Listens l                    -- l is LOCAL to subquery (different scope than outer)
    WHERE l.user_id = u.user_id       -- CORRELATED: u.user_id from OUTER scope!
      AND l.rating > 4.5              -- Additional filter on rating
)                                     -- Runs once per user (4 times for 4 users)

Execution: Like Nested Loops with Scoped Variables

# Conceptual execution model:
for user in Users:  # Outer loop
    found = False
    for listen in Listens:  # Inner loop 
        if listen.user_id == user.user_id and listen.rating > 4.5:
            found = True
            break  # EXISTS stops at first match
    if found:
        output(user.name)

Row-by-Row Execution

Outer Row	user_id	name	Inner Query Check	EXISTS?	Result
Row 1	1	Mickey	Check ratings for user 1: {4.5, 4.2, 3.9}	✗ 4.5 not > 4.5	Exclude
Row 2	2	Minnie	Check ratings for user 2: {4.7, 4.6, 3.9}	✓ 4.7 > 4.5	Include
Row 3	3	Daffy	Check ratings for user 3: {2.9, 4.9, NULL}	✓ 4.9 > 4.5	Include
Row 4	4	Pluto	Check ratings for user 4: {}	✗ No listens	Exclude

Output

	name
	Minnie
	Daffy

Example: Users with NO High Ratings (NOT EXISTS Pattern)

-- Find users who have NO ratings > 4.5
SELECT u.name
FROM Users u
WHERE NOT EXISTS (                    -- Check if NO rows exist where...
    SELECT 1
    FROM Listens l
    WHERE l.user_id = u.user_id
      AND l.rating > 4.5
)                                     -- NOT inverts the result: TRUE if no matches

Row-by-Row Execution

Outer Row	user_id	name	Inner Query Check	NOT EXISTS?	Result
Row 1	1	Mickey	Check ratings for user 1: {4.5, 4.2, 3.9}	✓ None > 4.5	Include
Row 2	2	Minnie	Check ratings for user 2: {4.7, 4.6, 3.9}	✗ Has 4.7, 4.6 > 4.5	Exclude
Row 3	3	Daffy	Check ratings for user 3: {2.9, 4.9, NULL}	✗ Has 4.9 > 4.5	Exclude
Row 4	4	Pluto	Check ratings for user 4: {}	✓ No listens	Include

Output

	name	Notes
	Mickey	Highest rating is 4.5 (not > 4.5)
	Pluto	Has no ratings at all

Correlated vs Regular: Key Differences

Aspect	Regular Subquery	Correlated Subquery
Execution	Runs ONCE total	Runs ONCE PER outer row
Independence	Standalone query	Depends on outer query
Use Case	Global comparisons	Row-specific comparisons
Performance	Usually faster	N+1 query problem
Example	"Above global average"	"Above YOUR average"

When to Use Correlated Subqueries

✅ Good Use Cases:

Personal comparisons (above YOUR average)
Relative rankings (best in YOUR category)
Row-specific existence checks (do YOU have...)
Self-referential queries (employees earning more than their manager)

❌ Avoid When:

Simple global comparisons suffice
Performance is critical (consider JOIN alternatives)
Query complexity becomes unreadable

Common Patterns

Personal Aggregates: Compare to your own AVG/MAX/MIN
Existence per Entity: EXISTS with correlation
Ranking within Groups: Find top N per category
Self-Joins Alternative: Sometimes clearer than complex JOINs

Common Mistakes with Correlated Subqueries

Forgetting Correlation

-- WRONG: Not correlated - returns all or nothing
WHERE EXISTS (
    SELECT 1 FROM Listens WHERE rating > 4.5  -- No correlation!
)                                             -- If ANY user has >4.5, ALL pass

-- CORRECT: Correlate with outer query's user_id
WHERE EXISTS (
    SELECT 1 FROM Listens l 
    WHERE l.user_id = u.user_id               -- MUST correlate!
      AND rating > 4.5                        -- Check THIS user's ratings
)

Problem: Without correlation, EXISTS checks globally instead of per-row

Performance Anti-Pattern (N+1 Problem)

-- INEFFICIENT: Correlated subquery runs once per user
SELECT name,
       (SELECT COUNT(*) FROM Listens l WHERE l.user_id = u.user_id) AS COUNT
FROM Users u                          -- N+1 query problem: 1 + 4 = 5 queries!

-- BETTER: JOIN executes once for all users
SELECT u.name, COUNT(l.listen_id) AS COUNT
FROM Users u
LEFT JOIN Listens l ON u.user_id = l.user_id  -- Single pass through data
GROUP BY u.user_id, u.name                    -- Much more efficient!

Problem: Subquery runs once per row - with 1000 users, that's 1001 queries total!

Note: Modern query optimizers may recognize this pattern and automatically rewrite it as a JOIN or use other optimization techniques like subquery caching. However, explicit JOINs are still preferred for clarity and guaranteed performance.