Correlated Subqueries: Row-by-Row Comparisons

The Problem: Personalized Comparisons

Imagine: "Find users whose latest rating is above THEIR personal average" - Each user has a different average!

Why Basic Subqueries Aren't Enough

-- WRONG: Regular subquery - whose average is this?
SELECT u.name, l.rating
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating > (                    -- Compare each rating to...
    SELECT AVG(rating)                -- GLOBAL average (not personalized!)
    FROM Listens                      -- No correlation to outer query
)                                     -- Everyone gets same threshold: 4.2
-- Problem: Compares everyone to the SAME global average (4.2)

What we really need: Each user compared to THEIR OWN average!

Basic Subquery: One Answer for All SELECT AVG(rating) FROM Listens 4.2 Mickey Minnie Daffy Find rows where rating for: AVG: 4.2 Mickey > 4.2? AVG: 4.2 Minnie > 4.2? AVG: 4.2 Daffy > 4.2? Runs ONCE total Same threshold for everyone Correlated: Personalized Answers For each user: Calculate THEIR AVG Find rows where rating for: Mickey AVG: 4.2 Mickey > 4.2? Minnie AVG: 4.4 Minnie > 4.4? Daffy AVG: 3.9 Daffy > 3.9? Runs N times (once per user) Personalized threshold per user Basic: Global Comparison vs Correlated: Personal Comparison

Solution: Correlated Subqueries

Good news: Add one extra line inside basic subquery to personalize (correlate)

    WHERE l2.user_id = u.user_id      -- CORRELATED: References outer u.user_id!

Full Query

-- CORRECT: Each user compared to THEIR personal average
SELECT u.name, l.rating
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating > (                    -- For EACH row, compare to...
    SELECT AVG(l2.rating)             -- This specific user's average
    FROM Listens l2
    WHERE l2.user_id = u.user_id      -- CORRELATED: References outer u.user_id!
)                                     -- Runs once per outer row
-- Now each row gets its personalized comparison!
# Conceptual execution model:
for user in Users:  # Outer loop
    for listen in Listens:
        if listen.user_id == user.user_id and listen.rating is not None:
            # Inner query: Calculate THIS user's average
            sum = 0
            count = 0
            for l in Listens:  # Inner loop
                if l.user_id == user.user_id and l.rating is not None:
                    sum += l.rating
                    count += 1
            avg = sum / count

            # Check condition
            if listen.rating > avg:
                output(user.name, listen.rating)

Key Insight: The inner query can "see" the outer query's current row!


Example: Users Above Their Own Average

-- Find users who have at least one rating above THEIR personal average
SELECT DISTINCT u.name                -- DISTINCT: Each user appears once
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating > (                    -- Check if this rating exceeds...
    -- Calculate THIS user's personal average
    SELECT AVG(l2.rating)             -- Inner query can "see" outer u.user_id
    FROM Listens l2
    WHERE l2.user_id = u.user_id      -- CORRELATED to outer u.user_id
)

Step-by-Step Execution

UserTheir RatingsTheir AVGRatings > Their AVGResult
Mickey4.5, 4.2, 3.94.24.5 > 4.2 ✓Include
Minnie4.7, 4.6, 3.94.44.7 > 4.4 ✓, 4.6 > 4.4 ✓Include
Daffy2.9, 4.9, NULL3.94.9 > 3.9 ✓Include
Pluto(no ratings)NULL(no comparisons)Exclude

Output

nameNotes
MickeyHas rating 4.5 above their 4.2 average
MinnieHas ratings 4.7, 4.6 above their 4.4 average
DaffyHas rating 4.9 above their 3.9 average

Example: Users with High Ratings (EXISTS Pattern)

-- Find users with at least one high rating (>4.5)
SELECT u.name                         -- For each user...
FROM Users u
WHERE EXISTS (                        -- Check if ANY rows exist where...
    -- Inner query: Check if THIS user has any rating > 4.5
    SELECT 1                          -- SELECT 1 is shorthand convention (don't care about value for EXISTs). 
                                      -- You could also SELECT user_id
    FROM Listens l
    WHERE l.user_id = u.user_id       -- CORRELATED: References outer u.user_id
      AND l.rating > 4.5              -- Additional filter on rating
)                                     -- Runs once per user (4 times for 4 users)

Execution: Like Nested Loops with Scoping

# Conceptual execution model:
for user in Users:  # Outer loop
    found = False
    for listen in Listens:  # Inner loop 
        if listen.user_id == user.user_id and listen.rating > 4.5:
            found = True
            break  # EXISTS stops at first match
    if found:
        output(user.name)

Row-by-Row Execution

Outer Rowuser_idnameInner Query CheckEXISTS?Result
Row 11MickeyCheck ratings for user 1: {4.5, 4.2, 3.9}✗ 4.5 not > 4.5Exclude
Row 22MinnieCheck ratings for user 2: {4.7, 4.6, 3.9}✓ 4.7 > 4.5Include
Row 33DaffyCheck ratings for user 3: {2.9, 4.9, NULL}✓ 4.9 > 4.5Include
Row 44PlutoCheck ratings for user 4: {}✗ No listensExclude

Output

name
Minnie
Daffy

Example: Users with NO High Ratings (NOT EXISTS Pattern)

-- Find users who have NO ratings > 4.5
SELECT u.name
FROM Users u
WHERE NOT EXISTS (                    -- Check if NO rows exist where...
    SELECT 1
    FROM Listens l
    WHERE l.user_id = u.user_id
      AND l.rating > 4.5
)                                     -- NOT inverts the result: TRUE if no matches

Row-by-Row Execution

Outer Rowuser_idnameInner Query CheckNOT EXISTS?Result
Row 11MickeyCheck ratings for user 1: {4.5, 4.2, 3.9}✓ None > 4.5Include
Row 22MinnieCheck ratings for user 2: {4.7, 4.6, 3.9}✗ Has 4.7, 4.6 > 4.5Exclude
Row 33DaffyCheck ratings for user 3: {2.9, 4.9, NULL}✗ Has 4.9 > 4.5Exclude
Row 44PlutoCheck ratings for user 4: {}✓ No listensInclude

Output

nameNotes
MickeyHighest rating is 4.5 (not > 4.5)
PlutoHas no ratings at all

Correlated vs Regular: Key Differences

Aspect Regular Subquery Correlated Subquery
Execution Runs ONCE total Runs ONCE PER outer row
Independence Standalone query Depends on outer query
Use Case Global comparisons Row-specific comparisons
Performance Usually faster N+1 query problem
Example "Above global average" "Above YOUR average"

When to Use Correlated Subqueries

Good Use Cases:

Avoid When:

Common Patterns

  1. Personal Aggregates: Compare to your own AVG/MAX/MIN

  2. Existence per Entity: EXISTS with correlation

  3. Ranking within Groups: Find top N per category

  4. Self-Joins Alternative: Sometimes clearer than complex JOINs

Common Mistakes with Correlated Subqueries

Forgetting Correlation

-- WRONG: Not correlated - returns all or nothing
WHERE EXISTS (
    SELECT 1 FROM Listens WHERE rating > 4.5  -- No correlation!
)                                             -- If ANY user has >4.5, ALL pass

-- CORRECT: Correlate with outer query's user_id
WHERE EXISTS (
    SELECT 1 FROM Listens l 
    WHERE l.user_id = u.user_id               -- MUST correlate!
      AND rating > 4.5                        -- Check THIS user's ratings
)

Problem: Without correlation, EXISTS checks globally instead of per-row

Performance Anti-Pattern (N+1 Problem)

-- INEFFICIENT: Correlated subquery runs once per user
SELECT name,
       (SELECT COUNT(*) FROM Listens l WHERE l.user_id = u.user_id) AS COUNT
FROM Users u                          -- N+1 query problem: 1 + 4 = 5 queries!

-- BETTER: JOIN executes once for all users
SELECT u.name, COUNT(l.listen_id) AS COUNT
FROM Users u
LEFT JOIN Listens l ON u.user_id = l.user_id  -- Single pass through data
GROUP BY u.user_id, u.name                    -- Much more efficient!

Problem: Subquery runs once per row - with 1000 users, that's 1001 queries total!

Note: Modern query optimizers may recognize this pattern and automatically rewrite it as a JOIN or use other optimization techniques like subquery caching. However, explicit JOINs are still preferred for clarity and guaranteed performance.