Correlated Subqueries: Row-by-Row Comparisons
The Problem: Personalized Comparisons
- What if each row needs its OWN answer? (e.g., "Mickey's average is 4.2, Minnie's is 4.4")
Imagine: "Find users whose latest rating is above THEIR personal average" - Each user has a different average!
Why Basic Subqueries Aren't Enough
-- WRONG: Regular subquery - whose average is this?
SELECT u.name, l.rating
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating > ( -- Compare each rating to...
SELECT AVG(rating) -- GLOBAL average (not personalized!)
FROM Listens -- No correlation to outer query
) -- Everyone gets same threshold: 4.2
-- Problem: Compares everyone to the SAME global average (4.2)
What we really need: Each user compared to THEIR OWN average!
Solution: Correlated Subqueries
Good news: Add one extra line inside basic subquery to personalize (correlate)
WHERE l2.user_id = u.user_id -- CORRELATED: References outer u.user_id!
Full Query
-- CORRECT: Each user compared to THEIR personal average
SELECT u.name, l.rating
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating > ( -- For EACH row, compare to...
SELECT AVG(l2.rating) -- This specific user's average
FROM Listens l2
WHERE l2.user_id = u.user_id -- CORRELATED: References outer u.user_id!
) -- Runs once per outer row
-- Now each row gets its personalized comparison!
# Conceptual execution model:
for user in Users: # Outer loop
for listen in Listens:
if listen.user_id == user.user_id and listen.rating is not None:
# Inner query: Calculate THIS user's average
sum = 0
count = 0
for l in Listens: # Inner loop
if l.user_id == user.user_id and l.rating is not None:
sum += l.rating
count += 1
avg = sum / count
# Check condition
if listen.rating > avg:
output(user.name, listen.rating)
Key Insight: The inner query can "see" the outer query's current row!
Example: Users Above Their Own Average
-- Find users who have at least one rating above THEIR personal average
SELECT DISTINCT u.name -- DISTINCT: Each user appears once
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating > ( -- Check if this rating exceeds...
-- Calculate THIS user's personal average
SELECT AVG(l2.rating) -- Inner query can "see" outer u.user_id
FROM Listens l2
WHERE l2.user_id = u.user_id -- CORRELATED to outer u.user_id
)
Step-by-Step Execution
User | Their Ratings | Their AVG | Ratings > Their AVG | Result | |
---|---|---|---|---|---|
Mickey | 4.5, 4.2, 3.9 | 4.2 | 4.5 > 4.2 ✓ | Include | |
Minnie | 4.7, 4.6, 3.9 | 4.4 | 4.7 > 4.4 ✓, 4.6 > 4.4 ✓ | Include | |
Daffy | 2.9, 4.9, NULL | 3.9 | 4.9 > 3.9 ✓ | Include | |
Pluto | (no ratings) | NULL | (no comparisons) | Exclude |
Output
name | Notes |
---|---|
Mickey | Has rating 4.5 above their 4.2 average |
Minnie | Has ratings 4.7, 4.6 above their 4.4 average |
Daffy | Has rating 4.9 above their 3.9 average |
-
Each user gets their OWN personalized average
-
Different threshold for each row
-
This is impossible with basic subqueries!
Example: Users with High Ratings (EXISTS Pattern)
-- Find users with at least one high rating (>4.5)
SELECT u.name -- For each user...
FROM Users u
WHERE EXISTS ( -- Check if ANY rows exist where...
-- Inner query: Check if THIS user has any rating > 4.5
SELECT 1 -- SELECT 1 is shorthand convention (don't care about value for EXISTs).
-- You could also SELECT user_id
FROM Listens l
WHERE l.user_id = u.user_id -- CORRELATED: References outer u.user_id
AND l.rating > 4.5 -- Additional filter on rating
) -- Runs once per user (4 times for 4 users)
Execution: Like Nested Loops with Scoping
# Conceptual execution model:
for user in Users: # Outer loop
found = False
for listen in Listens: # Inner loop
if listen.user_id == user.user_id and listen.rating > 4.5:
found = True
break # EXISTS stops at first match
if found:
output(user.name)
Row-by-Row Execution
Outer Row | user_id | name | Inner Query Check | EXISTS? | Result | |
---|---|---|---|---|---|---|
Row 1 | 1 | Mickey | Check ratings for user 1: {4.5, 4.2, 3.9} | ✗ 4.5 not > 4.5 | Exclude | |
Row 2 | 2 | Minnie | Check ratings for user 2: {4.7, 4.6, 3.9} | ✓ 4.7 > 4.5 | Include | |
Row 3 | 3 | Daffy | Check ratings for user 3: {2.9, 4.9, NULL} | ✓ 4.9 > 4.5 | Include | |
Row 4 | 4 | Pluto | Check ratings for user 4: {} | ✗ No listens | Exclude |
Output
name | |
---|---|
Minnie | |
Daffy |
Example: Users with NO High Ratings (NOT EXISTS Pattern)
-- Find users who have NO ratings > 4.5
SELECT u.name
FROM Users u
WHERE NOT EXISTS ( -- Check if NO rows exist where...
SELECT 1
FROM Listens l
WHERE l.user_id = u.user_id
AND l.rating > 4.5
) -- NOT inverts the result: TRUE if no matches
Row-by-Row Execution
Outer Row | user_id | name | Inner Query Check | NOT EXISTS? | Result | |
---|---|---|---|---|---|---|
Row 1 | 1 | Mickey | Check ratings for user 1: {4.5, 4.2, 3.9} | ✓ None > 4.5 | Include | |
Row 2 | 2 | Minnie | Check ratings for user 2: {4.7, 4.6, 3.9} | ✗ Has 4.7, 4.6 > 4.5 | Exclude | |
Row 3 | 3 | Daffy | Check ratings for user 3: {2.9, 4.9, NULL} | ✗ Has 4.9 > 4.5 | Exclude | |
Row 4 | 4 | Pluto | Check ratings for user 4: {} | ✓ No listens | Include |
Output
name | Notes | |
---|---|---|
Mickey | Highest rating is 4.5 (not > 4.5) | |
Pluto | Has no ratings at all |
Correlated vs Regular: Key Differences
Aspect | Regular Subquery | Correlated Subquery |
---|---|---|
Execution | Runs ONCE total | Runs ONCE PER outer row |
Independence | Standalone query | Depends on outer query |
Use Case | Global comparisons | Row-specific comparisons |
Performance | Usually faster | N+1 query problem |
Example | "Above global average" | "Above YOUR average" |
When to Use Correlated Subqueries
✅ Good Use Cases:
-
Personal comparisons (above YOUR average)
-
Relative rankings (best in YOUR category)
-
Row-specific existence checks (do YOU have...)
-
Self-referential queries (employees earning more than their manager)
❌ Avoid When:
-
Simple global comparisons suffice
-
Performance is critical (consider JOIN alternatives)
-
Query complexity becomes unreadable
Common Patterns
-
Personal Aggregates: Compare to your own AVG/MAX/MIN
-
Existence per Entity: EXISTS with correlation
-
Ranking within Groups: Find top N per category
-
Self-Joins Alternative: Sometimes clearer than complex JOINs
Common Mistakes with Correlated Subqueries
Forgetting Correlation
-- WRONG: Not correlated - returns all or nothing
WHERE EXISTS (
SELECT 1 FROM Listens WHERE rating > 4.5 -- No correlation!
) -- If ANY user has >4.5, ALL pass
-- CORRECT: Correlate with outer query's user_id
WHERE EXISTS (
SELECT 1 FROM Listens l
WHERE l.user_id = u.user_id -- MUST correlate!
AND rating > 4.5 -- Check THIS user's ratings
)
Problem: Without correlation, EXISTS checks globally instead of per-row
Performance Anti-Pattern (N+1 Problem)
-- INEFFICIENT: Correlated subquery runs once per user
SELECT name,
(SELECT COUNT(*) FROM Listens l WHERE l.user_id = u.user_id) AS COUNT
FROM Users u -- N+1 query problem: 1 + 4 = 5 queries!
-- BETTER: JOIN executes once for all users
SELECT u.name, COUNT(l.listen_id) AS COUNT
FROM Users u
LEFT JOIN Listens l ON u.user_id = l.user_id -- Single pass through data
GROUP BY u.user_id, u.name -- Much more efficient!
Problem: Subquery runs once per row - with 1000 users, that's 1001 queries total!
Note: Modern query optimizers may recognize this pattern and automatically rewrite it as a JOIN or use other optimization techniques like subquery caching. However, explicit JOINs are still preferred for clarity and guaranteed performance.