Query Equivalence: Multiple Paths, Same Result
SQL is flexible - most problems have multiple solutions. Let's compare approaches side-by-side.
Important: Query Equivalence vs Output Equivalence
🔄 Query Equivalence
Same logic, different syntax
These queries are semantically identical - they'll produce the same result for ANY valid input data.
LEFT JOIN + WHERE col IS NULL
→ Always same result!
📊 Output Equivalence
Same result, different logic
Works today by pure luck. Tomorrow's data will break it. Don't be that person who ships "it worked on my machine!" 🎲
WHERE rating >= 4.6
→ Same until someone rates 4.55!
Pattern 1: Users Above Their Personal Average
Problem: Find users who have at least one rating above their personal average
Correlated Subquery
-- Each user's avg calculated per row
SELECT DISTINCT u.name
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating > (
SELECT AVG(l2.rating)
FROM Listens l2
WHERE l2.user_id = u.user_id -- Correlate
)
✅ Intuitive for "per-user" logic
❌ Can be slow (N+1 queries)
CTE + JOIN
-- Pre-calculate all averages once
WITH UserAvgs AS (
SELECT user_id,
AVG(rating) AS avg_rating
FROM Listens
GROUP BY user_id
)
SELECT DISTINCT u.name
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN UserAvgs ua ON u.user_id = ua.user_id
WHERE l.rating > ua.avg_rating
✅ Single pass through data
❌ More verbose
Window Function
-- Calculate avg alongside each row
SELECT DISTINCT name
FROM (
SELECT u.name, l.rating,
AVG(l.rating) OVER (
PARTITION BY u.user_id
) AS user_avg
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
) t
WHERE rating > user_avg
✅ Elegant, modern SQL
❌ Harder to debug
All produce: Mickey, Minnie, Daffy
Pattern 2: Users with NO High Ratings (> 4.5)
Problem: Find users who have no ratings greater than 4.5
NOT EXISTS
-- Check non-existence directly
SELECT u.name
FROM Users u
WHERE NOT EXISTS (
SELECT 1
FROM Listens l
WHERE l.user_id = u.user_id
AND l.rating > 4.5
)
✅ Clear intent
✅ Handles NULLs correctly
LEFT JOIN + NULL
-- Join and check for no matches
SELECT u.name
FROM Users u
LEFT JOIN Listens l
ON u.user_id = l.user_id
AND l.rating > 4.5 -- Condition in ON
WHERE l.listen_id IS NULL -- No match found
✅ Often fastest
❌ Less intuitive
NOT IN
-- List users not in high-raters
SELECT name
FROM Users
WHERE user_id NOT IN (
SELECT user_id
FROM Listens
WHERE rating > 4.5
AND user_id IS NOT NULL -- Critical!
)
⚠️ NULL handling tricky
❌ Fails if subquery returns NULL
All produce: Mickey, Pluto
Pattern 3: Top 3 Songs per User
Problem: Find each user's top 3 highest-rated songs
Window Function (Modern)
-- Rank within each user's songs
WITH RankedListens AS (
SELECT user_id, song_id, rating,
ROW_NUMBER() OVER (
PARTITION BY user_id -- Per user
ORDER BY rating DESC -- Highest first
) AS rn
FROM Listens
WHERE rating IS NOT NULL
)
SELECT u.name, rl.song_id, rl.rating
FROM Users u
JOIN RankedListens rl ON u.user_id = rl.user_id
WHERE rn <= 3 -- Top 3 only
ORDER BY u.name, rl.rating DESC
✅ Standard approach for top-N per group
✅ Handles ties consistently (ROW_NUMBER assigns unique ranks)
Correlated Subquery (Classical)
-- Count better songs for same user
SELECT u.name, l.song_id, l.rating
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating IS NOT NULL
AND (
SELECT COUNT(*)
FROM Listens l2
WHERE l2.user_id = l.user_id
AND l2.rating > l.rating -- Count higher ratings
) < 3 -- Less than 3 songs rated higher
ORDER BY u.name, l.rating DESC
✅ Works in older SQL versions
❌ Complex tie handling
❌ May return > 3 rows if ties
Pattern 4: Users Who Listen to Multiple Genres
Problem: Find users who have listened to songs from at least 2 different genres
GROUP BY + HAVING
-- Direct aggregation approach
SELECT u.name
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN Songs s ON l.song_id = s.song_id
GROUP BY u.user_id, u.name
HAVING COUNT(DISTINCT s.genre) >= 2
✅ Most straightforward
✅ Single pass
✅ Clear intent
CTE for Clarity
-- Break into logical steps
WITH UserGenreCounts AS (
-- Step 1: Count genres per user
SELECT l.user_id,
COUNT(DISTINCT s.genre) AS genre_count
FROM Listens l
JOIN Songs s ON l.song_id = s.song_id
GROUP BY l.user_id
)
-- Step 2: Filter and get names
SELECT u.name
FROM Users u
JOIN UserGenreCounts ugc ON u.user_id = ugc.user_id
WHERE ugc.genre_count >= 2
✅ Easy to debug (can test CTE separately)
✅ Reusable if genre_count needed elsewhere
❌ More verbose for simple case
All produce: Mickey, Minnie, Daffy
When to Use Each Approach?
Approach | Best For | Avoid When |
---|---|---|
Correlated Subquery | • Conceptual clarity • Row-specific logic • Small datasets |
• Large datasets • Complex aggregations |
CTEs | • Multi-step logic • Reusable calculations • Debugging complex queries • Team readability |
• Simple one-off queries • Older SQL versions (pre-2005) |
Window Functions | • Rankings (top-N per group) • Running totals • Peer comparisons • Modern SQL environments |
• Simple filters • When debugging needed • Older SQL versions |
LEFT JOIN + NULL | • Finding non-matches • Performance critical • Complex exclusion logic |
• When readability important • Team unfamiliar with pattern |
GROUP BY + HAVING | • Simple aggregations • Clear business logic • Direct approach |
• Need intermediate results • Complex multi-step logic |
Key Insights
🌙 The 3 AM Test
"Which version would I prefer debugging at 3 AM?"
- Readability beats cleverness
- A "slower" query everyone understands is better than an optimization nobody can fix
- Use CTEs to break complex logic into named steps
⚠️ NULL Handling Differs
Each approach treats NULLs differently
- NOT IN: Fails completely with NULL
- NOT EXISTS: Handles NULLs correctly
- LEFT JOIN: NULLs work as expected
⚡ Performance Surprises
The "obvious" solution isn't always fastest
- Modern optimizers rewrite queries
- CTEs can help or hurt performance
- Test with real data volumes