Query Equivalence
Concept. Two queries are equivalent if they return the same result for ANY input data. They are merely output-equivalent if they happen to agree on the data at hand.
Intuition. Two queries that both return "Mickey, Minnie, Daffy" against today's Listens look identical. Add Pluto with one rating tomorrow, and they may diverge. Real equivalence holds for any input — output equivalence is just today's coincidence.
Important: Query Equivalence vs Output Equivalence
Query Equivalence
Same logic, different syntax
These queries are semantically identical—they'll produce the same result for ANY valid input data.
WHERE NOT EXISTS (...)
LEFT JOIN + WHERE col IS NULL
→ Always same result!
Output Equivalence
Same result, different logic
Queries produce the same result for THIS data, but not for OTHER data.
WHERE rating > 4.5
WHERE rating >= 4.6
→ Same until we get a new row when someone rates 4.55!
This page shows query equivalence: Different syntactic approaches that are semantically identical.
Example 1: Users Above Their Personal Average
Problem: Find users who have at least one rating above their personal average
Correlated Subquery
-- Each user's avg calculated per row
SELECT DISTINCT u.name
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating > (
SELECT AVG(l2.rating)
FROM Listens l2
WHERE l2.user_id = u.user_id -- Correlate
)
- Intuitive for "per-user" logic
- Can be slow (N+1 queries)
CTE + JOIN
-- Pre-calculate all averages once
WITH UserAvgs AS (
SELECT user_id,
AVG(rating) AS avg_rating
FROM Listens
GROUP BY user_id
)
SELECT DISTINCT u.name
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN UserAvgs ua ON u.user_id = ua.user_id
WHERE l.rating > ua.avg_rating
- Single pass through data
- More verbose
Window Function
-- Calculate avg alongside each row
SELECT DISTINCT name
FROM (
SELECT u.name, l.rating,
AVG(l.rating) OVER (
PARTITION BY u.user_id
) AS user_avg
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
) t
WHERE rating > user_avg
- Modern SQL with Window Functions
- Harder to debug
All produce: Mickey, Minnie, Daffy
Example 2: Users with NO High Ratings (> 4.5)
Problem: Find users who have no ratings greater than 4.5
NOT EXISTS
-- Check non-existence directly
SELECT u.name
FROM Users u
WHERE NOT EXISTS (
SELECT 1
FROM Listens l
WHERE l.user_id = u.user_id
AND l.rating > 4.5
)
- Clear intent
- Handles NULLs correctly
LEFT JOIN + NULL
-- Join and check for no matches
SELECT u.name
FROM Users u
LEFT JOIN Listens l
ON u.user_id = l.user_id
AND l.rating > 4.5 -- Condition in ON
WHERE l.listen_id IS NULL -- No match found
- Often fastest
- Less intuitive
NOT IN
-- List users not in high-raters
SELECT name
FROM Users
WHERE user_id NOT IN (
SELECT user_id
FROM Listens
WHERE rating > 4.5
AND user_id IS NOT NULL -- Critical!
)
- NULL handling tricky
All produce: Mickey, Pluto
⚠️ Warning about NOT IN: If the subquery returns ANY NULL values, NOT IN returns no results! Always add "AND column IS NOT NULL" to the subquery. See the NULL in NOT IN Death Trap →
Example 3: Top 3 Songs per User
Problem: Find each user's top 3 highest-rated songs
Window Function (Modern)
-- Rank within each user's songs
WITH RankedListens AS (
SELECT user_id, song_id, rating,
ROW_NUMBER() OVER (
PARTITION BY user_id -- Per user
ORDER BY rating DESC -- Highest first
) AS rn
FROM Listens
WHERE rating IS NOT NULL
)
SELECT u.name, rl.song_id, rl.rating
FROM Users u
JOIN RankedListens rl ON u.user_id = rl.user_id
WHERE rn <= 3 -- Top 3 only
ORDER BY u.name, rl.rating DESC
- Standard approach for top-N per group
- Handles ties consistently (ROW_NUMBER assigns unique ranks)
Correlated Subquery (Classical)
-- Count better songs for same user
SELECT u.name, l.song_id, l.rating
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating IS NOT NULL
AND (
SELECT COUNT(*)
FROM Listens l2
WHERE l2.user_id = l.user_id
AND l2.rating > l.rating -- Count higher ratings
) < 3 -- Less than 3 songs rated higher
ORDER BY u.name, l.rating DESC
- Works in older SQL versions
- Complex tie handling
- May return > 3 rows if ties
Example 4: Users Who Listen to Multiple Genres
Problem: Find users who have listened to songs from at least 2 different genres
GROUP BY + HAVING
-- Direct aggregation approach
SELECT u.name
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN Songs s ON l.song_id = s.song_id
GROUP BY u.user_id, u.name
HAVING COUNT(DISTINCT s.genre) >= 2
- Most straightforward
- Single pass
- Clear intent
CTE for Clarity
-- Break into logical steps
WITH UserGenreCounts AS (
-- Step 1: Count genres per user
SELECT l.user_id,
COUNT(DISTINCT s.genre) AS genre_count
FROM Listens l
JOIN Songs s ON l.song_id = s.song_id
GROUP BY l.user_id
)
-- Step 2: Filter and get names
SELECT u.name
FROM Users u
JOIN UserGenreCounts ugc ON u.user_id = ugc.user_id
WHERE ugc.genre_count >= 2
- Easy to debug (can test CTE separately)
- Reusable if genre_count needed elsewhere
- More verbose for simple case
All produce: Mickey, Minnie, Daffy
When to Use Each Approach?
| Approach | Best For | Avoid When |
|---|---|---|
| Correlated Subquery | • Conceptual clarity • Row-specific logic • Small datasets |
• Large datasets • Complex aggregations |
| CTEs | • Multi-step logic • Reusable calculations • Debugging complex queries • Team readability |
• Simple one-off queries • Older SQL versions (pre-2005) |
| Window Functions | • Rankings (top-N per group) • Running totals • Peer comparisons • Modern SQL environments |
• Simple filters • When debugging needed • Older SQL versions |
| LEFT JOIN + NULL | • Finding non-matches • Performance critical • Complex exclusion logic |
• When readability important • Team unfamiliar with pattern |
| GROUP BY + HAVING | • Simple aggregations • Clear logic • Direct approach |
• Need intermediate results • Complex multi-step logic |
Remember: Prioritize human readability over syntactic cleverness. Trust the query optimizer to handle execution performance.