Query Equivalence: Multiple Paths, Same Result

SQL is flexible - most problems have multiple solutions. Let's compare approaches side-by-side.

Important: Query Equivalence vs Output Equivalence

🔄 Query Equivalence

Same logic, different syntax

These queries are semantically identical - they'll produce the same result for ANY valid input data.

      WHERE NOT EXISTS (...)  

      LEFT JOIN + WHERE col IS NULL

      → Always same result!

📊 Output Equivalence

Same result, different logic

Works today by pure luck. Tomorrow's data will break it. Don't be that person who ships "it worked on my machine!" 🎲

      WHERE rating > 4.5

      WHERE rating >= 4.6

      → Same until someone rates 4.55!

✨ This page shows query equivalence: Different syntactic approaches that are semantically identical.

Pattern 1: Users Above Their Personal Average

Problem: Find users who have at least one rating above their personal average

Correlated Subquery

-- Each user's avg calculated per row
SELECT DISTINCT u.name
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating > (
    SELECT AVG(l2.rating)
    FROM Listens l2
    WHERE l2.user_id = u.user_id  -- Correlate
)

✅ Intuitive for "per-user" logic
❌ Can be slow (N+1 queries)

CTE + JOIN

-- Pre-calculate all averages once
WITH UserAvgs AS (
    SELECT user_id, 
           AVG(rating) AS avg_rating
    FROM Listens
    GROUP BY user_id
)
SELECT DISTINCT u.name
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN UserAvgs ua ON u.user_id = ua.user_id
WHERE l.rating > ua.avg_rating

✅ Single pass through data
❌ More verbose

Window Function

-- Calculate avg alongside each row
SELECT DISTINCT name
FROM (
    SELECT u.name, l.rating,
           AVG(l.rating) OVER (
               PARTITION BY u.user_id
           ) AS user_avg
    FROM Users u
    JOIN Listens l ON u.user_id = l.user_id
) t
WHERE rating > user_avg

✅ Elegant, modern SQL
❌ Harder to debug

All produce: Mickey, Minnie, Daffy

Pattern 2: Users with NO High Ratings (> 4.5)

Problem: Find users who have no ratings greater than 4.5

NOT EXISTS

-- Check non-existence directly
SELECT u.name
FROM Users u
WHERE NOT EXISTS (
    SELECT 1
    FROM Listens l
    WHERE l.user_id = u.user_id
      AND l.rating > 4.5
)

✅ Clear intent
✅ Handles NULLs correctly

LEFT JOIN + NULL

-- Join and check for no matches
SELECT u.name
FROM Users u
LEFT JOIN Listens l 
    ON u.user_id = l.user_id
    AND l.rating > 4.5  -- Condition in ON
WHERE l.listen_id IS NULL  -- No match found

✅ Often fastest
❌ Less intuitive

NOT IN

-- List users not in high-raters
SELECT name
FROM Users
WHERE user_id NOT IN (
    SELECT user_id
    FROM Listens
    WHERE rating > 4.5
      AND user_id IS NOT NULL  -- Critical!
)

⚠️ NULL handling tricky
❌ Fails if subquery returns NULL

All produce: Mickey, Pluto

⚠️ Warning about NOT IN: If the subquery returns ANY NULL values, NOT IN returns no results! Always add "AND column IS NOT NULL" to the subquery. See the NULL in NOT IN Death Trap →

Pattern 3: Top 3 Songs per User

Problem: Find each user's top 3 highest-rated songs

Window Function (Modern)

-- Rank within each user's songs
WITH RankedListens AS (
    SELECT user_id, song_id, rating,
           ROW_NUMBER() OVER (
               PARTITION BY user_id  -- Per user
               ORDER BY rating DESC  -- Highest first
           ) AS rn
    FROM Listens
    WHERE rating IS NOT NULL
)
SELECT u.name, rl.song_id, rl.rating
FROM Users u
JOIN RankedListens rl ON u.user_id = rl.user_id
WHERE rn <= 3  -- Top 3 only
ORDER BY u.name, rl.rating DESC

✅ Standard approach for top-N per group
✅ Handles ties consistently (ROW_NUMBER assigns unique ranks)

Correlated Subquery (Classical)

-- Count better songs for same user
SELECT u.name, l.song_id, l.rating
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating IS NOT NULL
  AND (
    SELECT COUNT(*)
    FROM Listens l2
    WHERE l2.user_id = l.user_id
      AND l2.rating > l.rating  -- Count higher ratings
) < 3  -- Less than 3 songs rated higher
ORDER BY u.name, l.rating DESC

✅ Works in older SQL versions
❌ Complex tie handling
❌ May return > 3 rows if ties

Pattern 4: Users Who Listen to Multiple Genres

Problem: Find users who have listened to songs from at least 2 different genres

GROUP BY + HAVING

-- Direct aggregation approach
SELECT u.name
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN Songs s ON l.song_id = s.song_id
GROUP BY u.user_id, u.name
HAVING COUNT(DISTINCT s.genre) >= 2

✅ Most straightforward
✅ Single pass
✅ Clear intent

CTE for Clarity

-- Break into logical steps
WITH UserGenreCounts AS (
    -- Step 1: Count genres per user
    SELECT l.user_id,
           COUNT(DISTINCT s.genre) AS genre_count
    FROM Listens l
    JOIN Songs s ON l.song_id = s.song_id
    GROUP BY l.user_id
)
-- Step 2: Filter and get names
SELECT u.name
FROM Users u
JOIN UserGenreCounts ugc ON u.user_id = ugc.user_id
WHERE ugc.genre_count >= 2

✅ Easy to debug (can test CTE separately)
✅ Reusable if genre_count needed elsewhere
❌ More verbose for simple case

All produce: Mickey, Minnie, Daffy

When to Use Each Approach?

Approach	Best For	Avoid When
Correlated Subquery	• Conceptual clarity • Row-specific logic • Small datasets	• Large datasets • Complex aggregations
CTEs	• Multi-step logic • Reusable calculations • Debugging complex queries • Team readability	• Simple one-off queries • Older SQL versions (pre-2005)
Window Functions	• Rankings (top-N per group) • Running totals • Peer comparisons • Modern SQL environments	• Simple filters • When debugging needed • Older SQL versions
LEFT JOIN + NULL	• Finding non-matches • Performance critical • Complex exclusion logic	• When readability important • Team unfamiliar with pattern
GROUP BY + HAVING	• Simple aggregations • Clear business logic • Direct approach	• Need intermediate results • Complex multi-step logic

Key Insights

🌙 The 3 AM Test

"Which version would I prefer debugging at 3 AM?"

Readability beats cleverness
A "slower" query everyone understands is better than an optimization nobody can fix
Use CTEs to break complex logic into named steps

⚠️ NULL Handling Differs

Each approach treats NULLs differently

NOT IN: Fails completely with NULL
NOT EXISTS: Handles NULLs correctly
LEFT JOIN: NULLs work as expected

⚡ Performance Surprises

The "obvious" solution isn't always fastest

Modern optimizers rewrite queries
CTEs can help or hurt performance
Test with real data volumes