Query Equivalence

Concept. Two queries are equivalent if they return the same result for ANY input data. They are merely output-equivalent if they happen to agree on the data at hand.

Intuition. Two queries that both return "Mickey, Minnie, Daffy" against today's Listens look identical. Add Pluto with one rating tomorrow, and they may diverge. Real equivalence holds for any input — output equivalence is just today's coincidence.

Important: Query Equivalence vs Output Equivalence

Query Equivalence

Same logic, different syntax

These queries are semantically identical—they'll produce the same result for ANY valid input data.

WHERE NOT EXISTS (...)  

LEFT JOIN + WHERE col IS NULL

→ Always same result!

Output Equivalence

Same result, different logic

Queries produce the same result for THIS data, but not for OTHER data.

WHERE rating > 4.5

WHERE rating >= 4.6

→ Same until we get a new row when someone rates 4.55!

This page shows query equivalence: Different syntactic approaches that are semantically identical.

Example 1: Users Above Their Personal Average

Problem: Find users who have at least one rating above their personal average

Correlated Subquery

-- Each user's avg calculated per row
SELECT DISTINCT u.name
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating > (
    SELECT AVG(l2.rating)
    FROM Listens l2
    WHERE l2.user_id = u.user_id  -- Correlate
)

Intuitive for "per-user" logic
Can be slow (N+1 queries)

CTE + JOIN

-- Pre-calculate all averages once
WITH UserAvgs AS (
    SELECT user_id,
           AVG(rating) AS avg_rating
    FROM Listens
    GROUP BY user_id
)
SELECT DISTINCT u.name
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN UserAvgs ua ON u.user_id = ua.user_id
WHERE l.rating > ua.avg_rating

Single pass through data
More verbose

Window Function

-- Calculate avg alongside each row
SELECT DISTINCT name
FROM (
    SELECT u.name, l.rating,
           AVG(l.rating) OVER (
               PARTITION BY u.user_id
           ) AS user_avg
    FROM Users u
    JOIN Listens l ON u.user_id = l.user_id
) t
WHERE rating > user_avg

Modern SQL with Window Functions
Harder to debug

All produce: Mickey, Minnie, Daffy

Example 2: Users with NO High Ratings (> 4.5)

Problem: Find users who have no ratings greater than 4.5

NOT EXISTS

-- Check non-existence directly
SELECT u.name
FROM Users u
WHERE NOT EXISTS (
    SELECT 1
    FROM Listens l
    WHERE l.user_id = u.user_id
      AND l.rating > 4.5
)

Clear intent
Handles NULLs correctly

LEFT JOIN + NULL

-- Join and check for no matches
SELECT u.name
FROM Users u
LEFT JOIN Listens l
    ON u.user_id = l.user_id
    AND l.rating > 4.5  -- Condition in ON
WHERE l.listen_id IS NULL  -- No match found

Often fastest
Less intuitive

NOT IN

-- List users not in high-raters
SELECT name
FROM Users
WHERE user_id NOT IN (
    SELECT user_id
    FROM Listens
    WHERE rating > 4.5
      AND user_id IS NOT NULL  -- Critical!
)

NULL handling tricky

All produce: Mickey, Pluto

⚠️ Warning about NOT IN: If the subquery returns ANY NULL values, NOT IN returns no results! Always add "AND column IS NOT NULL" to the subquery. See the NULL in NOT IN Death Trap →

Example 3: Top 3 Songs per User

Problem: Find each user's top 3 highest-rated songs

Window Function (Modern)

-- Rank within each user's songs
WITH RankedListens AS (
    SELECT user_id, song_id, rating,
           ROW_NUMBER() OVER (
               PARTITION BY user_id  -- Per user
               ORDER BY rating DESC  -- Highest first
           ) AS rn
    FROM Listens
    WHERE rating IS NOT NULL
)
SELECT u.name, rl.song_id, rl.rating
FROM Users u
JOIN RankedListens rl ON u.user_id = rl.user_id
WHERE rn <= 3  -- Top 3 only
ORDER BY u.name, rl.rating DESC

Standard approach for top-N per group
Handles ties consistently (ROW_NUMBER assigns unique ranks)

Correlated Subquery (Classical)

-- Count better songs for same user
SELECT u.name, l.song_id, l.rating
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
WHERE l.rating IS NOT NULL
  AND (
    SELECT COUNT(*)
    FROM Listens l2
    WHERE l2.user_id = l.user_id
      AND l2.rating > l.rating  -- Count higher ratings
) < 3  -- Less than 3 songs rated higher
ORDER BY u.name, l.rating DESC

Works in older SQL versions
Complex tie handling
May return > 3 rows if ties

Example 4: Users Who Listen to Multiple Genres

Problem: Find users who have listened to songs from at least 2 different genres

GROUP BY + HAVING

-- Direct aggregation approach
SELECT u.name
FROM Users u
JOIN Listens l ON u.user_id = l.user_id
JOIN Songs s ON l.song_id = s.song_id
GROUP BY u.user_id, u.name
HAVING COUNT(DISTINCT s.genre) >= 2

Most straightforward
Single pass
Clear intent

CTE for Clarity

-- Break into logical steps
WITH UserGenreCounts AS (
    -- Step 1: Count genres per user
    SELECT l.user_id,
           COUNT(DISTINCT s.genre) AS genre_count
    FROM Listens l
    JOIN Songs s ON l.song_id = s.song_id
    GROUP BY l.user_id
)
-- Step 2: Filter and get names
SELECT u.name
FROM Users u
JOIN UserGenreCounts ugc ON u.user_id = ugc.user_id
WHERE ugc.genre_count >= 2

Easy to debug (can test CTE separately)
Reusable if genre_count needed elsewhere
More verbose for simple case

All produce: Mickey, Minnie, Daffy

When to Use Each Approach?

Approach	Best For	Avoid When
Correlated Subquery	• Conceptual clarity • Row-specific logic • Small datasets	• Large datasets • Complex aggregations
CTEs	• Multi-step logic • Reusable calculations • Debugging complex queries • Team readability	• Simple one-off queries • Older SQL versions (pre-2005)
Window Functions	• Rankings (top-N per group) • Running totals • Peer comparisons • Modern SQL environments	• Simple filters • When debugging needed • Older SQL versions
LEFT JOIN + NULL	• Finding non-matches • Performance critical • Complex exclusion logic	• When readability important • Team unfamiliar with pattern
GROUP BY + HAVING	• Simple aggregations • Clear logic • Direct approach	• Need intermediate results • Complex multi-step logic

Remember: Prioritize human readability over syntactic cleverness. Trust the query optimizer to handle execution performance.