SQL Problem Solving: Writing Queries

Transform problems into SQL solutions using patterns and references

Master Reference: Problem → Pattern → SQL Tool

Problem Words/Phrases What You Need SQL Tool Example Pattern
"total", "average", "count" Calculate aggregate values SUM(), AVG(), COUNT() SELECT SUM(sales), AVG(price)
"for each", "per category", "by department" Group results GROUP BY GROUP BY department
"having more than", "with at least" Filter after aggregation HAVING HAVING COUNT(*) > 5
"above/below average", "more than max" Single calculated value Subquery WHERE salary > (SELECT AVG...)
"who never", "doesn't exist", "except" Find exclusions NOT IN / NOT EXISTS WHERE id NOT IN (SELECT...) 🔻 #1
"step-by-step", "first...then", "funnel" Build in steps CTEs (WITH) WITH step1 AS (...) 🔻 #1
"top N", "rank", "nth best", "percentile" Rank within groups ROW_NUMBER() OVER ROW_NUMBER() OVER (PARTITION BY...) 🪜 #2
"running total", "cumulative", "up to" Calculate running values SUM() OVER (ORDER BY) SUM(x) OVER (ORDER BY date) 📈 #3
"compared to average", "vs peers" Compare to group metrics AVG() OVER value - AVG(value) OVER (...) 📊 #4
"change from", "growth", "vs previous" Detect changes over time LAG() / LEAD() LAG(value) OVER (ORDER BY date) 🎯 #5

Pattern Recognition: The 5 Essential Patterns

Master these 5 patterns and you'll solve 80% of SQL problems

  • The Funnel finds exclusions, The Ladder ranks items, The Timeline tracks changes, The Comparison measures performance, The Exception finds outliers
  • Experiment and get comfortable with these patterns in your BigQuery Project1 and Project2

Pattern 1: The Funnel 🔻

Finding users who progress through stages (or don't)

Played Created Shared 1K 450 87

📍 When to use:

  • Users who did A but never B
  • Conversion funnel analysis
  • Multi-step user journeys
  • Drop-off analysis

🔧 SQL:

-- Users: played but never shared
SELECT user_id, name
FROM users
WHERE user_id IN (
  SELECT user_id FROM listens  -- Step 1: Who listened
)
AND user_id NOT IN (
  SELECT user_id FROM shares   -- Step 2: Exclude sharers
                               -- Experiment with NULL in NOT IN death traps
                               -- For our Spotify DB, user_id is not NULL. 
                               -- But in other schemas, it may be  
)

Pattern 2: The Ladder 🪜

Ranking items within groups

Top N per Group 1 2 3

📍 When to use:

  • Top 3 items per category
  • Rank within departments
  • Find nth best performer
  • Percentile rankings

🔧 SQL:

-- Top 3 per genre
SELECT * FROM (
  SELECT song, genre, plays,
    ROW_NUMBER() OVER (
      PARTITION BY genre  -- Reset per genre
      ORDER BY plays DESC -- High plays = #1
    ) AS RANK
  FROM songs
) ranked
WHERE RANK <= 3  -- Top 3 only

Pattern 3: The Timeline 📈

Cumulative and moving calculations

Running Total Time →

📍 When to use:

  • Running total over time
  • Moving averages
  • Month-to-date calculations
  • Cumulative growth metrics

🔧 SQL:

-- Running total: hours accumulated
SELECT 
  date,
  hours,  -- Daily value
  SUM(hours) OVER (
    ORDER BY date  -- Sum: all up to now
  ) AS cumulative_hours
FROM daily_listening

Pattern 4: The Comparison 📊

Individual vs Group Performance

You vs Average You Avg

📍 When to use:

  • Compare to department average
  • Percentile within group
  • Above/below benchmark
  • Peer comparisons

🔧 SQL:

-- User vs tier average
SELECT 
  user_id,
  tier,
  hours,
  hours - AVG(hours) OVER (
    PARTITION BY tier  -- Tier avg
  ) AS vs_avg  -- +ve = above avg
FROM user_stats

Pattern 5: The Exception Finder 🎯

Detecting outliers and anomalies

Outlier Detection -80% +200%

📍 When to use:

  • Sudden drops in activity
  • Unusual spikes
  • Behavior changes
  • Anomaly detection

🔧 SQL:

-- Detect >50% changes
WITH changes AS (
  SELECT *,
    LAG(value) OVER (ORDER BY date) AS prev_value,  -- Previous value
    -- % change = (new - old) / old × 100
    ((value - LAG(value) OVER (ORDER BY date)) * 100.0 / 
     NULLIF(LAG(value) OVER (ORDER BY date), 0)) AS pct_change  -- Prevent div/0
  FROM metrics
)
SELECT * FROM changes 
WHERE ABS(pct_change) > 50  -- Big changes only

Speed Run: Solving a Real Problem

Problem: "Find breakout artists with strong growth in past 3 months (>50% month-over-month)"

Breaking It Down in English First

🧩 Problem Decomposition

"artists" → Analyze each artist separately 📍 Row 2: GROUP BY
">50% growth" → Compare this month vs last month 📍 Row 10: LAG()
"past 3 months" → Filter to Sep, Aug, Jul 2025 📍 Row 3: WHERE filter
"strong" → Only keep artists above threshold 📍 Row 3: WHERE filter
📊 Example: "The Weeknd" Monthly Plays
Sept: 1,000 plays
Oct:  1,600 plays (+60% growth) ✅
Nov:  2,500 plays (+56% growth) ✅
    

This artist shows consistent 50%+ growth → They're a breakout artist!

The SQL Implementation

Step 1: Count Monthly Plays

📍 Uses rows 1-2

Why: Need raw numbers before we can calculate growth
What: Count plays for each artist per month

WITH monthly_plays AS (  -- CTE: named result set
  SELECT 
    artist_id,                                -- Keep artist ID
    DATE_TRUNC('month', play_date) AS month, -- Round to month start
    COUNT(*) AS play_count                   -- Total plays
  FROM listens
  GROUP BY artist_id, month  -- One row per artist-month combo
)
-- Result preview:
-- artist_id | month    | play_count
-- 1234      | 2024-09  | 1000       -- Sept baseline
-- 1234      | 2024-10  | 1600       -- Oct growth  
-- 1234      | 2024-11  | 2500       -- Nov spike

Step 2: Calculate Growth Rate

📍 Uses row 10

Why: Need to know if artist is growing
What: Compare each month to previous using LAG()

, artist_growth AS (      -- Chain another CTE
  SELECT 
    artist_id,            -- Pass through
    month,                -- Pass through  
    play_count,           -- Current month
    LAG(play_count, 1) OVER (         -- Look back 1 row
      PARTITION BY artist_id          -- Reset for each artist
      ORDER BY month                  -- Time sequence
    ) AS prev_month_plays,            -- Previous month value
    -- Growth % = (new - old) / old × 100
    ROUND(                            -- Clean decimals
      (play_count - LAG(play_count, 1) OVER (PARTITION BY artist_id ORDER BY month)) 
      * 100.0 / NULLIF(               -- Percent formula
        LAG(play_count, 1) OVER (PARTITION BY artist_id ORDER BY month), 
        0                             -- Avoid divide by 0
      ),
      1                               -- 1 decimal place
    ) AS growth_pct
  FROM monthly_plays      -- Use CTE from Step 1
)
-- Result preview:
-- artist_id | month    | play_count | prev_month | growth_pct
-- 1234      | 2024-09  | 1000       | NULL       | NULL      -- No prior
-- 1234      | 2024-10  | 1600       | 1000       | 60.0      -- +60%
-- 1234      | 2024-11  | 2500       | 1600       | 56.3      -- +56%

Step 3: Find Breakout Artists

📍 Uses row 3

Why: Filter to only high-growth artists
What: Apply >50% filter + recency filter

SELECT                    -- Final query uses both CTEs
  a.name AS artist_name,  -- Human-readable name
  g.month,                -- Keep month
  g.play_count,           -- Show absolute numbers
  g.growth_pct            -- Show growth %
FROM artist_growth g      -- CTE from Step 2
JOIN artists a            -- Get artist names
  ON g.artist_id = a.artist_id       -- Match IDs
WHERE g.growth_pct > 50              -- Filter: >50% growth
  AND g.month >= DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '3 months'  -- Last 3 months
ORDER BY g.growth_pct DESC,          -- Highest growth first
         g.month DESC                 -- Then by recency

-- Final output:
-- artist_name | month    | play_count | growth_pct
-- The Weeknd  | 2024-11  | 2500       | 56.3      -- Breakout!
-- The Weeknd  | 2024-10  | 1600       | 60.0      -- Consistent
-- Dua Lipa    | 2024-11  | 3200       | 78.0      -- Viral

Key Takeaway: We decomposed "strong recent growth" into 3 clear steps:
1️⃣ Aggregate data (COUNT + GROUP BY) → 2️⃣ Calculate change (LAG) → 3️⃣ Filter results (WHERE)

Pro Tips

Rules Of Thumb

✅ Do This

  • Start simple, add complexity gradually
  • Test with small data first
  • Use CTEs for readability
  • Name things clearly (not t1, t2)
  • Check for NULLs explicitly

❌ Avoid This

  • Deeply nested subqueries (>3 levels)
  • SELECT * in production
  • Forgetting GROUP BY with aggregates
  • Not handling division by zero
  • Ignoring the execution order