GROUP BY: Grouping and Aggregating Data

Core Concept

GROUP BY: Transform Rows into Summary Groups 9 Listen Rows Mickey, song_id=1, rating=4.5 Mickey, song_id=2, rating=4.2 Mickey, song_id=6, rating=3.9 Minnie, song_id=2, rating=4.7 Minnie, song_id=7, rating=4.6 Minnie, song_id=8, rating=3.9 Daffy, song_id=1, rating=2.9 Daffy, song_id=2, rating=4.9 Daffy, song_id=6, rating=NULL GROUP BY user_id 3 User Groups Group: Mickey (user_id=1) COUNT(*) = 3 3 rows grouped AVG(rating) = 4.2 (4.5+4.2+3.9)/3 Group: Minnie (user_id=2) COUNT(*) = 3 3 rows grouped AVG(rating) = 4.4 (4.7+4.6+3.9)/3 Group: Daffy (user_id=3) COUNT(*) = 3 3 rows grouped AVG(rating) = 3.9 (2.9+4.9)/2 (NULL excluded) Each group calculates its own aggregates Key Insight: 9 rows → 3 groups → 3 output rows

Key Rules

  1. Group Formation: Rows with identical values in GROUP BY columns form a group

  2. Aggregate Scope: Functions like COUNT() operate within each group

  3. Execution Order: FROM → WHERE → GROUP BY → SELECT (SELECT always happens last!)

  4. NULL Grouping: NULL values form their own group ⚠️

  5. SELECT Restriction: Only GROUP BY columns and aggregates allowed in SELECT (else UNSAFE) ⚠️

Item Allowed? Example
GROUP BY columns user_id (it's grouped)
Aggregates COUNT(*), AVG(rating)
Other columns song_id (which one?)

Simple GROUP BY

-- Count how many songs each user has listened to
SELECT user_id, COUNT(*) AS listen_count
FROM Listens
GROUP BY user_id  -- One group per unique user_id value

Step 1: Form Groups

listen_iduser_idsong_idratingNotes: GroupsNotes
1114.5Group user1First user_id=1
2124.2Group user1Same user_id=1
3163.9Group user1Same user_id=1
4224.7Group user2First user_id=2
5274.6Group user2Same user_id=2
6283.9Group user2Same user_id=2
7312.9Group user3First user_id=3
8324.9Group user3Same user_id=3
936NULLGroup user3Same user_id=3

Output

user_idlisten_countNotes: RowsNotes
13listen_id 1,2,3Count of Group user1
23listen_id 4,5,6Count of Group user2
33listen_id 7,8,9Count of Group user3

GROUP BY with Multiple Aggregates

-- Get listening stats per user: count, average, max, min ratings
SELECT 
    user_id,
    COUNT(*) AS listen_count,      -- Counts ALL rows in group (including NULLs)
    AVG(rating) AS avg_rating,      -- AVG ignores NULL values
    MAX(rating) AS max_rating,      -- Finds highest non-NULL rating
    MIN(rating) AS min_rating       -- Finds lowest non-NULL rating
FROM Listens
GROUP BY user_id                    -- Each user gets their own stats

Aggregates Per Group

user_idCOUNT(*)AVG CalculationMAXMINNotes
13(4.5+4.2+3.9)/34.53.9Group for User 1
23(4.7+4.6+3.9)/34.73.9Group for User 2
33(2.9+4.9)/24.92.9Group for User 3
NULL not in AVG

Output

user_idlisten_countavg_ratingmax_ratingmin_rating
134.24.53.9
234.44.73.9
333.94.92.9

GROUP BY Multiple Columns

-- Count how many times each user played each song
SELECT 
    user_id,
    song_id,
    COUNT(*) AS play_count,
    AVG(rating) AS avg_rating
FROM Listens
GROUP BY user_id, song_id    -- Multi-column GROUP BY: unique (user,song) pairs
ORDER BY user_id, song_id    -- Sort output: first by user, then by song

Step 1: Form (user_id, song_id) Groups

listen_iduser_idsong_idratingNotes: Group Key
1114.5Group for (1,1)
2124.2Group for (1,2)
3163.9Group for (1,6)
4224.7Group for (2,2)
5274.6Group for (2,7)
6283.9Group for (2,8)
7312.9Group for (3,1)
8324.9Group for (3,2)
936NULLGroup for (3,6)

Output

user_idsong_idplay_countavg_ratingNotes
1114.5Single row group
1214.2Single row group
1613.9Single row group
2214.7Single row group
2714.6Single row group
2813.9Single row group
3112.9Single row group
3214.9Single row group
361NULLNULL in AVG

GROUP BY with WHERE

-- Count high-rated songs (>4.0) per user
SELECT 
    user_id,
    COUNT(*) AS high_rating_count,
    AVG(rating) AS avg_high_rating
FROM Listens
WHERE rating > 4.0            -- WHERE filters BEFORE grouping (NULL > 4.0 = false)
GROUP BY user_id              -- Groups only the filtered rows
ORDER BY user_id              -- Sort by user_id ascending (default)

Step 1: WHERE Filter

listen_iduser_idratingAction: WHERE CheckAction
114.54.5 > 4.0 ✓Keep
214.24.2 > 4.0 ✓Keep
313.93.9 > 4.0 ✗Remove
424.74.7 > 4.0 ✓Keep
524.64.6 > 4.0 ✓Keep
623.93.9 > 4.0 ✗Remove
732.92.9 > 4.0 ✗Remove
834.94.9 > 4.0 ✓Keep
93NULLNULL > 4.0 ?Remove

Step 2: GROUP BY on Filtered Data

user_idCOUNT(*)AVG(rating)Notes: Post-Filtered Rows
12(4.5+4.2)/2 = 4.35Ratings from listen_ids=1,2
22(4.7+4.6)/2 = 4.65Ratings from listen_ids=4,5
314.9/1 = 4.9Ratings from listen_id=8

Output

user_idhigh_rating_countavg_high_rating
124.35
224.65
314.9

GROUP BY with NULL Values

-- Count how many listens for each rating value (including NULL)
SELECT 
    rating,
    COUNT(*) AS COUNT
FROM Listens
GROUP BY rating               -- NULL values form their own group
ORDER BY rating               -- NULLs appear first or last (DB-dependent)

NULL Grouping Behavior

ratingNotes: Groups
2.9Group: 2.9; listen_id = 7
3.9Group: 3.9; listen_id = 3,6
4.2Group: 4.2 ; listen_id = 2
4.5Group: 4.5 ; listen_id = 1
4.6Group: 4.6 ; listen_id = 5
4.7Group: 4.7 ; listen_id = 4
4.9Group: 4.9 ; listen_id = 8
NULLGroup: NULL ; NULLs group together
listen_id = 9

Output

ratingcountNotes
NULL1NULL forms its own group
2.91Single occurrence
3.92Two occurrences
4.21Single occurrence
4.51Single occurrence
4.61Single occurrence
4.71Single occurrence
4.91Single occurrence

COUNT(*) vs COUNT(column)

-- Compare total listens vs rated vs timed listens per user
SELECT 
    user_id,
    COUNT(*) AS total_listens,           -- COUNT(*) includes NULL rows
    COUNT(rating) AS rated_listens,      -- COUNT(column) skips NULL values
    COUNT(listen_time) AS timed_listens  -- Different NULLs = different counts
FROM Listens
GROUP BY user_id
ORDER BY user_id
user_idtotal_listensrated_listenstimed_listensNotes
13321: time=✓
2: time=NULL
3: time=✓
23324: time=NULL
5: time=✓
6: time=✓
33217: time=NULL
8: time=✓
9: time=NULL

COUNT Variants:


Common Patterns


Common Mistakes

The Non-Grouped Column Error; UNSAFE queries

-- WRONG: song_id not in GROUP BY (unsafe query)
SELECT user_id, song_id, COUNT(*)
FROM Listens
GROUP BY user_id

-- CORRECT: Include all non-aggregate columns in GROUP BY
SELECT user_id, song_id, COUNT(*)
FROM Listens
GROUP BY user_id, song_id

Problem: Which song_id to show when user has multiple songs?

WHERE Can't See Aggregates

-- WRONG: WHERE can't filter on aggregates (they don't exist yet)
SELECT user_id, AVG(rating)
FROM Listens
WHERE AVG(rating) > 4.0
GROUP BY user_id

-- CORRECT: Use HAVING to filter groups after aggregation
SELECT user_id, AVG(rating)
FROM Listens
GROUP BY user_id
HAVING AVG(rating) > 4.0

Problem: WHERE executes before grouping; aggregates don't exist yet

The COUNT NULL Trap

-- Misleading: COUNT(rating) skips NULL ratings
SELECT user_id, 
       COUNT(rating) AS total_ratings  -- Skips NULLs!
FROM Listens
GROUP BY user_id

-- Better: Show both total rows and non-NULL ratings
SELECT user_id, 
       COUNT(*) AS total_listens,
       COUNT(rating) AS rated_listens
FROM Listens
GROUP BY user_id

Problem: COUNT(column) behavior with NULL often unexpected

Redundant DISTINCT

-- REDUNDANT: GROUP BY already ensures unique user_id values
SELECT DISTINCT user_id, COUNT(*)
FROM Listens
GROUP BY user_id

-- SIMPLER: GROUP BY alone creates unique groups
SELECT user_id, COUNT(*)
FROM Listens
GROUP BY user_id

Problem: GROUP BY already ensures unique group values