GROUP BY: Grouping and Aggregating Data

Core Concept

Example: Count how many songs each user listened to (transforms 9 rows into 3 user groups)

Key Rules

Group Formation: Rows with identical values in GROUP BY columns form a group
Aggregate Scope: Functions like COUNT() operate within each group
Execution Order: FROM → WHERE → GROUP BY → SELECT (SELECT always happens last!)
NULL Grouping: NULL values form their own group ⚠️
SELECT Restriction: Only GROUP BY columns and aggregates allowed in SELECT (else UNSAFE) ⚠️

Item	Allowed?	Example
GROUP BY columns	✅	`user_id` (it's grouped)
Aggregates	✅	`COUNT(*)`, `AVG(rating)`
Other columns	❌	`song_id` (which one?)

Simple GROUP BY

-- Count how many songs each user has listened to
SELECT user_id, COUNT(*) AS listen_count
FROM Listens
GROUP BY user_id  -- One group per unique user_id value

Step 1: Form Groups

listen_id	user_id	song_id	rating	Notes: Groups	Notes
1	1	1	4.5	Group user1	First user_id=1
2	1	2	4.2	Group user1	Same user_id=1
3	1	6	3.9	Group user1	Same user_id=1
4	2	2	4.7	Group user2	First user_id=2
5	2	7	4.6	Group user2	Same user_id=2
6	2	8	3.9	Group user2	Same user_id=2
7	3	1	2.9	Group user3	First user_id=3
8	3	2	4.9	Group user3	Same user_id=3
9	3	6	NULL	Group user3	Same user_id=3

Output

user_id	listen_count	Notes: Rows	Notes
1	3	listen_id 1,2,3	Count of Group user1
2	3	listen_id 4,5,6	Count of Group user2
3	3	listen_id 7,8,9	Count of Group user3

One group per unique value
Aggregates work per group
Groups are isolated from each other

GROUP BY with Multiple Aggregates

-- Get listening stats per user: count, average, max, min ratings
SELECT 
    user_id,
    COUNT(*) AS listen_count,      -- Counts ALL rows in group (including NULLs)
    AVG(rating) AS avg_rating,      -- AVG ignores NULL values
    MAX(rating) AS max_rating,      -- Finds highest non-NULL rating
    MIN(rating) AS min_rating       -- Finds lowest non-NULL rating
FROM Listens
GROUP BY user_id                    -- Each user gets their own stats

Aggregates Per Group

user_id	COUNT(*)	AVG Calculation	MAX	MIN	Notes
1	3	(4.5+4.2+3.9)/3	4.5	3.9	Group for User 1
2	3	(4.7+4.6+3.9)/3	4.7	3.9	Group for User 2
3	3	(2.9+4.9)/2	4.9	2.9	Group for User 3 NULL not in AVG

Output

user_id	listen_count	avg_rating	max_rating	min_rating
1	3	4.2	4.5	3.9
2	3	4.4	4.7	3.9
3	3	3.9	4.9	2.9

All aggregates in one pass
Each sees only its group
NULLs follow aggregate rules

GROUP BY Multiple Columns

-- Count how many times each user played each song
SELECT 
    user_id,
    song_id,
    COUNT(*) AS play_count,
    AVG(rating) AS avg_rating
FROM Listens
GROUP BY user_id, song_id    -- Multi-column GROUP BY: unique (user,song) pairs
ORDER BY user_id, song_id    -- Sort output: first by user, then by song

Step 1: Form (user_id, song_id) Groups

listen_id	user_id	song_id	rating	Notes: Group Key
1	1	1	4.5	Group for (1,1)
2	1	2	4.2	Group for (1,2)
3	1	6	3.9	Group for (1,6)
4	2	2	4.7	Group for (2,2)
5	2	7	4.6	Group for (2,7)
6	2	8	3.9	Group for (2,8)
7	3	1	2.9	Group for (3,1)
8	3	2	4.9	Group for (3,2)
9	3	6	NULL	Group for (3,6)

Output

user_id	song_id	play_count	avg_rating	Notes
1	1	1	4.5	Single row group
1	2	1	4.2	Single row group
1	6	1	3.9	Single row group
2	2	1	4.7	Single row group
2	7	1	4.6	Single row group
2	8	1	3.9	Single row group
3	1	1	2.9	Single row group
3	2	1	4.9	Single row group
3	6	1	NULL	NULL in AVG

Combinations create groups
Often get one-row groups

GROUP BY with WHERE

-- Count high-rated songs (>4.0) per user
SELECT 
    user_id,
    COUNT(*) AS high_rating_count,
    AVG(rating) AS avg_high_rating
FROM Listens
WHERE rating > 4.0            -- WHERE filters BEFORE grouping (NULL > 4.0 = false)
GROUP BY user_id              -- Groups only the filtered rows
ORDER BY user_id              -- Sort by user_id ascending (default)

Step 1: WHERE Filter

listen_id	user_id	rating	Action: WHERE Check	Action
1	1	4.5	4.5 > 4.0 ✓	Keep
2	1	4.2	4.2 > 4.0 ✓	Keep
3	1	3.9	3.9 > 4.0 ✗	Remove
4	2	4.7	4.7 > 4.0 ✓	Keep
5	2	4.6	4.6 > 4.0 ✓	Keep
6	2	3.9	3.9 > 4.0 ✗	Remove
7	3	2.9	2.9 > 4.0 ✗	Remove
8	3	4.9	4.9 > 4.0 ✓	Keep
9	3	NULL	NULL > 4.0 ?	Remove

Step 2: GROUP BY on Filtered Data

user_id	COUNT(*)	AVG(rating)	Notes: Post-Filtered Rows
1	2	(4.5+4.2)/2 = 4.35	Ratings from listen_ids=1,2
2	2	(4.7+4.6)/2 = 4.65	Ratings from listen_ids=4,5
3	1	4.9/1 = 4.9	Ratings from listen_id=8

Output

user_id	high_rating_count	avg_high_rating
1	2	4.35
2	2	4.65
3	1	4.9

WHERE happens first
Form Groups for kept rows
Aggregates on kept data

GROUP BY with NULL Values

-- Count how many listens for each rating value (including NULL)
SELECT 
    rating,
    COUNT(*) AS COUNT
FROM Listens
GROUP BY rating               -- NULL values form their own group
ORDER BY rating               -- NULLs appear first or last (DB-dependent)

NULL Grouping Behavior

rating	Notes: Groups
2.9	Group: 2.9; listen_id = 7
3.9	Group: 3.9; listen_id = 3,6
4.2	Group: 4.2 ; listen_id = 2
4.5	Group: 4.5 ; listen_id = 1
4.6	Group: 4.6 ; listen_id = 5
4.7	Group: 4.7 ; listen_id = 4
4.9	Group: 4.9 ; listen_id = 8
NULL	Group: NULL ; NULLs group together listen_id = 9

Output

rating	count	Notes
NULL	1	NULL forms its own group
2.9	1	Single occurrence
3.9	2	Two occurrences
4.2	1	Single occurrence
4.5	1	Single occurrence
4.6	1	Single occurrence
4.7	1	Single occurrence
4.9	1	Single occurrence

All NULLs = one group
NULLs match each other here
Sort puts NULLs at edges

COUNT(*) vs COUNT(column)

-- Compare total listens vs rated vs timed listens per user
SELECT 
    user_id,
    COUNT(*) AS total_listens,           -- COUNT(*) includes NULL rows
    COUNT(rating) AS rated_listens,      -- COUNT(column) skips NULL values
    COUNT(listen_time) AS timed_listens  -- Different NULLs = different counts
FROM Listens
GROUP BY user_id
ORDER BY user_id

user_id	total_listens	rated_listens	timed_listens	Notes
1	3	3	2	1: time=✓ 2: time=NULL 3: time=✓
2	3	3	2	4: time=NULL 5: time=✓ 6: time=✓
3	3	2	1	7: time=NULL 8: time=✓ 9: time=NULL

COUNT Variants:

COUNT(*) = all rows (including NULLs)
COUNT(column) = non-NULL values only
COUNT(DISTINCT column) = unique non-NULL values

Common Patterns

Simple Grouping: GROUP BY user_id creates one group per user
Multiple Columns: GROUP BY user_id, song_id groups by combinations
With Aggregates: SELECT user_id, COUNT(*), AVG(rating)
Filter Groups: Use HAVING to filter after grouping
All Columns: GROUP BY with all non-aggregate SELECT columns

Common Mistakes

The Non-Grouped Column Error; UNSAFE queries

-- WRONG: song_id not in GROUP BY (unsafe query)
SELECT user_id, song_id, COUNT(*)
FROM Listens
GROUP BY user_id

-- CORRECT: Include all non-aggregate columns in GROUP BY
SELECT user_id, song_id, COUNT(*)
FROM Listens
GROUP BY user_id, song_id

Problem: Which song_id to show when user has multiple songs?

WHERE Can't See Aggregates

-- WRONG: WHERE can't filter on aggregates (they don't exist yet)
SELECT user_id, AVG(rating)
FROM Listens
WHERE AVG(rating) > 4.0
GROUP BY user_id

-- CORRECT: Use HAVING to filter groups after aggregation
SELECT user_id, AVG(rating)
FROM Listens
GROUP BY user_id
HAVING AVG(rating) > 4.0

Problem: WHERE executes before grouping; aggregates don't exist yet

The COUNT NULL Trap

-- Misleading: COUNT(rating) skips NULL ratings
SELECT user_id, 
       COUNT(rating) AS total_ratings  -- Skips NULLs!
FROM Listens
GROUP BY user_id

-- Better: Show both total rows and non-NULL ratings
SELECT user_id, 
       COUNT(*) AS total_listens,
       COUNT(rating) AS rated_listens
FROM Listens
GROUP BY user_id

Problem: COUNT(column) behavior with NULL often unexpected

Redundant DISTINCT

-- REDUNDANT: GROUP BY already ensures unique user_id values
SELECT DISTINCT user_id, COUNT(*)
FROM Listens
GROUP BY user_id

-- SIMPLER: GROUP BY alone creates unique groups
SELECT user_id, COUNT(*)
FROM Listens
GROUP BY user_id

Problem: GROUP BY already ensures unique group values