GROUP BY: Grouping and Aggregating Data
Core Concept
- Example: Count how many songs each user listened to (transforms 9 rows into 3 user groups)
Key Rules
-
Group Formation: Rows with identical values in GROUP BY columns form a group
-
Aggregate Scope: Functions like COUNT() operate within each group
-
Execution Order: FROM → WHERE → GROUP BY → SELECT (SELECT always happens last!)
-
NULL Grouping: NULL values form their own group ⚠️
-
SELECT Restriction: Only GROUP BY columns and aggregates allowed in SELECT (else UNSAFE) ⚠️
Item | Allowed? | Example |
---|---|---|
GROUP BY columns | ✅ | user_id (it's grouped) |
Aggregates | ✅ | COUNT(*) , AVG(rating) |
Other columns | ❌ | song_id (which one?) |
Simple GROUP BY
-- Count how many songs each user has listened to
SELECT user_id, COUNT(*) AS listen_count
FROM Listens
GROUP BY user_id -- One group per unique user_id value
Step 1: Form Groups
listen_id | user_id | song_id | rating | Notes: Groups | Notes | |
---|---|---|---|---|---|---|
1 | 1 | 1 | 4.5 | Group user1 | First user_id=1 | |
2 | 1 | 2 | 4.2 | Group user1 | Same user_id=1 | |
3 | 1 | 6 | 3.9 | Group user1 | Same user_id=1 | |
4 | 2 | 2 | 4.7 | Group user2 | First user_id=2 | |
5 | 2 | 7 | 4.6 | Group user2 | Same user_id=2 | |
6 | 2 | 8 | 3.9 | Group user2 | Same user_id=2 | |
7 | 3 | 1 | 2.9 | Group user3 | First user_id=3 | |
8 | 3 | 2 | 4.9 | Group user3 | Same user_id=3 | |
9 | 3 | 6 | NULL | Group user3 | Same user_id=3 |
Output
user_id | listen_count | Notes: Rows | Notes | |
---|---|---|---|---|
1 | 3 | listen_id 1,2,3 | Count of Group user1 | |
2 | 3 | listen_id 4,5,6 | Count of Group user2 | |
3 | 3 | listen_id 7,8,9 | Count of Group user3 |
-
One group per unique value
-
Aggregates work per group
-
Groups are isolated from each other
GROUP BY with Multiple Aggregates
-- Get listening stats per user: count, average, max, min ratings
SELECT
user_id,
COUNT(*) AS listen_count, -- Counts ALL rows in group (including NULLs)
AVG(rating) AS avg_rating, -- AVG ignores NULL values
MAX(rating) AS max_rating, -- Finds highest non-NULL rating
MIN(rating) AS min_rating -- Finds lowest non-NULL rating
FROM Listens
GROUP BY user_id -- Each user gets their own stats
Aggregates Per Group
user_id | COUNT(*) | AVG Calculation | MAX | MIN | Notes | |
---|---|---|---|---|---|---|
1 | 3 | (4.5+4.2+3.9)/3 | 4.5 | 3.9 | Group for User 1 | |
2 | 3 | (4.7+4.6+3.9)/3 | 4.7 | 3.9 | Group for User 2 | |
3 | 3 | (2.9+4.9)/2 | 4.9 | 2.9 | Group for User 3 NULL not in AVG |
Output
user_id | listen_count | avg_rating | max_rating | min_rating | |
---|---|---|---|---|---|
1 | 3 | 4.2 | 4.5 | 3.9 | |
2 | 3 | 4.4 | 4.7 | 3.9 | |
3 | 3 | 3.9 | 4.9 | 2.9 |
-
All aggregates in one pass
-
Each sees only its group
-
NULLs follow aggregate rules
GROUP BY Multiple Columns
-- Count how many times each user played each song
SELECT
user_id,
song_id,
COUNT(*) AS play_count,
AVG(rating) AS avg_rating
FROM Listens
GROUP BY user_id, song_id -- Multi-column GROUP BY: unique (user,song) pairs
ORDER BY user_id, song_id -- Sort output: first by user, then by song
Step 1: Form (user_id, song_id) Groups
listen_id | user_id | song_id | rating | Notes: Group Key | |
---|---|---|---|---|---|
1 | 1 | 1 | 4.5 | Group for (1,1) | |
2 | 1 | 2 | 4.2 | Group for (1,2) | |
3 | 1 | 6 | 3.9 | Group for (1,6) | |
4 | 2 | 2 | 4.7 | Group for (2,2) | |
5 | 2 | 7 | 4.6 | Group for (2,7) | |
6 | 2 | 8 | 3.9 | Group for (2,8) | |
7 | 3 | 1 | 2.9 | Group for (3,1) | |
8 | 3 | 2 | 4.9 | Group for (3,2) | |
9 | 3 | 6 | NULL | Group for (3,6) |
Output
user_id | song_id | play_count | avg_rating | Notes | |
---|---|---|---|---|---|
1 | 1 | 1 | 4.5 | Single row group | |
1 | 2 | 1 | 4.2 | Single row group | |
1 | 6 | 1 | 3.9 | Single row group | |
2 | 2 | 1 | 4.7 | Single row group | |
2 | 7 | 1 | 4.6 | Single row group | |
2 | 8 | 1 | 3.9 | Single row group | |
3 | 1 | 1 | 2.9 | Single row group | |
3 | 2 | 1 | 4.9 | Single row group | |
3 | 6 | 1 | NULL | NULL in AVG |
-
Combinations create groups
-
Often get one-row groups
GROUP BY with WHERE
-- Count high-rated songs (>4.0) per user
SELECT
user_id,
COUNT(*) AS high_rating_count,
AVG(rating) AS avg_high_rating
FROM Listens
WHERE rating > 4.0 -- WHERE filters BEFORE grouping (NULL > 4.0 = false)
GROUP BY user_id -- Groups only the filtered rows
ORDER BY user_id -- Sort by user_id ascending (default)
Step 1: WHERE Filter
listen_id | user_id | rating | Action: WHERE Check | Action | |
---|---|---|---|---|---|
1 | 1 | 4.5 | 4.5 > 4.0 ✓ | Keep | |
2 | 1 | 4.2 | 4.2 > 4.0 ✓ | Keep | |
3 | 1 | 3.9 | 3.9 > 4.0 ✗ | Remove | |
4 | 2 | 4.7 | 4.7 > 4.0 ✓ | Keep | |
5 | 2 | 4.6 | 4.6 > 4.0 ✓ | Keep | |
6 | 2 | 3.9 | 3.9 > 4.0 ✗ | Remove | |
7 | 3 | 2.9 | 2.9 > 4.0 ✗ | Remove | |
8 | 3 | 4.9 | 4.9 > 4.0 ✓ | Keep | |
9 | 3 | NULL | NULL > 4.0 ? | Remove |
Step 2: GROUP BY on Filtered Data
user_id | COUNT(*) | AVG(rating) | Notes: Post-Filtered Rows | |
---|---|---|---|---|
1 | 2 | (4.5+4.2)/2 = 4.35 | Ratings from listen_ids=1,2 | |
2 | 2 | (4.7+4.6)/2 = 4.65 | Ratings from listen_ids=4,5 | |
3 | 1 | 4.9/1 = 4.9 | Ratings from listen_id=8 |
Output
user_id | high_rating_count | avg_high_rating | |
---|---|---|---|
1 | 2 | 4.35 | |
2 | 2 | 4.65 | |
3 | 1 | 4.9 |
-
WHERE happens first
-
Form Groups for kept rows
-
Aggregates on kept data
GROUP BY with NULL Values
-- Count how many listens for each rating value (including NULL)
SELECT
rating,
COUNT(*) AS COUNT
FROM Listens
GROUP BY rating -- NULL values form their own group
ORDER BY rating -- NULLs appear first or last (DB-dependent)
NULL Grouping Behavior
rating | Notes: Groups |
---|---|
2.9 | Group: 2.9; listen_id = 7 |
3.9 | Group: 3.9; listen_id = 3,6 |
4.2 | Group: 4.2 ; listen_id = 2 |
4.5 | Group: 4.5 ; listen_id = 1 |
4.6 | Group: 4.6 ; listen_id = 5 |
4.7 | Group: 4.7 ; listen_id = 4 |
4.9 | Group: 4.9 ; listen_id = 8 |
NULL | Group: NULL ; NULLs group together listen_id = 9 |
Output
rating | count | Notes |
---|---|---|
NULL | 1 | NULL forms its own group |
2.9 | 1 | Single occurrence |
3.9 | 2 | Two occurrences |
4.2 | 1 | Single occurrence |
4.5 | 1 | Single occurrence |
4.6 | 1 | Single occurrence |
4.7 | 1 | Single occurrence |
4.9 | 1 | Single occurrence |
-
All NULLs = one group
-
NULLs match each other here
-
Sort puts NULLs at edges
COUNT(*) vs COUNT(column)
-- Compare total listens vs rated vs timed listens per user
SELECT
user_id,
COUNT(*) AS total_listens, -- COUNT(*) includes NULL rows
COUNT(rating) AS rated_listens, -- COUNT(column) skips NULL values
COUNT(listen_time) AS timed_listens -- Different NULLs = different counts
FROM Listens
GROUP BY user_id
ORDER BY user_id
user_id | total_listens | rated_listens | timed_listens | Notes | |
---|---|---|---|---|---|
1 | 3 | 3 | 2 | 1: time=✓ 2: time=NULL 3: time=✓ | |
2 | 3 | 3 | 2 | 4: time=NULL 5: time=✓ 6: time=✓ | |
3 | 3 | 2 | 1 | 7: time=NULL 8: time=✓ 9: time=NULL |
COUNT Variants:
-
COUNT(*) = all rows (including NULLs)
-
COUNT(column) = non-NULL values only
-
COUNT(DISTINCT column) = unique non-NULL values
Common Patterns
-
Simple Grouping:
GROUP BY user_id
creates one group per user -
Multiple Columns:
GROUP BY user_id, song_id
groups by combinations -
With Aggregates:
SELECT user_id, COUNT(*), AVG(rating)
-
Filter Groups: Use HAVING to filter after grouping
-
All Columns: GROUP BY with all non-aggregate SELECT columns
Common Mistakes
The Non-Grouped Column Error; UNSAFE queries
-- WRONG: song_id not in GROUP BY (unsafe query)
SELECT user_id, song_id, COUNT(*)
FROM Listens
GROUP BY user_id
-- CORRECT: Include all non-aggregate columns in GROUP BY
SELECT user_id, song_id, COUNT(*)
FROM Listens
GROUP BY user_id, song_id
Problem: Which song_id to show when user has multiple songs?
WHERE Can't See Aggregates
-- WRONG: WHERE can't filter on aggregates (they don't exist yet)
SELECT user_id, AVG(rating)
FROM Listens
WHERE AVG(rating) > 4.0
GROUP BY user_id
-- CORRECT: Use HAVING to filter groups after aggregation
SELECT user_id, AVG(rating)
FROM Listens
GROUP BY user_id
HAVING AVG(rating) > 4.0
Problem: WHERE executes before grouping; aggregates don't exist yet
The COUNT NULL Trap
-- Misleading: COUNT(rating) skips NULL ratings
SELECT user_id,
COUNT(rating) AS total_ratings -- Skips NULLs!
FROM Listens
GROUP BY user_id
-- Better: Show both total rows and non-NULL ratings
SELECT user_id,
COUNT(*) AS total_listens,
COUNT(rating) AS rated_listens
FROM Listens
GROUP BY user_id
Problem: COUNT(column) behavior with NULL often unexpected
Redundant DISTINCT
-- REDUNDANT: GROUP BY already ensures unique user_id values
SELECT DISTINCT user_id, COUNT(*)
FROM Listens
GROUP BY user_id
-- SIMPLER: GROUP BY alone creates unique groups
SELECT user_id, COUNT(*)
FROM Listens
GROUP BY user_id
Problem: GROUP BY already ensures unique group values