Verification: Prove the Result
Concept. Step three: verification. The model generates SQL, and you trust it like any code, handwritten or not, with three tools from software: a unit test proves one query is right, a regression test proves it stays right, and a library lets verified queries compose. All three need code. None of them work in English.
Intuition. Verification lives in code, like the spec did. The debug table is the unit test, the tests you keep are the regression suite, and the verified queries are the library.
The spec is Query B: return only Pluto. Execution got the query to run, but runs ≠ correct, as you've seen. You know the answer to expect, so confirm the query reproduces it: start with one query, by hand, on both readings.
Unit Test: One Query
The two readings, in SQL:
Query A: LEFT JOIN Approach
-- Find users who listened to non-Taylor songs (or no songs)
SELECT u.user_id, u.name
FROM Users u
LEFT JOIN Listens l ON u.user_id = l.user_id
LEFT JOIN Songs s ON l.song_id = s.song_id
WHERE s.artist != 'Taylor Swift' OR s.artist IS NULL
GROUP BY u.user_id, u.name
Query B: NOT IN Subquery Approach
-- Find users who never listened to any Taylor Swift song
SELECT u.user_id, u.name
FROM Users u
WHERE u.user_id NOT IN (
SELECT DISTINCT l.user_id
FROM Listens l
JOIN Songs s ON l.song_id = s.song_id
WHERE s.artist = 'Taylor Swift'
)
Reminder. Beware the NULL in NOT IN trap. Here we are safe because l.user_id cannot be NULL per the Spotify schema; adding WHERE l.user_id IS NOT NULL is still good defensive practice.
A debug table traces a query the way the engine runs it, one clause at a time, writing the rows after each step. Trace both readings and they diverge, exactly as the spec predicted: Query A keeps all four users, Query B keeps only Pluto.
Both queries, traced step by step:
Query A Execution Trace
Step 1: LEFT JOIN Users with Listens
| user_id | name | listen_id | song_id | rating | Notes | |
|---|---|---|---|---|---|---|
| 1 | Mickey | 1 | 1 | 4.5 | ||
| 1 | Mickey | 2 | 2 | 4.2 | ||
| 1 | Mickey | 3 | 6 | 3.9 | ||
| 2 | Minnie | 4 | 2 | 4.7 | ||
| 2 | Minnie | 5 | 7 | 4.6 | ||
| 2 | Minnie | 6 | 8 | 3.9 | ||
| 3 | Daffy | 7 | 1 | 2.9 | ||
| 3 | Daffy | 8 | 2 | 4.9 | ||
| 3 | Daffy | 9 | 6 | NULL | ||
| 4 | Pluto | NULL | NULL | NULL | No listens |
Step 2: LEFT JOIN with Songs (Add artist)
| user_id | name | song_id | title | artist | Notes | |
|---|---|---|---|---|---|---|
| 1 | Mickey | 1 | Evermore | Taylor Swift | ||
| 1 | Mickey | 2 | Willow | Taylor Swift | ||
| 1 | Mickey | 6 | Yesterday | Beatles | ||
| 2 | Minnie | 2 | Willow | Taylor Swift | ||
| 2 | Minnie | 7 | Yellow Submarine | Beatles | ||
| 2 | Minnie | 8 | Hey Jude | Beatles | ||
| 3 | Daffy | 1 | Evermore | Taylor Swift | ||
| 3 | Daffy | 2 | Willow | Taylor Swift | ||
| 3 | Daffy | 6 | Yesterday | Beatles | ||
| 4 | Pluto | NULL | NULL | NULL | No songs |
Step 3: WHERE filter (artist != 'Taylor Swift' OR artist IS NULL)
| user_id | name | song_id | title | artist | Notes | |
|---|---|---|---|---|---|---|
| 1 | Mickey | 6 | Yesterday | Beatles | Kept: artist != 'Taylor Swift' | |
| 2 | Minnie | 7 | Yellow Submarine | Beatles | Kept: artist != 'Taylor Swift' | |
| 2 | Minnie | 8 | Hey Jude | Beatles | Kept: artist != 'Taylor Swift' | |
| 3 | Daffy | 6 | Yesterday | Beatles | Kept: artist != 'Taylor Swift' | |
| 4 | Pluto | NULL | NULL | NULL | Kept: artist IS NULL |
Step 4: GROUP BY and Final Output
| user_id | name | |
|---|---|---|
| 1 | Mickey | |
| 2 | Minnie | |
| 3 | Daffy | |
| 4 | Pluto |
Query B Execution Trace
Step 1: Find Taylor Swift listeners
| user_id | Notes | |
|---|---|---|
| 1 | Listened to songs 1, 2 (Taylor Swift) | |
| 2 | Listened to song 2 (Taylor Swift) | |
| 3 | Listened to songs 1, 2 (Taylor Swift) |
Step 2: NOT IN filter
| user_id | name | Notes | |
|---|---|---|---|
| 1 | Mickey | Excluded: user_id IN {1,2,3} | |
| 2 | Minnie | Excluded: user_id IN {1,2,3} | |
| 3 | Daffy | Excluded: user_id IN {1,2,3} | |
| 4 | Pluto | Included: user_id NOT IN {1,2,3} |
Final Output
| user_id | name | |
|---|---|---|
| 4 | Pluto |
That is a unit test: one query, run against the answer you expected. "Did this LEFT JOIN keep Pluto? Did the GROUP BY count once?" You settle those by running the query and comparing the result, not by re-reading the English.
Figure 1. A query is a stack of exact relational operations (blue): a schema and its definitions, then joins and filters, then a GROUP BY and COUNT, then the result. Each one fixes a piece of the meaning, so each is one thing you can test: is this the governed metric, do the right rows survive (the Pluto case), does it count the right thing, does the number match what you expected. You settle each by running the query, not by arguing in English.
Regression Test: It Stays Right
A unit test proves the query is right today. Keep it and re-run it on every change, and it proves the query is still right tomorrow. That is a regression test: if a definition silently shifts, the stored test fails that day, not months later.
Library: Verified Queries Compose
A verified query is a function, and functions compose. You proved "non-Taylor listeners" returns {Pluto}, so the next question, "weekly active non-Taylor listeners," builds on it, and you verify only the new layer.
Figure 2. Each block is a verified query (blue, green check). Base metrics (active user, Taylor listener) compose into bigger queries (weekly actives, non-Taylor listeners), which compose into a complex one. Every block is verified once and reused, so a new query verifies only its own layer. That is a library: verified SQL that builds on itself, which is exactly what the semantic layer becomes.
Key Takeaways
-
Unit test. One query against an expected result. You cannot assert about a join in English.
-
Regression test. Keep the tests and re-run them, so a definition cannot silently drift.
-
Library. Verified queries stack into bigger ones, each reused; English never could. That library is the semantic layer.
Next
Case Study 1.3: Claude's Agentic Stack → You now have all three steps: precision, execution, verification. Next, a company that runs all three at once, taking a bare model from 21% to 95%.