Verification: Prove the Result

Concept. Step three: verification. The model generates SQL, and you trust it like any code, handwritten or not, with three tools from software: a unit test proves one query is right, a regression test proves it stays right, and a library lets verified queries compose. All three need code. None of them work in English.

Intuition. Verification lives in code, like the spec did. The debug table is the unit test, the tests you keep are the regression suite, and the verified queries are the library.

The spec is Query B: return only Pluto. Execution got the query to run, but runs ≠ correct, as you've seen. You know the answer to expect, so confirm the query reproduces it: start with one query, by hand, on both readings.

Unit Test: One Query

The two readings, in SQL:

Query A: LEFT JOIN Approach

-- Find users who listened to non-Taylor songs (or no songs)
SELECT u.user_id, u.name
FROM Users u
LEFT JOIN Listens l ON u.user_id = l.user_id
LEFT JOIN Songs s ON l.song_id = s.song_id
WHERE s.artist != 'Taylor Swift' OR s.artist IS NULL
GROUP BY u.user_id, u.name

Query B: NOT IN Subquery Approach

-- Find users who never listened to any Taylor Swift song
SELECT u.user_id, u.name
FROM Users u
WHERE u.user_id NOT IN (
    SELECT DISTINCT l.user_id
    FROM Listens l
    JOIN Songs s ON l.song_id = s.song_id
    WHERE s.artist = 'Taylor Swift'
)

Reminder. Beware the NULL in NOT IN trap. Here we are safe because l.user_id cannot be NULL per the Spotify schema; adding WHERE l.user_id IS NOT NULL is still good defensive practice.

A debug table traces a query the way the engine runs it, one clause at a time, writing the rows after each step. Trace both readings and they diverge, exactly as the spec predicted: Query A keeps all four users, Query B keeps only Pluto.

Both queries, traced step by step:

Query A Execution Trace

Step 1: LEFT JOIN Users with Listens

user_idnamelisten_idsong_idratingNotes
1Mickey114.5
1Mickey224.2
1Mickey363.9
2Minnie424.7
2Minnie574.6
2Minnie683.9
3Daffy712.9
3Daffy824.9
3Daffy96NULL
4PlutoNULLNULLNULLNo listens

Step 2: LEFT JOIN with Songs (Add artist)

user_idnamesong_idtitleartistNotes
1Mickey1EvermoreTaylor Swift
1Mickey2WillowTaylor Swift
1Mickey6YesterdayBeatles
2Minnie2WillowTaylor Swift
2Minnie7Yellow SubmarineBeatles
2Minnie8Hey JudeBeatles
3Daffy1EvermoreTaylor Swift
3Daffy2WillowTaylor Swift
3Daffy6YesterdayBeatles
4PlutoNULLNULLNULLNo songs

Step 3: WHERE filter (artist != 'Taylor Swift' OR artist IS NULL)

user_idnamesong_idtitleartistNotes
1Mickey6YesterdayBeatlesKept: artist != 'Taylor Swift'
2Minnie7Yellow SubmarineBeatlesKept: artist != 'Taylor Swift'
2Minnie8Hey JudeBeatlesKept: artist != 'Taylor Swift'
3Daffy6YesterdayBeatlesKept: artist != 'Taylor Swift'
4PlutoNULLNULLNULLKept: artist IS NULL

Step 4: GROUP BY and Final Output

user_idname
1Mickey
2Minnie
3Daffy
4Pluto

Query B Execution Trace

Step 1: Find Taylor Swift listeners

user_idNotes
1Listened to songs 1, 2 (Taylor Swift)
2Listened to song 2 (Taylor Swift)
3Listened to songs 1, 2 (Taylor Swift)

Step 2: NOT IN filter

user_idnameNotes
1MickeyExcluded: user_id IN {1,2,3}
2MinnieExcluded: user_id IN {1,2,3}
3DaffyExcluded: user_id IN {1,2,3}
4PlutoIncluded: user_id NOT IN {1,2,3}

Final Output

user_idname
4Pluto

That is a unit test: one query, run against the answer you expected. "Did this LEFT JOIN keep Pluto? Did the GROUP BY count once?" You settle those by running the query and comparing the result, not by re-reading the English.

The answer is built by a pipeline of relational operations on the left (blue): schema and definitions, then joins and filters, then GROUP BY and COUNT, then the result. On the right, each layer has a verification question: is this the governed metric, do the right rows survive (Pluto), does it count the right thing, does the number match the result you expected.

Figure 1. A query is a stack of exact relational operations (blue): a schema and its definitions, then joins and filters, then a GROUP BY and COUNT, then the result. Each one fixes a piece of the meaning, so each is one thing you can test: is this the governed metric, do the right rows survive (the Pluto case), does it count the right thing, does the number match what you expected. You settle each by running the query, not by arguing in English.

Regression Test: It Stays Right

A unit test proves the query is right today. Keep it and re-run it on every change, and it proves the query is still right tomorrow. That is a regression test: if a definition silently shifts, the stored test fails that day, not months later.

Library: Verified Queries Compose

A verified query is a function, and functions compose. You proved "non-Taylor listeners" returns {Pluto}, so the next question, "weekly active non-Taylor listeners," builds on it, and you verify only the new layer.

Three columns of verified SQL blocks (blue, each with a green check). Base metrics on the left (active user, Taylor listener) compose into composed queries in the middle (weekly actives, non-Taylor listeners), which compose into a complex query on the right (weekly active non-Taylor listeners). Arrows run left to right, and each block is verified once and reused.

Figure 2. Each block is a verified query (blue, green check). Base metrics (active user, Taylor listener) compose into bigger queries (weekly actives, non-Taylor listeners), which compose into a complex one. Every block is verified once and reused, so a new query verifies only its own layer. That is a library: verified SQL that builds on itself, which is exactly what the semantic layer becomes.

Key Takeaways

  1. Unit test. One query against an expected result. You cannot assert about a join in English.

  2. Regression test. Keep the tests and re-run them, so a definition cannot silently drift.

  3. Library. Verified queries stack into bigger ones, each reused; English never could. That library is the semantic layer.


Next

Case Study 1.3: Claude's Agentic Stack → You now have all three steps: precision, execution, verification. Next, a company that runs all three at once, taking a bare model from 21% to 95%.