Verification: Prove the Result

Concept. Step three: verification. The model generates SQL, and you trust it like any code, handwritten or not, with three tools from software: a unit test proves one query is right, a regression test proves it stays right, and a library lets verified queries compose. All three need code. None of them work in English.

Intuition. Verification lives in code, like the spec did. The debug table is the unit test, the tests you keep are the regression suite, and the verified queries are the library.

The spec is Query B: return only Pluto. Execution got the query to run, but runs ≠ correct, as you've seen. You know the answer to expect, so confirm the query reproduces it: start with one query, by hand, on both readings.

Unit Test: One Query

The two readings, in SQL:

Query A: LEFT JOIN Approach

-- Find users who listened to non-Taylor songs (or no songs)
SELECT u.user_id, u.name
FROM Users u
LEFT JOIN Listens l ON u.user_id = l.user_id
LEFT JOIN Songs s ON l.song_id = s.song_id
WHERE s.artist != 'Taylor Swift' OR s.artist IS NULL
GROUP BY u.user_id, u.name

Query B: NOT IN Subquery Approach

-- Find users who never listened to any Taylor Swift song
SELECT u.user_id, u.name
FROM Users u
WHERE u.user_id NOT IN (
    SELECT DISTINCT l.user_id
    FROM Listens l
    JOIN Songs s ON l.song_id = s.song_id
    WHERE s.artist = 'Taylor Swift'
)

Reminder. Beware the NULL in NOT IN trap. Here we are safe because l.user_id cannot be NULL per the Spotify schema; adding WHERE l.user_id IS NOT NULL is still good defensive practice.

A debug table traces a query the way the engine runs it, one clause at a time, writing the rows after each step. Trace both readings and they diverge, exactly as the spec predicted: Query A keeps all four users, Query B keeps only Pluto.

Both queries, traced step by step:

Query A Execution Trace

Step 1: LEFT JOIN Users with Listens

user_id	name	listen_id	song_id	rating	Notes
1	Mickey	1	1	4.5
1	Mickey	2	2	4.2
1	Mickey	3	6	3.9
2	Minnie	4	2	4.7
2	Minnie	5	7	4.6
2	Minnie	6	8	3.9
3	Daffy	7	1	2.9
3	Daffy	8	2	4.9
3	Daffy	9	6	NULL
4	Pluto	NULL	NULL	NULL	No listens

Step 2: LEFT JOIN with Songs (Add artist)

user_id	name	song_id	title	artist	Notes
1	Mickey	1	Evermore	Taylor Swift
1	Mickey	2	Willow	Taylor Swift
1	Mickey	6	Yesterday	Beatles
2	Minnie	2	Willow	Taylor Swift
2	Minnie	7	Yellow Submarine	Beatles
2	Minnie	8	Hey Jude	Beatles
3	Daffy	1	Evermore	Taylor Swift
3	Daffy	2	Willow	Taylor Swift
3	Daffy	6	Yesterday	Beatles
4	Pluto	NULL	NULL	NULL	No songs

Step 3: WHERE filter (artist != 'Taylor Swift' OR artist IS NULL)

user_id	name	song_id	title	artist	Notes
1	Mickey	6	Yesterday	Beatles	Kept: artist != 'Taylor Swift'
2	Minnie	7	Yellow Submarine	Beatles	Kept: artist != 'Taylor Swift'
2	Minnie	8	Hey Jude	Beatles	Kept: artist != 'Taylor Swift'
3	Daffy	6	Yesterday	Beatles	Kept: artist != 'Taylor Swift'
4	Pluto	NULL	NULL	NULL	Kept: artist IS NULL

Step 4: GROUP BY and Final Output

	user_id	name
	1	Mickey
	2	Minnie
	3	Daffy
	4	Pluto

Query B Execution Trace

Step 1: Find Taylor Swift listeners

	user_id	Notes
	1	Listened to songs 1, 2 (Taylor Swift)
	2	Listened to song 2 (Taylor Swift)
	3	Listened to songs 1, 2 (Taylor Swift)

Step 2: NOT IN filter

user_id	name	Notes
1	Mickey	Excluded: user_id IN {1,2,3}
2	Minnie	Excluded: user_id IN {1,2,3}
3	Daffy	Excluded: user_id IN {1,2,3}
4	Pluto	Included: user_id NOT IN {1,2,3}

Final Output

	user_id	name
	4	Pluto

That is a unit test: one query, run against the answer you expected. "Did this LEFT JOIN keep Pluto? Did the GROUP BY count once?" You settle those by running the query and comparing the result, not by re-reading the English.

The answer is built by a pipeline of relational operations on the left (blue): schema and definitions, then joins and filters, then GROUP BY and COUNT, then the result. On the right, each layer has a verification question: is this the governed metric, do the right rows survive (Pluto), does it count the right thing, does the number match the result you expected.

Figure 1. A query is a stack of exact relational operations (blue): a schema and its definitions, then joins and filters, then a GROUP BY and COUNT, then the result. Each one fixes a piece of the meaning, so each is one thing you can test: is this the governed metric, do the right rows survive (the Pluto case), does it count the right thing, does the number match what you expected. You settle each by running the query, not by arguing in English.

Regression Test: It Stays Right

A unit test proves the query is right today. Keep it and re-run it on every change, and it proves the query is still right tomorrow. That is a regression test: if a definition silently shifts, the stored test fails that day, not months later.

Library: Verified Queries Compose

A verified query is a function, and functions compose. You proved "non-Taylor listeners" returns {Pluto}, so the next question, "weekly active non-Taylor listeners," builds on it, and you verify only the new layer.

Three columns of verified SQL blocks (blue, each with a green check). Base metrics on the left (active user, Taylor listener) compose into composed queries in the middle (weekly actives, non-Taylor listeners), which compose into a complex query on the right (weekly active non-Taylor listeners). Arrows run left to right, and each block is verified once and reused.

Figure 2. Each block is a verified query (blue, green check). Base metrics (active user, Taylor listener) compose into bigger queries (weekly actives, non-Taylor listeners), which compose into a complex one. Every block is verified once and reused, so a new query verifies only its own layer. That is a library: verified SQL that builds on itself, which is exactly what the semantic layer becomes.

Key Takeaways

Unit test. One query against an expected result. You cannot assert about a join in English.
Regression test. Keep the tests and re-run them, so a definition cannot silently drift.
Library. Verified queries stack into bigger ones, each reused; English never could. That library is the semantic layer.

Case Study 1.3: Claude's Agentic Stack → You now have all three steps: precision, execution, verification. Next, a company that runs all three at once, taking a bare model from 21% to 95%.