LLMs and SQL: Syntax and Semantics

Concept. LLMs have effectively solved SQL syntax. The open problem is semantics: does the query answer the question you meant. You close that gap two ways, and they compose: let the model run the query and self-correct, and make "correct" checkable with a precise spec and a small expected result. Both depend on understanding SQL and how the engine executes it. That understanding is the key skill now, more than ever, not less.

Intuition. "Users who don't listen to Taylor Swift" has two reasonable readings, and an LLM, even one that can execute SQL, will silently pick one. The query that runs is not the same as the query that is right. The LLM raises the floor on syntax; the ceiling is still you. Pinning the meaning, deciding the edge cases, and judging whether the output is correct all require knowing SQL and how it runs.

Query Equivalence showed that two queries can look identical yet diverge on some data: truly equivalent only if they agree for any input, not just today's. An LLM hands you exactly that problem. It returns a query that runs; you decide whether it means what you meant, not just whether it matches the rows in front of you.

Syntax Is Solved, Semantics Isn't

The BIRD benchmark evaluates LLMs on text-to-SQL:

95%+

Syntax accuracy

16–77%

Semantic accuracy

runs ≠ correct

The gap

The low scorers are blind single-shot prompts. The high scorers let the model run the query and fix it. So the gap is the method, not the model. A query that runs is still not a query that is right. The model never sees your data, so it guesses your domain logic unless you state it.

Browse the live benchmark below; scroll through it to see how involved a real question and its gold SQL get:

Strategy: Give the Model an Execution Environment

Figure 1. A model that only emits text guesses (grey). Wired to a database through MCP (Model Context Protocol) or a CLI, it runs the SQL, reads the error or result, and iterates (blue), turning a guess into a checked answer. Execution only catches what you let it check: crashes for free, a wrong-but-running query only if you gave it a spec and an expected result. That spec defines the right answer; you need it whether the model writes the SQL or you do.

Precision Has to Live Somewhere

You still have to write the precision down. The only choice is where: in the prompt, or in the SQL you write yourself. A vague prompt writes none of it, so the model picks one of two readings and you cannot tell which.

The same request as two prompts. Left, grey: a vague one-liner, so the model can write Query A (LEFT JOIN, all four users) or Query B (NOT IN, only Pluto) and you can't tell which. Right, blue: a precise prompt with five numbered sections (engine, schema with join paths, definitions, edge cases, show your work) whose Definitions line fixes exactly Query B and {Pluto}.

Figure 2. A vague prompt (grey) says almost nothing, so the model can write Query A (LEFT JOIN, all four users) or Query B (NOT IN, only Pluto) and you cannot tell which. The precise prompt (blue) numbers what the model must not guess: engine, schema and join paths, definition, edge case, and a show-your-work trace. Its [Definitions] line fixes Query B and {Pluto}.

Every numbered line blocks a guess the model would otherwise make. Drop one and you hand it room to hallucinate. That keeps the prompt minimal and sufficient, not merely long:

[Engine] pins the SQL dialect (!= vs <>, NULL and string behavior). Unspecified, the model picks one.
[Schema] gives exact table and column names so it cannot invent one. The join paths matter most: the key is often implicit, and the model will assume a foreign key or join on the wrong column. Writing ON Listens.user_id = Users.user_id removes that guess.
[Definitions] is the actual ask, the one line only you can supply.
[Edge cases] keep the no-listen user (Pluto), whom a careless query silently drops.
[Show your work] forces the step-by-step trace you then check. That trace is the debug table below.

Two is the easy case. One ambiguous phrase over two tables gives two readings. A real query joins four or five tables and carries several independent choices: inner vs left join, whether a NULL counts, dedup, what a COUNT counts, date boundaries. Each is its own fork, so k independent ambiguities give on the order of 2^k queries that all run and disagree. The model still picks one silently. Specify every fork, in the prompt or the SQL.

The Two Readings, in SQL

Both run. They return different answers.

Query A: LEFT JOIN Approach

-- Find users who listened to non-Taylor songs (or no songs)
SELECT u.user_id, u.name
FROM Users u
LEFT JOIN Listens l ON u.user_id = l.user_id
LEFT JOIN Songs s ON l.song_id = s.song_id
WHERE s.artist != 'Taylor Swift' OR s.artist IS NULL
GROUP BY u.user_id, u.name

Query B: NOT IN Subquery Approach

-- Find users who never listened to any Taylor Swift song
SELECT u.user_id, u.name
FROM Users u
WHERE u.user_id NOT IN (
    SELECT DISTINCT l.user_id
    FROM Listens l
    JOIN Songs s ON l.song_id = s.song_id
    WHERE s.artist = 'Taylor Swift'
)

Reminder. Beware the NULL in NOT IN trap. Here we are safe because l.user_id cannot be NULL per the Spotify schema; adding WHERE l.user_id IS NOT NULL is still good defensive practice.

The [Definitions] line chooses Query B's reading over Query A's. Write that precision into the prompt, or into the SQL yourself. Either way, you write it.

Verify with a Debug Table

The spec says which query is right. A debug table proves it. Execution catches crashes; it does not catch runs-but-wrong. Build a debug table for the query: trace it the way the engine runs it, one clause at a time, writing the rows after each step. Do it for both queries and they diverge. Query A returns all four users; Query B returns only Pluto. The two tables show the semantic gap directly. Query A keeps Mickey, Minnie, and Daffy because each has a non-Taylor listen, and keeps Pluto only because OR s.artist IS NULL rescues the row the LEFT JOIN NULL-padded; Query B drops anyone who heard Taylor even once, so only Pluto survives. Same English, two debug tables, two answers. The spec decides which is right.

Both queries, traced step by step:

Query A Execution Trace

Step 1: LEFT JOIN Users with Listens

user_id	name	listen_id	song_id	rating	Notes
1	Mickey	1	1	4.5
1	Mickey	2	2	4.2
1	Mickey	3	6	3.9
2	Minnie	4	2	4.7
2	Minnie	5	7	4.6
2	Minnie	6	8	3.9
3	Daffy	7	1	2.9
3	Daffy	8	2	4.9
3	Daffy	9	6	NULL
4	Pluto	NULL	NULL	NULL	No listens

Step 2: LEFT JOIN with Songs (Add artist)

user_id	name	song_id	title	artist	Notes
1	Mickey	1	Evermore	Taylor Swift
1	Mickey	2	Willow	Taylor Swift
1	Mickey	6	Yesterday	Beatles
2	Minnie	2	Willow	Taylor Swift
2	Minnie	7	Yellow Submarine	Beatles
2	Minnie	8	Hey Jude	Beatles
3	Daffy	1	Evermore	Taylor Swift
3	Daffy	2	Willow	Taylor Swift
3	Daffy	6	Yesterday	Beatles
4	Pluto	NULL	NULL	NULL	No songs

Step 3: WHERE filter (artist != 'Taylor Swift' OR artist IS NULL)

user_id	name	song_id	title	artist	Notes
1	Mickey	6	Yesterday	Beatles	Kept: artist != 'Taylor Swift'
2	Minnie	7	Yellow Submarine	Beatles	Kept: artist != 'Taylor Swift'
2	Minnie	8	Hey Jude	Beatles	Kept: artist != 'Taylor Swift'
3	Daffy	6	Yesterday	Beatles	Kept: artist != 'Taylor Swift'
4	Pluto	NULL	NULL	NULL	Kept: artist IS NULL

Step 4: GROUP BY and Final Output

	user_id	name
	1	Mickey
	2	Minnie
	3	Daffy
	4	Pluto

Query B Execution Trace

Step 1: Find Taylor Swift listeners

	user_id	Notes
	1	Listened to songs 1, 2 (Taylor Swift)
	2	Listened to song 2 (Taylor Swift)
	3	Listened to songs 1, 2 (Taylor Swift)

Step 2: NOT IN filter

user_id	name	Notes
1	Mickey	Excluded: user_id IN {1,2,3}
2	Minnie	Excluded: user_id IN {1,2,3}
3	Daffy	Excluded: user_id IN {1,2,3}
4	Pluto	Included: user_id NOT IN {1,2,3}

Final Output

	user_id	name
	4	Pluto

Takeaways

Syntax is solved; semantics is the job. A query that runs is not a query that is right.
Execution turns a guess into a checked answer, but only checks what your spec defines. Crashes are free; runs-but-wrong needs a pinned definition and an expected result.
Understanding SQL and debugging is the key skill, more than ever. You can only pin the definition, decide the edge cases, and trace the debug table if you know how the engine runs the query. The LLM removes the boilerplate, not the need to understand it.

Intermediate SQL Quiz → That completes the intermediate-SQL toolkit: CTEs, window functions, query equivalence, and verifying machine-written SQL. The quiz checks all of it.