LLMs and SQL: Syntax and Semantics
Concept. LLMs have effectively solved SQL syntax. The open problem is semantics: does the query answer the question you meant. You close that gap two ways, and they compose: let the model run the query and self-correct, and make "correct" checkable with a precise spec and a small expected result. Both depend on understanding SQL and how the engine executes it. That understanding is the key skill now, more than ever, not less.
Intuition. "Users who don't listen to Taylor Swift" has two reasonable readings, and an LLM, even one that can execute SQL, will silently pick one. The query that runs is not the same as the query that is right. The LLM raises the floor on syntax; the ceiling is still you. Pinning the meaning, deciding the edge cases, and judging whether the output is correct all require knowing SQL and how it runs.
Query Equivalence showed that two queries can look identical yet diverge on some data: truly equivalent only if they agree for any input, not just today's. An LLM hands you exactly that problem. It returns a query that runs; you decide whether it means what you meant, not just whether it matches the rows in front of you.
Syntax Is Solved, Semantics Isn't
The BIRD benchmark evaluates LLMs on text-to-SQL:
95%+
Syntax accuracy
16–77%
Semantic accuracy
runs ≠ correct
The gap
The low scorers are blind single-shot prompts. The high scorers let the model run the query and fix it. So the gap is the method, not the model. A query that runs is still not a query that is right. The model never sees your data, so it guesses your domain logic unless you state it.
Browse the live benchmark below; scroll through it to see how involved a real question and its gold SQL get:
Strategy: Give the Model an Execution Environment
Figure 1. A model that only emits text guesses (grey). Wired to a database through MCP (Model Context Protocol) or a CLI, it runs the SQL, reads the error or result, and iterates (blue), turning a guess into a checked answer. Execution only catches what you let it check: crashes for free, a wrong-but-running query only if you gave it a spec and an expected result. That spec defines the right answer; you need it whether the model writes the SQL or you do.
Precision Has to Live Somewhere
You still have to write the precision down. The only choice is where: in the prompt, or in the SQL you write yourself. A vague prompt writes none of it, so the model picks one of two readings and you cannot tell which.
Figure 2. A vague prompt (grey) says almost nothing, so the model can write Query A (LEFT JOIN, all four users) or Query B (NOT IN, only Pluto) and you cannot tell which. The precise prompt (blue) numbers what the model must not guess: engine, schema and join paths, definition, edge case, and a show-your-work trace. Its [Definitions] line fixes Query B and {Pluto}.
Every numbered line blocks a guess the model would otherwise make. Drop one and you hand it room to hallucinate. That keeps the prompt minimal and sufficient, not merely long:
[Engine]pins the SQL dialect (!=vs<>, NULL and string behavior). Unspecified, the model picks one.[Schema]gives exact table and column names so it cannot invent one. The join paths matter most: the key is often implicit, and the model will assume a foreign key or join on the wrong column. WritingON Listens.user_id = Users.user_idremoves that guess.[Definitions]is the actual ask, the one line only you can supply.[Edge cases]keep the no-listen user (Pluto), whom a careless query silently drops.[Show your work]forces the step-by-step trace you then check. That trace is the debug table below.
Two is the easy case. One ambiguous phrase over two tables gives two readings. A real query joins four or five tables and carries several independent choices: inner vs left join, whether a NULL counts, dedup, what a COUNT counts, date boundaries. Each is its own fork, so k independent ambiguities give on the order of 2^k queries that all run and disagree. The model still picks one silently. Specify every fork, in the prompt or the SQL.
The Two Readings, in SQL
Both run. They return different answers.
Query A: LEFT JOIN Approach
-- Find users who listened to non-Taylor songs (or no songs)
SELECT u.user_id, u.name
FROM Users u
LEFT JOIN Listens l ON u.user_id = l.user_id
LEFT JOIN Songs s ON l.song_id = s.song_id
WHERE s.artist != 'Taylor Swift' OR s.artist IS NULL
GROUP BY u.user_id, u.name
Query B: NOT IN Subquery Approach
-- Find users who never listened to any Taylor Swift song
SELECT u.user_id, u.name
FROM Users u
WHERE u.user_id NOT IN (
SELECT DISTINCT l.user_id
FROM Listens l
JOIN Songs s ON l.song_id = s.song_id
WHERE s.artist = 'Taylor Swift'
)
Reminder. Beware the NULL in NOT IN trap. Here we are safe because l.user_id cannot be NULL per the Spotify schema; adding WHERE l.user_id IS NOT NULL is still good defensive practice.
The [Definitions] line chooses Query B's reading over Query A's. Write that precision into the prompt, or into the SQL yourself. Either way, you write it.
Verify with a Debug Table
The spec says which query is right. A debug table proves it. Execution catches crashes; it does not catch runs-but-wrong. Build a debug table for the query: trace it the way the engine runs it, one clause at a time, writing the rows after each step. Do it for both queries and they diverge. Query A returns all four users; Query B returns only Pluto. The two tables show the semantic gap directly. Query A keeps Mickey, Minnie, and Daffy because each has a non-Taylor listen, and keeps Pluto only because OR s.artist IS NULL rescues the row the LEFT JOIN NULL-padded; Query B drops anyone who heard Taylor even once, so only Pluto survives. Same English, two debug tables, two answers. The spec decides which is right.
Both queries, traced step by step:
Query A Execution Trace
Step 1: LEFT JOIN Users with Listens
| user_id | name | listen_id | song_id | rating | Notes | |
|---|---|---|---|---|---|---|
| 1 | Mickey | 1 | 1 | 4.5 | ||
| 1 | Mickey | 2 | 2 | 4.2 | ||
| 1 | Mickey | 3 | 6 | 3.9 | ||
| 2 | Minnie | 4 | 2 | 4.7 | ||
| 2 | Minnie | 5 | 7 | 4.6 | ||
| 2 | Minnie | 6 | 8 | 3.9 | ||
| 3 | Daffy | 7 | 1 | 2.9 | ||
| 3 | Daffy | 8 | 2 | 4.9 | ||
| 3 | Daffy | 9 | 6 | NULL | ||
| 4 | Pluto | NULL | NULL | NULL | No listens |
Step 2: LEFT JOIN with Songs (Add artist)
| user_id | name | song_id | title | artist | Notes | |
|---|---|---|---|---|---|---|
| 1 | Mickey | 1 | Evermore | Taylor Swift | ||
| 1 | Mickey | 2 | Willow | Taylor Swift | ||
| 1 | Mickey | 6 | Yesterday | Beatles | ||
| 2 | Minnie | 2 | Willow | Taylor Swift | ||
| 2 | Minnie | 7 | Yellow Submarine | Beatles | ||
| 2 | Minnie | 8 | Hey Jude | Beatles | ||
| 3 | Daffy | 1 | Evermore | Taylor Swift | ||
| 3 | Daffy | 2 | Willow | Taylor Swift | ||
| 3 | Daffy | 6 | Yesterday | Beatles | ||
| 4 | Pluto | NULL | NULL | NULL | No songs |
Step 3: WHERE filter (artist != 'Taylor Swift' OR artist IS NULL)
| user_id | name | song_id | title | artist | Notes | |
|---|---|---|---|---|---|---|
| 1 | Mickey | 6 | Yesterday | Beatles | Kept: artist != 'Taylor Swift' | |
| 2 | Minnie | 7 | Yellow Submarine | Beatles | Kept: artist != 'Taylor Swift' | |
| 2 | Minnie | 8 | Hey Jude | Beatles | Kept: artist != 'Taylor Swift' | |
| 3 | Daffy | 6 | Yesterday | Beatles | Kept: artist != 'Taylor Swift' | |
| 4 | Pluto | NULL | NULL | NULL | Kept: artist IS NULL |
Step 4: GROUP BY and Final Output
| user_id | name | |
|---|---|---|
| 1 | Mickey | |
| 2 | Minnie | |
| 3 | Daffy | |
| 4 | Pluto |
Query B Execution Trace
Step 1: Find Taylor Swift listeners
| user_id | Notes | |
|---|---|---|
| 1 | Listened to songs 1, 2 (Taylor Swift) | |
| 2 | Listened to song 2 (Taylor Swift) | |
| 3 | Listened to songs 1, 2 (Taylor Swift) |
Step 2: NOT IN filter
| user_id | name | Notes | |
|---|---|---|---|
| 1 | Mickey | Excluded: user_id IN {1,2,3} | |
| 2 | Minnie | Excluded: user_id IN {1,2,3} | |
| 3 | Daffy | Excluded: user_id IN {1,2,3} | |
| 4 | Pluto | Included: user_id NOT IN {1,2,3} |
Final Output
| user_id | name | |
|---|---|---|
| 4 | Pluto |
Takeaways
-
Syntax is solved; semantics is the job. A query that runs is not a query that is right.
-
Execution turns a guess into a checked answer, but only checks what your spec defines. Crashes are free; runs-but-wrong needs a pinned definition and an expected result.
-
Understanding SQL and debugging is the key skill, more than ever. You can only pin the definition, decide the edge cases, and trace the debug table if you know how the engine runs the query. The LLM removes the boilerplate, not the need to understand it.
Next
Intermediate SQL Quiz → That completes the intermediate-SQL toolkit: CTEs, window functions, query equivalence, and verifying machine-written SQL. The quiz checks all of it.