LLMs and SQL: Syntax and Semantics

Core Concept

Example: Same question, different logic and queries - one returns 4 users, another returns 1. Both are syntactically correct, yet semantically distinct.

Myth: "LLMs Will Replace SQL Skills"

LLMs: Quick with syntax, shaky on semantics, prone to hallucinations.
Context: Feeding a full database into an AI is impractical. You pass the schema and maybe a few sample rows. Anything more is a financial black hole.

The Sweet Spot: LLMs draft the boilerplate. You verify the logic with Debug Tables. Ship faster, avoid costly errors.

Real-World Performance Data

The BIRD benchmark (https://bird-bench.github.io/) evaluates LLMs on SQL generation:

Syntax Accuracy: Over 95%—most LLMs ace SQL syntax.
Semantic Accuracy: Ranges from 16% to 77%—a glaring gap.
The Gap: Queries that run without errors but deliver incorrect results.

Why you get paid the big bucks: Master semantics. Know SQL and the business inside out.

Generating Syntax with LLMs

Before debugging logic, ensure a syntactically correct foundation. Supply the right context:

The Exact Schema Constraints: Include precise CREATE TABLE statements or schema definitions.
The Relational Paths: Clearly state table joins. Don't let the LLM guess.
The Execution Engine: Specify the SQL dialect—PostgreSQL, BigQuery, SQLite.

Semantic Traps to Avoid: Even with perfect context, LLMs stumble on semantics. Always check for:

Silent Aggregation Errors: Misused joins can inflate results.
NULL Handling: LLMs often mishandle NULL in subqueries or joins.
Domain Logic Assumptions: LLMs might misinterpret business logic. Be explicit.

An Example of a Good Prompt Template

Reduce hallucinations with a structured prompt.

Example Prompt:

[Engine] I am using PostgreSQL.

[Schema] Here are my tables: Users (user_id INT, name VARCHAR, join_date DATE) Purchases (purchase_id INT, user_id INT, amount DECIMAL, status VARCHAR) Purchases.user_id joins to Users.user_id.

[Definitions & Logic] - "Active Users": Users who joined before '2025-01-01'. - "Top Users": Users with a total purchase amount > $500 where status = 'COMPLETED'.

[The Request] Write a query to find the name and total revenue of all Active, Top Users. Exclude users without completed purchases.

Query Comparison: "Users who don't listen to Taylor Swift"

Query A: LEFT JOIN Approach

Query B: NOT IN Subquery Approach

Reminder: Beware the NULL in NOT IN death trap! Here we're safe since l.user_id can't be NULL per our Spotify schema. Adding WHERE l.user_id IS NOT NULL is still fine defensive practice.

Step-by-Step Debug Tables

Query A Execution Trace

Step 1: LEFT JOIN Users with Listens

user_id	name	listen_id	song_id	rating	Notes
1	Mickey	1	1	4.5
1	Mickey	2	2	4.2
1	Mickey	3	6	3.9
2	Minnie	4	2	4.7
2	Minnie	5	7	4.6
2	Minnie	6	8	3.9
3	Daffy	7	1	2.9
3	Daffy	8	2	4.9
3	Daffy	9	6	NULL
4	Pluto	NULL	NULL	NULL	No listens

Step 2: LEFT JOIN with Songs (Add artist)

user_id	name	song_id	title	artist	Notes
1	Mickey	1	Evermore	Taylor Swift
1	Mickey	2	Willow	Taylor Swift
1	Mickey	6	Yesterday	Beatles
2	Minnie	2	Willow	Taylor Swift
2	Minnie	7	Yellow Submarine	Beatles
2	Minnie	8	Hey Jude	Beatles
3	Daffy	1	Evermore	Taylor Swift
3	Daffy	2	Willow	Taylor Swift
3	Daffy	6	Yesterday	Beatles
4	Pluto	NULL	NULL	NULL	No songs

Step 3: WHERE filter (artist != 'Taylor Swift' OR artist IS NULL)

user_id	name	song_id	title	artist	Notes
1	Mickey	6	Yesterday	Beatles	Kept: artist != 'Taylor Swift'
2	Minnie	7	Yellow Submarine	Beatles	Kept: artist != 'Taylor Swift'
2	Minnie	8	Hey Jude	Beatles	Kept: artist != 'Taylor Swift'
3	Daffy	6	Yesterday	Beatles	Kept: artist != 'Taylor Swift'
4	Pluto	NULL	NULL	NULL	Kept: artist IS NULL

Step 4: GROUP BY and Final Output

	user_id	name
	1	Mickey
	2	Minnie
	3	Daffy
	4	Pluto

Query B Execution Trace

Step 1: Find Taylor Swift listeners

	user_id	Notes
	1	Listened to songs 1, 2 (Taylor Swift)
	2	Listened to song 2 (Taylor Swift)
	3	Listened to songs 1, 2 (Taylor Swift)

Step 2: NOT IN filter

user_id	name	Notes
1	Mickey	Excluded: user_id IN {1,2,3}
2	Minnie	Excluded: user_id IN {1,2,3}
3	Daffy	Excluded: user_id IN {1,2,3}
4	Pluto	Included: user_id NOT IN {1,2,3}

Final Output

	user_id	name
	4	Pluto

The Semantic Gap

Same Question, Different Answers: English is ambiguous. SQL is not.

English: Users who don't listen to Taylor Swift
Query A: "Users who listened to non-Taylor songs" (4 users)
Query B: "Users who never listened to Taylor" (1 user)