LLMs and SQL: Syntax and Semantics

Core Concept

Myth: "LLMs Will Replace SQL Skills"

The Sweet Spot: LLMs draft the boilerplate. You verify the logic with Debug Tables. Ship faster, avoid costly errors.

Real-World Performance Data

The BIRD benchmark (https://bird-bench.github.io/) evaluates LLMs on SQL generation:

Why you get paid the big bucks: Master semantics. Know SQL and the business inside out.


Generating Syntax with LLMs

Before debugging logic, ensure a syntactically correct foundation. Supply the right context:

  1. The Exact Schema Constraints: Include precise CREATE TABLE statements or schema definitions.

  2. The Relational Paths: Clearly state table joins. Don't let the LLM guess.

  3. The Execution Engine: Specify the SQL dialect—PostgreSQL, BigQuery, SQLite.

Semantic Traps to Avoid: Even with perfect context, LLMs stumble on semantics. Always check for:

An Example of a Good Prompt Template

Reduce hallucinations with a structured prompt.

Example Prompt:

[Engine] I am using PostgreSQL.

[Schema] Here are my tables: Users (user_id INT, name VARCHAR, join_date DATE) Purchases (purchase_id INT, user_id INT, amount DECIMAL, status VARCHAR) Purchases.user_id joins to Users.user_id.

[Definitions & Logic] - "Active Users": Users who joined before '2025-01-01'. - "Top Users": Users with a total purchase amount > $500 where status = 'COMPLETED'.

[The Request] Write a query to find the name and total revenue of all Active, Top Users. Exclude users without completed purchases.


Query Comparison: "Users who don't listen to Taylor Swift"

Query A: LEFT JOIN Approach

Query B: NOT IN Subquery Approach

Reminder: Beware the NULL in NOT IN death trap! Here we're safe since l.user_id can't be NULL per our Spotify schema. Adding WHERE l.user_id IS NOT NULL is still fine defensive practice.

Step-by-Step Debug Tables

Query A Execution Trace

Step 1: LEFT JOIN Users with Listens

user_idnamelisten_idsong_idratingNotes
1Mickey114.5
1Mickey224.2
1Mickey363.9
2Minnie424.7
2Minnie574.6
2Minnie683.9
3Daffy712.9
3Daffy824.9
3Daffy96NULL
4PlutoNULLNULLNULLNo listens

Step 2: LEFT JOIN with Songs (Add artist)

user_idnamesong_idtitleartistNotes
1Mickey1EvermoreTaylor Swift
1Mickey2WillowTaylor Swift
1Mickey6YesterdayBeatles
2Minnie2WillowTaylor Swift
2Minnie7Yellow SubmarineBeatles
2Minnie8Hey JudeBeatles
3Daffy1EvermoreTaylor Swift
3Daffy2WillowTaylor Swift
3Daffy6YesterdayBeatles
4PlutoNULLNULLNULLNo songs

Step 3: WHERE filter (artist != 'Taylor Swift' OR artist IS NULL)

user_idnamesong_idtitleartistNotes
1Mickey6YesterdayBeatlesKept: artist != 'Taylor Swift'
2Minnie7Yellow SubmarineBeatlesKept: artist != 'Taylor Swift'
2Minnie8Hey JudeBeatlesKept: artist != 'Taylor Swift'
3Daffy6YesterdayBeatlesKept: artist != 'Taylor Swift'
4PlutoNULLNULLNULLKept: artist IS NULL

Step 4: GROUP BY and Final Output

user_idname
1Mickey
2Minnie
3Daffy
4Pluto

Query B Execution Trace

Step 1: Find Taylor Swift listeners

user_idNotes
1Listened to songs 1, 2 (Taylor Swift)
2Listened to song 2 (Taylor Swift)
3Listened to songs 1, 2 (Taylor Swift)

Step 2: NOT IN filter

user_idnameNotes
1MickeyExcluded: user_id IN {1,2,3}
2MinnieExcluded: user_id IN {1,2,3}
3DaffyExcluded: user_id IN {1,2,3}
4PlutoIncluded: user_id NOT IN {1,2,3}

Final Output

user_idname
4Pluto

The Semantic Gap

Same Question, Different Answers: English is ambiguous. SQL is not.