Execution: Run It on Real Data

Concept. Step two: execution. A one-shot query has no feedback, so a wrong table name or a bad join just fails, and the model never learns. Give the model an execution environment, a real database it reaches through a tool like MCP or a CLI, and it writes, runs, reads the error, and corrects. That loop is the difference between one-shot and multi-shot, and it is most of the gap in the BIRD numbers.

Intuition. It is the loop you run by hand: write a query, run it, read the error, fix it. The model just does it in seconds, many times over. Execution catches everything mechanical, because the database rejects it. It cannot catch a query that runs cleanly and answers the wrong question.

You defined the spec on the last page. Now the model writes SQL against it. A single attempt rarely lands: the first query references a column that moved, or joins on the wrong key. With nowhere to run it, the model ships that query blind. With a database to run against, it iterates.

The Loop

An English question goes to an LLM, which takes one of two paths. Grey: one blind shot with no feedback, so it may crash, be empty, or be wrong. Blue: with a database through a tool like MCP or a CLI, it writes, runs, reads the error, and fixes, repeating until a guess becomes a query that runs. The output is a query that runs, though running is not yet right.

Figure 1. A one-shot model emits SQL blind (grey) and ships whatever it wrote. Wired to a database through MCP (Model Context Protocol) or a CLI, it runs the query, reads the error, and corrects (blue), so a guess becomes a query that runs. The loop is only possible because it is SQL: you cannot run an English sentence. This is multi-shot, the difference between BIRD's low scores and its high ones.

What It Catches, What It Misses

Execution catches everything mechanical, for free, because the database does the checking: a missing column, a type mismatch, a join that returns nothing. The model sees the error and fixes it without you.

It cannot catch a query that runs cleanly and returns a plausible but wrong number. The database has no opinion about what you meant; it runs what it is given. You defined the meaning with the spec, but runs ≠ correct. Verifying that is the next step.

Key Takeaways

  1. One-shot has no feedback; multi-shot iterates. An execution environment turns a single guess into a write-run-read-correct loop.

  2. Execution catches mechanical failures for free. Crashes, type errors, and empty results all surface the moment the database runs the query.

  3. It cannot catch a wrong-but-running query. The database has no view on your intent. That gap is the next step, verification.


Next

Verification: Prove the Result → The model iterated to a query that runs. But runs ≠ correct. Third step: verification. Test the result, the way software trusts code, with unit tests and regression.