Execution: Run It on Real Data
Concept. Step two: execution. A one-shot query has no feedback, so a wrong table name or a bad join just fails, and the model never learns. Give the model an execution environment, a real database it reaches through a tool like MCP or a CLI, and it writes, runs, reads the error, and corrects. That loop is the difference between one-shot and multi-shot, and it is most of the gap in the BIRD numbers.
Intuition. It is the loop you run by hand: write a query, run it, read the error, fix it. The model just does it in seconds, many times over. Execution catches everything mechanical, because the database rejects it. It cannot catch a query that runs cleanly and answers the wrong question.
You defined the spec on the last page. Now the model writes SQL against it. A single attempt rarely lands: the first query references a column that moved, or joins on the wrong key. With nowhere to run it, the model ships that query blind. With a database to run against, it iterates.
The Loop
Figure 1. A one-shot model emits SQL blind (grey) and ships whatever it wrote. Wired to a database through MCP (Model Context Protocol) or a CLI, it runs the query, reads the error, and corrects (blue), so a guess becomes a query that runs. The loop is only possible because it is SQL: you cannot run an English sentence. This is multi-shot, the difference between BIRD's low scores and its high ones.
What It Catches, What It Misses
Execution catches everything mechanical, for free, because the database does the checking: a missing column, a type mismatch, a join that returns nothing. The model sees the error and fixes it without you.
It cannot catch a query that runs cleanly and returns a plausible but wrong number. The database has no opinion about what you meant; it runs what it is given. You defined the meaning with the spec, but runs ≠ correct. Verifying that is the next step.
Key Takeaways
-
One-shot has no feedback; multi-shot iterates. An execution environment turns a single guess into a write-run-read-correct loop.
-
Execution catches mechanical failures for free. Crashes, type errors, and empty results all surface the moment the database runs the query.
-
It cannot catch a wrong-but-running query. The database has no view on your intent. That gap is the next step, verification.
Next
Verification: Prove the Result → The model iterated to a query that runs. But runs ≠ correct. Third step: verification. Test the result, the way software trusts code, with unit tests and regression.