Case Study 1.3: Claude's Agentic Stack
Concept. Without context, Claude answers Anthropic's internal business questions at 21%. Give it precision as infrastructure, a governed semantic layer and skills that route it to the right definitions, and the same model reaches 95%. This is the section at company scale: precision does the heavy lifting, execution runs the query, verification gates it, and SQL is the spine. The numbers are Anthropic's own, on their own data.
Intuition. One analyst runs the three steps by hand: precision, execution, verification. A company wires them into infrastructure so thousands of questions a week get the same treatment. The model still only generates; the steps stay yours.
A stakeholder asks in Slack: "How many active users did we have last week?" Claude writes SQL, runs it, and returns a confident, wrong number, because "active user" has a dozen definitions in the warehouse and the model picked one. That is runs ≠ correct across a company. Anthropic's data team wrote up how they closed it in "How Anthropic enables self-service data analytics with Claude".
21% to 95%
21%
Bare model, no context
95%
With the stack (precision)
<1 pt
From raw SQL access alone
The 21% is the bare model: it can run and retry against the warehouse, so multi-shot execution is already there, but with no context it guesses what "active user" means. Precision, the definitions and the routing to them, is what lifts it to 95%. The third number is the punchline: handing the model the company's entire history of past SQL, raw, moved accuracy less than a point. Structure does the work, not access.
These are Anthropic's internal numbers: their own warehouse, their own questions, scored on their own offline evals. That is one company's data, not BIRD, the general benchmark that spans many varied, complex external databases. 95% on your own warehouse is a different, narrower test than BIRD's 16 to 77%, and the blog never runs on BIRD. What carries over is the method, not the number: context and checks lift a bare model, at one desk or across a company.
The Stack Is the Three Steps
The same three steps you ran by hand, now built as infrastructure, one person versus a company:
| Step | You, one query | Anthropic, every query |
|---|---|---|
| Precision | the spec in your prompt | a governed semantic layer; skills route the model to the ~30 docs that matter |
| Execution | run it in a CLI | the agent loop, automated |
| Verification | a debug table | the same table as a regression suite of evals, gating each domain at ~90% |
Both scales refuse the same wrong answer, the model's guess. The model writes the query; precision and verification stay yours, and SQL is the spine.
Takeaway: The bare model scores 21% on Anthropic's own data; the three steps lift it to 95%. The numbers are internal, not BIRD, but the method is the one you run by hand: precision, execution, verification, all on SQL.
Next
Intermediate SQL Quiz → That closes the Agentic Data Stacks section: one-shot versus multi-shot, the three steps, and a company that runs all of it at 95%. The quiz tests the intermediate toolkit underneath.