Claude & OpenAI: Deconstructing Agent Memory

Take a coding agent apart and you find a database.

Concept. An AI coding agent with no memory hallucinates. A persistent SQL database is the fix: first one file on your laptop, then the same primitive, one per user, scaled to millions of users.

Intuition. The context window is short-term memory, wiped the moment the session ends. Anything that has to outlast it lives in a database. That single idea is the spine of the rest of this course.

Case Study Reading Time: 6 mins

Real facts: store and retrieve, local then global Three panels left to right. Panel one, red: a grey de-emphasized model circle with only a context window; it hallucinates. Panel two, green: the same model plus a local database, SQLite running claude-mem, a file on disk on one machine, queried with SQL; it looks the answer up. Panel three, green: the same model plus the same database scaled out, ChatGPT's user data, Postgres on disk with read replicas, also queried with SQL, one private memory per user across 800 million users. The model talks to each database using SQL. Models keep upgrading and databases keep scaling; the database is the core infrastructure that holds their memory and context, and that is what is in focus. Color key: red is no memory and the ephemeral context window, grey is the model which is a separate topic out of focus, green is persistent memory. Real facts: store and retrieve, local then global Models keep upgrading and databases keep scaling. The database holds their memory and context. 1 · Model alone Model context window Short-term working memory. Wiped when the session ends. ✗ Hallucinates a function 2 · + Local memory Model SQL SQLite · claude-mem On disk, one machine. Survives every session. ✓ Looks it up · per laptop 3 · + Global memory Model SQL Postgres ChatGPT's user database. On disk, many machines · × 800M ✓ One per user · ×800M RAM forgets · disk remembers · one machine, then many Color key  red = no memory, the ephemeral context window  ·  grey = the model, a separate topic, out of focus  ·  green = persistent memory

Figure 1. Giving the model real facts, in two steps. On its own the model (grey, a separate topic) makes facts up. First, a database lets it store and retrieve real facts, read and write instead of invent. Second, the same database scaled across machines holds one memory per user for millions of them. The database is the constant infrastructure that manages the facts. Local memory is Modules 1 to 4; global memory is Modules 5 and 6.

No memory, and the hallucination

On Monday you and Claude Code build a custom auth module together. You agree on a few function names, fix a tricky edge case, ship it. On Tuesday you open a new session and ask Claude to extend it. The agent has no idea who you are. It confidently calls a function verifyJWTWithScope() that does not exist anywhere in your codebase, inventing a name that sounds right.

That is a hallucination: a confident, fluent output that happens to be wrong. The agent is doing its best with no context, and without memory, "its best" includes inventing things that sound plausible. The second failure mode is quieter: rediscovery, where the agent re-derives the same pattern, the same bug fix, the same architectural decision every session. Your time and your tokens both pay for it.

Hollywood made a movie about exactly this problem twenty years before AI went mainstream. 50 First Dates. Lucy has amnesia, so each morning she wakes up thinking she has just met her boyfriend Henry. Henry's fix is a recorded videotape. Each night he records what happened that day, their inside jokes, the state of the relationship. He plays it back every morning before saying hello. By breakfast she is caught up. By bedtime the recording is wiped, and he records the next one. Henry is doing what every AI engineer now does for a living. The recording is a database. It is also the cure for hallucination.

The context window is the agent's working-memory limit

Remember the RAM limit from the previous teardown: a fast working memory that cannot hold the whole problem. An agent hits the same limit, with a different name.

Modern agents work inside a context window, the chunk of text the model can see at once. Claude's is roughly 200,000 to 1,000,000 tokens. That sounds enormous until you compare it to a real project: a mid-sized codebase plus a few weeks of design discussion runs into the tens of millions of tokens. Worse, every new session starts empty. Yesterday's plans, decisions, and bug fixes evaporate the moment the window closes.

The context window is short-term memory: it forgets the moment the session ends. Anything that has to outlast a session needs long-term memory, a database. Your brain splits the same way: a phone number lasts five seconds, a friend's name lasts decades. The model is Lucy. The database does the remembering for it.

So what kills hallucination is whatever survives on disk after the process exits. The rest of this course is about how many machines hold that disk. Local memory is the smallest case: one database on one machine.

Local memory: a database on your machine

Local memory: write and read are just SQL A grey, de-emphasized model on the left (an LLM plus context extraction, for example pulling function signatures from code, a separate topic that is out of focus) talks to one green embedded database on your machine using SQL. A top arrow is SQL write, the model into the database. A bottom arrow is SQL read, the database back out through a small rerank box and into the model, a closed loop. The green database, claude-mem, is the focus: it holds two cylinders, SQLite with FTS5 for keyword and structured lookup, what you said, and a sqlite-vec vector index for semantic lookup, what you meant. Color key: grey is the model and your machine, a separate topic out of focus; green is the persistent embedded database, what we care about. Local memory: write and read are just SQL claude-mem on your machine: SQLite for what you said, vectors for what you meant. Model LLM + context extraction e.g. code → signatures SQL write SQL read rerank YOUR MACHINE One embedded database · claude-mem SQLite · FTS5 keyword · what you said Vector · sqlite-vec semantic · what you meant Color key  grey = the model and your machine, out of focus  ·  green = the persistent database

Figure 2. Local memory is a closed SQL loop: the model (grey, out of focus) writes with SQL and reads back through a rerank step. The database is the focus (green): one embedded claude-mem file, SQLite/FTS5 for keyword and sqlite-vec for semantic recall. One file, no server.

In late 2025, Tobi LΓΌtke (CEO of Shopify, a ~$150B company) hit exactly this pain on his own coding agent. Over a weekend he built QMD, an open-source local search engine that gives agents persistent memory across sessions (github.com/tobi/qmd). It indexes whatever you point it at (docs, notes, codebases) and retrieves the few snippets that matter, exposed over MCP so agents like Claude Code can call it mid-session. The premise: an agent with persistent searchable memory does not hallucinate the name of a function it could just look up.

QMD is one of several tools converging on the same answer. The claude-mem plugin passively logs observations during your sessions for later recall. Aider tackles understanding the current codebase. All of them store their memory in SQLite, for the same four reasons:

  • Zero config. The agent runs locally; the database has to start the moment the agent does, with no server to launch.

  • Embedded. It runs inside the agent's process. No network hop, no auth surface.

  • Transactional. A write mid-task must not corrupt the database if the laptop sleeps or the process dies.

  • One file. You can copy it, version it, or open it in any SQLite client and query it.

This is the cure for hallucination, and it is a single file on one machine.

Global memory: the same need, at scale

Global memory: one Postgres primary, many read replicas, 800 million users On the left, two grey de-emphasized boxes that are a separate topic out of focus: 800 million ChatGPT users, and the GPT-5 model, OpenAI's LLM. They send SQL writes to one green Postgres primary inside a green shared-database region; the primary streams replication to about fifty read replicas that serve the SQL reads. Same SQL write and SQL read pattern as the local database, one machine versus many. Color key: grey is the users and the GPT-5 model, a separate topic out of focus; green is the shared Postgres database, what we care about. Global memory: one logical database, many machines One primary takes every write; ~50 replicas fan the reads out to 800 million users. ChatGPT users × 800M writes and reads GPT-5 model OpenAI's LLM a separate topic Shared database · Postgres / SQL Primary takes all writes ~50 read replicas serve all reads replicate SQL write SQL read · fanned out from ~50 replicas Color key  grey = the users and the GPT-5 model, out of focus  ·  green = the shared Postgres database

Figure 3. Global memory is the same database, copied. The users and GPT-5 model (grey, a separate topic) issue the same SQL as in Figure 2; only the database differs (green): one Postgres primary takes every write and streams replication to about fifty read replicas serving 800 million users. The database just keeps scaling, one machine versus many.

One laptop holds one engineer's memory. Production holds everyone's. The product pattern is the same at both ends: the model writes the reply, the database remembers your name, your conversations, your billing. Going from local to global is pure scale. Each user keeps their own memory, and the same primitive grows to hold all of them. OpenAI scaled that to 800 million users by adding read replicas around a single primary.

The database does not change as it grows; it migrates. The same observations, the same retrieval, move from a per-laptop SQLite file to a shared cloud database read by every agent and every engineer on the project. Next question: what happens when the database outgrows any single machine and hundreds of users hit it at once? The next teardown makes it concrete with UberEats, a planet-scale service where three parties must agree on every order.

Key Takeaways

  1. Short-term versus long-term. The context window is short-term memory, wiped every session. The database is long-term memory. Drop the long-term half and the model hallucinates.

  2. One teardown, the whole course. A database on one machine (a language, storage, speed, transactions) is Modules 1 through 4. A database across many machines (replication, sharding) is Modules 5 and 6. This single example just walked the syllabus.

  3. The fundamentals compound. The simplest SQL database, one file, is the load-bearing primitive the most senior engineer in commerce reached for to give his agent a memory. Choose the simplest database that fits the problem; that is usually enough, even at the frontier.


Going deeper

How the database and the retrieval work, in detail.

The observations table

A simplified agent-memory database fits in one schema:

CREATE TABLE observations (
    id           INTEGER PRIMARY KEY,  -- unique row id; no two observations share one
    session_id   TEXT,                 -- which working session produced this
    project      TEXT,                 -- which codebase/project it belongs to
    created_at   TIMESTAMP,            -- when it was written; used to sort by recency
    kind         TEXT,                 -- 'decision' | 'pattern' | 'bug-fix' | 'todo'
    content      TEXT,                 -- the actual remembered text the agent searches
    embedding    BLOB                  -- the content as a vector, for semantic search
);

-- Indexes make the common lookups fast (more on indexes in Module 3).
CREATE INDEX idx_session ON observations(session_id);  -- "everything from this session"
CREATE INDEX idx_project ON observations(project);     -- "everything in this project"

QMD's index lives at ~/.cache/qmd/index.sqlite with the sqlite-vec extension and SQLite's built-in FTS5 module. claude-mem keeps sessions and observations in SQLite paired with a vector sidecar. Aider's repo-map cache uses diskcache, itself SQLite under the hood.

Three ranking signals

Once the database is set, agent memory becomes a question of what to rank by. Three signals show up across every serious tool, and a single agent often combines all three.

1. Keyword search (BM25, "Best Matching 25", a standard text-ranking algorithm). Classical full-text search weighted by term frequency and inverse document frequency, built into SQLite via FTS5. Best for exact recall: "find where we mentioned verifyJWTWithScope."

-- "What did we decide about auth in this project?"
-- observations_fts is an FTS5 virtual table over observations.content.
SELECT o.content FROM observations o
JOIN observations_fts f ON f.rowid = o.id
WHERE observations_fts MATCH 'auth'
  AND o.project = 'cs145'
  AND o.kind = 'decision'
ORDER BY bm25(observations_fts)
LIMIT 10;

2. Vector similarity. Embedding-based semantic search. The question is converted to a vector and the agent retrieves observations whose vectors point the same way. Best for paraphrase: "what are we doing about token refresh?" surfaces any observation about session lifetime even if "refresh" never appears.

-- :query_embedding is computed beforehand by an external embedding model.
SELECT content FROM observations
WHERE project = 'cs145'
ORDER BY vec_distance_cosine(embedding, :query_embedding) ASC
LIMIT 10;

3. Graph centrality (PageRank). Code-aware retrieval rather than session memory. Parse the codebase with tree-sitter, treat files as nodes and symbol references as edges, run personalized PageRank seeded on what the user is asking about. Aider's RepoMap is the canonical implementation; claude-mem ships its own as the smart-explore skill. The PageRank runs in application code; the symbol graph sits in the same SQLite database.

The three are orthogonal. BM25 catches what you said. Vectors catch what you meant. PageRank catches what is structurally relevant in your code right now. The right mix is workload-specific, and because the database speaks SQL, tuning the retrieval is a query edit, not a rebuild: weight BM25 twice as heavily by editing the ORDER BY; filter by recency by adding a WHERE. The same declarative power you have all course (say what you want, let the engine decide how) is what lets an agent builder iterate on memory by editing one line of SQL.