Pandas: Deconstructing Small Analytics

Pandas, Polars, and the limit at the edge of RAM.

Concept. Pandas and Polars assume your whole dataset fits in your machine's RAM, and work on it there. Fast, familiar, and bounded by one machine.

Intuition. A laptop holds maybe sixteen gigabytes of RAM. Spotify holds billions of song listens. The bridge between those two numbers is the rest of this course.

Case Study Reading Time: 5 mins

Pandas: the whole dataset has to fit in RAM A blue-stripped box is the single machine, RAM about 16 gigabytes, holding a DataFrame that is grouped, sorted, and joined entirely in memory; grey shows the de-emphasized CSV or Parquet on disk that is read in. On the right, the red-stripped box is the failure state: when data exceeds RAM the program crashes or thrashes. A visible color key runs along the bottom. Color key: blue is the one machine in focus, red is the out-of-RAM failure, grey is out of focus. Pandas: the whole dataset has to fit in RAM One process, one machine. Everything lives in memory, until it does not fit. CSV / Parquet on disk read ONE MACHINE · RAM ≈ 16 GB DataFrame every row, one process, in memory group / sort / join all in RAM result handed back fast RAM limit data > RAM: crash or thrash Color key  blue = the one machine in focus  ·  red = out of RAM, the failure  ·  grey = out of focus

Figure 1. The part you already know. Pandas, and Polars (the same idea in Rust, faster on big files), loads your whole dataset into RAM and works on it there: one process, one machine, fast until the data exceeds RAM (right), where it crashes. This is a database in the small, short-lived and bounded by one machine. Removing that limit is the rest of the course.

A familiar example

One schema runs all course: Songs (about a hundred million of them on real Spotify), Listens (billions of rows, one per play), and Users. The task: the top 10 songs by play count.

In pure Python you would write nested for loops over the rows, build a dictionary of counts, and sort it at the end. That is imperative code: you spell out how to compute the answer, step by step.

Pandas and Polars give you something better, and so does SQL. Same intent, two surfaces, side by side:

Polars (Python)

import polars as pl

popular = (
    listens_df
    .join(songs_df, left_on='song_id',
          right_on='song_id', how='inner')
    .group_by(['title', 'artist'])
    .agg(pl.len().alias('plays'))
    .sort('plays', descending=True)
    .head(10)
)

SQL

SELECT title, artist, COUNT(*) AS plays
FROM listens
JOIN songs ON listens.song_id = songs.song_id
GROUP BY title, artist
ORDER BY plays DESC
LIMIT 10;

Read them together. Neither one says how: no loops, no counters, no sort algorithm. You declare the join, the grouping, and the order; the engine works out how to walk the rows. That is the declarative style, and SQL is its canonical form.

The SQL version has one extra property: it is a program that does not change with scale. It runs unchanged against a thousand rows on your laptop or a hundred billion across two thousand machines; only the execution plan changes. You do not rewrite the query when the data grows. You point it at a bigger engine.

Pandas, Polars, and a database

That last property is the whole story, because Spotify's Listens table does not fit in a laptop's sixteen gigabytes, or a sixty-four gigabyte workstation. What each tool does when the data gets that big is what separates them.

Pandas, Polars, and a database as the data grows Three panels left to right as the data grows. Pandas, blue, loads every row into RAM with one core and copies at each step, so it dies when the data passes RAM. Polars, blue, runs a lazy streaming all-cores plan that only materializes what it needs, pushing the limit far out but still on one machine. A database, green, ships the query to the data and plans across many machines and disks, returning just the rows, with no single-machine limit. Color key: blue is one machine, green is many machines, red is the out-of-RAM failure. Same query. The data keeps growing. What each engine does differently Pandas Loads every row into RAM. One process, one core. Each step makes a full copy. Data past RAM: it dies. Polars Builds a lazy plan first. Streams from disk in chunks. Uses every CPU core. Drops work it does not need. Limit pushed far out. One machine. Database (SQL) Ship the query to the data. The engine plans across many machines and disks. Returns just your rows. No single-machine limit. Color key  blue = one machine  ·  green = many machines  ·  red = out of RAM

Figure 2. Pandas loads every row into RAM and copies at each step, so it stops the moment the data passes RAM. Polars keeps almost the same API but runs a lazy, streaming, all-cores plan that only materializes what it needs, which pushes that point far out, though it is still one machine with a finite disk and CPU. A database flips the problem: you send the query to the data, and the engine spreads the scan and join across many machines, returning just your rows. The first two make one machine go further. The third stops the data from having to fit on one machine at all.

Where this leaves Pandas in 2026

Pandas and Polars are not going away. A lot of data science work happens at small scale, and a DataFrame on a laptop is the right tool for those jobs. The Rust rewrite trend (Polars and friends) keeps pushing the single-machine ceiling higher every year, squeezing more out of one box before forcing the jump to a cluster.

But the moment your dataset crosses the single-machine line, you need something different. The next teardown shows the same RAM-ceiling failure in a place you would not expect: inside an AI coding agent, when its context window fills.

Takeaway

Pandas on a laptop is the small case of a database, one you already know. The RAM limit you just saw is the same limit an AI agent hits when its context window fills. The rest of CS145 is about what happens when small is no longer enough.