Pandas: Deconstructing Small Analytics

Pandas, Polars, and the limit at the edge of RAM.

Concept. Pandas and Polars assume your whole dataset fits in your machine's RAM, and work on it there. Fast, familiar, and bounded by one machine.

Intuition. A laptop holds maybe sixteen gigabytes of RAM. Spotify holds billions of song listens. The bridge between those two numbers is the rest of this course.

Case Study Reading Time: 5 mins

Figure 1. The part you already know. Pandas, and Polars (the same idea in Rust, faster on big files), loads your whole dataset into RAM and works on it there: one process, one machine, fast until the data exceeds RAM (right), where it crashes. This is a database in the small, short-lived and bounded by one machine. Removing that limit is the rest of the course.

A familiar example

One schema runs all course: Songs (about a hundred million of them on real Spotify), Listens (billions of rows, one per play), and Users. The task: the top 10 songs by play count.

In pure Python you would write nested for loops over the rows, build a dictionary of counts, and sort it at the end. That is imperative code: you spell out how to compute the answer, step by step.

Pandas and Polars give you something better, and so does SQL. Same intent, two surfaces, side by side:

Polars (Python)

import polars as pl

popular = (
    listens_df
    .join(songs_df, left_on='song_id',
          right_on='song_id', how='inner')
    .group_by(['title', 'artist'])
    .agg(pl.len().alias('plays'))
    .sort('plays', descending=True)
    .head(10)
)

SQL

SELECT title, artist, COUNT(*) AS plays
FROM listens
JOIN songs ON listens.song_id = songs.song_id
GROUP BY title, artist
ORDER BY plays DESC
LIMIT 10;

Read them together. Neither one says how: no loops, no counters, no sort algorithm. You declare the join, the grouping, and the order; the engine works out how to walk the rows. That is the declarative style, and SQL is its canonical form.

The SQL version has one extra property: it is a program that does not change with scale. It runs unchanged against a thousand rows on your laptop or a hundred billion across two thousand machines; only the execution plan changes. You do not rewrite the query when the data grows. You point it at a bigger engine.

Pandas, Polars, and a database

That last property is the whole story, because Spotify's Listens table does not fit in a laptop's sixteen gigabytes, or a sixty-four gigabyte workstation. What each tool does when the data gets that big is what separates them.

Figure 2. Pandas loads every row into RAM and copies at each step, so it stops the moment the data passes RAM. Polars keeps almost the same API but runs a lazy, streaming, all-cores plan that only materializes what it needs, which pushes that point far out, though it is still one machine with a finite disk and CPU. A database flips the problem: you send the query to the data, and the engine spreads the scan and join across many machines, returning just your rows. The first two make one machine go further. The third stops the data from having to fit on one machine at all.

Where this leaves Pandas in 2026

Pandas and Polars are not going away. A lot of data science work happens at small scale, and a DataFrame on a laptop is the right tool for those jobs. The Rust rewrite trend (Polars and friends) keeps pushing the single-machine ceiling higher every year, squeezing more out of one box before forcing the jump to a cluster.

But the moment your dataset crosses the single-machine line, you need something different. The next teardown shows the same RAM-ceiling failure in a place you would not expect: inside an AI coding agent, when its context window fills.

Takeaway

Pandas on a laptop is the small case of a database, one you already know. The RAM limit you just saw is the same limit an AI agent hits when its context window fills. The rest of CS145 is about what happens when small is no longer enough.