6.1 ยท Case Study: SQL vs Python, Pandas, and Polars

The Analytical Sandbox (1 - 10 GB)

Python is your go-to for quick prototyping with small data before you scale up with SQL pipelines.

How does SQL compare with alternatives like Python and Pandas?

Goal: Explore Python and Pandas as alternatives to SQL

SQL dominates data at scale, but procedural languages have their place. Let's dissect the alternatives and their tradeoffs.

Procedural Languages: Python, C++, and Java

Each has its own angle on big data processing:

  • Python: Simple and perfect for interactive data manipulation and quick prototyping.

  • C++: Offers performance-critical capabilities with low-level control, though it demands complex development.

  • Java: Known for portability and a robust ecosystem for scalability and parallelism.

Trade-offs: Consider Python's memory consumption, C++'s complexity, and Java's performance overhead in custom applications.

Python and the Pandas Library

Pandas is versatile for data analysis. Its DataFrame structure simplifies handling structured data with over 600 functions for indexing, aggregation, and missing values. It's a must-have for data scientists on a single machine.

Custom SQL Derivatives

Industry players often bridge the gap with custom tools:

  • Apache Spark: Uses SparkSQL for SQL operations on large, distributed datasets.

  • Cassandra: Adopts CQL (Cassandra Query Language), tailored for its distributed architecture.


Example: Computing Popular Songs

Consider our tables: Songs, Listens, and Users. We aim to find the top 10 most popular songs by listens.

The Procedural Way vs The Declarative Way

In pure Python, you'd have to write manual for loops and dictionaries to iterate over rows. This is imperative, you are telling the computer how to do it step by step.

Instead, we use library APIs like Pandas or Polars that give us SQL-like declarative commands (tell it what to do):

import polars as pl

# Declarative Polars Logic
songs_df = pl.DataFrame(songs)
listens_df = pl.DataFrame(listens)

popular = (
    listens_df.join(songs_df, left_on='song_id', right_on='id', how='inner')
    .group_by(['title', 'artist'])
    .agg(pl.len().alias('count'))
    .sort('count', descending=True)
    .head(10)
)

Logic: Convert to DataFrames, join on ID, group by title/artist, count the length, sort descending, and take the head.


When to use Pandas versus SQL?

The "RAM Wall"

A common question: "Why use SQL for big data when I have Pandas in Python?"

  • Pandas (How): You tell Python how to process the data. Great for data science, but it must fit into your RAM. If UberEats has 100GB of orders and you have 16GB of RAM, Pandas fails.

  • SQL (What): You tell the database what you want. It handles the scale, even if the data spans 1,000 servers.

The Why: Declarative commands let the database optimize your query while you focus on features.

The "Scale" Limit (SQL Strengths)

Consider the Spotify library: 50 million songs and billions of user listens.

  • Pandas falters because billions of rows won't fit into a laptop's memory.

  • SQL excels as distributed engines coordinate across thousands of machines to process petabytes.

When to use Polars versus Pandas?

Want Pandas-like syntax but with larger-than-RAM data or better performance?

Enter Polars: A fast DataFrame library built in Rust.

The "Default Choice" & The Rust Rewrite Trend

Why not always use Polars over Pandas? For new projects, Polars is becoming the norm. The industry is currently undergoing a massive trend of rewriting core big-data tooling natively in Rust (like Polars and Apache DataFusion) to extract every drop of multi-threaded performance from a single node before forcing a jump to expensive distributed clusters.

Out-of-Core Processing

  • Pandas Limitation: Needs the entire dataset in memory. Larger than RAM? Expect an out-of-memory error.

  • Polars Solution: Supports "out-of-core" processing, streaming data from disk in chunks and using lazy evaluation.

  • Multi-threaded: Automatically parallelizes operations across all CPU cores.

Remember, even Polars hits the limits of a single machine's disk and CPU speed. For true petabyte-scale analysis, distributed SQL databases are essential.

Future Directions: Pandas/SQL Compilers

Another approach is writing Python code that translates into SQL.

Modin and Snowpark

  • Modin (acquired by Snowflake in 2024) rewrites Pandas calls into SQL. Write logic in Python; let the compiler handle distributed SQL across nodes. In 2023, they scaled to a Terabyte of data, speeding workloads by 16x without changing Python code.

Takeaway for CS145:

Use Pandas/Polars for fast data exploration on your laptop. Use SQL for declarative, distributed production apps and large-scale data.