Case Study 1A.1: SQL vs Python, Pandas, and Polars

The Analytical Sandbox (1 - 10 GB)

Python is your go-to for quick prototyping with small data before you scale up with SQL pipelines.

How does SQL compare with alternatives like Python and Pandas?

Goal: Explore Python and Pandas as alternatives to SQL

SQL dominates data at scale, but procedural languages have their place. Let's dissect the alternatives and their tradeoffs.

Procedural Languages: Python, C++, and Java

Each has its own angle on big data processing:

Trade-offs: Consider Python's memory consumption, C++'s complexity, and Java's performance overhead in custom applications.

Python and the Pandas Library

Pandas is versatile for data analysis. Its DataFrame structure simplifies handling structured data with over 600 functions for indexing, aggregation, and missing values. It's a must-have for data scientists on a single machine.

Custom SQL Derivatives

Industry players often bridge the gap with custom tools:


Example: Computing Popular Songs

Consider our tables: Songs, Listens, and Users. We aim to find the top 10 most popular songs by listens.

The Procedural Way (Pure Python)

In pure Python, represent these as arrays and use loops to calculate the result.

# Pure Python Logic
def get_popular_songs(songs, listens):
    counts = {}
    for listen in listens:
        song_id = listen['song_id']
        counts[song_id] = counts.get(song_id, 0) + 1

    # Merge titles and sort
    popular = []
    for s_id, count in counts.items():
        song = next(s for s in songs if s['id'] == s_id)
        popular.append({'title': song['title'], 'artist': song['artist'], 'count': count})

    return sorted(popular, key=lambda x: x['count'], reverse=True)[:10]

Logic: Iterate through each listen, count them in a dictionary, find the song title, and sort at the end.

The Library Way (Pandas)

Pandas simplifies this with DataFrame operations that resemble SQL but are written in Python.

import pandas as pd

# Pandas Logic
songs_df = pd.DataFrame(songs)
listens_df = pd.DataFrame(listens)

merged = pd.merge(listens_df, songs_df, on='id')
popular = merged.groupby(['title', 'artist']).size().reset_index(name='count')
top_10 = popular.sort_values('count', ascending=False).head(10)

Logic: Convert to DataFrames, merge on ID, group by title/artist, count, sort, and take the head.

The Modern Way (Polars)

Polars offers a DataFrame API like Pandas, but it's built in Rust for multi-threaded performance and lazy evaluation.

import polars as pl

# Polars Logic
songs_df = pl.DataFrame(songs)
listens_df = pl.DataFrame(listens)

popular = (
    listens_df.join(songs_df, left_on='song_id', right_on='id', how='inner')
    .group_by(['title', 'artist'])
    .agg(pl.len().alias('count'))
    .sort('count', descending=True)
    .head(10)
)

Logic: Convert to DataFrames, join on ID, group by title/artist, count the length, sort descending, and take the head.


When to use Pandas versus SQL?

The "RAM Wall"

A common question: "Why use SQL for big data when I have Pandas in Python?"

The Why: Declarative commands let the database optimize your query while you focus on features.

The "Scale" Limit (SQL Strengths)

Consider the Spotify library: 50 million songs and billions of user listens.

When to use Polars versus Pandas?

Want Pandas-like syntax but with larger-than-RAM data or better performance?

Enter Polars: A fast DataFrame library in Rust.

The "Default Choice"

Why not always use Polars over Pandas? For new projects, Polars is becoming the norm. Choose Polars unless maintaining legacy code or needing specific Pandas-only features. Polars is faster, more memory-efficient, and has a robust API.

Out-of-Core Processing

Remember, even Polars hits the limits of a single machine's disk and CPU speed. For true petabyte-scale analysis, distributed SQL databases are essential.

Future Directions: Pandas/SQL Compilers

Another approach is writing Python code that translates into SQL.

Modin and Snowpark pandas

Takeaway for CS145:

Use Pandas/Polars for fast data exploration on your laptop. Use SQL for declarative, distributed production apps and large-scale data.