Dear builders, we need a columnar DB, pronto! You wrote SQL from the user side in Project 1. In Project 2 you build the engine that runs it. You'll store data in Parquet, hand-parse a target query, implement Hash-Partition and Sort-Merge joins from first principles, plan around a 12 GB memory envelope, and benchmark what you built.
This is a from-scratch engineering project, not an extension of Project 1. Your dataset is generated by a Colab we provide, and your target query is fixed (see below). If you're partnering, the natural divide is one person on storage and parsing, the other on join algorithms, and both of you on planning and benchmarking.
.ipynb) with all output visible,
due Nov 20.Proposal due Oct 30. Final due Nov 20. Five weeks of work total, about 7 hours a week. Jot down AI surprises as they happen so you can write them up later.
Oct 17 through Oct 23, spend 6 to 7 hours.
Run the data-generation Colab. Store the three tables as Parquet files. Measure compression ratios and read/write performance. Hand-parse the target query: pull out the join conditions, aggregations, GROUP BY, and ORDER BY.
Oct 24 through Oct 30, spend 6 to 7 hours.
Pick either HPJ or SMJ and get a correct 10-row version working. Write the proposal: which join you implemented first, your memory strategy for 1 GB, and any bonus tier you're chasing. Submit by Oct 30.
Oct 31 through Nov 6, spend 8 to 10 hours.
Finish the second join algorithm. Implement GROUP BY and COUNT DISTINCT on the joined result. Verify both algorithms produce identical output on small inputs.
Nov 7 through Nov 13, spend 7 to 8 hours.
Read Parquet metadata without loading data. Build the planner that picks HPJ vs. SMJ based on memory. Add column pruning so you only load the columns you need. Keep noting AI surprises, especially the times AI got something wrong and you caught it.
Nov 14 through Nov 20, spend 5 to 7 hours.
Run the 100 MB and 1 GB benchmarks with both algorithms. Write the Writeup & Reflection section. Turn your scratch notes into 2 or 3 AI anecdotes plus one plain-English walkthrough of a piece of your engine. Make sure all output cells are visible and submit.
Your engine must parse, optimize, and execute this three-table join. Favor algorithmic efficiency over brute-force loading.
SELECT s.song_id, AVG(u.age) AS avg_age,
COUNT(DISTINCT l.user_id)
FROM Songs s
JOIN Listens l ON s.song_id = l.song_id
JOIN Users u ON l.user_id = u.user_id
GROUP BY s.song_id, s.title
ORDER BY COUNT(DISTINCT l.user_id) DESC, s.song_id;
Each table includes 10 extra "noise" string columns. These simulate real-world data bloat and force you to prove your column pruning actually works. Ignoring the noise columns during joins should be one of your biggest I/O wins.
song_id (int64) plus 10 padding columnslisten_id, song_id, user_id (int64) plus 10 padding columnsuser_id, age (int64) plus 10 padding columnspd.merge(). You'll implement the binary
logic of Hash and Sort-Merge joins from first principles.Core rule: you must implement the partitioning logic, hash-table probe, and
merge-intersection yourself. Using pd.merge() as a black box for the primary join
is an automatic failure of the technical objective.
sorted(), .sort()).Flow: Songs JOIN Listens → Result1 JOIN Users → Final → GROUP BY → ORDER BY. Aggregation happens after both joins complete. You may emit intermediate Parquet files between stages.
A short reflection at the end of your notebook. Keep it informal. We're looking for evidence that you understood what you built and used AI as a collaborator.
Quality work gets 95/100: complete all sections with working code and clear comments. This is a high A and what most students should target.
Above 95 is rare. It's reserved for genuine engineering insight: a benchmark result that changes how you'd design the engine, or an optimization that measurably moves the numbers. We don't reward more lines of code for their own sake.
Use these exact headers in your Colab notebook:
# Section 0: Data Generation
# Section 1: Parquet Storage Analysis
# Section 2: SQL Parsing
# Section 3: Join Algorithms Implementation & Aggregation
# Section 4: Query Planning & Optimization
# Section 5: Performance Benchmarking
# Section 6: Writeup, Reflection & AI Takeaways
Two phases: data preparation (Sections 0 and 1) and query execution (Sections 2 through 5). The key intuition: read metadata first, then load only the columns you need, then pick HPJ vs. SMJ based on what fits in memory.
On 1 GB compressed, naive pd.read_parquet('listens.parquet') loads all 13 columns
and blows up to about 8 GB in RAM. Crash likely. Columnar pruning with
pq.read_table(path, columns=['song_id', 'user_id']) keeps you under 1 GB.
# For 1GB compressed total:
listens.parquet: 750MB compressed → load only (song_id, user_id) → ~150MB in memory
songs.parquet: 50MB compressed → load only (song_id) → ~20MB in memory
users.parquet: 200MB compressed → load only (user_id, age) → ~80MB in memory
# Total active memory: ~250MB vs 5GB if loading everything!
Additional strategies: save intermediate Parquet files between stages;
del dataframe + gc.collect() after each join; start with smaller tables to
minimize intermediate results; batch-process aggregations if needed.
Short. Aim for one page total, as a markdown cell at the top of your project notebook. Cover:
The proposal is a checkpoint, not a separately graded deliverable. We'll give feedback so you can catch scope problems before you've sunk too many hours in.
Parquet is a columnar storage format with excellent compression and query performance. See the Storage Layout lesson for the theory. Practical benefits for this project:
SELECT song_id doesn't touch the other 12.import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
# Write with compression
df.to_parquet('songs.parquet',
compression='snappy', # or 'gzip', 'lz4'
row_group_size=50000)
# Read only specific columns (columnar advantage!)
df = pd.read_parquet('songs.parquet', columns=['song_id', 'title'])
# Inspect metadata WITHOUT reading data
parquet_file = pq.ParquetFile('songs.parquet')
print(f"Num row groups: {parquet_file.num_row_groups}")
print(f"Schema: {parquet_file.schema}")
print(f"Metadata: {parquet_file.metadata}")
# Read in chunks for large files
for batch in parquet_file.iter_batches(batch_size=10000):
df_chunk = batch.to_pandas()
# Process chunk
Use Snappy. Best balance of speed and compression for this project. Gzip compresses more but is slow; LZ4 is slightly faster but weaker.
Check sizes on disk: ls -lh *.parquet or
os.path.getsize('file.parquet'). For the 100 MB benchmark, total parquet should
be about 100 MB. Listens should be about 75% of the total (it's the activity log). Parquet
typically achieves 3 to 5 times compression vs. CSV.
Store them in your Parquet files (they simulate real-world data bloat), but ignore them during joins. Only join on IDs. Use the noise columns to measure the impact of columnar pruning.
pd.merge()?This is a systems class. The objective is to understand the mechanics of hash tables and sort-merge
algorithms. pd.merge() is a high-level black box that bypasses the learning objective.
You must implement the partitioning logic, hash-table probe, and merge-intersection yourself.
# Your partitioning + pd.merge() on each partition
partitions = my_hash_partition(left, right, key)
for left_part, right_part in partitions:
result = pd.merge(left_part, right_part, on=key)
# pd.merge() as your entire algorithm
return pd.merge(left_df, right_df, on=key)
def choose_algorithm(table1_size, table2_size, available_memory):
smaller = min(table1_size, table2_size)
# Pandas needs ~5x file size in RAM
if smaller * 5 < available_memory:
return "HPJ" # hash join is faster when it fits
else:
return "SMJ" # sort-merge for limited memory
# Also consider:
# - Data already sorted? → SMJ
# - Need sorted output for GROUP BY? → SMJ
# - Very skewed join keys? → SMJ handles better
# Option 1: pandas groupby (simple)
result = joined_df.groupby(['song_id', 'title']).agg({
'age': 'mean',
'user_id': lambda x: x.nunique()
})
# Option 2: manual (shows understanding)
groups = {}
for row in joined_data:
key = (row['song_id'], row['title'])
if key not in groups:
groups[key] = {'ages': [], 'users': set()}
groups[key]['ages'].append(row['age'])
groups[key]['users'].add(row['user_id'])
COUNT DISTINCT approaches: Python set() (simple), sort + count
transitions (works with limited memory), or HyperLogLog for approximate results on big data.
| Metric | What to measure | How |
|---|---|---|
| Execution time | Total query time | time.time() or %%time |
| I/O time | Parquet read/write | Measure separately from compute |
| Memory | Peak consumption | memory_profiler / psutil |
| Compression ratio | Parquet vs. raw | File sizes on disk |
| Join selectivity | Output rows / input rows | Count at each stage |
| Pitfall | Do this instead |
|---|---|
Using pd.merge() as the whole join (or using Polars) |
Implement partitioning / hash table / merge yourself; use pd.merge() only
inside your partitions. |
| Assuming keys appear once (1:1 join) | Handle many-to-many correctly. Test with duplicate keys in both tables. |
| Loading full Parquet then selecting columns | Analyze metadata first (pq.ParquetFile(path).metadata), then load only the
columns you need. |
| Implementing joins in Section 4 (planning) | Section 3 is joins + aggregation. Section 4 is planning and algorithm selection. |
| Comparing HPJ and SMJ on different queries, no warm-up | Same query, both algorithms. Warm the file-system cache, average multiple runs. |
pd.merge() for correctness. You
just can't ship pd.merge() as your algorithm.Use AI freely for Parquet and PyArrow semantics, benchmarking harnesses, performance analysis scaffolding, and identifying bottlenecks after you've profiled. The real bar is whether you can talk through your own engine in plain English.
Not OK: AI-written join algorithms or query planning. HPJ and SMJ were covered in detail in class. Implement them from your lectures, not from Claude. If you can't explain your partitioning strategy or merge-intersection without pulling up a chat window, you used too much. See the AI Policy for the full rule.
Short and informal. Think half a page, not a report. You need 2 or 3 anecdotes (a few sentences each) and one plain-English walkthrough of a piece of your engine.
Good anecdotes capture a moment. A few shapes that tend to work:
sorted(). For the 1 GB case I had to
switch to chunked external sort because the full table didn't fit in RAM."For the walkthrough, pick one piece of your engine (HPJ, SMJ, the planner, or the aggregation). Write 2 or 3 sentences saying what it does in plain English. Aim for an explanation a reader who hasn't taken the class could still follow.
pd.merge() on small datasets.Proposal due Oct 30. Final notebook due Nov 20. Submit both to Gradescope.
.ipynb notebook.You must follow the Stanford Honor Code. Original engineering is required.