Project 2: NanoQuery Columnar Mogrifier

CS145: Stanford University, Fall 2026. Systems Track.
Project 2: Systems (NanoQuery Columnar Mogrifier)

Technical Memo: Building a Columnar Execution Engine

Dear builders, we need a columnar DB, pronto! You wrote SQL from the user side in Project 1. In Project 2 you build the engine that runs it. You'll store data in Parquet, hand-parse a target query, implement Hash-Partition and Sort-Merge joins from first principles, plan around a 12 GB memory envelope, and benchmark what you built.

Mission: build a columnar database that processes queries faster than a freeze ray.

Where to start

This is a from-scratch engineering project, not an extension of Project 1. Your dataset is generated by a Colab we provide, and your target query is fixed (see below). If you're partnering, the natural divide is one person on storage and parsing, the other on join algorithms, and both of you on planning and benchmarking.

Project Scope

Estimated Work-Back Schedule

Proposal due Oct 30. Final due Nov 20. Five weeks of work total, about 7 hours a week. Jot down AI surprises as they happen so you can write them up later.

Week 1 · Storage & Parsing

Oct 17 through Oct 23, spend 6 to 7 hours.

Run the data-generation Colab. Store the three tables as Parquet files. Measure compression ratios and read/write performance. Hand-parse the target query: pull out the join conditions, aggregations, GROUP BY, and ORDER BY.

Week 2 · Proposal & First Join

Oct 24 through Oct 30, spend 6 to 7 hours.

Pick either HPJ or SMJ and get a correct 10-row version working. Write the proposal: which join you implemented first, your memory strategy for 1 GB, and any bonus tier you're chasing. Submit by Oct 30.

Week 3 · Both Joins + Aggregation

Oct 31 through Nov 6, spend 8 to 10 hours.

Finish the second join algorithm. Implement GROUP BY and COUNT DISTINCT on the joined result. Verify both algorithms produce identical output on small inputs.

Week 4 · Query Planning

Nov 7 through Nov 13, spend 7 to 8 hours.

Read Parquet metadata without loading data. Build the planner that picks HPJ vs. SMJ based on memory. Add column pruning so you only load the columns you need. Keep noting AI surprises, especially the times AI got something wrong and you caught it.

Week 5 · Benchmark, Writeup, Submit

Nov 14 through Nov 20, spend 5 to 7 hours.

Run the 100 MB and 1 GB benchmarks with both algorithms. Write the Writeup & Reflection section. Turn your scratch notes into 2 or 3 AI anecdotes plus one plain-English walkthrough of a piece of your engine. Make sure all output cells are visible and submit.

Expected total: 30 to 40 hours (about 7 hours a week).

The Target Analytical Query

Your engine must parse, optimize, and execute this three-table join. Favor algorithmic efficiency over brute-force loading.

SELECT s.song_id, AVG(u.age) AS avg_age,
       COUNT(DISTINCT l.user_id)
FROM Songs s
JOIN Listens l ON s.song_id = l.song_id
JOIN Users u ON l.user_id = u.user_id
GROUP BY s.song_id, s.title
ORDER BY COUNT(DISTINCT l.user_id) DESC, s.song_id;

Dataset

Each table includes 10 extra "noise" string columns. These simulate real-world data bloat and force you to prove your column pruning actually works. Ignoring the noise columns during joins should be one of your biggest I/O wins.

Songs: song_id (int64) plus 10 padding columns

Listens: listen_id, song_id, user_id (int64) plus 10 padding columns

Users: user_id, age (int64) plus 10 padding columns

Why This Project Matters

Library Policy

Core rule: you must implement the partitioning logic, hash-table probe, and merge-intersection yourself. Using pd.merge() as a black box for the primary join is an automatic failure of the technical objective.

You CAN use
  • Pandas DataFrames as data containers only.
  • NumPy arrays for efficient array ops.
  • PyArrow for Parquet I/O only.
  • Python dicts for hash tables.
  • Python sorting (sorted(), .sort()).
  • pd.merge() only as the final merge step inside your own partitions.
  • Basic Python loops and list comprehensions.
You CANNOT use
  • pd.merge() as your entire join algorithm.
  • Polars. Has built-in columnar joins.
  • DuckDB. Complete SQL engine.
  • SQLite or PostgreSQL. Any SQL database.
  • Apache Arrow compute. Has join operations.
  • Any library that replaces your algorithm design.

Rubric (100 pts, plus up to 10 bonus)

Section 1: Parquet Storage 15 pts

Section 2: SQL Hand Parsing 10 pts

Section 3: Join Algorithms + Aggregation 20 pts

Flow: Songs JOIN Listens → Result1 JOIN Users → Final → GROUP BY → ORDER BY. Aggregation happens after both joins complete. You may emit intermediate Parquet files between stages.

Section 4: Query Planning & Optimization 20 pts

Section 5: Performance Benchmarking 20 pts

Section 6: Writeup, Reflection & AI Takeaways 15 pts

A short reflection at the end of your notebook. Keep it informal. We're looking for evidence that you understood what you built and used AI as a collaborator.

Base Grading Model

Quality work gets 95/100: complete all sections with working code and clear comments. This is a high A and what most students should target.

Above 95 is rare. It's reserved for genuine engineering insight: a benchmark result that changes how you'd design the engine, or an optimization that measurably moves the numbers. We don't reward more lines of code for their own sake.

Required Section Headers

Use these exact headers in your Colab notebook:

# Section 0: Data Generation
# Section 1: Parquet Storage Analysis
# Section 2: SQL Parsing
# Section 3: Join Algorithms Implementation & Aggregation
# Section 4: Query Planning & Optimization
# Section 5: Performance Benchmarking
# Section 6: Writeup, Reflection & AI Takeaways

Adjustments +5 max (top 10%)

  • +3 Value. Measurable speedups vs. a stated baseline (numbers released on Ed by Nov 1).
  • +2 Clarity. Crisp, compelling insights with clear diagrams and concise explanations.

Bonus: World Domination Tier Up to +10

  • Top 3 Performance (+5 pts). Top 3 query performance in the class on 1 GB compressed. Must work in free Colab (12 GB RAM). Fastest average query time wins.
  • Infinite Scalable Storage (+5 pts). Handle 10 GB+ compressed in free Colab. Correctness matters, not speed. Document your approach.
System Architecture & Implementation Pipeline

Two phases: data preparation (Sections 0 and 1) and query execution (Sections 2 through 5). The key intuition: read metadata first, then load only the columns you need, then pick HPJ vs. SMJ based on what fits in memory.

Phase 1: Data Preparation (Sections 0 and 1)

Phase 1 pipeline: Section 0 generates Songs, Users, and Listens tables. Parquet compression produces songs.parquet (about 5 MB), listens.parquet (about 75 MB), and users.parquet (about 20 MB) for Section 1 to analyze.

Phase 2: Memory-Aware Execution

On 1 GB compressed, naive pd.read_parquet('listens.parquet') loads all 13 columns and blows up to about 8 GB in RAM. Crash likely. Columnar pruning with pq.read_table(path, columns=['song_id', 'user_id']) keeps you under 1 GB.

# For 1GB compressed total:
listens.parquet: 750MB compressed → load only (song_id, user_id) → ~150MB in memory
songs.parquet:    50MB compressed → load only (song_id)          → ~20MB in memory
users.parquet:   200MB compressed → load only (user_id, age)     → ~80MB in memory

# Total active memory: ~250MB vs 5GB if loading everything!

Additional strategies: save intermediate Parquet files between stages; del dataframe + gc.collect() after each join; start with smaller tables to minimize intermediate results; batch-process aggregations if needed.

Frequently Asked Questions

Getting Started & Proposal

Q: What does the Oct 30 proposal look like?

Short. Aim for one page total, as a markdown cell at the top of your project notebook. Cover:

  • Which join algorithm you implemented first (HPJ or SMJ) and a 10-row correctness check.
  • Your memory strategy for the 1 GB benchmark, given the 12 GB free-Colab envelope.
  • Any bonus tier you're planning to pursue (top 3 performance, or 10 GB+ scale).
  • If partnering: who owns which section.

The proposal is a checkpoint, not a separately graded deliverable. We'll give feedback so you can catch scope problems before you've sunk too many hours in.

Parquet & PyArrow

Q: What is Parquet and why use it?

Parquet is a columnar storage format with excellent compression and query performance. See the Storage Layout lesson for the theory. Practical benefits for this project:

  • Compress repeated values efficiently (e.g., the same artist across songs).
  • Read only needed columns. SELECT song_id doesn't touch the other 12.
  • Built-in statistics for query optimization.
  • Row groups allow chunked processing.

Q: How do I work with Parquet using PyArrow?

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# Write with compression
df.to_parquet('songs.parquet',
              compression='snappy',     # or 'gzip', 'lz4'
              row_group_size=50000)

# Read only specific columns (columnar advantage!)
df = pd.read_parquet('songs.parquet', columns=['song_id', 'title'])

# Inspect metadata WITHOUT reading data
parquet_file = pq.ParquetFile('songs.parquet')
print(f"Num row groups: {parquet_file.num_row_groups}")
print(f"Schema: {parquet_file.schema}")
print(f"Metadata: {parquet_file.metadata}")

# Read in chunks for large files
for batch in parquet_file.iter_batches(batch_size=10000):
    df_chunk = batch.to_pandas()
    # Process chunk

Q: Which compression should I use, and how do I check my dataset sizes?

Use Snappy. Best balance of speed and compression for this project. Gzip compresses more but is slow; LZ4 is slightly faster but weaker.

Check sizes on disk: ls -lh *.parquet or os.path.getsize('file.parquet'). For the 100 MB benchmark, total parquet should be about 100 MB. Listens should be about 75% of the total (it's the activity log). Parquet typically achieves 3 to 5 times compression vs. CSV.

Q: How do I handle the 10 noise columns?

Store them in your Parquet files (they simulate real-world data bloat), but ignore them during joins. Only join on IDs. Use the noise columns to measure the impact of columnar pruning.

The pd.merge() Rule

Q: Why can't I just use pd.merge()?

This is a systems class. The objective is to understand the mechanics of hash tables and sort-merge algorithms. pd.merge() is a high-level black box that bypasses the learning objective. You must implement the partitioning logic, hash-table probe, and merge-intersection yourself.

Allowed
# Your partitioning + pd.merge() on each partition
partitions = my_hash_partition(left, right, key)
for left_part, right_part in partitions:
    result = pd.merge(left_part, right_part, on=key)
Forbidden
# pd.merge() as your entire algorithm
return pd.merge(left_df, right_df, on=key)

Technical Details

Q: How should my query planner choose between algorithms?

def choose_algorithm(table1_size, table2_size, available_memory):
    smaller = min(table1_size, table2_size)
    # Pandas needs ~5x file size in RAM
    if smaller * 5 < available_memory:
        return "HPJ"   # hash join is faster when it fits
    else:
        return "SMJ"   # sort-merge for limited memory

# Also consider:
#   - Data already sorted?          → SMJ
#   - Need sorted output for GROUP BY?  → SMJ
#   - Very skewed join keys?       → SMJ handles better

Q: How do I implement GROUP BY and COUNT DISTINCT after the joins?

# Option 1: pandas groupby (simple)
result = joined_df.groupby(['song_id', 'title']).agg({
    'age': 'mean',
    'user_id': lambda x: x.nunique()
})

# Option 2: manual (shows understanding)
groups = {}
for row in joined_data:
    key = (row['song_id'], row['title'])
    if key not in groups:
        groups[key] = {'ages': [], 'users': set()}
    groups[key]['ages'].append(row['age'])
    groups[key]['users'].add(row['user_id'])

COUNT DISTINCT approaches: Python set() (simple), sort + count transitions (works with limited memory), or HyperLogLog for approximate results on big data.

Q: When should I use intermediate Parquet files?

  • After sorting: save sorted tables to Parquet so a query retry doesn't re-sort.
  • After partitioning: save each partition as a separate file.
  • After joins: save the intermediate join result before aggregation.

Benchmarking & Common Pitfalls

Q: What metrics should I report?

MetricWhat to measureHow
Execution timeTotal query timetime.time() or %%time
I/O timeParquet read/writeMeasure separately from compute
MemoryPeak consumptionmemory_profiler / psutil
Compression ratioParquet vs. rawFile sizes on disk
Join selectivityOutput rows / input rowsCount at each stage

Q: What are the most common mistakes, and how do I avoid them?

PitfallDo this instead
Using pd.merge() as the whole join (or using Polars) Implement partitioning / hash table / merge yourself; use pd.merge() only inside your partitions.
Assuming keys appear once (1:1 join) Handle many-to-many correctly. Test with duplicate keys in both tables.
Loading full Parquet then selecting columns Analyze metadata first (pq.ParquetFile(path).metadata), then load only the columns you need.
Implementing joins in Section 4 (planning) Section 3 is joins + aggregation. Section 4 is planning and algorithm selection.
Comparing HPJ and SMJ on different queries, no warm-up Same query, both algorithms. Warm the file-system cache, average multiple runs.

Q: How do I debug my join algorithms?

  1. Start small. Test with 10-row tables first.
  2. Print intermediates. Hash partitions, sorted data, merge pointers.
  3. Ground truth. Compare with pd.merge() for correctness. You just can't ship pd.merge() as your algorithm.
  4. Edge cases. Empty tables, no matching keys, all keys match, duplicates, NULLs.

AI Takeaways & Verification

Q: How much AI is too much?

Use AI freely for Parquet and PyArrow semantics, benchmarking harnesses, performance analysis scaffolding, and identifying bottlenecks after you've profiled. The real bar is whether you can talk through your own engine in plain English.

Not OK: AI-written join algorithms or query planning. HPJ and SMJ were covered in detail in class. Implement them from your lectures, not from Claude. If you can't explain your partitioning strategy or merge-intersection without pulling up a chat window, you used too much. See the AI Policy for the full rule.

Q: What should the AI Takeaways section look like?

Short and informal. Think half a page, not a report. You need 2 or 3 anecdotes (a few sentences each) and one plain-English walkthrough of a piece of your engine.

Good anecdotes capture a moment. A few shapes that tend to work:

  • "AI suggested using a single giant hash table for HPJ. That would have blown memory on 1 GB. I switched to per-partition hash tables after looking at my memory profile."
  • "AI wrote a benchmarking harness that measured wall time. I added a cold-cache warm-up because my first few runs were 3× faster than the later ones."
  • "AI recommended sorting with Python's sorted(). For the 1 GB case I had to switch to chunked external sort because the full table didn't fit in RAM."

For the walkthrough, pick one piece of your engine (HPJ, SMJ, the planner, or the aggregation). Write 2 or 3 sentences saying what it does in plain English. Aim for an explanation a reader who hasn't taken the class could still follow.

Q: What should I double-check when debugging?

  • Join correctness. Compare with pd.merge() on small datasets.
  • Edge cases. Empty tables, no matches, all matches, NULLs.
  • Scaling. Time should scale appropriately with data size.
  • Duplicate handling. Many-to-many joins should produce the correct cartesian product.

Submission

Proposal due Oct 30. Final notebook due Nov 20. Submit both to Gradescope.

Honor Code

You must follow the Stanford Honor Code. Original engineering is required.