Project 2: NanoQuery Columnar Mogrifier

CS145: Stanford University, Fall 2025 - Systems Track
Project 2: Systems - NanoQuery Columnar Mogrifier

Technical Memo: Building a Columnar Execution Engine

Subject: Dear builders, Need Columnar DB, pronto!

Dr. Gru here with a plan so advanced, it'll make those big tech companies look like they're still using stone tablets and chisels! We're going to build our own Parquet-based columnar database system – one so efficient, so blazingly fast, that it'll make rows beg for mercy!

⚡ Mission: Build a columnar database that processes queries faster than a freeze ray!

Core Engineering Pillars

1. Columnar Layout (Parquet)

Master the Parquet format's compression (Snappy) and encoding. You'll build a system that reads specific columns directly from disk, drastically reducing I/O overhead.

2. Join Algorithm Design

Implement high-fidelity Hash-Partition and Sort-Merge Joins. You must manage multi-gigabyte datasets within a strict 12GB memory envelope.

3. Query Planning

Plan and optimize queries. Planning is the difference between success and a MemoryError.

Project Scope

Required Resources (see Ed for links)

Systems Work-Back Schedule

Phase 1: Storage Layer (W1-2)

~5 hours

Parquet I/O analysis, metadata extraction, and basic hand-parsing implementation.

Phase 2: Join Design (W3)

~12 hours

Core algorithm implementation: Hash-Partition Join and Sort-Merge Join logic.

Phase 3: Query Planning & Execution (W4)

~12 hours

Building the query optimizer, column pruning, and memory-constrained execution.

Phase 4: Benchmarking (W5)

~8 hours

I/O cost analysis, performance profiling, and final optimization pass.

Expected Total: 35-40 hours (approx. 7-8 hours/week).

The Target Analytical Query

Your engine must be capable of parsing, optimizing, and executing the following three-table join. Your implementation should favor algorithmic efficiency over brute-force loading.

SELECT s.song_id, AVG(u.age) AS avg_age,
       COUNT(DISTINCT l.user_id)
FROM Songs s
JOIN Listens l ON s.song_id = l.song_id
JOIN Users u ON l.user_id = u.user_id
GROUP BY s.song_id, s.title
ORDER BY COUNT(DISTINCT l.user_id) DESC, s.song_id;

Dataset

Your engine will operate on the following schema. Each table includes 10 additional "noise" columns (strings) to simulate real-world data bloating and test your columnar pruning capabilities.

Songs: song_id (int64) + 10 padding columns

Listens: listen_id (int64), song_id (int64), user_id (int64) + 10 padding columns

Users: user_id (int64), age (int64) + 10 padding columns

Why This Project Matters

The 12GB Constraint: Engineering Under Pressure

This project is designed for Free Colab! The 12GB memory constraint teaches you real-world optimization techniques used by columnar databases like BigQuery and Snowflake.

Both 100MB and 1GB compressed datasets are completely achievable in Free Colab with smart columnar techniques:

Smart Memory Management for 1GB Dataset

❌ Naive Loading

pd.read_parquet('listens.parquet')

Loads all 13 columns (including 10 noise columns). Result: ~8GB RAM consumption for one table. System crash likely.

Columnar Pruning

pq.read_table(..., columns=['song_id'])

Loads only specific offsets. Result: < 1 GB RAM consumption.

Memory Calculation Example:

# For 1GB compressed total:
listens.parquet: 750MB compressed → Load only (song_id, user_id) → ~150MB in memory
songs.parquet:    50MB compressed → Load only (song_id) → ~20MB in memory  
users.parquet:   200MB compressed → Load only (user_id, age) → ~80MB in memory

# Total active memory: ~250MB vs 5GB if loading everything!
# That's a 95% reduction through columnar optimization!

Additional Strategies

Remember: The constraint is the lesson! Working within 12GB teaches you techniques that scale to petabyte-scale systems.

Engineering Constraints & Library Policy

You CAN Use

  • Pandas DataFrames - as data containers only
  • NumPy arrays - for efficient array operations
  • PyArrow - for Parquet file I/O only
  • Python dictionaries - for hash tables
  • Python sorting - sorted(), .sort()
  • pd.merge() - for final merge step only (see FAQ)
  • Basic Python - loops, list comprehensions

Use libraries for data structures and I/O

You CANNOT Use

  • pd.merge() as your entire join algorithm - implement the logic yourself
  • Polars - has built-in columnar joins
  • DuckDB - complete SQL engine
  • SQLite/PostgreSQL - any SQL database
  • Apache Arrow compute - has join operations
  • Any library that replaces your algorithm design

Core Rule: You must implement the partitioning logic, hash table probe, and merge-intersection yourself. Using pd.merge() as a black box for the primary join is an automatic failure of the technical objective.

System Architecture: Two Phases

Data Preparation: Sections 0-1

Section 0: Generate Test Data Songs: 10K-100K records | Users: 50K-500K records Listens: 1M-10M records with foreign keys ↓ Automatic Parquet Compression ↓ Snappy compression | 75-90% size reduction Section 1: Store & Analyze Parquet Files songs.parquet ~5MB 10K rows × 11 cols listens.parquet ~75MB 1M rows × 13 cols users.parquet ~20MB 50K rows × 12 cols

⚙️ Query Execution Pipeline: Sections 2-5

SELECT s.song_id, AVG(u.age), COUNT(DISTINCT l.user_id) FROM Songs s JOIN Listens l ON ... JOIN Users u ON ... Section 2: Parse SQL Query Extract: tables, joins, aggregations DISK STORAGE PROCESSING MEMORY (RAM) Analyze Metadata (Don't Load Data Yet!) songs.parquet (5MB) 10K rows × 11 columns listens.parquet (75MB) 1M rows × 13 columns users.parquet (20MB) 50K rows × 12 columns songs_df (~25MB uncompressed) DataFrame in RAM listens_df (~375MB) DataFrame in RAM users_df (~100MB) DataFrame in RAM Section 4: Query Planning Use metadata stats to decide: • Join order • HPJ vs SMJ • Which columns Section 3: Execute Join Algorithm Hash Join 1. Build hash table 2. Probe & match 3. Output: joined_df Sort-Merge Join 1. Sort both tables 2. Merge sorted 3. Output: joined_df Aggregation (After Joins) GROUP BY, AVG, COUNT DISTINCT

System Implementation Roadmap

Follow this execution order to maintain memory efficiency. Data generation is provided; your engineering begins at Section 1.

Section 0: Generate Test Data
# Copy and run the complete data generation code from Ed Colab (no implementation needed)

        
Section 2: The SQL Parser

Deconstruct the target query. Extract join keys, projection columns, and aggregation predicates.

Section 3: Join Kernels (HPJ & SMJ)

The core of the engine. Implement the partitioning and merge logic from first principles. Handle many-to-many relationship correctly.

Section 4: The Optimizer

Analyze Parquet metadata (row counts, file sizes) to decide join order and algorithm selection (e.g., HPJ if building the hash table fits in memory, else SMJ).

Section 5: Performance Benchmarking
import time

# Benchmark at different scales
# you can use time.time() to measure the time taken for each query and the results.

            

💡 Key Implementation Points

See FAQs below for: Parquet compression options, dataset generation, performance metrics, query optimization strategies, and more implementation details.

Rubric (100 pts + 10 bonus)

Section 1: Parquet-based Columnar Storage 20 points

→ See Architecture: Data Generation & Storage (Phase 1)

Section 2: SQL Hand Parsing 15 points

→ See Architecture: Query Processing (Phase 2)

Section 3: Join Algorithms Implementation & Aggregation 20 points

→ This section: Implement the execution operators (HOW to join) and aggregation after joins
Important: Aggregation happens AFTER both joins complete. The flow is: Songs JOIN Listens → Result1 JOIN Users → Final Join Result → GROUP BY → ORDER BY
Tip: You may produce intermediate results in single or multiple smaller Parquet files. For JOINs, you can use memory after reading Parquet files (we're working with GBs, not TBs).

Section 4: Query Planning & Optimization 20 points

→ This section: Use Parquet metadata to plan BEFORE loading data

Section 5: Performance Benchmarking 20 points

→ See Implementation Pipeline: Section 4 for planning, Section 0 for data generation

Base Grading Model

Quality work gets 95/100: Complete all sections with working code and clear comments. This is a high A and what most students should target.

Higher credit requires exceptional work: Only pursue if you're genuinely excited about data systems. The extra 10s of hours won't significantly impact your grade but will deepen your data systems skills and the quality of your artifact for technical interviews.

Readability & Organization Up to -15 points

  • Have clearly marked sections in your Colab notebook
  • Use these exact section headers:
    # Section 0: Data Generation
    # Section 1: Parquet Storage Analysis
    # Section 2: SQL Parsing
    # Section 3: Join Algorithms Implementation & Aggregation
    # Section 4: Query Planning & Optimization
    # Section 5: Performance Benchmarking
  • Specific deductions:
    • -5 points: Missing one or more required sections
    • -5 points: Poor code quality (no comments, unclear variable names, messy structure)
    • -5 points: Missing or unclear output (benchmark results not visible, no performance tables)

Adjustments +5 points max (top 10% of projects)

  • +3 Value: Demonstrates measurable speedups vs a stated baseline (numbers released on Ed by Nov 1st)
  • +2 Clarity: Crisp, compelling insights with clear diagrams and concise explanations

BONUS: World Domination Tier 10 points

  • Top 3 Performance (5 points): Achieve top 3 query performance in the class for the three-table join on 1GB compressed Parquet files
    • Fastest average query time wins
    • Must work in Free Colab (12GB RAM)
  • Infinite Scalable Storage (5 points): Handle 10GB+ compressed datasets in Free Colab
    • Hint: Think about how distributed systems handle data that doesn't fit in memory
    • Document your approach and show it works
    • Performance doesn't need to be fast, just correct

Frequently Asked Questions

Parquet Format & Implementation

Q: What is Parquet and why use it?

Parquet is a columnar storage format that provides excellent compression and query performance. For columnar storage concepts, see Storage Layout lesson.

Key Parquet benefits for this project:

  • Compress repeated values efficiently (e.g., same artist across songs)
  • Read only needed columns (SELECT song_id doesn't touch other columns)
  • Built-in statistics for query optimization
  • Row groups allow chunked processing

Q: How do I know if my dataset sizes are correct?

All sizes refer to compressed Parquet files on disk:

  • 100MB total: Check that songs.parquet + users.parquet + listens.parquet ≈ 100MB
  • 1GB total: Check that all three Parquet files sum to ≈ 1GB
  • Use os.path.getsize('file.parquet') or ls -lh *.parquet to verify
  • Listens should be ~75% of your total data (it's the activity log)
  • Parquet typically achieves 3-5x compression vs CSV

Q: How do I work with Parquet files using PyArrow?

# Essential Parquet operations with PyArrow import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # 1. Write with compression options df.to_parquet('songs.parquet', compression='snappy', # or 'gzip', 'brotli', 'lz4' row_group_size=50000) # rows per group # 2. Read only specific columns (columnar advantage!) columns = ['song_id', 'title'] df = pd.read_parquet('songs.parquet', columns=columns) # 3. Inspect metadata without reading data parquet_file = pq.ParquetFile('songs.parquet') print(f"Num row groups: {parquet_file.num_row_groups}") print(f"Schema: {parquet_file.schema}") print(f"Metadata: {parquet_file.metadata}") # 4. Read in chunks for large files parquet_file = pq.ParquetFile('large_table.parquet') for batch in parquet_file.iter_batches(batch_size=10000): df_chunk = batch.to_pandas() # Process chunk

Q: Which compression should I use?

Compression Speed Ratio Best For
Snappy Very Fast ~2-4x Default, balanced
Gzip Slow ~4-8x Max compression
LZ4 Fastest ~2-3x Speed critical

Recommendation: Use Snappy (default) for this project - good balance of speed and compression.

Q: How should I handle the 10 additional string columns?

These columns simulate real-world data bloat. They should be:

  • Stored in your Parquet files to test compression
  • Ignored during join operations (only join on IDs)
  • Used to measure the impact of columnar storage

Q: Do I need to implement a full SQL parser?

No! "Hand parsing" means extracting components from the specific query provided. Hardcode the parsing for this query structure - extract joins, group by, aggregations, and order by.

Performance & Benchmarking

Q: When can I use pandas DataFrames in memory vs streaming?

Memory usage guidelines:

  • 100MB compressed: Can load all tables in memory (~500MB uncompressed)
  • 1GB compressed: Load one table at a time, stream the 750MB listens table if needed
  • 10GB+ (bonus): Must use chunked processing, never load full table

Rule of thumb: Pandas uses ~5-10x the file size in RAM. A 1GB Parquet file might need 5-10GB RAM when loaded as DataFrame.

Q: What metrics should I report for performance analysis?

Metric What to Measure How to Measure
Execution Time Total query time time.time() or %%time in Jupyter
I/O Time Parquet read/write time Measure separately from compute
Memory Usage Peak memory consumption memory_profiler or psutil
Compression Ratio Parquet vs raw size File sizes on disk
Join Selectivity Output rows / input rows Count rows at each stage

Q: How can I optimize for the top 3 performance bonus?

Consider these optimizations:

  • Predicate pushdown: Filter early in Parquet reads
  • Column pruning: Only read needed columns
  • Join order optimization: Join smaller tables first
  • Partition elimination: Skip irrelevant partitions
  • Memory management: Use chunking for large operations
  • Parallel processing: Use multiprocessing for independent tasks

Technical Details

Q: How should my query planner choose between algorithms?

Simple cost model approach:

def choose_algorithm(table1_size, table2_size, available_memory):
  # Check if smaller table fits in memory for HPJ
  smaller_size = min(table1_size, table2_size)
  if smaller_size * 5 < available_memory: # 5x for pandas overhead
    return "HPJ" # Hash join is faster
  else:
    return "SMJ" # Sort-merge for limited memory

# Also consider:
# - Is data already sorted? → SMJ
# - Need sorted output for GROUP BY? → SMJ
# - Very skewed join keys? → SMJ handles better

Q: When should I use intermediate Parquet files?

Use intermediate files when:

  • After sorting: Save sorted tables to Parquet for reuse
  • After partitioning: Save each partition as separate file
  • Join results: Save intermediate join results before aggregation

Example: For SMJ, save sorted tables as intermediate Parquet files so you don't re-sort if query fails.

Q: How do I implement GROUP BY after joins?

After your join completes:

# Option 1: Use pandas groupby (simple)
result = joined_df.groupby(['song_id', 'title']).agg({
  'age': 'mean',
  'user_id': lambda x: x.nunique()
})

# Option 2: Manual aggregation (shows understanding)
groups = {}
for row in joined_data:
  key = (row['song_id'], row['title'])
  if key not in groups:
    groups[key] = {'ages': [], 'users': set()}
  groups[key]['ages'].append(row['age'])
  groups[key]['users'].add(row['user_id'])

Q: How do I handle COUNT(DISTINCT) efficiently?

Several approaches:

  • Hash Set: Use Python set() to track unique values
  • Sort-based: Sort then count transitions
  • Approximate: Consider HyperLogLog for big datasets

Q: When can I use pd.merge()?

⚡ Rule: Use pd.merge() ONLY after you've implemented your own partitioning logic

✅ ALLOWED
# Your partitioning + pd.merge() for final step
partitions = my_hash_partition(left_df, right_df, key)
for left_part, right_part in partitions:
  result = pd.merge(left_part, right_part, on=key)
✗ FORBIDDEN
# pd.merge() as your entire algorithm
return pd.merge(left_df, right_df, on=key)

⚠️ Common Pitfalls to Avoid

Q: What are the most common mistakes students make?

1. Misusing pd.merge():

  • ❌ Using pd.merge() as your entire join algorithm
  • ❌ Using Polars (it has columnar joins built-in)
  • ✅ Use pd.merge() within your partitions after implementing partitioning logic
  • ✅ Implement hash partitioning, memory management, and algorithm design yourself

2. Not Handling Duplicate Keys in Joins:

  • ❌ Assuming each key appears once (1:1 join)
  • ✅ Handle many-to-many relationships correctly
  • ✅ Test with duplicate keys in both tables

3. Not Using Metadata Before Loading:

  • ❌ Loading full Parquet files THEN selecting columns
  • ❌ pd.read_parquet('1gb_table.parquet') without columns parameter
  • ✅ Analyze metadata FIRST: pq.ParquetFile(path).metadata
  • ✅ Plan which columns needed based on query
  • ✅ Load ONLY those columns: pd.read_parquet(file, columns=['id', 'name'])

4. Wrong Section Organization:

  • ❌ Implementing joins in Section 4 (planning)
  • ✅ Section 3: Join algorithms AND aggregation after joins
  • ✅ Section 4: Query planning and optimization decisions
  • ✅ Aggregation happens at the END of Section 3, after both joins complete

5. Incorrect Performance Comparisons:

  • ❌ Comparing HPJ and SMJ on different queries
  • ❌ Not warming up file system cache
  • ✅ Run same query with both algorithms
  • ✅ Average multiple runs to account for variance

Q: How do I debug my join algorithms?

Debugging strategy:

  1. Start small: Test with 10-row tables first
  2. Print intermediate results: Show hash partitions, sorted data
  3. Compare with pd.merge(): Use as ground truth for correctness
  4. Check edge cases:
    • Empty tables
    • No matching keys
    • All keys match
    • Duplicate keys
    • NULL values in join columns
  5. Visualize: Draw your hash partitions or merge pointers

AI Usage & Verification

Q: How can I use AI for this systems project?

AI can help with specific technical tasks (see AI Policy):

  • Understanding Parquet file format and columnar storage concepts, Pandas DataFrames
  • Generating benchmarking code and performance analysis
  • Identifying performance bottlenecks in your implementation
  • DO NOT use AI for join algorithms and query planning - use class knowledge from lectures

Important: The join algorithms (HPJ and SMJ) were covered in detail in class. Implement these based on your lectures and understanding, not AI assistance.

Q: What should I verify when debugging my implementation?

Test these critical cases:

  • Join correctness: Compare output with pandas merge for small datasets
  • Edge cases: Empty tables, no matching keys, all matching keys
  • Performance scaling: Time should scale appropriately with data size
  • Duplicate handling: Many-to-many joins should produce correct cartesian products

Partnership Policy

Q: Can I work with a partner?

Yes! Teams of 2 are allowed, or you can work solo. See our comprehensive Partnership Policy for details on collaboration, shared repositories, Gradescope submission, late day policies, and conflict resolution.

Frequently Asked Questions

Q: Can I work with a partner?

Yes. Teams of 2 are allowed. See Partnership Policy for rules on shared repos and Gradescope linking.

Q: Why can't I use pd.merge() for everything?

This is a Systems class. The objective is to understand the mechanics of hash tables and sort-merge algorithms. pd.merge() is a high-level black box that bypasses the learning objectives.

Q: What should I submit?

Submit one .ipynb notebook to Gradescope. Important: Do not clear the output cells. We must see your execution times and final query results tables in the notebook itself.

Honor Code

You are expected to follow the Stanford Honor Code. Original engineering is required. Do not consult existing columnar engine source code (e.g., DuckDB, Polars) to copy algorithms. You may discuss high-level concepts with classmates, but all implementation must be yours.