Project 2: NanoQuery Columnar Mogrifier

CS145: Stanford University, Fall 2025 - Systems Track
Project 2: Systems - NanoQuery Columnar Mogrifier

Memo from Dr. Gru's Lair

Subject: Dear builders, Need Columnar DB, pronto!

Dr. Gru here with a plan so advanced, it'll make those big tech companies look like they're still using stone tablets and chisels! We're going to build our own Parquet-based columnar database system – one so efficient, so blazingly fast, that it'll make rows beg for mercy!

⚡ Mission: Build a columnar database that processes queries faster than a freeze ray!

🚀 The Three Pillars of Power

🗜️

Parquet Power: Compression Meets Speed

Parquet is the secret weapon of data wizards everywhere! I want a system that can compress data like a black hole, query it faster than light, and join tables so quickly it'll break the laws of physics!

Faster Than a Freeze Ray

Every nanosecond counts when you need to optimize that three-table join. We're going BIG - from 100MB to GIGABYTES of compressed data without breaking a sweat!

🔧

Cracking the SQL

Parse those queries, optimize them, make them bow before the might of our NanoQuery engine!

Project Scope

Required Resources (see Ed for links)

Rough Time Investment Guide

Week 1-2: Parquet & SQL Parsing

2-5 hours

Parquet storage & analysis, hand-parsing (data generation already implemented)

Week 3: Joins & SQL

10-12 hours

Hash Join, Sort-Merge Join implementation

Week 4: Query Planning & Execution

10-12 hours

Query optimization, algorithm selection, memory management, query execution

Week 5: Benchmarking

6-8 hours

Performance testing, analysis, final polishing

Total: 30-40 hours over 5 weeks = manageable 6-8 hours/week

Query of Fun and Three Tables of Terrificity

Your mission is to parse and optimize this query. Fail, and you might be a test subject for my latest invention: the "Mandatory Volunteer Motivator 9000"!

SELECT s.song_id, AVG(u.age) AS avg_age,
       COUNT(DISTINCT l.user_id)
FROM Songs s
JOIN Listens l ON s.song_id = l.song_id
JOIN Users u ON l.user_id = u.user_id
GROUP BY s.song_id, s.title
ORDER BY COUNT(DISTINCT l.user_id) DESC, s.song_id;

Table Schemas

Each table has the following columns, plus 10 additional string columns of k characters each:

Songs Table

song_id: int64 + 10 string columns

Listens Table

listen_id: int64, song_id: int64, user_id: int64 + 10 string columns

Users Table

user_id: int64, age: int64 + 10 string columns

Why This Project Matters

Working Within Free Colab (12GB RAM)

✅ This project is designed for Free Colab! The 12GB memory constraint teaches you real-world optimization techniques used by columnar databases like BigQuery and Snowflake.

Both 100MB and 1GB compressed datasets are completely achievable in Free Colab with smart columnar techniques:

Smart Memory Management for 1GB Dataset

🎯 The Key Insight: Column Pruning

The query only needs 3-4 columns per table, not all 13 columns!

❌ Naive Approach:
  • Load all columns: 1GB → 5-10GB in RAM
  • Out of memory error!
✅ Smart Approach:
  • Load only needed columns: 1GB → 1-2GB in RAM
  • Works perfectly in Free Colab!

Memory Calculation Example:

# For 1GB compressed total:
listens.parquet: 750MB compressed → Load only (song_id, user_id) → ~150MB in memory
songs.parquet:    50MB compressed → Load only (song_id) → ~20MB in memory  
users.parquet:   200MB compressed → Load only (user_id, age) → ~80MB in memory

# Total active memory: ~250MB vs 5GB if loading everything!
# That's a 95% reduction through columnar optimization!

Additional Strategies

⚠️ Remember: The constraint is the lesson! Working within 12GB teaches you techniques that scale to petabyte-scale systems.

Architecture & Implementation Guide

Library Usage Policy

✅ You CAN Use

  • Pandas DataFrames - as data containers only
  • NumPy arrays - for efficient array operations
  • PyArrow - for Parquet file I/O only
  • Python dictionaries - for hash tables
  • Python sorting - sorted(), .sort()
  • pd.merge() - for final merge step only (see FAQ)
  • Basic Python - loops, list comprehensions

Use libraries for data structures and I/O

❌ You CANNOT Use

  • pd.merge() as your entire join algorithm - implement the logic yourself
  • Polars - has built-in columnar joins
  • DuckDB - complete SQL engine
  • SQLite/PostgreSQL - any SQL database
  • Apache Arrow compute - has join operations
  • Any library that replaces your algorithm design

Build the algorithms yourself for learning

⚠️ The Rule

You must implement:

  • ✅ Hash partitioning logic
  • ✅ Memory management
  • ✅ Algorithm selection (HPJ vs SMJ)
  • ✅ Overall join strategy

See pd.merge() FAQ below for exact rules!

System Architecture: Two Phases

📊 Data Preparation: Sections 0-1

Section 0: Generate Test Data Songs: 10K-100K records | Users: 50K-500K records Listens: 1M-10M records with foreign keys ↓ Automatic Parquet Compression ↓ Snappy compression | 75-90% size reduction Section 1: Store & Analyze Parquet Files songs.parquet ~5MB 10K rows × 11 cols listens.parquet ~75MB 1M rows × 13 cols users.parquet ~20MB 50K rows × 12 cols

⚙️ Query Execution Pipeline: Sections 2-5

SELECT s.song_id, AVG(u.age), COUNT(DISTINCT l.user_id) FROM Songs s JOIN Listens l ON ... JOIN Users u ON ... Section 2: Parse SQL Query Extract: tables, joins, aggregations DISK STORAGE PROCESSING MEMORY (RAM) Analyze Metadata (Don't Load Data Yet!) songs.parquet (5MB) 10K rows × 11 columns listens.parquet (75MB) 1M rows × 13 columns users.parquet (20MB) 50K rows × 12 columns songs_df (~25MB uncompressed) DataFrame in RAM listens_df (~375MB) DataFrame in RAM users_df (~100MB) DataFrame in RAM Section 4: Query Planning Use metadata stats to decide: • Join order • HPJ vs SMJ • Which columns Section 3: Execute Join Algorithm Hash Join 1. Build hash table 2. Probe & match 3. Output: joined_df Sort-Merge Join 1. Sort both tables 2. Merge sorted 3. Output: joined_df Aggregation (After Joins) GROUP BY, AVG, COUNT DISTINCT

Implementation Pipeline

Here's exactly how to build your NanoQuery system using Pandas and Parquet:

Section 0: Generate Test Data
# Copy and run the complete data generation code from Ed Colab (no implementation needed)

            
Section 1: Load Tables from Parquet
# Read Parquet files into Pandas DataFrames
import pandas as pd
import pyarrow.parquet as pq

songs_df = pd.read_parquet('songs.parquet')
listens_df = pd.read_parquet('listens.parquet', columns=['song_id', 'user_id'])  # Only needed columns
users_df = pd.read_parquet('users.parquet')
            
Section 2: Parse SQL Query
def parse_sql(query):
    """YOUR TASK: Extract tables, joins, and aggregations"""
    # Parse SQL string to identify:
    # - Tables involved
    # - Join conditions
    # - GROUP BY columns
    # - Aggregation functions
    pass  # Your implementation here
Section 3: Implement Join Algorithms
def hash_partition_join(left_df, right_df, join_key):
    """YOUR IMPLEMENTATION - Build hash table and probe"""
    # 1. Hash partition both tables
    # 2. Build hash table from smaller partition
    # 3. Probe with larger partition
    # 4. Return joined results
    return joined_df
    
def sort_merge_join(left_df, right_df, join_key):
    """YOUR IMPLEMENTATION - Sort and merge"""
    # 1. Sort both tables by join key
    # 2. Merge sorted sequences
    # 3. Handle duplicates
    return joined_df
Section 4: Query Planning & Optimization
import pyarrow.parquet as pq

def analyze_metadata_before_loading():
    """YOUR TASK: Get table statistics WITHOUT loading data
    
    Hints:
    - Use pq.ParquetFile() to access metadata
    - Extract: num_rows, column names, file sizes
    - DON'T use pd.read_parquet() here - that loads data!
    """
    # TODO: For each table ('songs', 'users', 'listens'):
    #   - Open the Parquet file (but don't load data)
    #   - Extract metadata like row count, columns, sizes
    #   - Store in a dictionary
    pass  # Your implementation here

def plan_query_execution(metadata, parsed_query):
    """YOUR TASK: Use metadata to make smart decisions
    
    Questions to answer:
    - Which table is smallest? Largest?
    - Will a hash table fit in memory?
    - Which columns does the query actually need?
    - What's the optimal join order?
    """
    # TODO: Based on metadata, decide:
    #   1. Join order (smallest first? or different strategy?)
    #   2. Algorithm choice (HPJ if fits in memory, else SMJ)
    #   3. Which columns to load for each table
    pass  # Your implementation here

# After planning, load ONLY what you need:
# Example (you implement the actual logic):
# columns_needed = ['song_id', 'artist']  # From your planning
# df = pd.read_parquet('songs.parquet', columns=columns_needed)
Section 4a: Save Intermediate Results (optional)
# For 1GB+ datasets, save intermediate joins to avoid recomputation
result1.to_parquet('intermediate_join1.parquet', compression='snappy')

# Can read back if memory constrained
del result1  # Free memory
result1 = pd.read_parquet('intermediate_join1.parquet')
Section 5: Performance Benchmarking
import time

# Benchmark at different scales
# you can use time.time() to measure the time taken for each query and the results.

            

💡 Key Implementation Points

See FAQs below for: Parquet compression options, dataset generation, performance metrics, query optimization strategies, and more implementation details.

Rubric (100 pts + 10 bonus)

Section 1: Parquet-based Columnar Storage 20 points

→ See Architecture: Data Generation & Storage (Phase 1)
  • Implement Parquet-based storage for the tables
  • For simplicity, store all data for a table in a single Parquet file and use a single DataFrame object as a buffer
  • Analyze and report on:
    • Space efficiency compared to row storage
    • Compression ratios achieved with Parquet
    • Read/write performance characteristics

Section 2: SQL Hand Parsing 15 points

→ See Architecture: Query Processing (Phase 2)
  • What "hand parsing" means: Manually extract the components from the provided query (you don't need a general SQL parser, just handle this specific query)
  • Your code should identify and extract:
    • Join conditions
    • Group by columns
    • Aggregation functions
    • Order by requirements

Section 3: Join Algorithms Implementation & Aggregation 20 points

→ This section: Implement the execution operators (HOW to join) and aggregation after joins
  • Hash Partition Join (9 points):
    • Implement hash partitioning for data distribution
    • Build Hash Partition Join (HPJ) algorithm
    • Handle memory constraints and hash collisions
  • Sort-Merge Join (9 points):
    • Implement external sorting if needed
    • Build Sort-Merge Join (SMJ) algorithm
    • Properly handle duplicate keys in merge phase
  • Aggregation After Joins (2 points):
    • Implement GROUP BY on the joined result
    • Calculate AVG(u.age) for each group
    • Calculate COUNT(DISTINCT l.user_id) for each group
    • Apply ORDER BY to final results
Important: Aggregation happens AFTER both joins complete. The flow is: Songs JOIN Listens → Result1 JOIN Users → Final Join Result → GROUP BY → ORDER BY
Tip: You may produce intermediate results in single or multiple smaller Parquet files. For JOINs, you can use memory after reading Parquet files (we're working with GBs, not TBs).

Section 4: Query Planning & Optimization 20 points

→ This section: Use Parquet metadata to plan BEFORE loading data
  • Metadata Analysis (6 points):
    • Read Parquet metadata WITHOUT loading data
    • Extract row counts, column names, file sizes
    • Use statistics to inform decisions
  • Algorithm Selection (6 points):
    • Choose HPJ vs SMJ based on metadata stats
    • Consider memory requirements from file sizes
  • Query Optimization (8 points):
    • Optimize join order using row counts
    • Implement column pruning - load ONLY needed columns

Section 5: Performance Benchmarking 20 points

→ See Implementation Pipeline: Section 4 for planning, Section 0 for data generation
  • 100MB Benchmark (4 points):
    • Generate 100MB total compressed Parquet files
    • Run the query with both algorithms
    • Report execution times
  • 1GB Benchmark (8 points):
    • Scale to 1GB total compressed Parquet files
    • Compare algorithm performance
    • Analyze memory usage
  • Performance Analysis (8 points):
    • Detailed breakdown of where time is spent
    • I/O vs CPU analysis
    • Bottleneck identification and optimization suggestions

Base Grading Model

Quality work gets 95/100: Complete all sections with working code and clear comments. This is a high A and what most students should target.

Higher credit requires exceptional work: Only pursue if you're genuinely excited about data systems. The extra 10s of hours won't significantly impact your grade but will deepen your data systems skills and the quality of your artifact for technical interviews.

Readability & Organization Up to -15 points

  • Have clearly marked sections in your Colab notebook
  • Use these exact section headers:
    # Section 0: Data Generation
    # Section 1: Parquet Storage Analysis
    # Section 2: SQL Parsing
    # Section 3: Join Algorithms Implementation & Aggregation
    # Section 4: Query Planning & Optimization
    # Section 5: Performance Benchmarking
  • Specific deductions:
    • -5 points: Missing one or more required sections
    • -5 points: Poor code quality (no comments, unclear variable names, messy structure)
    • -5 points: Missing or unclear output (benchmark results not visible, no performance tables)

Adjustments +5 points max (top 10% of projects)

  • +3 Value: Demonstrates measurable speedups vs a stated baseline (numbers released on Ed by Nov 1st)
  • +2 Clarity: Crisp, compelling insights with clear diagrams and concise explanations

BONUS: World Domination Tier 10 points

  • Top 3 Performance (5 points): Achieve top 3 query performance in the class for the three-table join on 1GB compressed Parquet files
    • Fastest average query time wins
    • Must work in Free Colab (12GB RAM)
  • Infinite Scalable Storage (5 points): Handle 10GB+ compressed datasets in Free Colab
    • Hint: Think about how distributed systems handle data that doesn't fit in memory
    • Document your approach and show it works
    • Performance doesn't need to be fast, just correct

Frequently Asked Questions

Parquet Format & Implementation

Q: What is Parquet and why use it?

Parquet is a columnar storage format that provides excellent compression and query performance. For columnar storage concepts, see Storage Layout lesson.

Key Parquet benefits for this project:

  • Compress repeated values efficiently (e.g., same artist across songs)
  • Read only needed columns (SELECT song_id doesn't touch other columns)
  • Built-in statistics for query optimization
  • Row groups allow chunked processing

Q: How do I know if my dataset sizes are correct?

All sizes refer to compressed Parquet files on disk:

  • 100MB total: Check that songs.parquet + users.parquet + listens.parquet ≈ 100MB
  • 1GB total: Check that all three Parquet files sum to ≈ 1GB
  • Use os.path.getsize('file.parquet') or ls -lh *.parquet to verify
  • Listens should be ~75% of your total data (it's the activity log)
  • Parquet typically achieves 3-5x compression vs CSV

Q: How do I work with Parquet files using PyArrow?

# Essential Parquet operations with PyArrow import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # 1. Write with compression options df.to_parquet('songs.parquet', compression='snappy', # or 'gzip', 'brotli', 'lz4' row_group_size=50000) # rows per group # 2. Read only specific columns (columnar advantage!) columns = ['song_id', 'title'] df = pd.read_parquet('songs.parquet', columns=columns) # 3. Inspect metadata without reading data parquet_file = pq.ParquetFile('songs.parquet') print(f"Num row groups: {parquet_file.num_row_groups}") print(f"Schema: {parquet_file.schema}") print(f"Metadata: {parquet_file.metadata}") # 4. Read in chunks for large files parquet_file = pq.ParquetFile('large_table.parquet') for batch in parquet_file.iter_batches(batch_size=10000): df_chunk = batch.to_pandas() # Process chunk

Q: Which compression should I use?

Compression Speed Ratio Best For
Snappy Very Fast ~2-4x Default, balanced
Gzip Slow ~4-8x Max compression
LZ4 Fastest ~2-3x Speed critical

Recommendation: Use Snappy (default) for this project - good balance of speed and compression.

Q: How should I handle the 10 additional string columns?

These columns simulate real-world data bloat. They should be:

  • Stored in your Parquet files to test compression
  • Ignored during join operations (only join on IDs)
  • Used to measure the impact of columnar storage

Q: Do I need to implement a full SQL parser?

No! "Hand parsing" means extracting components from the specific query provided. Hardcode the parsing for this query structure - extract joins, group by, aggregations, and order by.

Performance & Benchmarking

Q: When can I use pandas DataFrames in memory vs streaming?

Memory usage guidelines:

  • 100MB compressed: Can load all tables in memory (~500MB uncompressed)
  • 1GB compressed: Load one table at a time, stream the 750MB listens table if needed
  • 10GB+ (bonus): Must use chunked processing, never load full table

Rule of thumb: Pandas uses ~5-10x the file size in RAM. A 1GB Parquet file might need 5-10GB RAM when loaded as DataFrame.

Q: What metrics should I report for performance analysis?

Metric What to Measure How to Measure
Execution Time Total query time time.time() or %%time in Jupyter
I/O Time Parquet read/write time Measure separately from compute
Memory Usage Peak memory consumption memory_profiler or psutil
Compression Ratio Parquet vs raw size File sizes on disk
Join Selectivity Output rows / input rows Count rows at each stage

Q: How can I optimize for the top 3 performance bonus?

Consider these optimizations:

  • Predicate pushdown: Filter early in Parquet reads
  • Column pruning: Only read needed columns
  • Join order optimization: Join smaller tables first
  • Partition elimination: Skip irrelevant partitions
  • Memory management: Use chunking for large operations
  • Parallel processing: Use multiprocessing for independent tasks

Technical Details

Q: How should my query planner choose between algorithms?

Simple cost model approach:

def choose_algorithm(table1_size, table2_size, available_memory):
  # Check if smaller table fits in memory for HPJ
  smaller_size = min(table1_size, table2_size)
  if smaller_size * 5 < available_memory: # 5x for pandas overhead
    return "HPJ" # Hash join is faster
  else:
    return "SMJ" # Sort-merge for limited memory

# Also consider:
# - Is data already sorted? → SMJ
# - Need sorted output for GROUP BY? → SMJ
# - Very skewed join keys? → SMJ handles better

Q: When should I use intermediate Parquet files?

Use intermediate files when:

  • After sorting: Save sorted tables to Parquet for reuse
  • After partitioning: Save each partition as separate file
  • Join results: Save intermediate join results before aggregation

Example: For SMJ, save sorted tables as intermediate Parquet files so you don't re-sort if query fails.

Q: How do I implement GROUP BY after joins?

After your join completes:

# Option 1: Use pandas groupby (simple)
result = joined_df.groupby(['song_id', 'title']).agg({
  'age': 'mean',
  'user_id': lambda x: x.nunique()
})

# Option 2: Manual aggregation (shows understanding)
groups = {}
for row in joined_data:
  key = (row['song_id'], row['title'])
  if key not in groups:
    groups[key] = {'ages': [], 'users': set()}
  groups[key]['ages'].append(row['age'])
  groups[key]['users'].add(row['user_id'])

Q: How do I handle COUNT(DISTINCT) efficiently?

Several approaches:

  • Hash Set: Use Python set() to track unique values
  • Sort-based: Sort then count transitions
  • Approximate: Consider HyperLogLog for big datasets

Q: When can I use pd.merge()?

⚡ Rule: Use pd.merge() ONLY after you've implemented your own partitioning logic

✅ ALLOWED
# Your partitioning + pd.merge() for final step
partitions = my_hash_partition(left_df, right_df, key)
for left_part, right_part in partitions:
  result = pd.merge(left_part, right_part, on=key)
✗ FORBIDDEN
# pd.merge() as your entire algorithm
return pd.merge(left_df, right_df, on=key)

⚠️ Common Pitfalls to Avoid

Q: What are the most common mistakes students make?

1. Misusing pd.merge():

  • ❌ Using pd.merge() as your entire join algorithm
  • ❌ Using Polars (it has columnar joins built-in)
  • ✅ Use pd.merge() within your partitions after implementing partitioning logic
  • ✅ Implement hash partitioning, memory management, and algorithm design yourself

2. Not Handling Duplicate Keys in Joins:

  • ❌ Assuming each key appears once (1:1 join)
  • ✅ Handle many-to-many relationships correctly
  • ✅ Test with duplicate keys in both tables

3. Not Using Metadata Before Loading:

  • ❌ Loading full Parquet files THEN selecting columns
  • ❌ pd.read_parquet('1gb_table.parquet') without columns parameter
  • ✅ Analyze metadata FIRST: pq.ParquetFile(path).metadata
  • ✅ Plan which columns needed based on query
  • ✅ Load ONLY those columns: pd.read_parquet(file, columns=['id', 'name'])

4. Wrong Section Organization:

  • ❌ Implementing joins in Section 4 (planning)
  • ✅ Section 3: Join algorithms AND aggregation after joins
  • ✅ Section 4: Query planning and optimization decisions
  • ✅ Aggregation happens at the END of Section 3, after both joins complete

5. Incorrect Performance Comparisons:

  • ❌ Comparing HPJ and SMJ on different queries
  • ❌ Not warming up file system cache
  • ✅ Run same query with both algorithms
  • ✅ Average multiple runs to account for variance

Q: How do I debug my join algorithms?

Debugging strategy:

  1. Start small: Test with 10-row tables first
  2. Print intermediate results: Show hash partitions, sorted data
  3. Compare with pd.merge(): Use as ground truth for correctness
  4. Check edge cases:
    • Empty tables
    • No matching keys
    • All keys match
    • Duplicate keys
    • NULL values in join columns
  5. Visualize: Draw your hash partitions or merge pointers

AI Usage & Verification

Q: How can I use AI for this systems project?

AI can help with specific technical tasks (see AI Policy):

  • Understanding Parquet file format and columnar storage concepts, Pandas DataFrames
  • Generating benchmarking code and performance analysis
  • Identifying performance bottlenecks in your implementation
  • DO NOT use AI for join algorithms and query planning - use class knowledge from lectures

Important: The join algorithms (HPJ and SMJ) were covered in detail in class. Implement these based on your lectures and understanding, not AI assistance.

Q: What should I verify when debugging my implementation?

Test these critical cases:

  • Join correctness: Compare output with pandas merge for small datasets
  • Edge cases: Empty tables, no matching keys, all matching keys
  • Performance scaling: Time should scale appropriately with data size
  • Duplicate handling: Many-to-many joins should produce correct cartesian products

Partnership Policy

Q: Can I work with a partner?

Yes! Teams of 2 are allowed, or you can work solo. See our comprehensive Partnership Policy for details on collaboration, shared repositories, Gradescope submission, late day policies, and conflict resolution.

📊 Expected Final Output Format

Your query must produce results in this exact format:

# Final Query Results
song_id | avg_age | count_distinct_users
--------|---------|--------------------
12345 | 28.50 | 1523
67890 | 31.25 | 1245
54321 | 26.75 | 987
... | ... | ...

Requirements:

Performance Results Table (Required in summary):

Dataset Size Algorithm Total Time (s) I/O Time (s) Join Time (s) Peak Memory (GB)
100MB HPJ [Your time] [Your time] [Your time] [Your memory]
100MB SMJ [Your time] [Your time] [Your time] [Your memory]
1GB HPJ [Your time] [Your time] [Your time] [Your memory]
1GB SMJ [Your time] [Your time] [Your time] [Your memory]
Note: Both algorithms must produce identical results. Use assertion checks to verify correctness.

Submission Instructions

Submit to Gradescope:

Honor Code

As in all Stanford CS classes, you are expected to follow the Stanford Honor Code. Remember, honor comes from your original ideas, not stolen ones!