Project 2: NanoQuery Columnar Mogrifier

Architecture & Implementation Guide

Library Usage Policy

✅ You CAN Use

Pandas DataFrames - as data containers only
NumPy arrays - for efficient array operations
PyArrow - for Parquet file I/O only
Python dictionaries - for hash tables
Python sorting - sorted(), .sort()
pd.merge() - for final merge step only (see FAQ)
Basic Python - loops, list comprehensions

Use libraries for data structures and I/O

❌ You CANNOT Use

pd.merge() as your entire join algorithm - implement the logic yourself
Polars - has built-in columnar joins
DuckDB - complete SQL engine
SQLite/PostgreSQL - any SQL database
Apache Arrow compute - has join operations
Any library that replaces your algorithm design

Build the algorithms yourself for learning

⚠️ The Rule

You must implement:

✅ Hash partitioning logic
✅ Memory management
✅ Algorithm selection (HPJ vs SMJ)
✅ Overall join strategy

See pd.merge() FAQ below for exact rules!

System Architecture: Two Phases

📊 Data Preparation: Sections 0-1

⚙️ Query Execution Pipeline: Sections 2-5

Implementation Pipeline

Here's exactly how to build your NanoQuery system using Pandas and Parquet:

Section 0: Generate Test Data

# Copy and run the complete data generation code from Ed Colab (no implementation needed)

Section 1: Load Tables from Parquet

# Read Parquet files into Pandas DataFrames
import pandas as pd
import pyarrow.parquet as pq

songs_df = pd.read_parquet('songs.parquet')
listens_df = pd.read_parquet('listens.parquet', columns=['song_id', 'user_id'])  # Only needed columns
users_df = pd.read_parquet('users.parquet')

Section 2: Parse SQL Query

def parse_sql(query):
    """YOUR TASK: Extract tables, joins, and aggregations"""
    # Parse SQL string to identify:
    # - Tables involved
    # - Join conditions
    # - GROUP BY columns
    # - Aggregation functions
    pass  # Your implementation here

Section 3: Implement Join Algorithms

def hash_partition_join(left_df, right_df, join_key):
    """YOUR IMPLEMENTATION - Build hash table and probe"""
    # 1. Hash partition both tables
    # 2. Build hash table from smaller partition
    # 3. Probe with larger partition
    # 4. Return joined results
    return joined_df
    
def sort_merge_join(left_df, right_df, join_key):
    """YOUR IMPLEMENTATION - Sort and merge"""
    # 1. Sort both tables by join key
    # 2. Merge sorted sequences
    # 3. Handle duplicates
    return joined_df

Section 4: Query Planning & Optimization

import pyarrow.parquet as pq

def analyze_metadata_before_loading():
    """YOUR TASK: Get table statistics WITHOUT loading data
    
    Hints:
    - Use pq.ParquetFile() to access metadata
    - Extract: num_rows, column names, file sizes
    - DON'T use pd.read_parquet() here - that loads data!
    """
    # TODO: For each table ('songs', 'users', 'listens'):
    #   - Open the Parquet file (but don't load data)
    #   - Extract metadata like row count, columns, sizes
    #   - Store in a dictionary
    pass  # Your implementation here

def plan_query_execution(metadata, parsed_query):
    """YOUR TASK: Use metadata to make smart decisions
    
    Questions to answer:
    - Which table is smallest? Largest?
    - Will a hash table fit in memory?
    - Which columns does the query actually need?
    - What's the optimal join order?
    """
    # TODO: Based on metadata, decide:
    #   1. Join order (smallest first? or different strategy?)
    #   2. Algorithm choice (HPJ if fits in memory, else SMJ)
    #   3. Which columns to load for each table
    pass  # Your implementation here

# After planning, load ONLY what you need:
# Example (you implement the actual logic):
# columns_needed = ['song_id', 'artist']  # From your planning
# df = pd.read_parquet('songs.parquet', columns=columns_needed)

Section 4a: Save Intermediate Results (optional)

# For 1GB+ datasets, save intermediate joins to avoid recomputation
result1.to_parquet('intermediate_join1.parquet', compression='snappy')

# Can read back if memory constrained
del result1  # Free memory
result1 = pd.read_parquet('intermediate_join1.parquet')

Section 5: Performance Benchmarking

import time

# Benchmark at different scales
# you can use time.time() to measure the time taken for each query and the results.

💡 Key Implementation Points

Parquet is disk storage: Compressed columnar format for efficient I/O
Pandas is memory structure: DataFrames hold data while processing
You implement algorithms: Join logic, query planning, aggregation
Use intermediate files: Save progress for large datasets
Column pruning: Only load columns you need from Parquet

See FAQs below for: Parquet compression options, dataset generation, performance metrics, query optimization strategies, and more implementation details.

Rubric (100 pts + 10 bonus)

Section 1: Parquet-based Columnar Storage 20 points

→ See Architecture: Data Generation & Storage (Phase 1)

Implement Parquet-based storage for the tables
For simplicity, store all data for a table in a single Parquet file and use a single DataFrame object as a buffer
Analyze and report on:
- Space efficiency compared to row storage
- Compression ratios achieved with Parquet
- Read/write performance characteristics

Section 2: SQL Hand Parsing 15 points

→ See Architecture: Query Processing (Phase 2)

What "hand parsing" means: Manually extract the components from the provided query (you don't need a general SQL parser, just handle this specific query)
Your code should identify and extract:
- Join conditions
- Group by columns
- Aggregation functions
- Order by requirements

Section 3: Join Algorithms Implementation & Aggregation 20 points

→ This section: Implement the execution operators (HOW to join) and aggregation after joins

Hash Partition Join (9 points):
- Implement hash partitioning for data distribution
- Build Hash Partition Join (HPJ) algorithm
- Handle memory constraints and hash collisions
Sort-Merge Join (9 points):
- Implement external sorting if needed
- Build Sort-Merge Join (SMJ) algorithm
- Properly handle duplicate keys in merge phase
Aggregation After Joins (2 points):
- Implement GROUP BY on the joined result
- Calculate AVG(u.age) for each group
- Calculate COUNT(DISTINCT l.user_id) for each group
- Apply ORDER BY to final results

Important: Aggregation happens AFTER both joins complete. The flow is: Songs JOIN Listens → Result1 JOIN Users → Final Join Result → GROUP BY → ORDER BY

Tip: You may produce intermediate results in single or multiple smaller Parquet files. For JOINs, you can use memory after reading Parquet files (we're working with GBs, not TBs).

Section 4: Query Planning & Optimization 20 points

→ This section: Use Parquet metadata to plan BEFORE loading data

Metadata Analysis (6 points):
- Read Parquet metadata WITHOUT loading data
- Extract row counts, column names, file sizes
- Use statistics to inform decisions
Algorithm Selection (6 points):
- Choose HPJ vs SMJ based on metadata stats
- Consider memory requirements from file sizes
Query Optimization (8 points):
- Optimize join order using row counts
- Implement column pruning - load ONLY needed columns

Section 5: Performance Benchmarking 20 points

→ See Implementation Pipeline: Section 4 for planning, Section 0 for data generation

100MB Benchmark (4 points):
- Generate 100MB total compressed Parquet files
- Run the query with both algorithms
- Report execution times
1GB Benchmark (8 points):
- Scale to 1GB total compressed Parquet files
- Compare algorithm performance
- Analyze memory usage
Performance Analysis (8 points):
- Detailed breakdown of where time is spent
- I/O vs CPU analysis
- Bottleneck identification and optimization suggestions

Base Grading Model

Quality work gets 95/100: Complete all sections with working code and clear comments. This is a high A and what most students should target.

Higher credit requires exceptional work: Only pursue if you're genuinely excited about data systems. The extra 10s of hours won't significantly impact your grade but will deepen your data systems skills and the quality of your artifact for technical interviews.

Readability & Organization Up to -15 points

Have clearly marked sections in your Colab notebook
Use these exact section headers:
# Section 0: Data Generation
# Section 1: Parquet Storage Analysis
# Section 2: SQL Parsing
# Section 3: Join Algorithms Implementation & Aggregation
# Section 4: Query Planning & Optimization
# Section 5: Performance Benchmarking
Specific deductions:
- -5 points: Missing one or more required sections
- -5 points: Poor code quality (no comments, unclear variable names, messy structure)
- -5 points: Missing or unclear output (benchmark results not visible, no performance tables)

Adjustments +5 points max (top 10% of projects)

+3 Value: Demonstrates measurable speedups vs a stated baseline (numbers released on Ed by Nov 1st)
+2 Clarity: Crisp, compelling insights with clear diagrams and concise explanations

BONUS: World Domination Tier 10 points

Top 3 Performance (5 points): Achieve top 3 query performance in the class for the three-table join on 1GB compressed Parquet files
- Fastest average query time wins
- Must work in Free Colab (12GB RAM)
Infinite Scalable Storage (5 points): Handle 10GB+ compressed datasets in Free Colab
- Hint: Think about how distributed systems handle data that doesn't fit in memory
- Document your approach and show it works
- Performance doesn't need to be fast, just correct

Frequently Asked Questions

Parquet Format & Implementation

Q: What is Parquet and why use it?

Parquet is a columnar storage format that provides excellent compression and query performance. For columnar storage concepts, see Storage Layout lesson.

Key Parquet benefits for this project:

Compress repeated values efficiently (e.g., same artist across songs)
Read only needed columns (SELECT song_id doesn't touch other columns)
Built-in statistics for query optimization
Row groups allow chunked processing

Q: How do I know if my dataset sizes are correct?

All sizes refer to compressed Parquet files on disk:

100MB total: Check that songs.parquet + users.parquet + listens.parquet ≈ 100MB
1GB total: Check that all three Parquet files sum to ≈ 1GB
Use os.path.getsize('file.parquet') or ls -lh *.parquet to verify
Listens should be ~75% of your total data (it's the activity log)
Parquet typically achieves 3-5x compression vs CSV

Q: How do I work with Parquet files using PyArrow?


# Essential Parquet operations with PyArrow
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# 1. Write with compression options
df.to_parquet('songs.parquet', 
              compression='snappy',  # or 'gzip', 'brotli', 'lz4'
              row_group_size=50000)  # rows per group

# 2. Read only specific columns (columnar advantage!)
columns = ['song_id', 'title']
df = pd.read_parquet('songs.parquet', columns=columns)

# 3. Inspect metadata without reading data
parquet_file = pq.ParquetFile('songs.parquet')
print(f"Num row groups: {parquet_file.num_row_groups}")
print(f"Schema: {parquet_file.schema}")
print(f"Metadata: {parquet_file.metadata}")

# 4. Read in chunks for large files
parquet_file = pq.ParquetFile('large_table.parquet')
for batch in parquet_file.iter_batches(batch_size=10000):
    df_chunk = batch.to_pandas()
    # Process chunk

Q: Which compression should I use?

Compression	Speed	Ratio	Best For
Snappy	Very Fast	~2-4x	Default, balanced
Gzip	Slow	~4-8x	Max compression
LZ4	Fastest	~2-3x	Speed critical

Recommendation: Use Snappy (default) for this project - good balance of speed and compression.

Q: How should I handle the 10 additional string columns?

These columns simulate real-world data bloat. They should be:

Stored in your Parquet files to test compression
Ignored during join operations (only join on IDs)
Used to measure the impact of columnar storage

Q: Do I need to implement a full SQL parser?

No! "Hand parsing" means extracting components from the specific query provided. Hardcode the parsing for this query structure - extract joins, group by, aggregations, and order by.

Performance & Benchmarking

Q: When can I use pandas DataFrames in memory vs streaming?

Memory usage guidelines:

100MB compressed: Can load all tables in memory (~500MB uncompressed)
1GB compressed: Load one table at a time, stream the 750MB listens table if needed
10GB+ (bonus): Must use chunked processing, never load full table

Rule of thumb: Pandas uses ~5-10x the file size in RAM. A 1GB Parquet file might need 5-10GB RAM when loaded as DataFrame.

Q: What metrics should I report for performance analysis?

Metric	What to Measure	How to Measure
Execution Time	Total query time	time.time() or %%time in Jupyter
I/O Time	Parquet read/write time	Measure separately from compute
Memory Usage	Peak memory consumption	memory_profiler or psutil
Compression Ratio	Parquet vs raw size	File sizes on disk
Join Selectivity	Output rows / input rows	Count rows at each stage

Q: How can I optimize for the top 3 performance bonus?

Consider these optimizations:

Predicate pushdown: Filter early in Parquet reads
Column pruning: Only read needed columns
Join order optimization: Join smaller tables first
Partition elimination: Skip irrelevant partitions
Memory management: Use chunking for large operations
Parallel processing: Use multiprocessing for independent tasks

Technical Details

Q: How should my query planner choose between algorithms?

Simple cost model approach:


                    def choose_algorithm(table1_size, table2_size, available_memory):

                      # Check if smaller table fits in memory for HPJ

                      smaller_size = min(table1_size, table2_size)

                      if smaller_size * 5 < available_memory:  # 5x for pandas overhead

                        return "HPJ"  # Hash join is faster

                      else:

                        return "SMJ"  # Sort-merge for limited memory

                    

                    # Also consider:

                    # - Is data already sorted? → SMJ

                    # - Need sorted output for GROUP BY? → SMJ

                    # - Very skewed join keys? → SMJ handles better

Q: When should I use intermediate Parquet files?

Use intermediate files when:

After sorting: Save sorted tables to Parquet for reuse
After partitioning: Save each partition as separate file
Join results: Save intermediate join results before aggregation

Example: For SMJ, save sorted tables as intermediate Parquet files so you don't re-sort if query fails.

Q: How do I implement GROUP BY after joins?

After your join completes:


                    # Option 1: Use pandas groupby (simple)

                    result = joined_df.groupby(['song_id', 'title']).agg({

                      'age': 'mean',

                      'user_id': lambda x: x.nunique()

                    })

                    

                    # Option 2: Manual aggregation (shows understanding)

                    groups = {}

                    for row in joined_data:

                      key = (row['song_id'], row['title'])

                      if key not in groups:

                        groups[key] = {'ages': [], 'users': set()}

                      groups[key]['ages'].append(row['age'])

                      groups[key]['users'].add(row['user_id'])

Q: How do I handle COUNT(DISTINCT) efficiently?

Several approaches:

Hash Set: Use Python set() to track unique values
Sort-based: Sort then count transitions
Approximate: Consider HyperLogLog for big datasets

Q: When can I use pd.merge()?

⚡ Rule: Use pd.merge() ONLY after you've implemented your own partitioning logic

✅ ALLOWED


                            # Your partitioning + pd.merge() for final step

                            partitions = my_hash_partition(left_df, right_df, key)

                            for left_part, right_part in partitions:

                              result = pd.merge(left_part, right_part, on=key)

✗ FORBIDDEN


                            # pd.merge() as your entire algorithm

                            return pd.merge(left_df, right_df, on=key)

⚠️ Common Pitfalls to Avoid

Q: What are the most common mistakes students make?

1. Misusing pd.merge():

❌ Using pd.merge() as your entire join algorithm
❌ Using Polars (it has columnar joins built-in)
✅ Use pd.merge() within your partitions after implementing partitioning logic
✅ Implement hash partitioning, memory management, and algorithm design yourself

2. Not Handling Duplicate Keys in Joins:

❌ Assuming each key appears once (1:1 join)
✅ Handle many-to-many relationships correctly
✅ Test with duplicate keys in both tables

3. Not Using Metadata Before Loading:

❌ Loading full Parquet files THEN selecting columns
❌ pd.read_parquet('1gb_table.parquet') without columns parameter
✅ Analyze metadata FIRST: pq.ParquetFile(path).metadata
✅ Plan which columns needed based on query
✅ Load ONLY those columns: pd.read_parquet(file, columns=['id', 'name'])

4. Wrong Section Organization:

❌ Implementing joins in Section 4 (planning)
✅ Section 3: Join algorithms AND aggregation after joins
✅ Section 4: Query planning and optimization decisions
✅ Aggregation happens at the END of Section 3, after both joins complete

5. Incorrect Performance Comparisons:

❌ Comparing HPJ and SMJ on different queries
❌ Not warming up file system cache
✅ Run same query with both algorithms
✅ Average multiple runs to account for variance

Q: How do I debug my join algorithms?

Debugging strategy:

Start small: Test with 10-row tables first
Print intermediate results: Show hash partitions, sorted data
Compare with pd.merge(): Use as ground truth for correctness
Check edge cases:
- Empty tables
- No matching keys
- All keys match
- Duplicate keys
- NULL values in join columns
Visualize: Draw your hash partitions or merge pointers

AI Usage & Verification

Q: How can I use AI for this systems project?

AI can help with specific technical tasks (see AI Policy):

Understanding Parquet file format and columnar storage concepts, Pandas DataFrames
Generating benchmarking code and performance analysis
Identifying performance bottlenecks in your implementation
❌ DO NOT use AI for join algorithms and query planning - use class knowledge from lectures

Important: The join algorithms (HPJ and SMJ) were covered in detail in class. Implement these based on your lectures and understanding, not AI assistance.

Q: What should I verify when debugging my implementation?

Test these critical cases:

Join correctness: Compare output with pandas merge for small datasets
Edge cases: Empty tables, no matching keys, all matching keys
Performance scaling: Time should scale appropriately with data size
Duplicate handling: Many-to-many joins should produce correct cartesian products

Partnership Policy

Q: Can I work with a partner?

Yes! Teams of 2 are allowed, or you can work solo. See our comprehensive Partnership Policy for details on collaboration, shared repositories, Gradescope submission, late day policies, and conflict resolution.

Dataset Size	Algorithm	Total Time (s)	I/O Time (s)	Join Time (s)	Peak Memory (GB)
100MB	HPJ	[Your time]	[Your time]	[Your time]	[Your memory]
100MB	SMJ	[Your time]	[Your time]	[Your time]	[Your memory]
1GB	HPJ	[Your time]	[Your time]	[Your time]	[Your memory]
1GB	SMJ	[Your time]	[Your time]	[Your time]	[Your memory]

Project 2: NanoQuery Columnar Mogrifier

Memo from Dr. Gru's Lair

🚀 The Three Pillars of Power

Parquet Power: Compression Meets Speed

Faster Than a Freeze Ray

Cracking the SQL

Project Scope

Required Resources (see Ed for links)

Rough Time Investment Guide

Week 1-2: Parquet & SQL Parsing

Week 3: Joins & SQL

Week 4: Query Planning & Execution

Week 5: Benchmarking

Query of Fun and Three Tables of Terrificity

Table Schemas

Songs Table

Listens Table

Users Table

Why This Project Matters

Working Within Free Colab (12GB RAM)

Smart Memory Management for 1GB Dataset

🎯 The Key Insight: Column Pruning

Memory Calculation Example:

Additional Strategies