Subject: Dear builders, Need Columnar DB, pronto!
Dr. Gru here with a plan so advanced, it'll make those big tech companies look like they're still using stone tablets and chisels! We're going to build our own Parquet-based columnar database system – one so efficient, so blazingly fast, that it'll make rows beg for mercy!
Parquet is the secret weapon of data wizards everywhere! I want a system that can compress data like a black hole, query it faster than light, and join tables so quickly it'll break the laws of physics!
Every nanosecond counts when you need to optimize that three-table join. We're going BIG - from 100MB to GIGABYTES of compressed data without breaking a sweat!
Parse those queries, optimize them, make them bow before the might of our NanoQuery engine!
2-5 hours
Parquet storage & analysis, hand-parsing (data generation already implemented)
10-12 hours
Hash Join, Sort-Merge Join implementation
10-12 hours
Query optimization, algorithm selection, memory management, query execution
6-8 hours
Performance testing, analysis, final polishing
Total: 30-40 hours over 5 weeks = manageable 6-8 hours/week
Your mission is to parse and optimize this query. Fail, and you might be a test subject for my latest invention: the "Mandatory Volunteer Motivator 9000"!
SELECT s.song_id, AVG(u.age) AS avg_age,
COUNT(DISTINCT l.user_id)
FROM Songs s
JOIN Listens l ON s.song_id = l.song_id
JOIN Users u ON l.user_id = u.user_id
GROUP BY s.song_id, s.title
ORDER BY COUNT(DISTINCT l.user_id) DESC, s.song_id;
Each table has the following columns, plus 10 additional string columns of k characters each:
song_id: int64 + 10 string columns
listen_id: int64, song_id: int64, user_id: int64 + 10 string columns
user_id: int64, age: int64 + 10 string columns
✅ This project is designed for Free Colab! The 12GB memory constraint teaches you real-world optimization techniques used by columnar databases like BigQuery and Snowflake.
Both 100MB and 1GB compressed datasets are completely achievable in Free Colab with smart columnar techniques:
The query only needs 3-4 columns per table, not all 13 columns!
# For 1GB compressed total:
listens.parquet: 750MB compressed → Load only (song_id, user_id) → ~150MB in memory
songs.parquet: 50MB compressed → Load only (song_id) → ~20MB in memory
users.parquet: 200MB compressed → Load only (user_id, age) → ~80MB in memory
# Total active memory: ~250MB vs 5GB if loading everything!
# That's a 95% reduction through columnar optimization!
del dataframe after each join stage, run gc.collect()⚠️ Remember: The constraint is the lesson! Working within 12GB teaches you techniques that scale to petabyte-scale systems.
Use libraries for data structures and I/O
Build the algorithms yourself for learning
You must implement:
See pd.merge() FAQ below for exact rules!
Here's exactly how to build your NanoQuery system using Pandas and Parquet:
# Copy and run the complete data generation code from Ed Colab (no implementation needed)
# Read Parquet files into Pandas DataFrames
import pandas as pd
import pyarrow.parquet as pq
songs_df = pd.read_parquet('songs.parquet')
listens_df = pd.read_parquet('listens.parquet', columns=['song_id', 'user_id']) # Only needed columns
users_df = pd.read_parquet('users.parquet')
def parse_sql(query):
"""YOUR TASK: Extract tables, joins, and aggregations"""
# Parse SQL string to identify:
# - Tables involved
# - Join conditions
# - GROUP BY columns
# - Aggregation functions
pass # Your implementation here
def hash_partition_join(left_df, right_df, join_key):
"""YOUR IMPLEMENTATION - Build hash table and probe"""
# 1. Hash partition both tables
# 2. Build hash table from smaller partition
# 3. Probe with larger partition
# 4. Return joined results
return joined_df
def sort_merge_join(left_df, right_df, join_key):
"""YOUR IMPLEMENTATION - Sort and merge"""
# 1. Sort both tables by join key
# 2. Merge sorted sequences
# 3. Handle duplicates
return joined_df
import pyarrow.parquet as pq
def analyze_metadata_before_loading():
"""YOUR TASK: Get table statistics WITHOUT loading data
Hints:
- Use pq.ParquetFile() to access metadata
- Extract: num_rows, column names, file sizes
- DON'T use pd.read_parquet() here - that loads data!
"""
# TODO: For each table ('songs', 'users', 'listens'):
# - Open the Parquet file (but don't load data)
# - Extract metadata like row count, columns, sizes
# - Store in a dictionary
pass # Your implementation here
def plan_query_execution(metadata, parsed_query):
"""YOUR TASK: Use metadata to make smart decisions
Questions to answer:
- Which table is smallest? Largest?
- Will a hash table fit in memory?
- Which columns does the query actually need?
- What's the optimal join order?
"""
# TODO: Based on metadata, decide:
# 1. Join order (smallest first? or different strategy?)
# 2. Algorithm choice (HPJ if fits in memory, else SMJ)
# 3. Which columns to load for each table
pass # Your implementation here
# After planning, load ONLY what you need:
# Example (you implement the actual logic):
# columns_needed = ['song_id', 'artist'] # From your planning
# df = pd.read_parquet('songs.parquet', columns=columns_needed)
# For 1GB+ datasets, save intermediate joins to avoid recomputation
result1.to_parquet('intermediate_join1.parquet', compression='snappy')
# Can read back if memory constrained
del result1 # Free memory
result1 = pd.read_parquet('intermediate_join1.parquet')
import time
# Benchmark at different scales
# you can use time.time() to measure the time taken for each query and the results.
See FAQs below for: Parquet compression options, dataset generation, performance metrics, query optimization strategies, and more implementation details.
Quality work gets 95/100: Complete all sections with working code and clear comments. This is a high A and what most students should target.
Higher credit requires exceptional work: Only pursue if you're genuinely excited about data systems. The extra 10s of hours won't significantly impact your grade but will deepen your data systems skills and the quality of your artifact for technical interviews.
Parquet is a columnar storage format that provides excellent compression and query performance. For columnar storage concepts, see Storage Layout lesson.
Key Parquet benefits for this project:
All sizes refer to compressed Parquet files on disk:
os.path.getsize('file.parquet') or ls -lh *.parquet to verify
# Essential Parquet operations with PyArrow
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
# 1. Write with compression options
df.to_parquet('songs.parquet',
compression='snappy', # or 'gzip', 'brotli', 'lz4'
row_group_size=50000) # rows per group
# 2. Read only specific columns (columnar advantage!)
columns = ['song_id', 'title']
df = pd.read_parquet('songs.parquet', columns=columns)
# 3. Inspect metadata without reading data
parquet_file = pq.ParquetFile('songs.parquet')
print(f"Num row groups: {parquet_file.num_row_groups}")
print(f"Schema: {parquet_file.schema}")
print(f"Metadata: {parquet_file.metadata}")
# 4. Read in chunks for large files
parquet_file = pq.ParquetFile('large_table.parquet')
for batch in parquet_file.iter_batches(batch_size=10000):
df_chunk = batch.to_pandas()
# Process chunk
| Compression | Speed | Ratio | Best For |
|---|---|---|---|
| Snappy | Very Fast | ~2-4x | Default, balanced |
| Gzip | Slow | ~4-8x | Max compression |
| LZ4 | Fastest | ~2-3x | Speed critical |
Recommendation: Use Snappy (default) for this project - good balance of speed and compression.
These columns simulate real-world data bloat. They should be:
No! "Hand parsing" means extracting components from the specific query provided. Hardcode the parsing for this query structure - extract joins, group by, aggregations, and order by.
Memory usage guidelines:
Rule of thumb: Pandas uses ~5-10x the file size in RAM. A 1GB Parquet file might need 5-10GB RAM when loaded as DataFrame.
| Metric | What to Measure | How to Measure |
|---|---|---|
| Execution Time | Total query time | time.time() or %%time in Jupyter |
| I/O Time | Parquet read/write time | Measure separately from compute |
| Memory Usage | Peak memory consumption | memory_profiler or psutil |
| Compression Ratio | Parquet vs raw size | File sizes on disk |
| Join Selectivity | Output rows / input rows | Count rows at each stage |
Consider these optimizations:
Simple cost model approach:
def choose_algorithm(table1_size, table2_size, available_memory):
# Check if smaller table fits in memory for HPJ
smaller_size = min(table1_size, table2_size)
if smaller_size * 5 < available_memory: # 5x for pandas overhead
return "HPJ" # Hash join is faster
else:
return "SMJ" # Sort-merge for limited memory
# Also consider:
# - Is data already sorted? → SMJ
# - Need sorted output for GROUP BY? → SMJ
# - Very skewed join keys? → SMJ handles better
Use intermediate files when:
Example: For SMJ, save sorted tables as intermediate Parquet files so you don't re-sort if query fails.
After your join completes:
# Option 1: Use pandas groupby (simple)
result = joined_df.groupby(['song_id', 'title']).agg({
'age': 'mean',
'user_id': lambda x: x.nunique()
})
# Option 2: Manual aggregation (shows understanding)
groups = {}
for row in joined_data:
key = (row['song_id'], row['title'])
if key not in groups:
groups[key] = {'ages': [], 'users': set()}
groups[key]['ages'].append(row['age'])
groups[key]['users'].add(row['user_id'])
Several approaches:
⚡ Rule: Use pd.merge() ONLY after you've implemented your own partitioning logic
# Your partitioning + pd.merge() for final step
partitions = my_hash_partition(left_df, right_df, key)
for left_part, right_part in partitions:
result = pd.merge(left_part, right_part, on=key)
# pd.merge() as your entire algorithm
return pd.merge(left_df, right_df, on=key)
1. Misusing pd.merge():
2. Not Handling Duplicate Keys in Joins:
3. Not Using Metadata Before Loading:
4. Wrong Section Organization:
5. Incorrect Performance Comparisons:
Debugging strategy:
AI can help with specific technical tasks (see AI Policy):
Important: The join algorithms (HPJ and SMJ) were covered in detail in class. Implement these based on your lectures and understanding, not AI assistance.
Test these critical cases:
Yes! Teams of 2 are allowed, or you can work solo. See our comprehensive Partnership Policy for details on collaboration, shared repositories, Gradescope submission, late day policies, and conflict resolution.
Your query must produce results in this exact format:
Requirements:
song_id, avg_age, count_distinct_usersCOUNT(DISTINCT l.user_id) DESC, then by song_idPerformance Results Table (Required in summary):
| Dataset Size | Algorithm | Total Time (s) | I/O Time (s) | Join Time (s) | Peak Memory (GB) |
|---|---|---|---|---|---|
| 100MB | HPJ | [Your time] | [Your time] | [Your time] | [Your memory] |
| 100MB | SMJ | [Your time] | [Your time] | [Your time] | [Your memory] |
| 1GB | HPJ | [Your time] | [Your time] | [Your time] | [Your memory] |
| 1GB | SMJ | [Your time] | [Your time] | [Your time] | [Your memory] |
Submit to Gradescope:
.ipynb fileAs in all Stanford CS classes, you are expected to follow the Stanford Honor Code. Remember, honor comes from your original ideas, not stolen ones!