Subject: Dear builders, Need Columnar DB, pronto!
Dr. Gru here with a plan so advanced, it'll make those big tech companies look like they're still using stone tablets and chisels! We're going to build our own Parquet-based columnar database system – one so efficient, so blazingly fast, that it'll make rows beg for mercy!
Master the Parquet format's compression (Snappy) and encoding. You'll build a system that reads specific columns directly from disk, drastically reducing I/O overhead.
Implement high-fidelity Hash-Partition and Sort-Merge Joins. You must manage multi-gigabyte datasets within a strict 12GB memory envelope.
Plan and optimize queries. Planning is the
difference between success and a
MemoryError.
~5 hours
Parquet I/O analysis, metadata extraction, and basic hand-parsing implementation.
~12 hours
Core algorithm implementation: Hash-Partition Join and Sort-Merge Join logic.
~12 hours
Building the query optimizer, column pruning, and memory-constrained execution.
~8 hours
I/O cost analysis, performance profiling, and final optimization pass.
Expected Total: 35-40 hours (approx. 7-8 hours/week).
Your engine must be capable of parsing, optimizing, and executing the following three-table join. Your implementation should favor algorithmic efficiency over brute-force loading.
SELECT s.song_id, AVG(u.age) AS avg_age,
COUNT(DISTINCT l.user_id)
FROM Songs s
JOIN Listens l ON s.song_id = l.song_id
JOIN Users u ON l.user_id = u.user_id
GROUP BY s.song_id, s.title
ORDER BY COUNT(DISTINCT l.user_id) DESC, s.song_id;
Your engine will operate on the following schema. Each table includes 10 additional "noise" columns (strings) to simulate real-world data bloating and test your columnar pruning capabilities.
song_id (int64) + 10 padding columnslisten_id (int64), song_id (int64), user_id (int64) + 10 padding columnsuser_id (int64), age (int64) + 10 padding columnspd.merge(). You will implement the binary
logic of Hash Joins and Sort-Merge Joins from first principles.This project is designed for Free Colab! The 12GB memory constraint teaches you real-world optimization techniques used by columnar databases like BigQuery and Snowflake.
Both 100MB and 1GB compressed datasets are completely achievable in Free Colab with smart columnar techniques:
pd.read_parquet('listens.parquet')
Loads all 13 columns (including 10 noise columns). Result: ~8GB RAM consumption for one table. System crash likely.
pq.read_table(..., columns=['song_id'])
Loads only specific offsets. Result: < 1 GB RAM consumption.
# For 1GB compressed total:
listens.parquet: 750MB compressed → Load only (song_id, user_id) → ~150MB in memory
songs.parquet: 50MB compressed → Load only (song_id) → ~20MB in memory
users.parquet: 200MB compressed → Load only (user_id, age) → ~80MB in memory
# Total active memory: ~250MB vs 5GB if loading everything!
# That's a 95% reduction through columnar optimization!
del dataframe after each join stage, run
gc.collect()
Remember: The constraint is the lesson! Working within 12GB teaches you techniques that scale to petabyte-scale systems.
Use libraries for data structures and I/O
Core Rule: You must implement the
partitioning logic, hash table probe, and merge-intersection yourself. Using
pd.merge() as a black box for the primary join is an automatic failure of the technical
objective.
Follow this execution order to maintain memory efficiency. Data generation is provided; your engineering begins at Section 1.
# Copy and run the complete data generation code from Ed Colab (no implementation needed)
Deconstruct the target query. Extract join keys, projection columns, and aggregation predicates.
The core of the engine. Implement the partitioning and merge logic from first principles. Handle many-to-many relationship correctly.
Analyze Parquet metadata (row counts, file sizes) to decide join order and algorithm selection (e.g., HPJ if building the hash table fits in memory, else SMJ).
import time
# Benchmark at different scales
# you can use time.time() to measure the time taken for each query and the results.
See FAQs below for: Parquet compression options, dataset generation, performance metrics, query optimization strategies, and more implementation details.
Quality work gets 95/100: Complete all sections with working code and clear comments. This is a high A and what most students should target.
Higher credit requires exceptional work: Only pursue if you're genuinely excited about data systems. The extra 10s of hours won't significantly impact your grade but will deepen your data systems skills and the quality of your artifact for technical interviews.
Parquet is a columnar storage format that provides excellent compression and query performance. For columnar storage concepts, see Storage Layout lesson.
Key Parquet benefits for this project:
All sizes refer to compressed Parquet files on disk:
os.path.getsize('file.parquet') or ls -lh *.parquet to verify
# Essential Parquet operations with PyArrow
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
# 1. Write with compression options
df.to_parquet('songs.parquet',
compression='snappy', # or 'gzip', 'brotli', 'lz4'
row_group_size=50000) # rows per group
# 2. Read only specific columns (columnar advantage!)
columns = ['song_id', 'title']
df = pd.read_parquet('songs.parquet', columns=columns)
# 3. Inspect metadata without reading data
parquet_file = pq.ParquetFile('songs.parquet')
print(f"Num row groups: {parquet_file.num_row_groups}")
print(f"Schema: {parquet_file.schema}")
print(f"Metadata: {parquet_file.metadata}")
# 4. Read in chunks for large files
parquet_file = pq.ParquetFile('large_table.parquet')
for batch in parquet_file.iter_batches(batch_size=10000):
df_chunk = batch.to_pandas()
# Process chunk
| Compression | Speed | Ratio | Best For |
|---|---|---|---|
| Snappy | Very Fast | ~2-4x | Default, balanced |
| Gzip | Slow | ~4-8x | Max compression |
| LZ4 | Fastest | ~2-3x | Speed critical |
Recommendation: Use Snappy (default) for this project - good balance of speed and compression.
These columns simulate real-world data bloat. They should be:
No! "Hand parsing" means extracting components from the specific query provided. Hardcode the parsing for this query structure - extract joins, group by, aggregations, and order by.
Memory usage guidelines:
Rule of thumb: Pandas uses ~5-10x the file size in RAM. A 1GB Parquet file might need 5-10GB RAM when loaded as DataFrame.
| Metric | What to Measure | How to Measure |
|---|---|---|
| Execution Time | Total query time | time.time() or %%time in Jupyter |
| I/O Time | Parquet read/write time | Measure separately from compute |
| Memory Usage | Peak memory consumption | memory_profiler or psutil |
| Compression Ratio | Parquet vs raw size | File sizes on disk |
| Join Selectivity | Output rows / input rows | Count rows at each stage |
Consider these optimizations:
Simple cost model approach:
def choose_algorithm(table1_size, table2_size, available_memory):
# Check if smaller table fits in memory for HPJ
smaller_size = min(table1_size, table2_size)
if smaller_size * 5 < available_memory: # 5x for pandas overhead
return "HPJ" # Hash join is faster
else:
return "SMJ" # Sort-merge for limited memory
# Also consider:
# - Is data already sorted? → SMJ
# - Need sorted output for GROUP BY? → SMJ
# - Very skewed join keys? → SMJ handles better
Use intermediate files when:
Example: For SMJ, save sorted tables as intermediate Parquet files so you don't re-sort if query fails.
After your join completes:
# Option 1: Use pandas groupby (simple)
result = joined_df.groupby(['song_id', 'title']).agg({
'age': 'mean',
'user_id': lambda x: x.nunique()
})
# Option 2: Manual aggregation (shows understanding)
groups = {}
for row in joined_data:
key = (row['song_id'], row['title'])
if key not in groups:
groups[key] = {'ages': [], 'users': set()}
groups[key]['ages'].append(row['age'])
groups[key]['users'].add(row['user_id'])
Several approaches:
⚡ Rule: Use pd.merge() ONLY after you've implemented your own partitioning logic
# Your partitioning + pd.merge() for final step
partitions = my_hash_partition(left_df, right_df, key)
for left_part, right_part in partitions:
result = pd.merge(left_part, right_part, on=key)
# pd.merge() as your entire algorithm
return pd.merge(left_df, right_df, on=key)
1. Misusing pd.merge():
2. Not Handling Duplicate Keys in Joins:
3. Not Using Metadata Before Loading:
4. Wrong Section Organization:
5. Incorrect Performance Comparisons:
Debugging strategy:
AI can help with specific technical tasks (see AI Policy):
Important: The join algorithms (HPJ and SMJ) were covered in detail in class. Implement these based on your lectures and understanding, not AI assistance.
Test these critical cases:
Yes! Teams of 2 are allowed, or you can work solo. See our comprehensive Partnership Policy for details on collaboration, shared repositories, Gradescope submission, late day policies, and conflict resolution.
Yes. Teams of 2 are allowed. See Partnership Policy for rules on shared repos and Gradescope linking.
This is a Systems class. The objective is to understand the mechanics of hash tables and sort-merge algorithms. pd.merge() is a high-level black box that bypasses the learning objectives.
Submit one .ipynb notebook to Gradescope. Important: Do not clear the output cells. We must see your execution times and final query results tables in the notebook itself.
You are expected to follow the Stanford Honor Code. Original engineering is required. Do not consult existing columnar engine source code (e.g., DuckDB, Polars) to copy algorithms. You may discuss high-level concepts with classmates, but all implementation must be yours.