Project 1: BigQuery's First Heist

CS145: Stanford University, Fall 2025
Project 1: BigQuery's First Heist

Mission Briefing

Welcome to your first major challenge at Startup, Inc.! Dr. Gru is looking for the next big opportunity to invest.

Your Mission

Pick Your Favorite Topic

This project is largely open-ended. Choose a dataset about something you're genuinely curious about!

Past favorites: COVID-19 spread • StackOverflow users • GitHub repos • NYC taxi tips • Crypto blockchain • Climate change • Election ads

(Bonus points example): "When is the best time to get away with murder?" – One student analyzed Chicago crime (1.32GB), EPA temperature (1.8GB), and GDELT news (204GB)!

Requirements Summary

Why This Project

Detailed Rubric

0. Readability & Organization Upto -15 points if poor

Section 1: Project Overview 5 points

# Section 1: Project Overview

Section 2: Dataset Analysis 10 points

# Section 2: Dataset Analysis
## Detailed Overview
-- SQL: Show table sizes, row counts
## Table Relationships
-- SQL: Explore keys and joins
## Data Issues
-- SQL: Check for NULLs, duplicates

Section 3: Get Your Feet Wet 35 points

# Section 3: Get Your Feet Wet
## Subqueries
-- 2 queries with scoped variables
## CTEs
-- 2 queries with WITH clauses
## Window Functions
-- 3 queries with OVER, including RANK vs ROW_NUMBER

Section 4: Exploring Central Questions 30 points

# Section 4: Exploring Central Questions
## Question 1: [Your question]
-- SQL query and analysis
## Question 2: [Your question]
-- SQL query and analysis
## Question 3: [Your question]
-- SQL query and analysis

Section 5: Takeaways 10 points

# Section 5: Takeaways

Bonus Up to 5 points

Frequently Asked Questions

Questions & Analysis Datasets & BigQuery SQL Technical AI Usage & Debug Tables

Questions & Analysis

Q: What makes a good question?

Follow this formula: [ACTION VERB] + [SPECIFIC METRIC] + [COMPARISON/CONDITION] + [TIMEFRAME/SEGMENT]

GOOD Questions
  • IDENTIFY drivers WHO earn less on rainy days DESPITE driving more hours
  • FIND neighborhoods WHERE crime drops WHEN temperature > 90°F
  • DETECT stackoverflow users WHO ask 80% of questions BUT never answer
BAD Questions
  • How many crimes happened?
  • What is the average trip distance?
  • List all users with >100 posts
Each question scores 0-3 points:
  • 1 point: Uses 2+ tables with proper JOIN
  • 1 point: Includes WHERE/HAVING with 2+ conditions
  • 1 point: Output could change a decision/action

Q: What's the difference between key and supplementary questions?

Key questions are specific and directly answerable with SQL queries. Supplementary questions provide broader context but may not be directly queryable.

Example for "When's the best time to get away with murder?":

  • Key: When do most murders happen? How fast do cops respond? Which neighborhoods have more murders at certain times?
  • Supplementary: How do murder rates change with seasons? What weather sees most murders? What social factors impact rates?

Q: Can one query count for multiple requirements?

No. You need 12 distinct queries minimum. Each technique requirement needs its own query.

Q: How much output should I show?

Show 10-20 rows per query using LIMIT. For aggregations, show the complete summary.

Q: Can I use Python/Pandas for analysis?

No. This is a SQL-only project. No Python analysis or visualizations needed.

Datasets & BigQuery

Q: What datasets do other students typically use?

Popular BigQuery public datasets that meet our requirements (500MB+, 3+ tables):

StackOverflow

Dataset: bigquery-public-data.stackoverflow

Tables: posts, users, comments, badges

Size: 50GB+

Good for: User behavior, expertise patterns

NYC Taxi Trips

Dataset: bigquery-public-data.new_york_taxi_trips

Tables: tlc_yellow_trips, tlc_green_trips

Size: 100GB+

Good for: Travel patterns, tip analysis

Chicago Crime

Dataset: bigquery-public-data.chicago_crime

Tables: crime, socioeconomic_data

Size: 1.5GB+

Good for: Crime patterns, correlations

GitHub Activity

Dataset: bigquery-public-data.github_repos

Tables: commits, files, contents, languages

Size: 3TB+ (use samples!)

Good for: Code patterns, language trends

COVID-19 Data

Dataset: bigquery-public-data.covid19_*

Tables: Multiple regional datasets

Size: Varies by region

Good for: Spread patterns, policy impacts

NOAA Weather

Dataset: bigquery-public-data.noaa_*

Tables: gsod, stations, hurricanes

Size: 100GB+

Good for: Climate trends, weather impacts

Remember: Choose what excites YOU! These are just popular starting points. The best projects come from genuine curiosity about the data.

Q: Does the project have to be villainous/criminal themed?

No! The "heist" theme is just for fun. Past students have analyzed COVID, cancer research, crypto, climate change, and many other serious topics. Choose a dataset that genuinely interests you.

Q: Should I copy BigQuery tables to my personal dataset?

No - always query directly using dataset references! You can't realistically copy massive production datasets (100TB-PB scale). Get comfortable working with dataset references like bigquery-public-data.stackoverflow - this is how you'll work with production data in industry. Star the datasets in the console for easy access.

Q: How do I check my dataset size?

In BigQuery console: Dataset → Details → Size. Make sure total is ≥ 500MB.

Q: How much will BigQuery queries cost? Will I run out of credits?

Don't worry - you have plenty of credits! Here's what you need to know:

  • Cost: ~$5 per TB (1000 GB) processed
  • Your credits: CS145 provides $50 - enough for 10TB of queries (plenty for Project 1 & 2)
  • Check before running: Look at top-right of query editor for "This query will process X GB"
  • Avoid surprises: For bigger datasets like GitHub (3TB+), always use SAMPLEs or date filters. Don't use select *. Use LIMIT 100 to reduce the size of the output.
  • Monitor usage: BigQuery Console → Administration → Billing → View detailed charges

Q: What if my dataset has no formal keys?

Explain the implicit relationships between tables, discuss why formal keys might be missing, and describe how you handle potential data quality issues.

SQL Technical

Q: What exactly are "scoped variables" in subqueries?

Variables from the outer query that are referenced in the inner query's WHERE or JOIN clause. See Correlated Subqueries for detailed examples. For instance, WHERE l2.user_id = u.user_id correlates the inner query to the outer query's current user.

Q: My queries are expensive/slow. What should I do?

Use date filters like WHERE date > '2024-01-01', add LIMIT clauses, or use table sampling: FROM table TABLESAMPLE SYSTEM (10 PERCENT)

Q: How do I make RANK and ROW_NUMBER differ?

ORDER BY a non-unique column to create ties. RANK will skip numbers after ties (1,1,3), while ROW_NUMBER won't (1,2,3).

Q: What counts as a "window function" for the requirements?

You must use OVER or PARTITION BY keywords. Simple aggregations like AVG(col) don't count. Valid examples: ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date), RANK() OVER (ORDER BY score DESC). See the window functions guide.

Q: How do I count non-null values in a column?

Use COUNT(column_name) which only counts non-NULL values, unlike COUNT(*) which counts all rows including those with NULLs.

AI Usage & Debug Tables

Q: How much AI use is too much?

Remember our dual competency goal (see AI Policy). Use AI to:

  • Accelerate learning and explore more complex queries
  • Debug tricky syntax errors quickly
  • Brainstorm creative approaches to problems

But: If you can't explain your query on a whiteboard without AI, you used too much. Always verify you understand by recreating key queries without AI assistance.

Build specific anecdotes with evidence! Surface-level answers like "I used AI to help" won't cut it in interviews. Go deeper - explain the semantic issue you caught, why the AI's approach failed, how you debugged it. These detailed stories show how comfortable and productive you'll be, and your ability to ask the right questions. Remember: jobs don't need expensive copy-pasters!

Q: How should I document my AI usage in the Takeaways?

Focus on your debugging process for verifying correctness! Document how you caught and fixed AI mistakes. See Learning Outcomes for sample starter anecdotes.

Note: AI may or may not make errors on your first simple queries. That's OK! The point is to build the verification habit now, so you're ready when tackling complex queries where AI is more likely to hallucinate or miss edge cases.

What to document:

  • Speedup: "3-level CTE with window functions: 45 mins by hand vs 3 mins with Claude + 10 mins debugging"
  • Verification methods: "Used debug tables to trace JOIN logic, found AI's LEFT JOIN kept NULLs in COUNT(*)"
  • Hallucinations caught: "AI referenced CTE before it was defined" or "Window function in WHERE clause"
  • Prompt refinements: "First prompt gave global average, had to specify 'per-user average' for correlated subquery"
  • Debugging process: "Query returned duplicates, traced through GROUP BY logic on paper, discovered missing DISTINCT"

Q: What are "debug tables" and why do I need them on paper?

You've already seen debug tables in SELECT-FROM-WHERE and LLM Debug lessons – those row-by-row traces showing how SQL executes.

For your project, create similar debug tables on paper to verify your complex queries work correctly. This is EXACTLY what you'll do in interviews and exams where you can't run queries. Practice this now while you can verify your work against actual results.

Q: How do I draw debug tables when my query processes 100,000+ rows?

Use shorthand notation! You don't trace every row - use representative samples:

  • Sample approach: Show 3-5 representative rows, then notation like "... (10K more rows)"
  • GROUP BY shorthand: "genre='Rock': 3 rows shown + ...(497 more) = 500 total"
  • Edge cases focus: Include MIN, MAX, NULL, and tie values that affect your logic
  • Aggregation notation: Write "AVG(user_1): (4.5+4.2+3.9)/3 = 4.2" instead of listing all
  • Pattern indicators: Use "↓" or "..." to show patterns continue

Example: For RANK() on 1000 users, show top 3, ties if any, and bottom 2. That's enough to verify logic!

Q: Which LLMs should I use? Do you provide credits?

No credits provided – use free tiers! Recommended options:

  • ChatGPT (free): Good for SQL basics and debugging
  • Claude.ai (free): Strong at explaining complex queries
  • Google Gemini (free): Helpful for BigQuery-specific syntax
  • GitHub Copilot: Free with student account

Tips for free tiers:

  • Ask focused questions, not "write my whole project"
  • Use AI for learning concepts, then write code yourself
  • If you hit limits, that's a feature! Time to practice without AI

Submission Instructions

Submit to Gradescope:

Honor Code

You must follow the Stanford Honor Code: