Welcome to your first major challenge at Startup, Inc.! Dr. Gru is looking for the next big opportunity to invest.
This project is largely open-ended. Choose a dataset about something you're genuinely curious about!
Past favorites: COVID-19 spread • StackOverflow users • GitHub repos • NYC taxi tips • Crypto blockchain • Climate change • Election ads
(Bonus points example): "When is the best time to get away with murder?" – One student analyzed Chicago crime (1.32GB), EPA temperature (1.8GB), and GDELT news (204GB)!
# Section 1: Project Overview
# Section 2: Dataset Analysis
# Section 3: Get Your Feet Wet
# Section 4: Exploring Central Questions
# Section 5: Takeaways
Follow this formula: [ACTION VERB] + [SPECIFIC METRIC] + [COMPARISON/CONDITION] + [TIMEFRAME/SEGMENT]
Key questions are specific and directly answerable with SQL queries. Supplementary questions provide broader context but may not be directly queryable.
Example for "When's the best time to get away with murder?":
No. You need 12 distinct queries minimum. Each technique requirement needs its own query.
Show 10-20 rows per query using LIMIT
. For aggregations, show the complete summary.
No. This is a SQL-only project. No Python analysis or visualizations needed.
Popular BigQuery public datasets that meet our requirements (500MB+, 3+ tables):
Dataset: bigquery-public-data.stackoverflow
Tables: posts, users, comments, badges
Size: 50GB+
Good for: User behavior, expertise patterns
Dataset: bigquery-public-data.new_york_taxi_trips
Tables: tlc_yellow_trips, tlc_green_trips
Size: 100GB+
Good for: Travel patterns, tip analysis
Dataset: bigquery-public-data.chicago_crime
Tables: crime, socioeconomic_data
Size: 1.5GB+
Good for: Crime patterns, correlations
Dataset: bigquery-public-data.github_repos
Tables: commits, files, contents, languages
Size: 3TB+ (use samples!)
Good for: Code patterns, language trends
Dataset: bigquery-public-data.covid19_*
Tables: Multiple regional datasets
Size: Varies by region
Good for: Spread patterns, policy impacts
Dataset: bigquery-public-data.noaa_*
Tables: gsod, stations, hurricanes
Size: 100GB+
Good for: Climate trends, weather impacts
Remember: Choose what excites YOU! These are just popular starting points. The best projects come from genuine curiosity about the data.
No! The "heist" theme is just for fun. Past students have analyzed COVID, cancer research, crypto, climate change, and many other serious topics. Choose a dataset that genuinely interests you.
No - always query directly using dataset references! You can't realistically copy massive production datasets (100TB-PB scale). Get comfortable working with dataset references like bigquery-public-data.stackoverflow
- this is how you'll work with production data in industry. Star the datasets in the console for easy access.
In BigQuery console: Dataset → Details → Size. Make sure total is ≥ 500MB.
Don't worry - you have plenty of credits! Here's what you need to know:
Explain the implicit relationships between tables, discuss why formal keys might be missing, and describe how you handle potential data quality issues.
Variables from the outer query that are referenced in the inner query's WHERE or JOIN clause. See Correlated Subqueries for detailed examples. For instance, WHERE l2.user_id = u.user_id
correlates the inner query to the outer query's current user.
Use date filters like WHERE date > '2024-01-01'
, add LIMIT
clauses, or use table sampling: FROM table TABLESAMPLE SYSTEM (10 PERCENT)
ORDER BY a non-unique column to create ties. RANK will skip numbers after ties (1,1,3), while ROW_NUMBER won't (1,2,3).
You must use OVER
or PARTITION BY
keywords. Simple aggregations like AVG(col)
don't count. Valid examples: ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date)
, RANK() OVER (ORDER BY score DESC)
. See the window functions guide.
Use COUNT(column_name)
which only counts non-NULL values, unlike COUNT(*)
which counts all rows including those with NULLs.
Remember our dual competency goal (see AI Policy). Use AI to:
But: If you can't explain your query on a whiteboard without AI, you used too much. Always verify you understand by recreating key queries without AI assistance.
Build specific anecdotes with evidence! Surface-level answers like "I used AI to help" won't cut it in interviews. Go deeper - explain the semantic issue you caught, why the AI's approach failed, how you debugged it. These detailed stories show how comfortable and productive you'll be, and your ability to ask the right questions. Remember: jobs don't need expensive copy-pasters!
Focus on your debugging process for verifying correctness! Document how you caught and fixed AI mistakes. See Learning Outcomes for sample starter anecdotes.
Note: AI may or may not make errors on your first simple queries. That's OK! The point is to build the verification habit now, so you're ready when tackling complex queries where AI is more likely to hallucinate or miss edge cases.
What to document:
You've already seen debug tables in SELECT-FROM-WHERE and LLM Debug lessons – those row-by-row traces showing how SQL executes.
For your project, create similar debug tables on paper to verify your complex queries work correctly. This is EXACTLY what you'll do in interviews and exams where you can't run queries. Practice this now while you can verify your work against actual results.
Use shorthand notation! You don't trace every row - use representative samples:
Example: For RANK() on 1000 users, show top 3, ties if any, and bottom 2. That's enough to verify logic!
No credits provided – use free tiers! Recommended options:
Tips for free tiers:
Submit to Gradescope:
.ipynb
fileYou must follow the Stanford Honor Code: