Project 1: BigQuery's First Heist

CS145: Stanford University, Fall 2026
Project 1: BigQuery's First Heist

Mission Briefing

Welcome to your first major challenge at Startup, Inc.! Dr. Gru needs the next big opportunity to invest in. You're going to pick a big public dataset, ask three sharp questions a PM would ask, and answer them with SQL in a Colab notebook. Along the way you'll also hand-draw two debug tables on paper to trace how your queries actually execute, row by row. More on that below.

Dataset Inspiration

Past favorites: COVID-19 spread, StackOverflow users, GitHub repos, NYC taxi tips, crypto blockchain, climate change, election ads.

Ambitious example: one student asked "When is the best time to get away with murder?" and joined Chicago crime (1.3 GB), EPA temperature (1.8 GB), and GDELT news (204 GB). Dataset scale on its own isn't graded, but that level of curiosity tends to produce the best projects.

Requirements at a Glance

Estimated Work-Back Schedule

Due Oct 16. Three weeks of work, roughly paced to the class lectures. Basic SQL lands in Week 1. Intermediate SQL (CTEs, window functions, subqueries) arrives by Week 2. Jot down AI surprises as they happen. You'll turn those into the Takeaways section at the end.

Week 1 · Pick & Explore

Sep 29 through Oct 5, spend 4 to 5 hours.

Pick a dataset that actually excites you. Confirm it meets the 500 MB / 3-table bar. Star it in the BigQuery console. Sketch 3 draft questions. Run a handful of SELECT … LIMIT 20 probes to learn the schema. Start a short scratch note for AI surprises as you hit them.

Week 2 · Core SQL

Oct 6 through Oct 12, spend 7 to 8 hours.

Keep refining your dataset pick if the early queries aren't landing. Answer your 3 questions with JOINs, CTEs, and correlated subqueries. That's most of the 10-query requirement done by Sunday. Keep noting any AI surprises, especially the times AI got something wrong and you caught it.

Week 3 · Windows, Debug Table, Polish

Oct 13 through Oct 16, spend 5 to 7 hours.

Add the 3 window queries, including the RANK vs. ROW_NUMBER comparison. Hand-draw your two debug tables (one window function, one correlated subquery), photograph them, and embed them. Write the AI Takeaways section. Turn your scratch notes into 2 or 3 anecdotes plus one plain-English query walkthrough. Make sure every cell has run with output, TOC is visible, and submit to Gradescope.

Expected Total: 15–20 hours solo.

Why This Project

Rubric (100 pts)

Dataset & Questions 20 pts

SQL Queries 45 pts

Paper Debug Tables 15 pts

Two hand-drawn traces, photographed and embedded in the notebook next to the queries they trace. Shorthand is fine (3 to 5 rows plus "…N more rows"). Show the edge cases that matter.

AI Takeaways 10 pts

A short reflection at the end of your notebook. Keep it informal. We're looking for evidence that you used AI as a collaborator and understood what ended up in your notebook.

Notebook Hygiene 10 pts

Base Grading Model

Quality work gets 95/100: meet every requirement with working, commented SQL and a clean notebook. This is a high A and what most students should target.

Above 95 is rare. It's reserved for genuine insight: a query result that changes how you think about the data, or a writeup that stands on its own. We don't reward bigger datasets or more queries for their own sake.

Frequently Asked Questions

Questions & SQL

Q: What makes a good question?

Follow this formula: [ACTION VERB] + [SPECIFIC METRIC] + [COMPARISON/CONDITION] + [TIMEFRAME/SEGMENT]. Each question scores 0–3: (1) uses 2+ tables with a JOIN, (1) has WHERE/HAVING with 2+ conditions, (1) output could change a decision.

Good
  • IDENTIFY drivers WHO earn less on rainy days DESPITE driving more hours
  • FIND neighborhoods WHERE crime drops WHEN temperature > 90°F
  • DETECT StackOverflow users WHO ask 80% of questions BUT never answer
Bad
  • How many crimes happened?
  • What is the average trip distance?
  • List all users with >100 posts

Q: Can one query count for multiple requirements?

No. 10 distinct queries minimum. One query can combine techniques (a CTE + a window function in the same SELECT) but it only counts once toward the query total.

Q: What counts as a "window function"?

Anything with an OVER(...) clause. Plain AVG(col) without OVER does not count. Example: ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date). To make RANK vs. ROW_NUMBER differ, order by a non-unique column so ties exist: RANK skips numbers after ties (1,1,3), ROW_NUMBER doesn't (1,2,3).

Q: How much output should I show?

10 to 20 rows per query with LIMIT. Show the full summary for aggregations. This is a SQL-only project. No Python or pandas analysis needed.

Datasets & BigQuery

Q: Which datasets do students typically use?

Popular BigQuery public datasets that meet the 500 MB and 3-tables bar. These are starting points. The best projects come from genuine curiosity.

StackOverflow

bigquery-public-data.stackoverflow

Tables: posts, users, comments, badges. About 50 GB.

NYC Taxi Trips

bigquery-public-data.new_york_taxi_trips

Tables: yellow and green trips. About 100 GB.

Chicago Crime

bigquery-public-data.chicago_crime

Tables: crime and socioeconomic data. About 1.5 GB.

GitHub Activity

bigquery-public-data.github_repos

Tables: commits, files, contents, languages. About 3 TB (use samples).

COVID-19

bigquery-public-data.covid19_*

Regional datasets; varies by region.

NOAA Weather

bigquery-public-data.noaa_*

Tables: gsod, stations, hurricanes. About 100 GB.

Q: Should I copy BigQuery tables into my own dataset?

No. Always query public datasets directly by reference (bigquery-public-data.stackoverflow). Star them in the console for easy access. This is how you'll work with production data in industry.

Q: How much will BigQuery cost? Will I run out of credits?

CS145 provides a $50 credit. That's plenty for Project 1 and Project 2 combined (about $5 per TB processed, giving you 10 TB of headroom). Before running a query, check "This query will process X GB" in the top-right of the console. On big datasets like GitHub (3 TB), always filter by date, use TABLESAMPLE SYSTEM (10 PERCENT), and avoid SELECT *. Monitor billing via Administration → Billing.

AI Usage & Debug Tables

Q: How much AI is too much?

Use AI freely. The real bar is whether you can talk through your own query in plain English. That skill is what data jobs actually lean on, even more than typing the SQL yourself. See the AI Policy for the full rule.

Q: What should the AI Takeaways section look like?

Short and informal. Think half a page, not a report. You need 2 or 3 anecdotes (a few sentences each) and one plain-English walkthrough of a complex query you built.

Good anecdotes capture a moment. A few shapes that tend to work:

  • "AI's first draft of the correlated subquery joined on the wrong key. It returned duplicates until I traced it with a COUNT(*) on each side."
  • "Asking 'why is this slow' vs 'make this faster' gave me totally different answers. The first explained the plan, the second just rewrote my query."
  • "Claude suggested a window function where a GROUP BY would have worked fine. I went with the window because it read more clearly, not because I needed it."

For the walkthrough, pick one CTE, window function, or correlated subquery. Write 2 or 3 sentences saying what it does in plain English. Aim for an explanation a reader who doesn't know SQL yet could still follow.

Q: What are paper debug tables and how do I draw one for 100K+ rows?

A paper debug table is a row-by-row trace of how your SQL executes. It's the same format you saw in the SELECT-FROM-WHERE and LLM Debug lessons. You don't trace every row. Use shorthand:

  • Sample rows. Show 3 to 5 representative rows plus "…(10K more rows)".
  • GROUP BY shorthand. "genre='Rock': 3 rows shown + …(497 more) = 500 total".
  • Edge cases. Include MIN, MAX, NULL, and any ties that affect your logic.
  • Aggregation notation. "AVG(user_1) = (4.5+4.2+3.9)/3 = 4.2".

For RANK() on 1000 users, show top 3, ties if any, bottom 2. That's enough to verify logic.

Q: Which LLMs should I use? Do you provide credits?

No LLM credits are provided. Use free tiers: ChatGPT, Claude.ai, Gemini, or GitHub Copilot (free with a student account). Ask focused questions, not "write my whole project." If you hit rate limits, that's a feature. Practice without AI.

Submission

Due Oct 16. Submit to Gradescope:

Honor Code

You must follow the Stanford Honor Code: