Welcome to your first major challenge at Startup, Inc.! Dr. Gru needs the next big opportunity to invest in. You're going to pick a big public dataset, ask three sharp questions a PM would ask, and answer them with SQL in a Colab notebook. Along the way you'll also hand-draw two debug tables on paper to trace how your queries actually execute, row by row. More on that below.
Past favorites: COVID-19 spread, StackOverflow users, GitHub repos, NYC taxi tips, crypto blockchain, climate change, election ads.
Ambitious example: one student asked "When is the best time to get away with murder?" and joined Chicago crime (1.3 GB), EPA temperature (1.8 GB), and GDELT news (204 GB). Dataset scale on its own isn't graded, but that level of curiosity tends to produce the best projects.
.ipynb to Gradescope, due Oct 16.#project1 announcements)Due Oct 16. Three weeks of work, roughly paced to the class lectures. Basic SQL lands in Week 1. Intermediate SQL (CTEs, window functions, subqueries) arrives by Week 2. Jot down AI surprises as they happen. You'll turn those into the Takeaways section at the end.
Sep 29 through Oct 5, spend 4 to 5 hours.
Pick a dataset that actually excites you. Confirm it meets the 500 MB / 3-table bar. Star it
in the BigQuery console. Sketch 3 draft questions. Run a handful of
SELECT … LIMIT 20 probes to learn the schema. Start a short scratch note for AI
surprises as you hit them.
Oct 6 through Oct 12, spend 7 to 8 hours.
Keep refining your dataset pick if the early queries aren't landing. Answer your 3 questions with JOINs, CTEs, and correlated subqueries. That's most of the 10-query requirement done by Sunday. Keep noting any AI surprises, especially the times AI got something wrong and you caught it.
Oct 13 through Oct 16, spend 5 to 7 hours.
Add the 3 window queries, including the RANK vs. ROW_NUMBER comparison. Hand-draw your two debug tables (one window function, one correlated subquery), photograph them, and embed them. Write the AI Takeaways section. Turn your scratch notes into 2 or 3 anecdotes plus one plain-English query walkthrough. Make sure every cell has run with output, TOC is visible, and submit to Gradescope.
OVER(...), including a RANK vs.
ROW_NUMBER comparison.Two hand-drawn traces, photographed and embedded in the notebook next to the queries they trace. Shorthand is fine (3 to 5 rows plus "…N more rows"). Show the edge cases that matter.
A short reflection at the end of your notebook. Keep it informal. We're looking for evidence that you used AI as a collaborator and understood what ended up in your notebook.
Quality work gets 95/100: meet every requirement with working, commented SQL and a clean notebook. This is a high A and what most students should target.
Above 95 is rare. It's reserved for genuine insight: a query result that changes how you think about the data, or a writeup that stands on its own. We don't reward bigger datasets or more queries for their own sake.
Follow this formula: [ACTION VERB] + [SPECIFIC METRIC] + [COMPARISON/CONDITION] + [TIMEFRAME/SEGMENT]. Each question scores 0–3: (1) uses 2+ tables with a JOIN, (1) has WHERE/HAVING with 2+ conditions, (1) output could change a decision.
No. 10 distinct queries minimum. One query can combine techniques (a CTE + a window function in the same SELECT) but it only counts once toward the query total.
Anything with an OVER(...) clause. Plain AVG(col) without OVER
does not count. Example:
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date). To make RANK vs. ROW_NUMBER
differ, order by a non-unique column so ties exist: RANK skips numbers after ties (1,1,3),
ROW_NUMBER doesn't (1,2,3).
10 to 20 rows per query with LIMIT. Show the full summary for aggregations. This is
a SQL-only project. No Python or pandas analysis needed.
Popular BigQuery public datasets that meet the 500 MB and 3-tables bar. These are starting points. The best projects come from genuine curiosity.
bigquery-public-data.stackoverflow
Tables: posts, users, comments, badges. About 50 GB.
bigquery-public-data.new_york_taxi_trips
Tables: yellow and green trips. About 100 GB.
bigquery-public-data.chicago_crime
Tables: crime and socioeconomic data. About 1.5 GB.
bigquery-public-data.github_repos
Tables: commits, files, contents, languages. About 3 TB (use samples).
bigquery-public-data.covid19_*
Regional datasets; varies by region.
bigquery-public-data.noaa_*
Tables: gsod, stations, hurricanes. About 100 GB.
No. Always query public datasets directly by reference
(bigquery-public-data.stackoverflow). Star them in the console for easy access. This is
how you'll work with production data in industry.
CS145 provides a $50 credit. That's plenty for Project 1 and Project 2 combined (about $5 per TB
processed, giving you 10 TB of headroom). Before running a query, check "This query will process
X GB" in the top-right of the console. On big datasets like GitHub (3 TB), always filter by date,
use TABLESAMPLE SYSTEM (10 PERCENT), and avoid SELECT *. Monitor
billing via Administration → Billing.
Use AI freely. The real bar is whether you can talk through your own query in plain English. That skill is what data jobs actually lean on, even more than typing the SQL yourself. See the AI Policy for the full rule.
Short and informal. Think half a page, not a report. You need 2 or 3 anecdotes (a few sentences each) and one plain-English walkthrough of a complex query you built.
Good anecdotes capture a moment. A few shapes that tend to work:
COUNT(*) on each side."For the walkthrough, pick one CTE, window function, or correlated subquery. Write 2 or 3 sentences saying what it does in plain English. Aim for an explanation a reader who doesn't know SQL yet could still follow.
A paper debug table is a row-by-row trace of how your SQL executes. It's the same format you saw in the SELECT-FROM-WHERE and LLM Debug lessons. You don't trace every row. Use shorthand:
For RANK() on 1000 users, show top 3, ties if any, bottom 2. That's enough to verify logic.
No LLM credits are provided. Use free tiers: ChatGPT, Claude.ai, Gemini, or GitHub Copilot (free with a student account). Ask focused questions, not "write my whole project." If you hit rate limits, that's a feature. Practice without AI.
Due Oct 16. Submit to Gradescope:
.ipynb file, all cells run with output visibleYou must follow the Stanford Honor Code: