CS145: Stanford University, Fall 2026. Data Science Track.
Mission Briefing
Dear Data Scientists, dig deeper! Dr. Gru was impressed by your first round of insights,
but the real treasure is buried. Project 2 is where you push past P1's surface-level answers with new
data, new questions, a predictive model, and a scale analysis.
Where to start
Most students pick up their own Project 1 notebook and keep going. If you're partnering, choose one of
the two P1 notebooks as your team's starting point. If neither feels compelling, you're free to start
fresh with a new dataset, as long as it still meets the Project 1 bar (3+ tables, at least 500 MB).
Whichever route you take, the P1 work you carry over is scaffolding. Every query, visualization, and
the ML model in P2 need to be new.
What's new in Project 2
One new supplementary table (> 50 MB) joined to your BigQuery data.
2 or 3 new questions that specifically require the new table to answer.
6+ new non-simple queries, and 3+ new visualizations in matplotlib or seaborn.
One BigQuery ML model (linear or logistic regression).
A query analysis section reasoning about I/O cost at scale.
Proposal due Oct 30. Final due Nov 20. Five weeks of work total, about
7 hours a week. Jot down AI surprises as they happen so you can write them up later.
Week 1 · Pick Starting Point & Upload
Oct 17 through Oct 23, spend 6 to 7 hours.
Decide whether you're extending your own P1, your partner's, or starting fresh. Identify a new
supplementary dataset (> 50 MB) and upload it to BigQuery. Map how its keys relate to your
existing tables. Sketch your 2 or 3 new questions.
Week 2 · Proposal
Oct 24 through Oct 30, spend 6 to 7 hours.
Draft your first two or three queries joining the new table. Write the proposal: dataset choice,
new table, new questions, planned ML model. Submit by Oct 30.
Week 3 · ML Warm-up & Model
Oct 31 through Nov 6, spend 7 to 8 hours.
Complete the ML warm-up Colab. Build your BigQuery ML model. Evaluate it with the metrics that fit
your problem (R² for regression, precision/recall for imbalanced classification).
Week 4 · Queries, Viz, Query Analysis
Nov 7 through Nov 13, spend 7 to 8 hours.
Finish the 6+ non-simple queries and 3+ visualizations. Write the query analysis section with
BigQuery query plans and scaling estimates. Keep noting AI surprises, especially the times AI got
something wrong and you caught it.
Week 5 · Polish & Submit
Nov 14 through Nov 20, spend 4 to 5 hours.
Write the Writeup & Reflection section. Turn your scratch notes into 2 or 3 AI anecdotes plus
one plain-English query walkthrough. Make sure every cell has run with output, section headers
are in place, and submit to Gradescope.
Expected total: 30 to 40 hours (about 7 hours a week).
Why This Project Matters
Prep for Data Science Internships: most interns are asked to expand an existing
analysis, join new datasets, build a simple predictive model, and tell a compelling data story. This
project mirrors that workflow.
Real-world skills: you'll practice joining messy datasets, running BigQuery ML, and
optimizing SQL for cost. These are core to real data science roles.
Portfolio piece: former students have shared this project in interviews as a
conversation starter and evidence of design judgment.
Rubric (100 pts, plus up to 10 bonus)
Section 1: Project Overview 15 pts
5 pts. Declare your starting point. State whether you're extending your own P1,
your partner's P1, or starting fresh with a new dataset.
5 pts. Core analytical questions clearly stated. Name exactly which new
questions or dimensions you're adding in P2.
Load a CSV as a new table (at least 50 MB) into your dataset so it can join your existing tables.
5 pts. Upload succeeds. Schema and types are correct, and the table is queryable.
5 pts. Explain how the supplementary table enriches the analysis. Specify how
its keys and columns relate to your original tables.
Section 3: SQL + Visualizations 25 pts
Extend your analysis with visualizations. Save SQL outputs to Python, then use matplotlib or seaborn.
15 pts. At least 6 non-simple SQL queries. Non-simple means combining 2+ of
{JOINs, CTEs, window functions, subqueries, complex CASE}.
10 pts. At least 3 meaningful visualizations (scatterplots, histograms, maps,
heatmaps, etc.). Each should directly support an insight from a specific query.
Prerequisite: complete the ML warm-up Colab before you start this section.
10 pts. At least one ML model using BigQuery ML (linear or logistic regression),
predicting something interesting from your data.
10 pts. Report performance metrics (R², accuracy, precision/recall, etc.) and
discuss what the results mean for your analysis.
Section 5: Query Analysis 15 pts
This section bridges data science into systems thinking. You're learning napkin-math so you can
argue for an index in a real production review.
6 pts. Pull query plans for two of your complex SQL queries. Use BigQuery Query
History → Execution details. Report stage number, timing, rows processed, and bytes.
9 pts. Discuss efficiency at 10× and 100× scale. Compute I/O costs for BigSort,
SMJ, and HPJ using the equation sheet. Compare tradeoffs and suggest one optimization you'd try.
Section 6: Writeup, Reflection & AI Takeaways 15 pts
A short reflection at the end of your notebook. Keep it informal. We're looking for evidence that
you understood what landed in your notebook and used AI as a collaborator.
5 pts. Writeup. Key learnings from the project, limitations of your analysis,
and one or two ideas for future work.
6 pts. AI anecdotes. Share 2 or 3 short anecdotes about using AI on this
project, a few sentences each. At least one should be a surprise. That could be
something AI got wrong and you caught, something it nailed unexpectedly, or a prompt that
worked way better than another.
4 pts. Explain one query in plain English. Pick a complex query you built
(CTE, window function, or the joined analysis over the new table). In 2 or 3 sentences,
describe what it does in a way a reader who doesn't know SQL could follow.
Base Grading Model
Quality work gets 95/100: complete all sections with working code and clear comments.
This is a high A and what most students should target.
Above 95 is rare. It's reserved for genuine insight: a finding that changes how you think about the
data, or a writeup that stands on its own. We don't reward bigger datasets or more queries for their
own sake.
Adjustments +5 max (top 10%)
+3 Value. Novel integration of more than 1 GB of new data that
materially deepens insights, with joins and rationale clearly explained.
+2 Clarity. Crisp, compelling insights with clear visualizations and
concise explanations.
Bonus: Top Projects Up to +10
+5 pts. Exceptional data integration (more than 10 GB) with
well-justified joins and story.
+5 pts. Exceptional insights and storytelling (top 3 projects only).
Frequently Asked Questions
Getting Started & Scope
Q: What exactly carries over from Project 1?
You can reuse: your project overview, table descriptions, and setup code.
Must be new: all 6 queries, all 3 visualizations, and the ML model. If your P1
already had visualizations, the P2 ones should still be new and more advanced, using the
uploaded table or the ML outputs.
Partners choose one of the two P1 notebooks to extend. You don't merge both. If you're starting
fresh with a new dataset, it still needs to meet the P1 bar (3+ tables, at least 500 MB).
Q: How many questions and tables will I end up with?
Aim for 4+ tables total: your original 3+ from P1 (or equivalent in a fresh start) plus at least
one new uploaded table (at least 50 MB). Add 2 or 3 new questions that specifically require
the new table to answer. The goal is deeper insights from combining datasets, not more of the
same.
Q: What does the Oct 30 proposal look like?
Short. Aim for one page total. The easiest format is a single markdown cell at the top of your
project notebook that covers:
Which starting point you're using (your P1, your partner's P1, or a fresh dataset).
The new supplementary table you plan to upload, including where it's from and its size.
Your 2 or 3 new questions that specifically need the new table.
The ML model you plan to train (linear or logistic regression) and what it predicts.
The proposal is a checkpoint, not a separately graded deliverable. We'll give you feedback so you
can catch scope problems before you've sunk too many hours in.
Q: Will queries cost more than Project 1?
Your $50 BigQuery credit is enough for both projects. Check query size before running (top-right
of the console), use LIMIT while testing, cache results in Python variables, and
don't SELECT * on large tables.
Datasets & Upload
Q: Where can I find datasets > 50 MB?
Kaggle.com
Curated, clean datasets. Sort by size to find > 50 MB files quickly.
Data.gov
Government statistics covering Census, health, and education. Great for enriching
location-based analyses.
BigQuery Public Datasets
Query a public dataset, export to CSV, re-upload to your dataset. Already BigQuery-formatted.
Q: How do I upload a CSV to BigQuery, and what if it fails?
Upload steps: BigQuery Console → Your dataset → Create Table → Source: Upload →
Select CSV. Auto-detect schema usually works.
If it fails, check:
File size: split if > 10 GB
Encoding: use UTF-8
Headers: first row should be column names
Dates: prefer YYYY-MM-DD
Special characters: escape or remove; try "Allow jagged rows" in Advanced options
Q: What if my supplementary table doesn't have matching keys?
Common strategies: match on date ranges (same week/month), geographic regions (zip, city), fuzzy
string matching on names, or aggregated groupings that align between tables.
Q: What if I can't find a 50 MB+ related dataset?
Aggregate multiple smaller related tables into one
Export a filtered subset of a BigQuery public dataset
Generate synthetic data that complements your analysis
Use time-series expansions of existing data
SQL & Visualizations
Q: What counts as "one query" and as a "non-simple" query?
One query: even a SELECT with multiple CTEs counts as one query. A
CREATE TABLE ... AS (SELECT ...) also counts as one (the SELECT portion is what counts
toward your 6).
Non-simple: combines 2+ of {JOINs, CTEs, window functions, subqueries, complex
CASE}. Think funnels plus JOINs, ladders plus CTEs, or timeline comparisons. Anything you'd see
on a final-round SQL interview.
-- ONE query (even with multiple CTEs):WITH t1 AS (SELECT * FROMtable1),
t2 AS (SELECT * FROMtable2)
SELECT * FROM t1 JOIN t2 ON ...;
Q: What makes a good join query and a good visualization?
Joins must demonstrate a new insight, not just technical ability. Each chart should directly
illustrate findings from a specific query. Include a caption like "This chart shows results from
Query #3, revealing that…"
Good visualization
Clear trend or pattern
Directly answers your question
Labeled axes, title, legend
Bad visualization
Random scatter with no pattern
Unlabeled axes
Too cluttered or unrelated to the analysis
Q: How do I save BigQuery results for visualization?
query = """SELECT ... FROM ..."""
df = client.query(query).to_dataframe()
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(df['x'], df['y'])
plt.show()
Stay with matplotlib/seaborn so everything lives in the notebook. If you have a compelling reason to
use Tableau, include screenshots and explain the choice.
BigQuery ML
Q: How do I create and evaluate a BigQuery ML model?
-- TrainCREATE OR REPLACE MODEL `your_dataset.model_name`
OPTIONS(
model_type='linear_reg',
input_label_cols=['target_column']
) ASSELECT feature1, feature2, feature3, target_column
FROM`your_dataset.your_table`WHERE training_condition;
-- EvaluateSELECT * FROMML.EVALUATE(
MODEL `your_dataset.model_name`,
(SELECT feature1, feature2, target_column
FROM`your_dataset.test_table`)
);
Use AI freely. The real bar is whether you can talk through your own model choices, optimizations,
and visualization decisions in plain English. That skill is what data jobs actually lean on, even
more than typing the SQL or Python yourself. See the
AI Policy for the full rule.
Q: What should the AI Takeaways section look like?
Short and informal. Think half a page, not a report. You need 2 or 3 anecdotes (a few sentences
each) and one plain-English walkthrough of a complex query you built.
Good anecdotes capture a moment. A few shapes that tend to work:
"AI suggested logistic regression for a continuous target. I caught it when ML.EVALUATE
returned accuracy instead of R², and switched to linear regression."
"AI's scatter plot had the axes swapped. I spotted it when the correlation direction didn't
match my intuition from the raw data."
"AI suggested a feature that was almost perfectly correlated with the target. That would have
been data leakage. I dropped it after running a correlation matrix."
Q: What should I double-check on AI-generated ML code?
Model type matches your target (regression vs. classification).
Features make logical sense. No data leakage from features that "know" the target.
Train and test splits are correctly implemented.
Metrics are appropriate for the problem (R² for regression, precision and recall for
imbalanced classification).
Results are reasonable. If your model has 99.9% accuracy, you probably have data leakage.
Submission
Proposal due Oct 30. Final notebook due Nov 20. Submit both to
Gradescope.
Your Colab notebook as an .ipynb file. Start from
project2_template.ipynb if you like.
All cells run, outputs visible, no broken queries.
All six section headers present and in order.
Teams: include both partners' names at the top.
Do not submit the ML warm-up notebook. Just complete it for practice.
Honor Code
You must follow the Stanford Honor Code:
Submit only code you wrote personally (or with your partner)
Do not consult pre-existing solutions
Do not post your solutions online or share them with other students
You may discuss general approaches with classmates