Project 2: BigQuery Part Deux

CS145: Stanford University, Fall 2026. Data Science Track.
Project 2: Data Science (BigQuery Part Deux)

Mission Briefing

Dear Data Scientists, dig deeper! Dr. Gru was impressed by your first round of insights, but the real treasure is buried. Project 2 is where you push past P1's surface-level answers with new data, new questions, a predictive model, and a scale analysis.

Where to start

Most students pick up their own Project 1 notebook and keep going. If you're partnering, choose one of the two P1 notebooks as your team's starting point. If neither feels compelling, you're free to start fresh with a new dataset, as long as it still meets the Project 1 bar (3+ tables, at least 500 MB). Whichever route you take, the P1 work you carry over is scaffolding. Every query, visualization, and the ML model in P2 need to be new.

What's new in Project 2

Project Scope

Estimated Work-Back Schedule

Proposal due Oct 30. Final due Nov 20. Five weeks of work total, about 7 hours a week. Jot down AI surprises as they happen so you can write them up later.

Week 1 · Pick Starting Point & Upload

Oct 17 through Oct 23, spend 6 to 7 hours.

Decide whether you're extending your own P1, your partner's, or starting fresh. Identify a new supplementary dataset (> 50 MB) and upload it to BigQuery. Map how its keys relate to your existing tables. Sketch your 2 or 3 new questions.

Week 2 · Proposal

Oct 24 through Oct 30, spend 6 to 7 hours.

Draft your first two or three queries joining the new table. Write the proposal: dataset choice, new table, new questions, planned ML model. Submit by Oct 30.

Week 3 · ML Warm-up & Model

Oct 31 through Nov 6, spend 7 to 8 hours.

Complete the ML warm-up Colab. Build your BigQuery ML model. Evaluate it with the metrics that fit your problem (R² for regression, precision/recall for imbalanced classification).

Week 4 · Queries, Viz, Query Analysis

Nov 7 through Nov 13, spend 7 to 8 hours.

Finish the 6+ non-simple queries and 3+ visualizations. Write the query analysis section with BigQuery query plans and scaling estimates. Keep noting AI surprises, especially the times AI got something wrong and you caught it.

Week 5 · Polish & Submit

Nov 14 through Nov 20, spend 4 to 5 hours.

Write the Writeup & Reflection section. Turn your scratch notes into 2 or 3 AI anecdotes plus one plain-English query walkthrough. Make sure every cell has run with output, section headers are in place, and submit to Gradescope.

Expected total: 30 to 40 hours (about 7 hours a week).

Why This Project Matters

Rubric (100 pts, plus up to 10 bonus)

Section 1: Project Overview 15 pts

Section 2: Upload a New Table 10 pts

Load a CSV as a new table (at least 50 MB) into your dataset so it can join your existing tables.

Section 3: SQL + Visualizations 25 pts

Extend your analysis with visualizations. Save SQL outputs to Python, then use matplotlib or seaborn.

Section 4: Predictive Modeling (BigQuery ML) 20 pts

Prerequisite: complete the ML warm-up Colab before you start this section.

Section 5: Query Analysis 15 pts

This section bridges data science into systems thinking. You're learning napkin-math so you can argue for an index in a real production review.

Section 6: Writeup, Reflection & AI Takeaways 15 pts

A short reflection at the end of your notebook. Keep it informal. We're looking for evidence that you understood what landed in your notebook and used AI as a collaborator.

Base Grading Model

Quality work gets 95/100: complete all sections with working code and clear comments. This is a high A and what most students should target.

Above 95 is rare. It's reserved for genuine insight: a finding that changes how you think about the data, or a writeup that stands on its own. We don't reward bigger datasets or more queries for their own sake.

Adjustments +5 max (top 10%)

  • +3 Value. Novel integration of more than 1 GB of new data that materially deepens insights, with joins and rationale clearly explained.
  • +2 Clarity. Crisp, compelling insights with clear visualizations and concise explanations.

Bonus: Top Projects Up to +10

  • +5 pts. Exceptional data integration (more than 10 GB) with well-justified joins and story.
  • +5 pts. Exceptional insights and storytelling (top 3 projects only).

Frequently Asked Questions

Getting Started & Scope

Q: What exactly carries over from Project 1?

You can reuse: your project overview, table descriptions, and setup code.

Must be new: all 6 queries, all 3 visualizations, and the ML model. If your P1 already had visualizations, the P2 ones should still be new and more advanced, using the uploaded table or the ML outputs.

Partners choose one of the two P1 notebooks to extend. You don't merge both. If you're starting fresh with a new dataset, it still needs to meet the P1 bar (3+ tables, at least 500 MB).

Q: How many questions and tables will I end up with?

Aim for 4+ tables total: your original 3+ from P1 (or equivalent in a fresh start) plus at least one new uploaded table (at least 50 MB). Add 2 or 3 new questions that specifically require the new table to answer. The goal is deeper insights from combining datasets, not more of the same.

Q: What does the Oct 30 proposal look like?

Short. Aim for one page total. The easiest format is a single markdown cell at the top of your project notebook that covers:

  • Which starting point you're using (your P1, your partner's P1, or a fresh dataset).
  • The new supplementary table you plan to upload, including where it's from and its size.
  • Your 2 or 3 new questions that specifically need the new table.
  • The ML model you plan to train (linear or logistic regression) and what it predicts.

The proposal is a checkpoint, not a separately graded deliverable. We'll give you feedback so you can catch scope problems before you've sunk too many hours in.

Q: Will queries cost more than Project 1?

Your $50 BigQuery credit is enough for both projects. Check query size before running (top-right of the console), use LIMIT while testing, cache results in Python variables, and don't SELECT * on large tables.

Datasets & Upload

Q: Where can I find datasets > 50 MB?

Kaggle.com

Curated, clean datasets. Sort by size to find > 50 MB files quickly.

Data.gov

Government statistics covering Census, health, and education. Great for enriching location-based analyses.

BigQuery Public Datasets

Query a public dataset, export to CSV, re-upload to your dataset. Already BigQuery-formatted.

Q: How do I upload a CSV to BigQuery, and what if it fails?

Upload steps: BigQuery Console → Your dataset → Create Table → Source: Upload → Select CSV. Auto-detect schema usually works.

If it fails, check:

  • File size: split if > 10 GB
  • Encoding: use UTF-8
  • Headers: first row should be column names
  • Dates: prefer YYYY-MM-DD
  • Special characters: escape or remove; try "Allow jagged rows" in Advanced options

Q: What if my supplementary table doesn't have matching keys?

Common strategies: match on date ranges (same week/month), geographic regions (zip, city), fuzzy string matching on names, or aggregated groupings that align between tables.

Q: What if I can't find a 50 MB+ related dataset?

  • Aggregate multiple smaller related tables into one
  • Export a filtered subset of a BigQuery public dataset
  • Generate synthetic data that complements your analysis
  • Use time-series expansions of existing data

SQL & Visualizations

Q: What counts as "one query" and as a "non-simple" query?

One query: even a SELECT with multiple CTEs counts as one query. A CREATE TABLE ... AS (SELECT ...) also counts as one (the SELECT portion is what counts toward your 6).

Non-simple: combines 2+ of {JOINs, CTEs, window functions, subqueries, complex CASE}. Think funnels plus JOINs, ladders plus CTEs, or timeline comparisons. Anything you'd see on a final-round SQL interview.

-- ONE query (even with multiple CTEs):
WITH t1 AS (SELECT * FROM table1),
     t2 AS (SELECT * FROM table2)
SELECT * FROM t1 JOIN t2 ON ...;

Q: What makes a good join query and a good visualization?

Joins must demonstrate a new insight, not just technical ability. Each chart should directly illustrate findings from a specific query. Include a caption like "This chart shows results from Query #3, revealing that…"

Good visualization
  • Clear trend or pattern
  • Directly answers your question
  • Labeled axes, title, legend
Bad visualization
  • Random scatter with no pattern
  • Unlabeled axes
  • Too cluttered or unrelated to the analysis

Q: How do I save BigQuery results for visualization?

query = """SELECT ... FROM ..."""
df = client.query(query).to_dataframe()

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(df['x'], df['y'])
plt.show()

Stay with matplotlib/seaborn so everything lives in the notebook. If you have a compelling reason to use Tableau, include screenshots and explain the choice.

BigQuery ML

Q: How do I create and evaluate a BigQuery ML model?

-- Train
CREATE OR REPLACE MODEL `your_dataset.model_name`
OPTIONS(
  model_type='linear_reg',
  input_label_cols=['target_column']
) AS
SELECT feature1, feature2, feature3, target_column
FROM `your_dataset.your_table`
WHERE training_condition;

-- Evaluate
SELECT * FROM ML.EVALUATE(
  MODEL `your_dataset.model_name`,
  (SELECT feature1, feature2, target_column
   FROM `your_dataset.test_table`)
);

Model types: linear_reg (continuous), logistic_reg (binary classification), kmeans (clustering, optional), boosted_tree_regressor (advanced, optional).

Q: How do I split data for training/testing?

-- 80/20 random split
WHERE RAND() < 0.8   -- training
WHERE RAND() >= 0.8  -- testing

-- Time-based split
WHERE date < '2024-01-01'   -- training
WHERE date >= '2024-01-01'  -- testing

Q: What makes a good ML prediction target?

Something product-relevant that you can't easily calculate. "Predict tip amount from trip features" is good; "Predict trip_id" is not.

Good targets
  • Predicting tip amount from trip features
  • Classifying high-value customers
  • Forecasting crime rates by area
  • Predicting user engagement metrics
Bad targets
  • Predicting unique IDs
  • Predicting constants
  • Predicting from perfectly correlated features
  • Predicting random noise

Red flag: if your model has 99.9% accuracy, you probably have data leakage.

Query Analysis & Optimization

Q: How do I access the Query Explainer and what should I analyze?

Access: run your query → Query History → click query → Execution details tab.

Analyze:

  • Stage breakdown: which stage takes longest?
  • Data shuffled: can you reduce it with better partitioning?
  • Join type: would SMJ beat BNLJ at 100×?
  • Filter pushdown: are WHERE clauses applied early?
  • Bytes processed: can column selection reduce it?

Q: How do I calculate I/O costs at 10× and 100× scale?

Use the formulas from the equation sheet. P(R) = pages in table R, B = buffer size.

  • BigSort: see the Big Sort lesson for the full I/O cost derivation. The short form: 2 × P(R) × (1 + ⌈logB-1(P(R)/B)⌉).
  • BNLJ: P(R) + P(R) × P(S) / B
  • SMJ: 3 × (P(R) + P(S))
  • HPJ: 3 × (P(R) + P(S))

Example (current 1 GB → 100×):

Current (1GB tables):
  BNLJ = 1,000 + 1,000 × 2,000 / (100 − 1) ≈ 21,000 I/Os

At 100× scale (100GB):
  BNLJ = 100,000 + 100,000 × 200,000 / 99 ≈ 202M I/Os
  SMJ  = 3 × (100,000 + 200,000) = 900,000 I/Os
  → SMJ becomes ~200× more efficient at scale.

AI Takeaways & Verification

Q: How much AI is too much?

Use AI freely. The real bar is whether you can talk through your own model choices, optimizations, and visualization decisions in plain English. That skill is what data jobs actually lean on, even more than typing the SQL or Python yourself. See the AI Policy for the full rule.

Q: What should the AI Takeaways section look like?

Short and informal. Think half a page, not a report. You need 2 or 3 anecdotes (a few sentences each) and one plain-English walkthrough of a complex query you built.

Good anecdotes capture a moment. A few shapes that tend to work:

  • "AI suggested logistic regression for a continuous target. I caught it when ML.EVALUATE returned accuracy instead of R², and switched to linear regression."
  • "AI's scatter plot had the axes swapped. I spotted it when the correlation direction didn't match my intuition from the raw data."
  • "AI suggested a feature that was almost perfectly correlated with the target. That would have been data leakage. I dropped it after running a correlation matrix."

Q: What should I double-check on AI-generated ML code?

  • Model type matches your target (regression vs. classification).
  • Features make logical sense. No data leakage from features that "know" the target.
  • Train and test splits are correctly implemented.
  • Metrics are appropriate for the problem (R² for regression, precision and recall for imbalanced classification).
  • Results are reasonable. If your model has 99.9% accuracy, you probably have data leakage.

Submission

Proposal due Oct 30. Final notebook due Nov 20. Submit both to Gradescope.

Honor Code

You must follow the Stanford Honor Code: