Project 2: BigQuery Part Deux

CS145: Stanford University, Fall 2025 - Data Science Track
Project 2: Data Science - BigQuery Part Deux

Memo from Dr. Gru's Office

Subject: Dear Data Scientists, Dig Deeper!

Dr. Gru has been following your progress closely, and while he's impressed with your initial insights, he knows that real treasure isn't just on the surface — it's buried deep. In true Gru fashion, he's got big plans for the future, and that's where you come in. He's challenging you to go beyond the initial analysis to uncover valuable insights. Failure is not an option.

Your Extended Mission

Project Scope

Rough Time Investment Guide

Week 1-2: Data & Joins

8-10 hours

Find and upload new table, explore relationships, write join queries

Week 3: ML Models

6-8 hours

Complete ML warm-up, build first BigQuery ML model

Week 4: Viz & Analysis

8-10 hours

Create visualizations, analyze query performance

Week 5: Polish

4-6 hours

Optimize queries, finalize insights, clean notebook

Total: 30-40 hours over 5 weeks = manageable 6-8 hours/week

Why This Project Matters

Rubric (100 pts + 10 bonus)

Section 1: Project Overview 15 points

Section 2: Pick and Upload a New Table 15 points

Load a CSV file as a new table (> 50 MBs) to your dataset to interact with your existing tables.

Data Sources: Look for related data from kaggle.com, data.gov, datacommons.org. Or download a table from another BigQuery dataset and reupload as CSV.

Section 3: SQL + Visualizations and Data Exploration 25 points

Extend your Project 1 analysis with visualizations. Save SQL outputs in Python and use libraries like matplotlib/seaborn.

Section 4: ML Predictions in BigQuery 20 points

Section 5: Query Analysis 20 points

Section 6: Conclusion Unscored guidance

This section provides guidance for a strong conclusion but is not scored separately. Quality will be reflected in the Clarity adjustment above.

Base Grading Model

Quality work gets 95/100: Complete all sections with working code and clear comments. This is a high A and what most students should target.

Higher credit requires exceptional work: Only pursue if you're genuinely excited about your dataset. The extra 10s of hours won't significantly impact your grade but will deepen your data science skills and the quality of your artifact (for sharing) for data science interviews.

Readability & Organization Up to -15 points

  • Have clearly marked sections and crisp, clear comments in your Colab for items 1 through 6 below
  • This will help with grading consistency
  • Use these exact section headers:
    # Section 1: Project Overview
    # Section 2: Pick and Upload a New Table
    # Section 3: SQL + Visualizations and Data Exploration
    # Section 4: ML Predictions in BigQuery
    # Section 5: Query Analysis
    # Section 6: Conclusion
  • Specific deductions:
    • -5 points: Missing one or more required sections
    • -5 points: Poor code quality (no comments, unclear variable names, messy structure)
    • -5 points: Missing or unclear output (results not visible, no performance tables)

Adjustments +5 points max (top 10% of projects)

  • +3 Value: Novel integration of >1GB new data that materially deepens insights; joins and rationale explained
  • +2 Clarity: Crisp, compelling insights with clear visualizations and concise explanations

BONUS: Top Projects up to 10 points

  • +5 points: Exceptional data integration of new data (> 10 GB) with well-justified joins and story
  • +5 points: Exceptional insights and storytelling (for top 3 projects only)

Frequently Asked Questions

Datasets & Upload

Q: Where can I find datasets >50MB for upload?

Popular sources for supplementary datasets:

Kaggle.com

Best for: Curated, clean datasets

Examples: Weather, economic indicators, demographics

Tip: Sort by size to find >50MB files quickly

Data.gov

Best for: Government statistics

Examples: Census, health, education data

Tip: Great for enriching location-based analyses

BigQuery Public Datasets

Best for: Export and re-upload strategy

How: Query → Export to CSV → Upload to your dataset

Benefit: Already formatted for BigQuery

Q: How do I upload a CSV to BigQuery?

Steps to upload:

  1. BigQuery Console → Your dataset → Create Table
  2. Source: Upload → Select your CSV file
  3. Table name: Choose descriptive name
  4. Schema: Auto-detect usually works, or specify manually
  5. Advanced options: Check "Allow jagged rows" if having issues

Q: My CSV upload fails. What should I check?

  • File size: Split if >10GB
  • Encoding: Use UTF-8
  • Headers: First row should be column names
  • Special characters: Remove or escape properly
  • Date formats: Use YYYY-MM-DD for best results

Machine Learning in BigQuery

Q: How do I create a BigQuery ML model?

Basic template for linear regression:

CREATE OR REPLACE MODEL `your_dataset.model_name`
OPTIONS(
  model_type='linear_reg',
  input_label_cols=['target_column']
) AS
SELECT
  feature1, feature2, feature3,
  target_column
FROM `your_dataset.your_table`
WHERE training_condition

Q: What ML models are available in BigQuery?

  • linear_reg: For predicting continuous values
  • logistic_reg: For binary classification
  • kmeans: For clustering (optional for project)
  • boosted_tree_regressor: Advanced regression (optional)

Q: How do I evaluate my model?

Use BigQuery's ML.EVALUATE function:

SELECT * FROM ML.EVALUATE(MODEL `your_dataset.model_name`,
  (
    SELECT feature1, feature2, target_column
    FROM `your_dataset.test_table`
  ))

This returns metrics like R², mean absolute error, accuracy, etc.

Visualizations

Q: How do I save BigQuery results for visualization?

In your Colab notebook:

# Run query and save to dataframe
query = """SELECT ... FROM ..."""
df = client.query(query).to_dataframe()

# Now use matplotlib/seaborn
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(df['x'], df['y'])
plt.show()

Q: Can I use Tableau?

Try to stay with matplotlib so everything is contained in the Colab. This makes your notebook self-contained and easier to grade. If you have a compelling reason to use Tableau, include screenshots in your notebook and explain your visualization choices.

Q: What makes a "meaningful" visualization?

✓ GOOD Visualizations
  • Shows clear trend or pattern
  • Directly answers your question
  • Has labeled axes and title
  • Includes legend if needed
✗ BAD Visualizations
  • Random scatter with no pattern
  • Unlabeled axes
  • Too cluttered to read
  • Doesn't relate to analysis

Query Analysis & Optimization

Q: How do I access the Query Explainer?

  1. Run your query in BigQuery Console
  2. Click "Query History" in left sidebar
  3. Click on your query
  4. Go to "Execution details" tab
  5. Look at stages, timing, and bytes processed

Q: What should I analyze for query optimization?

  • Stage breakdown: Which stage takes longest?
  • Data shuffled: Can you reduce shuffling with better partitioning?
  • Join type: Would SMJ be faster than BNLJ for 100x data?
  • Filter pushdown: Are WHERE clauses applied early?
  • Bytes processed: Can you reduce with column selection?

Q: How do I calculate I/O costs for different algorithms?

Use formulas from lectures/equation sheet: e.g.,

  • BNLJ: P(R) + P(R) × P(S) / B
  • SMJ: 3 × (P(R) + P(S))
  • HPJ: 3 × (P(R) + P(S))

Where P(R) = pages in table R, B = buffer size

Q: Will queries cost more than Project 1?

Your $50 credit is enough for both projects. Tips to manage costs:

  • Check query size before running (top-right in console)
  • Use LIMIT for testing
  • Cache results in Python variables
  • Don't SELECT * on large tables

Project Relationship & Requirements

Q: How many questions do we add?

Add 2-3 new questions that specifically require your uploaded table to answer. These should build on your Project 1 questions.

Q: How many tables do we have now?

Total of 4+ tables: Your original 3+ tables from Project 1 + 1 new uploaded table (>50MB).

Q: Can I extend Project 1?

Yes! Please copy the corresponding cells to the NEW notebook. If you are working with a partner, pick one of your Project 1s to extend (if you want). Alternatively, you can choose a completely new dataset - it still needs to meet the 500MB+ requirement with 3+ tables.

Q: What is the point of the new dataset?

To enrich your analysis! Knowing that you are adding an additional dataset, you should direct your queries/visualizations to include that dataset. The goal is to gain deeper insights by combining your original data with complementary external data.

Q: What if my Project 1 already had visualizations?

Project 2 visualizations should be NEW and more advanced, using insights from your uploaded table and ML models. Don't reuse Project 1 visualizations.

Q: Can I reuse Project 1 content?

✅ REUSE: Project overview, table descriptions, setup code
❌ MUST BE NEW: All 6 queries, all 3 visualizations, ML model

Q: What counts as a "non-simple" query?

Rule of thumb: Combine 1-3 patterns from SQL Problem Solving for deeper insights.

Why: Practice for tests and interviews where you'll combine multiple SQL techniques.

Examples: Funnel + JOINs, Ladder + CTEs, Timeline + Comparison patterns, or Exception Finder techniques.

Minimum requirement: Any query using 2+ of {JOINs, CTEs, window functions, subqueries, complex CASE} counts as non-simple.

Q: Do I submit the ML warm-up notebook?

No, just complete it for practice. Only submit your main project notebook with all 6 sections.

Query & Code Specifics

Q: What counts as "one query"?

Even complex queries with multiple CTEs count as one query:

-- This is ONE query (even with multiple CTEs):
WITH temp1 AS (SELECT * FROM table1),
     temp2 AS (SELECT * FROM table2),
     temp3 AS (SELECT * FROM table3)
SELECT * FROM temp1 JOIN temp2 JOIN temp3;

Q: Can I use CREATE statements?

Yes, CREATE TABLE/VIEW statements are allowed. Any SELECT queries you use inside your CREATE statements also count toward your 6 required queries.

Example: CREATE TABLE top_users AS (SELECT...) - the SELECT portion counts as 1 of your 6 queries.

Q: What makes a good join query?

Your joins must demonstrate new insights, not just technical ability.

✗ BAD

"Join just to show I can join"

✓ GOOD

"Join weather data to show how rain affects ride patterns"

Q: How should visualizations connect to queries?

Each chart must directly illustrate findings from a specific query. Include captions like: "This chart shows results from Query #3, revealing that..."

Data & Model Specifics

Q: What if my supplementary table doesn't have matching keys?

You may need to create synthetic keys. Common approaches:

  • Match on date ranges (e.g., same week/month)
  • Geographic regions (zip codes, cities)
  • Fuzzy string matching for names
  • Aggregated groupings that align between tables

Q: How do I split data for training/testing?

Common approaches in BigQuery:

-- 80/20 random split
WHERE RAND() < 0.8 -- for training
WHERE RAND() >= 0.8 -- for testing

-- Time-based split
WHERE date < '2024-01-01' -- for training
WHERE date >= '2024-01-01' -- for testing

Q: What makes a good ML prediction target?

Something product-relevant you can't easily calculate. GOOD: "Predict tip amount from trip features." BAD: "Predict trip_id."

✓ GOOD Prediction Targets
  • Predicting tip amount from trip features
  • Classifying high-value customers
  • Forecasting crime rates by area
  • Predicting user engagement metrics
✗ BAD Prediction Targets
  • Predicting unique IDs
  • Predicting constants
  • Predicting from perfectly correlated features
  • Predicting random noise

Q: How detailed should my query analysis be?

Include actual calculations with numbers. Example:

Current (1GB tables):
BNLJ cost = 1000 + 1000 × 2000 / (100-1) ≈ 21,000 I/Os

At 100x scale (100GB):
BNLJ cost = 100,000 + 100,000 × 200,000 / 99 ≈ 202 million I/Os
SMJ cost = 3 × (100,000 + 200,000) = 900,000 I/Os
Conclusion: SMJ becomes 200x more efficient at scale

Q: What if I can't find a 50MB+ related dataset?

Several strategies:

  • Aggregate multiple smaller related tables into one
  • Export a subset of a BigQuery public dataset (with filters to keep relevant data)
  • Generate synthetic data that complements your analysis
  • Use time-series expansions of existing data

AI Usage & Verification

Q: How much AI use is appropriate for Project 2?

Remember our dual competency goal (see AI Policy). Use AI to:

  • Accelerate ML model creation and debugging
  • Generate visualization code templates
  • Help with query optimization calculations
  • Debug BigQuery ML syntax errors

But: If you can't explain your model choices, query optimizations, or visualization insights without AI, you used too much. Always verify you understand the underlying concepts.

Build specific anecdotes! Document how you caught AI errors in ML predictions, fixed visualization logic, or corrected optimization calculations. These stories demonstrate real understanding.

Q: How should I document AI usage for this project?

Focus on verification and debugging! Examples of good documentation:

  • ML debugging: "AI suggested logistic regression for continuous target; corrected to linear regression"
  • Visualization fixes: "AI's scatter plot had swapped axes; verified with sample data and fixed"
  • Query optimization: "AI's I/O calculation missed memory factor; recalculated with proper formula"
  • Feature engineering: "AI suggested perfect correlate as feature; removed after checking correlation matrix"

Q: Which AI tools are best for data science tasks?

Use free tiers strategically:

  • ChatGPT/Claude: Good for ML concepts and debugging
  • Google Gemini: Excellent for BigQuery ML syntax
  • GitHub Copilot: Great for Python visualization code
  • Perplexity: Helpful for finding relevant datasets

Remember: The goal is to learn with AI assistance, not to have AI do the project for you. Use it as a teaching assistant, not a ghostwriter.

Q: How do I verify AI-generated ML models are correct?

Always double-check:

  • Model type matches your target (regression vs classification)
  • Features make logical sense (no data leakage)
  • Train/test split is properly implemented
  • Metrics are appropriate for your problem
  • Results are reasonable (not too good to be true)

If your model has 99.9% accuracy, you probably have data leakage!

Partnership Policy

Q: Can I work with a partner?

Yes! Teams of 2 are allowed, or you can work solo. See our comprehensive Partnership Policy for details on collaboration, Google Cloud permissions, Gradescope submission, late day policies, and conflict resolution.

Submission Instructions

Submit to Gradescope:

Honor Code

You are expected to follow the Stanford Honor Code. You are allowed to discuss general approaches and issues with other students in the class, but any code you write must be original. Violations include, but are not limited to: