Subject: Dear Data Scientists, Dig Deeper!
Dr. Gru has been following your progress closely, and while he's impressed with your initial insights, he knows that real treasure isn't just on the surface — it's buried deep. In true Gru fashion, he's got big plans for the future, and that's where you come in. He's challenging you to go beyond the initial analysis to uncover valuable insights. Failure is not an option.
8-10 hours
Find and upload new table, explore relationships, write join queries
6-8 hours
Complete ML warm-up, build first BigQuery ML model
8-10 hours
Create visualizations, analyze query performance
4-6 hours
Optimize queries, finalize insights, clean notebook
Total: 30-40 hours over 5 weeks = manageable 6-8 hours/week
Load a CSV file as a new table (> 50 MBs) to your dataset to interact with your existing tables.
Extend your Project 1 analysis with visualizations. Save SQL outputs in Python and use libraries like matplotlib/seaborn.
Quality work gets 95/100: Complete all sections with working code and clear comments. This is a high A and what most students should target.
Higher credit requires exceptional work: Only pursue if you're genuinely excited about your dataset. The extra 10s of hours won't significantly impact your grade but will deepen your data science skills and the quality of your artifact (for sharing) for data science interviews.
Popular sources for supplementary datasets:
Best for: Curated, clean datasets
Examples: Weather, economic indicators, demographics
Tip: Sort by size to find >50MB files quickly
Best for: Government statistics
Examples: Census, health, education data
Tip: Great for enriching location-based analyses
Best for: Export and re-upload strategy
How: Query → Export to CSV → Upload to your dataset
Benefit: Already formatted for BigQuery
Steps to upload:
Basic template for linear regression:
CREATE OR REPLACE MODEL `your_dataset.model_name`
OPTIONS(
model_type='linear_reg',
input_label_cols=['target_column']
) AS
SELECT
feature1, feature2, feature3,
target_column
FROM `your_dataset.your_table`
WHERE training_condition
Use BigQuery's ML.EVALUATE function:
SELECT * FROM ML.EVALUATE(MODEL `your_dataset.model_name`,
(
SELECT feature1, feature2, target_column
FROM `your_dataset.test_table`
))
This returns metrics like R², mean absolute error, accuracy, etc.
In your Colab notebook:
# Run query and save to dataframe
query = """SELECT ... FROM ..."""
df = client.query(query).to_dataframe()
# Now use matplotlib/seaborn
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(df['x'], df['y'])
plt.show()
Try to stay with matplotlib so everything is contained in the Colab. This makes your notebook self-contained and easier to grade. If you have a compelling reason to use Tableau, include screenshots in your notebook and explain your visualization choices.
Use formulas from lectures/equation sheet: e.g.,
Where P(R) = pages in table R, B = buffer size
Your $50 credit is enough for both projects. Tips to manage costs:
Add 2-3 new questions that specifically require your uploaded table to answer. These should build on your Project 1 questions.
Total of 4+ tables: Your original 3+ tables from Project 1 + 1 new uploaded table (>50MB).
Yes! Please copy the corresponding cells to the NEW notebook. If you are working with a partner, pick one of your Project 1s to extend (if you want). Alternatively, you can choose a completely new dataset - it still needs to meet the 500MB+ requirement with 3+ tables.
To enrich your analysis! Knowing that you are adding an additional dataset, you should direct your queries/visualizations to include that dataset. The goal is to gain deeper insights by combining your original data with complementary external data.
Project 2 visualizations should be NEW and more advanced, using insights from your uploaded table and ML models. Don't reuse Project 1 visualizations.
✅ REUSE: Project overview, table descriptions, setup code
❌ MUST BE NEW: All 6 queries, all 3 visualizations, ML model
Rule of thumb: Combine 1-3 patterns from SQL Problem Solving for deeper insights.
Why: Practice for tests and interviews where you'll combine multiple SQL techniques.
Examples: Funnel + JOINs, Ladder + CTEs, Timeline + Comparison patterns, or Exception Finder techniques.
Minimum requirement: Any query using 2+ of {JOINs, CTEs, window functions, subqueries, complex CASE} counts as non-simple.
No, just complete it for practice. Only submit your main project notebook with all 6 sections.
Even complex queries with multiple CTEs count as one query:
-- This is ONE query (even with multiple CTEs):
WITH temp1 AS (SELECT * FROM table1),
temp2 AS (SELECT * FROM table2),
temp3 AS (SELECT * FROM table3)
SELECT * FROM temp1 JOIN temp2 JOIN temp3;
Yes, CREATE TABLE/VIEW statements are allowed. Any SELECT queries you use inside your CREATE statements also count toward your 6 required queries.
Example: CREATE TABLE top_users AS (SELECT...) - the SELECT portion counts as 1 of your 6 queries.
Your joins must demonstrate new insights, not just technical ability.
"Join just to show I can join"
"Join weather data to show how rain affects ride patterns"
Each chart must directly illustrate findings from a specific query. Include captions like: "This chart shows results from Query #3, revealing that..."
You may need to create synthetic keys. Common approaches:
Common approaches in BigQuery:
-- 80/20 random split
WHERE RAND() < 0.8 -- for training
WHERE RAND() >= 0.8 -- for testing
-- Time-based split
WHERE date < '2024-01-01' -- for training
WHERE date >= '2024-01-01' -- for testing
Something product-relevant you can't easily calculate. GOOD: "Predict tip amount from trip features." BAD: "Predict trip_id."
Include actual calculations with numbers. Example:
Several strategies:
Remember our dual competency goal (see AI Policy). Use AI to:
But: If you can't explain your model choices, query optimizations, or visualization insights without AI, you used too much. Always verify you understand the underlying concepts.
Build specific anecdotes! Document how you caught AI errors in ML predictions, fixed visualization logic, or corrected optimization calculations. These stories demonstrate real understanding.
Focus on verification and debugging! Examples of good documentation:
Use free tiers strategically:
Remember: The goal is to learn with AI assistance, not to have AI do the project for you. Use it as a teaching assistant, not a ghostwriter.
Always double-check:
If your model has 99.9% accuracy, you probably have data leakage!
Yes! Teams of 2 are allowed, or you can work solo. See our comprehensive Partnership Policy for details on collaboration, Google Cloud permissions, Gradescope submission, late day policies, and conflict resolution.
Submit to Gradescope:
.ipynb fileproject2_template.ipynbYou are expected to follow the Stanford Honor Code. You are allowed to discuss general approaches and issues with other students in the class, but any code you write must be original. Violations include, but are not limited to: