Project 2: BigQuery Part Deux

Technical Rubric (100 pts + 10 bonus)

Section 1: Project Overview 15 points

Summarize your core analytical questions. If continuing from Project 1, articulate the *new* queries/dimensions you are adding.
Detail your schema: table sizes, row counts, and joining keys (primary/foreign).

Section 2: Pick and Upload a New Table 15 points

Load a CSV file as a new table (> 50 MBs) to your dataset to interact with your existing tables.

5 points: Explain how the supplementary table enriches the dataset and insights
- Specify how keys and columns in the new table(s) relate to your original tables

Section 3: SQL + Visualizations and Data Exploration 25 points

Extend your Project 1 analysis with visualizations. Save SQL outputs in Python and use libraries like matplotlib/seaborn.

15 points: Create at least 6 SQL non-simple queries to answer your questions
- Practice more CTEs, Window Functions, and subqueries
- Explore trends, anomalies, and patterns in your data
10 points: Create at least 3 meaningful visualizations
- Can include scatterplots, histograms, maps, heatmaps, etc.
- Each visualization should directly support your insights

Section 4: Predictive Modeling (BigQuery ML) 20 points

Prerequisite: Complete the ML warm-up Colab before this section.
10 points: Implement at least one machine learning model using BigQuery ML
- Options: linear regression, logistic regression
- Model should predict an interesting quantity from your data
10 points: Report performance metrics and evaluation of the model's accuracy
- Include appropriate metrics (R², accuracy, precision/recall, etc.)
- Discuss what the results mean for your analysis

Section 5: Query Analysis 20 points

8 points: Analyze query plans of two of your complex SQL queries
- Use BigQuery's query explainer
- After running query in BigQuery UI, go to Query History → click query → "Execution details" tab
- Explore Stage Number, Timing, rows processed, and bytes
12 points: Discuss query efficiency and potential improvements if tables are 10x and 100x bigger
- Compute costs (in IOs) for BigSort, SMJ, HPJ from Equation Sheet/Lectures
- Analyze tradeoffs between different join algorithms
- Suggest optimization strategies (indexes, partitioning, etc.)

Section 6: Conclusion 5 points

Summarize your key learnings and takeaways
Discuss limitations of your analysis
Suggest future work or extensions

Base Grading Model

Quality work gets 95/100: Complete all sections with working code and clear comments. This is a high A and what most students should target.

Higher credit requires exceptional work: Only pursue if you're genuinely excited about your dataset. The extra 10s of hours won't significantly impact your grade but will deepen your data science skills and the quality of your artifact (for sharing) for data science interviews.

Hygiene Deductions (Up to -15)

Missing headers, broken TOC, or sloppy notebook formatting.
Code with no comments or unclear variable naming.
Submission of broken queries or missing outputs.

Adjustments +5 points max (top 10% of projects)

+3 Value: Novel integration of >1GB new data that materially deepens insights; joins and rationale explained
+2 Clarity: Crisp, compelling insights with clear visualizations and concise explanations

BONUS: Top Projects up to 10 points

+5 points: Exceptional data integration of new data (> 10 GB) with well-justified joins and story
+5 points: Exceptional insights and storytelling (for top 3 projects only)

Frequently Asked Questions

Datasets & Upload

Q: Where can I find datasets >50MB for upload?

Popular sources for supplementary datasets:

Kaggle.com

Best for: Curated, clean datasets

Examples: Weather, economic indicators, demographics

Tip: Sort by size to find >50MB files quickly

Data.gov

Best for: Government statistics

Examples: Census, health, education data

Tip: Great for enriching location-based analyses

BigQuery Public Datasets

Best for: Export and re-upload strategy

How: Query → Export to CSV → Upload to your dataset

Benefit: Already formatted for BigQuery

Q: How do I upload a CSV to BigQuery?

Steps to upload:

BigQuery Console → Your dataset → Create Table
Source: Upload → Select your CSV file
Table name: Choose descriptive name
Schema: Auto-detect usually works, or specify manually
Advanced options: Check "Allow jagged rows" if having issues

Q: My CSV upload fails. What should I check?

File size: Split if >10GB
Encoding: Use UTF-8
Headers: First row should be column names
Special characters: Remove or escape properly
Date formats: Use YYYY-MM-DD for best results

Machine Learning in BigQuery

Q: How do I create a BigQuery ML model?

Basic template for linear regression:


                    CREATE OR REPLACE MODEL `your_dataset.model_name`

                    OPTIONS(

                      model_type='linear_reg',

                      input_label_cols=['target_column']

                    ) AS

                    SELECT

                      feature1, feature2, feature3,

                      target_column

                    FROM `your_dataset.your_table`

                    WHERE training_condition

Q: What ML models are available in BigQuery?

linear_reg: For predicting continuous values
logistic_reg: For binary classification
kmeans: For clustering (optional for project)
boosted_tree_regressor: Advanced regression (optional)

Q: How do I evaluate my model?

Use BigQuery's ML.EVALUATE function:


                    SELECT * FROM ML.EVALUATE(MODEL `your_dataset.model_name`,

                      (

                        SELECT feature1, feature2, target_column

                        FROM `your_dataset.test_table`

                      ))

This returns metrics like R², mean absolute error, accuracy, etc.

Visualizations

Q: How do I save BigQuery results for visualization?

In your Colab notebook:


                    # Run query and save to dataframe

                    query = """SELECT ... FROM ..."""

                    df = client.query(query).to_dataframe()

                    

                    # Now use matplotlib/seaborn

                    import matplotlib.pyplot as plt

                    plt.figure(figsize=(10, 6))

                    plt.plot(df['x'], df['y'])

                    plt.show()

Q: Can I use Tableau?

Try to stay with matplotlib so everything is contained in the Colab. This makes your notebook self-contained and easier to grade. If you have a compelling reason to use Tableau, include screenshots in your notebook and explain your visualization choices.

Q: What makes a "meaningful" visualization?

✓ GOOD Visualizations

Shows clear trend or pattern
Directly answers your question
Has labeled axes and title
Includes legend if needed

✗ BAD Visualizations

Random scatter with no pattern
Unlabeled axes
Too cluttered to read
Doesn't relate to analysis

Query Analysis & Optimization

Q: How do I access the Query Explainer?

Run your query in BigQuery Console
Click "Query History" in left sidebar
Click on your query
Go to "Execution details" tab
Look at stages, timing, and bytes processed

Q: What should I analyze for query optimization?

Stage breakdown: Which stage takes longest?
Data shuffled: Can you reduce shuffling with better partitioning?
Join type: Would SMJ be faster than BNLJ for 100x data?
Filter pushdown: Are WHERE clauses applied early?
Bytes processed: Can you reduce with column selection?

Q: How do I calculate I/O costs for different algorithms?

Use formulas from lectures/equation sheet: e.g.,

BNLJ: P(R) + P(R) × P(S) / B
SMJ: 3 × (P(R) + P(S))
HPJ: 3 × (P(R) + P(S))

Where P(R) = pages in table R, B = buffer size

Q: Will queries cost more than Project 1?

Your $50 credit is enough for both projects. Tips to manage costs:

Check query size before running (top-right in console)
Use LIMIT for testing
Cache results in Python variables
Don't SELECT * on large tables

Project Relationship & Requirements

Q: How many questions do we add?

Add 2-3 new questions that specifically require your uploaded table to answer. These should build on your Project 1 questions.

Q: How many tables do we have now?

Total of 4+ tables: Your original 3+ tables from Project 1 + 1 new uploaded table (>50MB).

Q: Can I extend Project 1?

Yes! Please copy the corresponding cells to the NEW notebook. If you are working with a partner, pick one of your Project 1s to extend (if you want). Alternatively, you can choose a completely new dataset - it still needs to meet the 500MB+ requirement with 3+ tables.

Q: What is the point of the new dataset?

To enrich your analysis! Knowing that you are adding an additional dataset, you should direct your queries/visualizations to include that dataset. The goal is to gain deeper insights by combining your original data with complementary external data.

Q: What if my Project 1 already had visualizations?

Project 2 visualizations should be NEW and more advanced, using insights from your uploaded table and ML models. Don't reuse Project 1 visualizations.

Q: Can I reuse Project 1 content?

REUSE: Project overview, table descriptions, setup code
MUST BE NEW: All 6 queries, all 3 visualizations, ML model

Q: What counts as a "non-simple" query?

Rule of thumb: Combine 1-3 patterns from SQL Problem Solving for deeper insights.

Why: Practice for tests and interviews where you'll combine multiple SQL techniques.

Examples: Funnel + JOINs, Ladder + CTEs, Timeline + Comparison patterns, or Exception Finder techniques.

Minimum requirement: Any query using 2+ of {JOINs, CTEs, window functions, subqueries, complex CASE} counts as non-simple.

Q: Do I submit the ML warm-up notebook?

No, just complete it for practice. Only submit your main project notebook with all 6 sections.

Query & Code Specifics

Q: What counts as "one query"?

Even complex queries with multiple CTEs count as one query:


                    -- This is ONE query (even with multiple CTEs):

                    WITH temp1 AS (SELECT * FROM table1),

                         temp2 AS (SELECT * FROM table2),

                         temp3 AS (SELECT * FROM table3)

                    SELECT * FROM temp1 JOIN temp2 JOIN temp3;

Q: Can I use CREATE statements?

Yes, CREATE TABLE/VIEW statements are allowed. Any SELECT queries you use inside your CREATE statements also count toward your 6 required queries.

Example: CREATE TABLE top_users AS (SELECT...) - the SELECT portion counts as 1 of your 6 queries.

Q: What makes a good join query?

Your joins must demonstrate new insights, not just technical ability.

BAD

"Join just to show I can join"

GOOD

"Join weather data to show how rain affects ride patterns"

Q: How should visualizations connect to queries?

Each chart must directly illustrate findings from a specific query. Include captions like: "This chart shows results from Query #3, revealing that..."

Data & Model Specifics

Q: What if my supplementary table doesn't have matching keys?

You may need to create synthetic keys. Common approaches:

Match on date ranges (e.g., same week/month)
Geographic regions (zip codes, cities)
Fuzzy string matching for names
Aggregated groupings that align between tables

Q: How do I split data for training/testing?

Common approaches in BigQuery:


                    -- 80/20 random split

                    WHERE RAND() < 0.8  -- for training

                    WHERE RAND() >= 0.8 -- for testing

                    

                    -- Time-based split

                    WHERE date < '2024-01-01' -- for training

                    WHERE date >= '2024-01-01' -- for testing

Q: What makes a good ML prediction target?

Something product-relevant you can't easily calculate. GOOD: "Predict tip amount from trip features." BAD: "Predict trip_id."

✓ GOOD Prediction Targets

Predicting tip amount from trip features
Classifying high-value customers
Forecasting crime rates by area
Predicting user engagement metrics

✗ BAD Prediction Targets

Predicting unique IDs
Predicting constants
Predicting from perfectly correlated features
Predicting random noise

Q: How detailed should my query analysis be?

Include actual calculations with numbers. Example:

Current (1GB tables):
BNLJ cost = 1000 + 1000 × 2000 / (100-1) ≈ 21,000 I/Os

At 100x scale (100GB):
BNLJ cost = 100,000 + 100,000 × 200,000 / 99 ≈ 202 million I/Os
SMJ cost = 3 × (100,000 + 200,000) = 900,000 I/Os
Conclusion: SMJ becomes 200x more efficient at scale

Q: What if I can't find a 50MB+ related dataset?

Several strategies:

Aggregate multiple smaller related tables into one
Export a subset of a BigQuery public dataset (with filters to keep relevant data)
Generate synthetic data that complements your analysis
Use time-series expansions of existing data

AI Usage & Verification

Q: How much AI use is appropriate for Project 2?

Remember our dual competency goal (see AI Policy). Use AI to:

Accelerate ML model creation and debugging
Generate visualization code templates
Help with query optimization calculations
Debug BigQuery ML syntax errors

But: If you can't explain your model choices, query optimizations, or visualization insights without AI, you used too much. Always verify you understand the underlying concepts.

Build specific anecdotes! Document how you caught AI errors in ML predictions, fixed visualization logic, or corrected optimization calculations. These stories demonstrate real understanding.

Q: How should I document AI usage for this project?

Focus on verification and debugging! Examples of good documentation:

ML debugging: "AI suggested logistic regression for continuous target; corrected to linear regression"
Visualization fixes: "AI's scatter plot had swapped axes; verified with sample data and fixed"
Query optimization: "AI's I/O calculation missed memory factor; recalculated with proper formula"
Feature engineering: "AI suggested perfect correlate as feature; removed after checking correlation matrix"

Q: Which AI tools are best for data science tasks?

Use free tiers strategically:

ChatGPT/Claude: Good for ML concepts and debugging
Google Gemini: Excellent for BigQuery ML syntax
GitHub Copilot: Great for Python visualization code
Perplexity: Helpful for finding relevant datasets

Remember: The goal is to learn with AI assistance, not to have AI do the project for you. Use it as a teaching assistant, not a ghostwriter.

Q: How do I verify AI-generated ML models are correct?

Always double-check:

Model type matches your target (regression vs classification)
Features make logical sense (no data leakage)
Train/test split is properly implemented
Metrics are appropriate for your problem
Results are reasonable (not too good to be true)

If your model has 99.9% accuracy, you probably have data leakage!

Partnership Policy

Q: Can I work with a partner?

Yes! Teams of 2 are allowed, or you can work solo. See our comprehensive Partnership Policy for details on collaboration, Google Cloud permissions, Gradescope submission, late day policies, and conflict resolution.