Project 1: BigQuery's First Heist

Detailed Rubric

0. Readability & Organization Upto -15 points if poor

Use these exact section headers in your Colab:
- # Section 1: Project Overview
- # Section 2: Dataset Analysis
- # Section 3: Get Your Feet Wet
- # Section 4: Exploring Central Questions
- # Section 5: Takeaways
Double-check your Colab's Table of Contents – all 5 sections (and the internal subsections) should be visible
Use clear 1-3 line SQL comments to explain what each query does and why

Section 1: Project Overview 5 points

                # Section 1: Project Overview
            

Explain your Top-3 questions you want to answer
Each question should require multi-table analysis
Questions should lead to actionable insights

Section 2: Dataset Analysis 10 points

                # Section 2: Dataset Analysis

                ## Detailed Overview

                -- SQL: Show table sizes, row counts

                ## Table Relationships

                -- SQL: Explore keys and joins

                ## Data Issues

                -- SQL: Check for NULLs, duplicates

4 points: Detailed overview including size and table relationships
4 points: Clear explanation of primary/foreign keys (or why they're absent)
2 points: Mention data issues or limitations

Section 3: Get Your Feet Wet 35 points

                # Section 3: Get Your Feet Wet

                ## Subqueries

                -- 2 queries with scoped variables

                ## CTEs

                -- 2 queries with WITH clauses

                ## Window Functions

                -- 3 queries with OVER, including RANK vs ROW_NUMBER

10 points: At least 2 subqueries using scoped variables
10 points: At least 2 CTEs
15 points: At least 3 window functions with PARTITION/OVER, including one query where RANK and ROW_NUMBER produce different results
Required: Upload a photo of hand-written debug tables for at least 2 queries. Use similar format to SELECT-FROM-WHERE and LLM Debug examples.

Section 4: Exploring Central Questions 30 points

                # Section 4: Exploring Central Questions

                ## Question 1: [Your question]

                -- SQL query and analysis

                ## Question 2: [Your question]

                -- SQL query and analysis

                ## Question 3: [Your question]

                -- SQL query and analysis

≥ 5 SQL queries exploring your questions
Proper structure with helpful comments
Clear progression from question to query to insight

Section 5: Takeaways 10 points

                # Section 5: Takeaways
            

3 points: Summary of what you learned + Limitations of your analysis
2 points: How you used AI to brainstorm interesting questions and explore initial approaches
2 points: How you used AI as training wheels for intermediate queries (CTEs, window functions)
3 points: How you verified SQL correctness with debug tables (on paper) to prep for no-AI interviews/tests

Bonus Up to 5 points

Exceptional insights or creative queries
Compelling storytelling in data analysis
0 points for visualizations – This is Project 1, focus on SQL queries only
No need to create a Section 6 – CAs will handpick exceptional projects for team review and bonus points

Frequently Asked Questions

Questions & Analysis Datasets & BigQuery SQL Technical AI Usage & Debug Tables

Questions & Analysis

Q: What makes a good question?

Follow this formula: [ACTION VERB] + [SPECIFIC METRIC] + [COMPARISON/CONDITION] + [TIMEFRAME/SEGMENT]

GOOD Questions

IDENTIFY drivers WHO earn less on rainy days DESPITE driving more hours
FIND neighborhoods WHERE crime drops WHEN temperature > 90°F
DETECT stackoverflow users WHO ask 80% of questions BUT never answer

BAD Questions

How many crimes happened?
What is the average trip distance?
List all users with >100 posts

Each question scores 0-3 points:

1 point: Uses 2+ tables with proper JOIN
1 point: Includes WHERE/HAVING with 2+ conditions
1 point: Output could change a decision/action

Q: What's the difference between key and supplementary questions?

Key questions are specific and directly answerable with SQL queries. Supplementary questions provide broader context but may not be directly queryable.

Example for "When's the best time to get away with murder?":

Key: When do most murders happen? How fast do cops respond? Which neighborhoods have more murders at certain times?
Supplementary: How do murder rates change with seasons? What weather sees most murders? What social factors impact rates?

Q: Can one query count for multiple requirements?

No. You need 12 distinct queries minimum. Each technique requirement needs its own query.

Q: How much output should I show?

Show 10-20 rows per query using LIMIT. For aggregations, show the complete summary.

Q: Can I use Python/Pandas for analysis?

No. This is a SQL-only project. No Python analysis or visualizations needed.

Datasets & BigQuery

Q: What datasets do other students typically use?

Popular BigQuery public datasets that meet our requirements (500MB+, 3+ tables):

StackOverflow

Dataset: bigquery-public-data.stackoverflow

Tables: posts, users, comments, badges

Size: 50GB+

Good for: User behavior, expertise patterns

NYC Taxi Trips

Dataset: bigquery-public-data.new_york_taxi_trips

Tables: tlc_yellow_trips, tlc_green_trips

Size: 100GB+

Good for: Travel patterns, tip analysis

Chicago Crime

Dataset: bigquery-public-data.chicago_crime

Tables: crime, socioeconomic_data

Size: 1.5GB+

Good for: Crime patterns, correlations

GitHub Activity

Dataset: bigquery-public-data.github_repos

Tables: commits, files, contents, languages

Size: 3TB+ (use samples!)

Good for: Code patterns, language trends

COVID-19 Data

Dataset: bigquery-public-data.covid19_*

Tables: Multiple regional datasets

Size: Varies by region

Good for: Spread patterns, policy impacts

NOAA Weather

Dataset: bigquery-public-data.noaa_*

Tables: gsod, stations, hurricanes

Size: 100GB+

Good for: Climate trends, weather impacts

Remember: Choose what excites YOU! These are just popular starting points. The best projects come from genuine curiosity about the data.

Q: Does the project have to be villainous/criminal themed?

No! The "heist" theme is just for fun. Past students have analyzed COVID, cancer research, crypto, climate change, and many other serious topics. Choose a dataset that genuinely interests you.

Q: Should I copy BigQuery tables to my personal dataset?

No - always query directly using dataset references! You can't realistically copy massive production datasets (100TB-PB scale). Get comfortable working with dataset references like bigquery-public-data.stackoverflow - this is how you'll work with production data in industry. Star the datasets in the console for easy access.

Q: How do I check my dataset size?

In BigQuery console: Dataset → Details → Size. Make sure total is ≥ 500MB.

Q: How much will BigQuery queries cost? Will I run out of credits?

Don't worry - you have plenty of credits! Here's what you need to know:

Cost: ~$5 per TB (1000 GB) processed
Your credits: CS145 provides $50 - enough for 10TB of queries (plenty for Project 1 & 2)
Check before running: Look at top-right of query editor for "This query will process X GB"
Avoid surprises: For bigger datasets like GitHub (3TB+), always use SAMPLEs or date filters. Don't use select *. Use LIMIT 100 to reduce the size of the output.
Monitor usage: BigQuery Console → Administration → Billing → View detailed charges

Q: What if my dataset has no formal keys?

Explain the implicit relationships between tables, discuss why formal keys might be missing, and describe how you handle potential data quality issues.

SQL Technical

Q: What exactly are "scoped variables" in subqueries?

Variables from the outer query that are referenced in the inner query's WHERE or JOIN clause. See Correlated Subqueries for detailed examples. For instance, WHERE l2.user_id = u.user_id correlates the inner query to the outer query's current user.

Q: My queries are expensive/slow. What should I do?

Use date filters like WHERE date > '2024-01-01', add LIMIT clauses, or use table sampling: FROM table TABLESAMPLE SYSTEM (10 PERCENT)

Q: How do I make RANK and ROW_NUMBER differ?

ORDER BY a non-unique column to create ties. RANK will skip numbers after ties (1,1,3), while ROW_NUMBER won't (1,2,3).

Q: What counts as a "window function" for the requirements?

You must use OVER or PARTITION BY keywords. Simple aggregations like AVG(col) don't count. Valid examples: ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date), RANK() OVER (ORDER BY score DESC). See the window functions guide.

Q: How do I count non-null values in a column?

Use COUNT(column_name) which only counts non-NULL values, unlike COUNT(*) which counts all rows including those with NULLs.

AI Usage & Debug Tables

Q: How much AI use is too much?

Remember our dual competency goal (see AI Policy). Use AI to:

Accelerate learning and explore more complex queries
Debug tricky syntax errors quickly
Brainstorm creative approaches to problems

But: If you can't explain your query on a whiteboard without AI, you used too much. Always verify you understand by recreating key queries without AI assistance.

Build specific anecdotes with evidence! Surface-level answers like "I used AI to help" won't cut it in interviews. Go deeper - explain the semantic issue you caught, why the AI's approach failed, how you debugged it. These detailed stories show how comfortable and productive you'll be, and your ability to ask the right questions. Remember: jobs don't need expensive copy-pasters!

Q: How should I document my AI usage in the Takeaways?

Focus on your debugging process for verifying correctness! Document how you caught and fixed AI mistakes. See Learning Outcomes for sample starter anecdotes.

Note: AI may or may not make errors on your first simple queries. That's OK! The point is to build the verification habit now, so you're ready when tackling complex queries where AI is more likely to hallucinate or miss edge cases.

What to document:

Speedup: "3-level CTE with window functions: 45 mins by hand vs 3 mins with Claude + 10 mins debugging"
Verification methods: "Used debug tables to trace JOIN logic, found AI's LEFT JOIN kept NULLs in COUNT(*)"
Hallucinations caught: "AI referenced CTE before it was defined" or "Window function in WHERE clause"
Prompt refinements: "First prompt gave global average, had to specify 'per-user average' for correlated subquery"
Debugging process: "Query returned duplicates, traced through GROUP BY logic on paper, discovered missing DISTINCT"

Q: What are "debug tables" and why do I need them on paper?

You've already seen debug tables in SELECT-FROM-WHERE and LLM Debug lessons – those row-by-row traces showing how SQL executes.

For your project, create similar debug tables on paper to verify your complex queries work correctly. This is EXACTLY what you'll do in interviews and exams where you can't run queries. Practice this now while you can verify your work against actual results.

Q: How do I draw debug tables when my query processes 100,000+ rows?

Use shorthand notation! You don't trace every row - use representative samples:

Sample approach: Show 3-5 representative rows, then notation like "... (10K more rows)"
GROUP BY shorthand: "genre='Rock': 3 rows shown + ...(497 more) = 500 total"
Edge cases focus: Include MIN, MAX, NULL, and tie values that affect your logic
Aggregation notation: Write "AVG(user_1): (4.5+4.2+3.9)/3 = 4.2" instead of listing all
Pattern indicators: Use "↓" or "..." to show patterns continue

Example: For RANK() on 1000 users, show top 3, ties if any, and bottom 2. That's enough to verify logic!

Q: Which LLMs should I use? Do you provide credits?

No credits provided – use free tiers! Recommended options:

ChatGPT (free): Good for SQL basics and debugging
Claude.ai (free): Strong at explaining complex queries
Google Gemini (free): Helpful for BigQuery-specific syntax
GitHub Copilot: Free with student account

Tips for free tiers:

Ask focused questions, not "write my whole project"
Use AI for learning concepts, then write code yourself
If you hit limits, that's a feature! Time to practice without AI