SQL for Complex Types

Good For Interviews; Syntax Optional For Tests

Core Concept

SQL isn't just for INT/STRING tables. It handles maps, genomes, documents, and AI.

SQL Powers Everything: From Maps to DNA to AI Geographic Data Stanford Mickey Minnie Daffy ST_DWITHIN(location, point, 5000) Genomic Data Reference: ACTGACTGACTG Sample 1: ACTCACTGACTG 1 mutation Sample 2: ACTGACTCACTC 2 mutations Sample 3: ACTGACTGACTG Perfect match = ACTG EDIT_DISTANCE(seq1, seq2) < 3 JSON/Document Data { "user_id": 1, "preferences": { "theme": "dark", "premium": true } } → $.preferences.theme JSON_EXTRACT_SCALAR(data, '$.path') Vector/AI Data Song A Song B Song C User taste Vectors: [0.2, -0.5, 0.8, ...] [0.3, -0.4, 0.7, ...] ML.DISTANCE(v1, v2, 'COSINE') All Still SQL! SELECT ... FROM ... WHERE ... ORDER BY ...

1. Geographic Queries: Finding Things Nearby

-- Find all Spotify users within 5km of Stanford (BigQuery)
SELECT 
  u.name,
  u.email,
  -- ST_DISTANCE: Calculate distance between two geographic points in meters
  ST_DISTANCE(u.location, ST_GEOGPOINT(-122.1697, 37.4275)) AS distance_meters
FROM Users u
WHERE 
  -- ST_DWITHIN: Check if two points are within a specified distance
  ST_DWITHIN(
    u.location, 
    ST_GEOGPOINT(-122.1697, 37.4275),  -- Create a point from longitude, latitude
    5000  -- Distance in meters
  )
ORDER BY distance_meters;

2. JSON/Document Queries: Nested Data

-- Query nested user preferences stored as JSON
SELECT 
  user_id,
  -- JSON_EXTRACT_SCALAR: Extract a scalar value from JSON using JSONPath
  JSON_EXTRACT_SCALAR(preferences, '$.theme') AS theme,
  JSON_EXTRACT_SCALAR(preferences, '$.premium') AS is_premium,
  -- Access array elements with [index] notation
  JSON_EXTRACT_SCALAR(preferences, '$.playlists[0].name') AS first_playlist,
  -- JSON_ARRAY_LENGTH: Count elements in a JSON array  
  JSON_ARRAY_LENGTH(preferences, '$.playlists') AS playlist_count
FROM Users
WHERE 
  -- Filter based on JSON field value
  JSON_EXTRACT_SCALAR(preferences, '$.premium') = 'true';

3. Vector/AI Queries: Similarity Search

-- Find songs similar to user's taste using embeddings (BigQuery ML)
WITH user_taste AS (
  -- Get a representative embedding from highly-rated songs
  SELECT ARRAY_AGG(s.embedding ORDER BY l.rating DESC LIMIT 1)[OFFSET(0)] AS taste_vector
  FROM Listens l
  JOIN Songs s ON l.song_id = s.song_id
  WHERE l.user_id = 1 AND l.rating > 4
SELECT 
  s.title,
  s.artist,
  -- ML.DISTANCE: Calculate distance between two vectors
  -- COSINE distance (0=identical, 1=opposite, common for text/audio)
  ML.DISTANCE(s.embedding, ut.taste_vector, 'COSINE') AS similarity_score
FROM Songs s, user_taste ut
WHERE s.song_id NOT IN (
  SELECT song_id FROM Listens WHERE user_id = 1  -- Exclude already heard songs
)
ORDER BY similarity_score  -- Lower score = more similar
LIMIT 10;

4. User-Defined Functions: Custom Logic

Create custom functions to transform your data: Add business logic directly in SQL.

Use Custom Functions

-- Score artists with custom functions
SELECT 
  artist,

  -- Simple SQL function for scoring
  scorePopularity(
    COUNT(*), 
    AVG(rating)
  ) AS popularity_score,

  -- JavaScript function for image analysis  
  getDominantColor(
    ANY_VALUE(album_cover_rgb)
  ) AS mood_color,

  COUNT(*) AS total_plays
FROM Songs s
JOIN Listens l ON s.song_id = l.song_id
WHERE release_year = 2024
GROUP BY artist
ORDER BY popularity_score DESC;

Define Functions

-- Simple SQL function
CREATE TEMP FUNCTION scorePopularity(
  plays INT64, avg_rating FLOAT64
)
RETURNS FLOAT64
AS (
  plays * 0.3 + avg_rating * 100
);

-- JavaScript for complex logic
CREATE TEMP FUNCTION getDominantColor(
  rgb ARRAY<STRUCT<r INT64, g INT64, b INT64>>
)
RETURNS STRING
LANGUAGE js AS r"""
  // Average RGB VALUES
  let r = 0, g = 0, b = 0;
  for (let pixel of rgb) {
    r += pixel.r;
    g += pixel.g; 
    b += pixel.b;
  }
  r /= rgb.length;
  g /= rgb.length;
  b /= rgb.length;

  // Return dominant color
  if (r > g && r > b) return 'Warm';
  if (b > r && b > g) return 'Cool';
  return 'Neutral';
""";

Key Concepts:


5. Biological Analysis: DNA Pattern Matching

Custom functions for genomic data: Find patterns, calculate GC content, translate codons.

Analyze DNA Sequences

-- Analyze genetic sequences for research
SELECT 
  sample_id,
  organism,
  sequence,

  -- GC content: 40-60% is typical
  -- Extremes indicate special adaptations
  calculateGCContent(sequence) AS gc_percent,

  -- Get complement for replication studies
  getDNAComplement(
    SUBSTR(sequence, 1, 12)
  ) AS complement_preview,

  -- Find potential genes
  findORF(sequence) AS gene_location,

  -- Classify organism by GC content
  CASE 
    WHEN calculateGCContent(sequence) < 30 
      THEN 'AT-rich (parasitic?)'
    WHEN calculateGCContent(sequence) > 70 
      THEN 'GC-rich (thermophile?)'
    ELSE 'Normal range'
  END AS classification

FROM genetic_samples
WHERE LENGTH(sequence) > 100
ORDER BY gc_percent DESC;

Define Bio Functions

-- Calculate GC content (% of G+C bases)
CREATE TEMP FUNCTION calculateGCContent(
  sequence STRING
)
RETURNS FLOAT64
AS (
  LENGTH(REGEXP_REPLACE(sequence, '[^GC]', '')) 
  * 100.0 / LENGTH(sequence)
);

-- Get complement strand (A↔T, G↔C)
CREATE TEMP FUNCTION getDNAComplement(
  sequence STRING
)
RETURNS STRING
LANGUAGE js AS r"""
  // Apply Watson-Crick pairing rules
  return sequence
    .replace(/A/g, 't')
    .replace(/T/g, 'a')
    .replace(/G/g, 'c')
    .replace(/C/g, 'g')
    .toUpperCase();
""";

-- Find genes (ATG start → stop codon)
CREATE TEMP FUNCTION findORF(dna STRING)
RETURNS STRUCT<start INT64, len INT64>
LANGUAGE js AS r"""
  // Find ATG start codon
  let start = dna.indexOf('ATG');
  if (start === -1) 
    return {start: -1, len: 0};

  // Find nearest stop codon
  let stops = ['TAA', 'TAG', 'TGA'];
  // ... loop to find closest stop ...

  return {start: start, len: geneLength};
""";

Key Concepts:


Same SQL, New Superpowers

The Pattern Stays the Same

Notice how every query above still uses the SQL you know: - SELECT columns or expressions - FROM tables (even if they contain exotic data) - WHERE conditions (using special functions) - ORDER BY to sort results

The core SQL structure never changes - we just add new functions for new data types!


Real-World Applications

Uber/Lyft 23andMe Spotify
PostGIS queries match riders with nearby drivers in real-time DNA similarity queries find your genetic relatives Vector queries power "songs like this" recommendations

Try It Yourself: BigQuery Public Datasets

-- Real queries you can run today (free tier)!

-- 1. Geographic: Find all Starbucks in San Francisco
SELECT name, address, geom
FROM `bigquery-public-data.geo_openstreetmap.planet_features`
WHERE feature_type = 'amenity' 
  AND tags['amenity'] = 'cafe'
  AND tags['name'] LIKE '%Starbucks%'
  AND ST_DWITHIN(geom, ST_GEOGPOINT(-122.4194, 37.7749), 10000);

-- 2. Genomic: Analyze COVID variants
SELECT variant_name, country, date, mutation_count
FROM `bigquery-public-data.covid19_genome_sequence.variants`
WHERE mutation_count > 10
ORDER BY date DESC;

-- 3. Weather: Temperature patterns
SELECT 
  name AS station_name,
  MAX(temp) AS max_temp,
  MIN(temp) AS min_temp,
  AVG(wind) AS avg_wind
FROM `bigquery-public-data.noaa_gsod.gsod2024`
WHERE mo = '01' AND da = '01'
GROUP BY name
LIMIT 10;

Different Databases, Different Syntax

Each Database Has Its Own "App Store"

The examples above use BigQuery. Other databases have similar features with slightly different syntax:

Database Geographic JSON Vectors Genomics
PostgreSQL PostGIS extension Native JSON/JSONB pgvector BioSQL
MySQL Spatial functions JSON functions - -
Snowflake Geospatial VARIANT type Vector functions -
SQLite SpatiaLite JSON1 extension - -

Core SQL stays the same, special functions vary. Check your database docs!


The Evolution Continues

SQL Through the Decades:
1970s: Numbers and text
1990s: + Dates and binary data
2000s: + XML and spatial
2010s: + JSON and arrays  
2020s: + Vectors and ML
2030s: + Quantum data?

Your SQL skills adapt to whatever comes next!