SQL's Complex Types

Good For Interviews; Syntax Optional For Tests

Core Concept

Example: Find users within 5km, match DNA sequences, query JSON, find similar songs - all with SQL!

SQL isn't just for INT/STRING tables. It handles maps, genomes, documents, and AI.

1. Geographic Queries: Finding Things Nearby

-- Find all Spotify users within 5km of Stanford (BigQuery)
SELECT 
  u.name,
  u.email,
  -- ST_DISTANCE: Calculate distance between two geographic points in meters
  ST_DISTANCE(u.location, ST_GEOGPOINT(-122.1697, 37.4275)) AS distance_meters
FROM Users u
WHERE 
  -- ST_DWITHIN: Check if two points are within a specified distance
  ST_DWITHIN(
    u.location, 
    ST_GEOGPOINT(-122.1697, 37.4275),  -- Create a point from longitude, latitude
    5000  -- Distance in meters
  )
ORDER BY distance_meters;

2. Vector/AI Queries: Similarity Search

-- Find songs similar to user's taste using embeddings (BigQuery ML)
WITH user_taste AS (
  -- Get a representative embedding from highly-rated songs
  SELECT ARRAY_AGG(s.embedding ORDER BY l.rating DESC LIMIT 1)[OFFSET(0)] AS taste_vector
  FROM Listens l
  JOIN Songs s ON l.song_id = s.song_id
  WHERE l.user_id = 1 AND l.rating > 4
)
SELECT 
  s.title,
  s.artist,
  -- ML.DISTANCE: Calculate distance between two vectors
  -- COSINE distance (0=identical, 1=opposite, common for text/audio)
  ML.DISTANCE(s.embedding, ut.taste_vector, 'COSINE') AS similarity_score
FROM Songs s, user_taste ut
WHERE s.song_id NOT IN (
  SELECT song_id FROM Listens WHERE user_id = 1  -- Exclude already heard songs
)
ORDER BY similarity_score  -- Lower score = more similar
LIMIT 10;

3. User-Defined Functions: Custom Logic

Create custom functions to transform your data: Add business logic directly in SQL.

Use Custom Functions

-- Score individual songs with custom functions
SELECT 
  s.title,
  s.artist,

  -- Secret algorithm for viral potential
  calculateViralScore(
    s.play_count, 
    s.skip_rate,
    s.share_count
  ) AS viral_potential,

  -- Analyze album art mood  
  getDominantColor(
    s.album_cover_rgb
  ) AS mood_color,

  s.play_count
FROM Songs s
WHERE s.release_year = 2024
  AND s.play_count > 1000
ORDER BY viral_potential DESC
LIMIT 100;

Define Functions

-- Secret viral scoring algorithm
CREATE TEMP FUNCTION calculateViralScore(
  plays INT64, skip_rate FLOAT64, shares INT64
)
RETURNS FLOAT64
AS (
  plays * 0.3 + shares * 10 - skip_rate * 500
);

-- JavaScript for complex logic
CREATE TEMP FUNCTION getDominantColor(
  rgb ARRAY<STRUCT<r INT64, g INT64, b INT64>>
)
RETURNS STRING
LANGUAGE js AS r"""
  // Average RGB VALUES
  let r = 0, g = 0, b = 0;
  for (let pixel of rgb) {
    r += pixel.r;
    g += pixel.g; 
    b += pixel.b;
  }
  r /= rgb.length;
  g /= rgb.length;
  b /= rgb.length;

  // Return dominant color
  if (r > g && r > b) return 'Warm';
  if (b > r && b > g) return 'Cool';
  return 'Neutral';
""";

Key Concepts:

UDFs: Process individual rows with custom logic
UDAFs: Also possible - create custom aggregations like weighted averages
Languages: Use SQL for simple math, JavaScript for complex operations
Reusability: Define once, use everywhere in your queries

4. Biological Analysis: DNA Pattern Matching

Custom functions for genomic data: Find patterns, calculate GC content, translate codons.

Analyze DNA Sequences

-- Analyze genetic sequences for research
SELECT 
  sample_id,
  organism,
  sequence,

  -- GC content: 40-60% is typical
  -- Extremes indicate special adaptations
  calculateGCContent(sequence) AS gc_percent,

  -- Get complement for replication studies
  getDNAComplement(
    SUBSTR(sequence, 1, 12)
  ) AS complement_preview,

  -- Find potential genes
  findORF(sequence) AS gene_location,

  -- Classify organism by GC content
  CASE 
    WHEN calculateGCContent(sequence) < 30 
      THEN 'AT-rich (parasitic?)'
    WHEN calculateGCContent(sequence) > 70 
      THEN 'GC-rich (thermophile?)'
    ELSE 'Normal range'
  END AS classification

FROM genetic_samples
WHERE LENGTH(sequence) > 100
ORDER BY gc_percent DESC;

Define Bio Functions

-- Calculate GC content (% of G+C bases)
CREATE TEMP FUNCTION calculateGCContent(
  sequence STRING
)
RETURNS FLOAT64
AS (
  LENGTH(REGEXP_REPLACE(sequence, '[^GC]', '')) 
  * 100.0 / LENGTH(sequence)
);

-- Get complement strand (A↔T, G↔C)
CREATE TEMP FUNCTION getDNAComplement(
  sequence STRING
)
RETURNS STRING
LANGUAGE js AS r"""
  // Apply Watson-Crick pairing rules
  return sequence
    .replace(/A/g, 't')
    .replace(/T/g, 'a')
    .replace(/G/g, 'c')
    .replace(/C/g, 'g')
    .toUpperCase();
""";

-- Find genes (ATG start → stop codon)
CREATE TEMP FUNCTION findORF(dna STRING)
RETURNS STRUCT<start INT64, len INT64>
LANGUAGE js AS r"""
  // Find ATG start codon
  let start = dna.indexOf('ATG');
  if (start === -1) 
    return {start: -1, len: 0};

  // Find nearest stop codon
  let stops = ['TAA', 'TAG', 'TGA'];
  // ... loop to find closest stop ...

  return {start: start, len: geneLength};
""";

Key Concepts:

GC Content: Ratio of G+C bases - affects DNA stability and melting temperature
Complementary Strands: DNA double helix with A-T and G-C base pairs
Open Reading Frames: Regions between start (ATG) and stop codons encoding proteins
Biological insights: Simple calculations reveal organism adaptations

Same SQL, New Superpowers

The Pattern Stays the Same

Notice how every query above still uses the SQL you know: - SELECT columns or expressions - FROM tables (even if they contain exotic data) - WHERE conditions (using special functions) - ORDER BY to sort results

The core SQL structure never changes - we just add new functions for new data types!

Real-World Applications

Uber/Lyft	23andMe	Spotify
PostGIS queries match riders with nearby drivers in real-time	DNA similarity queries find your genetic relatives	Vector queries power "songs like this" recommendations

Try It Yourself: BigQuery Public Datasets

-- Real queries you can run today (free tier)!

-- 1. Geographic: Find all Starbucks in San Francisco
SELECT name, address, geom
FROM `bigquery-public-data.geo_openstreetmap.planet_features`
WHERE feature_type = 'amenity' 
  AND tags['amenity'] = 'cafe'
  AND tags['name'] LIKE '%Starbucks%'
  AND ST_DWITHIN(geom, ST_GEOGPOINT(-122.4194, 37.7749), 10000);

-- 2. Genomic: Analyze COVID variants
SELECT variant_name, country, date, mutation_count
FROM `bigquery-public-data.covid19_genome_sequence.variants`
WHERE mutation_count > 10
ORDER BY date DESC;

-- 3. Weather: Temperature patterns
SELECT 
  name AS station_name,
  MAX(temp) AS max_temp,
  MIN(temp) AS min_temp,
  AVG(wind) AS avg_wind
FROM `bigquery-public-data.noaa_gsod.gsod2024`
WHERE mo = '01' AND da = '01'
GROUP BY name
LIMIT 10;

Different Databases, Different Syntax

Each Database Has Its Own "App Store"

The examples above use BigQuery. Other databases have similar features with slightly different syntax:

Database Geographic JSON Vectors Genomics

PostgreSQL PostGIS extension Native JSON/JSONB pgvector BioSQL

MySQL Spatial functions JSON functions - -

Snowflake Geospatial VARIANT type Vector functions -

SQLite SpatiaLite JSON1 extension - -

Core SQL stays the same, special functions vary. Check your database docs!

Database	Geographic	JSON	Vectors	Genomics
PostgreSQL	PostGIS extension	Native JSON/JSONB	pgvector	BioSQL
MySQL	Spatial functions	JSON functions	-	-
Snowflake	Geospatial	VARIANT type	Vector functions	-
SQLite	SpatiaLite	JSON1 extension	-	-

The Evolution Continues

SQL Through the Decades:
1970s: Numbers and text
1990s: + Dates and binary data
2000s: + XML and spatial
2010s: + JSON and arrays  
2020s: + Vectors and ML
2030s: + Quantum data?

Your SQL skills adapt to whatever comes next!