SQL for Complex Types
Good For Interviews; Syntax Optional For Tests
Core Concept
- Example: Find users within 5km, match DNA sequences, query JSON, find similar songs - all with SQL!
SQL isn't just for INT/STRING tables. It handles maps, genomes, documents, and AI.
1. Geographic Queries: Finding Things Nearby
-- Find all Spotify users within 5km of Stanford (BigQuery)
SELECT
u.name,
u.email,
-- ST_DISTANCE: Calculate distance between two geographic points in meters
ST_DISTANCE(u.location, ST_GEOGPOINT(-122.1697, 37.4275)) AS distance_meters
FROM Users u
WHERE
-- ST_DWITHIN: Check if two points are within a specified distance
ST_DWITHIN(
u.location,
ST_GEOGPOINT(-122.1697, 37.4275), -- Create a point from longitude, latitude
5000 -- Distance in meters
)
ORDER BY distance_meters;
2. JSON/Document Queries: Nested Data
-- Query nested user preferences stored as JSON
SELECT
user_id,
-- JSON_EXTRACT_SCALAR: Extract a scalar value from JSON using JSONPath
JSON_EXTRACT_SCALAR(preferences, '$.theme') AS theme,
JSON_EXTRACT_SCALAR(preferences, '$.premium') AS is_premium,
-- Access array elements with [index] notation
JSON_EXTRACT_SCALAR(preferences, '$.playlists[0].name') AS first_playlist,
-- JSON_ARRAY_LENGTH: Count elements in a JSON array
JSON_ARRAY_LENGTH(preferences, '$.playlists') AS playlist_count
FROM Users
WHERE
-- Filter based on JSON field value
JSON_EXTRACT_SCALAR(preferences, '$.premium') = 'true';
3. Vector/AI Queries: Similarity Search
-- Find songs similar to user's taste using embeddings (BigQuery ML)
WITH user_taste AS (
-- Get a representative embedding from highly-rated songs
SELECT ARRAY_AGG(s.embedding ORDER BY l.rating DESC LIMIT 1)[OFFSET(0)] AS taste_vector
FROM Listens l
JOIN Songs s ON l.song_id = s.song_id
WHERE l.user_id = 1 AND l.rating > 4
SELECT
s.title,
s.artist,
-- ML.DISTANCE: Calculate distance between two vectors
-- COSINE distance (0=identical, 1=opposite, common for text/audio)
ML.DISTANCE(s.embedding, ut.taste_vector, 'COSINE') AS similarity_score
FROM Songs s, user_taste ut
WHERE s.song_id NOT IN (
SELECT song_id FROM Listens WHERE user_id = 1 -- Exclude already heard songs
)
ORDER BY similarity_score -- Lower score = more similar
LIMIT 10;
4. User-Defined Functions: Custom Logic
Create custom functions to transform your data: Add business logic directly in SQL.
Use Custom Functions
-- Score artists with custom functions
SELECT
artist,
-- Simple SQL function for scoring
scorePopularity(
COUNT(*),
AVG(rating)
) AS popularity_score,
-- JavaScript function for image analysis
getDominantColor(
ANY_VALUE(album_cover_rgb)
) AS mood_color,
COUNT(*) AS total_plays
FROM Songs s
JOIN Listens l ON s.song_id = l.song_id
WHERE release_year = 2024
GROUP BY artist
ORDER BY popularity_score DESC;
Define Functions
-- Simple SQL function
CREATE TEMP FUNCTION scorePopularity(
plays INT64, avg_rating FLOAT64
)
RETURNS FLOAT64
AS (
plays * 0.3 + avg_rating * 100
);
-- JavaScript for complex logic
CREATE TEMP FUNCTION getDominantColor(
rgb ARRAY<STRUCT<r INT64, g INT64, b INT64>>
)
RETURNS STRING
LANGUAGE js AS r"""
// Average RGB VALUES
let r = 0, g = 0, b = 0;
for (let pixel of rgb) {
r += pixel.r;
g += pixel.g;
b += pixel.b;
}
r /= rgb.length;
g /= rgb.length;
b /= rgb.length;
// Return dominant color
if (r > g && r > b) return 'Warm';
if (b > r && b > g) return 'Cool';
return 'Neutral';
""";
Key Concepts:
-
UDFs: Process individual rows with custom logic
-
UDAFs: Also possible - create custom aggregations like weighted averages
-
Languages: Use SQL for simple math, JavaScript for complex operations
-
Reusability: Define once, use everywhere in your queries
5. Biological Analysis: DNA Pattern Matching
Custom functions for genomic data: Find patterns, calculate GC content, translate codons.
Analyze DNA Sequences
-- Analyze genetic sequences for research
SELECT
sample_id,
organism,
sequence,
-- GC content: 40-60% is typical
-- Extremes indicate special adaptations
calculateGCContent(sequence) AS gc_percent,
-- Get complement for replication studies
getDNAComplement(
SUBSTR(sequence, 1, 12)
) AS complement_preview,
-- Find potential genes
findORF(sequence) AS gene_location,
-- Classify organism by GC content
CASE
WHEN calculateGCContent(sequence) < 30
THEN 'AT-rich (parasitic?)'
WHEN calculateGCContent(sequence) > 70
THEN 'GC-rich (thermophile?)'
ELSE 'Normal range'
END AS classification
FROM genetic_samples
WHERE LENGTH(sequence) > 100
ORDER BY gc_percent DESC;
Define Bio Functions
-- Calculate GC content (% of G+C bases)
CREATE TEMP FUNCTION calculateGCContent(
sequence STRING
)
RETURNS FLOAT64
AS (
LENGTH(REGEXP_REPLACE(sequence, '[^GC]', ''))
* 100.0 / LENGTH(sequence)
);
-- Get complement strand (AāT, GāC)
CREATE TEMP FUNCTION getDNAComplement(
sequence STRING
)
RETURNS STRING
LANGUAGE js AS r"""
// Apply Watson-Crick pairing rules
return sequence
.replace(/A/g, 't')
.replace(/T/g, 'a')
.replace(/G/g, 'c')
.replace(/C/g, 'g')
.toUpperCase();
""";
-- Find genes (ATG start ā stop codon)
CREATE TEMP FUNCTION findORF(dna STRING)
RETURNS STRUCT<start INT64, len INT64>
LANGUAGE js AS r"""
// Find ATG start codon
let start = dna.indexOf('ATG');
if (start === -1)
return {start: -1, len: 0};
// Find nearest stop codon
let stops = ['TAA', 'TAG', 'TGA'];
// ... loop to find closest stop ...
return {start: start, len: geneLength};
""";
Key Concepts:
-
GC Content: Ratio of G+C bases - affects DNA stability and melting temperature
-
Complementary Strands: DNA double helix with A-T and G-C base pairs
-
Open Reading Frames: Regions between start (ATG) and stop codons encoding proteins
-
Biological insights: Simple calculations reveal organism adaptations
Same SQL, New Superpowers
The Pattern Stays the Same
Notice how every query above still uses the SQL you know: - SELECT columns or expressions - FROM tables (even if they contain exotic data) - WHERE conditions (using special functions) - ORDER BY to sort results
The core SQL structure never changes - we just add new functions for new data types!
Real-World Applications
Uber/Lyft | 23andMe | Spotify |
---|---|---|
PostGIS queries match riders with nearby drivers in real-time | DNA similarity queries find your genetic relatives | Vector queries power "songs like this" recommendations |
Try It Yourself: BigQuery Public Datasets
-- Real queries you can run today (free tier)!
-- 1. Geographic: Find all Starbucks in San Francisco
SELECT name, address, geom
FROM `bigquery-public-data.geo_openstreetmap.planet_features`
WHERE feature_type = 'amenity'
AND tags['amenity'] = 'cafe'
AND tags['name'] LIKE '%Starbucks%'
AND ST_DWITHIN(geom, ST_GEOGPOINT(-122.4194, 37.7749), 10000);
-- 2. Genomic: Analyze COVID variants
SELECT variant_name, country, date, mutation_count
FROM `bigquery-public-data.covid19_genome_sequence.variants`
WHERE mutation_count > 10
ORDER BY date DESC;
-- 3. Weather: Temperature patterns
SELECT
name AS station_name,
MAX(temp) AS max_temp,
MIN(temp) AS min_temp,
AVG(wind) AS avg_wind
FROM `bigquery-public-data.noaa_gsod.gsod2024`
WHERE mo = '01' AND da = '01'
GROUP BY name
LIMIT 10;
Different Databases, Different Syntax
Each Database Has Its Own "App Store"
The examples above use BigQuery. Other databases have similar features with slightly different syntax:
Database Geographic JSON Vectors Genomics PostgreSQL PostGIS extension Native JSON/JSONB pgvector BioSQL MySQL Spatial functions JSON functions - - Snowflake Geospatial VARIANT type Vector functions - SQLite SpatiaLite JSON1 extension - - Core SQL stays the same, special functions vary. Check your database docs!
The Evolution Continues
SQL Through the Decades:
1970s: Numbers and text
1990s: + Dates and binary data
2000s: + XML and spatial
2010s: + JSON and arrays
2020s: + Vectors and ML
2030s: + Quantum data?
Your SQL skills adapt to whatever comes next!