SQL's Complex Types

Advanced Data Types

Before you jump into the deep end with a new system like Vector or Spatial databases, consider extending what you already have. PostgreSQL is more versatile than you think, with native support for JSONB, AI Vectors, and Geo-Spatial coordinates.

Good For Interviews; Syntax Optional For Tests

Core Concept

SQL isn't just for your basic INT or STRING tables. It's capable of handling maps, genomes, documents, and even AI.

SQL Powers Everything: From Maps to DNA to AI Geographic Data Stanford Mickey Minnie Daffy ST_DWITHIN(location, point, 5000) Genomic Data Reference: ACTGACTGACTG Sample 1: ACTCACTGACTG 1 mutation Sample 2: ACTGACTCACTC 2 mutations Sample 3: ACTGACTGACTG Perfect match = ACTG EDIT_DISTANCE(seq1, seq2) < 3 Special Data Types Images Time Series [1,2,3,4] Arrays {...} JSON Documents • UUID: a1b2c3d4-e5f6-... • Intervals: [2024-01-01, 2024-12-31] • Binary: 0x4D5A9000... UNNEST(array) • IMAGE_SIZE(blob) Generative AI (LLMs) PROMPT: "Explain these lyrics..." GEMINI RESPONSE: "This song depicts a journey of resilience..." "Categorized as: 'Inspirational'" AI.GENERATE(prompt, lyrics) All Still SQL! SELECT ... FROM ... WHERE ... ORDER BY ...

1. JSONB or NoSQL

Avoid the NoSQL Trap for the Wrong Reasons

Key idea:

  1. Your startup doesn't need a sprawling NoSQL database just because there's some JSON involved.
  2. One PostgreSQL node with JSONB can handle your MVP efficiently, providing ACID compliance long before NoSQL becomes necessary. Modern SQL offers the speed and reliability of relational data while natively supporting JSON's flexibility.
// DocumentDBs OR NoSQL DBs: Great for MVPs AND real-time (e.g., Firebase, MongoDB)
// But Harder for transactions & joins
// PostgreSQL JSON: Best of both worlds

// Instead of MongoDB's documents:
CREATE TABLE orders (
  id SERIAL PRIMARY KEY,
  user_id INTEGER REFERENCES users(id),
  items JSONB,  -- Flexible schema for prototyping where needed
  created_at TIMESTAMP
);
// Get consistency + flexibility

2. Geographic Queries: Finding Things Nearby

-- Find all Spotify users within 5km of Stanford (BigQuery)
SELECT 
  u.name,
  u.email,
  -- ST_DISTANCE: Calculate distance between two geographic points in meters
  ST_DISTANCE(u.location, ST_GEOGPOINT(-122.1697, 37.4275)) AS distance_meters
FROM Users u
WHERE 
  -- ST_DWITHIN: Check if two points are within a specified distance
  ST_DWITHIN(
    u.location, 
    ST_GEOGPOINT(-122.1697, 37.4275),  -- Create a point from longitude, latitude
    5000  -- Distance in meters
  )
ORDER BY distance_meters;

3. Generative AI Queries: Semantic Understanding

-- Analyze song lyrics and reviews with Gemini Pro (BigQuery)
SELECT 
  title,
  artist,
  -- AI.CLASSIFY: Automatically categorize the song vibe
  AI.CLASSIFY(lyrics, ["Happy", "Melancholic", "Energetic", "Chill"]) AS vibe,

  -- AI.GENERATE: Use an LLM to write a marketing summary
  AI.GENERATE(
    "Summarize the emotional journey of these lyrics IN one sentence", 
    lyrics
  ) AS ai_summary,

  -- AI.TRANSLATE: Localize the title for global markets
  AI.TRANSLATE(title, "Spanish") AS title_es
FROM Songs
-- AI.IF: Semantic filtering based on natural language meaning
WHERE AI.IF("The song IS about a summer romance", lyrics)
  AND play_count > 100000;

4. User-Defined Functions: Custom Logic

Embed custom functions to tailor your data transformations: Implement business logic directly in SQL.

Use Custom Functions

-- Score individual songs with custom functions
SELECT 
  s.title,
  s.artist,

  -- Secret algorithm for viral potential
  calculateViralScore(
    s.play_count, 
    s.skip_rate,
    s.share_count
  ) AS viral_potential,

  -- Analyze album art mood  
  getDominantColor(
    s.album_cover_rgb
  ) AS mood_color,

  s.play_count
FROM Songs s
WHERE s.release_year = 2024
  AND s.play_count > 1000
ORDER BY viral_potential DESC
LIMIT 100;

Define Functions

-- Secret viral scoring algorithm
CREATE TEMP FUNCTION calculateViralScore(
  plays INT64, skip_rate FLOAT64, shares INT64
)
RETURNS FLOAT64
AS (
  plays * 0.3 + shares * 10 - skip_rate * 500
);

-- JavaScript for complex logic
CREATE TEMP FUNCTION getDominantColor(
  rgb ARRAY<STRUCT<r INT64, g INT64, b INT64>>
)
RETURNS STRING
LANGUAGE js AS r"""
  // Average RGB VALUES
  let r = 0, g = 0, b = 0;
  for (let pixel of rgb) {
    r += pixel.r;
    g += pixel.g; 
    b += pixel.b;
  }
  r /= rgb.length;
  g /= rgb.length;
  b /= rgb.length;

  // Return dominant color
  if (r > g && r > b) return 'Warm';
  if (b > r && b > g) return 'Cool';
  return 'Neutral';
""";

Key Concepts:


Same SQL, New Superpowers

The Pattern Stays the Same

The core SQL structure remains unchanged: - SELECT columns or expressions - FROM tables (even those with complex data) - WHERE conditions (leveraging special functions) - ORDER BY to sort results

We just introduce new functions to handle new data types.


Try It Yourself: BigQuery Public Datasets

-- Real queries you can run today (free tier)!

-- 1. Geographic: Find all Starbucks in San Francisco
SELECT name, address, geom
FROM `bigquery-public-data.geo_openstreetmap.planet_features`
WHERE feature_type = 'amenity' 
  AND tags['amenity'] = 'cafe'
  AND tags['name'] LIKE '%Starbucks%'
  AND ST_DWITHIN(geom, ST_GEOGPOINT(-122.4194, 37.7749), 10000);

-- 2. Genomic: Analyze COVID variants
SELECT variant_name, country, date, mutation_count
FROM `bigquery-public-data.covid19_genome_sequence.variants`
WHERE mutation_count > 10
ORDER BY date DESC;

-- 3. Weather: Temperature patterns
SELECT 
  name AS station_name,
  MAX(temp) AS max_temp,
  MIN(temp) AS min_temp,
  AVG(wind) AS avg_wind
FROM `bigquery-public-data.noaa_gsod.gsod2024`
WHERE mo = '01' AND da = '01'
GROUP BY name
LIMIT 10;

The Evolution Continues

SQL Through the Decades:
1970s: Numbers and text
1990s: + Dates and binary data
2000s: + XML and spatial
2010s: + JSON and arrays  
2020s: + GenAI, Vectors, and ML
2030s: + Quantum data?

Your SQL skills adapt to whatever comes next!