Apache Spark: MapReduce Evolved

Concept. Spark generalizes MapReduce by representing a dataset as a Resilient Distributed Dataset (RDD): a partitioned, lazily-evaluated, lineage-tracked collection. You chain many transformations into a single in-memory pipeline that recomputes lost partitions from lineage instead of re-reading from disk.

Intuition. When a data scientist explores last year's Listens with filter(genre == 'pop').groupBy(user_id).count().orderBy(desc('count')).limit(100), Spark builds a single in-memory pipeline across thousands of cores, holds intermediate results in RAM (so the next exploratory query takes seconds, not minutes), and if a machine dies mid-job, Spark recomputes only that machine's partition from the lineage.

RDDs: Why MapReduce Pipelines Got 100× Faster

The core innovation is keeping intermediate data in memory between stages. The same word-count workload, drawn side by side:

Same word count task: MapReduce writes between stages to disk; Spark keeps RDDs in memory between stages, eliminating the IO bottleneck

Figure

Same word-count task. MapReduce writes between every stage to HDFS; Spark keeps RDDs in memory.

MapReduce (left): three jobs (Split & Map, Shuffle & Sort, Reduce). Every stage materializes its output to HDFS before the next stage reads it. End-to-end: ~3 minutes.
Spark RDDs (right): the same logical operations chained together, but intermediate RDDs stay in memory. Transformations are lazy: Spark builds an execution plan, then runs it once when an action triggers materialization. End-to-end: ~2 seconds.
Why ~100× faster: data stays in RAM between stages (no disk-write per stage); lazy evaluation lets Spark fuse and reorder operations; lineage recovery replaces replication so failure handling is cheap.
When the gap matters most: iterative workloads. A 10-stage ML training loop pays the disk-write cost once on Spark vs ten times on MapReduce.

Spark SQL: DataFrames at Distributed Scale

Most teams do not write low-level RDD code anymore. They write SQL or DataFrame transformations, and Spark's Catalyst optimizer compiles them into the underlying RDD operations.

Spark SQL: familiar SQL syntax executed at distributed scale across a cluster

Figure

Spark SQL: same SQL syntax as Postgres, distributed execution.

Query (top): the Spotify analytics example from earlier (genre × week × distinct listeners with a join across listens and songs). Identical SQL.
Execution plan: Spark's Catalyst optimizer compiles the SQL into RDD operations: hash-partition both tables on song_id, broadcast the smaller songs table to all partitions, join locally, then group and count.
Performance: PostgreSQL on a single machine takes hours and runs out of memory at the 1 TB scale. Spark on 100 machines finishes the same query in minutes and scales linearly with cluster size.
Why this matters: no code change between single-node and distributed. The optimizer picks join strategies and partition pruning the same way a single-node optimizer does.

Fault Tolerance: Lineage Instead of Replication

RAM is volatile. If a node crashes mid-job, MapReduce-style replication would either re-read from HDFS (slow) or keep three copies of every intermediate (expensive).

Spark does neither. It records the lineage of each RDD: which earlier RDD produced it, with which transformation. On crash, Spark recomputes only the lost partitions by replaying the lineage. Cost is computation time, not storage.

Lineage recovery: when a partition (RDD3.P1) is lost because its node died, Spark replays filter, map, and groupBy on the upstream RDD0.P1 partition on a different worker. No replicas were stored.

Traditional Approach: Replication

Node A: [data] → replicate to Node B, C
If Node A fails → read FROM Node B
Cost: 3× storage overhead

Problems:

High memory cost (3× replication)
Network overhead for sync
Complex consistency management

Spark Approach: Lineage

RDD3 = RDD1.filter().map()
If RDD3 lost → recompute FROM RDD1
Cost: just computation time

Benefits:

No storage overhead
Automatic recovery
Leverages deterministic operations

This trick only works because Spark's transformations are deterministic. Same input, same operator, same output every time. That deterministic-operator constraint is what lets lineage replace replication.

When Spark vs. When MapReduce

Spark wins for:

Iterative algorithms. Machine learning, graph processing, anything that touches the same data repeatedly.
Interactive analytics. Cached intermediates make follow-up queries return in seconds.
Multi-step pipelines. One end-to-end job instead of a chain of separate MR jobs.
Mixed workloads. Batch + streaming + ML in one framework.

MapReduce still wins for:

Datasets larger than cluster memory. Spark spills to disk when it has to, but if every stage spills, MR's design is just a better fit.
Single-pass batch jobs. A nightly ETL that reads once and writes once does not benefit from Spark's in-memory pipelining.
Cost-sensitive clusters. RAM is more expensive than disk; if latency is not the constraint, MR is cheaper.

Spark at Real-World Scale: Netflix Recommendations

# Simplified Spark ML pipeline
user_ratings = spark.read.table("user_ratings")     # 1 TB
movie_features = spark.read.table("movie_features") # 100 GB

# Traditional MapReduce would require multiple separate jobs.
# Spark expresses it as one in-memory pipeline:
recommendations = user_ratings \
    .join(movie_features, "movie_id") \
    .groupBy("user_id") \
    .agg(collect_list("rating").alias("ratings")) \
    .withColumn("predictions", ml_model_udf("ratings"))

# Cache for interactive exploration
recommendations.cache()
recommendations.show()                  # first computation
recommendations.filter(...).show()      # uses cached data

Numbers from Netflix's production reports:

Data: 1 TB user ratings, 100M+ users
Wall clock: ~15 minutes (vs. 3+ hours on MapReduce)
Workflow: data scientists iterate interactively instead of waiting overnight

Key Takeaways

Same primitives, better execution. Spark uses the same hash partitioning and worker model as MapReduce. The change is keeping intermediate data in RAM and tracking lineage.
Memory-centric design unlocks new workloads. Iterative ML, interactive analytics, and large-scale graph processing become tractable when intermediate state lives in memory.
One framework, many APIs. Spark Core (RDDs) → Spark SQL (DataFrames + the Catalyst optimizer) → Spark Streaming → MLlib. Same execution engine, different surfaces.
Determinism enables lineage. Spark's fault-tolerance trick only works because every transformation is deterministic. Replay produces the same result.

BigQuery → Spark is the workhorse for ELT-style transformations: you take raw data, run a multi-stage in-memory pipeline, and write the result somewhere. The next case study is what to query that result with. BigQuery is the serverless OLAP engine that pairs naturally with Spark in the modern data stack.