Apache Spark: MapReduce Evolved

MapReduce was fine for those who liked their data cold and slow. But for data scientists who prefer their insights hot and fast, Spark is the answer. It keeps everything in memory, allowing for rapid iteration and quick failure detection.


RDDs: The Core Innovation

Same Word Count Task: MapReduce vs Spark RDDs MapReduce Word Count Job 1: Split & Map (word, 1) pairs 💽 HDFS Job 2: Shuffle & Sort Group by word 💽 HDFS Job 3: Reduce (Sum) Final counts 💽 HDFS Total Time: 3 min 3 disk writes 3 separate jobs JVM startup x3 Spark RDD Word Count Single Pipeline (All in 🧠 Memory) textFile() .flatMap() .map() .reduceByKey() .collect() ✓ Transformations are lazy (no execution) ✓ Action triggers entire pipeline ✓ Data flows through memory Total Time: 2 sec 0 disk writes 1 job pipeline Single execution Why RDDs are 100x Faster In-Memory Processing • 🧠 RAM: 10+ GB/s vs 💽 Disk: 100 MB/s • Data stays in memory between operations Lazy Evaluation & Immutability • Build execution plan before running • Immutable RDDs enable optimization Lineage-Based Recovery • Track transformation history (lineage) • Recompute lost partitions from lineage

Spark SQL: Bringing Dataframes to Big Data

Forget about the tedious low-level RDD operations. Spark SQL offers the familiarity of SQL syntax, but on a distributed scale.

Spark SQL: Familiar Syntax, Distributed Scale Spotify Analytics Query (same as OLAP section!) SELECT s.genre, DATE_TRUNC('week', l.timestamp) as week, COUNT(DISTINCT l.user_id) as unique_listeners FROM listens l JOIN songs s ON l.song_id = s.song_id Spark's Execution Plan 1. Hash partition both tables by join key 2. Broadcast smaller table (songs) to all partitions 3. Join locally on each partition 4. Group by genre + week 5. Count distinct users per group Performance vs Systems PostgreSQL (single machine): • 2 hours for 1TB data • Runs out of memory Spark (100 machines): • 3 minutes for 1TB data • Linear scaling

Fault Tolerance: Lineage Graphs

RAM is expensive and volatile. Instead of triplicating data across nodes, Spark tracks the lineage. If a node crashes, it simply retraces its steps and rebuilds the data.

Traditional Approach: Replication

Node A: [data] → replicate to Node B, C
If Node A fails → read from Node B
Cost: 3x storage overhead
Problems:
  • High memory cost (3x replication)
  • Network overhead for sync
  • Complex consistency management

Spark Approach: Lineage

RDD3 = RDD1.filter().map()
If RDD3 lost → recompute from RDD1
Cost: Just computation time
Benefits:
  • No storage overhead
  • Automatic recovery
  • Leverages deterministic operations

When to Use Spark vs MapReduce

Spark Wins For:

MapReduce Still Has Benefits:


Real-World Spark at Scale

Netflix: Recommendation Engine

# Simplified Spark ML pipeline
user_ratings = spark.read.table("user_ratings")  # 1TB
movie_features = spark.read.table("movie_features")  # 100GB

# Traditional ML would require multiple MapReduce jobs
# Spark does it in one pipeline:
recommendations = user_ratings \
    .join(movie_features, "movie_id") \
    .groupBy("user_id") \
    .agg(collect_list("rating").alias("ratings")) \
    .withColumn("predictions", ml_model_udf("ratings"))

# Cache for interactive exploration
recommendations.cache()
recommendations.show()  # First computation
recommendations.filter(...).show()  # Uses cached data

Results:


Key Takeaways

1. Same Distributed Algorithms, Better Execution

Spark takes the tried-and-true and makes it faster:

2. Memory-Centric Design Revolution

In-memory processing is a game-changer:

3. Unified Framework

One system, multiple workloads:

4. Programming Model Evolution

From low-level to high-level APIs:

1 / 1