Apache Kafka: Real-Time Event Streaming

Concept. Kafka is a durable, partitioned, replicated commit log: producers append events to per-topic partitions, consumers read them at their own pace, and the log itself is the source of truth, so multiple downstream systems can independently replay the same stream of events.

Intuition. When Mickey hits play, the Spotify client emits a play_started event to the plays topic. Kafka appends it to one of the topic's partitions (chosen by hash(user_id)), replicates it to 3 brokers, and now both the recommendation pipeline and the royalty-payout pipeline can read that exact event independently, at their own speed, replaying as far back as the retention window allows.


A Worked Example: One Play, Two Pipelines

Trace what happens when Mickey hits play on a Taylor Swift track.

  1. Produce. The Spotify client builds a play_started event ({user_id, song_id, ts, device}) and POSTs it to a Kafka producer. The producer hashes the user_id to pick a partition of the plays topic, then appends the event to that partition's log.

  2. Replicate. The partition lives on three brokers (1 leader, 2 followers). The leader appends the event and waits for followers to acknowledge before confirming the write. The event now survives any single broker dying.

  3. Consume, in parallel. Two unrelated downstream systems are subscribed to the plays topic:

  4. The recommendation pipeline is at offset 16 (real-time): it reads Mickey's event within milliseconds and updates his "Up Next" queue.
  5. The royalty payouts pipeline is at offset 12 (batched nightly): it will get to Mickey's event when its nightly job runs, the same way it consumes every other play of the day.

  6. Replay, on demand. When a bug ships in the recommendation model, the team rewinds the consumer to yesterday's offset and re-reads the same events with the fixed code. Kafka's log is the durable source of truth, so the replay is bit-identical.

Kafka architecture: producer hashes by user_id into one of three topic partitions; two consumer groups (recommendations and royalty) read independently at their own offsets, replaying past events on demand


What Makes This Shape Different

Kafka is often introduced as "a message queue," but it is actually a log. The distinction is what makes it useful:

  • Queues consume on read. When a consumer pops a message, it is gone. Adding a second downstream system means re-publishing every message.

  • Logs persist after read. Every event sits in the log until the retention window expires (often days, sometimes forever). New downstream systems can consume the entire history at any time.

This single difference is why Kafka became the substrate for modern event-driven architectures. A new analytics team, a new ML pipeline, or a fraud detector can be added years after the original event was written, and they all read from the exact same log without disturbing existing consumers.


Architecture Cheat Sheet

Component Role
Topic A named stream of events, like a table
Partition One ordered log file inside a topic; the unit of parallelism
Producer Application that appends events to a topic
Consumer Application that reads events from a topic
Consumer Group Set of consumers that share the work of reading one topic
Broker One Kafka server, holds some partitions
Cluster Set of brokers that together host all partitions, with replication
Property Typical Value
Throughput Millions of messages per second per cluster
Latency Under 10 ms producer-to-consumer in a healthy cluster
Storage Petabytes per cluster
Retention Days to forever (configurable per topic)
Ordering Total order within a partition; no order across partitions
Delivery Tunable: at-most-once, at-least-once, or exactly-once

Modern alternatives

Kafka pioneered the distributed log, but its Java + ZooKeeper operational footprint is heavy. Drop-in replacements like Redpanda (rewritten in C++ for lower tail latency) implement the same Kafka API while bypassing the JVM. Same shape, lighter ops.


Delivery Semantics

Semantic Behavior Use Case
At Most Once Messages may be lost but never duplicated Metrics, logs
At Least Once Messages never lost but may duplicate Most applications
Exactly Once Messages delivered exactly once Financial systems, with idempotent writes

Exactly-once requires the consumer to commit its progress in the same transaction that processes the message. Kafka supports this via the transactional API; most teams accept "at-least-once + idempotent consumer" as a simpler approximation.


Where Kafka Fits in the Big Picture

MapReduce Spark Kafka
Bounded or unbounded? Bounded batch Bounded batch (or micro-batch) Unbounded stream
Latency Minutes to hours Seconds to minutes Milliseconds to seconds
State Re-derived per job RDD lineage in memory Append-only log on disk
Pattern Process one big snapshot Process many snapshots fast Process events as they arrive

MapReduce and Spark assume the data already exists somewhere (HDFS, S3). Kafka is what gets the data there in the first place, in real time, and makes it replayable.


Real-World Use

  • LinkedIn (Kafka's original creator): 1 trillion messages per day for activity tracking, metrics, and log aggregation.

  • Netflix: 4 million events per second feeding recommendations, A/B test capture, and operational monitoring.

  • Uber: trip updates, surge-pricing inputs, and petabyte-scale analytics pipelines all flow through Kafka topics.

  • Airbnb: search-ranking ML features, payment event capture, and the backbone of their data platform.


Wrap-up

Across this module you have walked through every layer of the modern data stack: storage that survives failures (DFS, GFS), computation that scales linearly (MapReduce, Spark), and now an event substrate that connects them in real time (Kafka). All three rest on the same primitives the earlier sections introduced: hash partitioning, replication, and consensus.

The next time you see "this team built it on Kafka" or "we run Spark on top of S3," you can decompose the architecture into the primitives. That is the point of Module 5.