Apache Kafka: Real-Time Event Streaming
Concept. Kafka is a durable, partitioned, replicated commit log: producers append events to per-topic partitions, consumers read them at their own pace, and the log itself is the source of truth, so multiple downstream systems can independently replay the same stream of events.
Intuition. When Mickey hits play, the Spotify client emits a play_started event to the plays topic. Kafka appends it to one of the topic's partitions (chosen by hash(user_id)), replicates it to 3 brokers, and now both the recommendation pipeline and the royalty-payout pipeline can read that exact event independently, at their own speed, replaying as far back as the retention window allows.
A Worked Example: One Play, Two Pipelines
Trace what happens when Mickey hits play on a Taylor Swift track.
-
Produce. The Spotify client builds a
play_startedevent ({user_id, song_id, ts, device}) and POSTs it to a Kafka producer. The producer hashes theuser_idto pick a partition of theplaystopic, then appends the event to that partition's log. -
Replicate. The partition lives on three brokers (1 leader, 2 followers). The leader appends the event and waits for followers to acknowledge before confirming the write. The event now survives any single broker dying.
-
Consume, in parallel. Two unrelated downstream systems are subscribed to the
playstopic: - The recommendation pipeline is at offset 16 (real-time): it reads Mickey's event within milliseconds and updates his "Up Next" queue.
-
The royalty payouts pipeline is at offset 12 (batched nightly): it will get to Mickey's event when its nightly job runs, the same way it consumes every other play of the day.
-
Replay, on demand. When a bug ships in the recommendation model, the team rewinds the consumer to yesterday's offset and re-reads the same events with the fixed code. Kafka's log is the durable source of truth, so the replay is bit-identical.
What Makes This Shape Different
Kafka is often introduced as "a message queue," but it is actually a log. The distinction is what makes it useful:
-
Queues consume on read. When a consumer pops a message, it is gone. Adding a second downstream system means re-publishing every message.
-
Logs persist after read. Every event sits in the log until the retention window expires (often days, sometimes forever). New downstream systems can consume the entire history at any time.
This single difference is why Kafka became the substrate for modern event-driven architectures. A new analytics team, a new ML pipeline, or a fraud detector can be added years after the original event was written, and they all read from the exact same log without disturbing existing consumers.
Architecture Cheat Sheet
| Component | Role |
|---|---|
| Topic | A named stream of events, like a table |
| Partition | One ordered log file inside a topic; the unit of parallelism |
| Producer | Application that appends events to a topic |
| Consumer | Application that reads events from a topic |
| Consumer Group | Set of consumers that share the work of reading one topic |
| Broker | One Kafka server, holds some partitions |
| Cluster | Set of brokers that together host all partitions, with replication |
| Property | Typical Value |
|---|---|
| Throughput | Millions of messages per second per cluster |
| Latency | Under 10 ms producer-to-consumer in a healthy cluster |
| Storage | Petabytes per cluster |
| Retention | Days to forever (configurable per topic) |
| Ordering | Total order within a partition; no order across partitions |
| Delivery | Tunable: at-most-once, at-least-once, or exactly-once |
Modern alternatives
Kafka pioneered the distributed log, but its Java + ZooKeeper operational footprint is heavy. Drop-in replacements like Redpanda (rewritten in C++ for lower tail latency) implement the same Kafka API while bypassing the JVM. Same shape, lighter ops.
Delivery Semantics
| Semantic | Behavior | Use Case |
|---|---|---|
| At Most Once | Messages may be lost but never duplicated | Metrics, logs |
| At Least Once | Messages never lost but may duplicate | Most applications |
| Exactly Once | Messages delivered exactly once | Financial systems, with idempotent writes |
Exactly-once requires the consumer to commit its progress in the same transaction that processes the message. Kafka supports this via the transactional API; most teams accept "at-least-once + idempotent consumer" as a simpler approximation.
Where Kafka Fits in the Big Picture
| MapReduce | Spark | Kafka | |
|---|---|---|---|
| Bounded or unbounded? | Bounded batch | Bounded batch (or micro-batch) | Unbounded stream |
| Latency | Minutes to hours | Seconds to minutes | Milliseconds to seconds |
| State | Re-derived per job | RDD lineage in memory | Append-only log on disk |
| Pattern | Process one big snapshot | Process many snapshots fast | Process events as they arrive |
MapReduce and Spark assume the data already exists somewhere (HDFS, S3). Kafka is what gets the data there in the first place, in real time, and makes it replayable.
Real-World Use
-
LinkedIn (Kafka's original creator): 1 trillion messages per day for activity tracking, metrics, and log aggregation.
-
Netflix: 4 million events per second feeding recommendations, A/B test capture, and operational monitoring.
-
Uber: trip updates, surge-pricing inputs, and petabyte-scale analytics pipelines all flow through Kafka topics.
-
Airbnb: search-ranking ML features, payment event capture, and the backbone of their data platform.
Wrap-up
Across this module you have walked through every layer of the modern data stack: storage that survives failures (DFS, GFS), computation that scales linearly (MapReduce, Spark), and now an event substrate that connects them in real time (Kafka). All three rest on the same primitives the earlier sections introduced: hash partitioning, replication, and consensus.
The next time you see "this team built it on Kafka" or "we run Spark on top of S3," you can decompose the architecture into the primitives. That is the point of Module 5.