Apache Kafka: Real-Time Event Streaming

Concept. Kafka is a durable, partitioned, replicated commit log: producers append events to per-topic partitions, consumers read them at their own pace, and the log itself is the source of truth, so multiple downstream systems can independently replay the same stream of events.

Intuition. When Mickey hits play, the Spotify client emits a play_started event to the plays topic. Kafka appends it to one of the topic's partitions (chosen by hash(user_id)), replicates it to 3 brokers, and now both the recommendation pipeline and the royalty-payout pipeline can read that exact event independently, at their own speed, replaying as far back as the retention window allows.

A Worked Example: One Play, Two Pipelines

Trace what happens when Mickey hits play on a Taylor Swift track.

Produce. The Spotify client builds a play_started event ({user_id, song_id, ts, device}) and POSTs it to a Kafka producer. The producer hashes the user_id to pick a partition of the plays topic, then appends the event to that partition's log.
Replicate. The partition lives on three brokers (1 leader, 2 followers). The leader appends the event and waits for followers to acknowledge before confirming the write. The event now survives any single broker dying.
Consume, in parallel. Two unrelated downstream systems are subscribed to the plays topic:
The recommendation pipeline is at offset 16 (real-time): it reads Mickey's event within milliseconds and updates his "Up Next" queue.
The royalty payouts pipeline is at offset 12 (batched nightly): it will get to Mickey's event when its nightly job runs, the same way it consumes every other play of the day.
Replay, on demand. When a bug ships in the recommendation model, the team rewinds the consumer to yesterday's offset and re-reads the same events with the fixed code. Kafka's log is the durable source of truth, so the replay is bit-identical.

Kafka architecture: producer hashes by user_id into one of three topic partitions; two consumer groups (recommendations and royalty) read independently at their own offsets, replaying past events on demand

Figure

Kafka: a partitioned, replicated commit log.

Producer (left): each event has a key (here user_id). The producer hashes the key to pick one of the topic's partitions. Same key always lands in the same partition, so a single user's events stay in order.
Topic with 3 partitions, each replicated 3× across brokers. The partition is the unit of parallelism and ordering; the replicas give durability.
Consumer groups (right): two independent groups (Recommendations and Royalty) each track their own offset into the log. They read at their own pace and never block each other.
Replay: events stay in the log until the retention window expires. A consumer can rewind to a past offset and re-read, which is what makes Kafka a substrate for log-replay debugging and for adding new downstream systems years later.

What Makes This Shape Different

Kafka is often introduced as "a message queue," but it is actually a log. The distinction is what makes it useful:

Queues consume on read. When a consumer pops a message, it is gone. Adding a second downstream system means re-publishing every message.
Logs persist after read. Every event sits in the log until the retention window expires (often days, sometimes forever). New downstream systems can consume the entire history at any time.

This single difference is why Kafka became the substrate for modern event-driven architectures. A new analytics team, a new ML pipeline, or a fraud detector can be added years after the original event was written, and they all read from the exact same log without disturbing existing consumers.

Architecture Cheat Sheet

Component	Role
Topic	A named stream of events, like a table
Partition	One ordered log file inside a topic; the unit of parallelism
Producer	Application that appends events to a topic
Consumer	Application that reads events from a topic
Consumer Group	Set of consumers that share the work of reading one topic
Broker	One Kafka server, holds some partitions
Cluster	Set of brokers that together host all partitions, with replication

Property	Typical Value
Throughput	Millions of messages per second per cluster
Latency	Under 10 ms producer-to-consumer in a healthy cluster
Storage	Petabytes per cluster
Retention	Days to forever (configurable per topic)
Ordering	Total order within a partition; no order across partitions
Delivery	Tunable: at-most-once, at-least-once, or exactly-once

Modern alternatives

Kafka pioneered the distributed log, but its Java + ZooKeeper operational footprint is heavy. Drop-in replacements like Redpanda (rewritten in C++ for lower tail latency) implement the same Kafka API while bypassing the JVM. Same shape, lighter ops.

Delivery Semantics

Semantic	Behavior	Use Case
At Most Once	Messages may be lost but never duplicated	Metrics, logs
At Least Once	Messages never lost but may duplicate	Most applications
Exactly Once	Messages delivered exactly once	Financial systems, with idempotent writes

Exactly-once requires the consumer to commit its progress in the same transaction that processes the message. Kafka supports this via the transactional API; most teams accept "at-least-once + idempotent consumer" as a simpler approximation.

Where Kafka Fits in the Big Picture

	MapReduce	Spark	Kafka
Bounded or unbounded?	Bounded batch	Bounded batch (or micro-batch)	Unbounded stream
Latency	Minutes to hours	Seconds to minutes	Milliseconds to seconds
State	Re-derived per job	RDD lineage in memory	Append-only log on disk
Pattern	Process one big snapshot	Process many snapshots fast	Process events as they arrive

MapReduce and Spark assume the data already exists somewhere (HDFS, S3). Kafka is what gets the data there in the first place, in real time, and makes it replayable.

Real-World Use

LinkedIn (Kafka's original creator): 1 trillion messages per day for activity tracking, metrics, and log aggregation.
Netflix: 4 million events per second feeding recommendations, A/B test capture, and operational monitoring.
Uber: trip updates, surge-pricing inputs, and petabyte-scale analytics pipelines all flow through Kafka topics.
Airbnb: search-ranking ML features, payment event capture, and the backbone of their data platform.

Wrap-up

Across this module you have walked through every layer of the modern data stack: storage that survives failures (DFS, GFS), computation that scales linearly (MapReduce, Spark), and now an event substrate that connects them in real time (Kafka). All three rest on the same primitives the earlier sections introduced: hash partitioning, replication, and consensus.

The next time you see "this team built it on Kafka" or "we run Spark on top of S3," you can decompose the architecture into the primitives. That is the point of Module 5.