Apache Kafka: Real-Time Event Streaming
Once upon a time, companies were content to process data in hefty overnight batches. But then the business world woke up. They realized they needed to know when a user clicked a button immediately, not hours later. Enter Event Streaming.
From Batch to Stream Processing
Kafka Architecture
Core Concepts
| Component | Description |
|---|---|
| Topic | A stream of records, akin to a table |
| Partition | Physical division of a topic |
| Producer | Publishes records to topics |
| Consumer | Reads records from topics |
| Broker | A Kafka server |
| Cluster | Set of brokers |
Key Properties
| Property | Value |
|---|---|
| Throughput | Millions of messages per second |
| Latency | Less than 10 milliseconds |
| Storage | Petabytes |
| Retention | Days to forever |
| Ordering | Per partition |
| Delivery | At least once |
Kafka Topics, Partitions & Consumer Groups
Producer Patterns
Fire and Forget
Use Case: Metrics, logs
Synchronous Send
Use Case: Financial transactions
Asynchronous with Callback
Use Case: Most applications
Kafka Guarantees
Delivery Semantics
| Semantic | Description | Use Case |
|---|---|---|
| At Most Once | Messages may be lost but never duplicated | Metrics, logs |
| At Least Once | Messages never lost but may duplicate | Most applications |
| Exactly Once | Messages delivered exactly once | Financial systems |
Ordering Guarantees
-
Within Partition: Total order guaranteed
-
Across Partitions: No ordering guarantee
-
Key-based: Same key → Same partition → Ordered
Stream Processing with Kafka
Kafka Streams API
Common Patterns
Filtering
Transformation
Aggregation
Joins
Kafka vs Traditional Messaging
| Feature | Kafka | Traditional MQ |
|---|---|---|
| Storage | Persistent log | Transient queue |
| Replay | Yes, any offset | No, consumed = gone |
| Throughput | Millions per second | Thousands per second |
| Consumers | Pull-based | Push-based |
| Ordering | Per partition | Global or none |
| Scalability | Horizontal | Vertical |
Real-World Use Cases
LinkedIn (Original Creator)
-
Activity Tracking: 1 trillion messages per day
-
Metrics: Real-time monitoring
-
Log Aggregation: Central logging
Netflix
-
Recommendations: Real-time updates
-
A/B Testing: Event collection
-
Monitoring: 4 million events per second
Uber
-
Trip Updates: Real-time tracking
-
Surge Pricing: Dynamic calculations
-
Analytics: Petabytes of data
Airbnb
-
Search Ranking: Real-time ML features
-
Payments: Transaction processing
-
Data Pipeline: ETL replacement