Distributed File Systems
GFS: The Pioneer (2003)
Google's solution to store the web
Google had a revelation: why splurge on high-end hardware when you can make do with the cheap stuff? Enter GFS, a system designed to keep data intact even when the hardware underneath is anything but reliable. Think of it as the duct tape holding together a room full of bargain-bin servers.
-
Single leader (Master) + chunk servers
-
64MB chunks, 3x replicated to outlast frequent failures
-
Optimized for large sequential reads
-
Manages petabytes of data
HDFS: Open Source Revolution
Hadoop's answer to GFS
HDFS took Google's playbook and ran with it, democratizing big data storage. It's the backbone of the Hadoop ecosystem, ensuring that data stays consistent, even if the hardware doesn't.
-
NameNode (leader) + DataNodes
-
128MB blocks with rack-aware replication
-
Powers the big data ecosystem
-
CP system: Strong consistency
Modern Cloud Storage
S3: Object Storage Redefined
Amazon's S3 reimagined storage with a system that doesn't rely on a single leader. It's all about durability and scale, promising that your data will be there when you need it, even if it takes a moment to catch up.
-
No leader, distributed metadata
-
11 9's durability (99.999999999%)
-
Eventually consistent (AP)
-
Infinite scale
Colossus: GFS 2.0
Google's Colossus took the GFS concept and blew it up to an exabyte scale. It ditched the single leader for distributed metadata, allowing for lightning-fast access and reliable storage.
-
Distributed metadata (no leader bottleneck)
-
Exabyte scale at Google
-
Sub-millisecond latencies
-
Reed-Solomon erasure coding
Quick Comparison
| System | Year | Leader | Chunk Size | Replication | CAP | Scale |
|---|---|---|---|---|---|---|
| GFS | 2003 | Single | 64MB | 3x | CP | PB |
| HDFS | 2006 | Single (NameNode) | 128MB | 3x rack-aware | CP | PB |
| S3 | 2006 | None | Variable | 6x across zones | AP | ∞ |
| Colossus | 2010+ | Distributed | Variable | Reed-Solomon | CP | EB+ |
Real-World Usage
Hadoop (HDFS)
The Data Warehouse
Powers analytics for tech giants. Facebook's 600PB+ data warehouse and Yahoo's 40,000+ node clusters rely on it.
Amazon S3
The Global Hard Drive
From Netflix to Spotify, if it's online, it probably lives on S3.
Google Colossus
The Final Boss of Scale
Handles everything from YouTube videos to Gmail attachments at a scale that defies conventional systems.
Key Takeaways
1. Evolution of Scale
From PBs to EBs
-
GFS: Petabytes (2003)
-
HDFS: Petabytes (2006)
-
S3/Colossus: Exabytes (2010s)
2. Leader Bottleneck Solution
Distributed metadata wins
-
GFS/HDFS: Single leader bottleneck
-
Modern systems: Distributed coordination
-
Trade-off: Complexity for scale
3. CAP Trade-offs
Choose based on use case
-
HDFS: CP (analytics, consistency matters)
-
S3: AP (web serving, availability matters)
-
All handle partitions differently