Distributed File Systems

Concept. A distributed file system splits each large file into fixed-size chunks, replicates each chunk across many cheap commodity servers, and gives clients a single-namespace view, so the cluster delivers sequential read throughput proportional to its size while tolerating constant disk and machine failures.

Intuition. When Spotify stores a 64 GB audio-feature export, GFS splits it into 1,024 chunks of 64 MB each, places 3 copies of every chunk on different machines, and lets a job read all 1,024 chunks in parallel from 1,024 disks. Even if a disk dies mid-read, two other replicas of the dead chunk are already on standby.


The Pattern, and What Changes Across Generations

The GFS case study introduced chunks-and-replicas as the answer to "store the web on cheap, dying drives." That same pattern shows up in every distributed file system since: split big files into chunks, replicate each chunk across cheap servers, and give clients one logical namespace. What evolved across generations is how the metadata is coordinated, and that single design choice cascades into scale, latency, and consistency.

Four generations of distributed file systems (GFS, HDFS, S3, Colossus) showing the evolution from a single metadata leader to fully distributed metadata


GFS (2003): The Pioneer

Google's solution to storing the web on cheap hardware. The case study walks through the architecture. The punchline for this comparison is the single Master that holds all file-to-chunk metadata in RAM. Every open, list, or chunk-lookup goes through it, which is why the cluster scales only until the Master's RAM or CPU runs out, somewhere in the petabyte range.


HDFS (2006): GFS for Everyone Else

Yahoo and the Hadoop community open-sourced the GFS pattern as HDFS, the storage layer that powered the entire Hadoop ecosystem.

  • One NameNode plays the Master role.

  • DataNodes store 128 MB blocks (twice GFS's chunk size, sized for the larger files of the Hadoop era).

  • Replication is rack-aware: copies sit on different racks so a rack power loss does not lose all replicas of a block.

  • CP system: strict consistency, sometimes pays in availability when the NameNode is recovering.

Same fundamental architecture as GFS, same bottleneck. HDFS Federation later sharded the NameNode to push past it.


S3 (2006): No Leader at All

Amazon S3 took a different turn. Instead of one Master, S3 has no central metadata coordinator. Every metadata operation hits a distributed key-value store, and the storage backend is a partitioned cluster spanning availability zones.

  • No leader bottleneck. Throughput scales horizontally with cluster size.

  • 11 nines of durability (99.999999999%) by storing 6+ copies across multiple AZs.

  • Eventually consistent writes (AP system). A PUT may take a few seconds to be visible to every reader.

  • Effectively infinite scale. Customers store exabytes; the largest single buckets hold petabytes.

The trade is consistency for availability and scale. For most workloads (object storage, backups, static assets, data lakes) the eventual-consistency window is acceptable.


Colossus (2010+): GFS Without the Bottleneck

Google replaced GFS with Colossus to break the Master bottleneck without giving up consistency.

  • Distributed metadata. Metadata is sharded across many servers; no single node holds the full chunk map.

  • Reed-Solomon erasure coding instead of 3× replication, dropping storage overhead from 200% to ~50% for the same fault tolerance.

  • Sub-millisecond latencies for metadata operations.

  • Exabyte-plus scale at Google.

Colossus is what GFS would have been if it had been redesigned with everything Google learned about distributed metadata coordination over the prior decade.


Quick Comparison

System Year Metadata Chunk Size Replication CAP Scale
GFS 2003 Single Master 64 MB CP PB
HDFS 2006 Single NameNode 128 MB 3× rack-aware CP PB
S3 2006 Distributed Variable 6× across AZs AP
Colossus 2010+ Distributed Variable Reed-Solomon CP EB+

Real-World Usage

Hadoop (HDFS)

The Data Warehouse

Powers analytics for tech giants. Facebook's 600 PB+ data warehouse and Yahoo's 40,000+ node clusters rely on it.

Amazon S3

The Global Hard Drive

From Netflix to Spotify, if a piece of data is online, it probably lives on S3 (or a peer like Google Cloud Storage / Azure Blob).

Google Colossus

The Final Boss of Scale

Stores everything from YouTube videos to Gmail attachments at a scale that defies conventional file systems.


Key Takeaway: The Leader Bottleneck

Every system above shares the same chunk-and-replicate idea. What changes across generations is how metadata is coordinated. GFS and HDFS lean on a single Master / NameNode, which caps cluster size at the leader's throughput. S3 and Colossus distribute the metadata, removing that ceiling at the cost of more complex coordination. The CAP and scale columns in the table above are downstream of that one design choice.


Next

MapReduce → Now that the storage layer can hold petabytes across cheap hardware, the next question is how to compute on that data without moving it across the network. MapReduce gives you a programming model that works directly on top of HDFS-style chunks.