Timilearning

MIT 6.824: Lecture 3 - GFS

· 10 min read

The Google File System paper is relevant to this course because GFS is an example of distributed storage, which is a key abstraction in building distributed systems. Many distributed systems are either distributed storage systems or systems built on top of distributed storage.

Building distributed storage is a hard problem for a couple of reasons:

This cycle, which leads back to performance, highlights the challenges in building distributed systems. The GFS paper touches on these topics and discusses the trade-offs that were made to yield good performance in a production-ready distributed storage system.

Table of Contents

Paper Summary

The system was built at a time when Google needed a system to meet its data-processing demands with the goals of achieving good performance while being:

Also note that it was tailored for a workload that largely consisted of sequential access to huge files (read or append). It was not built to optimize for low-latency requests; rather, it was meant for batch workloads which often read a file sequentially, as in MapReduce jobs.

Design Overview

A GFS Cluster is made up of a single master and multiple chunkservers, and is accessed by multiple clients, as shown in the figure below.

GFS Architecture

Breakdown of the architecture:

An interesting design choice made in this system is the decoupling of the data flow from the control flow. The GFS client only communicates with the master for metadata operations, but all data-bearing communications (reads and writes) go directly to the chunkservers. I'll explain how that works next.

Single Master

As noted earlier, there is a single master in a GFS cluster which clients only interact with to retrieve metadata. This section highlights the role of the master in decoupling the data flow from the control flow.

To read the data for a file:

Chunk Size

In typical Linux filesystems, a file is split into blocks and those blocks usually range from 0.5-65 kilobytes in size, with the default on most file systems being 4 kilobytes.

A block size is the unit of work for the file system, which means reading or writing any files is done in multiples of that block size.

In GFS, chunks are analogous to blocks, except that chunks are of a much larger size (64 MB). Having a large chunk size offers several advantages in this system:

Google uses lazy space allocation to avoid wasting space due to internal fragmentation. Internal fragmentation means having unused portions of the 64 MB chunk. For example, if we allocate a 64 MB chunk and only fill up 10 MB, that's a lot of unused space.

According to this Stack Overflow answer,

Lazy space allocation means that the physical allocation of space is delayed as long as possible, until data at the size of the chunk size is accumulated.

From the rest of that answer, I think what this means is that the decision to allocate a new chunk is based solely on the data available, as opposed to using another partitioning scheme to allocate data to chunks.

This does not mean the chunks will always be filled up. A chunk which contains the file region for the end of a file will typically only be partially filled up.

Metadata

The master stores three types of metadata in memory:

The first two types listed are also persisted on the master's local disk. The third is not persisted; instead, the master asks each chunkserver about its chunks at master startup and when a chunkserver joins the cluster.

By having the chunkserver as the ultimate source of truth of each chunk's location, GFS eliminates some of the challenges of keeping the master and chunkservers in sync regularly.

The master keeps an operation log, where it stores the namespace and file-to-chunk mappings on local disk. It replicates this operation log on several machines, and GFS does not make changes to the metadata visible to clients until they have been persisted on all replicas.

After startup, the master can restore its file system state by replaying the operation log. It keeps this log small to minimize the startup time by periodically checkpointing it.

Consistency Model

The consistency guarantee for GFS is relaxed. It does not guarantee that all the replicas of a chunk are byte-wise identical. What it does guarantee is that every piece of data stored will be written at least once on each replica. This means that a replica may contain duplicates, and it is up to the application to deal with such anomalies.

File Region State After Mutation

From Table 1 above:

The data mutations here may be writes or record appends. A write occurs when data is written at a file offset specified by the application.

Record Appends

Record appends cause data to be written atomically at least once even in the presence of concurrent mutations, but at an offset chosen by GFS.

If a record append succeeds on some replicas and fails on others, those successful appends are not rolled back. This means that if the client retries the operation, the successful replicas may have duplicates for that record.

Retrying the record append at a new file offset could mean that the offset chosen for the initial failed append operation is now blank in the file regions of the failed replicas; that is, if the region has not been modified before the retry. This blank region is known as a padding, and the existence of padding and duplicates in replicas are what make them inconsistent.

Applications that use GFS are left with the responsiblilty of dealing with these inconsistent file regions. These applications can include a unique ID with each record to filter out duplicates, and use checksums to detect and discard extra padding.

There is also the possibility of a client reading from a stale replica. Each chunk replica is given a version number that gets increased for each successful mutation. If the chunkserver hosting a chunk replica is down during a mutation, the chunk replica will become stale and will have an older version number. Stale replicas are not given to clients when they ask the master for the location of a chunk, and they are not involved in mutations either.

Despite this, because a client caches the location of a chunk, it may read from a stale replica before the information is refreshed. The impact of this is low because most operations to a chunk are append-only. This means that a stale replica usually returns a premature end of chunk, rather than outdated data for a value.

System Interactions

This section describes in more detail how the client, master and chunkservers interact to implement data mutations and atomic record appends.

Writes

When the master receives a modification operation for a particular chunk, the following happen:

a) The master finds the chunkservers which hold that chunk and grants a chunk lease to one of them.

b) After the lease expires (typically after 60 seconds), the master is free to grant primary status to a different server for that chunk.

c) The master may lose communication with a primary while the mutation is still happening. If this happens, it is fine for the master to grant a new lease to another replica as long as the lease timeout has expired.

Let's look at Figure 2 which illustrates the control flow of a write operation.

Write Control and Data Flow

The numbered steps below correspond to each number in the diagram.

  1. The client asks the master for all chunkservers.
  1. The master grants a new lease to a replica (if none exist), increases the chunk version number, and tells all replicas to do the same after the mutation has been applied. It then replies to the client. After this, the client no longer has to talk to the master.
  1. The client pushes the data to all the chunkservers, not necessarily to the primary first. The servers will initially store this data in an internal LRU buffer cache until the data is used.
  1. Once the client receives the acknowledgement that this data has been pushed successfully, it sends the write request to the primary chunkserver. The primary decides what serial order to apply the mutations in and applies them to the chunk.
  1. After applying the mutations, the primary forwards the write request and the serial number order to all the secondaries for them to apply in the same order.
  1. All secondaries reply to the primary once they have completed the operation.
  1. The primary replies to the client, indicating whether the operation was a success or an error. Note:
    • If the write succeeds at the primary but fails at any of the secondaries, we'll have an inconsistent state and an error is returned to the client.
    • The client can retry steps 3 through 7.

Atomic Record Appends

The system interactions for record appends are largely the same as discussed for writes, with the following exceptions:

Fault Tolerance

Fault Tolerance is achieved in GFS by implementing:

Data Integrity

Checksumming is used by each chunkserver to detect the corruption of stored data.

From the course website [1]:

A checksum algorithm takes a block of bytes as input and returns a single number that's a function of all the input bytes. For example, a simple checksum might be the sum of all the bytes in the input (mod some big number). GFS stores the checksum of each chunk as well as the chunk.

When a chunkserver writes a chunk on its disk, it first computes the checksum of the new chunk, and saves the checksum on disk as well as the chunk. When a chunkserver reads a chunk from disk, it also reads the previously-saved checksum, re-computes a checksum from the chunk read from disk, and checks that the two checksums match.

If the data was corrupted by the disk, the checksums won't match, and the chunkserver will know to return an error. Separately, some GFS applications stored their own checksums, over application-defined records, inside GFS files, to distinguish between correct records and padding. CRC32 is an example of a checksum algorithm.

[1] GFS FAQ - Lecture Notes from MIT 6.824

Conclusion

This week's material brought some interesting ideas in the design of a distributed storage system. These include:

However, having a single master eventually became less than ideal for Google's use case. As the number of files stored increased by thousands, it became harder to fit all the metadata for those files on the master. In addition, the number of clients also increased, leading to too much CPU load on the master.

Another challenge with GFS at Google was that the weak consistency model meant applications had to be designed to cope with those limitations. These limitations led to the creation of Colossus as a successor to GFS.

Further Reading

mit-6.824 distributed-systems learning-diary

A small favour

Did you find anything I wrote confusing, outdated, or incorrect? Please let me know by writing a few words below.

Follow along

To get notified when I write something new, you can subscribe to the RSS feed or enter your email below.

← Home