The Google File System

Paper digest - The Google File System

Motivation

  • Scalable distributed file system for large distributed data-intensive applications driven by observations of application workloads and technological environment, both current and anticipated:
    • Component failures are the norm rather than the exception.
    • Files are huge by traditional standards. Multi-GB files are common.
    • Workloads: large streaming reads and small random reads; many large, sequential writes.

Goals

  • Provides fault tolerance while running on inexpensive commodity hardware.

  • Delivers high aggregate performance to a large number of clients.

  • Must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Atomicity with minimal synchronization overhead is essential.

  • Prefer bandwidth over latency (batch processing, not real time.).

  • Consistency Model : co-designing the applications and the file system for flexibility (in other words, push hard problem to other places.).

System Design

  • Master-Slave architecture.

    • GFS Master:
      • Management data storage: metadata, namespace, access control, mappings from files to chunks, current location of chunk.
      • System wide activities: chunk lease management, garbage collect of orphaned chunks, chunk migration.
      • System is designed to minimize master’s involvement since master is SPOF.
    • GFS Slave (aka chunk server):
      • Chunk with 64 bit globally uniquely chunk handle. Stored on linux FS. Replicated (3 copies by default.).
      • Does not cache files (rely on Linux’s buffer cache.).
    • GFS Client:
      • Interact with master for metadata operations.
      • Actual data IO does not go through master; go through between client and chunk server.
      • Does not cache files (infeasible to cache - too large; hard problem to solve for cache coherence, so don’t solve it.).
      • Does not cache files - continued: GFS applications, MapReduce and BigTable. MapReduce - sequential read, no need to cache. Bigtable has its own cache management.
  • Lease Management

    • Lease is used to maintain a consistent mutation order across replicas.
    • Master grants chunk lease to one replica (primary).
    • The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations.
    • A lease has an initial timeout of 60 seconds. Can be extended at the request of primary chunk server, or can be revoked by master after lease expire.
  • Consistency Model

    • Relaxed consistency model.
    • File namespace mutations (e.g., file creation) are atomic, handled by master exclusively.
    • Data mutation case: consistent but undefined (e.g., concurrent succeed mutation).
    • Data mutation case: Inconsistent and undefined (e.g., failed mutation).
    • Concurrent append: at least once semantic (implies that records could have paddings in between, that readers need to dealt with).
    • Reader dealt with validating (through checksums) and identifying records.
    • Overall - flexible consistency model (instead of strong consistency) makes application development a little bit difficult.
  • Append

    • Pipeline
    • Data flow and control flow decoupled.
  • Fault Tolerant

    • Replicated master, check points and snapshots (**Q: how replication is done??)
    • Replicated chunk servers (default 3). Erasure codes / parity.