Paper digest - The Google File System
Motivation
- Scalable distributed file system for large distributed data-intensive applications driven by observations of application workloads and technological environment, both current and anticipated:
- Component failures are the norm rather than the exception.
- Files are huge by traditional standards. Multi-GB files are common.
- Workloads: large streaming reads and small random reads; many large, sequential writes.
Goals
Provides fault tolerance while running on inexpensive commodity hardware.
Delivers high aggregate performance to a large number of clients.
Must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Atomicity with minimal synchronization overhead is essential.
Prefer bandwidth over latency (batch processing, not real time.).
Consistency Model : co-designing the applications and the file system for flexibility (in other words, push hard problem to other places.).
System Design
Master-Slave architecture.
- GFS Master:
- Management data storage: metadata, namespace, access control, mappings from files to chunks, current location of chunk.
- System wide activities: chunk lease management, garbage collect of orphaned chunks, chunk migration.
- System is designed to minimize master’s involvement since master is SPOF.
- GFS Slave (aka chunk server):
- Chunk with 64 bit globally uniquely chunk handle. Stored on linux FS. Replicated (3 copies by default.).
- Does not cache files (rely on Linux’s buffer cache.).
- GFS Client:
- Interact with master for metadata operations.
- Actual data IO does not go through master; go through between client and chunk server.
- Does not cache files (infeasible to cache - too large; hard problem to solve for cache coherence, so don’t solve it.).
- Does not cache files - continued: GFS applications, MapReduce and BigTable. MapReduce - sequential read, no need to cache. Bigtable has its own cache management.
- GFS Master:
Lease Management
- Lease is used to maintain a consistent mutation order across replicas.
- Master grants chunk lease to one replica (primary).
- The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations.
- A lease has an initial timeout of 60 seconds. Can be extended at the request of primary chunk server, or can be revoked by master after lease expire.
Consistency Model
- Relaxed consistency model.
- File namespace mutations (e.g., file creation) are atomic, handled by master exclusively.
- Data mutation case: consistent but undefined (e.g., concurrent succeed mutation).
- Data mutation case: Inconsistent and undefined (e.g., failed mutation).
- Concurrent append: at least once semantic (implies that records could have paddings in between, that readers need to dealt with).
- Reader dealt with validating (through checksums) and identifying records.
- Overall - flexible consistency model (instead of strong consistency) makes application development a little bit difficult.
Append
- Pipeline
- Data flow and control flow decoupled.
Fault Tolerant
- Replicated master, check points and snapshots (**Q: how replication is done??)
- Replicated chunk servers (default 3). Erasure codes / parity.