Paper digest - Megastore: Providing Scalable, Highly Available Storage for Interactive Services
Abstract
- Blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS.
- Provides both strong consistency guarantees and high availability.
- Provides fully serializable ACID semantics within fine-grained partitions of data.
Motivation
- Online services are forcing the storage community to meet new demands.
- Highly scalable.
- Rapid development.
- Low latency.
- Consistent view of data.
- Highly available.
Design Highlights
- Partition data. Replicate unit is a partition.
- Full ACID semantics within partitions only.
- Across partitions, limited consistency guarantees (later spanner improves this)
- Provide database features (2nd indexes), but does not provide scalability guarantees (can only scale within user-tolerable latency limits.)
- Innovative usage of Paxos: use it for replication instead of traditional usage (locking, leader election, replicate config data - e.g. ZooKeeper/ZAB)
Availability with Scalability
- Replication
- Replicate a write-ahead log over a group of symmetric peers.
- Any node can initiate reads and writes.
- Each log append blocks on acknowledgments from a majority of replicas.
- Optimizations on Paxos
- Allow local reads.
- Single roundtrip writes.
- Multiple replicated logs instead of single log for better availability.
- Key Concept - Entity Groups
- Entity group is a collection of data and unit of replication.
- Within a single entity group: ACID semantic.
- Across entity groups:
- Two phased commit (strong consistency)
- Or async communications through message queues.
- Physical Layout of data
- Use BigTable for data storage.
- Letting applications control the placement of data.
- Applications try to keep data near users and replicas near each other.
- The data for an entity group are held in contiguous ranges of Bigtable rows.
API Design
- Cost-transparent APIs: runtime costs that match application developers’ intuitions.
- Eliminate needs for joins: offer fine-grained control over physical locality.
- Joins when required is implemented in application code.
Data Model
- Lies between the abstract tuples of an RDBMS and the concrete row-column storage of NoSQL.
- Data model is declared in a schema and is strongly typed.
Transactions and Concurrency Control
- Use MVCC. Readers and writers don’t block each other, and reads are isolated from writes for the duration of a transaction.
- Read semantics
- current (single entity group)
- snapshot (single entity group)
- inconsistent (ignore state of log)
- Queues
- Cross group batch operation deliver.
- 2 Phased Commit for atomic update across entity groups.
Paxos based Replication (key innovations)
- Pitfall of master-slave based Paxos
- Master failover can require a complicated state machine.
- A series of timers must elapse before service is restored.
- It is difficult to avoid user-visible outages.
- Fast Reads
- Local read.
- Coordinator tracks if a specific replica has observed the full state of writes so it can serve local reads.
- Fast Writes
- Single round trip writes.
- Pre-preparing optimization: each successful write includes an implied prepare message granting the master the right to issue accept messages for the next log position.
- If the write succeeds, the prepares are honored, and the next write skips directly to the accept phase.
- Multi-Paxos: independent instance of the Paxos algorithm for each log position. T
- Replica Types
- Full Replica: contain all entity and index and able to serve current reads.
- Witness Replica: vote and store WAL, but does not apply the WAL (kind like observer in ZooKeeper but not strictly the same - observer in ZK does not vote.)
- Read Only Replica: does not vote but contain full data of a point of time in the past (more like ZK observer but ZK observer is more up to date WRT its state.)
Coordinator
- The coordinator is a simple process with no external dependencies and no persistent storage.
- Failure Detection
- Use an out of band protocol to identify when other coordinators are up, healthy, and generally reachable.
- Coordinators obtain specific Chubby locks in remote datacenters at startup.
- Revert its state to a conservative default when loses majority of locks and / or parititons.
- Brief and rare outage risk..
- Liveness protocol is vulnerable to asymmetric network partitions.