Paper digest - Consensus in the Cloud - Paxos Systems Demystified
Abstract
- Compared several popular Paxos protocols and Paxos systems and present advantages and disadvantages for each.
- Categorized the coordination use-patterns in cloud, and examine Google and Facebook infrastructures, as well as use cases of Paxos in Apache projects.
- Analyzed tradeoffs in the distributed coordination domain and identify promising future directions.
Introduction
- Coordination plays a major role in cloud computing systems:
- Leader election
- Group membership
- Cluster management
- Service discovery
- Resource/access management
- Consistent replication of the master nodes in services
- Barrier-orchestration when running large analytic tasks
- And so on..
- The coordination problem has been studied extensively under the name “distributed consensus”.
- Paxos rise to fame with Google Chubby (lock service for GFS).
- Apache ZooKeeper as a coordination kernel
- File system abstraction easy to use.
- Abused / misused, often constituted the bottleneck in performance of these applications and caused scalability problems.
- More choices nowadays (Raft, etc), with confusions still remained:
- Proper use cases.
- Which systems are more suitable for which tasks:
- Paxos protocols (e.g. multi-paxos, Raft, ZAB) are useful for low-level components for server replication.
- Paxos systems (e.g. ZK) are useful for highly-available/durable metadata management with constraints that all metadata fit in main-memory and are not subject to very frequent changes.
Paxos Protocols
- Protocols
- Paxos: fault-tolerant consensus for a single value.
- Multi-Paxos: fault-tolerant consensus for multiple values.
- ZAB: additional constraints on Paxos (primary global order).
- Raft: very similar to ZAB with some implementation level difference.
- Differences
- Leader Election:
- Zab and Raft protocols differ from Paxos as they divide execution into phases (called epochs in Zab and terms in Raft).
- Each epoch begins with a new election, goes into the broadcast phase and ends with a leader failure.
- The phases are sequential because of the additional safety properties are provided by the isLeader predicate.
- Zab and Raft there can be at most one leader at any time.
- Paxos can have multiple leaders coexisting.
- Zab has discovery and sync phase, Raft does not which simplifies algorithm but makes recover longer, possibly.
- Communication:
- ZAB is messaging model. Each update requires at least three messages:
- proposal, ack and commit
- Raft use RPC.
- ZAB is messaging model. Each update requires at least three messages:
- Dynamic reconfiguration:
- Reconfig is just another command go through consensus process.
- To ensure safety, new config cannot be activated immediately and the configuration changes must go through two phases.
- A process can only propose commands for slots with known configuration, ∀ρ : ρ.slotin < ρ.slotout+ WINDOW.
- With primary order property provided by Zab and Raft, both protocols are able to implement their reconfiguration algorithms without limitations to normal operations or external services.
- Both Zab and Raft include a pre-phase where the new processes join the cluster as a non-voting members so that the leader in Cold could initialize their states by transferring currently committed prefix of updates.
- In Zab, the time between new config proposed and committed, any commands received after reconfig is only scheduled but will not commit.
- Leader Election:
- Paxos Extensions
- Generalized Paxos: allows acceptors to vote for independent commands.
- EPaxos:
- allows nodes to commit conflict free commands by checking the command dependency list.
- adds significant complexity and extra effort to resolve the conflict if concurrent commands do not commute.
- from an engineer’s perspective, the sketch algorithm descriptions in the literature are often underspecified, and lead to divergent interpretations and implementations.
Paxos Systems
- Chubby, Apache ZooKeeper, etcd
- All three services hide the replicated state machine and log abstractions under a small data-store with filesystem-like API.
- All three support watchers.
- All three support ephemeral storage.
- All three support observers (none voting quorum peer).
- Difference
- ZK provides client FIFO order.
- etcd is stateless (note, not very true with etcd 3.0’s lease, which is ZK session kind things.)
- etcd has TTL (note, ZK has ttl too after 3.5.3.)
- etcd has hidden data (MVCC).
- ZK has weighted quorum.
- Proper use criteria for Paxos systems
- Paxos system should not be in the performance critical path of the application.
- Frequency of write operations to the Paxos system should be kept low
- Amount of data maintained in the Paxos system should be kept small
- Application adopting the Paxos system should really require strong consistency
- Application adopting the Paxos system should not be distributed over the Wide Area Network (WAN).
- The API abstraction should be fit the goal.
Paxos Usage Patterns
- Server Replication (SR)
- Log Replication (LR)
- Synchronization Service (SS).
- Barrier Orchestration (BO).
- Configuration Management.
- Message Queues (Q).