Paper Digest - Consensus in the Cloud - Paxos Systems Demystified

Paper digest - Consensus in the Cloud - Paxos Systems Demystified

Abstract

Compared several popular Paxos protocols and Paxos systems and present advantages and disadvantages for each.
Categorized the coordination use-patterns in cloud, and examine Google and Facebook infrastructures, as well as use cases of Paxos in Apache projects.
Analyzed tradeoffs in the distributed coordination domain and identify promising future directions.

Introduction

Coordination plays a major role in cloud computing systems:
- Leader election
- Group membership
- Cluster management
- Service discovery
- Resource/access management
- Consistent replication of the master nodes in services
- Barrier-orchestration when running large analytic tasks
- And so on..
The coordination problem has been studied extensively under the name “distributed consensus”.
Paxos rise to fame with Google Chubby (lock service for GFS).
Apache ZooKeeper as a coordination kernel
- File system abstraction easy to use.
- Abused / misused, often constituted the bottleneck in performance of these applications and caused scalability problems.
More choices nowadays (Raft, etc), with confusions still remained:
- Proper use cases.
- Which systems are more suitable for which tasks:
  - Paxos protocols (e.g. multi-paxos, Raft, ZAB) are useful for low-level components for server replication.
  - Paxos systems (e.g. ZK) are useful for highly-available/durable metadata management with constraints that all metadata fit in main-memory and are not subject to very frequent changes.

Paxos Protocols

Protocols
- Paxos: fault-tolerant consensus for a single value.
- Multi-Paxos: fault-tolerant consensus for multiple values.
- ZAB: additional constraints on Paxos (primary global order).
- Raft: very similar to ZAB with some implementation level difference.
Differences
- Leader Election:
  - Zab and Raft protocols differ from Paxos as they divide execution into phases (called epochs in Zab and terms in Raft).
  - Each epoch begins with a new election, goes into the broadcast phase and ends with a leader failure.
  - The phases are sequential because of the additional safety properties are provided by the isLeader predicate.
  - Zab and Raft there can be at most one leader at any time.
  - Paxos can have multiple leaders coexisting.
- Zab has discovery and sync phase, Raft does not which simplifies algorithm but makes recover longer, possibly.
- Communication:
  - ZAB is messaging model. Each update requires at least three messages:
    - proposal, ack and commit
  - Raft use RPC.
- Dynamic reconfiguration:
  - Reconfig is just another command go through consensus process.
  - To ensure safety, new config cannot be activated immediately and the configuration changes must go through two phases.
  - A process can only propose commands for slots with known configuration, ∀ρ : ρ.slotin < ρ.slotout+ WINDOW.
  - With primary order property provided by Zab and Raft, both protocols are able to implement their reconfiguration algorithms without limitations to normal operations or external services.
  - Both Zab and Raft include a pre-phase where the new processes join the cluster as a non-voting members so that the leader in Cold could initialize their states by transferring currently committed prefix of updates.
  - In Zab, the time between new config proposed and committed, any commands received after reconfig is only scheduled but will not commit.
Paxos Extensions
- Generalized Paxos: allows acceptors to vote for independent commands.
- EPaxos:
  - allows nodes to commit conflict free commands by checking the command dependency list.
  - adds significant complexity and extra effort to resolve the conflict if concurrent commands do not commute.
  - from an engineer’s perspective, the sketch algorithm descriptions in the literature are often underspecified, and lead to divergent interpretations and implementations.

Paxos Systems

Chubby, Apache ZooKeeper, etcd
- All three services hide the replicated state machine and log abstractions under a small data-store with filesystem-like API.
- All three support watchers.
- All three support ephemeral storage.
- All three support observers (none voting quorum peer).
Difference
- ZK provides client FIFO order.
- etcd is stateless (note, not very true with etcd 3.0’s lease, which is ZK session kind things.)
- etcd has TTL (note, ZK has ttl too after 3.5.3.)
- etcd has hidden data (MVCC).
- ZK has weighted quorum.
Proper use criteria for Paxos systems
- Paxos system should not be in the performance critical path of the application.
- Frequency of write operations to the Paxos system should be kept low
- Amount of data maintained in the Paxos system should be kept small
- Application adopting the Paxos system should really require strong consistency
- Application adopting the Paxos system should not be distributed over the Wide Area Network (WAN).
- The API abstraction should be fit the goal.

Paxos Usage Patterns

Server Replication (SR)
Log Replication (LR)
Synchronization Service (SS).
Barrier Orchestration (BO).
Configuration Management.
Message Queues (Q).

Paxos Uasage in Production