Paper Digest - Consensus in the Cloud - Paxos Systems Demystified

Paper digest - Consensus in the Cloud - Paxos Systems Demystified

Abstract

  • Compared several popular Paxos protocols and Paxos systems and present advantages and disadvantages for each.
  • Categorized the coordination use-patterns in cloud, and examine Google and Facebook infrastructures, as well as use cases of Paxos in Apache projects.
  • Analyzed tradeoffs in the distributed coordination domain and identify promising future directions.

Introduction

  • Coordination plays a major role in cloud computing systems:
    • Leader election
    • Group membership
    • Cluster management
    • Service discovery
    • Resource/access management
    • Consistent replication of the master nodes in services
    • Barrier-orchestration when running large analytic tasks
    • And so on..
  • The coordination problem has been studied extensively under the name “distributed consensus”.
  • Paxos rise to fame with Google Chubby (lock service for GFS).
  • Apache ZooKeeper as a coordination kernel
    • File system abstraction easy to use.
    • Abused / misused, often constituted the bottleneck in performance of these applications and caused scalability problems.
  • More choices nowadays (Raft, etc), with confusions still remained:
    • Proper use cases.
    • Which systems are more suitable for which tasks:
      • Paxos protocols (e.g. multi-paxos, Raft, ZAB) are useful for low-level components for server replication.
      • Paxos systems (e.g. ZK) are useful for highly-available/durable metadata management with constraints that all metadata fit in main-memory and are not subject to very frequent changes.

Paxos Protocols

  • Protocols
    • Paxos: fault-tolerant consensus for a single value.
    • Multi-Paxos: fault-tolerant consensus for multiple values.
    • ZAB: additional constraints on Paxos (primary global order).
    • Raft: very similar to ZAB with some implementation level difference.
  • Differences
    • Leader Election:
      • Zab and Raft protocols differ from Paxos as they divide execution into phases (called epochs in Zab and terms in Raft).
      • Each epoch begins with a new election, goes into the broadcast phase and ends with a leader failure.
      • The phases are sequential because of the additional safety properties are provided by the isLeader predicate.
      • Zab and Raft there can be at most one leader at any time.
      • Paxos can have multiple leaders coexisting.
    • Zab has discovery and sync phase, Raft does not which simplifies algorithm but makes recover longer, possibly.
    • Communication:
      • ZAB is messaging model. Each update requires at least three messages:
        • proposal, ack and commit
      • Raft use RPC.
    • Dynamic reconfiguration:
      • Reconfig is just another command go through consensus process.
      • To ensure safety, new config cannot be activated immediately and the configuration changes must go through two phases.
      • A process can only propose commands for slots with known configuration, ∀ρ : ρ.slotin < ρ.slotout+ WINDOW.
      • With primary order property provided by Zab and Raft, both protocols are able to implement their reconfiguration algorithms without limitations to normal operations or external services.
      • Both Zab and Raft include a pre-phase where the new processes join the cluster as a non-voting members so that the leader in Cold could initialize their states by transferring currently committed prefix of updates.
      • In Zab, the time between new config proposed and committed, any commands received after reconfig is only scheduled but will not commit.
  • Paxos Extensions
    • Generalized Paxos: allows acceptors to vote for independent commands.
    • EPaxos:
      • allows nodes to commit conflict free commands by checking the command dependency list.
      • adds significant complexity and extra effort to resolve the conflict if concurrent commands do not commute.
      • from an engineer’s perspective, the sketch algorithm descriptions in the literature are often underspecified, and lead to divergent interpretations and implementations.

Paxos Systems

  • Chubby, Apache ZooKeeper, etcd
    • All three services hide the replicated state machine and log abstractions under a small data-store with filesystem-like API.
    • All three support watchers.
    • All three support ephemeral storage.
    • All three support observers (none voting quorum peer).
  • Difference
    • ZK provides client FIFO order.
    • etcd is stateless (note, not very true with etcd 3.0’s lease, which is ZK session kind things.)
    • etcd has TTL (note, ZK has ttl too after 3.5.3.)
    • etcd has hidden data (MVCC).
    • ZK has weighted quorum.
  • Proper use criteria for Paxos systems
    • Paxos system should not be in the performance critical path of the application.
    • Frequency of write operations to the Paxos system should be kept low
    • Amount of data maintained in the Paxos system should be kept small
    • Application adopting the Paxos system should really require strong consistency
    • Application adopting the Paxos system should not be distributed over the Wide Area Network (WAN).
    • The API abstraction should be fit the goal.

Paxos Usage Patterns

  • Server Replication (SR)
  • Log Replication (LR)
  • Synchronization Service (SS).
  • Barrier Orchestration (BO).
  • Configuration Management.
  • Message Queues (Q).

Paxos Uasage in Production