Dynamo, Amazon’s Highly Available Key-value Store

Paper digest - Dynamo: Amazon’s Highly Available Key-value Store

Motivation

  • A highly available key-value storage system that manages state of Amazon services.
  • Reliability at massive scale, provide an “always-on” experience.

Requirements

  • Query Model: simple read and write operations to objects (relatively small, usually less than 1MB) that is uniquely identified by a key.

  • ACID (Atomicity, Consistency, Isolation, Durability) : weak consistency, no isolation, only single key update.

  • Efficiency:

    • Function on a commodity hardware infrastructure.
    • Strict latency requirements and SLA.
    • Configurable to achieve customizable latency and throughput requirements.
  • Operation and Security:

    • No authentication and authorization (operating in non-hostile environment).
    • Scale up (each service uses its distinct instance of Dynamo).
  • SLA:

    • Traditional SLA metrics (average, median and expected variance) are not good enough to build a system where all customers have a good experience.
    • SLAs are expressed and measured at the 99.9th percentile of the distribution. Choose 99.9 for cost effective.

System Design

  • Favor availability over consistency by using optimistic replication techniques.
  • Optimistic replication introduces conflict. Conflict resolution introduces two problems: when to resolve them and who resolves them.
    • When to resolve:
      • Write time: writes may be rejected if the data store cannot reach all the replicas at a given time. This leads to poor user experience.
      • Read time: Dynamo targets the design space of an “always write-able” data store. So conflicts are resolved at read time.
    • Who to perform resolve:
      • Configurable.
      • System level resolution: last write wins.
      • Application level resolution: interactive guided merge conflicts, etc.
  • Other design considerations: Incremental scalability, Symmetry, Decentralization, Heterogeneity.

System Architecture

  • Partitioning: Consistent Hashing.
  • Replication: Replicated write protocol with configurable NWR parameters.
  • Conflict Resolution: Vector clock.
  • Temporary failures: Sloppy Quorum and hinted handoff.
  • Recovering from permanent failures: Merkle trees.
  • Membership and failure detection: Gossip-based membership protocol and failure detection.

To be continued.