Dynamo, Amazon’s Highly Available Key-value Store

Motivation

A highly available key-value storage system that manages state of Amazon services.
Reliability at massive scale, provide an “always-on” experience.

Query Model: simple read and write operations to objects (relatively small, usually less than 1MB) that is uniquely identified by a key.
ACID (Atomicity, Consistency, Isolation, Durability) : weak consistency, no isolation, only single key update.
Efficiency:
- Function on a commodity hardware infrastructure.
- Strict latency requirements and SLA.
- Configurable to achieve customizable latency and throughput requirements.
Operation and Security:
- No authentication and authorization (operating in non-hostile environment).
- Scale up (each service uses its distinct instance of Dynamo).
SLA:
- Traditional SLA metrics (average, median and expected variance) are not good enough to build a system where all customers have a good experience.
- SLAs are expressed and measured at the 99.9th percentile of the distribution. Choose 99.9 for cost effective.

Favor availability over consistency by using optimistic replication techniques.
Optimistic replication introduces conflict. Conflict resolution introduces two problems: when to resolve them and who resolves them.
- When to resolve:
  - Write time: writes may be rejected if the data store cannot reach all the replicas at a given time. This leads to poor user experience.
  - Read time: Dynamo targets the design space of an “always write-able” data store. So conflicts are resolved at read time.
- Who to perform resolve:
  - Configurable.
  - System level resolution: last write wins.
  - Application level resolution: interactive guided merge conflicts, etc.
Other design considerations: Incremental scalability, Symmetry, Decentralization, Heterogeneity.

Partitioning: Consistent Hashing.
Replication: Replicated write protocol with configurable NWR parameters.
Conflict Resolution: Vector clock.
Temporary failures: Sloppy Quorum and hinted handoff.
Recovering from permanent failures: Merkle trees.
Membership and failure detection: Gossip-based membership protocol and failure detection.