Paper digest - Dynamo: Amazon’s Highly Available Key-value Store
Motivation
- A highly available key-value storage system that manages state of Amazon services.
- Reliability at massive scale, provide an “always-on” experience.
Requirements
Query Model: simple read and write operations to objects (relatively small, usually less than 1MB) that is uniquely identified by a key.
ACID (Atomicity, Consistency, Isolation, Durability) : weak consistency, no isolation, only single key update.
Efficiency:
- Function on a commodity hardware infrastructure.
- Strict latency requirements and SLA.
- Configurable to achieve customizable latency and throughput requirements.
Operation and Security:
- No authentication and authorization (operating in non-hostile environment).
- Scale up (each service uses its distinct instance of Dynamo).
SLA:
- Traditional SLA metrics (average, median and expected variance) are not good enough to build a system where all customers have a good experience.
- SLAs are expressed and measured at the 99.9th percentile of the distribution. Choose 99.9 for cost effective.
System Design
- Favor availability over consistency by using optimistic replication techniques.
- Optimistic replication introduces conflict. Conflict resolution introduces two problems: when to resolve them and who resolves them.
- When to resolve:
- Write time: writes may be rejected if the data store cannot reach all the replicas at a given time. This leads to poor user experience.
- Read time: Dynamo targets the design space of an “always write-able” data store. So conflicts are resolved at read time.
- Who to perform resolve:
- Configurable.
- System level resolution: last write wins.
- Application level resolution: interactive guided merge conflicts, etc.
- When to resolve:
- Other design considerations: Incremental scalability, Symmetry, Decentralization, Heterogeneity.
System Architecture
- Partitioning: Consistent Hashing.
- Replication: Replicated write protocol with configurable NWR parameters.
- Conflict Resolution: Vector clock.
- Temporary failures: Sloppy Quorum and hinted handoff.
- Recovering from permanent failures: Merkle trees.
- Membership and failure detection: Gossip-based membership protocol and failure detection.