Paper Digest - Megastore, Providing Scalable, Highly Available Storage for Interactive Services

Paper digest - Megastore: Providing Scalable, Highly Available Storage for Interactive Services

Abstract

  • Blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS.
  • Provides both strong consistency guarantees and high availability.
  • Provides fully serializable ACID semantics within fine-grained partitions of data.

Motivation

  • Online services are forcing the storage community to meet new demands.
    • Highly scalable.
    • Rapid development.
    • Low latency.
    • Consistent view of data.
    • Highly available.

Design Highlights

  • Partition data. Replicate unit is a partition.
  • Full ACID semantics within partitions only.
  • Across partitions, limited consistency guarantees (later spanner improves this)
  • Provide database features (2nd indexes), but does not provide scalability guarantees (can only scale within user-tolerable latency limits.)
  • Innovative usage of Paxos: use it for replication instead of traditional usage (locking, leader election, replicate config data - e.g. ZooKeeper/ZAB)

Availability with Scalability

  • Replication
    • Replicate a write-ahead log over a group of symmetric peers.
    • Any node can initiate reads and writes.
    • Each log append blocks on acknowledgments from a majority of replicas.
    • Optimizations on Paxos
      • Allow local reads.
      • Single roundtrip writes.
    • Multiple replicated logs instead of single log for better availability.
  • Key Concept - Entity Groups
    • Entity group is a collection of data and unit of replication.
    • Within a single entity group: ACID semantic.
    • Across entity groups:
      • Two phased commit (strong consistency)
      • Or async communications through message queues.
  • Physical Layout of data
    • Use BigTable for data storage.
    • Letting applications control the placement of data.
      • Applications try to keep data near users and replicas near each other.
      • The data for an entity group are held in contiguous ranges of Bigtable rows.

API Design

  • Cost-transparent APIs: runtime costs that match application developers’ intuitions.
  • Eliminate needs for joins: offer fine-grained control over physical locality.
  • Joins when required is implemented in application code.

Data Model

  • Lies between the abstract tuples of an RDBMS and the concrete row-column storage of NoSQL.
  • Data model is declared in a schema and is strongly typed.

Transactions and Concurrency Control

  • Use MVCC. Readers and writers don’t block each other, and reads are isolated from writes for the duration of a transaction.
  • Read semantics
    • current (single entity group)
    • snapshot (single entity group)
    • inconsistent (ignore state of log)
  • Queues
    • Cross group batch operation deliver.
  • 2 Phased Commit for atomic update across entity groups.

Paxos based Replication (key innovations)

  • Pitfall of master-slave based Paxos
    • Master failover can require a complicated state machine.
    • A series of timers must elapse before service is restored.
    • It is difficult to avoid user-visible outages.
  • Fast Reads
    • Local read.
    • Coordinator tracks if a specific replica has observed the full state of writes so it can serve local reads.
  • Fast Writes
    • Single round trip writes.
    • Pre-preparing optimization: each successful write includes an implied prepare message granting the master the right to issue accept messages for the next log position.
    • If the write succeeds, the prepares are honored, and the next write skips directly to the accept phase.
    • Multi-Paxos: independent instance of the Paxos algorithm for each log position. T
  • Replica Types
    • Full Replica: contain all entity and index and able to serve current reads.
    • Witness Replica: vote and store WAL, but does not apply the WAL (kind like observer in ZooKeeper but not strictly the same - observer in ZK does not vote.)
    • Read Only Replica: does not vote but contain full data of a point of time in the past (more like ZK observer but ZK observer is more up to date WRT its state.)

Coordinator

  • The coordinator is a simple process with no external dependencies and no persistent storage.
  • Failure Detection
    • Use an out of band protocol to identify when other coordinators are up, healthy, and generally reachable.
    • Coordinators obtain specific Chubby locks in remote datacenters at startup.
    • Revert its state to a conservative default when loses majority of locks and / or parititons.
    • Brief and rare outage risk..
    • Liveness protocol is vulnerable to asymmetric network partitions.

Operation / Metrics / Experience - ommitted.