Prabesh Regmi

This review is to analyze the paper “Spanner: Google’s Globally-Distributed Database” written by a group of researchers at Google, Inc. The authors of the paper are James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang and Dale Woodford .The paper was published in OSDI 2012.

This paper describes the design, implementation and deployment of Spanner which is a globally replicated database that provides exceptional features such as hierarchical scaling, external consistency, synchronous-replication, multi-versioned, global distribution, fault-tolerance, and SQL-based database. The purpose of this review is to assess the strength and weakness of Spanner’s design, its contribution to the field of distributed databases and its overall significance. This review also aims to know about how Spanner achieves global consistency and scalability, fault-tolerance mechanism, its performance, and real world applications.

Google’s distributed database, Google Spanner is a globally replicated database that provides exceptional features such as hierarchical scaling, external consistency, synchronous-replication, multi-versioned, global distribution, fault-tolerance, and SQL-based database. Google Spanner supports non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes.

Summary

The paper represents Spanner as the very first system to distribute data at global scale using bounded time for external consistent transactions. This paper provides a detailed description of Spanner’s architecture, its use of Paxos for consensus and a novel time API that exposes clock uncertainty. This paper outlines the data distribution mechanism within spanservers, which in turn house data structures known as tablets that are kept in files within a distributed file system known as Colossus (successor of the Google File System). It explains the data model that permits a hierarchy of tables, with rows from the parent table interspersed with those from the children tables. Spanner's most significant innovation is the way they use a trustworthy global time system called TrueTime to implement timestamp management and linear serializability of transactions.TrueTime represents timestamps as an interval, the timestamp +/- some uncertainty concerning the true time.Spanner offers several advantages which includes non-geographically dispersed replicas, non-blocking reads, lock-free read only transactions and atomic schema changes.It uses Paxos for operations and is globally replicated database that hides sharding and replication, giving customers the impression of receiving all the data on Spanner servers. Spanner is designed to scale up to millions of machines across hundreds of data centers and trillions of database rows. There is a strong use of data locality using the concept of buckets. Data is stored in tablets, which are also classified into ‘buckets’. Application can control the locality of data by carefully assigning keys to the data, potentially lowering the latency.

Evaluation

Megastore's popularity prompted the need to enable semi-relational tables and synchronous replication in Spanner. Despite its low write performance, Megastore was chosen by many Google applications (including Gmail, Picasa, Calendar, Android Market, and AppEngine) for its semi-relational data model and synchronous replication. Spanner has grown from a Bigtable-style versioned key-value store to a temporal multi-version database. Data is stored in semi-relational tables, and Spanner offers a SQL-based query language as well as general-purpose long-lived transactions (for example, report production – on the order of minutes). The Spanner team feels it is preferable to allow application programmers deal with performance issues caused by abuse of transactions when they emerge, rather than always coding around a lack of transactions.

TrueTime

The TrueTime API versions the data and automatically timestamps each version with its commit time. Spanner supports externally consistent reads and writes, as well as globally consistent reads across the database at a specified timestamp. External consistency (or linearizability) is defined as follows: if a transaction T1 commits before another transaction T2 begins, then T1's commit timestamp is less than T2's. TrueTime Spanner may assign globally relevant commit timestamps to transactions that reflect their serialization order.

TT.now() returns a TTinterval that is guaranteed to contain the absolute time when TT.now() was called. TrueTime ensures that for an invocation tt = TT.now(), tt.earliest ≤ tabs(enow) ≤ tt.latest, where enow is the invocation event and tabs(enow) is the absolute time of event enow. The instantaneous error bound, denoted by ε, is half of the interval's breadth. Google uses numerous current clock references (GPS and atomic clocks) to minimize uncertainty to a minimum (around 6ms). TrueTime is implemented using a collection of time master machines per datacenter and a time slave daemon per machine. The majority of masters use GPS receivers with specific antennas. The remaining masters (Armageddon masters) have atomic clocks. The TrueTime API explicitly exposes clock uncertainty, and the guarantees for Spanner's timestamps are determined by the implementation's boundaries. If the uncertainty is significant, Spanner slows down to wait out the uncertainty.

TrueTime is a sophisticated design with redundant hardware support and algorithm to verify its correctness.As the write transaction performed during a period of time is bounded by epsilon. Epsilons caused by hardware errors are hard to be eliminated,which means Spanner is unlikely to have better writing performance or timestamp accuracy. System ordering is solely based on clock time and clock time is uncertain, in many scenarios we may have to wait till the system ensures that the previous event is completed even if we are waiting for the purpose of making TT.after true. Other than that if a TrueTime API uses a faulty timestamp, suppose it fires a read operation in the future, will it block other transactions, or get halted, or returns with error the paper does not mention.

Implementation

A zone has one zonemaster and hundreds or thousands of spanservers. Zonemaster assigns data to spanservers, which provide data to clients. Location proxies assist clients in locating the spanservers assigned to serve their data. The universe master displays status information about all zones, allowing for interactive debugging. The placement driver automates the flow of data across zones on a timescale of minutes. Each spanserver is responsible for between 100 and 1000 tablets. A tablet supports the following mappings: (key:string, timestamp:int64) → string. To allow replication, each spanserver builds a single Paxos state machine on top of each tablet.Each spanserver at a leader replica provides a lock table (mapping key ranges to lock states) to handle concurrency control, as well as a transaction manager to support distributed transactions. If a transaction only includes one Paxos group (as most transactions do), it can bypass the transaction manager since the lock table and Paxos work together to achieve transactionality. If a transaction involves more than one Paxos group, its leaders collaborate to accomplish a two-phase commit. One of the participant groups is selected as the coordinator; the participant leader of that group is referred to as the coordinator leader, and the slaves of that group as coordinator slaves.

Concurrency

Read-Write Transactions, Read-Only Transactions, and Snapshot Read are supported in the Spanner implementation. Read-Write Transactions are implemented as Read-Write Transactions. Snapshot Read is a Snapshot read that executes in the past. Non-Snapshot Snapshot Read is implemented as Read-Only Transactions. Read-write transactions use 2-phase locking and 2-phase commit. First, the client sends reads to the primary replica in the appropriate group, which receives the read locks and then reads the most recent data. Once the client has completed all reads and cached all writes, it will initiate a 2-phase commit. Read-write transactions can be assigned a commit timestamp by the orchestration manager at any time when all locks have been acquired, but before any locks have been released. For a given transaction, Spanner specifies the timestamp that the coordinator assigns to the Paxos entry that represents transaction validation

to wait for the uncertainty to end in TrueTime, there is a wait commit. The coordinator manager ensures that the client cannot see any Ti-commit data until TT after(si ) is TRUE

The write operations use Paxos for consensus and two phase commit during the transaction, enforcing consistency but have some drawbacks such as resulting in master failover in long waiting time. Also, communication overhead is inevitable which increases the latency of every transaction.

Strength and Weakness

The paper describes the work in the field of distributed databases and introduces several novel concepts. Spanner’s contribution to the field is its ability to distribute data at global scale along with supporting externally-consistent distributed transactions, which is one of the significant achievements in the field of distributed databases. Another major significant achievement can be the use of Time API which is one of the innovative approaches to handle clock uncertainty, enabling features like reads at past timestamps. Spanner’s design allows for horizontal scaling to meet today’s growing data demands alongside offering a comprehensive set of features valuable for various applications.

The paper doesn’t dive deeply into the performance consideration, particularly regarding the latency in geographically distributed deployments.The paper addresses major challenges in distributed database system such as consistency, reliability, fault tolerance, performance and concurrency handling.The paper misses detailed discussion on the trade-offs involved in achieving global distribution and external consistency. The design’s complexity also may pose challenges for manageability and maintainability. The paper describes how spanner works in theory,but doesn’t mention how it performs with real-world application and massive datasets. It also doesn’t mention if we notice a lag while accessing data across different continents. The paper could benefit more from extensive evaluation with real-world workloads. The Spanner doesn’t support the full SQL semantics which might be useful for applications like Ads databases. For instance, it doesn’t support secondary indices which are crucial for performance critical operations. The system heavily relies on clock-time for ordering its transactions, but fails to describe what will happen in case of clock failure or incorrect working clocks in local machines.

The publication does not go into detail about what TrueTime buys for Spanner. The conclusion states, "One aspect of our design jumps out: TrueTime is the foundation of Spanner's feature set. We demonstrated that reifying clock uncertainty in the time API allows us to construct distributed systems with considerably better time semantics.", The article does not specify what TrueTime will pay for Spanner. Although the research paper presents convincing data, there are some assumptions and potential biases that need to be addressed. The paper focuses on functionalities and doesn’t extensively discuss the potential security consideration of globally-distributed systems. The paper also doesn’t describe a mechanism for data sharding in Spanner to achieve high read/write throughput. The paper assumes the benefits of global distribution outweigh the costs and complexities associated with it which may not hold true for all kinds of scenarios. The evaluation is biased towards Google’s particular use cases, and more research is required to fully comprehend Spanner’s capabilities under other workload scenarios.

Conclusion

We can conclude that “Spanner: Globally-Distributed Database” presents the ground-breaking design for scalable, globally distributed databases with high consistency. The paper has a strong theoretical foundation which is very impressive. The authors went through the challenging ideas of externally consistent transactions and delivering global time, managing clock uncertainty in the kinds of systems that can be the time API. Also, the evaluation approach is really nice, where the paper gives a good notion of the scalability of the two phase commit protocol.

The paper makes significant contribution in the field of distributed database. It offers a cutting-edge method for managing transaction and distributing data globally. The study could be strengthened in a few places, specially when it comes to address the trade-offs and potential drawbacks of the proposed system. To completely understand Spanner’s consequences and more potential uses, more research is required.Future study could look into security concerns and Spanner’s potential use of Google’s internal ecosystem. Considering everything, the work significantly advances the subject of distributed databases and opens the door for future developments in the creation of scalable and globally consistent data storage solutions. Future work also includes tuning its performance and functionality. The major areas of improvement also can be scalability in number of nodes as node-local data structure have relatively poor performance on complex SQL queries. The moving of data automatically between datacenters in response to changes in client load can also be one of the major areas for improvement in the near future.