Intro

Highest-level: a database that shards data across many sets of Paxos state machines in datacenters spread all over the world

Spanner has evolved from a Bigtable-like versioned key-value store into a temporal multi-version database.

Features:

Replication dynamic controlled by application
Externally consistent reads and writes(?), globally-consistent reads across the database at a timestamp

External Consistency(linearizability): if a transaction $T_1$ commits before another transaction $T_2$ starts, then $T_1$ ‘s commit timestamp is smaller than $T_2$ ‘s.

TrueTime API: directly expose clock uncertainty.

Implementation

Spanner is organized as a set of zones.

每个zone包括一个zonemaster和数百个spanservers，前者将数据分配给spanservers，后者向用户提供数据。location proxies用来给用户提供spanservers的位置。universe master和placement driver目前是单机的。universe master是用来希纳是系统信息和debug的console。placement driver周期性与spanservers通信来找到需要转移的数据。

如果一个Transaction只包含一个Paxos Group，那么这个Paxos的leader通过lock table就可以处理。如果Transaction包含多个Paxos Group，这些leader需要2-Phase Commit。

Concurrency Control

Timestamp Management

Read-only

read-only transaction: 不需要锁，不会阻塞之后的写
snapshot read: 过去某一时刻的读取

Leader Lease

leader通过vote获得leader lease。在成功的write之后lease延长。
leader lease interval从获得多数vote后开始，到lease vote失效截止。
Spanner的不变性：leader lease interval的不相交性，同一时刻不能有两个leader。

RW Transaction

RW Transaction需要被赋予一个timestamp。使用2-Phase locking。当在获取所有锁之后，释放锁之前添加timestamp。

在每个Paxos Group中，Spanner赋予Paxos write单调增加的timestamp（即使是多个leader的情况）：单个leader是容易实现的，多个leader之间由于leader interval的不相交，timestamp必须在自己的leader lease中赋予时间戳。

Spanner遵守外部一致性：如果$T_2$的到达时间大于$T_1$的提交时间，则$T_2$的commit timestamp大于$T_1$的commit timestamp。Spanner保证了$t_{abs}(e_{1}^{commit}) \lt t_{abs}(e_{2}^{start}) \implies s_1 \lt s_2$。

Serving Reads

每一个副本跟踪$t_{safe}$，代表着当前副本中最大的timestamp，即当前副本可以满足$t \le t_{safe}$的读取。

$t_{safe}=min(t_{safe}^{Paxos}, t_{safe}^{TM})$

其中，$t_{safe}^{Paxos}$是最后一次Paxos write的时间戳，因为Paxos保证了不会有小于$t_{safe}^{Paxos}$的write。$t_{safe}^{TM}$是所有处于prepare的participant的时间戳的最小值-1：因为commit protocol保证了每一个participant知道transaction timestamp的下界，coordinator leader保证了在所有的participant中$s_i \ge s_{i,g}^{prepare}$，所以$t_{safe}^{TM} = min_{i}(s_{i,g}^{prepare}) -1$ 。

RO Transaction

只读事务分为两步：首先赋予只读事务一个时间戳$s_{read}$，之后就可以在（足够新的）副本上执行快照读。
最简单的就是$s_{read}=TT.now().lastest$，即可保证外部一致性。然而因为$t_{safe}$这样的设定可能会导致阻塞。所以Spanner会选择能够保证外部一致性的最老的时间戳（Section 4.2.2）。

Details

Read-Write Transaction

事务中的写入直到commit之前会缓存在客户端中。

Read-Write事务锁使用[[wound-wait]]来解决死锁。客户端将读请求发送给leader来获取最新的数据。

客户端在执行事务时与相对应的leader保持keepalive连接避免超时。当客户端完成了所有的读取并且完成了写入的缓存后，开始2-Phase Commit。客户端选择Coordinator Group并且将commit信息、coordinator身份与写入信息发送给每个leader，避免了多次传输。

非coordinator leader首先获取写入锁，选取一个prepare时间戳，这个时间戳需要大于之前所有事务的时间戳（以保证单调性），之后通过paxos记录下prepare record。每一个participant通知coordinator自身的prepare timestamp。

coordinator leader首先获取写入锁，但是跳过了prepare阶段。commit时间戳需要大于所有的prepare时间戳，大于收到commit时的$TT.now().lastest$，大于之前所有事务的时间戳。之后coordinator通过paxos记录下commit记录。

在允许其他的coordinator replica（区别于leader）apply之前，leader等待直到$TT.after(s)$来遵守commit-wait规则。因为leader通过$TT.now()$来确定的s，leader等待来确保这个时间戳已经过去，期望等待时间是$2*\epsilon$。等待后，coordinator将commit时间戳发送给client与其他的participant leader。其他的participant leader通过paxos记录下时间戳，在相同的时间戳apply并且释放锁。

Read-Only Transaction

只读事务读取的key有scope，如果scope只在一个Paxos组中，只需要向leader发送请求即可。leader选择一个$s_{read}$并且执行读操作。
如果涉及到多个Paxos组，目前Spanner使用的是$s_{read}=TT.now().lastest$的方法。

Reading

简单解释Spanner的TrueTime在分布式事务中的作用 - 阿莱克西斯的文章 - 知乎
 Spanner十问 - 知乎 (zhihu.com)

Steve's

6.824-Spanner