Kudu memo

  • Externally consistent transactions in the sense where all transactions are linearizable

## Hybrid Time

  • Assumption 1. Physical clock error from reference clock is bounded

  • Assumption 2. Physical clock timestamps are monotonically increasing

Physical Clock API:

i :server_id()

PC.Now:
 physical = PC_i(now)
 ε = E_i (physical)
 return physical, ε

where PC_i(now) is a clock build in the server and E_i( ) is an upper bound of physical clock time skew, from reference time PC_ref().

HTC API:

last_physical = 0
next_logical = 0

HTC.Now():
 Timestamp now
 int cur_physical = PC.Now().physical

 if cur_physical >= last_physical then
   now.physical = cur_physical
   now.logical = 0
   last_physical = cur_physical
   next_logical = 1
 else
   now.physical = last_physical
   now.logical = next_logical
   next_logical++
 end if
 return now

HTP.Update(Timestamp in)
 Timestamp now = Now()
 if now.physical > in.physical then
   return
 end if
 last_physical = in.physical
 next_logical = in.logical + 1
 return
  • Theorem 1. The HybridTime clock happened-before relation forms a total order of events

  • Q. This is not trivial, you cannot order e and f when PC_i(e) - PC_j(f) < E_i(e) < E_j(f)

  • Theorem 2. For any causal chain f the physical component of HTC timestamp approximates the “real” time the event occurred, with a bounded error [-E_j(e), E_i(f)]

## Types of Transactions

  • Read-write transactions

  • Snapshot transactions

  • Time-travel read

Write transactions that span multiple RG (Replica group)s needs one RG leader that plays transaction manager.

  1. Prepare: TM sends all related RGs making them aquiring locks, responding each HTCs

  2. TM chooses highest HTC and sends them as Apply to RG leaders

My questions: bounded error of time at f does not account the feasibility of Read-write or Write-write transactions. Details are not dictated, which might be future work here. Also, how is interleaving select-update on R(x)->W(y) at T1 and R(y)->W(x) going to be handled?

## Spanner

  • Lock-free distributed read transactions

  • External Consistency - commit order respects global wall-time order (aka linearizability)

  • TrueTime

  • Transactions that writes use 2PL, 2PC

  • Timestamp order == commit order

  • Commit wait : Pick timestamp s = TT.now().latest where TT.now().earliest > s by average ε

Operation

Timestamp Discussion

Concurrency Control

Replica Required

Read-Write Transactions

4.1.2

pessimistic

Leader

Read-Only Transactions

4.1.4

lock-free

leader for timesatmp; any for read

Snapshot Read, client-provided timestamp

lock-free

any

Snapshot Read, client-provided bound

4.1.3

lock-free

any

Table: types of reads and writes in Spanner, how they compare

### Invariants

  • For single directory, all leaders’ lease period is disjoint

  • All timestamps that any leader issues are unique

### Read-only transactions

  • Assign a read timestamp at group leader

### Read-write transactions

  • 2PC between group (main group leader becomes the transaction manager)

  • Wound-wait: later transactions wait for earlier transaction - (earlier transactions don’t wait for them, just abort)

  • All commits and writes are logged via Paxos

  • The group leader does commit-wait (ensure the commit timestamp exceeds the ε)

## Calvin

  • “Deterministic Distributed Transactions” scheduler on top of distributed storage

  • half million TPS at TPC-C benchmark

Sequencer * 10ms epoch, all requests are tied up as a batch at the end of epoch * sequencer’s unique ID, epoch number, all transaction inputs * => every scheduler to have its own global view of transaction order per epoch

Replication * asynchronous and synchronous (Paxos-based) replication * one replica is designated as a Master replica ** Sequencer failure is complexed - which batch was the last valid batch sent out?, what transactions did that batch contained? ** ZooKeeper provided necessary throughput for replicating transactional inputs

Concurrency Control * Virtual locking could be used … future work * Instead lock manager is responsible for data on the same data node * The lock order must be strictly kept. Late transactions must wait * active / passive participants * local reads => remote reads => local writes (remote writes are done ate remote notes with partial view) * simplify -> abort -> retry? / 2i is very simple

Pitfall * A stalled transaction write may block unrelated following transacations * If it’s predicted, put an artificial delay before and perform ‘warm-up’ at disk->memory

  • “No need to pay the overhead of physical REDO logging” eh? instead replaying history of transactional input…

Experiment * Contention index? * scale up to 4-8 nodes if contention = 100%

Questions * Partial updates? That my happen if a transaction is the combination of partial updates * What if during the transaction a master replica, or sequencer failed … failover? * Also, how to avoid deadlocks is not clear; clocks goes sometimes wrong and sequencer gives lock to later transaction

## Resouces