Kudu memo¶
Externally consistent transactions in the sense where all transactions are linearizable
## Hybrid Time
Assumption 1. Physical clock error from reference clock is bounded
Assumption 2. Physical clock timestamps are monotonically increasing
Physical Clock API:
i :server_id()
PC.Now:
physical = PC_i(now)
ε = E_i (physical)
return physical, ε
where PC_i(now)
is a clock build in the server and E_i( )
is an
upper bound of physical clock time skew, from reference time
PC_ref()
.
HTC API:
last_physical = 0
next_logical = 0
HTC.Now():
Timestamp now
int cur_physical = PC.Now().physical
if cur_physical >= last_physical then
now.physical = cur_physical
now.logical = 0
last_physical = cur_physical
next_logical = 1
else
now.physical = last_physical
now.logical = next_logical
next_logical++
end if
return now
HTP.Update(Timestamp in)
Timestamp now = Now()
if now.physical > in.physical then
return
end if
last_physical = in.physical
next_logical = in.logical + 1
return
Theorem 1. The HybridTime clock happened-before relation forms a total order of events
Q. This is not trivial, you cannot order e and f when
PC_i(e) - PC_j(f) < E_i(e) < E_j(f)
…Theorem 2. For any causal chain f the physical component of HTC timestamp approximates the “real” time the event occurred, with a bounded error
[-E_j(e), E_i(f)]
## Types of Transactions
Read-write transactions
Snapshot transactions
Time-travel read
Write transactions that span multiple RG (Replica group)s needs one RG leader that plays transaction manager.
Prepare: TM sends all related RGs making them aquiring locks, responding each HTCs
TM chooses highest HTC and sends them as Apply to RG leaders
My questions: bounded error of time at f does not account the
feasibility of Read-write or Write-write transactions. Details are not
dictated, which might be future work here. Also, how is interleaving
select-update on R(x)->W(y)
at T1 and R(y)->W(x)
going to be
handled?
## Spanner
Lock-free distributed read transactions
External Consistency - commit order respects global wall-time order (aka linearizability)
TrueTime
Transactions that writes use 2PL, 2PC
Timestamp order == commit order
Commit wait : Pick timestamp s = TT.now().latest where TT.now().earliest > s by average ε
Operation |
Timestamp Discussion |
Concurrency Control |
Replica Required |
Read-Write Transactions |
4.1.2 |
pessimistic |
Leader |
Read-Only Transactions |
4.1.4 |
lock-free |
leader for timesatmp; any for read |
Snapshot Read, client-provided timestamp |
— |
lock-free |
any |
Snapshot Read, client-provided bound |
4.1.3 |
lock-free |
any |
Table: types of reads and writes in Spanner, how they compare
### Invariants
For single directory, all leaders’ lease period is disjoint
All timestamps that any leader issues are unique
### Read-only transactions
Assign a read timestamp at group leader
### Read-write transactions
2PC between group (main group leader becomes the transaction manager)
Wound-wait: later transactions wait for earlier transaction - (earlier transactions don’t wait for them, just abort)
All commits and writes are logged via Paxos
The group leader does commit-wait (ensure the commit timestamp exceeds the ε)
## Calvin
“Deterministic Distributed Transactions” scheduler on top of distributed storage
half million TPS at TPC-C benchmark
Sequencer * 10ms epoch, all requests are tied up as a batch at the end of epoch * sequencer’s unique ID, epoch number, all transaction inputs * => every scheduler to have its own global view of transaction order per epoch
Replication * asynchronous and synchronous (Paxos-based) replication * one replica is designated as a Master replica ** Sequencer failure is complexed - which batch was the last valid batch sent out?, what transactions did that batch contained? ** ZooKeeper provided necessary throughput for replicating transactional inputs
Concurrency Control * Virtual locking could be used … future work * Instead lock manager is responsible for data on the same data node * The lock order must be strictly kept. Late transactions must wait * active / passive participants * local reads => remote reads => local writes (remote writes are done ate remote notes with partial view) * simplify -> abort -> retry? / 2i is very simple
Pitfall * A stalled transaction write may block unrelated following transacations * If it’s predicted, put an artificial delay before and perform ‘warm-up’ at disk->memory
“No need to pay the overhead of physical REDO logging” eh? instead replaying history of transactional input…
Experiment * Contention index? * scale up to 4-8 nodes if contention = 100%
Questions * Partial updates? That my happen if a transaction is the combination of partial updates * What if during the transaction a master replica, or sequencer failed … failover? * Also, how to avoid deadlocks is not clear; clocks goes sometimes wrong and sequencer gives lock to later transaction
## Resouces
[Repo](https://github.com/cloudera/kudu)
[Hybrid Time paper](http://pdsl.ece.utexas.edu/david/hybrid-time-tech-report-01.pdf)
[Transaction Semantics](https://github.com/cloudera/kudu/blob/master/docs/transaction_semantics.adoc)
[Calvin: Calvin: Fast Distributed Transactions for Partitioned Database Systems](http://cs-www.cs.yale.edu/homes/dna/papers/calvin-sigmod12.pdf)
[Spanner paper](http://research.google.com/archive/spanner.html)