######### Kudu memo ######### * Externally consistent transactions in the sense where all transactions are linearizable ## Hybrid Time * Assumption 1. Physical clock error from reference clock is bounded * Assumption 2. Physical clock timestamps are monotonically increasing Physical Clock API:: i :server_id() PC.Now: physical = PC_i(now) ε = E_i (physical) return physical, ε where ``PC_i(now)`` is a clock build in the server and ``E_i( )`` is an upper bound of physical clock time skew, from reference time ``PC_ref()``. HTC API:: last_physical = 0 next_logical = 0 HTC.Now(): Timestamp now int cur_physical = PC.Now().physical if cur_physical >= last_physical then now.physical = cur_physical now.logical = 0 last_physical = cur_physical next_logical = 1 else now.physical = last_physical now.logical = next_logical next_logical++ end if return now HTP.Update(Timestamp in) Timestamp now = Now() if now.physical > in.physical then return end if last_physical = in.physical next_logical = in.logical + 1 return * Theorem 1. The HybridTime clock happened-before relation forms a total order of events * Q. This is **not trivial**, you cannot order e and f when ``PC_i(e) - PC_j(f) < E_i(e) < E_j(f)`` ... * Theorem 2. For any causal chain f the physical component of HTC timestamp approximates the "real" time the event occurred, with a bounded error ``[-E_j(e), E_i(f)]`` ## Types of Transactions * Read-write transactions * Snapshot transactions * Time-travel read Write transactions that span multiple RG (Replica group)s needs one RG leader that plays transaction manager. 1. *Prepare*: TM sends all related RGs making them aquiring locks, responding each HTCs 2. TM chooses highest HTC and sends them as *Apply* to RG leaders My questions: bounded error of time at f does not account the feasibility of Read-write or Write-write transactions. Details are not dictated, which might be future work here. Also, how is interleaving select-update on ``R(x)->W(y)`` at T1 and ``R(y)->W(x)`` going to be handled? ## Spanner * Lock-free distributed read transactions * External Consistency - commit order respects global wall-time order (aka linearizability) * TrueTime * Transactions that writes use 2PL, 2PC * Timestamp order == commit order * **Commit wait** : Pick timestamp s = TT.now().latest where TT.now().earliest > s by average ε +--------------------+--------------------+--------------------+--------------------+ |Operation | Timestamp | Concurrency Control|Replica Required | | | Discussion | | | +--------------------+--------------------+--------------------+--------------------+ |Read-Write |4.1.2 |pessimistic |Leader | |Transactions | | | | +--------------------+--------------------+--------------------+--------------------+ |Read-Only |4.1.4 |lock-free |leader for | |Transactions | | |timesatmp; any for | | | | |read | +--------------------+--------------------+--------------------+--------------------+ |Snapshot Read, |--- |lock-free |any | |client-provided | | | | |timestamp | | | | +--------------------+--------------------+--------------------+--------------------+ |Snapshot Read, |4.1.3 |lock-free |any | |client-provided | | | | |bound | | | | +--------------------+--------------------+--------------------+--------------------+ Table: types of reads and writes in Spanner, how they compare ### Invariants * For single directory, all leaders' lease period is disjoint * All timestamps that any leader issues are unique ### Read-only transactions * Assign a read timestamp at group leader ### Read-write transactions * 2PC between group (main group leader becomes the transaction manager) * Wound-wait: later transactions wait for earlier transaction - (earlier transactions don't wait for them, just abort) * All commits and writes are logged via Paxos * The group leader does commit-wait (ensure the commit timestamp exceeds the ε) ## Calvin * "Deterministic Distributed Transactions" scheduler on top of distributed storage * half million TPS at TPC-C benchmark Sequencer * 10ms epoch, all requests are tied up as a batch at the end of epoch * sequencer's unique ID, epoch number, all transaction inputs * => every scheduler to have its own global view of transaction order per epoch Replication * asynchronous and synchronous (Paxos-based) replication * one replica is designated as a *Master* replica ** Sequencer failure is complexed - which batch was the last valid batch sent out?, what transactions did that batch contained? ** ZooKeeper provided necessary throughput for replicating transactional inputs Concurrency Control * Virtual locking could be used ... future work * Instead lock manager is responsible for data on the same data node * The lock order must be strictly kept. Late transactions must wait * *active* / *passive* participants * local reads => remote reads => local writes (remote writes are done ate remote notes with partial view) * simplify -> abort -> retry? / 2i is very simple Pitfall * A stalled transaction write may block unrelated following transacations * If it's predicted, put an artificial delay before and perform 'warm-up' at disk->memory * "No need to pay the overhead of physical REDO logging" eh? instead replaying history of transactional input... Experiment * Contention index? * scale up to 4-8 nodes if contention = 100% Questions * Partial updates? That my happen if a transaction is the combination of partial updates * What if during the transaction a master replica, or sequencer failed ... failover? * Also, how to avoid deadlocks is not clear; clocks goes sometimes wrong and sequencer gives lock to later transaction ## Resouces * [Repo](https://github.com/cloudera/kudu) * [Hybrid Time paper](http://pdsl.ece.utexas.edu/david/hybrid-time-tech-report-01.pdf) * [Transaction Semantics](https://github.com/cloudera/kudu/blob/master/docs/transaction_semantics.adoc) * [Calvin: Calvin: Fast Distributed Transactions for Partitioned Database Systems](http://cs-www.cs.yale.edu/homes/dna/papers/calvin-sigmod12.pdf) * [Spanner paper](http://research.google.com/archive/spanner.html)