Strict Serializable Multidatabase Certification with Out-of-Order Updates

Multi-phase atomic commitment protocols require long-lived resource locks on the participants and introduce blocking behaviour at the coordinator. They are also pessimistic in nature, preventing reads from executing concurrently with writes. Despite their known shortfalls, multi-phase protocols are the mainstay of transactional integration between autonomous, federated systems. This paper presents a novel atomic commitment protocol, STRIDE (Speculative Transactions in Decentralised Environments), that offers strict serializable certification of distributed transactions across autonomous, replicated sites. The protocol follows the principles of optimistic concurrency control, operating on the premise that conflicting transactions are infrequent. When they do occur, conflicting transactions are identified through antidependency testing on the certifier, which may be replicated for performance and availability. The majority of transactions can be certified entirely in memory. Unlike its multi-phase counterparts, STRIDE is nonblocking, decentralised and does not mandate the use of long-lived resource locks on the participants. It also offers a flexible isolation model for read-only transactions, which can be served directly from the participant sites without undergoing certification. Also, update transactions are Φ-serializable, making the certifier immune to the recently disclosed logical timestamp skew anomaly.


I. INTRODUCTION
Many complex systems comprise multiple autonomous applications that "own" some subset of the overall system state and encapsulate parts of the business logic. Often, these systems are labelled as microservices [31] and serviceoriented; occasionally, they might also be event-driven. Their contemporary classification notwithstanding, it may be more instructive to think of them as a set of federated subsystems that balance between the ambition for autonomy and the overwhelming need to partake in the greater whole. Ideally, these applications adhere to the single responsibility principle and are fairly elemental on their own; their composition, however, can readily grow into a complex web of process communication that presents many problems to system designers. One such problem is maintaining the notion of global consistency without sacrificing autonomy, which is the main focus of this paper.
We elaborate on the problem before delving further. Consider a simple decentralised credit union, comprising multiple branches, each maintaining a subset of the member accounts. Branches may run different versions of the banking software. Funds can be transferred between members internally within the bank's secure network. A transfer must ensure that there are sufficient funds in the outgoing account and that the incoming account can receive those funds.
A naive way of approaching the problem is to engage in a series of discrete (local) transactions with each partaking branch, applying some sort of compensating action in the event of a processing error or a breach of some invariant. This is commonly referred to as the saga pattern [23]. While acceptable in some systems, sagas present a consistency nightmare for system designers. At any point during the orchestration of a saga, the system may appear to be inconsistent, i.e., violate some invariant. Even if the situation is remedied eventually, intermediate inconsistent states may lead to other undesirable states that cannot be easily detected and remedied by the initiating saga. For example, a report that is run during the saga's execution will observe an inconsistency between some pairs of accounts; this could subsequently be misreported as a financial anomaly. Another problem is failures: processes and network links may be unresponsive and may crash altogether. Especially problematic is the failure of the coordinator, which can leave sagas in a partially completed state.
Next, we turn to choreography-a distant cousin of the saga pattern [30]. In a choreographed system, participants closely observe each other's actions and initiate actions of their own, according to some scripted scheme. This pattern is often employed in event-driven systems. In our example, the branch housing the outgoing account might publish an event on a well-known message topic, signifying the start of the transfer. Every branch subscribes to the topic, and upon identifying a transaction that involves one of its accounts, enters the reciprocal transaction in its own ledger, completing the transfer. The main advantage of choreography is that there is no centralised coordinator; provided that the message broker is available, each participant will eventually complete its side of the transaction. There is a problem, however, if one of the participants is unable to complete the transaction; for example, if the incoming account is suspended due to a pending investigation. The affected branch might publish a failure message, which could be acted upon by the initiator, rolling back the transfer on the outgoing account. Even in such a simple example, choreography quickly becomes unwieldy: we effectively find ourselves spreading fragments of the overarching business logic across autonomous participants, making it difficult to reason about as a whole. The more impending problem, however, is that choreography, like sagas, is eventually consistent. Generally speaking, if some transaction observes and later comes to depend on an intermediate state resulting from another in-flight transaction (executed within a saga or by choreography), the former may complete despite the latter aborting; hence, the system will fail to converge on a consistent state. In concurrency control theory, this phenomenon is referred to as "dirty reads" and stems from the lack of recoverability [17]. One might say that an eventually consistent system is, in fact, eventually inconsistent.
The issue of transactions, atomicity and isolation has been studied for many decades in the context of databases and distributed systems, and the relevant theories, for example, concurrency control, are well understood; they have been heavily scrutinised in both the academic circles and in the broader industry and have stood the test of time. ACID works, to put it plainly. It may hence be prudent to consider these theories in the construction of distributed systems.
In yet another solution, one might partially dispense with the notion of autonomy in favour of a shared state, using ACID to govern transactions across branches. Depending on the expected volume of transactions, one may consider a distributed database that conforms to the necessary serializability guarantees; for example, Fauna [33], [39] and FoundationDB [32]. This addresses the problem from a functional standpoint, and a distributed database will ostensibly provide sufficient performance; however, the sites are now strongly coupled by the sharing of their database schema definition and the underlying entity states.
We remark on a minor variation of the above: the use of replicated databases with reduced isolation guarantees to achieve global strict serializability, as demonstrated by Bornea et al. [11] in their work on serializable generalized snapshot isolation (SGSI). This is an extension of the distributed database model, wherein decentralized sites can perform serializable reads from their local replica databases and are required to send their update transactions to an external certifier for final verification and commitment. The replicas and the certifier share a schema definition; however, each replica may have a different version of the entities installed at any given time.
ACID has served as a largely indispensable tool for solving this very problem in monolithic systems; its projection to federated systems has been less successful, however. Using ACID at the database level invariably leads to a significant reduction of autonomy, as participants are bound by a schema contract.
An alternative approach is to group entity data on the basis of known transactions that intersect those entities. For example, transfers span account ledgers and possibly some status flags, therefore, those data elements should be grouped in a single logical database. The same transactions do not modify member details and some sundry account attributes: the latter could reside in a different logical database. In this manner, the system's overall data set is partitioned not by entity aggregates, as it is commonly done in microservices and service-oriented architectures, but by the nature of business processes that atomically read and update those entities. When any branch wishes to initiate a transfer, it forwards the transaction request to the authoritative system responsible for transfers (e.g., a transfer coordinator), which accepts or rejects transfers using ACID under the hood. Once the transfer completes, the initiator is notified of the outcome. Furthermore, each branch can maintain a local read-only projection of the global state that is asynchronously populated by learning the outcomes of all transfers in the system. This bears some similarities to the SGSI approach, in that the transfer coordinator acts as the certifier and the bank branches are de facto replicas; however, the replicas do not share a schema. Nonetheless, there are several notable problems with this approach. First, there is an a priori stability of transaction scopes-a premise that may not persist as the system evolves. If the transaction scope changes or a new transaction is introduced that intersects with entities coordinated by different systems, a costly re-engineering effort may be incurred. Furthermore, entities will invariably converge onto a single coordinator over time as more transactions are introduced. (The number of disjoint entity sets decreases over time, as more groupings are added.) Second, the authoritative systems represent a point of failure and may bottleneck transactions; if the transfer coordinator is faulty or backlogged, key aspects of the system will become inoperable. This leaves us with atomic commitment protocols. The role of atomic commitment is to coordinate distributed transactions between multiple autonomous sites, such that a transaction either commits at all sites simultaneously or on none. Once the commitment decision is reached, it is durable, and no party may observe a state containing some effects of the transaction but not others. Two-Phase Commit (2PC) [18], [19] is the mainstay of atomic commitment protocols that has received much attention and numerous optimisations [20], [21], [22]. The protocol comprises separate voting and decision phases. A single encumbered coordinator is permitted to execute the protocol over a set of resource managers. 2PC has no mechanism for replacing coordinators, leading to the criticism of it blocking indefinitely if the coordinator is unavailable. Three-Phase Commit (3PC) [24] attempted to address 2PC's main deficiency by adding a coordinator election step. Keidar and Dolev have shown 3PC to block after carefully chosen network partition and merge steps [25] and devised E3PC, which uses view-based exclusive coordinator election.
Atomic commitment addresses the key issues of databasecentric models; namely, site autonomy. On the flip side, the commitment protocols described above present problems of their own. First, resource locking on the participants is a pessimistic concurrency control measure that negatively impacts throughput [17], preventing reads from executing concurrently with writes. Second, it requires centralised coordination, creating a point of failure on one hand, and forcing business logic to be collocated with the coordinator on the other. Third, participants in a multi-phase commitment protocol must support long-lived resource locks that span multiple phases. Finally, the locking schedule of these protocols may lead to distributed deadlocks, which are more difficult to identify than the local variety. Nonetheless, for lack of viable alternatives, multi-phase atomic commitment is a canonical model in heterogeneous multidatabase systems and federated architectures where site autonomy is imperative [22].

A. CONTRIBUTIONS OF THIS PAPER
This paper presents a novel atomic commitment protocol named STRIDE (Speculative Transactions in Decentralised Environments). STRIDE combines foundational concepts from different disciplines: strict serializable certification from concurrency control theory, heterogeneous autonomous sites from atomic commitment and atomic broadcast from distributed consensus. The resulting protocol coordinates distributed transactions across autonomous sites in an optimistic fashion, on the presumption that most transactions do not conflict. Conflicting transactions are identified through antidependency testing on the certifier, which may be replicated for performance and availability. STRIDE is nonblocking, decentralised, deterministic [16], schemaagnostic and does not mandate the use of long-lived resource locks on the participants. It also offers a flexible isolation model for read-only transactions, which can be served directly from the participant sites. Also, update transactions are Φserializable, making them immune to the recently disclosed Logical Timestamp Skew (LTS) anomaly [16].
The rest of this paper is structured as follows. Section II presents the necessary theoretical concepts and common definitions that are cited throughout this paper. Section III introduces the system model and outlines the model assumptions. Section IV describes the Stateless Certification algorithm that operates over a finite suffix of transactions. Section V offers a refinement of the algorithm that significantly reduces the false abort rate by introducing Antecedent Set Reification. Section VI recounts related work in the field of distributed databases. Section VII summarises this paper.

II. DEFINITIONS AND THEORETICAL FOUNDATIONS
This section summarises known concepts and results of concurrency control theory and introduces some useful definitions 1 as a basis for the following sections.
written by Tj with value v. Similarly, wi [x, v] means that Ti writes v into x.
The →S precedence operator denotes a happened-before relationship, where a →S b means that a's occurrence clearly preceded b's in some schedule S. If S is obvious from the discussion context, the subscript may be omitted. Formally, the precedence operator is a strict partial order binary relation that is irreflexive (¬ (a → a) We use ci and ai to indicate the commit and abort operations, respectively, of Ti. When the terminal operation exists but is not specified to be one of commit or abort, it is denoted as ei (ended I.e., either Tj read the value after Ti aborted or before Ti aborted; otherwise, the value read by Tj cannot be accurately determined. A prefix of a partial order P over a set S is a partial order P' over a set S' ⊆ S such that: • If q ∈ S' and p →P q then p ∈ S'; and • If p, q ∈ S' then p →P q ⇔ p →P' q. A history is any prefix of a complete history. We consider prefixes to reason about instantaneous state, where only some of the queued transactions have been fully executed, or even partially executed, for instance, in the event of a failure. A history is represented as a partial order over a set of operations; e.g., the partial order c2}. The set of operations may be omitted as it is trivially inferred from the partial order. A partial order is sufficient to produce a directed graph (digraph) like the one in Fig. 1, where operations are vertices and edges denote precedence.  . A schedule is a partial order over a set of events. A history is a schedule. This paper uses schedules to refer to partial orders (and by extension, total orders) outside of transactional operations, for instance, to depict sequences of logical timestamps.
Let S and S' be two sets such that S' ⊆ S and let P be a partial order over a set S. A projection of P on a set S' is a partial order P' (a subset of P) consisting of all elements in P involving only elements of S'.
For a complete history H (a partial order →H over a set of transactions ), a projection of H on a set of all operations of some transaction Ti ∈ , produces a partial order →i over Ti, complete with all operations of Ti and void of all other operations. A (prefix) history, however, may project a partial history for some transaction that does not conform to the transaction axioms above.
The commit projection C(H) of a history H is its projection on its set of committed transactions. Commit projections are used to erase the operations of aborted transactions. . Alternatively, H and H' are conflict equivalent if, for some pair of topological sorts, G ∈ S(H) and G' ∈ S(H'), it is possible to arrive from G to G' through a series of swaps of pairs of adjacent nonconflicting operations in G.

B. CONFLICTS AND EQUIVALENCE OF HISTORIES
All pairs of histories that are conflict equivalent are also view equivalent.

C. SERIAL HISTORIES, ISOLATION PROPERTIES AND SCHEDULERS
A serial history is one in which, for every transaction, the operations of that transaction are contiguous (not interleaved with the operations from any other transaction). Formally, for a serial history H, for all Ti, Tj ∈ H (i ≠ j), either ∀ p ∈ Ti, q ∈ Tj : p →H q or ∀ p ∈ Ti, q ∈ Tj : q →H p. A serial history may be compactly represented by listing the transactions in their serial order; e.g., (T1, T2, T3). (Assuming the partial order of those transactions' operations is recorded elsewhere.) An isolation property (or isolation level, as the terms may be used interchangeably) is a predicate I applicable to an execution history H, such that I(H) iff (if and only if) the property holds.
As it is common elsewhere in the literature, a property's abbreviation is used as a surrogate for the set of all histories permitted by that property. For example, SR is an abbreviation for serializable and is also a surrogate for the set of all serializable histories. Formally, for an isolation property I, its abbreviation is the set comprehension {H ∈ : I(H)}, where is the set of all histories.
As preamble for the subsequent definitions, let Start(T) and End(T) resolve the real-time start and end timestamps, respectively, of a transaction T. Start (T)  Start and End can be used to enforce a real-time constrained precedence partial order →RT over the set of transactions, such that Ti →RT Tj ⇒ End(Ti) < Start(Tj). Note, there is no requirement that Start and End faithfully mirror some reference global clock; only that they correctly impose a temporal order over events. In other words: • Every timestamp is unique.
• No two timestamps are the same unless they represent the same event. • A higher-valued timestamp occurs later 2 than a lowervalued timestamp. We say that a transaction Ti precedes another transaction Tj if Ti ended before Tj started; i.e., End(Ti) < Start(Tj) ⇒ Ti → Tj.
Snapshot isolation (SI) is both an isolation property and a concurrency control mechanism that satisfies this property. Informally, it guarantees that all reads made in a transaction will see the most recent consistent snapshot of the database [6], [25]. Formally, if Ti reads a data item x, then x was written by a committed transaction Tj that precedes Ti, and there is no committed Tk (j ≠ k) that also precedes Ti that wrote to x after Tj, and Ti either observes Tj's write to x or its own write to x, whichever is more recent. In other words, writes by concurrent transactions that are active after Ti starts are not visible to Ti [1], [27].
Snapshot isolation prevents the read skew anomaly, in which a transaction observes an inconsistent view of the database. For example, suppose T1 reads x, then T2 updates both x and y. If now T1 reads y, it will see an inconsistent state. SI prevents T1 from observing T2's concurrent write to y.
Generalized snapshot isolation (GSI) relaxes the timing constraint [12] of SI. In the generalised variant, a transaction Ti observes a snapshot that is consistent with some prefix of operations involving committed transactions that precede Ti. GSI allows pathological orderings. In the most extreme example, the returned snapshot may be consistent with an empty history; i.e., a view of the database that corresponds to its initial state.
Item cut isolation (I-CI) is an isolation property wherein each transaction reads from a non-changing cut, or snapshot, over the discrete data items [5].
Monotonic atomic view (MAV) is an isolation property wherein once some of the effects of a transaction Ti are observed by another transaction Tj, thereafter, all effects of Ti are observed by Tj [5]. Applying MAV in conjunction with I-CI (i.e., MAV ∩ I-CI) prevents read skew anomalies [5]. Serializability, informally, is an isolation property that states that a concurrent execution of transactions is serializable if its outcome is equivalent to some outcome in which those transactions execute serially. It serves as the foundation for reasoning about the consistency of a system in the presence of concurrent operations. Serializability is often presented as a necessary correctness condition in the design of concurrent systems.
A history is (conflict) serializable (SR; is in SR) if its commit projection is conflict equivalent to some serial history. The qualifier 'conflict' is typically omitted; serializable histories are assumed to be conflict serializable. A serial history is serializable.
A history is view serializable if the commit projection of every prefix of that history is view equivalent to some serial history. Clearly, from the earlier definition of equivalence, a conflict serializable history is view serializable.
A blind write occurs when a transaction writes a value without reading it. A serializable history that is not conflict serializable must contain a blind write.
A scheduler controls the concurrent execution of transactions [17]. In practice, a scheduler is a program or a set of programs that form a core part of a database system. A scheduler restricts the order in which read, write, commit and abort operations from different transactions are applied, such that the resulting history conforms to some a priori isolation property. Formally, schedulers are schedulers that generate histories in , where is the set comprehension of an isolation property I. For example, "serializable schedulers generate histories in SR," or stated otherwise, "serializable schedulers generate serializable histories." Serializable histories permit certain pathological orderings of transactions in the equivalent serial histories. A serializable scheduler can reorder all transactions comprising only reads towards the beginning of the history, and all transactions comprising only writes to the end of the history. Thereby, in an extreme (but plausible) scenario, all read transactions return the initial state for some value x and all writes to x are discarded. These, loosely speaking, "optimisations," are not consciously employed in practice in database systems; however, in some databases (particularly in the distributed variety) these effects may inadvertently occur from time to time.
Consider a pair of transactions Ti and Tj that are processed serially (Tj is submitted only after Ti is decided) with no other interfering transactions. For some data item x written by Ti and subsequently read by Tj, rj [x] is not guaranteed to observe wi[x] in serializable histories. I.e., serializability does not offer any real-time, causal or even per-process guarantees.
Strict serializability is an isolation property that includes the serial-emulating property of serializability, in addition to imposing a real-time constraint on nonconcurrent transactions.
A history is strictly serializable (S-SR; is in S-SR) if its commit projection is conflict equivalent to a serial history H, and for all pairs of transactions Ti, Tj that are both in H, End(Ti) < Start(Tj) ⇒ Ti →H Tj.
Strict serializable histories do not permit the pathological orderings that serializable histories are prone to. Specifically, if transactions Ti and Tj are processed sequentially, then for some data item x written by Ti and subsequently read by Tj,

D. CONCURRENCY CONTROL METHODS
When a scheduler receives an operation, it has the choice of scheduling it immediately, delaying it (inserting it into a queue to schedule later) or rejecting it (thereby aborting the transaction).
Schedulers can be classified based on their propensity to delay or reject operations. An aggressive scheduler tends to schedule operations immediately, but in doing so, it foregoes the opportunity to reorder the operations later on. Conversely, a conservative scheduler tends to delay operations, which may lead to unnecessarily postponing operations that could be completed immediately. A serial scheduler is an extreme case of a conservative scheduler, in which the operations of all transactions but one are delayed. A certifier is a special class of schedulers that never delays transactions.
Pessimistic concurrency control is a concurrency control method (a set of algorithms) that assumes that conflicts among transactions are frequent and thereby acts conservatively with respect to scheduling. Conversely, optimistic concurrency control assumes that conflicts are rare and schedules operations aggressively, looking for indications of conflicts and aborting transactions as appropriate.
Two-phase locking (2PL) is a pessimistic concurrency control method that guarantees serializability. Locks are applied and removed in two phases: 1. Expanding phase: locks are acquired, and no locks are released. 2. Shrinking phase: locks are released, and no locks are acquired. 2PL can be reduced to a single rule wherein a lock is never acquired after it has been released. 2PL is not strict; the basic protocol allows data items to be read from transactions that have not committed. 2PL (and its variants) are deadlockprone, in that operations may be delayed to a point where none of the queued operations may proceed, pending other queued operations.
Strict two-phase locking (S2PL) is a strengthened variant of 2PL in which write locks held by a transaction are not released until it commits or aborts; thus, data items cannot be read from undecided transactions.
Strong strict two-phase locking (S-S2PL) is a further strengthening of S2PL in which both read and write locks are not released until the transaction terminates.  [15].
Multiversion concurrency control (MVCC) is a family of optimistic concurrency control methods that permit several versions of a data item to coexist, rather than substituting data items for their most recent updates. MVCC avoids contention by presenting a snapshot view of the database to concurrent transactions; changes made by one transaction will not be disclosed to others until the former commits, at which point (or earlier) transactions may be tested for conflicts.
An execution of transactions can be represented with a direct serialization graph [3], with transactions as vertices, and edges indicating precedence in the apparent serial history. Three types of dependencies exist: 1. ) Snapshot isolation (SI), in addition to being an isolation property, is an MVCC method that allows a transaction Ti to commit if Ti's writes have not been superseded by another transaction Tj that committed after Ti's snapshot was takenthe first-committer-wins rule. SI is not serializable, being prone to write skew anomalies.
Serializable snapshot isolation (SSI) is a strengthened variant of SI that relies on an abridged form of serialization graph testing to identity potentially nonserializable executions [1], [28].

III. SYSTEM MODEL
The system comprises the following elements.

A. TRANSACTION
A transaction T describes a set of operations that must execute atomically across several autonomous cohorts. It comprises the following attributes.
Transaction ID: Immutable. Abbrev. XID. A unique identifier, e.g., a UUID, assigned by the cohort initiating the transaction.
Snapshot version: Immutable. The version of the cohort database used to validate the transaction on the cohort. Denoted by snapshot(T).
Readset: Immutable. The set of items read by the transaction. Denoted by readset(T).
Read versions: Immutable. The logical timestamps of transactions that last updated the items in T's readset. Denoted by readvers(T).
Writeset: Immutable. The set of items written by the transaction. Denoted by writeset(T).
State mapping: Immutable. Describes how the transaction's updates should be installed, given the existing state. Formally, statemap(T) is a function of the form [x, Sx ↦ Sx']; i.e., given the written item x and its current state Sx, produce a new state Sx'.
A state mapping may be idempotent, wherein the result of statemap(T) is independent of Sx; i.e., repeat applications of the state mapping produce the same outcome independent of the initial state. In the absence of other updates for x, the cohort can safely install an idempotent update any number of times for Sx, consistently producing Sx'. [x, Sx ↦ 4] is an idempotent state mapping. A nonidempotent state mapping is variant on Sx and therefore lacks this property; repeat applications of statemap(T) produce different outcomes. [x, Sx ↦ Sx + 1] is a nonidempotent state mapping.
Version: Once-mutable. Initially ∅; once set, it cannot be changed. Indicates the logical timestamp of the transaction within the encompassing total order. Logical timestamps are monotonically increasing sequence numbers assigned to transactions by the atomic broadcast primitive. Denoted by ver(T).
Status: Once-mutable. Initially pending; terminal states are committed and aborted. Indicates the certifier's final decision to commit or abort the transaction.
Safepoint: Once-mutable. Unset on the candidate and assigned only when the transaction commits. Indicates the lower bound of the snapshot version to which the transaction's update may be applied. Denoted by safepoint(T).
Note, since STRIDE is based on replication, we limit our scope to deterministic transactions. Otherwise, if transactions are nondeterministic, their installation in identical replicas may lead to divergent replica states.

B. GLOBAL DATABASE
A set of all objects under the certifier's purview. This is a purely virtual construct that need not manifest in reality; it is nonetheless useful for reasoning about the system's state. The global database can be materialised by suspending the certifier and applying all atomically broadcast updates from the first committed transaction to the last in their logical timestamp order.

C. COHORT DATABASE
A local view of the global database, possibly containing a subset of its items, and for those items, possibly depicting their earlier versions. A cohort database is attached to a set of cohort processes.

D. COHORT
Reads from and writes to its local cohort database and initiates transactions based on the replicated state of the cohort database. A cohort- • Reads the state of its local database.
• Stages a transaction for subsequent certification (and submits it if the transaction is valid locally). • Serially applies updates from atomic broadcast, upgrading the database's version in the process. • May apply updates to the local database out of order if it is safe to do so. A cohort process may be replicated for performance and fault tolerance.

E. AGENT
Acts as a stateless proxy between the initiator process and the certifier. The role of the agent is to accept candidate transactions from cohorts, submit them for certification, and respond to the initiator when the outcome of the transaction has been decided. An agent may be replicated for performance and fault tolerance. Each agent has a unique ID in the system. Agents may be collocated with cohorts for convenience and compactness. An agent process may be replicated for performance and fault tolerance.

F. ATOMIC BROADCAST CHANNEL
Abbrev. abcast. A persistent, total order over the messages used by the certifier, where a message represents some action recorded against a transaction. Abcast ensures that all participating processes receive the same set of messages in the same order, and messages are not lost.
The contents of abcast are available to processes outside the certifier, e.g., to cohorts and agents. Delivered messages are assigned monotonically increasing sequence numbers-their logical timestamps; the timestamps are uniformly observed by all recipients.
We specify the behaviour of abcast without constraining its implementation. In practice, abcast may be constructed by employing distributed consensus algorithms such as Paxos [34] (and its derivatives) or Spire [35].

G. TRANSACTION DATABASE
Abbrev. XDB. Maintains the hard state of every decided transaction, i.e., those transactions for which a commit/abort decision has been established. There is a single logical XDB instance for the entire system, which is attached to a set of certifier processes. Writes to the XDB are durable and atomic at the item level; an item in the XDB being a transaction decision record written by a certifier. There is no requirement for cross-item transactions; hence, the XDB may be sharded for performance.

H. CERTIFIER
Inspects candidate transactions for conflicts, determines an outcome (commit/abort), broadcasts the outcome and saves the transaction state to the XDB. A certifier may be replicated for performance and fault tolerance.

I. CLIENT
An external process that initiates a transaction with a cohort. Its role is not critical to the algorithm's specification; it is included here for completeness.

J. NETWORK AND PROCESS FAILURES
All processes in the system are prone to failures and communicate asynchronously by message passing in a non-Byzantine environment. While processes may crash, they eventually recover (for example, by restarting).
Message communication over abcast satisfies the usual abcast safety and liveness properties; i.e., messages may take arbitrarily long to arrive, but they are not lost, duplicated or delivered out of order, and they are either delivered on all nonfaulty processes or on none. Message communication outside of abcast is subject to losses, duplications and out-oforder delivery; however, messages cannot be undetectably corrupted.
All databases, unless otherwise specified, have access to nonvolatile storage, and persisted data survive the failure of the attached processes. Fig. 2 illustrates the basic system model. It is later expanded upon in subsequent refinements of the algorithm.

K. GUARANTEES
Certification is subject to several "ground rules" that underpin its safety: S1. Decision stability. Once a transaction has been decided, the decision may not be altered. S2. Abort side effect freedom. A transaction can always be safely aborted, unless it has already been committed. In the trivial case, every transaction is aborted by the certifier.

S3. Serializability. A transaction can only be committed
if doing so results in a history that is equivalent to some history in which all transactions executed serially. In the trivial case, if all transactions are submitted strictly serially, they are all committed, yielding a serial history.
Certification is also subject to the usual safety guarantees expected of agreement protocols: S4. Nontriviality. If a transaction is committed, then it was submitted by some cohort. S5. Agreement. If a transaction is decided, then all cohorts observe the same decision value for that transaction. To satisfy liveness, we assume that sufficient processes and links are nonfaulty for some cohort to eventually submit a transaction, for it to eventually be certified, and for the decision to eventually be learned by the initiating cohort.

IV. ALGORITHM 1 -STATELESS CERTIFICATION
We start with a simple algorithm that satisfies the invariants S1-S5, is reasonably performant, but may abort more transactions than necessary in certain cases.

A. PRELIMINARIES
Transactions are denoted with the literal Ti, where the subscript is the transaction's logical timestamp, a synonym for its version. I.e., ver(Ti) ≜ i. Transactions are arranged in a total order, captured by abcast. Transactions are described by a Candidate message in the abcast sequence, signalling an intent that may eventuate. The certifier's decision is communicated as a Decision message over abcast.
Logical timestamps are not gap-free; i.e., there may be a "hole" in the numbering where some abcast sequence numbers are not assigned to transactions. A Candidate message may be published in duplicate, in which case the first abcast sequence number associated with a transaction is its logical timestamp; the 2 nd and subsequent depictions of that transaction with higher sequence numbers are ignored. (Duplicates are possible owing to failures and retries.) The logical timestamp of the decision message is strictly greater than that of its corresponding candidate message 3 .
A transaction starts when it reads its local cohort database and completes when it is decided by the certifier. Within that period, the transaction is said to be active. The active period is demarcated by two versions: snapshot(T) for the start and ver(T) for the end. I.e., for a transaction Ti, the active period is [snapshot(Ti), i]. Also, every transaction has a nonzero duration; i.e., snapshot(Ti) < i.
We now present the conventional definition of concurrency. If two transactions are concurrent then there is a period in which they overlap; i.e., one transaction starts while the other is active; neither starts after the other completes. Formally, if a pair of transactions Ti, Tj, are concurrent, then snapshot(Ti) < j ∧ snapshot(Tj) < i; that is, they overlap. Conversely, if snapshot(Ti) ≥ j ∨ snapshot(Tj) ≥ i, then the pair are serial (or nonconcurrent). Note the similarities to the classical definition quoted for snapshot isolation systems [2], where a snapshot represents a point-intime cut of the items in the database. STRIDE snapshots are somewhat more amorphous, containing updates from transactions eagerly applied out of (logical timestamp) order. Therefore, we do not use iff in the conventional definition, as there is a more elaborate, extended definition of concurrency that will be presented later. That is, if two transactions overlap, it does not imply that they are concurrent. Similarly, if transactions are serial, it does not imply that they do not overlap. Avoiding bidirectional implication here ensures that the conventional definition is still correct, albeit not complete. Specifically, some transactions are not classifiable under the conventional definition.
A transaction is initiated when a cohort (possibly acting on its client's behalf) decides that it must update a set of data items atomically while maintaining global strict serializability. A cohort only has its local database state to go on. We define an update transaction as one that writes some items (its writeset) and optionally read some items (its readset). If the declared readset is nonempty, we assume that the transaction relied on the items in the readset, such that those items may be considered a part of the transaction's invariants.
For example, a transaction might read integer items x and y, and write their sum to z, but only if they sum to 10; otherwise, the transaction aborts. Hence, the readset is {x, y}, the writeset is {z}, and it is assumed that the transaction is dependent on the values of its readset. I.e., should those values change, its invariants may be breached. In every serial history containing the transaction in the committed state, its inputs sum to 10.
A read-only transaction has an empty writeset. Note, either the readset or the writeset of a transaction must be nonempty for it to be considered legal.
The originating cohort of transaction Ti is Ci, which may be used interchangeably to refer to the cohort's local database. (The reference is clear from the context.) Also, Ci has a baseline version, denoted by ver(Ci), being the version of the most recent transaction, Ti, serially installed in Ci, such that all committed transactions prior to Ti have also been installed. I.e., for every cohort C, Th) and installed(C, T) denotes that the state mapping of T was installed on C. We assume that some initial progenitor transaction T0 is installed prior to all other transactions; thus, ver(C) ≥ 0.

B. ALGORITHM OVERVIEW
At an outline level, the certification of Tk involves the following steps.
Step 1: Local validation. Ck determines that a candidate transaction is valid locally. I.e., it can be applied to the local database while satisfying all invariants of the transaction.
Failing local validation, the cohort may either abort the transaction or retry validation with a more recent set of values.
In the course of validation, the cohort captures the baseline version of its local database, ver(Ck), at a time that is no later than the time of its reads, storing the result in cpt_snapshot(Tk). I.e., cpt_snapshot(Tk) ≜ picklower(ver(Ck)), where picklower(n) ≜ ε m ∈ 0..n : TRUE. (ε is Hilbert's epsilon operator.) It also captures the versions of all items in its readset, storing the results in cpt_readvers(Tk); i.e., cpt_readvers(Tk) ≜ {ver(x) : x ∈ readset(Tk)}, where ver(x) is the logical timestamp of the most recent transaction that wrote to x on the cohort in question. The capture of read versions and the reading of the values are performed in an atomic step.
Ideally, acquiring the readset and the capture of ver(Ck) are also atomic, such that the captured version is consistent with the time of Tk's reads. The algorithm, however, permits some earlier version of Ck to be captured for a readset containing updates installed after cpt_snapshot(Tk). This is suboptimal in that it may produce unnecessary aborts; however, it does not compromise safety.
Step 2: Candidate submission. The cohort creates a CertifyRequest, containing just the immutable parts of the transaction, and invokes the agent. The request contains- • The transaction's unique XID, assigned by the cohort. how the transaction should be applied to an existing state to produce the new state, for each item updated by the transaction. The CertifyRequest call is a blocking remote procedure call on the agent, returning when the transaction completes or times out. Steps 1 and 2 are collectively described by Alg. 1a.
Step 3: Candidate ordering. The agent creates a Candidate message containing the elements of the CertifyRequest, appends its agent ID and abcasts the resulting message. Abcasting a candidate message has the effect of making it persistent and assigning a sequence number to it. (All abcast messages are totally ordered, persistent and uniquely numbered.) Step 4: Certification. The certifier inspects the candidate transaction, comparing it to some suffix of transactions that precede it in the total order, to identify potential antidependencies. The certification algorithm is described in Section IV.C and Alg. 1b.
The certifier decides the outcome of the transaction. In either case (commit/abort), the certifier installs the decision state of the transaction into the XDB, honouring a prior decision state if one exists. The certifier then abcasts a Decision message, containing the XID, the logical timestamp of the transaction (i.e., its version), the commit/abort outcome and the initiating agent's ID. In the case of a commit, the Decision message also includes a safepoint version, whose calculation is described in Alg. 1c. Decision messages may safely be abcast in duplicate, as their payloads cannot differ.
Step 5: Certification response. The agent receives the Decision message for a transaction ID that has a pending request as of Step 2. It replies to the initiating cohort with a CertifyResponse, containing the decision. The agent may time out the request if it fails to receive a corresponding Decision within a set period.
Step 6: Completion. The cohort learns the outcome of the transaction. In the event of commitment, it may apply the transaction's updates directly to its local database if it is safe to do so, as indicated by the transaction's safepoint. The criteria for installing the update of Tk at Ck and the installation process are described in Alg. 1d. In the event of an abort, the cohort may retry the transaction with updated values or cease further activity on the transaction.
Background applicator. The cohort process maintains a separate background applicator thread that receives abcast Decision messages and installs the state mapping of the committed transactions, either concurrently, in accordance with their safepoint instructions (Alg. 1d), or serially, upgrading ver(Ck) in the process (Alg. 1e). Fig. 3 outlines the certification process.

C. CERTIFIER PROCESS
The certifier process maintains a suffix of candidate messages received from abcast. is held in memory and sized such that it encompasses all undecided transactions, as well as some decided ones. The size of the suffix is tuneable; typically, it is in the order of several seconds' worth of transactions at peak throughput. The certification algorithm is formally presented in Alg. 1b.
Let the earliest transaction in the suffix be Ti. I.e., if ∃ Tj : snapshot(Tk) < j < k ∧ ¬aborted(Tj) ∧ j ∉ readvers(Tk) ∧ writeset(Tj) ∩ readset(Tk) ≠ ∅, then try to abort Tk. R4. Otherwise, try to commit Tk. Aborting and committing of transactions must satisfy S1: For R2, R3 and R4, if some prior decision state exists, the proposed decision is discarded in favour of the existing decision. For R1, a contradiction in decisions between successive applications of Alg. 1b is not possible by the invariants of the algorithm. (See Lemma 7.) A safepoint is an optimisation that allows cohorts to eagerly install Tk's update out of order, concurrently with other such updates. Upon commitment of Tk, the certifier assigns a safepoint to it by identifying the last non-aborted transaction in the suffix before Tk that is a direct dependency of Tk. That is, it either has a read-write antidependency directed upon Tk or read-conflicts with Tk or write-conflicts with Tk. If such a predecessor transaction exists, then its logical timestamp is assigned to Tk's safepoint. Otherwise, there is nothing in the suffix that is superseded by Tk; hence, Tk's safepoint is conservatively assigned the logical timestamp of the transaction that immediately precedes the earliest transaction in the suffix. Formally, for a transaction Tk, let the conflict set The operator vers( ) is the set of the logical timestamps of all transactions in defined by vers( ) ≜ {ver(T) : T ∈ }.
The operator max(N) is the maximum of a finite, nonempty set of zero-inclusive natural numbers N, defined as max(N ∈ ℘(ℕ0)) ≜ ε n ∈ N : ∀ m ∈ N : m ≤ n.

D. EXAMPLES
We now provide several examples of certification, covering the rules R1-R4 of Alg. 1b as well as the safepoint calculation in Alg. 1c for committed transactions. Refer to Fig. 4, illustrating a suffix of Candidate messages received from abcast. Candidate transactions are depicted as an ordered sequence; the bold outline indicates the transaction under consideration (i.e., Tk in Alg. 1b). For simplicity, we assume that none of the antidependencies located by R3 was aborted.
We know this to be safe because T6 does not depend on anything outside of its readset and we only consider deterministic transactions; since readset(T6) = ∅, there is no possibility of T6 having its invariants breached upon installation. Example 2-straying outside the suffix: snapshot(T15) = 10 in Fig. 4 (b), whereas the snapshot boundary is demarcated by T12. 10 < 12 -1; thus, T15 is conditionally aborted by R2.
The behaviour of Alg. 1b is conservative for the assessment of transactions outside the suffix boundary. Perhaps transaction T11 antidepends on T15; there is no way of knowing, so we assume the worst case. Put simply, T15 is too stale to be certified; it must be retried or abandoned. Section V offers ways of dealing with this contingency. If snapshot(T15) was 11 instead of 10, it would be safe, despite the suffix not starting until 12: R3 considers transactions ahead of the snapshot, which excludes the transaction to which the snapshot points.
Example 3-antidependency: snapshot(T28) = 23 in Fig.  4 (c), which qualifies for assessment by R3. The algorithm considers all candidates in the range 24..27. writeset(T25) intersects with readset(T28); however, 25 ∈ readvers(T28), meaning that T28's cohort installed T25 out of order and therefore observed its writes. Next, we consider T26, whose writeset also intersects with readset(T28). This time R3 finds a bona fide antidependency, as T28 does not appear to have observed T26's writes. T28 is hence conditionally aborted. T28 reads z as of version 25 while T26 concurrently installs the successor version of z and commits prior to T28. Intuitively, for this history to be serializable, either 1) T28 must observe z at version 26 or later, or 2) T28 must precede T26, or 3) T26 must abort, or 4) T28 must abort. Only option 4 is viable. (Option 2 is excluded due to a serialization order guarantee discussed in Section IV.L.) Example 4-conditional commit: snapshot(T35) = 31 in Fig. 4 (d), and there are no transactions in the range 32..34 that antidepend on T35. R4 attempts to commit T35, honouring a previous decision if one was assigned. Assuming T35 was previously undecided, its safepoint will be the highest logical timestamp of the transactions in the conflict set, being 33, seeing that readset(T33) ∩ writeset(T35) ≠ ∅.
The commitment of T35 in R4 is conditional because there is a chance that T35 may have aborted in some previous run of Alg. 1b. This might appear counterintuitive at first, so we elaborate. There is in fact a subtle recursive step concealed in Alg. 1b. Each decision is tied to the decision states of preceding transactions. Suppose we altered the example such that there is some undecided transaction in the range safepoint(T35)..34 that antidepends on T35, hence the certifier aborts the latter by R3. At some later point, the conflicting transaction itself is aborted for whatever reason. A repeat run of Alg. 1b will disregard this antidependency in R3 and fall through to R4, attempting to commit T35. To ensure that decision stability is not compromised, the commitment in R4 must be conditional. Equivalently, the aborting of transactions in R2 and R3 is also conditional.

E. ALGORITHM SPECIFICATION
We now present more formal definitions of the algorithms introduced earlier. There are five subordinate algorithms involved, labelled Alg. 1a through to 1e. The operator picklower(n) returns some natural number inclusive of zero that is lower than n: The operators min(N) and max(N) return the minimum and maximum values, respectively, in the given finite set of zeroinclusive natural numbers N: For brevity, we assume that the cohort directly submits candidates over abcast; that is, the agent is collocated with the cohort. In most cases, the role of the agent can be collapsed into the cohort for Alg. 1a. (For example, the agent may be deployed as a library embedded in the cohort process.) We later discuss conditions under which the separation of agents from cohorts is worthwhile. The operators commit(T) and abort(T) are fail-fast: they do not attempt to overwrite the decision if one has already been assigned in the XDB for T, but if the assigned decision contradicts with the proposed decision, they return an error.
In practice, these operators may be used to assert system invariants; a contradiction for commit(T) or abort(T) implies an implementation defect or state corruption. Also, commit(T) will skip safepoint calculation for an existing decision, honouring the existing safepoint. The operators try_commit(T) and try_abort(T) are fail-safe. They will not attempt to overwrite the decision, nor will they raise an error if they encounter a contradiction.
The operator vers( ) is the set of the logical timestamps of all transactions in : Out-of-order installation 1: upon receiving a decision for Tk on C: The operator install(Tk) conditionally installs updates in Tk's state mapping for those items whose installed versions have not been superseded by a successor of Tk: The notation g' denotes the successor state of g. The assignment of readvers(Tk) in Alg. 0a and its subsequent read in Alg. 0b are needed to support out-of-order updates. Ordinarily, with serial installation of updates, transactions are concurrent iff they overlap. For Alg. 0, as it turns out, this is a necessary but insufficient criterion for concurrency. Consider a pair of overlapping transactions Ti, Tj, such that snapshot(Tj) < i ∧ snapshot(Ti) < j. Without loss of generality, assume that i < j, Cj has installed Ti's updates by Alg. 0d, and for all items written by Ti and read by Tj, the version read by Tj is at least as recent as that installed by Ti. It follows that either Tj has been submitted serially after Ti, or Tj reads no items written by Ti. This is the archetypal scenario for out-of-order updates. In the absence of readvers(Tj), the certifier will erroneously assume an antidependency Tj ⤏rw Ti and summarily abort Tj. Worse still, this will happen in every such case, obviating all benefits of out-of-order updates. By disclosing readvers(Tj), the cohort signals to the certifier that certain transactions preceding Tj in commit order, while overlapping with Tj, are in fact serial with Tj.
We now reveal the extended definition of concurrency that accounts for out-of-order updates. Two transactions are concurrent iff if there is a period in which they overlap and neither transaction observes the other's updates. Formally, a pair of transactions Ti, Tj, are concurrent iff snapshot( Next, we introduce Fig. 5 to guide several of the forthcoming proofs. We represent a transaction Ti as a horizontal segment demarcated by a pair of vertices ri (occurring at snapshot(Ti)) and wi (occurring at ver(Ti), or simply, i). The horizontal axis depicts logical time; i.e., points to the left relate to logical timestamps that occur before those to the right. All reads of Ti occur at an instant represented by its ri vertex; all writes of Ti occur at an instant represented by its wi vertex, sometime after the reads. Transactions have nonzero duration; thus, ri is always to the left of wi. Dependencies are shown as directed edges between transaction vertices. This is similar in spirit to a special case of Adya's unfolded serialization graph (USG) [2], where transactions are restricted to reading and writing exactly once, at the beginning and end of their lifespan, respectively. It is also similar to the SC-graph technique of Shasha et al. [8], adopted later by Fekete et al. [1].
We depict an antidependency Ti ⤏rw Tj as a dashed edge from the ri vertex to wj. Ti ⤏rw Tj is said to be backwardfacing if i > j; otherwise, Ti ⤏rw Tj is said to be forwardfacing if i < j. i ≠ j owing to the uniqueness of version numbers. Backward-facing antidependencies only occur among concurrent transactions. Note, it is the transactions' relative commit order that determines the direction of the antidependency, not the spatial orientation of the edge (i.e., the direction it might appear to be facing on the Cartesian plane). In Fig. 5 (a), the antidependency Tk ⤏rw Tv is backward-facing, while Tk ⤏rw Tv is forward-facing in Fig. 5 (b). Dependencies of type write-write and write-read are depicted with a solid edge (annotated with ww and wr, respectively) from the appropriate vertices of the respective transactions, i.e., wi to wj for Ti →ww Tj and wi to rj for Ti →wr Tj. Note, ww and wr dependencies always point in a direction such that Ti occurs before Tj for both Ti →ww Tj and Ti →wr Tj. For the Ti →wr Tj case, the pair must execute serially, whereas for Ti →ww Tj, the pair may execute concurrently. Also, a ww dependency cannot face backward in STRIDE because write versions follow logical timestamp order: if Tj overwrites an item written by Ti, then Tj's version is greater than Ti's.
As a corollary of the aforementioned definitions, given that ww and wr dependencies may only point forward, for a directed cycle to occur among a set of transactions, there must exist at least one backward-facing antidependency in the directed graph [3]. In other words, dependencies must "go back in time" somehow for a cycle to materialise [1]; such a cycle is depicted in Fig. 5 (c).

Proof.
We begin by assuming that Alg. 0b commits a transaction Tk that partakes in such an antidependency and prove its absence by contradiction. In addition to the source transaction, Tk, an antidependency requires a target, Tv, upon which the antidependency is incident, i.e., Tk ⤏rw Tv, and v < k by the assumption of the lemma. There are two separate cases under which Alg. 0b may commit a transaction: ⟨1⟩ by the action of rule R1, and ⟨2⟩ by that of R4.
For ⟨1⟩, Tk is committed iff readset(Tk) = ∅. Yet, writeset(Tv) ∩ readset(Tk) ≠ ∅ by the definition of an antidependency. No set can produce a nonempty intersection with an empty set, therefore writeset(Tv) ∩ readset(Tk) = ∅: a contradiction. Proof. A directed cycle involves at least two transactions and at least one antidependency by the result of Adya et al. [3]. It has been shown in Lemma 5 that a backward-facing antidependency is proscribed by the algorithm; hence, for a cycle to exist, a dependency of some type must project forward, starting from some transaction Tk, incident upon Tv (i.e., Tk →? Tv, v > k, where →? denotes any dependency type), transit through zero or more intermediate transactions (involving any dependency type), then close upon Tk from some Tu, such that Tu →? Tk and u < k. (Lemma 5 proscribes the u > k case.) This cycle is exemplified in Fig. 5 (c). Assume Tu exists, with the requisite cyclic dependency path originating from Tk.
Since u < k and v > k, then u < v. The only dependency type that may face backward is an antidependency, which is proscribed by Lemma 5. This leads to a contradiction since Tv must (directly or indirectly) project a backwardfacing antidependency upon Tu for a cycle to materialize. o We use the general definition of a cycle for Lemma 6 of Adya et al. [3], rather than the specific result of Fekete et al. [1] for snapshot isolation, that requires a pivot transaction, Tpivot, and a pair of (possibly identical) transactions Tu, Tv, such that Tu ⤏rw Tpivot ⤏rw Tv, Tu ≠ Tpivot, Tv ≠ Tpivot. The strongest cohort isolation model required by Alg. 0 is MAV ∩ I-CI per Lemma 2, which is weaker than generalized snapshot isolation (and by extension, snapshot isolation) [2], owing to out-of-order updates; therefore, theorems that hold under SI will not hold here. The result of Lemma 7 is important for two reasons. First, the system model assumes fail-recovery; therefore, a process must safely resume any operation when recovering from failure. A certifier cannot skip over a pending transaction or issue a contradictory decision. Second, multiple certifier instances may be operating simultaneously over the same set of transactions. It may be, for example, due to one certifier appearing to have failed as a result of a false suspicion by an eventually perfect failure detector, and another certifier instance taking over, not sensing that the former is in fact operational. Even in the absence of external coordination, and without the guarantee of mutual exclusion, parallel certifiers cannot lead to divergent outcomes. In another example, multiple certifier instances may be intentionally operating in parallel for load balancing, for example, where in a group of n certifiers F1, F2, ..., Fn, a certifier Fi decides transaction Tk, such that k mod n = i. If the process group is subject to dynamic membership, for example, autoscaling, where the group population may expand and contract automatically depending on the load, changes to n may lead to conflicting work assignments; i.e., two certifiers may attempt to assess the same candidate. Lemma 7 guarantees decision stability in all scenarios.

Lemma 8. Alg. 0 is correct for idempotent state mappings.
Proof. Alg. 0b is the algorithm responsible for certification, hence it is the cynosure of this proof. It relies on lemmas that involve key parts of Alg. 0b and other algorithms.
For the stability of decisions criterion, S1, this is discharged by Lemma 7.
For the safe abort criterion, S2, an aborted transaction is not installed in any cohort; hence, it has no bearing on the system state, other than to assign an abort decision in the XDB. The status of a transaction is only consulted in R3, in the predicate ¬aborted(Tj) inside an existential quantifier. The correctness of this case is proven separately by Lemma 7.
For the serializability criterion, S3, it suffices to show that there is no subset of committed transactions S in the resulting history, whose constituents form a directed cycle in the serialization graph [3]. This is discharged by Lemma 6.
For the nontriviality criterion, S4, only a transaction received from abcast is subject to certification, which must have been submitted by Alg. 0a according to the invariants of abcast; in turn, the transaction must have been initiated by some cohort.
For the agreement criterion, S5, if a transaction is decided for some outcome, then it cannot subsequently have its state altered by S1 and all cohorts see the same decision value for that transaction by the invariants of abcast and

G. CONSISTENCY MODELS
In Alg. 0, we permit the out-of-order installation of updates, provided the transaction's version is greater than the baseline version of the cohort. Recall, the ver(C) ≥ safepoint(Tk) predicate in Alg. 0d was effectively short-circuited by the invariant safepoint(Tk) = 0. In Alg. 1, what is the purpose of limiting out-of-order assignments, thereby restricting the set of admissible histories, and reducing concurrency in the process 4 , if Alg. 0 is already safe as it is? Consider the guarantees provided by Alg. 0, vis-à-vis the safepoint calculation and out-of-order updates. For simplicity, we assume that all state mappings are idempotent for now. For every transaction Ti, writes may be installed at any point, provided i > ver(C). Once an update for Ti is installed, all updates for Th, h ≤ i, are thereafter prohibited by the predicate in install(T). Thus, updates under Alg. 0 are both atomic and monotonic: all items in Ti's state mapping are installed or none, and once any transaction Tj, j > i, reads the updates of Ti, then for all items written by Ti, a read by Tj cannot return a value installed by some transaction Tg, g < i. This equates Monotonic Atomic View (MAV) by Bailis et al. [5]. Specifically, MAV only prescribes monotonicity from the perspective of any given transaction performing the read, but not across transactions [5]. For example, under MAV, a newer transaction Tk, may read an older version of the same item read by Tj, where j < k.
Next, consider the atomicity of reads in Alg. 0a. For every staged transaction Ti, we require the capture of readvals(Ti) and cpt_readvers(Ti) to proceed in an atomic step, reading from a uniform snapshot of the database. This property is labelled as Item Cut Isolation (I-CI) by Bailis et al. [5].
Combining MAV with I-CI, reads in Alg. 0a cannot observe the read skew anomaly (A5A in Berenson et al. [6]). MAV ∩ I-CI is equivalent to the read atomic (RA) isolation level from Bailis et al. [7]. According to the authors, "a system provides RA isolation if it prevents fractured reads phenomena and also proscribes [Adya's]  The main problem with RA is that it is susceptible to the Gupdate phenomenon of Adya [2]; i.e., single antidependency cycles with update transactions. We briefly recount the phenomenon here; however, Adya provides the most comprehensive treatment. A history H and transaction Ti show phenomenon G-update if a DSG containing all update transactions of H and Ti contains a cycle with at least one antidependency edge. Adya also introduces the no-updateconflict-misses isolation level and states its equivalence with G-update [2]. In summary, if Ti depends on Tj, it must not miss the effects of Tj and all update transactions that Tj depends or antidepends on.
To explore the problem, consider the following scenario: A leaderboard system is tracking the ranks of three riders in a speedway race, {A, B, C}, with ranks assigned from the set 1..3. Upon each successive lap, the leaderboard is updated by assigning new ranks as appropriate. The updates are recorded in a primary database and asynchronously installed in a replica database, where they are later read and displayed on a spectator screen overlooking the grandstand. Updates are applied atomically at the replica but in arbitrary order. All read-only transactions on the replica satisfy RA isolation.
Riders are initially ranked {A:1, B:2, C:3} on the starting grid, installed by transaction T0. After the first lap, B overtakes A, issuing T1. After the second lap, C overtakes A, issuing T2. The transaction history on the primary is:  Assume ver(Tq) = 3, as it appears after all other transactions considered here. Because T1 precedes Tq in the abcast sequence, Alg. 0b will reject Tq either by R2-if Tq falls short of the suffix boundary, or by R3-as it locates T1 that antidepends on Tq, occurs after snapshot(Tq) and is not among readvers(Tq).
In Alg. 0, any query, even one void of writes, must be certified before its results may be used safely elsewhere in the system. (One might make exceptions for analytics-style queries that can make do with an approximation of the global state.) Although Alg. 0 is safe with certification, it would be ideal if read-only transactions could be handled more efficiently.
As a minor segue, the selective allowance for missing updates has been successfully employed in large scale georeplicated systems, for example, in Parallel Snapshot Isolation (PSI) of Sovran et al. [8], where the penalty of synchronization is paid in terms of increased latency, hence some pragmatic performance-consistency trade-off is sought. However, PSI guarantees causality of some operations and employs a combination of preferred sites and conflictresistant data types (CRDTs) to either localise conflicts or avoid them altogether. Unfortunately, this is not a property of RA (or MAV ∩ I-CI, the two being equivalent).

H. FALSE ANTIDEPENDENCY IDENTIFICATION
Alg. 0b employs a conservative tactic for identifying prospective antidependencies. The predicate in R3 is a sufficient, but not a necessary condition for Tj to antidepend on Tk. Consider the following history, comprising committed transactions T1-T3 and an undecided transaction T4. T1 installs the states of x and y. T2 and T3 both update x. T4 subsequently reads both x and y. Consider the behaviour of Alg. 0b with in the range T1..T4. ¬R1 ∧ ¬R2 for T4, therefore Alg. 0b evaluates R3. There are two non-aborted transactions in located between snapshot(T4) and ver(T4) that satisfy the set intersection criterion: T2 and T3. T3 ᐅ T4 because 3 ∈ readvers(T4); hence, T3 is disregarded by R3. However, 2 ∉ readvers(T4), tripping R3 for T2, despite T4 reading from a newer version of x. R3 is not "trained" to recognise the transitive relationship between T2 and T4 via T3, hence T4 is aborted unnecessarily. Clearly, this behaviour is suboptimal.

I. OPTIMISING READ-ONLY TRANSACTIONS
Read-only transactions are common among many workloads [2] and finding efficient ways for ensuring their consistency is important. Ideally, we would like a strong consistency model that closely approximates serializability for most read-only workloads, with only the most stringent of read-only transactions requiring explicit certification. Such a model exists: update serializability, communicated by Hansdah and Patnaik [4]. Roughly speaking, update serializability ensures that read-only transactions observe a state that is consistent; i.e., preserving all invariants. Adya also presents an account of update serializability [2], labelling it PL-3U. This is a slight relaxation of regular serializability, labelled PL-3.
In PL-3U, transactions Ti and Tj may observe a serializable database state, but unlike PL-3, the serial ordering observed by both transactions could vary. The difference between PL-3U and PL-3 is exemplified in the following: consider three transactions, T0, T1, T2, updating the set of items {x, y}, such that T1 and T2 do not conflict. The history also contains a pair of read-only transactions, Tv and Tu. The dependencies of H3U are depicted in Fig. 7. Note the cycle T1 ⤏wr Tv →rw T2 →wr Tu ⤏rw T1. Each of Tu and Tv individually observes a state that passes for serializable in the absence of the other. However, they do not agree on the order of the update transactions. According to Tu, the serial order is (T0, T2, T1). According to Tv, the serial order is (T0, T1, T2). Thus, read-only transactions may induce the reversal of the order of update transactions. Nonetheless, both transactions observe a consistent state, and provided the two do not directly communicate in a way that makes these discrepancies apparent, update serializability is as good as serializability [2]. Unsurprisingly, H3U can be made serializable by deleting any one of the two read-only transactions. Formally, a history H is update serializable (USR; is in USR) if every history H' obtained by deleting all but one read-only transaction from H is serializable. In the absence of read-only transactions, H is in USR iff H is in SR [4]. Additionally, SR ⊂ USR.
Resuming the discussion on Alg. 1 and its comparison to Alg. 0, we can now state the cardinal reason for the demarcation of safepoints: update serializability. The careful choice of safepoint placement in Alg. 1c and its subsequent enforcement in Alg. 1d ensures that histories on the cohorts satisfy the no-update-conflict-misses invariant. That is, the G-update phenomenon is proscribed, and uncertified reads are thereby consistent.
For the forthcoming propositions, let installed(C, T) denote that transaction T is installed on cohort C.

Lemma 9.
For committed transactions Ti, Tj, if i < safepoint(Tj) and Tj is installed on some cohort, then Ti is also installed on that cohort. I.e., i < safepoint(Tj) ∧ installed(C, Tj) ⇒ installed(C, Ti) for all C.
For ⟨2.1⟩, if p < k, the proof of is identical to that of ⟨1⟩ and is omitted for brevity.
For ⟨2.2⟩, if p > k, it implies a backward-facing antidependency, which is prohibited by Lemma 5. o Earlier, we remarked on the propensity of Alg. 0 to abort transactions due to false antidependency identification. We could modify R3 to cater for the Hfalse-rw scenario (and others like it), extending the predicate to ignore apparent conflicts Update serializability is a sufficient isolation property for many read-only workloads [2]; however, there may be occasional queries where PL-3U is insufficient. Is certification the only way of achieving serializability for read-only transactions? Not in the least.
Recall that the opportunistic out-of-order update mechanism of Alg. 1c and 1d is an optimisation that fast-tracks the state of replicated items, so that update transactions can be seeded from a more up-to-date cohort state, reducing the temporal overlap between transactions. This is unequivocally useful for update transactions because it lowers the rate of conflicts. Suppose, however, one or more cohorts were dedicated solely for handling read-only transactions, while others could process a mixture of read-only and update transactions. Then for the dedicated read-only cohorts, one could disable Alg. 1d altogether and apply all updates serially, using batching to maximise throughput. By simply splitting the workloads, we obtain: • PL-3 for uncertified read-only transactions; • PL-3U for uncertified read-only transactions that benefit from observing a more recent state; and • Strict serializability for all certified transactions. On idempotence: During our proof of Alg. 0 we assumed that all state mappings were idempotent. This tightening of the model is necessary because Alg. 0 disregards dependencies in its safepoint calculation. A nonidempotent state mapping produces a value that depends on the prior state, which may be arbitrary according to Alg. 0. To satisfy nonidempotent state mappings, it is necessary to show that transactions are installed deterministically on every cohort, and if a transaction reads some value for an item x on some cohort, then it reads the same value for x on every cohort at which it is installed. This, in turn, requires us to show that the immediate predecessor of x is installed on every cohort before the nonidempotent state mapping is exercised.

J. SUPPORT FOR SERIALIZABLE BLIND WRITES
Compare the antidependency identification approach of STRIDE with other prominent optimistic and multiversion concurrency control systems. Conventional SI relies exclusively on write conflict identification [6], indiscriminately rejecting concurrent transactions whose writesets intersect. Serializable Snapshot Isolation (SSI) expands upon the write-write conflict identification of SI by identifying a dangerous structure-a condition wherein a triplet of transactions exhibit an adjacent pair of read-write antidependencies. The condition in SSI is sufficient but not necessary for cycle identification [1]. Precisely Serializable Snapshot Isolation (PSSI) elaborates upon SSI with complete serializable graph testing [10]. PSSI is more permissive than SSI, in that it admits a larger set of histories while still rejecting all nonserializable histories. Serializable Generalized Snapshot Isolation (SGSI) extends GSI in a replicated environment, providing additional performance and availability while guaranteeing one-copy serializability. While SI is not serializable, it is nonetheless a useful property [6] for many systems and offers high levels of concurrency, as reads are not blocked by writes, and vice versa. SI prevents a broad range of anomalies: dirty reads, read skew, nonrepeatable reads and lost writes [6]. It also respects the real-time precedence order of nonconflicting transactions-a favourable trait it shares with strict serializability and linearizability [14] while being neither of those things. Importantly, it is the archetypal isolation model upon which its descendants-SSI, PSSI and SGSI-are built. The latter offer serializability while rejecting the same histories that SI would ordinarily reject due to write conflicts.
We suggest that write conflict elimination is a "hangover" from SI that was designed as a crude but effective mechanism for avoiding histories that might contain lost writes. Due to its known limitations, namely, the lack of antidependency testing [1], SI is unnecessarily conservative, eliminating histories that are otherwise serializable [2]. Specifically, SI's requirement that the writesets of concurrent transactions be disjoint is a 5 The corresponding author's principal area of industry expertise. sufficient condition for identifying lost updates, but not a necessary one. Consider a history containing a blind write: The serialization graph of Hblind is depicted in Fig. 8, clearly indicating that the history is serializable with the conflict equivalent serial order (T0, T1, T2); however, it is proscribed by SI [2]. Furthermore, Hblind is also permitted by strict serializability because T1 and T2 are concurrent. Finally, Hblind is also permitted by Commitment Ordering [15] and Φserializability [16], both of which are stronger still. Adya suggests that blind writes are rare, without submitting empirical evidence in support of this [2]. While this viewpoint may have been plausible at some point, it is not unequivocally true today. There are several classes of systems where readsets are routinely disjoint with writesets; for example, gossiping protocols, content caching, sensor networks, wagering systems 5 and financial markets. There is a frequent requirement among such systems to record the most recent value of some data item, such that the update is unconditional on prior values; for example, to record the most recent temperature reading (it does not matter what the previous reading was); or replace a value in a cache with a newer value, possibly concurrently from multiple processes; or update the odds of a betting proposition or the market price of a financial instrument. In such systems, unconditional concurrent updates to the same item often result in either the same value or in values that converge with time. Another set of use cases concerns atomic in-place operations, where items are updated without requiring a prior read; for example, in atomic counters. This is a typical requirement for clickstream analytics, complex event processing, etc., where tallies are transactionally incremented when specific conditions are observed, and the previous value of the tallied item need not be a condition of the transactions in question.
The lack of proper antidependency analysis in SI means that transactions containing blind writes are indiscriminately aborted, leading to unnecessary retries that eventually succeed, often after several attempts (depending on the level of concurrency).
As a side note, because both SI and S-SR each admit histories proscribed by the other, we can now state that snapshot isolation is incomparable to strict serializability (and by extension, to serializability); i.e., SI ⊄ S-SR ∧ S-SR ⊄ SI. Bailis et al.
mischaracterise SI as being a proper superset of SR [13]. Adya does not compare SI with SR; instead, it is mislabelled as a proper superset of S-SR [2]. Although the models are incomparable from a settheoretic perspective, in that one is not a subset of the other, we accept that from a pragmatic viewpoint, SR is a "stronger" model than SI because the latter proscribes histories that do not cause known anomalies yet permits histories that lead to serialization cycles.
The advantage of antidependency testing in STRIDE is that we can finally dispense with the archaic write conflict testpermitting benign blind writes while simultaneously rejecting lost updates.
Like most other schedulers, STRIDE is conservative, in that it rejects some serializable histories. Consider, for example, an acyclic history containing a backward-facing antidependency: The serialization graph of Hback-ser is depicted in Fig. 9. It is serializable with the conflict equivalent serial order (T0, T2, T1). SI accepts this history; however, STRIDE rejects it due a backward-facing antidependency T2 ⤏rw T1.

K. THE ROLE OF AGENTS
The agent plays the least interesting part in the protocol for most update transactions and will not be traversed for most read-only transactions. Its role is to emulate a request-response style synchronous interaction with the certifier, which is otherwise entirely asynchronous. Database queries tend to be invoked synchronously, as are coordinators in atomic commitment protocols; likely, this style of interaction will also be preferred by the users of STRIDE. We suggest that in most systems the agent is part of a client library that is collocated with the cohort, rather than a separately deployed process.
One notable use case for a dedicated agent process is the certification of blind writes. A transaction performing a blind write may contain an empty readset; the cohort database serves no purpose therein. Therefore, we can allow a client to submit a write-only transaction directly to the agent, bypassing the cohort, assuming that the client is only concerned with the outcome of the transaction.

L. CONFORMANCE TO STRONGER ISOLATION PROPERTIES
Generally speaking, all known practical scheduler implementations are conservative in rejecting some histories that are otherwise admissible by their claimed isolation properties. This is especially apparent in pessimistic concurrency control systems, for example, S-S2PL, being the dominant pessimistic scheduler among commercial and opensource database systems. S-S2PL proscribes Hback-ser by the shared lock of T1 and T2 on x that prevents the subsequent write of x by T1 until T2 commits. (Hence c2 ⟶ c1.) Of the typical pessimistic schedulers, both 2PL and S2PL admit Hback-ser, but 2PL does not guarantee recoverability [17] whereas S2PL has the drawback of requiring prior knowledge of transactions' read patterns 6 .
Conservatism in a scheduler is invariably suboptimal unless it offers some stronger isolation property that is also genuinely useful. While the rejection of Hback-ser does make STRIDE conservative, this approach carries distinct benefits. Namely, STRIDE conforms to both commitment ordering (CO) proposed by Raz [15] and Φ-serializability (Φ-SR) proposed by Koutanov [16]. We briefly recount these properties here.
For a history to be commitment ordered, the order of every pair of conflicting operations in every pair of committed transactions must match the order of their respective commit events [15]. Formally, a history is commitment ordered (CO; is in CO) if for all conflicting operations pi[x], qj [x], for all committed transactions Ti, Tj, pi[x] → qj[x] ⇒ ci → cj. CO was devised to ensure global serializability in multidatabase systems [15].
Let Start(T) denote the submission time of transaction T to a scheduler. A history H is Φ-serializable (Φ-SR; is in Φ-SR) if it is strict serializable and for every pair of conflicting transactions Ti, Tj, that are committed in H, Start(Ti) < Start(Tj) ⇔ Ti →H Tj [16]. In other words, a Φ-SR scheduler ensures that conflicting transactions are committed in their logical timestamp order. Φ-SR was designed to avoid the logical timestamp skew (LTS) anomaly, wherein records streamed from a serializable primary database to a replica and serially installed on the latter in their logical timestamp order can compromise the consistency of the replica. In STRIDE, logical timestamp order is equivalent to commitment ordering.
Proof. Under abcast, Φ-SR and CO are equivalent, as commitment order follows logical timestamp order. It suffices to show that either Alg. 1 ⊆ Φ-SR or Alg. 1 ⊆ CO.
We assume there exists a pair of committed transactions, Ti, Tj, in some history H admissible by Alg. 1, that are serialized in the reverse order of their logical timestamps, and show their absence by contradiction. Without loss of generality, assume Start(Ti) < Start(Tj), hence i < j. Then, Tj →? Ti for their serialization order to be reversed. Of the three dependency types, only read-write antidependencies may be projected backwards. This is a contradiction: backward-facing antidependencies are proscribed by Lemma 5. o

M. LIMITATIONS
In its stateless incarnation, STRIDE is limited by the size of the suffix in Alg. 1b. Increasing the suffix allows the certifier to look further "back in time," so to speak, when certifying transactions. A transaction that is staged over a severely lagging cohort snapshot does not need to be unnecessarily aborted if no conflicting transactions emerged since its snapshot was taken. The larger the suffix, the further the certifier can look back in time, and the lower the likelihood of aborting nonconflicting transactions. Increasing the suffix, however, brings about several problems: 1. It takes up additional memory on the certifier. 2. It requires scanning through a larger set of records to identify outbound antidependencies in Alg. 1b. 3. It requires scanning through a larger set of records to identify inbound dependencies in Alg. 1c. 4. During recovery, the certifier has to scan from the head of the suffix to ensure that all pending transactions have been decided. Points 2 and 3 can be addressed by employing either hashing or probabilistic data structures, such as counting Bloom filters, that support both the addition and removal of elements. For point 2, the use of such structures enables R3 of Alg. 1b to expediently determine, for some candidate transaction Tk, that there is no non-aborted transaction Tj whose writeset intersects with the Tk's readset. If Tj's absence can be ascertained, Alg. 1b can safely commit Tk by R4; otherwise, if the presence of Tj cannot be ruled out, R3 can initiate a full scan of and either abort Tk or proceed to R4 if the scan comes up empty.
A similar technique may be employed for point 3 in Alg. 1c. We also have the option of truncating the quantified range of the set comprehension assigned to , making safepoint identification conservative. This restricts the range of versions on the cohorts to which Tk may be concurrently applied, implicitly preserving the safety property of the algorithm 7 .
For point 4, the certifier can periodically checkpoint its progress in the suffix to a database, to expedite recovery. This is the logical timestamp of the first undecided transaction in the suffix. If all transactions in the suffix have been decided, the checkpoint is the successor timestamp of the last decided transaction. Upon recovery (or if a new certifier joins the group), the certifier loads a suffix that starts no later than the checkpointed timestamp; possibly earlier, depending on the preferred suffix length. (Recall, must contain all undecided transactions and possibly some decided ones.) However, the 7 I.e., if Alg. 1 is safe to begin with, then safety is preserved if some transaction is assigned a higher safepoint than that permitted by Alg. 1c. In certifier only needs to start processing transactions from its checkpoint timestamp.
Lastly, point 1 cannot be easily addressed. Alg. 1 represents a fundamental trade-off between false aborts and resource utilisation on certifiers. Some prefix of may be periodically relocated to stable storage to conserve memory; however, this will impact certification performance for transactions whose snapshot falls into this prefix. The following section builds on this basic premise.

V. ALGORITHM 2 -ANTECEDENT SET REIFICATION
The rejection rate of Alg. 1 is constrained by the chosen suffix size, as previously noted. We now consider an optimisation of Alg. 1 that introduces the notion of a semi-permanent state, allowing the certifier to look further back in time without affecting the suffix size. The design objective is to achieve inmemory certification of an overwhelming majority of transactions while allowing for off-process certification of a relatively small subset of transactions, rather than conservatively aborting the latter. is the coalescing of all writes in such that only the most recent writes are preserved and assigned the same version numbers that they had in . Serially applying transactions in and in produces two potentially different histories that are nonetheless state equivalent, which is to say, they produce the same end-result.
Consider again R2 of Alg. 1b. What compels R2 to abort Tk if snapshot(Tk) < i -1? Suppose R2 was rewritten to the provisional variation, as per Alg. 2b-pr.
the extreme case, safepoint(T) = ∞ for any T, which effectively disables Alg. 1d. is obviously unknown at this point because the algorithm has no knowledge of . At this stage, we must conservatively assume that contains a transaction that antidepends on Tk. I.e., ∃ x ∈ readset(Tk) : ∃ Aj ∈ : snapshot(Tk) < j ∧ x ∈ writeset(Aj) ∧ j ∉ readvers (Tk). Under this assumption, Alg. 2b-pr is equivalent to Alg. 1b; they behave identically for all inputs. In other words, the certification algorithm proscribes a backward-facing antidependency from Tk to any transaction Aj ∈ on the assumption that such a transaction might exist.

Algorithm 2b-pr
So far, all we have done is delineate the intuition behind Alg. 1b by codifying it in Alg. 2b-pr using our notional antecedent set. As we know, this algorithm may lead to a false abort of Tk; its likelihood can be approximated by some linear function of | | / | ∪ |. Hence the pressure of increasing , especially in high-throughput certifiers. Ideally, we would like to significantly reduce the false abort rate by reifying and thus making its contents available to the algorithm. We now remark on several useful properties of .

Compactness:
The nice thing about the antecedent set is that it can be represented compactly. Let there be n distinct items processed by . Clearly, | | ≤ n, because there can be at most one transaction for any written item in by the combination of A2 and A3. Note, A1 merely drops the readset; it does not affect the cardinality of .
Transitivity: Another convenient property of stems from the transitivity of the observes relation. If some transaction Ai ∈ later has its write superseded by Tj ∈ , j > i, we can say that Ai →ww Tj. Furthermore, if Tj ᐅ Tk and wr(Tj, Tk), then Ai ᐅ Tk. This property is essential: it allows us to safely apply A2 (and subsequently A3) without inadvertently discarding a critical antidependency. (This property is later used in the proof of Lemma 11.) Commutativity: A final noteworthy property is that all transactions in commute. By A1 and A2, there is no pair of transactions Ai, Aj, such that (readset(Ai) ∪ writeset(Ai)) ∩ (readset(Aj) ∪ writeset(Aj)). This is convenient because we can arbitrarily reorder transactions in after its formation.
The addition of a new transaction, An, by applying rules A1-A3 leads to the set ' that also satisfies all of the aforementioned properties.
The commutativity property implies that the ordering of transactions in is unimportant, and rather than representing as a sequence, it can instead be represented as a mapping, α, of data items to versions of committed transactions: If the requested item does not exist among the transactions in , we assume that a notional placeholder for that item has been written by the initial transaction T0, which cannot antidepend on candidate transactions. (Hence α returns 0 if = ∅.) A cohort can exploit this property to insert a new item into the global state while satisfying the uniqueness constraint-by declaring the item in its readset. The first transaction that attempts to install a new item commits; the subsequent transactions will be aborted by the certifier.

B. EXPANDED SYSTEM MODEL
In the original system model, depicted in Section III, the certifier process only has access to the abcast primitive and the XDB for recording decision states. Before we introduce the revised algorithm, it is necessary to accommodate the antecedent set and the derived α mapping function in the system model; we revise the system model to encompass the following additional elements. The expanded system model is depicted in Fig. 10.

1) ANTECEDENT STORE
Abbrev. α-store. A soft-state database of item versions that is lazily populated by installing transaction updates from abcast in logical timestamp order. An α-store contains a version number, ver(α), which is the logical timestamp of the most recent transaction installed in that α-store. There is no requirement for the durability of writes; the contents of the store may be lost, provided ver(α) always accurately corresponds to the state of the store. An α-store is effectively a cache.

2) ANTECEDENT REIFIER
Consumes updates from some prefix in abcast, serially installing them in the associated α-store and updating ver(α) in the process. The reifier may be replicated for fault tolerance.

C. ALGORITHM SPECIFICATION
There is now a sufficient foundation for presenting the next algorithm. Alg. 2 is equivalent to Alg. 1 in all but one of its subordinate algorithms: Alg. 2b, for R2. (Rules R1, R3 and R4 of Alg. 2b are carried over unchanged from Alg. 1b.) The function α[x] produces the version of data item x, taken from some α-store, being the logical timestamp of the most recent transaction that updated x in . must include all undecided transactions, so for Ti to move from to , Ti must be committed. If Ti antidepends on Tk, then its subsequent move to will preserve this antidependency.

1) OVERLAPPING SUBSETS AND GAPS
Alg. 2b assumes that ver(α) ≥ i -1; i.e., it allows for the contents of to overlap with the suffix . In the case ver(α) = i -1, and are complementary-every transaction is captured in either one set or the other, but not in both. In the case ver(α) > i -1, R3 can be further optimised to certify over a truncated portion of that starts from ver(α) + 1. Finally, in the undesirable case that ver(α) < i -1 and snapshot(Tk) < i -1, the certifier has the choice of either downgrading to Alg. 1b, that is, conservatively aborting Tk on the assumption that an antidependency exists in , or loading the missing transactions from abcast into memory, such that ver(α) ≥ i -1.
Compacting the prefix of transactions into an antecedent cache alleviates the strain on the certifier in relation to the suffix size while significantly reducing the number of false aborts due to stale cohort snapshots. Crucially, α is an optimisation; the complete or partial loss of the α-store does not impact the correctness properties of the algorithm, provided the data loss is detectable and that the α-store is always prefix-complete, be it a zero-length prefix in the extreme case. In other words, if the α-store loses some transaction Tn, then it must also forfeit all subsequent transactions in the corresponding prefix , that is, {Ts ∈ : s > n}, and ver(α) must report the immediate committed predecessor of Tn.

2) PERFORMANCE OF THE ANTECEDENT STORE
It is possible to operate the system with multiple independent certifiers, each reading from a dedicated α-store or a pool of α-stores. In this manner, the antecedent set can be loadbalanced for read throughput.
An α-store instance is also write-intensive, as it must ingest all transaction updates. To maintain prefix completeness, the installation of updates must occur serially. However, there is no need for it to be synchronous with the commitment of transactions: the corresponding reifier process can apply updates asynchronously, at its discretion, akin to the installation of updates on the cohorts in Alg. 2e. The ability to install updates asynchronously accords another optimisation: Ordinarily, serial updates are slow because an update cannot commence until the previous update is completed, leaving the system in an underutilised state. So, rather than updating its αstore one transaction at a time, the reifier process can batch updates together, installing them with fewer round trips. This has virtually no bearing on the latency of certification, as most transactions will be certified directly from , which remains in memory.

E. PROOF SKETCH
For the forthcoming proof, we use the primed notation ' to depict the next state of . Similarly, P' depicts the truth value of some predicate P in its next state. Proof. The proof is by induction over naturals. We consider a base case in which i = 1; i.e., = ∅ and the suffix encompasses all transactions up to Tk (and possibly Tk's successors). When i = 1, Alg. 2b and Alg. 1b are equivalent for all inputs because R2 is never triggered. (There is no Tk such that snapshot(Tk) < i -1.) Therefore, Alg. 2 is correct in the base case, owing to the established correctness of Alg. 1. In the inductive step, i = g and it suffices to show that the invariant holds for i = g + 1. It further suffices to show that ⟨1⟩ if Tk was aborted in some previous run, it will not be committed by R1 in a subsequent run, and ⟨2⟩ if Tk was committed in a previous run, it will not be aborted R2 in a subsequent run.
For ⟨1⟩, if Tk was aborted in some previous run, then readset(Tk) ≠ ∅; therefore, Tk will never be committed by R1 in any run. This is analogous to the corresponding proof in Lemma 7.
For ⟨2.2⟩, i ≥ k implies that Tk ∉ '; hence, none of the rules will be evaluated for Tk in the next state.
For ⟨2.3⟩, aborted(Ti) implies ¬committed(Ti)' and Ti ∉ '; hence, = '. Therefore, ¬R2'. Proof. The proof is by induction over naturals. We consider a base case in which i = 1; i.e., = ∅ and the suffix encompasses all transactions up to Tk (and possibly Tk's successors). When i = 1, Alg. 2b and Alg. 1b are equivalent for all inputs because R2 is never triggered. Therefore, Alg. 2 is correct in the base case, owing to the established correctness of Alg. 1. In the inductive step, i = g and it suffices to show that the invariant holds for i = g + 1. We consider two separate cases: ⟨1⟩ a backwardfacing antidependency is incident on a transaction in , and ⟨2⟩ a backward-facing antidependency is incident on a transaction in .
For ⟨1⟩, let Ac be the conflicting transaction that antidepends on Tk, and pick some x ∈ writeset(Ac) ∩ readset(Tk). Consider two cases: ⟨1.1⟩ where there exists a non-aborted Tj in such that x ∈ writeset(Tj) ∧ j < k and ⟨1.2⟩ there is no such Tj. In both cases, Tk is not committed by the assumption of the inductive step; it must be shown that Tk does not commit in the next state. If there is some non-aborted Tj in , such that j < k and Tj antidepends on Tk, then the transaction scheduler must adhere to one of three strategies: either 1) wait until Tj is decided, 2) attempt to decide Tj first, or 3) abort Tk on the worst-case assumption that Tj will be committed. As it stands, the algorithm applies strategy 3.
For (1), it would not be difficult to adapt R3 of Alg. 2b to halt Tk's evaluation until Tj is decided; however, this introduces blocking in the scheduler 8 and runs contrary to the spirit of optimistic concurrency control. However, even in this case, it cannot induce a deadlock, as blocking only applies to backward-facing antidependencies, and Tj clearly cannot directly block on Tk since j < k. Nor can Tj block on Tk via some transitive dependency, because for a deadlock cycle to 8 Technically, it may no longer be called a certifier if it queues or otherwise delays operations. 9 SGSI refers to them as replicas; STRIDE refers to them as cohorts to highlight their autonomy. materialise, the antidependency direction must be reversed at some point, which would immediately render it nonblocking.
For (2), adapting R3 to recursively decide Tj before Tk has the advantage of forming a perfectly accurate certifier that never falsely aborts. It also means that transactions' decision states can be deterministically reconstructed from abcast, relieving the need for durable writes on the XDB. On the flip side, if transactions are certified concurrently, then in the absence of process coordination, multiple attempts at certifying a transaction may take place concurrently. This will not compromise safety, as per Lemma 7; however, it may result in duplicate work. Also, checking transaction states requires a round-trip to the database; alternatively, transaction states can be cached in memory. This strategy is roughly equivalent to (1), but instead of waiting for some other certifier to commit or abort the antidependency, the present certifier does so itself. Still, it may be more effective than conservatively aborting Tk, and is worth considering. Other serializable certifiers, for example, SLOG [36] and Tashkent [37] also proactively decide antidependency states by certifying transactions in the order of arrival.
The algorithms in this paper are described in terms of (3) primarily because doing so generalises their specifications. That is, an algorithm that does not mandate decisions on antidependency states contains more behaviours than an alternative based on (1) or (2) for the same model size. Namely, for antidependency scenarios, the behaviour of (1) and (2) always results in either a commit or abort state, whereas (3) permits both behaviours while ensuring decision stability (S1). Although (2) has the hallmarks of a compelling optimisation, we rest the final decision with the implementer.

VI. RELATED WORK
This section recounts related work in the field of distributed transaction processing and concurrency control, relating the results of this paper to other contributions that preceded it.

A. SGSI
The prior work that most closely resembles STRIDE is Serializable Generalized Snapshot Isolation (SGSI) by Bornea et al. [11]. The two algorithms share much in common, including- • Many elements of the overall system model; • Lazily replicated databases 9 with weakened local consistency properties; • Support for uncertified read-only transactions 10 ; • Commitment ordering and Φ-serializability 11 at the system level; and • Certification over a totally ordered sequence of candidate transactions. 10 SGSI supports serializability for read-only transactions, while STRIDE offers the choice of serializability or update serializability. 11 SGSI histories are in Φ-SR, although it was devised prior to the work on Φ-serializability, and externalizability in general.
choking point in SGSI's design. Furthermore, updates are installed serially in CertDB and synchronously with the commitment of transactions, contributing to certification latency. STRIDE supports stateless certification, using only the memory of certifiers to look for antidependencies and calculate safepoints. Alg. 2 offers a further optimisation, wherein the stateless certifier is augmented with an optional soft-state antecedent store to reduce memory utilisation. Furthermore, updates can be installed asynchronously and in batches, maximising IO utilisation on the antecedent store. If the antecedent store begins to lag during periods of peak throughput, the certifier can choose to either grow the in-memory suffix or abort transactions seeded from lapsed snapshot versions. 4. Being a descendant of snapshot isolation, SGSI rejects concurrent write conflicts on the assumption that they might conceal a lost update anomaly. In doing so, SGSI also rejects serializable blind writes. STRIDE permits serializable blind writes, using antidependency testing to identify lost updates. A STRIDE transaction must therefore explicitly declare its read dependencies; there is no presumption that if a transaction writes to an item, then it must have also read from it. 5. SGSI mandates the use of identical schemas among replicas and CertDB; i.e., SGSI is intrinsically schemaaware. (Without this, transactions cannot be certified using SQL queries on CertDB.) STRIDE is schemaagnostic; neither the certifier nor the antecedent store cares about the internal structure of the data items; moreover, the cohort databases can be structured differently from each other. The only uniformity required by STRIDE is the identity of the data items; i.e., for every data item x, the mapping of x to the underlying entity is equivalent on every cohort. The antecedent store does not consider the entity-only its version number. Furthermore, there is no requirement that each cohort maintains a full set of data items processed during the lifetime of the system; it may safely limit its view to the subset of data items that it cares about. (Although in doing so, it limits the scope of queries it can support.) 6. SGSI supports SQL queries with predicates. STRIDE only supports named data items; predicate support is absent from STRIDE and is unlikely to be offered without significant changes to the algorithm. If we limit our consideration to the attributes of the two algorithms, disregarding their usage context, STRIDE is an objective improvement over SGSI in all areas except for point 6-predicate support. The bottlenecks described in points 2 and 3 call the performance of SGSI into question: on one hand, SGSI attempts to scale SQL databases by separating certification from reads, using replication for read throughput and availability; on the other hand, the throughput of the certifier is limited by that of CertDB and writes are applied serially on the replicas and synchronously on CertDB. Bornea et al. did not discuss these issues in [11]. Instead, the authors demonstrated a combined throughput of 88,000 TPM in the TPC-W benchmark under some workloads, showing a 7x improvement over single-replica performance (admittedly, in 2011). At first glance, SGSI appears to live up to its promise; upon closer inspection, however, the result is unsurprising for its time, given that replicas were equipped with 7,200 RPM mechanical disks with 2 GB of RAM, whereas the certifier used an in-memory SQLite.NET database for conflict detection, buffering updates and committing to disk in batches of 32 certifications [11]. Under these conditions, the bottleneck is presumably in the IO subsystem of the replicas, masking the bottlenecks of the overarching algorithm; the buffered certifier will outperform a conventional DBMS with little effort. We speculate that using solid-state disks on the replicas, with sufficient memory to support table pinning, will noticeably shift the balance of performance and make SGSI's bottlenecks more apparent; SGSI will still scale to some degree, but the saturation point will occur at a lower multiple.
Returning to point 6, the lack of predicate certification in STRIDE is a significant functional reduction over SGSI. The flip side, however, is that STRIDE is schema-agnostic in its entirety, permitting site autonomy. It is difficult to imagine a schema-agnostic predicate certifier; thus, the two design criteria are inherently contradictory. We suggest that the contrast in points 5 and 6 naturally leads to a distinct classification of the two algorithms: SGSI is a distributed SQL database, whereas STRIDE is an atomic commitment protocol over a heterogeneous multidatabase system. To that point, SGSI is more likely to be used as a common database for one or more closely coupled applications. Conversely, STRIDE is better suited to atomically coordinating distributed transactions across a set of federated, autonomous applications (or systems) with separate databases.

B. Calvin
Calvin [38] is a transactional scheduling and replication layer that uses the deterministic ordering guarantee to achieve strict serializability without relying on a multi-phase atomic commitment protocol for coordination.
Calvin works by assigning a total order to candidate transactions using an abcast primitive, which also doubles as a persistence layer. Transactions encompass data items residing in multiple partitions and replicas. A scheduler at each partitioned replica serially computes a locking plan for the inbound transactions, which are subsequently executed concurrently according to the established plan. Owing to the combination of total order and the upfront declaration of 13 Histories produced by Calvin are in Φ-SR, although it was devised prior to the work on Φ-serializability and externalizability. readsets and writesets, the locking schedules agree at each replica. Partitions may synchronously exchange information to facilitate multi-partition transactions, wherein a transaction executing on one partition reads from or writes to an item that resides at another partition.
Calvin and STRIDE share many design elements and resulting properties: • Avoiding multi-phase synchronous coordination by way of deterministic scheduling. Transactions behave identically at each replica; therefore, once their relative order is stably agreed upon, the coordination step becomes unnecessary. • Use of abcast to fulfil both ordering and durability.
Whereas convention databases rely on a durable REDO log, Calvin and STRIDE exploit the durability of the abcast primitive. • Disclosing the transactions' readsets and writesets prior to scheduling. A transaction may not prospect for new items during execution. • Asynchronous replication of transactions.
• Concurrent installation of updates. In Calvin's case, this is accomplished by concurrent execution. In STRIDE, certified updates may be installed concurrently on the replicas. • Support for serializable blind writes.
• Commitment ordering and Φ-serializability 13 at the system level. They differ by the following characteristics: 1. Transactions are modelled as stored procedures on Calvin, leading to code sharing across sites. The sites are therefore not autonomous. STRIDE embeds the state mapping in the transaction record, supporting site autonomy. On the flip side, the use of stored procedures may result in a more compact transaction representation because the state mapping is derived locally. 2. Calvin employs locking at each site, meaning it may queue certain transactions behind others, but the scheduler never aborts (unless the transaction aborts of its own will). The commitment model is reversed in STRIDE: a transaction may voluntarily abort during preliminary validation; however, it has no influence over its outcome during or after certification. 3. Calvin combines the scheduling and execution of transactions. In STRIDE, transactions are scheduled on the certifier independently of their subsequent execution on the replicas. 4. While both algorithms are concurrent in their processing of transactions, their concurrency exploits differ markedly. Calvin executes transactions concurrently, subject to a predetermined locking plan.
There is no splitting of transactions' read and write phases: they execute collectively at the relevant partitions. As a result, slow reads and writes block subsequent transactions. In STRIDE, read, certification and write phases support concurrent execution and are subject to pipelining; in other words, transactions may be reading concurrently from local snapshots, while candidates are being certified, and committed writes are being installed concurrently for overlapping items. Slow updates have no bearing on certification performance. 5. While both algorithms use asynchronous replication, they differ vastly. Calvin replicates just the candidate transactions; the decision is implicit in the execution phase. The initiator of a Calvin transaction does not know of its outcome until it is executed on the relevant active partitions. By comparison, STRIDE replicates both candidates and decisions; the initiator of a transaction learns of its outcome after certification. 6. Read-only transactions are treated identically to update transactions in Calvin-they require scheduling and entail locking. Calvin cannot serve lock-free, read-only queries unless it is bonded with a multiversion storage engine [38]. By comparison, STRIDE supports consistent uncertified reads at the cohorts. 7. Following from the above, STRIDE offers multiple consistency models for read-only transactions: USR, SR and S-SR, depending on the readers' requirements. This flexibility is absent from Calvin for as long as multiple partitions are used. Reducing Calvin to a single partition supports fast reads at the expense of write throughput. 8. Calvin partitions its dataset mainly for update performance, trading read performance. STRIDE does not employ partitioning; instead, replicating data to support site autonomy and local reads. Calvin's partitioned execution model is susceptible to progress skew-a condition wherein execution runs out of phase across partitions, stalling the progress of multipartition transactions [38]. STRIDE avoids this by design. Finally, we note that in the same vein as SGSI, Calvin's design is optimised for use in a distributed database, where the use of stored procedures is commonplace and site autonomy is a nonrequirement. Calvin has recently been adopted in Fauna-a commercial, SQL-compatible distributed database-with minor variations [39].

VII. SUMMARY AND CONCLUSIONS
This paper presents a novel protocol for optimistically certifying distributed transactions in multidatabase systems. In STRIDE, sites called cohorts maintain local replica databases containing data items that form a subset of the notional global state, at versions that may lag behind their most recent committed state. Transactions are initiated by cohorts after querying their local state to ascertain provisional validity, before forwarding the candidate transactions on to a certifier via an agent proxy for categorical validation and atomic commitment. (The agent may be collocated with the cohort.) The agent assigns a logical timestamp order to the candidate transactions by sending them over an atomic broadcast primitive, being the order in which the transactions are eventually certified. The cohort databases are asynchronously populated by installing the writesets of certified transactions; a cohort eventually catches up to the global state for the subset of data items that it cares about.
The installation of writesets on the cohort may proceed concurrently, and out of order with respect to the transactions' commitment order; nonetheless, the algorithm guarantees that no transaction is installed before any of its direct or transient dependencies, and that all writesets are installed atomically. In doing so, the cohorts conform to update serializability (USR, PL-3U) for read-only transactions, relieving the need for their certification. Additionally, the isolation property for uncertified read-only transactions can be elevated to full serializability (SR, PL-3) on those cohorts that apply updates serially, in strict commitment order.
For transactions that complete certification, the resulting isolation property is the intersection of commitment ordering and Φ-serializability; the latter essential to avoiding the logical timestamp skew anomaly that was recently shown to plague event-replicated systems [16].
The overarching design of STRIDE bears many similarities to SGSI and was designed from the outset with the limitations of SGSI in mind, ensuring that neither the certifier nor the cohorts are bottlenecked for nonconflicting transactions, and offering a choice of consistency models for uncertified readonly transactions. In addition to concurrent, out-of-order updates and support for serializable blind writes, STRIDE gains schema autonomy, but forfeits predicate support in doing so. Evidently, the algorithms are different in their practical application: SGSI is a single distributed database system, while STRIDE is a transactional multidatabase system.
By paring STRIDE down to its minimally correct form (viz. Alg. 0), it was shown that the isolation levels on the replicas need not be stronger than read atomic (RA), that is, the intersection of monotonic atomic view (MAV) and item cut isolation (I-CI), to produce strict serializable post-certification histories. This result is an improvement over the previous result of [11] that alluded to a minimum of generalised snapshot isolation (GSI). In its more conventional form (Alg. 1 and 2), STRIDE delivers USR on the replicas, which is still weaker than GSI (if we disregard write conflicts, which cannot occur during serial installation). We can now categorically state that local GSI is not a requirement for global SR.
We also remarked on the inherent design trade-offs in STRIDE. In particular, the stateless certification variant (Alg. 1) is limited by the extent of the in-memory suffix of candidate transactions: increasing the suffix results in a lower false abort rate for lapsed nonconflicting transactions but increases memory utilisation on the certifier. We presented an optimised variant (Alg. 2) that coalesces the prefix of committed transactions into an antecedent store, which is subsequently queried by the certifier when assessing a lapsed transaction. With antecedent set reification, certifiers can operate over a much smaller in-memory suffix, acquiring immunity to most false aborts while assessing the overwhelming majority of transactions entirely in memory.