fundamentals of transaction systems - part 1: causality banishes acausality (clustered database)

1-1

Valverde Computing

The Fundamentals of Transaction Systems Part 1:Causality banishes Acausality(Clustered Database)

C.S. Johnson <[email protected]> video: http://ValverdeComputing.Com social: http://ValverdeComputing.Ning.Com

The Open Source/ SystemsMainframe Architecture

1-2

Library = Low level communication, operating system drivers and state on Open Systems platforms

Subsystems = Open Source components+ middleware standards+ Customer Application Cores

EAI = commerce brokers, data integration & rules engines, enterprise mining, web analytics, ETL and data cleansing tools

Optimal ClusterSoftware Architecture

1-3

Library = Low level communication, operating system drivers and state on Open Systems platforms

ESS, WAN, LAN, SAN drivers and management library

Global serialization libraryXML

log records library

Buffered log I/O library

XML log reading library

Cluster logging library

Recovery library

XML chains resource manager

Global Transaction (IDs, handles and types) library

Data management library

Transaction management library

XML remote scripting API library

Computer, Cluster and Network management library

Optimally includes a proprietary layer of low level, C/C++ based drivers, yielding unparalleled transaction processing performance without the client having to deal with the underlying design architecture. These libraries provide a simple and unobstructive, yet elegant and abstract data management interface for new applications.

Libraries

1-4

Disaster Recovery interface

XML remote scripting

XML management console

Service control manager

Application servers

Application feeders

Application extractors

Application reports

Application human interface

Database and Recovery management interface

Computer, Cluster and Network management interface

Subsystems = Open Source components + middleware standards + Customer Application Cores

The vast majority of optimal middleware and applications are then implemented on open source using cross-platform Java to access this open system interface, allowing unprecedented flexibility for customization and future expansion.

Middleware – Open Source

1-5

Actional Control Broker

Acxiom AbiliTec™

Fair Isaac Blaze Advisor

Mercator Commerce Broker

MicroStrategy

DoubleClick Ensemble

SAS Enterprise Miner

ETL Tools

SeeBeyond®

TIBCO

Trillium

EAI = commerce brokers, data integration & rules engines, enterprise mining, web analytics, ETL and data cleansing tools

Enterprise Application Integration

1-6

High Speed, Minumum Latency Network or SAN “B”

Cluster Redundancy Architecture

Fibre Channel or SAN Based Enterprise Storage Network “B”

High Speed, Minumum Latency Network or SAN “A”

Fibre Channel or SAN Based Enterprise Storage Network “A”

* Elements can be viewed as computers in a cluster, or as clusters in a group

1-7

4 Pillars (or Guardians or Demons)

1. Causality banishes Acausality(Clustered Database)

A. Importance of Serialization, Order

B. S2PL (strict two phase locking vs MVCC)

a. Write skew and wormholes

C. Wittgenstein and the Tractatus Logicus

D. Mohandas Gandhi: Be the change you want to see.

E. Daisaku Ikeda: Ningen Kakumei (Principle of Human Revolution)

1-8

2. Relativity shatters the Classical Delusion(Replicated Database)

A.Real-time and distributed systems performance issues

B.Timestamp and clock issues

a.Einstein: no such thing as the global “current moment”

b.Davies: no such thing as the local “current moment” (modern physics)

c.GPS satellites were made with 13 digits of precision, they turned out to need every digit due to elliptical orbits and gravity-acceleration time-dilation variances.


1-9

3. Purity emerges from Impurity(Practical makes perfect)A. Algorithms need to work correctly and be

optimized for time and resourcesB. Occam: "All other things being equal, the

simplest solution is the best."C. Einstein: “Make everything as simple as possible,

but not simpler”D. Roger Penrose: Objectivity of Plato's

mathematical world = no simultaneous correct proof and disproof

E. Karl Hewitt's Paraconsistency is really just Inconsistency


1-10

4. Certainty suppresses Uncertainty(Groups of Clusters)A. ΔXΔP ≥ ћ/2

B. Failure, Takeover and Recovery: reasserting the invariants

C. Propagation of Error and Chaos will ultimately loom in all predictions, interpolations and extrapolations

D. Idempotence: transparent retryability


1-11

Cluster Fundamentals

1. Reliable Message Based System - serialized retries with duplicate removal

2. Data Integrity - data must be checked wherever it goes

3. Reliability = fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment

4. Basic Parallelism - if it isn’t locked, then it isn’t blocked

5. Basic Transparency - when? where? how?6. Basic Scalability7. Basic Availability - outage minutes -> zero8. Application/Database Serialized Consistency - the

database must be serialized wherever it goes9. Recovery - putting it all back together again

1-12


10. ACID and BASE - workflow makes this reaction safe

11. True Multi-Threading - shrinking the size of thread-instance state

12. Single System Image and Network Autonomy

13. Minimal Use of Special Hardware - servers need to be off-the-shelf

14. Maintainability and Supportability - H/W & S/W needs to be capable of basic on-line repair

15. Expansively Transparent - Parallelism and Scalability

16. Continuous Database - needs virtual commit by name

1-13


17. Reliable Disjoint Async Replication

18. Logical Redo and Volume Autonomy

19. Scalable Joint Replication

20. Bi-Directional Replication - Reliable, Scalable, Atomically Consistent

21. Openness (Glasnost) - Open systems, open source, free software

22. Restructuring (Perestroika) - Online application and schema maintenance

23. Reliable Software Telemetry - push streaming needs a many-to-many architecture

1-14


24. Publish and Subscribe

25. Ubiquitous Work Flow

26. Virtual Operating System

27. Scaling Inwards - Extreme Single Row Performance for Exchanges

28. Ad Hoc Aggregation - Institutional Query Transparency for Regulation

29. Reliable Multi-Lateral Trading - Regulated Fairness & Performance, Guaranteed Result

30. Semantic Data - Verity of Data Processing

31. Integration and Test Platform - Real-Time Transaction Database

32. Integrated Logistics

1-15

1. Reliable Message-Based System serialized retries with duplicate removal

Why are loosely-coupled clusters of computers such a great thing ? Of course, the computers themselves can be

tightly-coupled SMPs, so one does not preclude the other: by carefully architecting the software, we get to have the best of both:

(1) the shared-memory multi-core semi-automatic load balancing (with the help of packages like the Intel Thread Building Blocks) within a single unit of failure that is an SMP, but memory-update-contention limits scalability

TR-90.8 Guardian 90: A Distributed Operating System Optimized Simultaneously for High-Performance OLTP, Parallelized Batch/Query and Mixed Workloads <http://www.hpl.hp.com/techreports/tandem/TR-90.8.html>

1-16

Why loosely-coupled clusters (continued)? (2) and the shared-nothing (Stonebraker)

potential for fault tolerance through the separate units of failure connected by messages, that makes up a cluster of computers with extremely difficult load balancing, but with the possibility of limitless scalability and parallelism, with no shared memory access

However, what are the limits of using cluster messages? Messages aren’t free, they have a cost: LAN

messages cost ten times what messages between cores in a shared memory system do (250 vs. 2500 instructions)

TR-88.4 The Cost of Messages <http://www.hpl.hp.com/techreports/tandem/TR-88.4.html>


1-17

However, what are the limits of using cluster messages? The increased LAN cost comes from framing,

checksumming, packet assembly/disassembly, standard protocols, and OS layering

Co-processors are defraying a lot of this cost (SMP and LAN) since that article, but the disparity remains

Inlining code does minimize the processor overhead, but there is still a response time hit (down to 100 ns for Nonstop ServerNet II, this is the hardware limit)

WAN costs require the abandonment of full transparency outside LAN clusters: so, no SQL partitioning across the WAN - it’s all client server, replication or workflow

TR-89.1 Transparency in its Place The Case Against Transparent Access to Geographically Distributed Data <http://www.hpl.hp.com/techreports/tandem/TR-89.1.html>


1-18

In an optimal RDBMS (a relational database management system), data spread over multiple clusters of computers implies the use of messages between software subsystems (Gray-Reuter 3.7.3.1)

Message senders can die and be restarted and send duplicate messages, which must be detected and dropped (idempotence: in math, a = a x a; in computers, multiple attempts yield a single result)

Receivers can die and be restarted, sonon-replied-to messages must be resent on failure (reliable retry to a new primary)

Gaps in series may need to be detectable (sessions and sequencing, a solid MsgSys can help do this for you)


1-19

Library based components require (for performance reasons) Drivers and packet communications kernel mode execution under driver dispatches packet buffering and replies without dispatching

target threads (called driver ricochet broadcasts) global data with common access controls for

kernel mode and thread mode, which allows kernel mode flushing of RMs (RDBMS resource

managers) with low-level lock-release FIFO queuing of packet buffers into subsystem

user mode threads (fibers) support stream programming


1-20

Basic HP (was Tandem) Nonstop clustering services include : Cluster coldload and single processor reload Processor Synchronization:

I’m Alive Protocol: Heartbeats every second, all processors check every two seconds for receipt from every other processor, if one cannot communicate, send a poison pill message and declare it down, and cancel its messages, etc., unless …

Regroup Protocol: Two-round cluster messaging protocol to make sure the unhealthy processor is really down, and not just late for some and not others (split-brain), which gives recalcitrants a second chance

TR-90.5 Fault Tolerance in Tandem Computer Systems <http://www.hpl.hp.com/techreports/tandem/TR-90.5.html>


1-21

Processor Synchronization (continued): Global Update Protocol (Glupdate):

Cluster information (process pair name directory, other items in the messaging destination table) are replicated in a time-limited, atomic and serial manner

Cluster Time Synchronization: clock adjustments are constantly maintained and relative clock error is kept track of: the Nonstop transaction service does not depend upon clock synchronization for commit or any other algorithmic purpose (that would defy relativity), so Nonstop only inserts timestamps for reference purposes


1-22

The basic clustering of the Nonstop message system is described in the (expired and now available) Nonstop patents from James Katzman, et al: 4,817,091 Fault-tolerant multiprocessor system 4,807,116 Interprocessor communication 4,672,537 Data error detection and device

controller failure detection in an input/output system

4,672,535 Multiprocessor system 4,639,864 Power interlock system and method for

use with multiprocessor systems 4,484,275 Multiprocessor system


1-23

Nonstop patents from James Katzman, et al (continued): 4,378,588 Buffer control for a data path system 4,365,295 Multiprocessor system 4,356,550 Multiprocessor system 4,228,496 Multiprocessor system

And the (expired and now available) Glupdate patent from Richard Carr, et al, which has been reliable for over 30 years now, and has a much reduced message overhead versus Quorum Consensus+Thomas Write Rule+Lamport timestamps, while accomplishing more (for cluster size <= 25 or so): 4,718,002 Method for multiprocessor communications


1-24

A Nonstop system is a loosely-coupled (no shared memory) cluster (called a “network”) of clusters (called “nodes”) of processors, up to 4096. Each 16 processor node in the Expand network has node autonomy, and its own transaction service TM (transaction manager) capable of bringing the cluster's RDBMS up and down, one RM (resource manager) at a time, or all at once.

Fault tolerance at the subsystem and application level is accomplished by process pairs, which look like a single process to a client sending messages and later retrying after the primary half of the pair has gone down, and takeover by the backup has made a new primary.


1-25

Takeover is quite different from failover and restart, IBM’s Parallel Sysplex does not do takeover: all nodes have transparent access to data, and applications that fail have to be restarted; there is a ‘Workload Manager’ to restart the apps, but even that does not completely recover the database (50-60% of IBM database applications are not transactional)

An IBM S390 Sysplex Cluster is a set of up to 32 16-way SMPs joined by ultra-fast interconnects and buses, with at least 2 coupling facility smart memory devices and 2 synchronized sysplex clocks (they are not used for processing commit, they do commit by log order, like Nonstop does), see their presentation:

<http://www.mvdirona.com/jrh/work/hpts2001/presentations/DB2%20390%20Availability.pdf>


1-26

2. Data Integrity data must be checked wherever it goes

Data corruption is an ever-present possibility through electronic noise (e.g. radon decay chain effects, cosmic radiation), physical defects (semiconductor doping flaws), and HW/SW design defects (stray pointers in code)

The statistics are that there are 3 undetected and uncorrected, but program-significant data corruptions per 1000 microprocessors per year (Horst, et al: Proc 23rd FT Computing Symposium 1993)

Disks, even when not in use, will corrupt data at a low rate and the mirrors need to be crawled and corrected in the background, and single data disk blocks recovered, non-mirrored disks with errors exceeding the 2-bit (or otherwise) encoding correction are a permanent problem

1-27


Memory must be error checking and correcting (ECM), as in most computer systems. Many components have the potential to corrupt data, and this will become more problematic as components shrink

Higher integration levels for processors will cause sporadic internal resets from soft errors, which occur more frequently at higher altitude (Itaniums in Colorado reset 1/day vs. 1/week at sea level in 2001), and which can take a processor offline for a half a minute

Optimal, reliable systems will support every one of the following: Lock-stepped microprocessors Fail-fast protection of internal buses and drivers End-to-end checksums on data sent to storage devices

1-28


Log writing must use end-to-end checksums on blocks. This is because after a crash, we need to fixup to the last valid written block of log records from a log buffer, and we can’t tolerate garbage in the block middle due to power-loss partial writes (drive manufacturer dependencies)

During transaction restart after an RM (resource manager) or computer crash or a full cluster TM (transaction manager) crash, log fixup then searches for the last good block written (valid checksum) to the log mirrors, which becomes the new log tail

The fixup function reads from the mirrors until neither one has a good block at the end of the log, then rewrites all of the last log blocks on both mirrors (to scrub the errors on the short side)

1-29

3. Reliability = fail-fast + fault detection+ fault tolerance + fault avoidance+ proper fault containment

James Gosling: distributed computing is not transparent to either failure or performance

Some errors are tolerable and some operations returning errors can be retried with idempotence

Oddly enough, keeping things reliably running requires a cut-throat approach to critical subsystems that are experiencing anomalies, encountering garbage data, or even running abnormally

Fail-fast – going down quickly prevents the spread of invalid data or even the effects of flawed algorithms or races we can’t handle (what if the corruption checks don’t catch something?)

1-30

Takeover (more transparent) is far superior to failover (failure and restart), if only because it enables the use of fail-fast techniques, because they don’t hurt users as much

Failure Detection: assertion logic is interwoven throughout all critical library code and all critical subsystem code in reliable systems

To maintain the state machine invariants end-to-end we must detect any violation of the invariants and then reinstate them by whatever means necessary

Bohr-bugs (synchronous: they hit repeatedly) andHeisen-bugs (asynchronous and racy) require different kinds of testing


1-31

Single failures and double failures in clusters require different kinds of testing and furthermore, different kinds of fault tolerance design to ‘transparently’ handle those failures

Fault Tolerance: when something goes wrong and a failure occurs, whether hardware or software, takeover mechanisms ensure the re-establishment of state machine invariants (a new state which is equivalently identical to the state before failure)

In fault tolerant systems, when a piece of hardware fails, the fault tolerance of the software has to function correctly to mask the hardware failure


1-32

Fault Avoidance: by small amounts of forethought and action here and there in the code, potentially large failures can be shrunk down in size to be handled invisibly: avoiding unnecessary transaction aborts by

preparatory checkpointing of shared read locking state in the RM [resource manager] before a coordinated takeover

avoiding unnecessary RM crash recovery outages by detecting missing log writes and performing them in a timely way after a takeover

Fault Containment: garbage pointers in the library kernel globals cause the outage of a computer in a cluster. Encountering the garbage pointers in a critical subsystem process environment may only require a process restart, if proper checkpoints have been made beforehand


1-33

4. Basic Parallelism if it isn’t locked, then it isn’t blocked

An optimal RDBMS will use S2PL (strict two phase locking) for the transaction duration locks (there are 5 kinds of locks in the Nonstop RM: DP2)

An optimal RDBMS RM (resource manager) holds both the data and the locks, with no external distributed lock table or external buffer cache to fight over with interlopers: and that means that clients can queue properly

So, one client message connects the transactional application code to: the RM data + the RM client queue + the acquired RM locks + the tx state within the node TM (transaction

manager) library globals underneath the RM subsystem + the potential failure takeover process for the

RDBMS RM

1-34


An optimal RDBMS RM (resource manager) will support the use of a priority queue and the priority inversion on that queue so that mixed workloads can intermingle with little impact: A common problem with clusters is that a low-priority

client, once dequeued and getting served at the RM, can block the access of a high-priority client that is newly queued

For short duration requests, this is ignorable, but low priority table scans for queries blocking high priority OLTP updates is not good for business

The solution is to execute client function in a thread at the priority of the client (inversion) and to make low priority scans (and the like) to execute for a quantum and be interruptible by high priority updates, see the paper:

TR-90.8 Guardian 90: A Distributed Operating System Optimized Simultaneously for High-Performance OLTP, Parallelized Batch/Query and Mixed Workloads <http://www.hpl.hp.com/techreports/tandem/TR-90.8.html>

1-35


Optimally, the RDBMS also supports RM-only transactions, which are only active within one RM, and which do a transaction flush confined to that RM, so that one application message can send all the compound SQL statements and rowsets for several transactions which will have microscopic response time, lock hold times, etc. (For instance, a hundred TPC-C transactions for one branch of the bank) … which will still allow you to do wide transactions and queries on that RM data at any time, see the 1999 HPTS position paper:

<http://research.microsoft.com/~gray/HPTS99/Papers/JohnsonCharlie.doc>

1-36


There is one version of the data in an optimal RDBMS computing universe: applications must serialize, they must interact with each other on the same data using transactions: this is in contrast to MVCC (versioning) databases (Oracle, SQL Server, Sybase, MySQL, Postgres) where transactional reads use snapshot isolation and are blind to concurrent updates: so only updates on primary keys block updates and repeatable read transactions (in the style of banking and finance transactions) don't provide proper isolation

An optimal S2PL (strict two-phase locking) system's update lock will block both updates and reads, an S2PL shared read lock will block updates, and both kinds of locks will only be released when the transaction stops changing the database, and for the update lock, after changes to the database are made durable at commit or abort time

1-37


This allows every application process to run freely in parallel across the entire database, until they encounter a blocking lock on some record – hence there is no application locking schedule or high level concurrency control or massive sharding of the database required to single thread the concurrency and allow the database to work correctly – so the system runs naturally in parallel at warp speed across the entire network of clusters of computers

In an optimal RDBMS, use of the read-only transaction commit optimization will further allow large sections of the database to be released at the beginning of the commit flush (which is the end of the database transformation from data that was read by the transaction, which is why you hold shared read locks)

1-38

5. Basic Transparency when? where? how?

Gosling: distributed computing is not transparent to either failure or performance (once again !)

Transparent / opaque to whom? For an optimal RDBMS, at the kernel mode library

programming level, there is no clustering or failure transparency, and the task is to provide transparency (whenever possible) to the layers above

For the vast majority of hardware and software failures, even most double failures, an optimal RDBMS seamlessly rolls along without aborting transactions, so that the applications don’t need to worry about those failures … but here are the three major types of failures that applications will see:

1-39


(1) Occasionally, the seamless operations and functioning of the application are interrupted, as in the total loss (very rare double failure) of the cluster computers containing a resource manager (RM) … since the TM (transaction manager) library doesn’t know which tx touched which RM after all the copies of the RM-tx state are lost, because ofCAB-WAL-WDV

The Nonstop RM uses a variation of WAL, which is the Write Ahead Log protocol (Gray/Reuter 10.3.7.6): WAL functions to guarantee that database blocks which get changed in the RM buffer cache must be written first to the log (serial writes are >= 10X faster, treating the log disk like a tape), before they ever get written to the database disk (random writes are slow)

1-40


Then the database disk writes are scheduled to go out between the every five minute state checkpoints (what has changed since the last RM checkpoint) that the RM makes to the log – these writes are not demand based, so they can go out in leisure fashion (until you get close to the next RM checkpoint time, then things get hectic)

The WAL variation that the Nonstop RM uses is theCAB-WAL-WDV protocol: CAB - Checkpoint Ahead Buffer: the RM first checkpoints

the log buffer to the RM backup process, the backup uses a neat trick to fabricate all the update locks from the log records, and if the primary dies, the backup can takeover and will do the log write again (idempotence)

WAL – Write Ahead Log: then write the database changes to the log

WDV – Write Data Volume: leisurely write back the dirty database buffer cache blocks

1-41


The neat trick that the backup RM does to restore update locks, which cannot restore shared read locks that protect against the write skew and wormhole problems of MVCC databases, is the very reason that the TM library will have to abort all the transactions in the cluster to restore the state of the database, and this will require applications to resubmit any uncommitted updates (like the current mini-batch in NASDAQ SuperMontage does on Nonstop)

Note that the RM could checkpoint shared read locks, but that would be a constant pain suffered to alleviate an only occasional irritation

(2) If the application process that begins a transaction dies the tx gets aborted and work must be resubmitted

1-42


(3) Finally, if there is a rare TM or total cluster crash, due to a double log disk failure or the failure to restart one of the cluster computers supporting the logging subsystem (like registry problems), then the optimal RDBMS transaction service must be restarted or a disaster recovery initiated (which is clearly not very transparent to applications)

So when/where/how is this transparent to the application ? In the fact that the optimal RDBMS has no

wormholes in it due to the write skew problems of snapshot isolation databases that employ MVCC

In the consistent isolation view of the database outside of the transaction that either commits or aborts

1-43


So when/where/how is this transparent to the application ? (continued) In the guarantee that if the transaction service

says commit and the entire system takes a nosedive a nanosecond later, then the transaction data is there and it’s consistent

In the guarantee that if the transaction service says abort, then all transaction protected work is undone completely before any transaction locks are released

In that the application needs to do nothing, but use a transaction to guarantee all of that consistency

1-44

6. Basic Scalability The original clustered database view of scalability came

from David DeWitt and Jim Gray’s 1990 paper on

database parallelism : Speedup – when you can double the hardware and

get the same work done in half the time Scaleup – when you can double the hardware and

get twice the work done in the same time Nowadays ‘scaling up’ means roughly what speedup

meant, and ‘scaling out’ means roughly what scaleup meant, although I’ve noticed that different people mean drastically different things when using the modern phrasing: the DeWitt and Gray terms had very precise meanings, and a scalable system does both

TR-90.9 Parallel Database Systems: The Future of Database Processing or a Passing Fad? <http://www.hpl.hp.com/techreports/tandem/TR-90.9.html>

1-45

6. Basic Scalability

Scalability of database logging performance inside the Nonstop cluster and for disaster recovery is accomplished by a three phase commit flushing algorithm and the forced group commit write

The Nonstop RMs (called ‘DP2’) would not force-write database updates to the log (except in highly unusual circumstances), instead those updates would be streamed to the log partition’s (called ‘auxiliary audit trails’) input buffer, using asynchronous and multi-buffered writes

Nonstop uses the WAL (write ahead log) protocol so that writes only have to be scheduled to the resource manager database disk every five minutes or so (their disk checkpoints are called ‘control points’), for nearly “in-memory” update database performance for the resource manager disk

1-46


The combination of group commit and WAL yields just short of “in-memory” RDBMS performance, because of the

Five Minute Rule: Keep a data item in electronic memory if its access frequency is 5 minutes or higher; otherwise keep it in magnetic memory. (Gray/Reuter 2.2.1.3)

This rule was originally calculated for a 1KB page size, it still comes out to 5 minutes for a 64KB page size – and this gives us guidance as to what is about the right page size to use

1-47


At commit time, the Nonstop library transaction service induces explicit RM log flushing only when necessary, from the interrupt service level of the TM library (100 times cheaper than process message wakeups). In busy systems the RMs are stream-writing ahead continuously to the log, so that the transaction updates are almost always already flushed to the log when commit time comes (unless the transactions are tiny and unbuffered)

When flushes due to commit (and abort) are reported to the commit coordinator (for Nonstop, called the‘TMF Tmp’) on a busy system, they are lumped together into a single and periodic forced write into the log, called a group commit

1-48


The group commit write by the RDBMS commit coordinator is the one and only time in the system that the transactional database application absolutely must wait for the disk to spin and the drive head to move, and it’s a shared experience (and thereby scalable for the cluster’s transaction service)

So, why is writing to one log disk faster than writing in parallel to a bunch of RM data volume disks? If there is no other disk writehead-moving activity for that disk, and if we write it sequentially using big buffers with effective disk sector management: then by treating a disk like a tape we get 20-100 times the writing throughput (Gray/Reuter 2.2.1.2)

1-49

6. Basic Scalability Ultimately, however, you can easily generate more

joint-serialized database log record blocks than one log disk can receive, so the optimal RDBMS log is vertically partitioned N-1 ways (on Nonstop, called the ‘merged audit trail’)

But you still only force write one group commit buffer to the log root (on Nonstop, the ‘master audit trail’) while streaming log blocks to the N-1 leaf log partitions

So, part of the configuration of an optimal RDBMS clustered transaction service is to assign RMs to log partitions. Reassigning RMs to log to different log partitions should not require the transaction service to be brought down, and needs to be performable online (several issues, too complex to discuss here)

1-50


Let’s talk about how a swarm of RMs can be flushed for transaction commit (or abort): Where each RM is flushing its log record

contribution to a particular log partition (leaf) And which log partition (leaf) is itself flushed

during the group commit for the merged log (root)

To ensure scalability (that means both speedup and scaleup), all this flushing needs to occur: Without causing unnecessary forced writes from

RMs through their log partition input buffers And without causing unnecessary forced flushes

for already flushed or non-participating log partitions underneath the log root commit write

1-51

6. Basic Scalability Before an RM can do anything to a transaction protected

file, on behalf of a client request, it needs to be doing so on behalf of a valid transaction: first, the TID (transaction identifier) from the message header, which was sent under the client’s invocation of the transactional file system, is used when the RM performs a bracketing Check-In call to the TM (transaction management) library to create a crosslink element between the RM and the TID in the TM globals, where these connections are tracked for cluster transaction flushing by the TM library at the correct time

The crosslink stores an VSN (Volume Sequence Number), which is initially set to infinity (binary ones, or hex FFFFs): and that means that transaction work is in progress (similar, but not quite the same as the term ‘LSN’ from Gray/Reuter 9.3.3, more on the VSN, below)

1-52


As an aside, there will be at least one transaction flush in the cluster for commit or abort, and then potentially many flushes for successive undo attempts (so, TM library transaction flush broadcasts also have sequence numbers), until the transaction is successfully backed out; the recovery of successive backout attempts can get extraordinarily worse, since each attempt can apply - as undo, all the undo of the original transaction and all the successive undo of the previous attempts … this problem is solved by chaining and avoiding redundant undo records in the log in the following patent by theoretician and expositor Jim Gray’s favorite practitioner, Franco Putzolu, et al :

Method for providing recovery from a failure in a system utilizing distributed audit [log records] <http://www.google.com/patents?id=L_IWAAAAEBAJ&dq=5,832,203>

1-53


When an RM does an update, insert or delete of some row in an SQL table or an index entry, that item has to be contained in a cache block which was read from the disk and is now contained in the RM’s buffer cache: if it’s not in cache, it must be read into cache now, and that delay is the source of Jim Gray’s Five Minute Rule

Modifying that cache block atomically requires the increment of the 64-bit VSN counter for the RM: The VSN counts transactional database changes monotonically for this RM in the log partition that it streams changes to, such that {RMID (resource manager identity), VSN} pairs in that log partition’s history precisely measure the progress in flushing the log stream for this RM

1-54


After incrementing the VSN, the RMID VSN TID And the previous state and the new state of the

database item that was changed

… these are logically described in an undo/redo log record and that record is inserted at the end of the RM’s log write buffer: some RDBMS products separate the redo and undo log, but since you need to read both to do RM crash recovery, and since you have already successfully scaled up (speedup) by partitioning the log, why complicate the log further? (We will discuss physical vs. logical redo, later on)

1-55


When the RM is finished working on behalf of this transactional client file system request message, the RM calls the end-bracketing Check-Out call to the TM library with the VSN, which can have three kinds of values: Infinity: Binary 111111s/Hex FFFFFFs means that

transaction work is in progress, a Check-Out call is expected soon

Zero: A transactional read of the the data was done and will be replied to very soon (a shared read lock is held)

Positive and not Infinity: We changed some data and will reply very soon (a exclusive update lock is held)

1-56


The RM should have (at least) two log write buffers, so that it can be filling one with log records from changes to the database, while asynchronously writing the other to the input buffer of the log (synchronous writes only happen under some very narrow circumstances related to something being down or wrongly configured)

The log needs to be able to handle multiple simultaneous input messages from RMs and also be able to place them in a ring buffer, because many RMs are multi-buffering writes to it, and it is possible that an individual RM’s buffer messages can be queued out of order, and you would rather put a message aside than cancel it to force a retry

1-57


The RM writes out its log buffer when it gets full enough (when things are busy), or when a transaction flush broadcast requests that the buffer be written (mostly happens when things are not very busy), or when someone configured the buffer cache to be too small (this is because of WAL - write ahead log): after the log write is complete, the RM records the highest VSN written in the log write, and the resulting LPTR (log pointer: 64-bit counter of total blocks written to this log partition) into TM globals with a TM library call

When an RM runs out of memory, it tries to write back dirty cache blocks to the database disk, but this requires that other things be done first because of the CAB-WAL-WDV policy (checkpoint ahead buffer-write ahead log-write data volume, in that order)

1-58


This is not very scalable, so don’t under-configure the memory size of the RM buffer cache

Why?, because an out-of-memory RM must checkpoint the log write buffer to the backup RM (CAB), to be able to synchronously write the log buffer to the input buffer of the log (WAL), and then to be able to write back cache blocks whose log records have not been flushed to the log yet (WDV)

Stepping away from the issues of a sick data volume, after all the work for the transaction has been done, the user asks the file system to commit the work, and under that call the TM library does a cluster group commit broadcast, which is not much different from an abort broadcast (caused either by direct abort invocation or spontaneously from some failure or anomaly)

1-59


As yet another aside, when the Nonstop software stack was ported to Windows NT clusters in the late 1990s, an ultra-fast and ultra-reliable version of the network transaction flushing and two-phase commit broadcast was written to utilize UDP or TCP group broadcast service (multicast) on ethernet, which did unicasts to complete unreplied multicast messages for incredible scalability, see the patent :

That port was very successful, the boxed release was released twice to the Paris Stock Exchange, and then it was mysteriously pulled back by Compaq (the Nonstop mainframe people didn’t complain much)

Transaction state broadcast method using a two-stage multicast in a multiple processor cluster<http://www.google.com/patents?id=pOEIAAAAEBAJ&dq=6,247,059>

1-60


Getting back to our scenario, out of the commit broadcast a transaction flush packet arrives in the computer’s packet service, which calls the TM library in kernel mode (all traps off, no virtual memory swaps, locked down memory and code), and that call runs through the TID’s (transaction) list of crosslinks in TM Library globals checking the VSN values: Infinity: Binary 111111s/Hex FFFFFFs means that

transaction work is in progress and this should not be happening for a commit flush (after ‘commit work’ has been called): call a fail-fast halt to save the database from corruption due to a ‘late RM checkin’

1-61


Checking the crosslinks VSN values (continued): Zero: data was only read (a shared read lock is

held) wakeup the RM to release locks now, this RM is flushed

Positive and not Infinity: data was modified (an exclusive update lock is held) now you have some work to do

For every RM you find on the crosslink list for this TID that has a positive VSN: If the crosslink VSN is less than or equal to

the highest VSN written for this RM from log writing (he deposited that value in TM globals after completing the log write), then this RM is flushed for this TID, carry on to the next

1-62


For every RM you find on the crosslink list for this TID that has a positive VSN (continued): If the crosslink VSN is greater than

the highest VSN written for this RM from log writing, then wakeup this RM to flush his log writing buffers until the crosslink VSN is less than or equal to the highest VSN written for this RM, then make a TM library call to deposit this RM’s {LPID (log partition identity), LPTR (log pointer)} pair into the TM TID structure which holds the crosslink list

1-63


Once the TID’s crosslink list is run through to the point that all the RMs are flushed and have been awakened to release shared read locks, the TM library replies back to the commit broadcaster that this computer is flushed for this TID, with the concise list of {LPID, MAX LPTR} pairs flushed.

Once that the TM library that initiated the commit flush broadcast has gotten the replies back from all the computers in the cluster (someone may have aborted), the TID results (i.e., committed/aborting) and the concise list of {LPID, MAX LPTR} pairs for the whole cluster are sent to the computer containing the TM commit coordinator (for Nonstop, called the ‘TMP’)

1-64


The TM commit coordinator wakes up on a timer, which is set tiny if things are not busy in the cluster (not enough traffic to get anything out of group commit), medium if there is enough business (response time decreases when you go as a group at this point, like metering lights on the freeway) and shorter as business picks up (at peak rates, the commit timer should be as short as will allow maximum throughput and minimum response time)

If you want microscopic response times, you would use RM-only transactions, which are focused on one RM and only flush that RM, not the whole cluster

TR-88.1 Group Commit Timers and High-Volume Transaction Systems<http://www.hpl.hp.com/techreports/tandem/TR-88.1.html>

1-65

6. Basic Scalability When the TM commit coordinator has awakened, the

committed/aborting flushed packets from the cluster since the last wakeup are scooped up and lumped together into a group commit (and abort)

First, all the log partitions are sent a message to flush their log partition input buffers to their log disk, iff they were included in the joined and concise list of

{LPID, MAX LPTR } pairs, otherwise they are not involved in any transaction in the current group committed/aborting list: when the log partition receives the message it will flush its log partition input buffer to disk:

iff the MAX LPTR associated with this LPID is not already flushed to disk, otherwise it will reply OK(this LPID is flushed)

1-66


When the last log partition that it sent flush requests to has replied OK, the TM commit coordinator will write the group of committed and aborting transaction state log records in the log write buffer, by doing a waited write to the log root, and when that acknowledgment comes back, the group commit is complete for all the transactions … note the following: In a busy system (when it counts) every

forced write, except the single commit write for every group commit, could have already been accomplished through streaming by the time it was requested (it’s only that we have noticed this by good bookkeeping)

1-67


Final notes on group commit (continued) That one buffered and forced write of transaction

state log records to the log root by the TM commit coordinator is comparatively tiny, and all the transactions in the system piggyback together on that one timer-driven delay: it is shared and periodic, like a rapid heartbeat

If any VSN or log pointer information is lost by an RM or log partition takeover, we will have to force flush the log partitions during commit, for a while

If you can’t stand that wait for the log to do all this flush coordination and a serial write, then use RM-only transactions

1-68

6. Basic Scalability When transactions span the network to other clusters (or

heterogeneously to other vendor systems), then the commit coordinators on the two or more clusters do non-blocking three phase commit to guarantee the joint commit or abort of the distributed transaction

Optimal RDBMS distributed commit performance will do 60% of the local maximum transaction rate across as many nodes as the customer needs. That “scaling out” (Gray’s scaleup) is accomplished by a method called“Mother-May-I”, and is described in two Nonstop patents :

Hybrid method for flushing transaction state in a fault-tolerant clustered database<http://www.google.com/patents?id=rUt4AAAAEBAJ&dq=7,028,219>Method for handling node failures and reloads in a fault tolerant clustered database supporting transaction registration and fault-in logic <http://www.google.com/patents?id=S-d3AAAAEBAJ&dq=6,990,608>

1-69


If transactions have locality of reference and only touch the local node with no lock conflicts (which means that no joint serialization is necessary), then an optimal RDBMS will do “scaling up” (Gray’s speedup) to nearly the full 100% level

Based on the scalability of the cluster and the partitioned log, an optimal RDBMS replication service will consistently maintain the database on a remote cluster with only 1% DB overhead + 4% network messaging overhead on the primary cluster

The optimal RDBMS replication service will consume more of the remote cluster applying the updates (between 15% and 25%).

More than these listed values for overhead is a bug

1-70

7. Basic Availability outage minutes -> zero

What is availability? On some systems it is defined as the

existence of a working Unix or Linux shell prompt

On some databases (Oracle) it has been quoted only on database software-produced outages, as though hardware and operating system-produced outages that are not tolerated by the database system are somehow not really happening to the customers

1-71


In an optimal RDBMS, availability is measured in terms of database queuing: if you can begin a transaction, and queue up for a lock on any part of the database under that transaction with the likelihood of actually getting that lock and then accessing that data, then that data is considered available

If you can’t do all that on some part of the database, then that part of the database is actually unavailable

Availability, in Highleyman's “Breaking the Availability Barrier”, p. 32:

Availability = MTBF/(MTBF + MTR)

where any 'mean time before failure' will return an availability of 1 (eternally up), if the 'mean time to repair' is zero.

1-72


Tandem’s Nonstop TMF has had excellent fault tolerance out-of-the-box for nearly 25 years with non-blocking three phase commit coordination between NonStop cluster nodes: when the Tmp process or cpu dies, the backup Tmp takes over with no perceptible outage or loss of state or any transactions being aborted, to the tune of 5½ nines of availability or 12 years overall MTBF (restarting after total system failure due to double failure makes a 30 minute repair time). Add RDF to make 7 nines in the British banking system (Mosher), or an astonishing 38 years overall MTBF (RDF repair time is consistently under 2 minutes)

IBM Parallel Sysplex, using mainframe DB2 says (2003) that they can do 50 years overall MTBF (open your wallet wide for IBM services, because that’s definitely not out-of-the-box)

1-73


An optimal RDBMS should drastically exceed this level of fault tolerance, going to more than 1 million years overall MTBF and twelve nines (with a 30 second repair time), and this should not require expensive legacy services or onerously complex configuration: it should work that way out-of-the-box

Tandem Nonstop was the first to make all the common database operations seamlessly transparent online: SQL partition reorganize, split and merge, partition move to another disk, changing the log partition that an SQL partition logs to, SQL catalog changes (add, modify and delete table, add, modify and delete fields); all of these operations can be done mostly without changing applications and even without modifying query plans from the optimizer (many systems have hundreds of query plans stored and customers hate recompiling all that)

1-74


The 4 SQL online partition operations: reorganize, split, merge, and move are all done starting with a rollforward from an online (fuzzy) dump, applying changes (redo) from the log starting at dump time forward until nearly caught up, then slowing user updates at the end to catch up (avoiding infinite overtaking). You could call this ‘recovery in place’

An optimal RDBMS will operate in such a seamless way, and also the transaction abort, RM recovery (data volume recovery) and media failure recovery (archive recovery) will all do their job without writing through to the database disks, which runs up to 100 times faster (only rebuilding RM disk buffer cache)

This guarantees minimal outage times (MTR) on the functions of the database which are subject to these more visible operational outages

1-75


Much of this is possible, because Nonstop uses logical keys instead of record pointers (RIDS) to interconnect the btree pages/blocks …

IBM’s mainframe DB2 uses RIDs to connect the leaf levels in btrees, while an optimal RDBMS uses logical keys: this allows btrees to be moved without modification of the btree data, whereas IBM RIDs are only valid at that disk address and need to be remapped to move them. This causes holes in the implementation of utility functions for availability, known as the 'Halloween Problem' in the old days

They resolved Halloween by employing the Nonstop method (the two patents are nearly identical) for the SQL partition operations including the rollforward part and the infinite overtaking part

1-76


However, RIDs disallow using SQL cursors against the base tables since the RIDs can be remapped asynchronously underneath the app. In DB2, cursors can only be used against snapshot isolated copies of the base tables. That RID infantile appendage has an impact on function

This is why an optimal RDBMS, will split, merge, and move partitions, and reorganize the database on the fly without any availability outage, using archive recovery interfaces to pull the updates to the old partition from the log in real time and apply them to the new partition, and then switch over when the log tail is near, while SQL cursors are still active

1-77


The SQL ‘recovery in place’ approach to seamless operations allows completely retry-able and transparent fault tolerance for RDBMS utility functions experiencing failures. (So you don’t have to dump the entire database before and after utilities run, as has historically been the case with Oracle)

However … if enough failures of the intolerable kind occur simultaneously … your database can become unavailable, or worse, unusable … so what, then?

The first thing in availability is to reduce your MTR (mean time to repair). In a transaction system that means to bounce back up quickly after a crash, and that means knowing what transactions and locks are outstanding in the cluster.

1-78


IBM's mainframe DB2 uses a piece of special hardware called the CF (coupling facility), which acts as a smart memory to store shared buffers and locks. The CF is a mainframe processor running a special OS called CFCC. Then, each time cluster SMPs fail, the database elements needed for a quick restart are right there

Nonstop does not use special hardware in this way. Their innovation is to store their locks in the end of the log (in the last 5 minute RM checkpoint), and quickly reacquire those locks to allow RMs that have not failed to continue doing business, while those RMs requiring recovery get processed in some critical order, see this patent :

Minimum latency reinstatement of database transaction locks<http://www.google.com/patents?id=9Lx6AAAAEBAJ&dq=7,100,076>

1-79

8. Application/DatabaseSerialized Consistencythe database must be serialized wherever it goes

So what is database consistency? It’s like pH: the higher the ACIDity, the stronger the

database The letters in ACID stand for Atomic, Consistent,

Isolated, and Durable A is for Atomicity and means all or nothing: the

database everywhere must end up in a state whose visibility to the world outside the transaction is first the old state, then the new state (in the case of commit) or the state remains unchanged (in the case of abort)

1-80

C is for Consistency seems like a circular definition, but actually it should probably be spelled ASID, because consistency in database work is really accomplished by serialization (as seen from below)

The database (on reputable systems) is really in the log, the RM disks are a convenient cache whose disk image is only rarely in a consistent state (only after a correctly completed shutdown of the optimal RDBMS transaction service, at which point the RM disk is fairly unusable)

Serialization in the log is defined by the exclusive existence of serialized transaction histories without wormholes, and no other kind


1-81

A transaction history starts in the log with the first update log record for a transaction

Then there’s a series of update records (btree block splits have multiple physical redo log records in a string)

Then there are one or more commit log records, xor one or more abort log records (either commit or abort, never are both present)

Then, in the case of an abort, there are one or more undo log records

Finally, there are one or more forgotten log records to terminate the transaction history


1-82

The big thing about serialized transaction histories in the log is that even though they are mostly interspersed in order (concurrency), you must never have the case where a log record touching data for one transaction is historically interspersed with a log record touching that very same data for another transaction

This is called a wormhole (Gray/Reuter 7.5.8.1) and it occurs when a transaction is either not well-formed (using shared read locks and exclusive update locks) or is not two-phase (first acquiring and then releasing locks). “A transaction history is isolated if, and only if has no wormhole transactions.” - Jim Gray


1-83

You must be able to sort (using Jim Gray's sorting method in Gray/Reuter 7.5.8.1) the entire log by transaction timestamp to the same effect as the original log when replayed back into the database at recovery time: yielding the GOLD STANDARDGOLD STANDARD of transaction systems:

Wormhole-Free Transaction HistoriesWormhole-Free Transaction Histories

Strict two phase locking (S2PL) transaction concurrency is both well-formed and two-phase on purpose.


1-84

Multiple version concurrency control (MVCC) does not enforce the well-formed part, because shared read locks are not used (even in repeatable read mode in most implementations I've seen)

MVCC databases (Oracle, SQL Server, Sybase, Postgres, MySQL) basically employ forms of snapshot isolation, which allow transactions to work on a private snapshot version of the DB, so most locking is unnecessary. Blocking only rarely occurs, when a record is deleted or the primary key is updated, for concurrent transaction users.


1-85

MVCC databases can create wormholes by write skew: for instance, if two concurrent transactions read two different row.column values and then update each other's previously read row.column values.

You can only do one of three things when this kind of conflict occurs:1. Block one tx (which MVCC can't do at all well)2. Abort one tx (which some MVCCs do) or3. Corrupt the database integrity (mostly this what is done)

You know, there really isn’t any magic here


1-86

Mostly what MVCC database users end up doing is single-threading, by two kinds of means:

1 Sharding: making many isolated databases, also called database federation (invented by Microsoft for the TPC-C benchmark in the late 1990’s)

Then, single-threading the shards by: Using single threaded frameworks on

dynamic languages (Ruby/Rails, Groovy/Grails, Python/Django, PHP/various)

Or by breaking up a hot-spot, like the bank branch balance in the TPC-C transactions and making many, many single-threaded bank apps in accomplishing the benchmark


1-87

Mostly what MVCC database users end up doing is single-threading, by various means (continued):2 Alternatively creating a towering application stack

(EAI enterprise application integration) that: Filters every transaction and single threads the

consistency space Queuing and single-threading transactions that

might conflict Maintaining a complicated and partially sharded,

partially duplicated database schema To prevent the possibility of write skew by making

applications impossible to develop and maintain by end users, who become the ‘end-losers’

That is necessary, because the RDBMS cannot protect itself from its applications: it is totally vulnerable to corruption by concurrent access


1-88

What all these single-threading methods accomplish is to effectively convert an MVCC RDBMS model to a fragile application SS2PL model: strong strict two phase locking, where even read locks are held until the transaction updates are flushed to disk

So, when a MySQL executive recently said that the era of Jim Gray database was passing, it was only true for web 2.0, not the critical enterprise or critical computing, where the answers really matter, and the big money is at risk


1-89

Of the MVCC databases, only Microsoft SQL Server can completely pass the TPC-E benchmark, which checks for inconsistencies (so they are doing something cute): If you do aborts and raise the concurrency, you end

up aborting more concurrent transactions 2 of 3, then 3 of 4, etc.

If you try and catch inconsistencies on-the-fly and then single-thread their schedule … well, you should have blocked on shared-reads to begin with

There is no guarantee these detecting methods work, using timestamps is racy, and only sharpens the guillotine blade. Once you start to get asymptotically close to correct, you and your users will start to trust your erroneous implementation


1-90

So, what does all this S2PL get you? Isn't it slow?

Nonstop demonstrated the converse in the ZLE benchmarks on 15 nodes with 200K TPS of updates, with hundreds of queries constantly running against the base tables, and with trickle batch. It was their finest hour. You need to be smart and a good database designer, but it can be done. And without isolating fractured databases.

The first wonderful thing you get from S2PL is the magic of distributed database for free.


1-91

Imagine 10, 20, 100 clusters of database side by side in a massive group. Now allow them to share transactions with global tx IDs that have local tx IDs for each system. The RMs on the different nodes do not interact or scheme to serialize their updates in any way

Now go at the database concurrently with a 1000 tx and just try to make a wormhole, you can't do it

What's stopping you from corrupting the database? The RMs aren't coordinating, the commit coordination only orders the transaction state log records, not the RDBMS update records, so how is it protecting itself?


1-92

In an MVCC log, you have what is called “log serialization”, which is only serialized as far as the log is concerned, which is insufficient. (Which is why you have to single-thread the applications).

In an S2PL log you get “application serialization”, where the applications serialize their own updates, by waiting for shared read and exclusive update locks. The applications themselves are doing that magical, unseen coordination


1-93

So no matter how many clusters are involved in sharing distributed transactions in the massive group, if all of their logs are simply merged together (sorted by transaction-timestamp) they should yield a joined log with total joint serialized histories for all the transactions anywhere in the entire contiguous computational universe of optimal RDBMS clusters

This is why you need to do three phase commit coordination between nodes, simply to guarantee that all the update records everywhere are actually on a log disk somewhere when we terminate or “Forget” the global transaction)


1-94

This also allows you to do replication of many cluster primary databasesto many cluster backup databases and actually do a takeover and make that work, as well, to the last serialized transaction history commit. Only Nonstop RDF does this as of now, and that possibility only arises because of S2PL concurrency control


1-95

I is for Isolation and should probably be spelled ACLD, because isolation in database really means locking (as seen from below)

For optimal RDBMS RM files, transaction duration locks are either exclusive update locks which block reads and updates, or shared read locks which block updates only (there are four other kinds of locks in a Nonstop system: held for session, message and operation duration)

Locks are only released after all of their associated transaction database work has ceased, and once the totality of database changes have hit the log disk: the locks are the fingers of the correctly ordered database in the log reaching out to the cache copy of the database and guaranteeing serialized behavior in the applications interacting through that cache copy of the database log


1-96

Hence, because of transactional isolation:Applications communicate with each other using simple

relational propositional logic (Queries) ...

creating, changing or discarding truthful propositions (Rows) ...

in shared repositories of mutually agreed upon truths (Tables) …

through the database in complete, uninterrupted compound sentences (Transactions)

pausing in real-time, only to hear the complete, uninterrupted compound sentences of other concurrent applications (Locks)

otherwise, running at warp speed unhindered of any other blockage to performance (Minimum Latency)


1-97

Finally, D is for Durability and means that once an optimal RDBMS transaction service says it’s done, it’s really done

If the ENDTRANSACTION procedure call returns OK, and one nanosecond later the entire installation crashes, the data is there and it’s correct

For optimal RDBMS disaster recovery, that means that when it’s done on the database on the primary site, and after all the log records reach the remote site, it’s done there, too


1-98

So, given that we have a safe copy of the database transaction history (wormhole-free) in the log and a periodic collection of archive dumps of the files, we can do database recovery and there are two kinds: RM restart after TM restart (Gray/Reuter 11.4.2/TM,

11.4.3/RM or 11.4.6/Unified, 11.4.7-10/tricks): this is called ‘Volume Recovery’ on Nonstop, it is needed after a crash, because a clean shutdown pushing out all dirty cache blocks does not require recovery to start up the database volume

RM archive recovery (Gray/Reuter 11.5-6): this is called ‘File Recovery’ on Nonstop and is done after a media failure by restoring a fuzzy online dump and applying the log redo and then undoing incomplete transactions

9. Recoveryputting it all back together again

1-99

RM restart after TM restart: The transaction manager rebuilds its state

from the last run by scanning the log root from the beginning of the penultimate TM checkpoint (last two): since each TM checkpoint contains all the transaction state records for transactions that did not generate log records since the last TM checkpoint, this guarantees building a complete snapshot list

As the state transitions are traversed for each transaction in the log, the state continually changes until the Forgotten state is reached, when the TM throws that transaction away


1-100

9. Recovery putting it all back together again

These are the transaction states, which are stored as records in the log root (the tense is crucial):

Active state: only seen in the log if a working transaction gets caught by the periodic TM checkpoint

Prepared state: seen in a transaction which came in from a remote cluster (parent), and which is in the middle of a 2 or 3 phase commit

Committed state: seen in a transaction which has touched remote clusters (children), and which is at the end of a 2 or 3 phase commit

Aborting state: like I said Forgotten state: this transaction is now durably

going away

1-101

Transaction state record transitions in the log root:

Active or nil -> forgotten (hurried, lockrelease after): local commit, neither parents nor children

Active or nil -> prepared (hurried, locks held): distributed commit, definitely having parents, maybe having children

Active or nil -> committed (hurried, lockrelease after): distributed commit, no parents, definitely having children

Active or nil -> aborting (hurried, locks held): maybe local or distributed


1-102

Transaction state record transitions in the log root (continued):

Prepared -> forgotten (hurried, lockrelease after): distributed commit, definitely having parents, but no children

Prepared -> committed (hurried, lockrelease after): distributed commit, definitely having both parents and children

Prepared -> aborting (hurried, locks held): distributed commit, definitely having parents, maybe having children


1-103

Transaction state record transitions in the log root (continued):

Committed -> forgotten: distributed commit, maybe having parents, definitely having children

Aborting -> aborting (hurried, locks held): try, try again to abort

Aborting -> forgotten (hurried, lockrelease after): maybe local or distributed


1-104

Actions to take depending on transaction states in the TM after the restart scan of the last two TM checkpoints in the log root:

Active: Locks must be held to abort this transaction

Prepared: request the commit/abort decision from the parent cluster, locks must be held on the RM so that the RM can be made available, in case the answer is abort, which will apply undo

Committed: notify all children clusters about Commit, after they have all responded, go forgotten and discard the transaction: no locks are held for the Committed state on the TM’s home cluster

Aborting: Locks must be held to abort this transaction

Forgotten: discard the transaction


1-105


If there is no lock reinstatement during crash recovery for the RMs, then the Prepared state transactions must be resolved by the TM by communicating with the parents of all Prepared transactions and committing or aborting them all, before bringing the database up

Lock reinstatement arises from RM periodic checkpointing, by appending to the RM log checkpoint all the locks that were not released or otherwise written to the log as redo records since the last RM log checkpoint, such that traversing the last two RM log checkpoints after a crash can rebuild the lock table (update locks only, these are sufficient for Prepared state management), from these and redo records

1-106


If there is lock reinstatement, then the database can be brought online with Prepared transactions awaiting resolution and also by aborting the remaining (Aborting and Active state transactions) while online

The resource manager RM rebuilds its state by starting a redo scan of the log partition from the redo low water mark (retrieved from the TM) to the end of the log, applying all redo (original transaction forward work) into RM cache blocks for any transactions that were alive when the crash occurred: reinstating locks along the way, if that is supported

1-107


If lock reinstatement is supported, the RM can be brought online at the end of the redo run, otherwise it can only be brought online after Prepared transactions are resolved and all the aborts are completed

The redo low water mark for an RM points to the earliest log partition write that has not been lazy-written to the data volume using the CAB(hurried)-WAL(hurried)-WDV(lazy) protocol

After the redo scan is complete, and using the list of Active and Aborting transactions remaining from the TM, the log partition is traversed from the end in reverse all the way to the undo low water mark, applying undo for the transactions until all of those transactions are completely undone: this can easily go beyond the last two RM checkpoints, and even beyond the redo low water mark - you hope all the undo is online and not on a T9840 StorageTek tape on a shelf somewhere

1-108


If the current log pointers for the log root and the log partitions are periodically broadcast to the TM library on every computer in the cluster, then when any transaction is begun, the transaction undo low water mark can be sent back to the TM and then that part of the log can be made to remain online and not go off to tape, to make aborting the transaction later (if it comes to that) easier, or possible

If lock reinstatement is supported, the database can be brought back online in 30 seconds or less, as opposed to 15 minutes or more to resolve Prepared and complete all the aborts

1-109


RM archive recovery (if the file is trashed): To recover a file from the archive, first you restore

the ‘fuzzy’ online dump from the most recent time before the target time you want the file consistency to brought up to

After the dump is in place in the RM, archive recovery applies all the redo for that file from the RM’s log partition, starting at a point in the log partition immediately before the online dump was initiated, and proceeding up to the log pointer or timestamp of the target file that you want to recover to

Traversing the log backwards from that final location (which is usually the end), archive recovery then applies all the undo for the transactions that were incomplete at the target time, back to the undo low water mark : and that’s it

fundamentals of transaction systems - part 1: causality banishes acausality (clustered database)

Technology

network management library

open system interface

cluster fundamentals

open source components

data integrity data

san drivers

enterprise mining

based drivers