db-15: inside the recovery subsystem plan to commit; be prepared to rollback. richard banville...
TRANSCRIPT
DB-15: Inside The Recovery Subsystem
Plan to commit; Be prepared to rollback.
Richard BanvilleFellow, Technology and Product Architecture
Progress OpenEdge
© 2007 Progress Software Corporation2 DB-15: Inside the Recovery Subsystem
Recovery Types
Transaction Recovery*• Before image rollback/undo and crash recovery
Hard Failure Recovery• Roll forward after images• Point in time, transaction, retry
Coordinated distributed txn consistency• OpenEdge® 2PC - Prepare Phase, Commit Phase
Heterogeneous distributed txn consistency (JTA)• External distributed transaction coordinator• Requires application changes• Available for OpenEdge SQL only
* Before Imaging is the focus of this presentation
© 2007 Progress Software Corporation3 DB-15: Inside the Recovery Subsystem
Agenda
The BI Units of Measure Some Simple Rules General Processing (the fun stuff) Reliability Switches Summary
© 2007 Progress Software Corporation4 DB-15: Inside the Recovery Subsystem
BI Layout: Notes and Blocks
Notes are the basis for recording change in the database
BI made up of many Notes
Notes are variable sized
Notes are organized in order of operation
Notes are stored into BI blocks
BI block size can be customized (1-16K)
I/O is performed in BI Blocksize
© 2007 Progress Software Corporation5 DB-15: Inside the Recovery Subsystem
BI Layout: Clusters
Notes are stored into BI blocks
BI Block size can be customized (1-16K)
I/O is performed in BI Blocksize
Blocks are grouped to form a cluster
BI cluster size can be customized (16KB – 256MB)
Size affects checkpoint frequency (among other things)
© 2007 Progress Software Corporation6 DB-15: Inside the Recovery Subsystem
BI Layout: Clusters
Clusters are allocated as needed
Clusters are logically joined and ordered into a ring
Only ever one cluster accepting BI writes
© 2007 Progress Software Corporation7 DB-15: Inside the Recovery Subsystem
BI Layout: Storage
BI FileBI File
BI File
The Primary Recovery Area:
BI data stored in the extents of area #2 of the database
It grows as needed
Space is re-used when possible
© 2007 Progress Software Corporation8 DB-15: Inside the Recovery Subsystem
What’s in a note?
Trid: 81180 code = RL_RMCR version = 2
Trid: 81180 area = 8 dbkey = 14528 update counter = 4770
Header Note Specific Info Data Portion (if needed)
Length & note version
Note code/identifier
Associates action
Note type
Transaction Id
Block pointer & area
Block update counter
Record #
Table number
Size of record
Split information
Block change data
i.e, Record data itself
Only if needed
© 2007 Progress Software Corporation10 DB-15: Inside the Recovery Subsystem
Agenda
The BI Units of Measure Some Simple Rules General Processing (the fun stuff) Reliability Switches Summary
© 2007 Progress Software Corporation11 DB-15: Inside the Recovery Subsystem
Rules to live by
#1 - Write ahead logging (WAL)• Recovery log notes written BEFORE data
– Assures atomic and durable transactions– BI, AI - reliable write I/O– Can relax data write I/O
Write prior to BI-reuse Cluster close Missing data applied by redo Deferring writes allows multiple updates to occur with
a single I/O
#2 - Write ordering rule (FS and hardware)• AI, BI writes get to disk in order requested
© 2007 Progress Software Corporation12 DB-15: Inside the Recovery Subsystem
Rules to follow
#3 - BI Space Reuse• Only when cluster is closed
• Cluster closes when its last transaction ends– Checkpoint DOES NOT close a cluster – Checkpoint occurs when cluster fills up
#4 - Exclusive Block Access• When changing data in database
#5 - Atomic Physical Changes• Such as block chain manipulations
• Enforced by internal TXE mechanism
• SYSTEM ERROR: User 5 died during micro txn.
© 2007 Progress Software Corporation13 DB-15: Inside the Recovery Subsystem
Rule
#6 - Without exception: • All DB changes are recorded in recovery log.
© 2007 Progress Software Corporation14 DB-15: Inside the Recovery Subsystem
Rules were meant to be broken
#6 - Without exception: • All DB changes are recorded in recovery log.
Exception:• Control Area (area #1) changes are not logged.
– Why should I care?– Allows structural changes w/o affecting recovery
Such as adding space while in roll forward.
– Recovery Mechanism: Builddb
© 2007 Progress Software Corporation15 DB-15: Inside the Recovery Subsystem
Agenda
The BI Units of Measure Some Simple Rules General Processing (the fun stuff) Reliability Switches Summary
© 2007 Progress Software Corporation16 DB-15: Inside the Recovery Subsystem
Forward Processing
Locate/Lock the data block to change• Not all notes require a block
– Transaction begin, end
• Not all DB changes require a block!– Acquiring additional space– Certain index sub-operations
Ensure begin transaction recorded Record the change in the BI log
(via the BI buffer pool)
So you want to perform a database action
© 2007 Progress Software Corporation17 DB-15: Inside the Recovery Subsystem
Rollback Processing
BI Buffer Pool – Recording a change
-bibufs 10
NF - a
NF - b
NF - c
NF - d
NF - e
32 31
30
29
Modified QueueFree List
15
Current Input Buffer
9
Backout Buffer
12
Backout Buffer
BI
Current Output Buffer
New Notes (Actions)
Forward Processing
© 2007 Progress Software Corporation18 DB-15: Inside the Recovery Subsystem
BI Buffer Pool – Recording a change
-bibufs 10
NF - a
NF - b
NF - c
NF - d
NF - e
32 31
30
29
Modified QueueFree List
BI
Current Output Buffer
PROMON:Total BI WritesRecords (notes) written
Busy buffer waitsEmpty buffer waits
Partial Writes New Notes (Actions)
Forward Processing
Is it OK to buffer dirty BI blocks?
YES
Is it OK to buffer committed BI data?
Delayed commit is up to you!
© 2007 Progress Software Corporation19 DB-15: Inside the Recovery Subsystem
Forward Processing (continued)
Finally perform the DB action (make the change)• Logical, physical or a mix
Data block’s update ctr is incremented• Identifies if a noted change made it to disk yet
• Ensures changes re-applied in order
Dependency counter maintained in ctlr struct• Ensures associated BI flushed if –B eviction
User may be forced to do (expensive) BI I/O• On -B eviction or No BI buffers available
• Avoid with APWs, BIW and -bibufs
The BI Note has been written…
© 2007 Progress Software Corporation20 DB-15: Inside the Recovery Subsystem
Helping avoid OLTP BI I/O
© 2007 Progress Software Corporation21 DB-15: Inside the Recovery Subsystem
Broker Processing
-bibufs 10
NF - a
NF - b
NF - c
NF - d
NF - e
32 31
30
29
Modified Queue
Current Output BufferFree List
BI
Delayed commit (Durability)
Based on –Mf value, Broker may flush BI buffers to disk
For aged txn ends
Broker
PROMON:Total BI WritesRecords (notes) written
Partial Writes
New Notes (Actions)
Helping Avoid OLTP BI I/O
© 2007 Progress Software Corporation22 DB-15: Inside the Recovery Subsystem
BIW Processing
-bibufs 10
NF - a
NF - b
NF - c
NF - d
NF - e
32 31
30
29
Modified Queue
Current Output BufferFree List
BI
B I W
PROMON:Total BI WritesRecords (notes) written
BIW Writes New Notes (Actions)
Partial Writes
Helping Avoid OLTP BI I/O
© 2007 Progress Software Corporation23 DB-15: Inside the Recovery Subsystem
APW Processing
-bibufs 10
NF - a
NF - b
NF - c
NF - d
NF - e
32 31
30
29
Modified Queue
Current Output BufferFree List
BI
A P W
db
CheckpointQueue
172
128
Associated BI Note
(dependency ctr)
Data
Blocks
New Notes (Actions)
WAL
12
Helping Avoid OLTP BI I/O
© 2007 Progress Software Corporation24 DB-15: Inside the Recovery Subsystem
BI Clusters And Checkpointing
© 2007 Progress Software Corporation25 DB-15: Inside the Recovery Subsystem
The Precious Ring
BI Files
42 31
Database
BI Cluster Layout
42 31 -B buffer pool
1
32 31
30
29
Modified Queue
Current Out Buffer
-bibufs
BI blocks are grouped together to form a cluster of blocks.
The cluster of blocks are logically joined together in a ring.
© 2007 Progress Software Corporation26 DB-15: Inside the Recovery Subsystem
Checkpoint – Synchronization point
BI Files
42 31
Database
BI Cluster Layout
42 31 -B buffer pool
1
32 31
30
29
Modified Queue
Current Out Buffer
Db buffer pool scanned
Db buffers previously marked for chkpt are written out (OUCH!)
Dirty buffers are marked for chkpt & put on checkpoint queue
File system cache is synchronizedFile
System Cache
File System Cache
No more sync delay
-bibufs Fuzzy checkpointing avoids I/O
All Database Changes Halted!
BI buffer pool flushed
© 2007 Progress Software Corporation27 DB-15: Inside the Recovery Subsystem
Checkpoint (with –directio)
BI Files
42 31
Database
BI Cluster Layout
42 31 -B buffer pool
1
(unbuffered I/O)
All Database Changes Halted!
Db buffer pool scanned
Db buffers marked for chkpt are written out
Dirty buffers are marked for chkpt & put on checkpoint queue
Fuzzy checkpointing avoids I/O
BI buffer pool flushed
© 2007 Progress Software Corporation28 DB-15: Inside the Recovery Subsystem
The APW
A P W
db
APW Queue 172 128 128
Checkpoint Queue 256 1024 512
-B Buffer Pool 1152 1664 …
PROMON:Buffers Flushed at checkpoint
BIW Writes
The APWs help w/checkpoints too
© 2007 Progress Software Corporation29 DB-15: Inside the Recovery Subsystem
Checkpoint – Size Does Matter
Larger cluster sizes• Fewer checkpoints (sync points)
– Will a crash result in additional lost data?• Longer recovery time
– Recovery starts at last cluster - 1• Longer BI format time (runtime)• Longer BI format time after truncate
– Use at least one fixed length extent Also use a variable length extent
– Use bigrow
© 2007 Progress Software Corporation30 DB-15: Inside the Recovery Subsystem
Checkpoints and Promon
Seeing is believing…Ckpt ------ Database Writes ------
No. Time Len Freq Dirty CPT Q Scan APW Q Flushes
27 10:23:12 4 0 384 52 0 0 0
26 10:22:46 25 26 381 381 0 0 0
25 10:22:18 27 28 380 380 0 0 0
24 10:21:50 27 28 346 158 201 0 0
23 10:21:21 28 29 372 360 115 0 0
Ooops!!
© 2007 Progress Software Corporation31 DB-15: Inside the Recovery Subsystem
Checkpoints and Promon
Seeing is believing…Ckpt ------ Database Writes ------
No. Time Len Freq Dirty CPT Q Scan APW Q Flushes
27 10:23:12 4 0 384 52 0 0 0
26 10:22:46 25 26 381 381 0 0 0
25 10:22:18 27 28 380 380 0 0 0
24 10:21:50 27 28 346 158 201 0 0
23 10:21:21 28 29 372 360 115 0 0
Len: begin to end time - Time cluster was actively available for writes
Freq: begin time to begin time - Time between checkpoints
Dirty: # data blocks newly updated – not incremented when “made dirtier”
Time spent performing checkpoint operation: Freq - Len
© 2007 Progress Software Corporation32 DB-15: Inside the Recovery Subsystem
Checkpoints and Promon
APW Specific Activity…Ckpt ------ Database Writes ------
No. Time Len Freq Dirty CPT Q Scan APW Q Flushes
27 10:23:12 4 0 384 52 0 0 0
26 10:22:46 25 26 381 381 0 0 0
25 10:22:18 27 28 380 380 0 0 0
24 10:21:50 27 28 346 158 201 0 0
23 10:21:21 28 29 372 360 115 0 0
CPT Q: # data buffers APW wrote from checkpoint queue (from prev chkpt)
Scan: # data buffers APW wrote while scanning -B
APW Q: # data buffers APW wrote from APW Q
Dirty buffers added to APWQ from -B LRU eviction
© 2007 Progress Software Corporation33 DB-15: Inside the Recovery Subsystem
Checkpoints and Promon
To be avoided…Ckpt ------ Database Writes ------
No. Time Len Freq Dirty CPT Q Scan APW Q Flushes
27 10:23:12 4 0 384 52 0 0 0
26 10:22:46 25 26 381 381 0 0 0
25 10:22:18 27 28 380 380 0 0 0
24 10:21:50 27 28 346 158 201 0 0
23 10:21:21 28 29 372 360 115 0 0
Flushes: Number of blocks written during checkpoint
(marked from previous checkpoint)
Len: Checkpointing too often should be avoided
© 2007 Progress Software Corporation34 DB-15: Inside the Recovery Subsystem
Reusing space in the BI file
© 2007 Progress Software Corporation35 DB-15: Inside the Recovery Subsystem
BI Space Reuse
1
BI Files
4322 43
© 2007 Progress Software Corporation36 DB-15: Inside the Recovery Subsystem
BI Space Reuse
1 5
BI Files
4322 43 5
© 2007 Progress Software Corporation37 DB-15: Inside the Recovery Subsystem
BI Space Reuse
42 31 5
BI Files
6
When can BI space be reused?
No need to “Age” cluster anymore
No open transactions in cluster W h y ??
Checkpoint DOES NOT close a cluster!!
Changes have been written to data files
If outstanding transaction were to roll back,
where would the undo action come from?
-G 0 vs –G 60 Thanks fdatasync()
BI files grow to some working set size
© 2007 Progress Software Corporation38 DB-15: Inside the Recovery Subsystem
Rollback
© 2007 Progress Software Corporation39 DB-15: Inside the Recovery Subsystem
Rollback Processing
-bibufs 10
NF - a
NF - b
NF - c
NF - d
NF - e
32 31
30
29
Modified Queue
Current Output BufferFree List
15
Current Input Buffer
9
Backout Buffer
12
Backout Buffer
BI
.lbi
PROMON:Input buffer hitsOutput buffer hitsMod buffer hitsBusy buffer waits
Total BI ReadsNotes read
ABL sub transaction rollback: ABL requests compensating action
Read backwards & UNDO until tx begin
© 2007 Progress Software Corporation40 DB-15: Inside the Recovery Subsystem
What about BOB?
-bibufs 10
NF - a
NF - b
NF - c
NF - d
NF - e
32 31
30
29
Modified QueueFree List
15
Current Input Buffer
9
Backout Buffer
12
Backout Buffer
BI
Current Output Buffer
PROMON:Input buffer hitsOutput buffer hitsMod buffer hits
BO Buffer hits
© 2007 Progress Software Corporation41 DB-15: Inside the Recovery Subsystem
Crash Recovery
© 2007 Progress Software Corporation43 DB-15: Inside the Recovery Subsystem
Crash Recovery
Performed on each database startup• Only needed phases performed
Brings DB up to last known consistent state• Physically sound
• In-flight transactions rolled back
• Missing committed transactions re-applied
© 2007 Progress Software Corporation44 DB-15: Inside the Recovery Subsystem
Physical Redo
Oldest active txnLast Recorded Note
Before-Image Log
Bring DB up to point of crash
*** Begin Physical Redo Phase, 4 at 0.
Find last active cluster and backup one
*** Physical Redo Phase Completed at block, off, upd…
*** At end of Physical Redo, txn table is 128
Apply notes based on updctr
No BI notes generated during redo
redo phase - forward scan
© 2007 Progress Software Corporation45 DB-15: Inside the Recovery Subsystem
Physical Undo
redo phase - forward scan
Before-Image Log
Backout physical DB changes (if needed)
Oldest active txn
*** Begin Physical Undo 10 txns at block 128 offset 1608
*** Physical Undo Completed at 128 (block #)
Starts at crash point. Undo physical and physiological notes
Causes new BI notes to be generated
Ends when 1st transaction end encountered
Physical undo
Last Note
© 2007 Progress Software Corporation46 DB-15: Inside the Recovery Subsystem
Logical Undo
redo phase - forward scan
Before-Image Log
Backout all uncommitted transactions
Oldest active txn
*** Begin Logical Undo Phase, 10 incomplete txns are being backed out.
*** Logical Undo Phase Completed at Block 1135 offset 7743.
Starts where physical undo left off Undo logical and physiological notes
*** Logical Undo Phase begin at Block 1136 offset 1608.
Logical undo backward scan Physical undo
Last Note
© 2007 Progress Software Corporation47 DB-15: Inside the Recovery Subsystem
Agenda
The BI Units of Measure Some Simple Rules General Processing Reliability Switches Summary
© 2007 Progress Software Corporation48 DB-15: Inside the Recovery Subsystem
Switches: Reliability and Integrity
-I : No longer a valid parameter.• Never had anything to do with crash recovery
-R : Default - Reliable BI I/O• Writes bypass the FS cache
• Use for OLTP
*** Before-Image File I/O (-r -R): Reliable.
*** Crash Recovery (-i): Enabled.
© 2007 Progress Software Corporation49 DB-15: Inside the Recovery Subsystem
Switches: Reliability and Integrity
-r : BI writes are buffered (un-reliable) to FS• Well tuned system overshadows any gain of -r
• All notes recorded
• Rollback will work
• Crash recovery likely to work
• Recovery from OS crash will most likely fail*** This session is running with the non-raw (-r) parameter.
*** Before-Image File I/O (-r -R): Not Reliable.
*** Crash Recovery (-i): Enabled.
*** An earlier -r session crashed, the database may be damaged.
© 2007 Progress Software Corporation50 DB-15: Inside the Recovery Subsystem
Switches: Reliability and Integrity
-i : Does not record purely physical notes• BI I/O is buffered (un-reliable) to FS
• No FS sync at checkpoint
• Rollback will work.
• OS or DB crash, abnormal termination– Must restore from backup
*** This session is being run with the no-integrity (-i) option.
*** Crash Recovery (-i): Not Enabled.
*** Before-Image File I/O (-r -R): Not Reliable.
Why provide it then?
© 2007 Progress Software Corporation51 DB-15: Inside the Recovery Subsystem
Switches: Last Resort
-F (dash Foolish)• Enter DB without recovery
• Use as a last resort
• Integrity NOT maintained
• Usually need to– Validate Data Integrity– Dump and load
© 2007 Progress Software Corporation52 DB-15: Inside the Recovery Subsystem
Agenda
The BI Units of Measure Some Simple Rules General Processing Reliability Switches Summary
© 2007 Progress Software Corporation53 DB-15: Inside the Recovery Subsystem
Summary
Recovery is a complex thing You can do things to improve the process We make it simple for you
© 2007 Progress Software Corporation54 DB-15: Inside the Recovery Subsystem
Questions?-bibufs 10
NF - a
NF - b
NF - c
NF - d
NF - e
32 31
30
29
Modified Queue
Current Out BufferFree List
BI
A P W
db
CheckpointQueue
172
128
Associated BI Note
42 31
© 2007 Progress Software Corporation55 DB-15: Inside the Recovery Subsystem
Thank you for your time!
© 2007 Progress Software Corporation56 DB-15: Inside the Recovery Subsystem
© 2007 Progress Software Corporation57 DB-15: Inside the Recovery Subsystem
Other recovery related Switches
-bi -biblocksize -directio
• No need for sync at checkpoint time
-bwdelay -bibufs, -aibufs -bistall, -bithold
© 2007 Progress Software Corporation58 DB-15: Inside the Recovery Subsystem
Switches: Transactions
-Mf : Delayed commit• # seconds a commit note can reside in –bibufs
• Some commits lost/Integrity Maintained
Group Commit Technique • –groupdelay only runs w/-Mf 0
• Only in multi user mode
• # milliseconds to sleep at commit time
-G : # seconds to age cluster (use & re-use)• No longer needed with fdatasync()