1, 2, 3… scatter! getting your humpty-dumpty database in order. tom bascom, white star software...
TRANSCRIPT
1, 2, 3…Scatter!
Getting Your Humpty-Dumpty Database in Order.
Tom Bascom, White Star [email protected]
2
A Few Words about the Speaker
• Tom Bascom; Progress user & roaming DBA since 1987
• VP, White Star Software, LLC– Expert consulting services related to all aspects of Progress and
OpenEdge.– [email protected]
• President, DBAppraise, LLC– Remote database management service for OpenEdge.– Simplifying the job of managing and monitoring the world’s
best business applications.– [email protected]
3
“Fragmentation”vs
“Scatter”
4
$ proutil dbname –C dbanalys > dbname.dba
…RECORD BLOCK SUMMARY FOR AREA "APP_FLAGS_Dat" : 95------------------------------------------------------- Record Size (B) -Fragments- ScatterTable Records Size Min Max Mean Count Factor FactorPUB.APP_FLAGS 1676180 47.9M 28 58 29 1676190 1.0 1.9…
Fragmentation
• “Fragmentation” is splitting records into multiple pieces.
5
$ proutil dbname –C dbanalys > dbname.dba
…RECORD BLOCK SUMMARY FOR AREA "APP_FLAGS_Dat" : 95------------------------------------------------------- Record Size (B) -Fragments- ScatterTable Records Size Min Max Mean Count Factor FactorPUB.APP_FLAGS 1676180 47.9M 28 58 29 1676190 1.0 1.9…
Fragmentation
• “Fragmentation” is splitting records into multiple pieces.
10 additional fragments Beware!
6
Fragmentation Occurs When
• Record data is too big for the block: i.e. 16k of data going into a 4k block.
• Updated data needs more room to expand than is available.
• The “create limit” and the “toss limit” can be used to reserve more free space in blocks and control fragmentation.
• Progress will automatically “de-frag” when possible (10.1+).
7
Scatter
• “Scatter” is a measure of the “sequentialness” of records.
$ proutil dbname –C dbanalys > dbname.dba
…RECORD BLOCK SUMMARY FOR AREA "APP_FLAGS_Dat" : 95------------------------------------------------------- Record Size (B) -Fragments- ScatterTable Records Size Min Max Mean Count Factor FactorPUB.APP_FLAGS 1676180 47.9M 28 58 29 1676190 1.0 1.9…
8
What is “Scatter Factor”?
• The “factor” does not care about any ordering – not even primary key.
• For type 1 areas it is a measure of how well the data fits into the minimum # of blocks that would be required to hold it with “distance” between blocks taken into account.
• For type 2 areas there is no distance penalty – but free space in a cluster can increase scatter.
• “Logically adjacent” isn’t really reported by dbanalys.
9
Fragmentation and Scatter
• “Fragmentation” is splitting records into multiple pieces.
• “Scatter” is a measure of the “sequentialness” of records.
• The “scatter factor” that proutil reports might be better described as “density”.
• If you have two or more indexes at least one of them is probably scattered.
10
Why is ThisImportant?
11
Locality of Reference
• When data is referenced there is a high probability that it will be referenced again soon.
• If data is referenced there is a high probability that “nearby” data will be referenced soon.
• Locality of reference is why caching exists at all levels of computing.
12
How Does Progress Help?
• Temporal Locality (Will be reused “soon”)• -B (LRU Chain)• -mmax• -Bt
• Spatial Locality (“nearby” data will be accessed)
• Type 2 storage areas• -B2 (no LRU)• Memory mapped prolib
13
Cache Effectiveness
Layer Time # of Recs
# of Ops Cost per Op
Relative
Progress 4GL to –B 0.96 100,000 203,473 0.000005 1-B to FS Cache 10.24 100,000 26,711 0.000383 75
FS Cache to SAN 5.93 100,000 26,711 0.000222 45-B to SAN Cache 11.17 100,000 26,711 0.000605 120
SAN Cache to Disk 200.35 100,000 26,711 0.007500 1500-B to Disk 211.52 100,000 26,711 0.007919 1585
14
LogicalScatter
15
Definition
• Logical Scatter is the probability that records in a given logical order are also in “physical” order (in the same block).
• Each index has its own ordering and thus its own logical scatter.
• It is very unlikely that more than one index will be well ordered.
• It is quite possible that all indexes might be scattered.
Type 1 Storage Area
16
Block 1
1 Lift Tours Burlington
3 66 9/23 9/28 Standard Mail
1 1 54 4.86 Shipped
1 2 55 23.85 Shipped
Block 2
1 3 53 8.77 Shipped
2 1 19 2.75 Shipped
2 2 49 6.78 Shipped
2 3 13 10.99 Shipped
Block 3
14 Cologne Germany Germany
2 Upton Frisbee Oslo
1 Koberlein Kelly
1 53 1/26 1/31 FlyByNight
Block 4
BBB Brawn, Bubba B. 1,600
DKP Pitt, Dirk K. 1,800
4 Go Fishing Ltd Harrow
16 Thundering Surf Inc. Coffee City
Type 2 Storage Area
17
Block 1
1 Lift Tours Burlington
2 Upton Frisbee Oslo
3 Hoops Atlanta
4 Go Fishing Ltd Harrow
Block 2
5 Match Point Tennis Boston
6 Fanatical Athletes Montgomery
7 Aerobics Tikkurila
8 Game Set Match Deatsville
Block 3
9 Pihtiputaan Pyora Pihtipudas
10 Just Joggers Limited Ramsbottom
11 Keilailu ja Biljardi Helsinki
12 Surf Lautaveikkoset Salo
Block 4
13 Biljardi ja tennis Mantsala
14 Paris St Germain Paris
15 Hoopla Basketball Egg Harbor
16 Thundering Surf Inc. Coffee City
18
Tangent…
• The preceding slides should be all you need to see in order to be convinced that type 1 areas are a bad place to be putting data.
• The schema area is always a type 1 area. Should it have data in it?
19
How to Determine “Logical Scatter”?
• You could read the whole database…• Multiple times…• (Because every index must be considered)
-or-• For each table randomly choose a record.• For each index of that table find the NEXT record.• Is it in the same block?• Lather, Rinse and Repeat.
Type 2 Storage Area
20
Block 1
1 Lift Tours Burlington
2 Upton Frisbee Oslo
3 Hoops Atlanta
4 Go Fishing Ltd Harrow
Block 2
5 Match Point Tennis Boston
6 Fanatical Athletes Montgomery
7 Aerobics Tikkurila
8 Game Set Match Deatsville
Block 3
9 Pihtiputaan Pyora Pihtipudas
10 Just Joggers Limited Ramsbottom
11 Keilailu ja Biljardi Helsinki
12 Surf Lautaveikkoset Salo
Block 4
13 Biljardi ja tennis Mantsala
14 Paris St Germain Paris
15 Hoopla Basketball Egg Harbor
16 Thundering Surf Inc. Coffee City
Id = 100% Name = 25% City 19%
21
WhichIndex?
22
4GL Index Selectioncompile cust.p xref “cust.xrf”
/* cust.p */for each customer no-lock: display custNum.end.
cust.p 1 COMPILE cust.pcust.p 1 CPINTERNAL ISO8859-1cust.p 1 CPSTREAM ISO8859-1cust.p 1 STRING "Customer" 8 NONE UNTRANSLATABLEcust.p 1 SEARCH sports2000.Customer CustNum WHOLE-INDEXcust.p 2 ACCESS sports2000.Customer CustNumcust.p 2 STRING ">>>>9" 5 NONE TRANSLATABLE FORMATcust.p 3 STRING "Cust Num" 8 LEFT TRANSLATABLEcust.p 3 STRING "CustNum" 7 NONE UNTRANSLATABLEcust.p 3 STRING "--------" 8 NONE UNTRANSLATABLE
23
Amdahl’s Law
The performance enhancement possible with a given improvement is limited by the fraction of the execution time that the improved feature is used.
24
Compile Time is Not Enough
• Dynamic Queries• SQL-92 Cost Based Optimizer• Unreached Code• Rarely Run Code• Widespread, but Low Impact Code
25
Execution Time Index Usagefor each _indexStat no-lock: find _index no-lock where _index._idx-num = _indexStat._indexStat-id no-error. if available( _index ) then do: find _file no-lock where recid( _file ) = _index._file-recid no-error. display _indexStat._indexStat-id "*" when ( _file._prime-index = recid( _index )) "u" when ( _index._unique = true ) _file._file-name when available _file _index._index-name when available _index _indexStat._indexStat-read . end.end.
26
Execution Time Index UsageId File-Name Index-Name Read
1 * _File _File-Name 39,550,882
2 * U _Field _File/Field 29
3 U _Field _Field-Name 2,457,451
4 U _Field _Field-Positi on 1,999,506
5 * U _Index _File/Index 4,791,744
6 * U _Index-Field _Index/Number 7,344,550
7 _Index-Field _Field 0
8 * U table1 t1_idx1 6,668,593
9 U table1 t1_idx2 224,913
10 * U table2 t2_idx1 42,078,065
11 table2 t2_idx2 19,772,967,351
12 table2 t2_idx3 0
27
VST Note
• The default db settings only collect statistics for the first 50 tables and indexes.
• To fix this:
-tablerangesize 1000-indexrangesize 3000
define variable t as integer no-undo label “Tables”.define variable i as integer no-undo label “Indexes”.for each _file no-lock where _hidden = no: t = t + 1. end.for each _index no-lock: i = i + 1. end.display t i.
28
CaseStudy
29
Logical Scatter Case Study
• A process reading approximately 1,000,000 records.
• An initial run time of 2 hours.– 139 records/sec.
• Un-optimized database.
30
BaselineTable Index %Sequential %Idx Used DensityTable1 t1_idx1 0% 100% 0.09
t1_idx2 0% 0% 0.09Table2 t2_idx1 69% 99% 0.51
t2_idx2 98% 1% 0.51t2_idx3 74% 0% 0.51
4k DB BlockType 1 Area
• -B 25,000• Hit Ratio 95%• 19,208 IO ops• Run time 2 hours
31
Round 1 – Increase Big BTable Index %Sequential %Idx Used DensityTable1 t1_idx1 0% 100% 0.09
t1_idx2 0% 0% 0.09Table2 t2_idx1 69% 99% 0.51
t2_idx2 98% 1% 0.51t2_idx3 74% 0% 0.51
4k DB BlockType 1 Area
• -B 100,000• Hit Ratio 98%• 9,816 IO ops• Run time 60 minutes
32
Round 2 – Increase Some More
Table Index %Sequential %Idx Used DensityTable1 t1_idx1 0% 100% 0.09
t1_idx2 0% 0% 0.09Table2 t2_idx1 69% 99% 0.51
t2_idx2 98% 1% 0.51t2_idx3 74% 0% 0.51
4k DB BlockType 1 Area
• -B 200,000• Hit Ratio 99%• 6,416 IO ops• Run time 40 minutes
33
Restructure DB
• Dump & Load• Convert to 8KB DB Blocks• Convert to Type 2 Storage Areas
34
Round 3 – Back to Baseline –B
8k DB BlockType 2 Areas
• -B 12,500• Hit Ratio 95%• 9,417 IO ops• Run time 55 minutes
Table Index %Sequential %Idx Used Density
Table1 t1_idx1 71% (0) 100% 0.10
t1_idx2 63% (0) 0% 0.10
Table2 t2_idx1 85% (69) 99% 1.00
t2_idx2 100% (98) 1% 1.00
t2_idx3 83% (74) 0% 0.99
35
Round 4 – Bump Big B
8k DB BlockType 2 Areas
• -B 50,000• Hit Ratio 98%• 4,746 IO ops• Run time 30 minutes
Table Index %Sequential %Idx Used Density
Table1 t1_idx1 71% 100% 0.10
t1_idx2 63% 0% 0.10
Table2 t2_idx1 85% 99% 1.00
t2_idx2 100% 1% 1.00
t2_idx3 83% 0% 0.99
36
Round 5 – Big B … Again
8k DB BlockType 2 Areas
• -B 100,000• Hit Ratio 99%• 3,192 IO ops• Run time 20 minutes
Table Index %Sequential %Idx Used Density
Table1 t1_idx1 71% 100% 0.10
t1_idx2 63% 0% 0.10
Table2 t2_idx1 85% 99% 1.00
t2_idx2 100% 1% 1.00
t2_idx3 83% 0% 0.99
37
Are We Done?
8k DB BlockType 2 Areas
• The most used index is not the most sequential index!
Table Index %Sequential %Idx Used Density
Table1 t1_idx1 71% 100% 0.10
t1_idx2 63% 0% 0.10
Table2 t2_idx1 85% 99% 1.00
t2_idx2 100% 1% 1.00
t2_idx3 83% 0% 0.99
38
Restructure DB
• Dump & Load• Dump Table 2 using the most used index:
• t2_idx1
• Load Normally
39
Why Not?
• Return on Investment:– Pickups from improving %SEQ are less than those
from improving Hit Ratio.– That last 15% is a drop in the bucket compared to
the 6x improvement already gained.– Expected improvement would be about 2% -- of
20 minutes. Or around 24 seconds.
40
However…
• At low buffer hit ratios (95% or lower):– Restructuring to favor the most used index results in
a 60% improvement in time.– And the hit ratio improves to 99.75%.– By eliminating 95% of the disk IO ops (112,247 ->
5196).
• On the other hand… the system in question has grown again and it may now be worth revisiting.
41
Conclusion
• Type 2 Storage Areas improve “logical scatter”.• Addressing “logical scatter” can be a powerful
performance improvement technique.• Addressing “logical scatter” can be an
alternative to increasing –B in environments where shared memory is constrained.
42
Questions?
43
Questions?
• Should I use USE-INDEX to force a “well ordered” index?
• Why might scatter grow over time?
• I have two (or more) conflicting, but very important, needs. What can I do?
44
Thank You!
45