index building. -2--2- overview database tables building flow (logical) sequential drawbacks...
TRANSCRIPT
![Page 1: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/1.jpg)
Index Building
![Page 2: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/2.jpg)
Index Building
-2-
Overview
• Database tables• Building flow (logical)• Sequential• Drawbacks• Parallel processing• Recovery• Helpful rules
![Page 3: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/3.jpg)
Index Building
-3-
Database tables
Word Index:• Z97 - word dictionary• Z98 - bitmap• Z980 - cache of bitmap updates• Z95 - words in document
![Page 4: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/4.jpg)
Index Building
-4-
Database tables
Z97• translation from word to
internal representation (sequence)
• same character set as documents
![Page 5: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/5.jpg)
Index Building
-5-
Database tables
Z98• “bitmap” of word occurrence in
documents• each bitmap is physically made
up of one or more records• compressed• one bitmap for every
combination of word and index
![Page 6: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/6.jpg)
Index Building
-6-
Database tables
Z980• cache of bitmap updates • increases speed of large bitmap
updates• 1/1000
![Page 7: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/7.jpg)
Index Building
-7-
Database tables
Z95• list of words and their location
in a document• adjacency
![Page 8: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/8.jpg)
Index Building
-8-
Database tables
Heading index:• Z01 - phrase dictionary• Z02 - phrase->document
mapping
![Page 9: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/9.jpg)
Index Building
-9-
Database tables
Z01:• filing phrase• connection to authority
database• hash key (display text)
![Page 10: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/10.jpg)
Index Building
-10-
Building flow - word
Stage 1: Retrieval + Sort• Read document• prepare list of words and
locations• for each word find list of indices
it belongs to• sort according to words
![Page 11: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/11.jpg)
Index Building
-11-
Building flow - word
Stage 2: Word Dictionary• read intermediate file from
stage 1• build up word dictionary (check
+ load)• replace word with internal
representation• create 2nd intermediate file
![Page 12: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/12.jpg)
Index Building
-12-
Building flow - word
Stage 3: Sort + Build Z95• sort intermediate file from
stage 2 - by document number• create Z95 records• load Z95 sequential file to
database
![Page 13: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/13.jpg)
Index Building
-13-
Building flow - word
Stage 4: Merge + Build Z98• intermediate file from stage 2
already sorted by word number• split words into a number of
files according to range of word numbers
• merge into Z98 records• load sequential files
![Page 14: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/14.jpg)
Index Building
-14-
Building flow - heading
Stage 1: Retrieval + Sort• Read document• prepare list of phrases• for each phrase find list of
indices it belongs to• sort according to hash key
![Page 15: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/15.jpg)
Index Building
-15-
Building flow - heading
Stage 2: Phrase Dictionary• read intermediate file from stage
1• build up phrase dictionary• generate unique key - acc
sequence• load Z01 sequential file to
database• build Z02 - non unique
![Page 16: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/16.jpg)
Index Building
-16-
Building flow - heading
Stage 3: Sort + Load Z02• sort non unique Z02 sequential
file• load Z02 sequential file to
database
![Page 17: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/17.jpg)
Index Building
-17-
Sequential - word
• Every stage is handled by a single process
• Only after handling by a previous stage would the next stage proceed
• stage 4 would proceed after all other stages were finished
![Page 18: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/18.jpg)
Index Building
-18-
Sequential - word
Example from version 12.1 csh -f p_manage_01_a $1 >& $data_scratch/p_manage_01_a.log & csh -f p_manage_01_b $1 >& $data_scratch/p_manage_01_b.log & csh -f p_manage_01_c $1 >& $data_scratch/p_manage_01_c.log & csh -f p_manage_01_d $1 >& $data_scratch/p_manage_01_d.log
csh -f p_manage_01_e $1 >& $data_scratch/p_manage_01_e.log
![Page 19: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/19.jpg)
Index Building
-19-
Sequential - word
• p_manage_01_a: retrieval• p_manage_01_b: sort (by word)• p_manage_01_c: build Z97• p_manage_01_d: build Z95• p_manage_01_e: merge + build
Z98
![Page 20: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/20.jpg)
Index Building
-20-
Drawbacks
• Minimum parallel processing• Single process per stage• No recoverability - Z97 could be
reused but the whole building process needed to be rerun
• Computer resources not fully utilized
• Long run time
![Page 21: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/21.jpg)
Index Building
-21-
Parallel processing
• Large databases - multiple processors
• Identify stages that are not “workflow” bottlenecks
• Coordinate parallel processes with assignment/progress table
![Page 22: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/22.jpg)
Index Building
-22-
Parallel processing (word)
Stage 1: Retrieval + Sort• Retrieval is parallel - “io” not
“workflow” bottleneck• Split into cycles of range
document numbers
![Page 23: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/23.jpg)
Index Building
-23-
Parallel processing (word)
p_manage_01_a.cycles - initial
0001 - - - - 000000001 0000100000002 - - - - 000010001 0000200000003 - - - - 000020001 0000300000004 - - - - 000030001 0000400000005 - - - - 000040001 0000500000006 - - - - 000050001 0000600000007 - - - - 000060001 0000700000008 - - - - 000070001 0000800000009 - - - - 000080001 0000900000010 - - - - 000090001 0001000000011 - - - - 000100001 0001100000012 - - - - 000110001 000110511
![Page 24: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/24.jpg)
Index Building
-24-
Parallel processing (word)
p_manage_01_a.cycles - 3 processes, 1st retrieval cycle
0001 ? - - - 000000001 0000100000002 ? - - - 000010001 0000200000003 ? - - - 000020001 0000300000004 - - - - 000030001 0000400000005 - - - - 000040001 0000500000006 - - - - 000050001 0000600000007 - - - - 000060001 0000700000008 - - - - 000070001 0000800000009 - - - - 000080001 0000900000010 - - - - 000090001 0001000000011 - - - - 000100001 0001100000012 - - - - 000110001 000110511
![Page 25: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/25.jpg)
Index Building
-25-
Parallel processing (word)
p_manage_01_a.cycles - 3 processes, 2nd retrieval cycle
0001 + + ? - 000000001 0000100000002 + ? - - 000010001 0000200000003 + - - - 000020001 0000300000004 ? - - - 000030001 0000400000005 ? - - - 000040001 0000500000006 ? - - - 000050001 0000600000007 - - - - 000060001 0000700000008 - - - - 000070001 0000800000009 - - - - 000080001 0000900000010 - - - - 000090001 0001000000011 - - - - 000100001 0001100000012 - - - - 000110001 000110511
![Page 26: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/26.jpg)
Index Building
-26-
Parallel processing (word)
• Whenever possible stages were split into separate sub-stages
• Usually in cases of non-parallel stages
• stages 2 and 3 were not made into parallel processes - retrieval was by far the most costly stage
![Page 27: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/27.jpg)
Index Building
-27-
Parallel processing (word)
Stage 2 and 3 were subdivided into the 3 sub stages:
• build Z97 + load• sort intermediate file by
document number• build Z95 + load
![Page 28: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/28.jpg)
Index Building
-28-
Parallel processing (word)
p_manage_01_a.cycles - example
0001 + + + + 000000001 0000100000002 + + + ? 000010001 0000200000003 + + ? - 000020001 0000300000004 + + - - 000030001 0000400000005 + ? - - 000040001 0000500000006 + - - - 000050001 0000600000007 ? - - - 000060001 0000700000008 ? - - - 000070001 0000800000009 ? - - - 000080001 0000900000010 - - - - 000090001 0001000000011 - - - - 000100001 0001100000012 - - - - 000110001 000110511
![Page 29: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/29.jpg)
Index Building
-29-
Parallel processing (word)
Stage 4 is split into sub stages:• pre-processing of intermediate
files from stage 2 - distribution of words
• build Z98 - parallel• load Z98 sequential file• input files are compressed and
stored in separate directory
![Page 30: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/30.jpg)
Index Building
-30-
Parallel processing (word)
Pre-processing:• generate histogram - # of lines
per 5000 words• determine range of words - no
more than 1G in intermediate files
![Page 31: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/31.jpg)
Index Building
-31-
Parallel processing (word)
p_manage_01_e.cycles
0001 - - 000000001 0006000000002 - - 000600001 0009000000003 - - 000900001 999999999
![Page 32: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/32.jpg)
Index Building
-32-
Parallel processing (word)
Build Z98:• intermediate files - split into
discrete range of words• parallel merging and building of
Z98
![Page 33: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/33.jpg)
Index Building
-33-
Parallel processing (word)
p_manage_01_e.cycles - example
0001 + ? 000000001 0006000000002 ? - 000600001 0009000000003 ? - 000900001 999999999
![Page 34: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/34.jpg)
Index Building
-34-
Parallel processing (heading)
Stage 1: Retrieval + Sort• same handling as word index
stage 1• “io” bottleneck • Split into cycles of range
document numbers
![Page 35: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/35.jpg)
Index Building
-35-
Parallel processing (heading)
p_manage_02.cycles
0001 - - - - 000000001 0000050000002 - - - - 000005001 0000100000003 - - - - 000010001 0000150000004 - - - - 000015001 0000200000005 - - - - 000020001 0000250000006 - - - - 000025001 0000300000007 - - - - 000030001 0000350000008 - - - - 000035001 0000400000009 - - - - 000040001 0000450000010 - - - - 000045001 000048435
![Page 36: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/36.jpg)
Index Building
-36-
Parallel processing (heading)
Stage 2 and 3 were subdivided into the 3 sub stages:
• build Z01 + load + build Z02• sort non unique Z02 sequential
file• load Z02
![Page 37: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/37.jpg)
Index Building
-37-
Parallel processing (heading)
p_manage_02.cycles - example
0001 + + + ? 000000001 0000050000002 + + ? - 000005001 0000100000003 + + - - 000010001 0000150000004 + ? - - 000015001 0000200000005 + - - - 000020001 0000250000006 ? - - - 000025001 0000300000007 ? - - - 000030001 0000350000008 ? - - - 000035001 0000400000009 - - - - 000040001 0000450000010 - - - - 000045001 000048435
![Page 38: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/38.jpg)
Index Building
-38-
Parallel processing (heading)
Building of headings is conceptually and practically similar to word building, except for the building of bitmaps (Z98)
![Page 39: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/39.jpg)
Index Building
-39-
Recovery
Word index:• stages 1-3 and stage 4 are
separate• stage 4 runs only after all
processing is done in stage 3
![Page 40: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/40.jpg)
Index Building
-40-
Recovery
Stage 1-3 - scenarios:• database tables need to be
enlarged• not enough disk space -
intermediate files• not enough disk spaces - sort• general disaster?
![Page 41: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/41.jpg)
Index Building
-41-
Recovery
Stage 1-3:• identify last successful section• change “in process” signs (?) to
“not processed” sign (-)• rerun discrete stage scripts:
– p_manage_01_a– p_manage_01_c– p_manage_01_d– p_manage_01_d1
![Page 42: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/42.jpg)
Index Building
-42-
Recovery
Stage 4:• must be rerun in totality• input files are saved and
compressed• $word_compress_dir• p_manage_01_e
![Page 43: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/43.jpg)
Index Building
-43-
Helpful rules
Stage 1 outrunning stage 2-3:• decide on number of stage 1
processes to stop (p_manage_01_a)
• kill shell and program process• reset associated cycle in
p_manage_01_a.cycles
![Page 44: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/44.jpg)
Index Building
-44-
Helpful rules
Log file names:p_manage_01_a_{process_number}.logp_manage_01_e_{process_number}.log
others are without process_number
p_manage_01_c.logp_manage_01_d.logp_manage_01_d1.logp_manage_01_e1.logp_manage_01_e2.log
![Page 45: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/45.jpg)
Index Building
-45-
Helpful rules
cycle size:
# docs<2M - 50k# docs<4M - 100kotherwise - 200k
![Page 46: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/46.jpg)
Index Building
-46-
Helpful rules
Disk space calculation:
d = no. documentsc = no. cycles p = no. processorss = size of retrieval file
![Page 47: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/47.jpg)
Index Building
-47-
Helpful rules
Sort space ($TMPDIR):
sort = p*s + 20%
stage 1 sort (parallel) +stage 2,3 sorting (single file)
![Page 48: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/48.jpg)
Index Building
-48-
Helpful rules
Scratch space:
scratch = p*1.5*s +c*s*1/3
output from stage 1 (in process and not yet processed) +
output from stage 3
![Page 49: Index Building. -2--2- Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649ecb5503460f94bd9c76/html5/thumbnails/49.jpg)
Index Building
-49-
Helpful rules
Example: UBU
d=2M cycle size=50kp=4, c=40, s= ~0.5G
sort=4*0.5*1.2=2.4Gscratch=4*1.5*0.5 + 40*0.5*1/3
= 3G + 6.67G= 10.67G