file processing - indirect address translation mvnc1 hashing indirect address translation chapter 11
TRANSCRIPT
File Processing - Indirect Address Translation MVNC 1
HashingIndirect Address
TranslationChapter 11
File Processing - Indirect Address Translation MVNC 2
Indirect Address Translation
Direct translation» Primary Key (PK) and the relative record position
(RRP) are the same, we say there is a direct translation.
» Simple direct access file systems use this technique.
File Processing - Indirect Address Translation MVNC 3
Indirect Address Translation
Direct translation - problems» The PKs may not be numeric.
– Names– Alpha numeric IDs
File Processing - Indirect Address Translation MVNC 4
Indirect Address Translation
Direct translation - problems» Only a small percent of the possible range of PK's
may actual have records assigned to them:– Consider a keyfield for an employee file is a 9 digit ID
number. (E.g. Social Security Number) – The company has 200 employees. – Since the ID's may have any of the 109 values, The file
will have to be huge (109 records!). Thus the file will have a packing density of:
200 records used109 records allocated
= 2
107 = .000002%
File Processing - Indirect Address Translation MVNC 5
Indirect Address Translation
Hashing» A common technique of indirect translation is
hashing.» A solution in which the broad range of PK values
are transformed into the smaller range of RRP values.
» Hashing uses a hashing function to map translate thne key values into the smaller range of the RRP values.
000000000
9999999999
000
250
Broad range of
PK values
Restricted range of
RRP values
Coercion or indirect translation
File Processing - Indirect Address Translation MVNC 6
Indirect Address Translation
Hashing Algorithms» Development of a hashing function requires careful
attention– The algorithm should distribute the keys as evenly as
possible across the range of address.– Some different key MUST necessarily map to the same
addresses
File Processing - Indirect Address Translation MVNC 7
Key Transformation Algorithms
3 general steps to convert a key to a RRP address:1) If key is not numeric, convert it into a numeric form,
without losing information.
2) Operate on the numeric key using an algorithm which converts the keys into a spread of numbers of the order of magnitude of the address numbers required.
3) The resulting numbers are multiplied by a constant which compresses the address into the precise range of addresses.
File Processing - Indirect Address Translation MVNC 8
Key Transformation Algorithms
Example:» Key is a 9 Digit Number. » Destination file has 7000 records» Step 1 - Not needed (already a number)» Step 2 - Divide Key by 10000 to get remainder
between 0 - 9999» Step 3 - we multiply the value from 2 by .7 to put
number within the range 0000 to 6999.
File Processing - Indirect Address Translation MVNC 9
Key Transformation Algorithms
Example:» What would happen if we simply skip step 2 , and
simply compress the number from step 1? » What about clustered insertions? (Keys with
contiguous values.)
File Processing - Indirect Address Translation MVNC 10
Key Transformation Algorithms - Division
The key is divided by a number approximately equal to the number of available addresses, and the remainder is taken as the RRP.
A prime number or number with no small factors is used.
File Processing - Indirect Address Translation MVNC 11
Key Transformation Algorithms - Division
Example:» records have 6-digit key, 5000 RRPs desired.» divide by 4997 and use remainder» consider key: 142536
» = 28 remainder 2620.
» Use 2620 as RRP. How do you suppose this method would work
with clustered insertions?
142536
4997
File Processing - Indirect Address Translation MVNC 12
Key Transformation Algorithms - Extraction
Select digits from different parts of key. Example:
» Records with 10-digit key, 5000 RRPs desired.» Choose 3rd, 5th, 8th and 9th digits:» Consider key = 3865324567
» Compress into RRP range:
INT(8625 * .5) = 4312. Use 4312 as RRP.
3865324567
8625
File Processing - Indirect Address Translation MVNC 13
Key Transformation Algorithms - Folding
Digits in the key are folded inward like folding paper. Then the digits are added.
Folding tends to be more appropriate for large keys.
142537 142537
File Processing - Indirect Address Translation MVNC 14
Key Transformation Algorithms - Folding
Example» Let key be 142537.» Fold left at 4th digit, right at 3rd digit:
» Results in 4137 and 735» Add the two resulting values:
4137 + 735 = 4872» Compress into RRP range:» 4872 x .5 = 2436. Use 2436 as RRP.
142537 142537
File Processing - Indirect Address Translation MVNC 15
Key Transformation Algorithms - Mid-square
method
Square the key, and use the central digits of the result.
Example:» Let records have 6-digit key, and 5000 RRP's desired.» Key value of 142536.» 1425362 --> 020316511296» 1651 - central digits» Compress into RRP range: » 1651 x.5 = 825. Use 825 as RRP.
File Processing - Indirect Address Translation MVNC 16
Key Transformation Algorithms - Selection
The best way to choose a transform is to take the key set for the file and simulate using different transforms.
Choose the one which distributes the records most evenly.
The division method seems to be the best general transform.
File Processing - Indirect Address Translation MVNC 17
Important hashing considerations
When designing a practical hashing scheme, several important issues must be addressed:
record distribution » A hashing function needs to be picked which will
evenly distribute the records throughout the RRP range.
» Different key sets will have different distribution patterns.
» Thus the hashing function chosen will depend on the patterns of keys in the data set.
File Processing - Indirect Address Translation MVNC 18
Important hashing considerations
synonyms » two or more PKs which transform to the same RRP
address. » The the goal is to devise a hashing function for a
given key set of keys which will minimize synonyms. » It is, however, statistically beyond reason to totally
avoid synonyms. » Not only would all keys need to be known in advance,
but only one algorithm in 1012000 will work!
File Processing - Indirect Address Translation MVNC 19
Important hashing considerations
collisions» When a new record hashes to a record already in
use by another record. » The new record and the existing record are called
synonyms. » The result is called an overflow. » A scheme must be devised to handle overflows
efficiently.
File Processing - Indirect Address Translation MVNC 20
Important hashing considerations
packing density» ratio of records stored in a file to addresses
available in the file. » Typically the best packing density is 80-90%. » The larger the file, the less the probability of an
overflow. » There is thus a trade-off between space and
efficiency.
spaceefficiency
File Processing - Indirect Address Translation MVNC 21
Techniques for handling collisions
Strategies for collision resolution:1. Create the file so that each address (physical
record) can hold several logical records (usually synonyms). Called Composite Records or buckets.
2. Develop algorithms for relocating records which collide.
File Processing - Indirect Address Translation MVNC 22
Composite Records or buckets
Reduce number of RRP’s, but increase the size of each to hold several records.
Each RRP (called a bucket) now holds several logical records.
123456789
101112
1234
File Processing - Indirect Address Translation MVNC 23
Composite Records or buckets
buckets are arrays of logical records. bucket size - number of records/bucket Now room for several synonyms in each
bucket. Probability of overflow is reduced. Overflow now only occurs when bucket is full. Overall file size need not increase, if bucket
size 5, then reduce number of physical records by 5.
File Processing - Indirect Address Translation MVNC 24
Composite Records or buckets
May be implemented by having file record be arrays of logical records
Example: Consider two half full files
123456789
101112
rec
rec
rec
rec
rec
rec
1234
rec
rec
rec
rec
rec
rec
Probabity ofOverflow?
File Processing - Indirect Address Translation MVNC 25
Composite Records or buckets
Trade-offs» as bucket size increases, probability of a overflow
is greatly reduced. » as bucket size increases, time to read in and scan
bucket increases» Typical bucket sizes range from 5 to 30. » Ideal bucket size often a multiple of the disk sector
or track size.» What is the extreme case of having the longest
possible bucket?
File Processing - Indirect Address Translation MVNC 26
Handling overflows
Increasing bucket size will reduce, but not eliminate overflows. They must be dealt with.
Many algorithms exist for handling overflows , including:1. Progressive overflow
2. Separate overflow area
3. Chained Progressive overflow
File Processing - Indirect Address Translation MVNC 27
Progressive overflow
Adding new record» If home address is full, try the next record.» If next address full, try next, and so one.» If at end of file, wrap around to record 0» If search continues until home address again
reached, file full.
File Processing - Indirect Address Translation MVNC 28
Progressive overflow
Finding a record» If in home bucket, success!» Else if home bucket not full, search fails.» Else if home bucket full, go search next bucket.» Keep searching successive buckets until either
found, or a non-full bucket is searched.
File Processing - Indirect Address Translation MVNC 29
Progressive overflow
Finding a record» Note that as file fills, search length will increase.» What are some enhancements?
– Each bucket has flag indicating if bucket has really overflowed
File Processing - Indirect Address Translation MVNC 30
Progressive overflow
Delete record» Can't simply remove, or find may not work correctly» Must mark each record as used, unused, or
deleted.
File Processing - Indirect Address Translation MVNC 31
Progressive overflow
Evaluation» simple» robust » searches may get very long» clustering
File Processing - Indirect Address Translation MVNC 32
Progressive overflow
Alternate version - skip x records each time, where x is prime relative to the number of records.
Reduces the problem of record clustering
File Processing - Indirect Address Translation MVNC 33
Separate overflow area
Buckets contain pointers which may point to a record in a special overflow area.
Records (or buckets) are linked together in the overflow area as a linked list.
What happens if there are a lot of synonyms for a few home addresses?
File Processing - Indirect Address Translation MVNC 34
Separate overflow areaMain
BucketsOverflow Pointers
Overflow Area
File Processing - Indirect Address Translation MVNC 35
Chained Progressive overflow
similar to progressive, but pointers link synonyms together for quicker searches.