single level index

11

Single level index• Single-level index: file of entries

– Will point to :

• The record in the data file

<field value, pointer to record> or

• The block which contains the record

<field value, pointer to block>

– field value ordered by indexing field

• Single-level index:– Carry out binary search in the index file, then ?

• Then follow pointer• Why single-level ?

– Will see other types of indexes later

– Including multi-level indexes

22

Types of Single-Level Indexes: Primary Index

• Defined on data file ordered on a key field– We will think of as Primary Key– Indexing field will also be ordered by same key

• One index entry for each block in data file

– the index entry has the key field value for the first record in the block

• called the block anchor

• Dense or sparse ?

• Sparse : includes an entry for each disk block– Not for every record

33

[EN] FIGURE

18.1Primary

index on the ordering

key field of the file

shown in Figure 13.7.

44

Types of Single-Level Indexes: Primary Index

• Advantage of having primary index if file already sorted by that field ?

• Index file smaller, binary search on that faster– Why is index file smaller ?

• Fewer records (why?), smaller records (why?)• If index file is much smaller, could have

another big advantage – May be possible to keep (all or most of) index file

in RAM. Advantage ?

• Fewer disk accesses

55

[EN] Eg 1: Primary IndexRecord size R = 100 bytes, block size B=1024 bytes, r = 30000 recordsFor data file, blocking factor Bfr = # records in a block = ? For data file, Bfr = # records in a block = B div R = 1024 / 100 = 10 Number of data file blocks b = ? Number of data file blocks b = (r/Bfr) = (30000/10) = 3000 blocksIf no index, how many block accesses for search by ordering field ? If no index, bin. search needs log b +1 = log 3000 +1 = 13 block accesses

Indexing field 9 bytes, block pointer 6 bytes.If sparse primary index (on disk) like Figure 14.1, how many block accesses?

Index entry size = ?Index entry size (9+6)= 15bytesFor index file, # records in a block = ? For index file, Bfr = # records in a block = B div R = 1024 div 15 = 68Total # index entries = ?Total # index entries = # data blocks = 3000. # index file blocks = ?# index file blocks = (3000/68) = 45 blocks. # block accesses to search ?Binary search : log 45 + 1 = 7 block accesses. Plus need one more. Why?To get the data block. Total # block accesses = 7 + 1 = 8

66

Types of Single-Level Indexes: Clustering Index

• Motivation: suppose we repeatedly wanted to ask some question about employees according to which department they work for. Eg:

SELECT LNAME, FNAME FROM EMP

WHERE DNUMBER = 3;

• How to do ?

• What would we like here : an index according to DNUMBER, even though non-key

• Also important if looking for range. Eg:

(DNUMBER >= 2) AND (DNUMBER <= 7)

77

[EN]FIGURE

18.2Clustering index on the DEPTNUMBER ordering nonkey field of

EMP file.

88


• Data file ordered on non-key field called clustering field– Clustering field does not have unique values– Index built on same clustering field

• Includes one index entry for each distinct value of the field.

• Index entry points to the first data block that contains records with that field value.

• Terminology not standardized: clustering index can mean file sorted by clustering field– Could include primary index as special case

99



• Sparse

• Insertion : similar problem as before.– Eg: if block full, has 7, 7, 8, 9 want to insert 7

• How to deal with this ?

• Have an entire block for each value of clustering field– Insertion and Deletion now straightforward– Could have a lot of almost empty blocks

1010

[EN] FIGURE 18.3

Clustering index with a

separate block cluster for each

group of records that

share the same value for the

clustering field.

1111

Types of Single-Level Indexes: Secondary Index

• Motivation: suppose we want to access employees by both ssn and by name

• Assume EMP file is sorted by ssn and we have a primary index with ssn. How to do efficient access with name ?

• Build another index by name

• Secondary index: file not sorted by this field– Also called non-clustering index..

1212

Secondary Indices Example [SKS]

• One type of secondary index– Index record points to a bucket that contains pointers to all the

actual records with that particular search-key value.

Secondary index on balance field of account

1313

Types of Single-Level Indexes: Secondary Index

• A secondary index provides a secondary means of accessing file for which some primary access already exists.

• Can have multiple secondary indexes

• Secondary index may be on a field which is a– Secondary key : has unique value in every

record

– Non-key with duplicate values.

1414

Types of Single-Level Indexes: Secondary Index with Secondary Key

• The index is an ordered file with two fields.

– The first field is of the same data type as some nonordering field of the data file that is an indexing field

– The second field is either a block pointer or a record pointer.

– If block pointer, have to search block


• Dense

1515

[EN]FIGURE

18.4A dense

secondary index (with

record pointers) on a nonordering key field of a

file.

1616

[EN] Eg 2 : Secondary IndexRecord size R = 100 bytes, block size B=1024 bytes, r = 30000 recordsFor data file, blocking factor Bfr = # records in a block = ? For data file, Bfr = # records in a block = B div R = 1024 / 100 = 10 Number of data file blocks b = (r/Bfr) = (30000/10) = 3000 blocksIf no index, how many block accesses for search by non-ordering field ?If no index, linear search needs 3000/2 = 1500 block accesses

Indexing field 9 bytes, block pointer 6 bytes.

If dense secondary index (on disk) like Figure 14.4, # block accesses?

Index entry size (9+6) = 15 bytes

For index file, Bfr = # records in index file = B div R = 1024 div 15 = 68

Total # index entries = ?

Total # index entries = # records in index file = 30000

# index file blocks = (30000/68) = 442 blocks. # block accesses to search ?

Binary search : log 442 + 1 + 1 (for getting data block) block accesses

Compare: gone from 1500 to 11

1717

[EN]FIGURE 18.5

A secondary index (with record

pointers) on a nonkey field

implemented using one level of

indirection so that index entries are of

fixed length and have unique field

values.

1818

Types of Single-Level Indexes: Secondary Index with Non-key

• Use extra level of indirection– Pointer points to block of record pointers

• Upside– efficiently retrieve all records with specific value– Index file is small

• Downside– May have to do another disk access to get block

of record pointers

19

Query Optimization• Two Egs of how optimizer might use indexes

• Eg 1: Get last names of employees who work on a project. – SQL query– 2 approaches– Which index available

• Eg 2: Get last names of employees who make more than 60k and who are in department 5. – SQL query– 3 approaches– Which index available

2020

[EN] Table 18.1 Types of Indexes Based on Properties of Indexing Field

2121

Hashing• Internal Hashing: when the data is being

kept in RAM

• External Hashing: when the data is being kept on disk– This is what we are interested in– But will first do a quick review of internal

hashing• Since internal hashing easier to understand

2222

Mod review• a mod b = c : short hand for saying that when

we divide a by b, the remainder is c

• 7 mod 5 = 2, 19 mod 4 = 3

• a mod b c or a = c mod b or a c mod b

• 7 = 2 mod 5, 19 = 3 mod 4

2323

Direct Address Tables

• Eg: We want to keep information about students. Suppose we have 10 students, and we want to look up their names and grades etc.

• Operations:

– Insert a student

– Search for a student

2525

Direct Address Tables

• Suppose students have id number between 0 and 9.

• Direct Address Table: info stored in table (array) with 10 entries.

• Eg: student 6 goes to table[6], student 4 goes to table[4]. Search for student 6.

• Slow/Fast ?

• Fast: just an array index calculation

• What if: 9 digit ssn ?

2626

Idea behind Hashing• Can we use direct address tables now? • No, still want fast searches: hash tables.• Want a way of getting from ssn to index in table.

“Random” mapping ?• No – because we will need to search for this element

after we have inserted it– So the way for carrying out this search has to be exactly the

same as for inserting it.• Hashing: way of transforming key into array index.• Hash Function: maps key to an index. • Eg: Hash (SSN) = SSN % 10

– 123-45-6789 goes to 9– 122-45-6566 goes to 6– Searching for 123-45-6789. Where will we look?

• Looks straightforward. Possible problem?

2727

[CLR] Example : Collisions

2828

Collisions

• 123-45-6789 goes to 9

• 111-44-9999 goes to 9

• Collision: When two different keys yield the same index.

• Two issues with collisions:

– Dealing with collisions

– Minimizing collisions : good hash functions, won’t study

2929

Collision Resolution

• Chaining: keep all the entries which map onto the same hash value in a linked list

• Open addressing: put in another available available slot

3030

[CLR] Example : Chaining

3131

Chaining

• Idea: T[i] pointer to linked list which contains all elts whose keys hash to i.

• Eg: m=7, T[0..6].

• a,b,c,d,e,f arrive in order. h(a) = 5, h(b) = 5, h(c) = 1, h(d) = 6, h(e) = 5, h(f) = 4.

• Now search for e

• Now search for z, h(z) = 0.

3232

Open Addressing

• No linked lists, all elts stored directly in T.

• If collision: probe: look elsewhere in T.

• Where ever we look to insert, have to search in same way.

• There are a number of different says of doing open addressing– we look at linear probing.

3333

Linear Probing

• Idea: If current slot is full, look at next one.

• Eg: m=7, T[0..6].

• a,b,c,d,e,f arrive in order. h(a) = 5, h(b) = 5, h(c) = 1, h(d) = 6, h(e) = 5, h(f) = 4.

• Now search for e

• Now search for z, h(z) = 0.

3434

External Static Hashing

• External Hashing : Hashing for disk files • static hashing or dynamic hashing• static hashing : The file blocks are divided

into M equal-sized buckets, numbered bucket0, bucket1, ..., bucket M-1

– Typically, a bucket corresponds to one disk block.

• The record with hash key value K is stored in bucket i, where i=h(K)

• Hash function h is a function from set of all search-key values to set of all bucket addresses.

3535

Static Hashing [EN] Figure 17.9

3636

Static Hashing Eg [SKS]• Hash file organization of account file, using

branch_name as hashing field

• There are 10 buckets,

• The binary representation of the ith character is assumed to be the integer i.

• The hash function returns the sum of the binary representations of the characters modulo 10– Eg h(Perryridge) = 5 h(Round Hill) = 3

h(Brighton) = 3

3737

Static Hashing Eg [SKS]Hash file organization of account file, using branch_name as key(see previous slide for details).

3838

Static Hashing• Hash function is used to locate records for

access, insertion as well as deletion. • Records with different search-key values may

be mapped to the same bucket– What does this imply when looking for a record?

• Entire bucket has to be searched to locate record– But done in RAM, so not a problem

• Search is very efficient on the hash key• How to deal with collisions

– What is a collision now ?

3939

Bucket Overflows• Collisions occur when a new record hashes to

a bucket that is already full– If it is not full, not a problem

• When would the bucket overflow start happening on a large scale ?

• Insufficient buckets

• Skew in distribution of records. Why ?

• Lousy hash function (or unlucky !)

• Although the probability of bucket overflow can be reduced, it cannot be eliminated;

4040

Handling of Bucket Overflows• How to handle bucket overflow ? Two ways:

• Overflow file kept for storing such records– All overflow records kept in same block– Even if coming from different buckets– See [EN] Eg.

• Overflow chaining – The overflow blocks of a given bucket are

chained together in a linked list. – See [SKS] Eg

4141

Overflow File Eg [EN] Figure 17.10

4242

Overflow Chaining Eg [SKS]

• Advantage of doing it this way?

• Faster search. Disadvantage ?

• Wasted space

4343

Static Hashing• To reduce overflow records, a hash file is

typically kept 70-80% full.

• The hash function h should distribute the records uniformly among the buckets. Why ?

• Otherwise, search time will be increased because many overflow records will exist.

• Ordered access on hash key efficient ?

• No: inefficient (requires sorting the records)– This is true of any hashing scheme– What about range queries : efficient ?

• Range queries also inefficient

4444

Deficiencies of Static Hashing• Databases grow or shrink with time. • In static hashing, fixed # buckets. If #

buckets too small ?• If # buckets too small, and file grows,

performance will degrade due to too much overflows. If # buckets too large ?

• Significant amount of space will be wasted initially (and buckets will be under full).– Similar problem if database shrinks, again space

will be wasted.

• If too much overflow or underflow, solution ?

4545

Deficiencies of Static Hashing• One solution: periodic re-organization of the

file with a new hash function. Problem ?

• Large overhead, disrupts normal operations

• Different solution: allow the number of buckets to be modified dynamically: dynamic hashing or extendible hashing – Allow the dynamic growth and shrinking of the

number of file records.– If overflow, split– If underflow, merge– We won’t cover in detail, [EN] does

4646

Multi-Level Indexes• Suppose index too big to be in RAM, is on disk.

Consequences ?• Search expensive : log (#blocks). To improve ?• Treat main index kept on disk as a sorted file

– build a sparse index for the main index

– first level (inner index )– the main (“primary”) index file

– second level (outer index ) – sparse index of the primary index sorted file

• If even outer index too large to fit in RAM ?• Build another index on outer index

– … and so on, until all entries of top level fit in one block

4747

Multi-Level Indexes [SKS]

4848

Multi-Level Indexes - Eg• How does this help. Look at an example:

– Suppose we have 2 level with first level being dense (eg: secondary index), with bfr = 20

– Suppose 400 data records– Suppose 2nd level is in RAM– How many disk accesses ?

• 400 index records, bfr 20, so # blocks in 1st level = 400/20 = 20.

• If only 1st level, log2 20 + 1 = 6, 6+1 = 7

• With 2 level (if top level in RAM) ?

• 2

4949

[EN]FIGURE 18.6

A two-level primary index resembling ISAM (Indexed

Sequential Access Method)

organization.

• ISAM: Originally developed by IBM

• Now used in MYSQL– MYISAM

5050

[EN] Eg 3 Multi-level indexesRecord size R = 100 bytes, block size B=1024 bytes, r = 30000 recordsFor data file, blocking factor Bfr = # records in a block = ? For data file, Bfr = # records in a block = B div R = 1024 / 100 = 10 Number of data file blocks b = (r/Bfr) = (30000/10) = 3000 blocksWe saw if dense secondary index (on disk), # block accesses = 11

Indexing field 9 bytes, block pointer 6 bytes, index entry size = 15 bytesIf multi- level index like Figure 14.6, # block accesses?For index file, Bfr = # records in file = B div R = 1024 div 15 = 68Total # first level index entries = # records in data file = 30000 # first level index file blocks = (30000/68) = 442 blocks.# second level index file blocks = ?# second level index file blocks = (442 /68) = 7 blocks. # third level index file blocks = ?# third level index file blocks = (7 /68) = 1 block. Top level.Total # block accesses assuming everything in disk = ?Total # block accesses = 1 + 1 + 1 + 1 (for data block) = 4Compare: gone from 11 to 4

5151

Multi-Level Indexes• Multi-level index can be for any type of first-level

index: primary, secondary, clustering.• Multi-level index is a form of search tree.• When records inserted/deleted expensive – why ?• Every level of index is a sorted file.

– Sorted file has to be updated

– And so does every index on the file

• Performance degrades as file grows – why ?• Potentially many overflow blocks can be created.

– Periodic reorganization of entire file is required.

– But can be expensive

5252

Disadvantages of indexed sorted files• Sequential scan using primary index (file sorted

by indexing field) efficient – why ?

• Sequential scan using secondary index - fast?– Eg: EMPLOYEE file sorted by ssn– Secondary index by last name– Want to write out in alphabetical order.

• Expensive – Each record access may fetch a new block from disk– Block fetch requires about 5 to 10 micro seconds,

versus about 100 nanoseconds for memory access

• Solution: B-trees, B+trees, hashing indexes

5353

Indexes: B-Trees, B+ Trees• Problems of indexed-sequential files

– As file changes, expensive to maintain index

• B-tree, B+tree indexes solve this problem – When changes made, automatically reorganizes itself

with small, local, changes

• B-tree, B+tree indices are an alternative to indexed-sequential files

• We will briefly look at B-trees, then B+trees– A kind of a multi-level index– Studied in more detail in CSCI 6632

5454

[CLR] example of a B-tree

5555

[EN] FIGURE 18.10B-tree structure and example

5656

Indexes: B-Trees• Can keep entire records in trees

– Entire file kept as a B-tree

• Alternative: Only keys with links (to rest of the record) in tree. – Full records kept elsewhere, maybe in

unsorted file– Advantage of doing it like this?

• What has to be kept in B-tree is less– Advantage ?

• Fit in more per node, shallower depth

5757

Indexes: B-Trees• Advantage compared to binary search trees?

• Fewer disk accesses than search trees : why ?

• Related info in one block in B-Tree

• B-Trees: each node corresponds to disk block

• Insertion and deletion efficient ?

• Each node is kept between half-full and completely full– Because of this flexibility, relatively easy to do

insertions and deletions

• Now look at B+ Trees

58

B+ Tree Indexes [RG]

Leaf pages contain data entries, and are chained (prev & next) Non-leaf pages have index entries; only used to direct searches:

P0 K 1 P 1 K 2 P 2 K m P m

index entry

Non-leaf

Pages

Pages (Sorted by search key)

Leaf

5959

[EN] FIGURE 18.11The nodes of a B+tree..

60

Example B+ Tree [RG]

• Find 7 ? 29 ? All > 15 and < 30 • Insert/delete: Find data entry in leaf, then

change it. Need to adjust parent sometimes.– And change sometimes bubbles up the tree

2 3

Root

17

30

14 16 33 34 38 39

135

75 8 22 24

27

27 29

Entries <= 17 Entries > 17

Note how data entriesin leaf level are sorted

6161

B-tree and B+tree Differences• In both can do quickly :

– Searches, insertions and deletions to indexes

• Also true of leaf nodes in B+tree

• B-tree: ptrs to data records at all levels of the tree• B+tree: ptrs to data records only at leaf-level nodes

– internal nodes only for navigation

• B+tree can have less levels than B-tree– B-tree index is dense

– B+tree index is sparse, linked list is dense

• B+tree can also do fast sequential access : how?• Linked list at bottom level is in sequential order• B+ tree : greater complexity: maintaining leaf nodes

6262

Multiple-Key Access/Indexes• Use multiple indices for certain types of

queries.

• [EN Eg:] : Emp who are 59 years old and are in dept 4 select ssn from Emp

where dno = 4 and age = 59

• Possible strategies for processing query using indices on single attributes ?– Depends on which indices are available

• What indices would be helpful ?

6363

Multiple-Key Access/Indexes• Suppose 2 indices: dno, age. How to do ? • Method 1: Use index on dno to find Emp with dno 4

– then test age = 59 • Method 2: Use index on age to find Emp with age 59

– then test dno = 4• Method 3: Use index on dno to find records of Emp

with dno 4. Use index on age to find records of Emp with age 59. – Now what ?

• Take intersection of both sets of records.

6464

• Composite search keys are search keys containing more than one attribute– Eg: searching for combination of dno, age

• Lexicographic ordering: (a1, a2) < (b1, b2) if either

a1 < b1, or

a1= b1 and a2 < b2

Eg: (4, 40) < (5, 20)

Eg: (4, 40) < (4, 45)

• Can build a single index on multiple attributes

Ordered Indices on Multiple Attributes

6565

• Consider the following:– where dno = 4 and age = 59

• The index on (dno, age) can be used to fetch only records that satisfy both conditions.

• More efficient than using separate indices ?– Eg: use index on dno, age and take intersection

• Using separate indices is less efficient– we may fetch many records that satisfy only one of

the conditions.

Suppose we have an index on combined search-key (dno, age).

Ordered Indices on Multiple Attributes

6666

• Is the following efficiently handled ?– where dno = 4

and age < 59

• Yes: because of lexicographic ordering

• Is the following efficiently handled ?– where dno < 6 and age = 59

• Not quite so efficient– may fetch many records that satisfy the first but not

the second condition

Ordered Indices on Multiple AttributesSuppose we have an index on combined search-key (dno, age).

6767

Grid Files: [EN] Figure 18.14

• Do well in terms of access time. Downside ?• Space for grid array, maintenance when file changes

• Another alternative for composite search

6868

Hash Indices [SKS]

• Can use hashing for indices:– A hash index organizes the search keys, with their

associated record pointers, into a hash file structure.

• If the file itself is organized using hashing– a separate hash index on it using the same search-

key is unnecessary. Why ?

• Sometimes, the term hash index to refer to both secondary index structures and hash organized files.

6969

Example of Hash Index [SKS]• Data file

ordered by branch name

• Secondary index on Acct#

7070

Ordered Indexing vs Hashing• Which works better depends on particular

situation.

• Relative frequency of insertions and deletions

• Average access time vs worst-case access time?

• Expected type of queries: which type of query will each be good at ?

• Hashing is generally better at retrieving records having a specified value of the key.

• Ordered indices better at range queries

7171

Cost/Benefit of Indexes• Indexes can have large benefits:

– B-tree can search 1M rows of indexed data with < 20 lookups

– Hashed index (on avg) about 1 lookup

• Why not have lots of indexes all the time ?• Cost of mantaining index when updates.

– Introduction to Oracle 10g: Perry and Post :“According to Oracle performance tuning documentation, each index requires about 3 times the resources as the original DML.”

– “So adding 3 indexes to a table will slow down an INSERT command by about 10 times.”

• Balance faster retrieval vs slower updates

7272

Data Warehousing Systems

• Used for analysis, not transaction processing Since no transaction processing, consequence ?

• No updates. Impact of this ?• Data is denormalized and stored together and

materialized views are used– Advantage of denormalized ?

• Data in fewer tables– Fewer joins. Why do we normalize?– Does that logic apply here ?

• No updates so no modification anomalies• Advantage of materialized views?

7373

Data Warehousing Systems• Don’t have to go back and recalculate views

every time a view is referred to– What is the problem with materialized views?

• Have to change on updates– Does it apply here ?

• Can create lots of indexes (indexes on most columns) – What is the problem with having lots of indexex?

• Cost of maintaining lots of indexes.• Does this apply here ?

• No updates

7474

Index Definition in SQL• Index statements part of early versions of SQL

– but not part of SQL standard today. Why ?

• Physical access path, not data specification– Responsibility of DBMS– Not of person writing SQL queries– Commercial DBMS have index specifications

• End users may not be aware of indices– SQL queries remain the same– Indices can be created/destroyed without affecting

correctness of query• But efficiency is effected

7575

Indexes supported in DBMS• Theoretically, DBMS not even required to

support indices• In practice, every commercial DBMS supports

some form of indexing. Why ?– Some ops inefficient without indices. Which ones?

• Joins• Range Queries• Checking uniqueness

– For keys– When DISTINCT ( no duplicates) specified

• Referential Integrity

7676

Index Definition in SQL• Many DBMS automatically create index on

primary key– And on other keys (specified via UNIQUE )

• In addition, DBMS allow for the programmer to explicitly create and destroy indexes.

• Since no current SQL standard, we will look at typical syntax for creating indexes– Based on old SQL syntax– We then look at Eg from Oracle, SQL-Server

• Also a drop index command

DROP INDEX indexname

7777

Index Definition in SQL

CREATE INDEX LNAME-INDEX

ON EMPLOYEE (LNAME)

• What type of index is this ?

• Secondary index– File not sorted by LNAME– LNAME is not a key

• Can create index on multiple attributes

CREATE INDEX FULLNAME-INDEX

ON EMPLOYEE (LNAME, FNAME)– On both, with LNAME being more significant

7878

Index Definition in SQL: on key• Index corresponding to a key:

CREATE UNIQUE INDEX SSN-INDEX

ON EMPLOYEE (SSN)

• Will enforce uniqueness

• In early versions of SQL, only way of specifying uniqueness. Why?

• o/w too inefficient to check uniqueness

• When we specify attribute is a key, typically an index like this is created.

• File may not be sorted on indexing field

7979

Index Definition in SQL: CLUSTER

• Can do a clustering index– File has to be sorted by the indexing field– Indexing field may not be a key, may be repeated

CREATE INDEX DNO-INDEX

ON EMPLOYEE (DNO)

CLUSTER

• Without CLUSTER may not be sorted on that field

8080

Index Definition in SQL: Primary, B-tree

• If we want to get a primary index, how to do ?

• Use both CLUSTER and UNIQUE

CREATE UNIQUE INDEX SSN-INDEX

ON EMPLOYEE (SSN)

CLUSTER

• User can specify wants B tree index:

CREATE INDEX MY-INDEX

ON EMPLOYEE (SALARY)

WITH STRUCTURE = BTREE

single level index

Documents