physical database design - query execution concepts ... · session agenda disk storage, basic file...

1

Database Systems

Session 8 – Main Theme

Physical Database Design,

Query Execution Concepts

and Database Programming Techniques

Dr. Jean-Claude Franchitti

New York University

Computer Science Department

Courant Institute of Mathematical Sciences

Presentation material partially based on textbook slides

Fundamentals of Database Systems (6th Edition)

by Ramez Elmasri and Shamkant Navathe

Slides copyright © 2011

Section on indexing references is partially based on a

report prepared by Joseph Conron

2

Agenda

11 Session OverviewSession Overview

88 Summary and ConclusionSummary and Conclusion

22 Disk Storage, Basic File Structures, and HashingDisk Storage, Basic File Structures, and Hashing

33 Indexing Structures for FilesIndexing Structures for Files

44 Algorithms for Query Processing and OptimizationAlgorithms for Query Processing and Optimization

55 Physical Database Design and TuningPhysical Database Design and Tuning

66 Introduction to SQL Programming TechniquesIntroduction to SQL Programming Techniques

77 Web Database Programming Using PhPWeb Database Programming Using PhP

3

Session Agenda

� Disk Storage, Basic File Structures, and Hashing

� Indexing Structures for Files

� Algorithms for Query Processing and Optimization

� Physical Database Design and Tuning

� Introduction to SQL Programming Techniques

� Web Database Programming Using PhP

� Summary & Conclusion

4

What is the class about?

�Course description and syllabus:

» http://www.nyu.edu/classes/jcf/CSCI-GA.2433-001

» http://cs.nyu.edu/courses/fall11/CSCI-GA.2433-001/

� Textbooks:» Fundamentals of Database Systems (6th Edition)

Ramez Elmasri and Shamkant Navathe

Addition Wesley

ISBN-10: 0-1360-8620-9, ISBN-13: 978-0136086208 6th Edition (04/10)

5

Icons / Metaphors

5

Common Realization

Information

Knowledge/Competency Pattern

Governance

Alignment

Solution Approach

6

Agenda









7

Agenda

� Database Design

� Disk Storage Devices

� Files of Records

� Operations on Files

� Unordered Files

� Ordered Files

� Hashed Files

»Dynamic and Extendible Hashing Techniques

� RAID Technology

8

Database Design

� Logical DB Design:1. Create a model of the enterprise (e.g., using ER)

2. Create a logical “implementation” (using a relational model and normalization)

» Creates the top two layers: “User” and “Community”

» Independent of any physical implementation

� Physical DB Design» Uses a file system to store the relations

» Requires knowledge of hardware and operating systems characteristics

» Addresses questions of distribution, if applicable

» Creates the third layer

� Query execution planning and optimization ties the two together; not the focus of this unit

9

Issues Addressed In Physical Design

� Main issues addressed generally in physical design» Storage Media

» File structures

» Indexes

� We concentrate on» Centralized (not distributed) databases

» Database stored on a disk using a “standard” file system, not one “tailored” to the database

» Database using unmodified general-purpose operating system

» Indexes

� The only criterion we will consider: performance

10

What Is A Disk?

� Disk consists of a sequence of cylinders

� A cylinder consists of a sequence of tracks

� A track consist of a sequence of blocks

(actually each block is a sequence of

sectors)

� For us: A disk consists of a sequence of

blocks

� All blocks are of same size, say 4K bytes

� We assume: physical block has the same size as a virtual memory page

11

What Is A Disk?

� A physical unit of access is always a block.

� If an application wants to read one or more bits that are in a single block

» If an up-to-date copy of the block is in RAM already (as a page in virtual memory), read from the page

» If not, the system reads a whole block and puts it as a whole page in a disk cache in RAM

12

Disk Storage Devices

� Preferred secondary storage device for high storage capacity and low cost.

� Data stored as magnetized areas on magnetic disk surfaces.

� A disk pack contains several magnetic disks connected to a rotating spindle.

� Disks are divided into concentric circular tracks on each disk surface.»Track capacities vary typically from 4 to 50

Kbytes or more

13

Disk Storage Devices (cont.)

� A track is divided into smaller blocks or sectors

» because it usually contains a large amount of information

� The division of a track into sectors is hard-coded on the disk surface and cannot be changed.» One type of sector organization calls a portion of a track that

subtends a fixed angle at the center as a sector.

� A track is divided into blocks.» The block size B is fixed for each system.

• Typical block sizes range from B=512 bytes to B=4096 bytes.

» Whole blocks are transferred between disk and main memory for processing.

14


15


� A read-write head moves to the track that contains the block to be transferred.» Disk rotation moves the block under the read-write

head for reading or writing.

� A physical disk block (hardware) address consists of:» a cylinder number (imaginary collection of tracks of

same radius from all recorded surfaces)» the track number or surface number (within the

cylinder)» and block number (within track).

� Reading or writing a disk block is time consuming because of the seek time s and rotational delay (latency) rd.

� Double buffering can be used to speed up the transfer of contiguous disk blocks.

16


17

Typical Disk Parameters

18

Records

� Fixed and variable length records

� Records contain fields which have values of a particular type»E.g., amount, date, time, age

� Fields themselves may be fixed length or variable length

� Variable length fields can be mixed into one record:»Separator characters or length fields are

needed so that the record can be “parsed.”

19

Blocking

� Blocking: »Refers to storing a number of records in one

block on the disk.

� Blocking factor (bfr) refers to the number of records per block.

� There may be empty space in a block if an integral number of records do not fit in one block.

� Spanned Records:»Refers to records that exceed the size of one

or more blocks and hence span a number of blocks.

20

What Is A File?

� File can be thought of as “logical” or a “physical” entity

� File as a logical entity: a sequence of records

� Records are either fixed size or variable size

� A file as a physical entity: a sequence of fixed size blocks (on the disk), but not necessarily physically contiguous (the blocks could be dispersed)

� Blocks in a file are either physically contiguous or not, but the following is generally simple to do (for the file system):

» Find the first block

» Find the last block

» Find the next block

» Find the previous block

21

What Is A File?

� Records are stored in blocks

» This gives the mapping between a “logical” file and a “physical” file

� Assumptions (some to simplify presentation for now)

» Fixed size records

» No record spans more than one block

» There are, in general, several records in a block

» There is some “left over” space in a block as needed later (for chaining the blocks of the file)

� We assume that each relation is stored as a file

� Each tuple is a record in the file

22

Example: Storing A Relation (Logical File)

1 1200

3 2100

4 1800

2 1200

6 2300

9 1400

8 1900

E# Salary

1 1200

3 2100

4 1800

2 1200

6 2300

9 1400

8 1900

RecordsRelation

23

Example: Storing A Relation (Physical File)

Blocks

1 1200

3 2100

4 1800

2 1200

6 2300

9 1400

8 1900

Records

6 23009 1400

1 1200 3 2100 8 1900

4 18002 1200

Left-overSpaceFirst block

of the file

24

Files of Records

� A file is a sequence of records, where each record is a collection of data values (or data items).

� A file descriptor (or file header) includes information that describes the file, such as the field names and their data types, and the addresses of the file blocks on disk.

� Records are stored on disk blocks.

� The blocking factor bfr for a file is the (average) number of file records stored in a disk block.

� A file can have fixed-length records or variable-lengthrecords.

25

Files of Records (cont.)

� File records can be unspanned or spanned» Unspanned: no record can span two blocks

» Spanned: a record can be stored in more than one block

� The physical disk blocks that are allocated to hold the records of a file can be contiguous, linked, or indexed.

� In a file of fixed-length records, all records have the same format. Usually, unspanned blocking is used with such files.

� Files of variable-length records require additional information to be stored in each record, such as separator characters and field types.» Usually spanned blocking is used with such files.

26

Operation on Files

� Typical file operations include:» OPEN: Readies the file for access, and associates a pointer that will refer

to a current file record at each point in time.» FIND: Searches for the first file record that satisfies a certain condition,

and makes it the current file record.» FINDNEXT: Searches for the next file record (from the current record) that

satisfies a certain condition, and makes it the current file record.» READ: Reads the current file record into a program variable.» INSERT: Inserts a new record into the file & makes it the current file

record. » DELETE: Removes the current file record from the file, usually by marking

the record to indicate that it is no longer valid.» MODIFY: Changes the values of some fields of the current file record.» CLOSE: Terminates access to the file.» REORGANIZE: Reorganizes the file records.

• For example, the records marked deleted are physically removed from the file or a new organization of the file records is created.

» READ_ORDERED: Read the file blocks in order of a specific field of the file.

27

Unordered Files

� Also called a heap or a pile file.

� New records are inserted at the end of the file.

� A linear search through the file records is necessary to search for a record.

» This requires reading and searching half the file blocks on the average, and is hence quite expensive.

� Record insertion is quite efficient.

� Reading the records in order of a particular field requires sorting the file records.

28

Ordered Files

� Also called a sequential file.

� File records are kept sorted by the values of an orderingfield.

� Insertion is expensive: records must be inserted in the correct order.» It is common to keep a separate unordered overflow (or

transaction) file for new records to improve insertion efficiency; this is periodically merged with the main ordered file.

� A binary search can be used to search for a record on its ordering field value.» This requires reading and searching log2 of the file blocks

on the average, an improvement over linear search.

� Reading the records in order of the ordering field is quite efficient.

29

Ordered Files (cont.)

30

Average Access Times

� The following table shows the average access time to access a specific record for a given type of file

31

Hashed Files

� Hashing for disk files is called External Hashing

� The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1, ..., bucketM-1.» Typically, a bucket corresponds to one (or a fixed number

of) disk block.

� One of the file fields is designated to be the hash key of the file.

� The record with hash key value K is stored in bucket i, where i=h(K), and h is the hashing function.

� Search is very efficient on the hash key.

� Collisions occur when a new record hashes to a bucket that is already full.» An overflow file is kept for storing such records.

» Overflow records that hash to each bucket can be linked together.

32

Hashed Files (cont.)

� There are numerous methods for collision resolution, including the following:» Open addressing: Proceeding from the occupied position

specified by the hash address, the program checks the subsequent positions in order until an unused (empty) position is found.

» Chaining: For this method, various overflow locations are kept, usually by extending the array with a number of overflow positions. In addition, a pointer field is added to each record location. A collision is resolved by placing the new record in an unused overflow location and setting the pointer of the occupied hash address location to the address of that overflow location.

» Multiple hashing: The program applies a second hash function if the first results in a collision. If another collision results, the program uses open addressing or applies a third hash function and then uses open addressing if necessary.

33


34


� To reduce overflow records, a hash file is typically kept 70-80% full.

� The hash function h should distribute the records uniformly among the buckets»Otherwise, search time will be increased

because many overflow records will exist.

� Main disadvantages of static external hashing:»Fixed number of buckets M is a problem if the

number of records in the file grows or shrinks.

»Ordered access on the hash key is quite inefficient (requires sorting the records).

35

Hashed Files - Overflow Handling

36

Dynamic And Extendible Hashed Files

� Dynamic and Extendible Hashing Techniques

» Hashing techniques are adapted to allow the dynamic growth and shrinking of the number of file records.

» These techniques include the following: dynamic hashing, extendible hashing, and linear hashing.

� Both dynamic and extendible hashing use the binary representation of the hash value h(K) in order to access a directory.

» In dynamic hashing the directory is a binary tree.

» In extendible hashing the directory is an array of size 2d where d is called the global depth.

37

Dynamic And Extendible Hashing (cont.)

� The directories can be stored on disk, and they expand or shrink dynamically.» Directory entries point to the disk blocks that contain

the stored records.

� An insertion in a disk block that is full causes the block to split into two blocks and the records are redistributed among the two blocks.» The directory is updated appropriately.

� Dynamic and extendible hashing do not require an overflow area.

� Linear hashing does require an overflow area but does not use a directory.» Blocks are split in linear order as the file expands.

38

Extendible Hashing

Structure of the Extensible Hashing Scheme

39

Parallelizing Disk Access using RAID Technology.

� Secondary storage technology must take steps to keep up in performance and reliability with processor technology.

� A major advance in secondary storage technology is represented by the development of RAID, which originally stood for Redundant Arrays of Inexpensive Disks.

� The main goal of RAID is to even out the widely different rates of performance improvement of disks against those in memory and microprocessors.

40

RAID Technology (cont.)

� A natural solution is a large array of small independent disks acting as a single higher-performance logical disk.

� A concept called data striping is used, which utilizes parallelism to improve disk performance.

� Data striping distributes data transparently over multiple disks to make them appear as a single large, fast disk.

41

RAID Technology (cont.)

� Different raid organizations were defined based on different combinations of the two factors of granularity of data interleaving (striping) and pattern used to compute redundant information.» Raid level 0 has no redundant data and hence has the

best write performance at the risk of data loss » Raid level 1 uses mirrored disks.» Raid level 2 uses memory-style redundancy by using

Hamming codes, which contain parity bits for distinct overlapping subsets of components. Level 2 includes both error detection and correction.

» Raid level 3 uses a single parity disk relying on the disk controller to figure out which disk has failed.

» Raid Levels 4 and 5 use block-level data striping, with level 5 distributing data and parity information across all disks.

» Raid level 6 applies the so-called P + Q redundancy scheme using Reed-Soloman codes to protect against up to two disk failures by using just two redundant disks.

42

Use of RAID Technology (cont.)

� Different raid organizations are being used under different situations» Raid level 1 (mirrored disks) is the easiest for rebuild of a disk

from other disks• It is used for critical applications like logs

» Raid level 2 uses memory-style redundancy by using Hamming codes, which contain parity bits for distinct overlapping subsets of components.

• Level 2 includes both error detection and correction.

» Raid level 3 (single parity disks relying on the disk controller to figure out which disk has failed) and level 5 (block-level data striping) are preferred for Large volume storage, with level 3 giving higher transfer rates.

� Most popular uses of the RAID technology currently are:» Level 0 (with striping), Level 1 (with mirroring) and Level 5

with an extra drive for parity.

� Design Decisions for RAID include:» Level of RAID, number of disks, choice of parity schemes,

and grouping of disks for block-level striping.

43

Use of RAID Technology (cont.)

44

Storage Area Networks

� The demand for higher storage has risen considerably in recent times.

� Organizations have a need to move from a static fixed data center oriented operation to a more flexible and dynamic infrastructure for information processing.

� Thus they are moving to a concept of Storage Area Networks (SANs).» In a SAN, online storage peripherals are configured as nodes on

a high-speed network and can be attached and detached from servers in a very flexible manner.

� This allows storage systems to be placed at longer distances from the servers and provide different performance and connectivity options.

45

Storage Area Networks (cont.)

� Advantages of SANs are:

» Flexible many-to-many connectivity among servers and storage devices using fiber channel hubs and switches.

» Up to 10km separation between a server and a storage system using appropriate fiber optic cables.

» Better isolation capabilities allowing non-disruptive addition of new peripherals and servers.

� SANs face the problem of combining storage options from multiple vendors and dealing with evolving standards of storage management software and hardware.

46

Summary

� Database Design

� Disk Storage Devices

� Files of Records

� Operations on Files

� Unordered Files

� Ordered Files

� Hashed Files

»Dynamic and Extendible Hashing Techniques

� RAID Technology

47

Agenda









48

Agenda (1/2)

� Processing a Query� Types of Single-level Ordered Indexes

» Primary Indexes» Clustering Indexes» Secondary Indexes

� Multilevel Indexes� Best: clustered file and sparse index� Hashing on disk� Index on non-key fields� Secondary index� Use index for searches if it is likely to help� SQL support� Bitmap index� Need to know how the system processes queries� How to use indexes for some queries� How to process some queries� Merge Join� Hash Join� Order of joins

49

Agenda (2/2)

� Cutting down relations before joining them

� Logical files: records

� Physical files: blocks

� Cost model: number of block accesses

� File organization

� Indexes

� Hashing

� 2-3 trees for range queries

� Dense index

� Sparse index

� Clustered file

� Unclustered file

� Dynamic Multilevel Indexes Using B-Trees and B+-Trees

� B+ trees

� Optimal parameter for B+ tree depending on disk and key size parameters

� Indexes on Multiple Keys

50

Processing A Query

� Simple query:SELECT E#FROM RWHERE SALARY > 1500;

� What needs to be done “under the hood” by the file system:» Read into RAM at least all the blocks containing all records

satisfying the condition (assuming none are in RAM.

• It may be necessary/useful to read other blocks too, as we see later

» Get the relevant information from the blocks

» Perform additional processing to produce the answer to the query

� What is the cost of this?

� We need a “cost model”

51

Cost Model

� Two-level storage hierarchy

» RAM

» Disk

� Reading or Writing a block costs one time unit

� Processing in RAM is free

� Ignore caching of blocks (unless done previously by the query itself, as the byproduct of reading)

� Justifying the assumptions

» Accessing the disk is much more expensive than any reasonable CPU processing of queries (we could be more precise, which we are not here)

» We do not want to account formally for block contiguity on the disk; we do not know how to do it well in general (and in fact what the OS thinks is contiguous may not be so: the disk controller can override what OS thinks)

» We do not want to account for a query using cache slots “filled” by another query; we do not know how to do it well, as we do not know in which order queries come

52

Implications of the Cost Model

� Goal: Minimize the number of block accesses

� A good heuristic: Organize the physical database so that you make as much use as possible from any block you read/write

� Note the difference between the cost models in

» Data structures (where “analysis of algorithms” type of cost model is used: minimize CPU processing)

» Data bases (minimize disk accessing)

� There are serious implications to this difference

� Database physical (disk) design is obtained by extending and “fine-tuning” of “classical” data structures

53

Example

� If you know exactly where E# = 2 and E# = 9 are:

� The data structure cost model gives a cost of 2 (2 RAM accesses)

� The database cost model gives a cost of 2 (2 block accesses)

54

Example

� If you know exactly where E# = 2 and E# = 4 are:

� The data structure cost model gives a cost of 2 (2 RAM accesses)

� The database cost model gives a cost of 1 (1 block access)

55

File Organization and Indexes

� If we know what we will generally be reading/writing, we can try to minimize the number of block accesses for “frequent” queries

� Tools:» File organization

» Indexes (structures showing where records are located)

� Oversimplifying: File organization tries to provide:» When you read a block you get “many” useful records

� Oversimplifying: Indexes try to provide:» You know where blocks containing useful records are

� We discuss both in this unit, not clearly separating them

56

File Organization (1/5)

� The tuples on a relation are typically stored as records in file on a secondary storage device (e.g., a moving head disk)

� Unless attention is paid to the way the records in the file are organized, performance of the system is likely to suffer

� There are three file organizations to consider:» Heap Files, which contain a collection of records in random

order.

» Sorted files, in which the records are arranged according to some sort criteria.

» Hashed files, in which the records are organized according to a hash function on some information in the record.

57


� The physical layer of a DBMS allocates data in a file in units called pages which usually comprise several blocks of disk storage.

» Typical page sizes are 4 or 8 kilobytes

� Information is read or retrieved from the file in units of one or more pages to minimize the number of disk accesses

» Because of this paging concept, a blocking factor Bf

of records per page is defined as the size of the page P divided by the size of the record R: Bf = FLOOR(P/R)

58


� The time it takes to perform an operation on a file is proportional to the number of accesses made to the file.

» The number of accesses made to the file is in turn based upon the organization of the records in the file relative to the operation on the data.

� It is necessary to compare these three file organizations against the following five categorical operations:

Scanning

retrieval of all of the records in the file. For example, find the record having account number A-110.

Search for exact matchfind all records which have a specific value in their data. For example, find all accounts at the Rocky Point branch.

Search within rangefind all records that satisfy range condition. Find all accounts with balances between $400 and $800.

Insert a new record add a record to some page in the file.

Delete a record remove a record from some page in the file.

59


� Depending on which of the three file organizations is being used, the cost in disk access for each of these five operations could be quite high

» To see how high we must calculate the number of accesses required to perform the operation

• Given a blocking factor Bf and number of records M, then the number of pages in the file N is N = CEIL(M/Bf)

� If D is the typical disk access time for a system, and the system retrieves one page per access, then the time to access the file for each of these operation for a file of M records is given below

Heap Sorted* Hash**

Scan ND ND 1.25ND

Match 0.5ND Log2(N)D D

Range ND Log2(N)D + m 1.25ND

Insert 2D Log2(N)D +ND 2D

Delete (N+1)D Log2(N)D+ND 2D

Cost of Operations by File Organization

m is the number of pages containing records which satisfy the range condition.

* It is assumed that a binary search can be performed on the sorted file and that the records are sorted on the search attribute.

** Assuming an 80% fill factor and that the hash is on the search attribute.

60


� The noted cost factors are not always worst case» In the sorted file case for example, the time to perform a Match

operation balloons to ND when a binary search cannot be used

� Examining the table in the previous slide gives us a first indication that none of the three organizations is good for each of the operations, and some are worse than others

» An organization that can offer scan and search (Match and Range operations) cost factors similar to that of the sorted file organization while providing the low cost for inserts and deletes that the hash method affords is desirable

61

Introduction to Indexes (1/2)

� Indexing is a concept central to efficient processing of queries and updates of databases

» Indexing usually has an impact on the time it takes to process a query

� It is important to know the following about indexing

» Important properties of indexing schemes

» Model to use to evaluate the relative costs of index operations

• Various indexing schemes have different relative costs.

62

Introduction to Indexes (2/2)

� It is important to know the following about indexing (continued)» The role that indexing plays in developing optimized

query plans (in particular when it comes to the join operation

• There are several methods for executing joins based on indexing and the alternative methods can be compared using the cost model mentioned above

» Expected results of executing queries against a database that supports index nested loop joins

• This helps understand the effect that indexing can have on the performance of queries and associated results should be compared to the cost model

63

Indexes as Access Paths

� A single-level index is an auxiliary file that makes it more efficient to search for a record in the data file.

� The index is usually specified on one field of the file (although it could be specified on several fields)

� One form of an index is a file of entries <field value, pointer to record>, which is ordered by field value

� The index is called an access path on the field.

64

Indexes

� An index is a structure which is designed to improve access to desired information over that provided by the three basic file organizations

» The members of an index structure are records which contain a search key (k) and record ID (rid) which is a pointer to a record in a data file

» The search key is an attribute or set of attributes from the relation which is used to look up the record in the table

� Before discussing specific indexing methods, it is important to understand some properties which affect the potential effectiveness of an index

» There are number of ways to organize records in an index, each or which has properties which make it more or less suitable as an indexing method for a given database than other methods, depending on a how the data in the database is used

» To aid understand of how these properties can effect the efficiency of the indexing method, it's useful to define some characteristics of databases and how they are accessed.

65

Index Properties - Clustering

� When the order of the records in a data file are in the same order as or similar to the order of the entries in the index file, the index file is said to be clustered

» There may be at most one clustered index for a given data file, and more than one un-clustered index

» A data file that is clustered on an index is not necessarily maintained as a sorted file, even though it may begin life that way, since the cost to maintain the sort is expensive in the face of frequent inserts and deletes

» The benefit of a clustered index is evident when performing range queries since the index entries point to records that are distributed across the smallest number of pages.

� It is also possible to create a database file which is clustered on an attribute of the relation without creating an index for the attribute

» Furthermore, some DBMS allow attributes from different files to be clustered (inter-file clustering), which is most useful when the attributes are frequently retrieved in the same query

66

Index Properties – Density (1/3)

� The first index property to consider is Density

» Indexes are said to be dense if there is an entry in the index for each record in the data file (i.e., each entry in the index points to one and only one record)

67


� An index is sparse (not dense) when there is a one to many relationship between index entries and records in the data file

» In a sparse index, one key entry points to a page of data records

� Sparse indexes rely on an ordering of records in the data file, as illustrated below

68


� Depending on the blocking factor, a sparse index will occupy less space in the file system than will a dense index

» By reducing the physical size of the index, the number of accesses to process the index is reduced

» But, since sparse indexes require a sorted data file, the cost to maintain the order of the data file will mitigate any saving on index processing in the face of frequent inserts and deletes

� A sparse index requires that the data file be clustered on the index attribute, and hence there may be at most one sparse index for a relation

69

Index Properties – Primary vs. Secondary Indexes

� If an index includes the primary key for a relation, then the index is referred to as a primary index

� There may be but one primary index for a given relation

� All indexes that do not include the primary key are secondary indexes

� By definition, a primary index may not contain duplicates, since the search key contains a candidate key

70

Index Properties – Composite Keys

� An index may be created using a composition of several attributes in a relation if queries frequently combine the attributes concerned

� If the index key is (<A1><A2), the ordering of the index is constrained to the order of the first attribute <A1> in the set

� If it's necessary to retrieve records by <A2> frequently, then a second index on <A2> may be required

71

Index Properties – Index Methods

� There are a number of index methods in use today, and each has characteristics which make it more or less suitable for a given workload

� The following are two most common index methods that are employed by DBMS» B+Trees

» Hash Indexes

� Tree Structured Indexes

» Perhaps the most widely used index methods employ some form of tree structure and associated search, insert, delete, and iterate algorithms

» While no single structure is best for every application, the B-Tree and a variant, the B+Tree are perhaps the best choice for reasonable performance

72

Indexes as Access Paths (cont.)

� The index file usually occupies considerably less disk blocks than the data file because its entries are much smaller

� A binary search on the index yields a pointer to the file record

� Indexes can also be characterized as dense or sparse

» A dense index has an index entry for every search key value (and hence every record) in the data file.

» A sparse (or nondense) index, on the other hand, has index entries for only some of the search values

73

Indexes as Access Paths (cont.)

� Example: Given the following data file EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )

� Suppose that:» record size R=150 bytes block size B=512 bytes r=30000

records� Then, we get:

» blocking factor Bfr= B div R= 512 div 150= 3 records/block» number of file blocks b= (r/Bfr)= (30000/3)= 10000 blocks

� For an index on the SSN field, assume the field size VSSN=9 bytes, assume the record pointer size PR=7 bytes. Then:» index entry size RI=(VSSN+ PR)=(9+7)=16 bytes» index blocking factor BfrI= B div RI= 512 div 16= 32 entries/block» number of index blocks b= (r/ BfrI)= (30000/32)= 938 blocks» binary search needs log2bI= log2938= 10 block accesses» This is compared to an average linear search cost of:

• (b/2)= 30000/2= 15000 block accesses

» If the file records are ordered, the binary search cost would be:• log2b= log230000= 15 block accesses

74

Types of Single-Level Indexes

� Primary Index

» Defined on an ordered data file

» The data file is ordered on a key field

» Includes one index entry for each block in the data file; the index entry has the key field value for the first

record in the block, which is called the block anchor

» A similar scheme can use the last record in a block.

» A primary index is a nondense (sparse) index, since it includes an entry for each disk block of the data file and the keys of its anchor record rather than for every search value.

75

Primary Index on the ordering Key Field

76


� Clustering Index

» Defined on an ordered data file

» The data file is ordered on a non-key field unlike primary index, which requires that the ordering field of the data file have a distinct value for each record.

» Includes one index entry for each distinct value of the field; the index entry points to the first data block that contains records with that field value.

» It is another example of nondense index where Insertion and Deletion is relatively straightforward with a clustering index.

77

A Clustering Index

Example

Clustering Index Example (1/2)

78

Another Clustering

Index Example

Clustering Index Example (2/2)

79


� Secondary Index» A secondary index provides a secondary means of

accessing a file for which some primary access already exists.

» The secondary index may be on a field which is a candidate key and has a unique value in every record, or a non-key with duplicate values.

» The index is an ordered file with two fields.• The first field is of the same data type as some non-ordering

field of the data file that is an indexing field. • The second field is either a block pointer or a record pointer.• There can be many secondary indexes (and hence, indexing

fields) for the same file.

» Includes one entry for each record in the data file; hence, it is a dense index

80

Example of a Dense Secondary Index

81

Example of a

Secondary Index

Sample Secondary Index on a Non-Key Field

82

Properties of Index Types

83

Multi-Level Indexes

� Because a single-level index is an ordered file, we can create a primary index to the index itself;

» In this case, the original index file is called the first-

level index and the index to the index is called the second-level index.

� We can repeat the process, creating a third, fourth, ..., top level until all entries of the top

level fit in one disk block

� A multi-level index can be created for any type of first-level index (primary, secondary, clustering) as long as the first-level index consists of more

than one disk block

84

Sample Two-Level Primary Index

85

Multi-Level Indexes

� Such a multi-level index is a form of search tree

»However, insertion and deletion of new index entries is a severe problem because every level of the index is an ordered file.

86

Tradeoff

� Maintaining file organization and indexes is not free

� Changing (deleting, inserting, updating) the database requires » Maintaining the file organization

» Updating the indexes

� Extreme case: database is used only for SELECT queries» The “better” file organization is and the more indexes we have:

will result in more efficient query processing

� Extreme case: database is used only for INSERT queries

» The simpler file organization and no indexes: will result in more efficient query processing

» Perhaps just append new records to the end of the file

� In general, somewhere in between

87

Review Of Data Structures To Store N Numbers

� Heap: unsorted sequence (note difference from a common use of the term “heap” in data structures)

� Sorted sequence

� Hashing

� 2-3 trees

88

Heap (Assume Contiguous Storage)

� Finding (including detecting of non-membership)Takes between 1 and N operations

� DeletingTakes between 1 and N operationsDepends on the variant also » If the heap needs to be “compacted” it will take always N (first

to reach the record, then to move the “tail” to close the gap)

» If a “null” value can be put instead of a “real” value, then it will cause the heap to grow unnecessarily

� InsertingTakes 1 (put in front), or N (put in back if you cannot access the back easily, otherwise also 1), or maybe in between by reusing null values

� Linked list: obvious modifications

89

Sorted Sequence

� Finding (including detecting of non-membership)

log N using binary search. But note that “optimized” deletions and insertions could cause this to grow (next transparency)

� Deleting

Takes between log N and log N + N operations. Find the integer, remove and compact the sequence.

Depends on the variant also. For instance, if a “null” value can be put instead of a “real” value, then it will cause the sequence to grow unnecessarily.

90

Sorted Sequence

� Inserting

Takes between log N and log N + N operations. Find the place, insert the integer, and push the tail of the sequence.

Depends on the variant also. For instance, if “overflow” values can be put in the middle by means of a linked list, then this causes the “binary search to be unbalanced, resulting in possibly up to N operations for a Find.

2 5 7 9 2 5 7 9 3

3 inserted as an "overflow"

91

Hashing

� Pick a number B “somewhat” bigger than N

� Pick a “good” pseudo-random function hh: integers → {0,1, ..., B – 1}

� Create a “bucket directory,” D, a vector of length B, indexed 0,1, ..., B – 1

� For each integer k, it will be stored in a location pointed at from location D[h(k)], or if there are more than one such integer to a location D[h(k)], create a linked list of locations “hanging” from this D[h(k)]

� Probabilistically, almost always, most of the the locations D[h(k)], will be pointing at a linked list of length 1 only

92

Hashing: Example Of Insertion

N = 7

B = 10

h(k) = k mod B (this is an extremely bad h, but good for a simple example)

Integers arriving in order:

37, 55, 21, 47, 35, 27, 14

93


0

1

2

3

4

5

6

7

8

9

37

55

0

1

2

3

4

5

6

7

8

9

37

0

1

2

3

4

5

6

7

8

9

37

55

21

0

1

2

3

4

5

6

7

8

9

94


47

37

55

21

0

1

2

3

4

5

6

7

8

9

35

47

37

55

21

0

1

2

3

4

5

6

7

8

9

95


47

37

55

21

0

1

2

3

4

5

6

7

8

9

35

27

14

47

37

55

21

0

1

2

3

4

5

6

7

8

9

35

27

96

Hashing

� Assume, computing h is “free”

� Finding (including detecting of non-membership)Takes between 1 and N + 1 operations.

Worst case, there is a single linked list of all the integers from a single bucket.

Average, between 1 (look at bucket, find nothing) and a little more than 2 (look at bucket, go to the first element on the list, with very low probability, continue beyond the first element)

97

Hashing

� Inserting

Obvious modifications of ”Finding”

But sometimes N is “too close” to B (bucket table becomes too small)

» Then, increase the size of the bucket table and rehash

» Number of operations linear in N

» Can amortize across all accesses (ignore, if you do not know what this means)

� Deleting

Obvious modification of “Finding”

Sometimes bucket table becomes too large, act “opposite” to Inserting

98

2-3 Tree (Example)

99

2-3 Trees

� A 2-3 tree is a rooted (it has a root) directed (order of children matters) tree

� All paths from root to leaves are of same length

� Each node (other than leaves) has between 2 and 3 children

� For each child of a node, other than the last, there is an index value stored in the node

� For each non-leaf node, the index value indicates the largest value of the leaf in the subtree rooted at the left of the index value.

� A leaf has between 2 and 3 values from among the integers to be stored

100

2-3 Trees

� It is possible to maintain the “structural characteristics above,” while inserting and deleting integers

� Sometimes for insertion or deletion of integers there is no need to insert or delete a node

» E.g., inserting 19

» E.g., deleting 45

� Sometimes for insertion or deletion of integers it is necessary to insert or delete nodes (and thus restructure the tree)

» E.g., inserting 88,89,97

» E.g., deleting 40, 45

� Each such restructuring operation takes time at most linear in the number of levels of the tree (which, if there are N leaves, is between log3N and log2N; so we write: O(log N) )

� We show by example of an insertion

101

Insertion Of A Node In The Right Place

First example: Insertion resolved at the lowest level

102

Insertion Of A Node In The Right Place

Second example: Insertion propagates up to the creation of a new root

103

2-3 Trees

� Finding (including detecting of non-membership)

Takes O(log N) operations

� Deleting

Takes O(log N) operations (we did not show)

� Inserting

Takes O(log N) operations

104

What To Use?

� If the set of integers is large, use either hashing or 2-3 trees

� Use 2-3 trees if “many” of your queries are range queries

� Find all the SSNs (integers) in your set that lie in the range 070430000 to 070439999

� Use hashing if “many” of your queries are not range queries

� If you have a total of 10,000 integers randomly chosen from the set 0 ,..., 999999999, how many will fall in the range above?

� How will you find the answer using hashing, and how will you find the answer using 2-3 trees?

105

Index And File

� In general, we have a data file of blocks, with blocks containing records

� Each record has a field, which contains a key, uniquely labeling/identifying the record and the “rest” of the record

� There is an index, which is another file which essentially (we will be more precise later) consists of records which are pairs of the form (key,block address)

� For each (key,block address) pair in the index file, the block address contains a pointer to a block of the file that contains the record with that key

106

Index And File

� Above, a single block of the index file

� Below, a single block of the data file

� The index record tells us in which data block some record is

107

Dense And Sparse

� An index is dense if for every key appearing in (some record) of the data file, dedicated pointer to the block containing the record appears in (some record) of index file

� Otherwise, it is sparse

108

Clustered And Unclustered

� A data file is clustered if whenever some block B

contains records with key values x and y, such that x < y, and key value z, such that x < z < y appears anywhere in the data file, then it must appear in block B

� You can think of a clustered file as if the file were first sorted with contiguous, not necessarily full blocks and then the blocks were dispersed

� Otherwise, it is unclustered

109

Dense Clustered Index (Dense Index And Clustered File)

� To simplify, we stripped out the rest of the records in data file

110

Sparse Clustered Index (Sparse Index And Clustered File)


111

Dense Unclustered Index (Dense Index And Unclustered File)


112

Sparse Unclustered Index (Sparse Index and Unclustered File)


113

2-3 Trees Revisited

� We will now consider a file of records of interest

� Each record has one field, which serves as primary key, which will be an integer

� 2 records can fit in a block

� Sometimes a block will contain only 1 record

� We will consider 2-3 trees as indexes pointing at the records of this file

� The file contains records with indexes: 1, 3, 5, 7, 9

114

Dense Index And Unclustered File

� Pointers point from the leaf level of the tree to blocks (not records) of the file

» Multiple pointers point at a block

� For each value of the key there is a pointer for such value

» Sometimes the value is explicit, such as 1

» Sometimes the value is implicit, such as 5 (there is one value between 3 and 7 and the pointer is the 3rd pointer coming out from the leftmost leaf and it points to a block which contains the record with the key value of 5

115

Sparse Index And Clustered File

� Pointers point from the leaf level of the tree to blocks (not records) of the file

» A single pointer points at a block

� For each value of the key that is the largest in a block there is a pointer for such value» Sometimes the key value is explicit, such as 3

» Sometimes the key value is implicit, such as 5

� Because of clustering we know where 1 is

116

“Quality” Of Choices In General

� Sparse index and unclustered file: generally bad, cannot easily find records for some keys

� Dense index and clustered file: generally unnecessarily large index (we will learn later why)

� Dense index and unclustered file: good, can easily find the record for every key

� Sparse index and clustered file: best (we will learn later why)

117

Search Trees

� We will study in greater depth search tree for finding records in a file

� They will be a generalization of 2-3 trees

� For very small files, the whole tree could be just the root

» For example, in the case of 2-3 trees, a file of 1 record, for instance

� So to make the presentation simple, we will assume that there is at least one level in the tree below the root

» The case of the tree having only the root is very simple and can be ignored here

118

B+ Tree (1/5)

� A B+ tree is a balanced tree structure where the leaf nodes contain the data entries and the nodes above contain entries which direct the search

» A distinction is made in a B+ tree between the entries in leaf nodes and entries in the non-leaf nodes

» The entries in the leaf nodes contain pointers to the data records. The leaf level is a dense index

� The non-leaf nodes form a sparse index on the leaf level» The leaf nodes are frequently referred to as the sequence set,

and the non-leaf nodes are referred to as the index set

» Unlike a standard B tree, key values may appear more than once in the tree

» Furthermore, the key values stored in the index set need not be actual keys themselves, but are rather "guideposts" which direct the downward flow left or right

119

B+ Tree (2/5)

� Sample B+ Tree of Height 3

120

B+ Tree (3/5)

� The tree is comprised of nodes containing key values and tree node pointers, as shown in the node format below

� Note that there are n pointers (Pn) and n-1 key values (Kn-1). The relationship between the key values and pointers is that keys in the page Pi

are lower in value than key Ki, and keys in page Ki+1 and greater than or equal in value than Ki for i = 1,n.

P1 K1 P2 . . . Pn-1 Kn-1 Pn

121

B+ Tree (4/5)

� The height of a B+ Tree is the number of levels in the tree from the root to the leaf layer» The order of a tree is the measure of the number of entries (number of

children) that a node may have

� Each node in a tree of order d that is not a leaf node nor the root node contains m entries, where d £ m£ 2d» Leaf nodes hold between (d-1)/2 £ m£ d-1 entries and the root node

contains between 1 and 2d entries

� The fan-out (F) of a tree is the number of downward pointers a node holds and is usually somewhere between 50% and 67% of the order of the tree, d» It's really the height of the tree that is of interest, since it determines the

number of accesses required to traverse the tree from root to acquisition of a target leaf node

» To understand the relationship between the height of a tree and the order of a tree assume a tree has an order of 100 (which is not unusual)

• Then assuming an average fill factor of 67% , the fan-out is 133 (0.67 x 2x100)

122

B+ Tree (5/5)

� The height (h) of a balanced tree is CEIL(logF(N)), where N is the number of leaf pages and F is the fan-out

» Then N = Fh

� In the example, with a tree height of only 3, 133**3 = 2,352,637 key values can be stored

» More important, it requires only 134 pages of memory (1 + 133) to hold all of the nodes for level 1 (the root node) and level 2 in memory, which means that only one disk access is required to locate the target index node!

» If a page size of 8 kilobytes is assumed, little more than one megabyte of memory is required

123

B+-Trees: Generalization Of 2-3 Trees

� A B+-tree is a tree satisfying the conditions

» It is rooted (it has a root)

» It is directed (order of children matters)

» All paths from root to leaves are of same length

» For some parameter m:

• All internal (not root and not leaves) nodes have between ceiling of m/2 and m children

• The root between 2 and m children

� We cannot, in general, avoid the case of the root having only 2 children, we will see soon why

124

B+-Trees: Generalization Of 2-3 Trees

� Each node consists of a sequence (P is address or pointer, I is index or key):P1,I1,P2,I2,...,Pm-1,Im-1,Pm

� Ij’s form an increasing sequence.

� Ij is the largest key value in the leaves in the subtree pointed by Pj

»Note, others may have slightly different conventions

125

A Node in a Search Tree with Pointers to Subtrees Below It

126

Sample Search Tree

127

Dynamic Multilevel Indexes Using B-Trees and B+-Trees

� Most multi-level indexes use B-tree or B+-tree data structures because of the insertion and deletion problem

» This leaves space in each tree node (disk block) to allow for new index entries

� These data structures are variations of search trees that allow efficient insertion and deletion of new search values.

� In B-Tree and B+-Tree data structures, each node corresponds to a disk block

� Each node is kept between half-full and completely full

128

Dynamic Multilevel Indexes Using B-Trees and B+-Trees (cont.)

� An insertion into a node that is not full is quite efficient» If a node is full the insertion causes a split into

two nodes

� Splitting may propagate to other tree levels

� A deletion is quite efficient if a node does not become less than half full

� If a deletion causes a node to become less than half full, it must be merged with neighboring nodes

129

Difference between B-tree and B+-tree (1/3)

� A B-Tree is a tree structure similar to the B+ Tree, but the B-Tree permits search key values to appear only once in the tree, whereas B+ trees permit redundancy to allow for a dense index at the leaf level

»By contrast, the leaf level of a B-Tree is a sparse index

»Because search keys are not duplicated, an extra pointer member must be included in each index entry to reference the target data file page

130


� In a B-tree, pointers to data records exist at all levels of the tree

� In a B+-tree, all pointers to data records exists at the leaf-level nodes

� A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree

131


� A B-Tree has two advantages over a B+ Tree:» Since the data pointer is stored at with each index entry, the

target page number can sometimes be acquired before descending to the leaf level

» In some instances, fewer tree nodes are required

� The disadvantages of a B-Tree compared to a B+ Tree are:» A relatively small number of search key values are actually found

without accessing nodes at the leaf level

» Because non-leaf nodes are larger, fan-out is reduced, potentially increasing the height of the tree

» Insertion and deletion algorithms are somewhat more complex, making B-Tree methods harder to implement

� In practice, few DBMS use B-Trees over B+ Trees

132

B-tree Structures

133

The Nodes of a B+-tree

134

Example of an Insertion

in a B+-tree

Sample Insertion in a B+ Tree

135

Example of a Deletion in a B+-tree

136

B+ Tree Operation and Cost - Search

� In the following algorithms, the number of Key values in a node is n - 1, and the number of Pointer entries is n

� To find all records having a search key value v:» Begin at the root node.

• scan the current node for lowest key entry value Ki > v (1 £ i £ n -1)

• if Ki exists, retrieve the node from pointer Pi.

• if no such Ki exists in the node, then retrieve the node from pointer Pn.

» Repeat the above procedure until a leaf node is reached.

» When a leaf node is reached, scan the record for Ki = v• If found, then retrieve the data record using Pi

• Else, no key with value v exists

� The expected number of accesses to retrieve the target key value and pointer is no more than h, the height of the tree = CEIL(logF(N)

» One additional access is required to read the page from the data file.

137

B+ Tree Operation and Cost – Insert (1/2)

� Find the node Pi where v belongs (use above search procedure to find leaf node)

» If space for v exists in Pi, add record to data file page and add new entry (v,p) to leaf node

• Insert is complete

» If no space in Pi for new entry, then split: allocate a new node page Pn and move the upper half of the entries in the existing node to the new node page and to Pn

• Insert v in either Pi or Pn, as appropriate.

» Let k be the lowest value vk in Pn

• Insert k into the parent node of Pi using the Insert procedure until a split does not occur

138

B+ Tree Operation and Cost – Insert (2/2)

� The cost of an insert varies depending on whether a node split occurs, and if so, how far up the tree the split is propagated

» When space exists for the new key, the number of accesses required to update the index is no more than h + 1 (the cost to find the target node plus one access needed to write the updated node back to disk)

» Assuming a node split occurs a one level of the tree, then the cost of the insert is h + 4

» The additional accesses occur to write the new node, read the parent node, and update the parent node

» When the split occurs at the root node, the height of the tree is increased by one

� Generalizing, the cost of a split occurring at s levels in the tree is h + 3s + 1

» While the cost for inserts in the face of node splits may seem expensive, bear in mind that since the nodes are maintained at 50% to 67% filled state, on the average, only one in F/2 inserts will result in a split

139

B+ Tree Operation and Cost – Delete

� Find the node Pi where v is located (use above search procedure to find leaf node)

� Remove the entry for v

� If Pi is at least half full, then finished

� Else (d -1 or less entries), try to combine the entries of Pi and a sibling

» If the entries from Pi and the sibling fit into a single node then,

• Combine the entries into node Pi and delete the right node (Pi+1)

» Recursively delete the entry for (Ki, Pi+1) in the parent node

� Else (entries from Pi and Pi+1 cannot fit into a single node),

» Copy entries from Pi+1 to Pi to balance the two nodes

» Update the search key value in the parent node to reflect the new distribution of values in Pi, Pi+1

� Deletions work recursively up the tree until a node with d/2 or more entries is found

» When the root node contains only one pointer after deletion, that node is discarded and the remaining child becomes the new root

� The cost for delete operations is equal to the cost of inserts

140

B+ Tree Operation and Cost – Data Record Operations (1/4)

� Thus far, it has been assumed that the index (both index and sequence sets taken together) are used simply to locate a pointer to the data record for which the search key is an attribute, and the cost to perform the operation on the data record has not been considered

� The number of accesses needed to retrieve the desired record(s) can be seen in the table in slide 59 (cost of operation by file organization) based upon the nature of the operation

» Now that the number of accesses is known, the relative cost of physically ordered vs. randomly ordered retrievals can be compared

� The cost to retrieve a single record (exact match) or several records which are stored in the same page is the same regardless of whether the data file is clustered on the index

» The comparative cost for range retrievals that span pages is much higher for randomly ordered files

141


� If the data file is clustered on the indexed attribute, then the cost is (per slide 59) m times the average access time for sequential reads (which is on the order of 1 or 2 msec plus the cost of the first access to the data page

» Otherwise, assuming that the pages of the data file are randomly distributed, the cost is R times greater than for a clustered file, where R is the random access weighting factor

� The random access factor compares the two delays incurred to access (read or write) a given page

» These are the rotational delay incurred while waiting for the target page to arrive at the read head, and the seek delay, which is the time required to move the read/write head to the track on which the target page is located

142


� For current technology, the seek time is approximately two times that of the typical rotational wait time for a single page access

» When accessing logically consecutive pages, this factor can increase to as much as eight

» It is reasonable to assume that 8 pages can fit on a single disk track

» With disk rotational speeds of 7200 RPM the time for one full revolution is 8.33 msec

» Compare this to the typical head seek time of 8 to 10 msec

» Then consider that to read the first (or only) page, on average a wait of half the rotational delay will be required, or 4 msec

» Add to this the extra cost to move the head and the total cost for the first page is 12 -14 msec

» This cost is the same whether or not the data file is clustered.

143


� Now, consider the effect of reading two pages

» The additional cost to read the second page in a clustered file will be approximately 1 to 2 msec, while the cost to read the second page of a randomly organized file is another 12 to 14 msec

» So, for a two-page retrieval, the clustered file requires about 15 msec, while the acquisition of the two randomly distributed pages is almost twice that at 26 msec

» This disparity increases to a factor of 2.6 for a 4-page retrieval: 20 msec compared to 52 msec, .

� Finally, since a B+ Tree sequence set is always clustered, it is possible to store the attribute data in the leaf nodes when the size of the attributes are relatively small compared to the size of the key (or in the trivial case when the relation contains only a single attribute!)

» The extra storage to incorporate the data into the leaf nodes will decrease the fan-out, potentially increasing the height of the tree

144

Example - Dense Index & Unclustered File

� m = 57

� Although 57 pointers are drawn, in fact between 29 and 57 pointers come out of each non-root node

Leaf Node

Internal Node . . . . . .P1 I1 P56 I56 P57

. . . . . .P1 I1 P56 I56 P57

File Block Containing Record with Index I1



(note that I57 is not listed in the index file)

Left-overSpace

145

B+-trees: Properties

� Note that a 2-3 tree is a B+-tree with m = 3

� Important properties» For any value of N, and m ≥ 3, there is always a B+-tree storing

N pointers (and associated key values as needed) in the leaves

» It is possible to maintain the above property for the given m, while inserting and deleting items in the leaves (thus increasing and decreasing N)

» Each such operation only O(depth of the tree) nodes need to be manipulated.

� When inserting a pointer at the leaf level, restructuring of the tree may propagate upwards

� At the worst, the root needs to be split into 2 nodes

� The new root above will only have 2 children

146

B+-trees: Properties

� Depth of the tree is “logarithmic” in the number of items in the leaves

� In fact, this is logarithm to the base at least ceiling of m/2 (ignore the children of the root; if there are few, the height may increase by 1)

� What value of m is best in RAM (assuming RAM cost model)?

� m = 3

� Why? Think of the extreme case where N is large and m = N

» You get a sorted sequence, which is not good (insertion is extremely expensive

� “Intermediate” cases between 3 and N are better than N but not as good as 3

� But on disk the situation is very different, as we will see

147

Hash Indexes

� Hash index schemes use hashing functions to map keys (typically an attribute in the relation) to the location of the corresponding record in a data file

» Hashing affords an extremely fast method for direct retrieval, but offers no support for range searches

» As such Hash indexes are not usually deployed when B+ Tree indexing is available

» However, certain type types of join operations actually work more efficiently with a Hash index compared to a B+ Tree index (to be continued)

� There are two basic forms of hash methods: static and dynamic

148

Static Hash Indexing (1/4)

� Hashing algorithms define bucket as a unit of storage (typically a page) within a data file

» A hash function H generates all the possible bucket numbers from the set of all search key values

» To locate the bucket in which a target data entry is (or should be) stored, the hash function is applied to the target search key yielding a bucket (page) number

» The bucket is retrieved from the data file and searched for the target data record

� A well-chosen hash function distributes the keys to buckets uniformly

» A commonly used hash function is based on division of the search key by a prime number

» The prime number is chosen as the highest prime number which represents the number of buckets need to hold about 70% of the expected number of records

» When the search key is a character type, a dividend is formed by adding up all of the pairs (or 4-tuples) of characters, and the result is divided by the prime divisor

149

Static Hash Indexing (2/4)

� However, it's frequently impossible to choose such a function, or to know for certain exactly how many data entries are to be stored» As such, it is likely that the hash function will generate the same bucket number for several

different search key values

» As long as there is room in the target bucket, this is not a problem

� When the target bucket (called a primary bucket) is full, a new bucket (page), called an overflow bucket, must be allocated and chained to the primary bucket set, forming an overflow chain» The hash table below illustrates the relationship between the hash function h and the primary

and overflow buckets in the hash file

» When the overflow chains become too long, performance degrades badly

» See Ramakrishnan, Raghu - Slide Presentation for Database Management Systems -University of Wisconsin, 1997

� Frequent deletion of data entries is also a problem» If new entries do not hash to the same bucket where the "holes" are, space is wasted

» The only alternative in these case is to rehash the entire file

150

Dynamic Hash Indexing

� Dynamic hashing methods are intended to deal with the problems of overflow and deletes that static hashing suffers from

» The two most commonly used forms of dynamic hashing are extendable hashing and linear hashing

» These algorithms are more complex than static hashing, and adequate description of these methods is beyond the scope of this presentation

» Detailed descriptions of extended and linear hashing methods can be found in:

• Ramakrishnan, Raghu. Database Management Systems. New York: WCB McGraw-Hill, 1997

• Silberschatz, Abraham, Henry F. Korth, and S. Sudarshan Database System Concepts. New York: McGraw-Hill, 1997

• Helman, Paul. The Science of Database Management. Burr Ridge, Il: Richard D. Irwin, Inc., 1994

» An excellent simplified description of extendable hashing can be found in:

• Date, C. J. An Introduction to Database Systems. Reading, MA: Addison-Wesley, 1995

� The significant feature of hashing is that in the absence of overflows, the number of accesses needed to retrieve the target page is one!» Overflows in statically hashed indexes can cause the number of accesses

required to grow dramatically

» Dynamic hashing guarantees that the number of accesses is never more than two, and in most cases is one

151

An Example

� Our application:

» Disk of 16 GiB (this means: 16 times 2 to the power of 30); rather small, but a better example

» Each block of 512 bytes; rather small, but a better example

» File of 20 million records

» Each record of 25 bytes

» Each record of 3 fields:SSN: 5 bytes (packed decimal), name 14 bytes, salary 6 bytes

� We want to access the records using the value of SSN

� We want to use a B+-tree index for this

� Our goal, to minimize the number of block accesses, so we want to derive as much benefit from each block

� So, each node should be as big as possible (have many pointers), while still fitting in a single block

» Traversing a block once it is in the RAM is free

� Let’s compute the optimal value of m

152

An Example

� There are 234 bytes on the disk, and each block holds 29 bytes.

� Therefore, there are 225 blocks

� Therefore, a block address can be specified in 25 bits

� We will allocate 4 bytes to a block address.

» We may be “wasting” space, by working on a byte as opposed to a bit level, but simplifies the calculations

� A node in the B-tree will contain some m pointers and m – 1 keys, so what is the largest possible value for m, so a node fits in a block?

(m) × (size of pointer) + (m – 1) × (size of key) ≤ size of the block(m) × (4) + (m – 1) × (5) ≤ 5129m ≤ 517m ≤ 57.4...m = 57

153

An Example

� Therefore, the root will have between 2 and 57 children

� Therefore, an internal node will have between 29 and 57 children

» 29 is the ceiling of 57/2

� We will examine how a tree can develop in two extreme cases:

» “narrowest” possible

» “widest possible

� To do this, we need to know how the underlying data file is organized, and whether we can reorganize it by moving records in the blocks as convenient for us

� We will assume for now that

» the data file is already given and,

» it is not clustered on the key (SSN) and,

» that we cannot reorganize it but have to build the index on “top of it”

154

Example - Dense Index & Unclustered File

� In fact, between 2 and 57 pointers come out of the root

� In fact, between 29 and 57 pointers come out of a non-root node (internal or leaf)

Leaf Node

Internal Node . . . . . .P1 I1 P56 I56 P57

. . . . . .P1 I1 P56 I56 P57




(note that I57 is not listed in the index file)

Left-overSpace

155

An Example

� We obtained a dense index, where there is a pointer “coming out” of the index file for every existing key in the data file

� Therefore we needed a pointer “coming out” of the leaf level for every existing key in the data file

� In the narrow tree, we would like the leaves to contain only 29 pointers, therefore we will have 20,000,000 / 29 = 689,655.1... leaves

� We must have an integer number of leaves

� If we round up, that is have 689,656 leaves, we have at least 689,656 × 29 = 20,000,024 pointers: too many!

� So we must round down and have 689,655 leaves

» If we have only 29 pointers coming out of each leaf we have 689,655 ×29 = 19,999,995 leaves: too few

» But this is OK, we will have some leaves with more than 29 pointers “coming out of them” to get exactly 20,000,000

156

An Example

� In the wide tree, we would like the leaves to contain 57 pointers, therefore we will have 20,000,000 / 57 = 350,877.1... leaves

� If we round down, that is have 350,877 leaves, we have most 350,877 × 57 = 19,999,989 pointers: too few!

� So we must round up and have 350,878 leaves

» If we have all 57 pointers coming out of each leaf we have 350,878 × 57 = 20,000,046 leaves: too many

» But this is OK, we will have some leaves with fewer than 57 pointers “coming out of them” to get exactly 20,000,000

157

An Example

Level Nodes in a narrow tree Nodes in a wide tree

1 1 12 2 573 58 3,2494 1,682 185,1935 48,778 10,556,0016 1,414,5627 41,022,298

� We must get a total of 20,000,000 pointers “coming out” in the lowest level.

� For the narrow tree, 6 levels is too much

» If we had 6 levels there would be at least 1,414,562 × 29 = 41,022,298 pointers, but there are only 20,000,000 records

» So it must be 5 levels

� In search, there is one more level, but “outside” the tree, the file itself

158

An Example


1 1 12 2 573 58 3,2494 1,682 185,1935 48,778 10,556,0016 1,414,5627 41,022,298

� For the wide tree, 4 levels is too little

» If we had 4 levels there would be at most 185,193 × 57 = 10.556,001 pointers, but there are 20,000,000 records

» So it must be 5 levels

� Not for the wide tree we round up the number of levels

� Conclusion: it will be 5 levels exactly in the tree (accident of the example; in general could be some range)

� In search, there is one more level, but “outside” the tree, the file itself

159

Elaboration On Rounding

� We attempt to go as narrow and as wide as possible. We start by going purely narrow and purely wide, and this is what we have on the transparencies. However, rarely can we have such pure cases. As an analogy with 2-3 trees, going pure means 2, going wide means 3. But then we always get the number of leaves to be a power of 2 (narrow) or power of 3 (wide).

� If the number of leaves we want to get is not such a pure case, we need to compromise. I will explain in the case of 2-3, as the point of how to round is explainable there.

160


� Say you are asking the question for narrowest/widest 2-3 tree with 18 leaves.

� If you compute, you get the following:

1 1

2 3

4 9

8 27

16 81

32

64

etc.

161


� What this tells me:

� No matter how narrow you go (pure narrow case), you have to have at least 32 leaves on level 6. So you cannot have a 2-3 tree with 6 levels and 18 leaves. However you can have (this is not a proof, but you can convince yourself in this case by simple playing with the example) that if you take the narrowest tree of 5 levels, one that has only 16 nodes, that you can insert into it 2 additional leaves, without increasing the number of levels. So the tree is in some sense "as narrow as possible." In summary, you find the position between levels (5 and 6 in this example), and round down. And this will work, because leaves can be inserted (again I did not give a proof) without needing more than 5 levels.

� If you go pure wide way, than you get then you fall between levels 3 and 4. No matter how wide you go, there is no way you will have more than 9 leaves, but you need 18. It can be shown (again no proof) that rounding up (from 3 to 4) is enough. Intuitively, one can get it by removing 27 − 18 = 9 leaves from the purely wide tree of 4 levels.

162

An Example

� How does this compare with a sorted file?

� If a file is sorted, it fits in at least 20,000,000 / 20 = 1,000,000 blocks, therefore:

Binary search will take ceiling of log2(1,000,000) = 20 block accesses

� As the narrow tree had 5 levels, we needed at most 6 block accesses (5 for the tree, 1 for the file)

» We say “at most” as in general there is a difference in the number of levels between a narrow and a wide tree, so it is not the case of our example in which we could say “we needed exactly 6 block accesses”

» But if the page/block size is larger, this will be even better

� A 2-3 tree, better than sorting but not by much

163

Finding 10 Records

� So how many block accesses are needed for reading, say 10 records?

� If the 10 records are “random,” then we need possibly close to 60 block accesses to get them, 6 accesses per record

» In fact, somewhat less, as likely the top levels of the B-tree are cashed and therefore no need to read them again for each search of one of the 10 records of interest

� Even if the 10 records are consecutive, then as the file is not clustered, they will still (likely be) in 10 different blocks of the file, but “pointed from” 1 or 2 leaf nodes of the tree

» We do not know exactly how much we need to spend traversing the index, worst case to access 2 leaves we may have 2 completely disjoint paths starting from the root, but this is unlikely

» In addition, maybe the index leaf blocks are chained so it is easy to go from a leaf to the following leaf

» So in this case, seems like 16 or 17 block accesses in most cases

164

An Example

� We will now assume that we are permitted to reorganize the data file

� We will in fact treat the file as the lowest level of the tree

� The tree will have two types of nodes:

» nodes containing indexes and pointers as before

» nodes containing the data file

� The leaf level of the tree will in fact be the file

� We have seen this before

165

An Example

� For our example, at most 20 records fit in a block

� Each block of the file will contain between 10 and 20 records

� So the bottom level is just like any node in a B-tree, but because the “items” are bigger, the value of m is different, it is m = 20

� So when we insert a record in the right block, there may be now 21 records in a block and we need to split a block into 2 blocks

» This may propagate up

� Similarly for deletion

� We will examine how a tree can develop in two extreme cases:

» “narrowest” possible

» “widest possible

166

An Example

Internal Node

Left-overSpace

Leaf Node

. . . . . .P1 I1 P56 I56 P57

. . .I.. I... rest of rec.rest of rec.

Left-overSpace

For each leaf node (that is a block of the file), there is a pointer associated with the largest key value from the key values appearing in the block; the pointer points from the level just above the leaf of the structure

167

An Example


1 1 12 2 573 58 3,2494 1,682 185,1935 48,778 10,556,0016 1,414,5627 41,022,298

� Let us consider the trees, without the file level for now

� For the narrow tree, we must get a total of 20,000,000 / 10 = 2,000,000 pointers in the lowest level

� So we need 2,000,000 / 29 = 68,965.5... nodes

� So it is between level 5 and 6, so must be 5

» Rounding down as before

©Zvi M. Kedem

168

An Example


1 1 12 2 573 58 3,2494 1,682 185,1935 48,778 10,556,0016 1,414,5627 41,022,298

� For the wide tree, we must get a total of 20,000,000 / 20 = 1,000,000 pointers in the lowest level

� so we need 1,000,000 / 57 = 17,543.8… nodes

� So it is between level 3 and 4, so must be 4

» Rounding up as before

� Conclusion: it will be between 5 + 1 = 6 and 4 +1 = 5 levels (including the file level), with this number of block accesses required to access a record.

169

Finding 10 records

� So how many block accesses are needed for reading, say 10 records?

� If the 10 records are “random,” then we need possibly close to 60 block accesses to get them, 6 accesses per record.

� In practice (which we do not discuss here), somewhat fewer, as it is likely that the top levels of the B-tree are cashed and therefore no need to read them again for each search of one of the 10 records of interest

170

Finding 10 records

� If the 10 records are consecutive, and if there are not too many null records in the file blocks, then all the records are in 2 blocks of the file

� So we only need to find these 2 blocks, and the total number of block accesses is between 7 and a “little more”

� In general, we do not know exactly how much we need to spend traversing the index

» In the worst case in order to access two blocks we may have two completely different paths starting from the root

� But maybe the index leaf blocks are chained so it is easy to go from leaf to leaf

171

An Example

� To reiterate:

� The first layout resulted in an unclustered file

� The second layout resulted in a clustered file

� Clustering means: “logically closer” records are “physically closer”

� More precisely: as if the file has been sorted with blocks not necessarily full and then the blocks were dispersed

� So, for a range query the second layout is much better, as it decreases the number of accesses to the file blocks by a factor of between 10 and and 20

172

How About Hashing On A Disk

� Same idea

� Hashing as in RAM, but now choose B, so that “somewhat fewer” than 20 key values from the file will hash on a single bucket value

� Then, from each bucket, “very likely” we have a linked list consisting of only one block

� But how do we store the bucket array? Very likely in memory, or in a small number of blocks, which we know exactly where they are

� So, we need about 2 block accesses to reach the right record “most of the time”

173

Index On A Non-Key Field

� So far we consider indexes on “real” keys, i.e., for each search key value there was at most one record with that value

� One may want to have an index on non-key value, such as DateOfBirth

» Choosing the right parameters is not as easy, as there is an unpredictable, in general, number of records with the same key value.

� One solution, using index, an index find a “header” pointing at a the structure (e.g., a sequence of blocks), containing records of interest.

All records for1957.02.03 All records for

1957.02.04

Index with pointers

174

Index On A Non-Key Field: Hashing

� The number of (key,pointer) pairs on the average should be “somewhat” fewer than what can fit in a single block

� Very efficient access to all the records of interest

175

Index On A Non-Key Field: B+-tree

176

Primary vs. Secondary Indexes

� In the context of clustered files, we discussed a primary index, that is the one according to which a file is physically organized, say SSN

� But if we want to efficiently answer queries such as:

» Get records of all employees with salary between 35,000 and 42,000

» Get records of all employees with the name: “Ali”

� For this we need more indexes, if we want to do it efficiently

� They will have to be put “on top” of our existing file organization.

» The primary file organization was covered above, it gave the primary index

� We seem to need to handle range queries on salaries and non-range queries on names

� We will generate secondary indexes

177

Secondary Index On Salary

� The file’s primary organization is on SSN

� So, it is “very likely” clustered on SSN, and therefore it cannot be clustered on SALARY

� Create a new file of variable-size records of the form:

(SALARY)(SSN)*

For each existing value of SALARY, we have a list of all SSN of employees with this SALARY.

� This is clustered file on SALARY

178

Secondary Index On Salary

� Create a B+-tree index for this file

� Variable records, which could span many blocks are handled similarly to the way we handled non-key indexes

� This tree together with the new file form a secondary index on the original file

� Given a range of salaries, using the secondary index we find all SSN of the relevant employees

� Then using the primary index, we find the records of these employees

» But they are unlikely to be contiguous, so may need “many” block accesses

179

Secondary Index on Name

� The file’s primary organization is on SSN

� So, it is “very likely” clustered on SSN, and therefore it cannot be clustered on NAME

� Create a file of variable-size records of the form:

(NAME)(SSN)*

For each existing value of NAME, we have a list of all SSN of employees with this NAME.

� Create a hash table for this file

� This table together with the new file form a secondary index on the original file

� Given a value of name, using the secondary index we find all SSN of the relevant employees

� Then using the primary index, we find the records of these employees

180

Index On Several Fields

� In general, a single index can be created for a set of columns

� So if there is a relation R(A,B,C,D), and index can be created for, say (B,C)

� This means that given a specific value or range of values for (B,C), appropriate records can be easily found

� This is applicable for all type of indexes

181

Symbolic vs. Physical Pointers

� Our secondary indexes were symbolic

Given value of SALARY or NAME, the “pointer” was primary key value

� Instead we could have physical pointers

(SALARY)(block address)* and/or (NAME)(block address)*

� Here the block addresses point to the blocks containing the relevant records

� Access more efficient as we skip the primary index

� Maintaining more difficult

» If primary index needs to be modified (new records inserted, causing splits, for instance) need to make sure that physical pointers properly updated

182

When to Use Indexes To Find Records

� When you expect that it is cheaper than simply going through the file

� How do you know that? Make profiles, estimates, guesses, etc.

� Note the importance of clustering in greatly decreasing the number of block accesses

» Example, if there are 1,000 blocks in the file, 100 records per block (100,000 records in the file), and

• There is a hash index on the records

• The file is unclustered (of course)

» To find records of interest, we need to read at least all the blocks that contain such recods

» To find 50 (very small fraction of) records, perhaps use an index, as these records are in about (maybe somewhat fewer) 50 blocks, very likely

» To find 2,000 (still a small fraction of) records, do not use an index, as these records are in “almost as many as” 1,000 blocks, very likely, so it is faster to traverse the file and read 1000 blocks (very likely) than use the index

183

How About SQL?

� Most commercial database systems implement indexes

� Assume relation R(A,B,C,D) with primary key A

� Some typical statements in commercial SQL-based database systems» CREATE UNIQUE INDEX index1 on R(A); unique implies that

this will be a “real” key, just like UNIQUE is SQL DDL

» CREATE INDEX index2 ON R(B ASC,C)

» CREATE CLUSTERED INDEX index3 on R(A)

» DROP INDEX index4

� Generally some variant of B tree is used (not hashing)

» In fact generally you cannot specify whether to use B-trees or hashing

184

Oracle SQL

� Generally:» When a PRIMARY KEY is declared the system generates a

primary index using a B+-tree

Useful for retrieval and making sure that there this indeed is a primary key (two different rows cannot have the same key)

» When UNIQUE is declared, the system generates a secondary index using a B+-tree

Useful as above

� It is possible to specify hash indexes using, so called, HASH CLUSTERs

Useful for fast retrieval, particularly on equality (or on a very small number of values)

� It is possible to specify bit-map indexes, particularly useful for querying databases (not modifying them—that is, in a “warehouse” environment)

185

Bitmap Index

� Assume we have a relation (table) as follows:

ID Sex Region Salary

10 M W 450

31 F E 321

99 F W 450

81 M S 356

62 M N 412

52 M S 216

47 F N 658

44 F S 987

51 F N 543

83 M S 675

96 M E 412

93 F E 587

30 F W 601

186

Bitmap Index

� Bitmap index table listing whether a row has a value

ID F M E N S W

10 0 1 0 0 0 1

31 1 0 1 0 0 0

99 1 0 0 0 0 1

81 0 1 0 0 1 0

62 0 1 0 1 0 0

52 0 1 0 0 1 0

47 1 0 0 1 0 0

44 1 0 0 0 1 0

51 1 0 0 1 0 0

83 0 1 0 0 1 0

96 0 1 1 0 0 0

93 1 0 1 0 0 0

30 1 0 0 0 0 1

187

Bitmap Index

� Useful on when cardinality of an attribute is small:

» This means: the number of distinct values is small

» Which implies that the number of columns in the index is small

� Example: Find all IDs for people for F and S; i.e., people for which Sex = “F” and Region = “S”

» Just do boolean AND on the columns labeled F and S; where you get 1, you look at the ID, and it satisfies this condition

� Example: How many males live in the northern region

» Just do boolean AND and count the number of 1’s

� Can use boolean OR, NEGATION, if needed

� The idea is to make the bitmap index table so small that it can fit in the RAM (or still be relatively small)

» So operations are very efficient

� But, the ID field could be large

� Solution: two tables, each smaller

188

Optimization: Maintain 2 Smaller Structures With Implicit Row Pairing

ID

10000000400567688782345

31000000400567688782345

99000000400567688782345

81000000400567688782345

62000000400567688782345

52000000400567688782345

47000000400567688782345

44000000400567688782345

51000000400567688782345

83000000400567688782345

96000000400567688782345

93000000400567688782345

30000000400567688782345

F M E N S W

0 1 0 0 0 1

1 0 1 0 0 0

1 0 0 0 0 1

0 1 0 0 1 0

0 1 0 1 0 0

0 1 0 0 1 0

1 0 0 1 0 0

1 0 0 0 1 0

1 0 0 1 0 0

0 1 0 0 1 0

0 1 1 0 0 0

1 0 1 0 0 0

1 0 0 0 0 1

189

Computing Conjuction Conditions

� Simplest example: R(A,B)

� SELECT *FROM RWHERE A = 1 AND B = ‘Mary’;

� Assume the database has indexes on A and on B

� This means, we can easily find

» All the blocks that contain at least one record in which A = 1

» All the blocks that contain at least one record in which B = ‘Mary’

� A reasonably standard solution» DB picks one attribute with an index, say A

» Brings all the relevant blocks into memory (those in which there is a record with A = 1)

» Selects and outputs those records for which A = 1 and B = ‘Mary’

190

But There Are Choices

� Choice 1:» DB picks attribute A

» Brings all the relevant blocks into memory (those in which there is a record with A = 1)


� Choice 2:» DB picks attribute B

» Brings all the relevant blocks into memory (those in which there is a record with B = ‘Mary’)


� It is more efficient to pick the attribute for which there are fewer relevant blocks, i.e., the index is more selective

191

But There Are Choices

� Some databases maintain profiles, statistics helping them to decide which indexes are more selective

� But some have a more static decision process

» Some pick the first in the sequence, in this case A

» Some pick the last in the sequence, in this case B

� So depending on the system, one of the two below may be much more efficient than the other

» SELECT *FROM RWHERE A = 1 AND B = ‘Mary’;

» SELECT *FROM RWHERE B = ‘Mary’ AND A = 1;

� So it may be important for the programmer to decide which of the two equivalent SQL statements to specify

192

Using Both Indexes

� DB brings into RAM the identifiers of the relevant blocks (their serial numbers

� Intersects the set of identifiers of blocks relevant for A with the set of identifiers of blocks relevant for B

� Reads into RAM the identifiers in the intersections and selects and outputs records satisfying the conjunction

193

Joins and Indexing

� Indexes can have an effect on the evaluation of join queries by a DBMS

» There are several algorithms available for implementing join

» The choice of which method to use depends on a number of factors, not the least of which is the presence of an index on one or more of the attributes involved in the query

� The join operator is the most expensive of the operators» To understand the effect that indexing can have on the cost of

joins, several join methods are examined

» For this presentation only simple equi-join of the form: • R (join) S on a single common attribute of R and S and simple range queries

are considered

» In the following slides, relation R is called the outer relation, and S is the inner relation

194

Nested Loop Joins (1/3)

� This is the brute force method for join of two relations R and S

� The algorithm is as follows:For Each tuple ri in R do

For Each tuple sj in S do

if ri = sj then

add <ri, sj> to output

195


� To calculate the anticipated cost of the join using this method, assume that the number of pages in R and S are Pr and Ps, and the number of records per page for R and S are nr and ns, respectively

»Then the cost of the join Cj in units of page reads is:

• Cj = Pr + (nr*Pr*Ps)

»For example, if R consists of 1000 pages of 20 tuples each, and S is 500 pages of 10 tuples each then:

• Cj = 1000 + (20*500*10) = 101,000 page reads

196


� Note the relatively small direct contribution that the outer relation R makes to the cost

»Each page of R is read but once, while many passes over the same pages of the inner relation is required

»This unfortunate circumstance is improved upon in the next algorithm

197

Block Nested Loop Joins (1/2)

� This algorithm depends on using available pages of memory to buffer pages of R and S while looping

� If M memory pages are available to use, then the algorithm is:

for Each block of M -2 pages in R do

for page in S do

for each ri in R memory blocks do

for each sj in S memory blocks do

if ri = sj then

add <ri, sj> to output block

� Then the cost of the join Cj in units of page reads is: Cj = Pr + Ps* Pr/(M-2)

198

Block Nested Loop Joins (2/2)

� Using the previous example and assuming there are 12 available memory pages (and extremely modest requirement!), then:» Cj = 1000 + 500*1000/(12 - 2) = 51,000

� A twofold improvement over the brute force method at a trivial cost» If the number of available memory pages were increased to 102,

the cost would drop to 6,000 page reads

� The in-memory process of matching of tuples in S is enhanced by building a hash table in memory for the target attribute of the tuples in R» Furthermore, performance is improved using the smaller of the two

relations as the outer relation

199

Index Nested Loop Joins (1/2)

� If there is an index on the attribute to be joined in one of the relations, then it can be used as the inner relation and the index can be used to great advantage

� Since only equality joins are being considered, the index may be either B+ Tree or Hash type

� The algorithm for Index Nested Loop Join is:

For Each tuple ri in R do

For Each tuple sj in S where ri = sj do

add <ri, sj> to output

200

Index Nested Loop Joins (2/2)

� This algorithm uses the index on the inner relation to retrieve (any) tuple from S matching the join attribute of R

» The cost for the Index Nested Loop Join is:

• For B+ Tree indexes: Cj = Pr *( hs+1), where hs is the height of the index tree

• For Hash Indexes: Cj = Pr *(A+1), A is average cost of a hash index access (typical value of A is 1.2)

� Again, using the running example, and assuming that the height of S is two (since S contains but 5000 records, a height of two is highly probable):

» B+ Tree: Cj = 1000*(2+1) = 3000

» Hash: Cj = 1000*(1.2+1) = 2200

� The effect of the index on the join is so significant that, when processing a join, some DBMS will build an index on the inner relation at runtime

201

Sort-Merge Join

� The sort-merge join requires that both relations be sorted on the join attribute

� Then the sorted tables are "merged,” comparing tuples in both tables

� The cost of a sort-merge join is:

Cj = Pr*log(Pr) + Ps*log(Ps) + Pr + Ps

� In the example,

» Cj = 1000*3 + 500*2.7 + 1000 + 500 = 5,850

� It's interesting to examine the sort-merge join algorithm when B+ Tree indexes exist for both relations

» The sequence set of both indexes are sorted

» The cost to join these using the merge method form sort-merge is simply Pr + Ps, which is quite small

202

Hash Joins

� There are other methods used to execute joins, such as the Hash Join, and the Hybrid Hash Join

� This join method partitions the relations into sets by hashing the join attributes of each relation using the same hash function

� Only tuples whose hash values are equal need be compared, that is, tuples that have been hashed into the same partition

203

Computing A Join Of Two Tables

� We will deal with rough asymptotic estimates to get the flavor

» So we will make simplifying assumptions, which still provide the intuition, but for a very simple case

� We have available RAM of 3 blocks

� We have two tables R(A,B), S(C,D)

� Assume no indexes exist

� There are 1,000 blocks in R

� There are 10,000 blocks in S

� We need (or rather DB needs) to compute

SELECT * FROM R, SWHERE R.B = S.C;

204

Experiment with Indexing (1/10)

� The next few slides present the results of tests conducted to demonstrate the effect indexing has on the performance of join operation in relational queries

� The test cases were chosen to exercise joins using several of the methods described in this paper, and to compare the different characteristics of hash indexes compared to tree structured indexes

� The results of the test runs are summarized in the next slide

� The test environment was:

» Pentium 200

» 64 Megabytes memory, 700 megabyte contiguous partition

» Linux 2.0.23

» PostgreSQL 6.3.1

205


Test # Outer Index Method Time

1 R none S-M 8.7704

2 G none S-M 34.1504

3 R RB NL (S, I) 2.4289

4 G RB NL (S, I) 0.5253

5 R RB, GB NL (I, I) 0.0951

6 G RB, GB NL (I, I) 0.0774

7 R GB NL (S, I) 1.6691

8 G GB NL (I, S) 11.1605

9 R RH NL (I, S) 1.4812

10 G RH NL (S, I) 0.5393

11 R RH, GH NL (I, I) 0.0479

12 G RH, GH NL (I, I) 0.0746

13 R GH NL (S, I) 1.6472

14 G GH NL (I, S) 11.2198

15 G RB, GH NL (I, I) 0.2276

16 R RH, GB NL (I, I) 0.0496

17 G RH, GB NL (I, I) 0.4074

18 R RB, GH NL (I, I) 0.0361

19 R none HJ (S, H, S) 4.0695

20 R RB HJ ((S, H, I) 2.6571

21 R RH HJ (S, H, S) 5.7368

206


� PostgreSQL was selected because it supports user selectable index methods (hash, B-tree, R-tree), and because it has a reasonably robust planner/optimizer, supporting a variety of join algorithms» Additionally, the system allows a user to restrict methods to be

used for join (a developer feature)

» It also supports a reasonable statistical reporting scheme which is necessary to measure performance accurately

� The tests fall into two categories: tests which exercise exact match joins, and tests which execute range selection joins

» The same two relations were used in every test

» To insure that the residual buffering did not affect the tests, the DBMS was shutdown and restarted for each test

» Indexes were completely removed and rebuilt for each test, and no updates (inserts or deletes) occurred at any time during the tests

207


� Although only one set of values are given for each test, every test was executed at least twice to insure that results were reasonable

� The two relations are very simple and derive their data from the Internet Movie Data Base (imdb.com)

» To generate enough data for the tests to be meaningful, the entire IMDB database was downloaded (0ver 40 megabytes of compressed data, 150 MB uncompressed), as well as a set of tools for accessing the data locally

» The tools allow for generation of reports of one attribute at a time

208


� Several small programs were written to filter the reports generated by these tools to build delimited text files suitable for loading by the postgreSQL (or any other DBMS) copy facility» Two relations were built:

1. Ratings ::= (Rating (float), Title char(128), Year char(6))

2. Genre ::= (Title char(128), Year char(6), Genre char(16))

� The Ratings relation was populated with titles from 1960 to 1998, yielding 55,000 tuples in the relation, with no duplicates

� The Genre relation also was populated with titles from 1960 to 1998 and comprised 14,500 tuples, and the categories were restricted to six or seven genre

» Titles are sometimes repeated as many as three times

209


� The sizes of these two relations were picked to be in an approximate 4:1 ratio so that the effect of the size of the inner and outer relations in nested loop joins could be seen easily

� The sizes of tuples in the two relations are nearly identical (142 vs 150) so that the fan-out in B-tree indexes was the same for either

� The queries used for exact match join are:SELECT *

FROM Ratings R1, Genre R2

WHERE R1.Ti = 'Star Wars' AND R1.Ti = R2.Ti and

SELECT *


WHERE R2.Ti = 'Star Wars' AND R2.Ti = R1.Ti

210

� The form of the queries are identical, but yield very different results based upon which relation is indexed, and their position vs size in the loop algorithm (and, I was surprised to learn, in the Sort-Merge join)

� The query used to test range selection joins is:SELECT *


WHERE R1.Ti BETWEEN 'Ne' AND 'Oo' AND R1.Ti = R2.Ti

� The following table summarizes the results of the tests

» Tests 1 and 2 employ no indexes; these two tests are established as a baseline to contrast indexed joins to non-indexed joins

» Tests 3 to 8 use only B-Tree indexes

» Tests 9 to 14 repeat Tests 3 to 9 using Hash indexes rather than B-Trees

» Tests 15 to 18 test combinations of Hash and B-Tree indexes

» Finally, Tests 19 to 21 execute the range query using B-Tree and Hash indexes alternately


211

� Table legend is as follows:» Test # is the number of the test executed

• Appendix A is a log file of the raw results of the tests

• The output from each test is clearly marked by test number

» Outer is the relation on which the attribute match is requested

» Index gives the relation (Ratings or Genre, or both) having an index, The type of index is provided in a superscript (hash or b-tree)

» Method names the method and gives the sequence of retrieval operations (Index, Scan, Hash) left to right in hierarchical order of execution

• For example, (I, S) indicates that an index is used on the outer relation and that the inner relation was scanned. (S,H,S) means Scan, Hash, Scan)

• Given the outer relation in the Outer column, the inner relation can be deduced!

» Time lists the back-end execution time in seconds• It does not include the time taken to pass query results from the back-end to

the client


212

� Tests 1 though 18 each returned three tuples» Tests 19 through 21, the range queries, returned 624 tuples

� The results in the table are consistent with the theory» That is, a well chosen index combined with a correctly structured query can provide

dramatically better results than without indexing

» Some general observations of the results show that:• As expected, hash indexes are somewhat better performers than tree structured indexes when equi-joins are executed

• Unless there is a huge disparity in the size of the outer vs inner relation, the index works best on the inner relation

• Scans on the larger relation must be avoided– Note that in tests 8 and 14, even though an index was present, it was of little use to overcome the size of the scan of the larger relation

» If you’re able to index both relations, performance of joins is as good as it gets• See test cases 5, 11, 12, 16, and 18

» It is interesting to examine the results of tests 1 and 2, in which no index was available

• The time for test 2 is almost four times that of test 1

• Although I have not checked the source code for the SM join, it’s likely that the algorithm has an outer controlling inner relationship similar to the NL join algorithm, since test 2 used the larger relation (R) as the inner relation

� Although hash indexes perform slightly better than B-trees in equi-joins, they offer little help in range joins, as indicated by tests 20 and 21» Note that in test 21, the presence of the index led the planner to choose a plan which

was apparently less effective than the plan used when there was no index. Compare tests 19 and 21

Experiment with Indexing (9/10) - Summary

213

Experiment with Indexing (10/10) - Conclusion

� The use of indexes clearly enhances performance of joins, both equi-joins and range checking joins

� However, an index alone cannot make up for a poorly constructed (or badly optimized) query

� Balancing size (of inner and outer relations) with speed (which relation is indexed) can be a tricky task, but can yield enormous benefits when done well

214

Experiment with Indexing – Appendix (1/23)

Schema for test relations

Table = ratings (54,000 tuples)+--------------------------------+----------------------------------+--------+

| Field | Type | Length |

+--------------------------------+-------------------------------------------+

| ra | float8 not null | 8 |

| ti | char() not null | 128 |

| yr | char() not null | 6 |

+--------------------------------+----------------------------------+--------+

Table = genre (14,500 tuples)+--------------------------------+----------------------------------+--------+

| Field | Type | Length |

+--------------------------------+----------------------------------+--------+

| ti | char() not null | 128 |

| yr | char() not null | 6 |

| ge | char() not null | 16 |

+--------------------------------+----------------------------------+--------+

215


exact match query tests - return 3 rowstest1: R1=ratings, R2=genre, no index

test2: R1=genre, R2=ratings, no index

test3: R1=ratings, R2=genre, btree index on R1

test4: R1=genre, R2=ratings, btree index on R2

test5: R1=ratings, R2=genre, btree index on R1 and R2

test6: R1=genre, R2=ratings, btree index on R1 and R2


test8: R1=genre, R2=ratings, btree index on R1

test9: R1=ratings, R2=genre, hash index on R1

test10: R1=genre, R2=ratings, hash index on R2

test11: R1=ratings, R2=genre, hash index on R1 and R2

test12: R1=genre, R2=ratings, hash index on R1 and R2


test14: R1=genre, R2=ratings, hash index on R1

combination query tests - return 3 rowstest15: R1=genre, R2=ratings, hash index on R1, btree on R2

test16: R1=ratings, R2=genre, btree index on R1, hash on R2

test17: R1=genre, R2=ratings, btree index on R1, hash on R2

test18: R1=ratings, R2=genre, hash index on R1, btree on R2

range query tests - return 624 rowstest19: R1=genre, R2=ratings, no index



216


Test 1

--------

NOTICE: QUERY PLAN:

Merge Join (cost=0.00 size=1 width=68)

-> Seq Scan (cost=0.00 size=0 width=0)

-> Sort (cost=0.00 size=0 width=0)

-> Seq Scan on r1 (cost=0.00 size=0 width=32)




---- query is:

SELECT *

FROM ratings R1, genre R2

WHERE R1.Ti = 'Star Wars' AND

R1.Ti = R2.Ti

;

! system usage stats:

! 8.770406 elapsed 7.580000 user 1.100000 system sec

! [7.650000 user 1.130000 sys total]

! 0/0 [0/0] filesystem blocks in/out

! 1823/611 [2173/801] page faults/reclaims, 0 [0] swaps

! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent

! 0/0 [0/0] voluntary/involuntary context switches

! postgres usage stats:

! Shared blocks: 1741 read, 0 written, buffer hit rate = 1.64%

! Local blocks: 0 read, 0 written, buffer hit rate = 0.00%

! Direct blocks: 657 read, 737 written

217


Test 2

--------

NOTICE: QUERY PLAN:

Merge Join (cost=0.00 size=1 width=68)







---- query is:

SELECT *



R2.Ti = R1.Ti

;



! [28.900000 user 2.650000 sys total]









218


Test 3

--------

NOTICE: QUERY PLAN:

Nested Loop (cost=0.00 size=1 width=68)


-> Index Scan on r1 (cost=2.00 size=1 width=32)

---- query is:

SELECT *



R1.Ti = R2.Ti

;



! [1.630000 user 0.570000 sys total]









219


Test 4

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R2.Ti = R1.Ti

;



! [0.460000 user 0.100000 sys total]









220


Test 5

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R1.Ti = R2.Ti

;



! [0.130000 user 0.010000 sys total]









221


Test 6

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R2.Ti = R1.Ti

;



! [0.090000 user 0.040000 sys total]









222


Test 7

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R1.Ti = R2.Ti

;



! [1.330000 user 0.340000 sys total]









223


Test 8

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R2.Ti = R1.Ti

;



! [10.150000 user 1.010000 sys total]









224


Test 9

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R1.Ti = R2.Ti

;



! [1.050000 user 0.130000 sys total]









225


Test 10

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R2.Ti = R1.Ti

;



! [0.440000 user 0.120000 sys total]









226


Test 11

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R1.Ti = R2.Ti

;



! [0.110000 user 0.020000 sys total]









227


Test 12

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R2.Ti = R1.Ti

;



! [0.120000 user 0.030000 sys total]









228


Test 13

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R1.Ti = R2.Ti

;



! [1.280000 user 0.440000 sys total]









229


Test 14

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R2.Ti = R1.Ti

;



! [10.290000 user 0.950000 sys total]









230


Test 15

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R2.Ti = R1.Ti

;



! [0.110000 user 0.040000 sys total]









231


Test 16

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R1.Ti = R2.Ti

;



! [0.080000 user 0.060000 sys total]









232


Test 17

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R2.Ti = R1.Ti

;



! [0.150000 user 0.030000 sys total]









233


Test 18

--------

NOTICE: QUERY PLAN:




---- query is:

SELECT *



R1.Ti = R2.Ti

;



! [0.120000 user 0.010000 sys total]









234


Test 19

--------

NOTICE: QUERY PLAN:

Hash Join (cost=4684.39 size=6036 width=68)


-> Hash (cost=0.00 size=0 width=0)


---- query is:

SELECT *


WHERE R1.Ti between 'Ne' and 'Oo' AND

R1.Ti = R2.Ti

;



! [3.200000 user 0.440000 sys total]









235


Test 20

--------

NOTICE: QUERY PLAN:





---- query is:

SELECT *



R1.Ti = R2.Ti

;



! [1.930000 user 0.290000 sys total]









236


Test 21

--------

NOTICE: QUERY PLAN:





---- query is:

SELECT *



R1.Ti = R2.Ti

;



! [3.500000 user 1.100000 sys total]









237

A Simple Approach

� Read 2 blocks of R into memory

» Read 1 block of of S into memory

» Check for join condition for all the “resident” records of R and S

» Repeat a total of 10,000 times, for all the blocks of S

� Repeat for all subsets of 2 size of R, need to do it total of 500 times

� Total reads: 1 read of R + 500 reads of S = 5,001,000 block reads

238

A Simple Approach

� Read 2 blocks of S into memory

»Read 1 block of of R into memory

»Check for join condition for all the “resident” records of R and S

»Repeat a total of 1,000 times, for all the blocks of R

� Repeat for all subsets of 2 size of S, need to do it total of 5,000 times

� Total reads: 1 read of S + 5,000 reads of R = 5,010,000 block reads

239

Reminder On Merge-Sort

� At every stage, we have sorted sequences of some length

� After each stage, the sorted sequences are double in length and there is only half of them

� We are finished when we have one sorted sequence

� Three blocks are enough: two for reading the current sorted sequences and one for producing the new sorted sequence from them

� In the example, two sorted sequence of length three are merged; only the first three steps are shown

240

Merge-Join

� If R is sorted on B and S is sorted on C, we can use the merge-sort approach to join them

� While merging, we compare the current candidates and output the smaller (or any, if they are equal)

� While joining, we compare the current candidates, and if they are equal, we join them, otherwise we discard the smaller

� In the example below (where only the join attributes are shown, we would discard 11 and 43 and join 21 with 21 and 61 with 61

241

Merge-Join

� The procedure:» Sort R (using merge sort)

» Sort S (using merge sort)

» Join R and S

� To sort R» Read 3 blocks of R, sort in place and write out sorted

sequences of length of 3 blocks

» Merge sort R

� To sort S

» Read 3 blocks of S, sort in place and write out sorted sequences of length of 3 blocks

» Merge sort S

� Then merge-join

242

Performance Of Merge-Join

� To sort R, about 9 passes, where pass means read and write

� To sort S about 19 passes, where pass means read and write

� To merge-join R and S, one pass (at most the size of R, as we join on keys, so each row of R either has one row of S matching, or no matching row at all)

� Total cost, below 300,000 block accesses

243

Hash-Join

� Sorting could be considered in this context similar to the creation of an index

� Hashing can be used too, though more tricky to do in practice.

244

Order Of Joins Matters

� Consider a database consisting of 3 relations» Lives(Person,City) about people in the US, about 300,000,000

tuples

» Oscar(Person) about people in the US who have won the Oscar, about 1,000 tuples

» Nobel(Person) about people in the US who have won the Nobel, about 100 tuples

� How would you answer the question, trying to do it most efficiently “by hand”?

� Produce the relation Good_Match(Person1,Person2) where the two Persons live in the same city and the first won the Oscar prize and the second won the Nobel prize

� How would you do it using SQL?

245

Order Of Joins Matters (Our Old Example)

� SELECT Oscar.Person Person1, Nobel.Person Person2FROM Oscar, Lives Lives1, Nobel, Lives Lives2WHERE Oscar.Person = Lives1.Person AND Nobel.Person = Lives2.PersonAND Lives1.City = Lives2.City

Very inefficient

� Using various joins we can specify easily the “right order,” in effect producing

» Oscar_PC(Person,City), listing people with Oscars and their cities

» Nobel_PC(Person,City), listing people with Nobels and their cities

� Then producing the result from these two small relations

� Effectively we do some WHERE conditions earlier, rather than later

� This is much more efficient

246

A Skeleton ISAM File Example

� Sometimes indexes, such as search trees are tied to the physical device structure

» This is an older approach

» Cannot dynamically rebalance the tree

� ISAM = Indexed Sequential Access Method

247

Summary (1/2)

� Processing a Query� Types of Single-level Ordered Indexes

» Primary Indexes» Clustering Indexes» Secondary Indexes

� Multilevel Indexes� Best: clustered file and sparse index� Hashing on disk� Index on non-key fields� Secondary index� Use index for searches if it is likely to help� SQL support� Bitmap index� Need to know how the system processes queries� How to use indexes for some queries� How to process some queries� Merge Join� Hash Join� Order of joins

248

Summary (2/2)

� Cutting down relations before joining them

� Logical files: records

� Physical files: blocks

� Cost model: number of block accesses

� File organization

� Indexes

� Hashing

� 2-3 trees for range queries

� Dense index

� Sparse index

� Clustered file

� Unclustered file

� Dynamic Multilevel Indexes Using B-Trees and B+-Trees

� B+ trees

� Optimal parameter for B+ tree depending on disk and key size parameters

� Indexes on Multiple Keys

249

Agenda









250

Agenda

0. Introduction to Query Processing1. Translating SQL Queries into Relational Algebra 2. Algorithms for External Sorting3. Algorithms for SELECT and JOIN Operations4. Algorithms for PROJECT and SET Operations5. Implementing Aggregate Operations and Outer

Joins6. Combining Operations using Pipelining7. Using Heuristics in Query Optimization8. Using Selectivity and Cost Estimates in Query

Optimization9. Overview of Query Optimization in Oracle10. Semantic Query Optimization

251

0. Introduction to Query Processing (1)

� Query optimization:

»The process of choosing a suitable execution strategy for processing a query.

� Two internal representations of a query:

»Query Tree

»Query Graph

252

1. Translating SQL Queries into Relational Algebra (1)

� Query block: »The basic unit that can be translated into the

algebraic operators and optimized.

� A query block contains a single SELECT-FROM-WHERE expression, as well as GROUP BY and HAVING clause if these are part of the block.

� Nested queries within a query are identified as separate query blocks.

� Aggregate operators in SQL must be included in the extended algebra.

253

Translating SQL Queries into Relational Algebra (2)

SELECT LNAME, FNAME

FROM EMPLOYEE

WHERE SALARY > ( SELECT MAX (SALARY)

FROM

EMPLOYEE

WHERE DNO = 5);

SELECT MAX (SALARY)

FROM EMPLOYEE

WHERE DNO = 5

SELECT LNAME, FNAME

FROM EMPLOYEE

WHERE SALARY > C

πLNAME, FNAME (σSALARY>C(EMPLOYEE)) ℱMAX SALARY (σDNO=5 (EMPLOYEE))

254

2. Algorithms for External Sorting (1)

� External sorting:» Refers to sorting algorithms that are suitable for large

files of records stored on disk that do not fit entirely in main memory, such as most database files.

� Sort-Merge strategy:» Starts by sorting small subfiles (runs) of the main file

and then merges the sorted runs, creating larger sorted subfiles that are merged in turn.

» Sorting phase: nR = (b/nB)» Merging phase: dM = Min (nB-1, nR); nP = (logdM(nR))» nR: number of initial runs; b: number of file blocks; » nB: available buffer space; dM: degree of merging;» nP: number of passes.

255

3. Algorithms for SELECT and JOIN Operations (1)

� Implementing the SELECT Operation

� Examples:

» (OP1): σ SSN='123456789' (EMPLOYEE)

» (OP2): σ DNUMBER>5(DEPARTMENT)

» (OP3): σ DNO=5(EMPLOYEE)

» (OP4): σ DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)

» (OP5): σ ESSN=123456789 AND PNO=10(WORKS_ON)

256

Algorithms for SELECT and JOIN Operations (2)

� Implementing the SELECT Operation (contd.):� Search Methods for Simple Selection:

» S1 Linear search (brute force):• Retrieve every record in the file, and test whether its attribute

values satisfy the selection condition.

» S2 Binary search:• If the selection condition involves an equality comparison on

a key attribute on which the file is ordered, binary search (which is more efficient than linear search) can be used. (See OP1).

» S3 Using a primary index or hash key to retrieve a single record:

• If the selection condition involves an equality comparison on a key attribute with a primary index (or a hash key), use the primary index (or the hash key) to retrieve the record.

257


� Implementing the SELECT Operation (contd.):� Search Methods for Simple Selection:

» S4 Using a primary index to retrieve multiple records:• If the comparison condition is >, ≥, <, or ≤ on a key field with a

primary index, use the index to find the record satisfying the corresponding equality condition, then retrieve all subsequent records in the (ordered) file.

» S5 Using a clustering index to retrieve multiple records:

• If the selection condition involves an equality comparison on a non-key attribute with a clustering index, use the clustering index to retrieve all the records satisfying the selection condition.

» S6 Using a secondary (B+-tree) index:• On an equality comparison, this search method can be used to

retrieve a single record if the indexing field has unique values (is a key) or to retrieve multiple records if the indexing field is not a key.

• In addition, it can be used to retrieve records on conditions involving >,>=, <, or <=. (FOR RANGE QUERIES)

258


� Implementing the SELECT Operation (contd.):

� Search Methods for Simple Selection:» S7 Conjunctive selection:

• If an attribute involved in any single simple condition in the conjunctive condition has an access path that permits the use of one of the methods S2 to S6, use that condition to retrieve the records and then check whether each retrieved record satisfies the remaining simple conditions in the conjunctive condition.

» S8 Conjunctive selection using a composite index• If two or more attributes are involved in equality conditions in

the conjunctive condition and a composite index (or hash structure) exists on the combined field, we can use the index directly.

259


� Implementing the SELECT Operation (contd.):� Search Methods for Complex Selection:

» S9 Conjunctive selection by intersection of record pointers:

• This method is possible if secondary indexes are available on all (or some of) the fields involved in equality comparison conditions in the conjunctive condition and if the indexes include record pointers (rather than block pointers).

• Each index can be used to retrieve the record pointers that satisfy the individual condition.

• The intersection of these sets of record pointers gives the record pointers that satisfy the conjunctive condition, which are then used to retrieve those records directly.

• If only some of the conditions have secondary indexes, each retrieved record is further tested to determine whether it satisfies the remaining conditions.

260


� Implementing the SELECT Operation (contd.):» Whenever a single condition specifies the selection,

we can only check whether an access path exists on the attribute involved in that condition.

• If an access path exists, the method corresponding to that access path is used; otherwise, the “brute force” linear search approach of method S1 is used. (See OP1, OP2 and OP3)

» For conjunctive selection conditions, whenever more than one of the attributes involved in the conditions have an access path, query optimization should be done to choose the access path that retrieves the fewest records in the most efficient way.

» Disjunctive selection conditions

261


� Implementing the JOIN Operation:

» Join (EQUIJOIN, NATURAL JOIN)

• two–way join: a join on two files

• e.g. R A=B S

• multi-way joins: joins involving more than two files.

• e.g. R A=B S C=D T

� Examples

» (OP6): EMPLOYEE DNO=DNUMBER

DEPARTMENT

» (OP7): DEPARTMENT MGRSSN=SSN

EMPLOYEE

262


� Implementing the JOIN Operation (contd.):

� Methods for implementing joins:

» J1 Nested-loop join (brute force):• For each record t in R (outer loop), retrieve every record s

from S (inner loop) and test whether the two records satisfy the join condition t[A] = s[B].

» J2 Single-loop join (Using an access structure to retrieve the matching records):

• If an index (or hash key) exists for one of the two join attributes — say, B of S — retrieve each record t in R, one at a time, and then use the access structure to retrieve directly all matching records s from S that satisfy s[B] = t[A].

263



� Methods for implementing joins:

» J3 Sort-merge join:• If the records of R and S are physically sorted (ordered) by

value of the join attributes A and B, respectively, we can implement the join in the most efficient way possible.

• Both files are scanned in order of the join attributes, matching the records that have the same values for A and B.

• In this method, the records of each file are scanned only once each for matching with the other file—unless both A and B are non-key attributes, in which case the method needs to be modified slightly.

264



� Methods for implementing joins:» J4 Hash-join:

• The records of files R and S are both hashed to the same hash file, using the same hashing function on the join attributes A of R and B of S as hash keys.

• A single pass through the file with fewer records (say, R) hashes its records to the hash file buckets.

• A single pass through the other file (S) then hashes each of its records to the appropriate bucket, where the record is combined with all matching records from R.

265



� Factors affecting JOIN performance

»Available buffer space

» Join selection factor

»Choice of inner VS outer relation

266


� Implementing the JOIN Operation (contd.):� Other types of JOIN algorithms� Partition hash join

» Partitioning phase:• Each file (R and S) is first partitioned into M partitions using a

partitioning hash function on the join attributes:– R1 , R2 , R3 , ...... Rm and S1 , S2 , S3 , ...... Sm

• Minimum number of in-memory buffers needed for the partitioning phase: M+1.

• A disk sub-file is created per partition to store the tuples for that partition.

» Joining or probing phase:• Involves M iterations, one per partitioned file.• Iteration i involves joining partitions Ri and Si.

267



� Partitioned Hash Join Procedure:

» Assume Ri is smaller than Si.

1. Copy records from Ri into memory buffers.

2. Read all blocks from Si, one at a time and each record from Si is used to probe for a matching record(s) from partition Si.

3. Write matching record from Ri after joining to the record from Si into the result file.

268



� Cost analysis of partition hash join:

1. Reading and writing each record from R and S during the partitioning phase:

(bR + bS), (bR + bS)

2. Reading each record during the joining phase:(bR + bS)

3. Writing the result of join:bRES

� Total Cost:

» 3* (bR + bS) + bRES

269


� Implementing the JOIN Operation (contd.):� Hybrid hash join:

» Same as partitioned hash join except: • Joining phase of one of the partitions is included during the

partitioning phase.

» Partitioning phase:• Allocate buffers for smaller relation- one block for each of the

M-1 partitions, remaining blocks to partition 1.• Repeat for the larger relation in the pass through S.)

» Joining phase:• M-1 iterations are needed for the partitions R2 , R3 , R4 ,

......Rm and S2 , S3 , S4 , ......Sm. R1 and S1 are joined during the partitioning of S1, and results of joining R1 and S1 are already written to the disk by the end of partitioning phase.

270

4. Algorithms for PROJECT and SET Operations (1)

� Algorithm for PROJECT operations (Figure 15.3b)

ππππ <attribute list>(R)

1. If <attribute list> has a key of relation R, extract all tuples from R with only the values for the attributes in <attribute list>.

2. If <attribute list> does NOT include a key of relation R, duplicated tuples must be removed from the results.

� Methods to remove duplicate tuples1. Sorting2. Hashing

271

Algorithms for PROJECT and SET Operations (2)

� Algorithm for SET operations

� Set operations:» UNION, INTERSECTION, SET DIFFERENCE and

CARTESIAN PRODUCT

� CARTESIAN PRODUCT of relations R and S include all possible combinations of records from R and S. The attribute of the result include all attributes of R and S.

� Cost analysis of CARTESIAN PRODUCT » If R has n records and j attributes and S has m

records and k attributes, the result relation will have n*m records and j+k attributes.

� CARTESIAN PRODUCT operation is veryexpensive and should be avoided if possible.

272

Algorithms for PROJECT and SET Operations (3)

� Algorithm for SET operations (contd.)

� UNION (See Figure 19.3c) » Sort the two relations on the same attributes.» Scan and merge both sorted files concurrently,

whenever the same tuple exists in both relations, only one is kept in the merged results.

� INTERSECTION (See Figure 19.3d)» Sort the two relations on the same attributes.» Scan and merge both sorted files concurrently, keep

in the merged results only those tuples that appear in both relations.

� SET DIFFERENCE R-S (See Figure 19.3e)» Keep in the merged results only those tuples that

appear in relation R but not in relation S.

273

5. Implementing Aggregate Operations and Outer Joins (1)

� Implementing Aggregate Operations:

� Aggregate operators:» MIN, MAX, SUM, COUNT and AVG

� Options to implement aggregate operators:» Table Scan

» Index

� Example» SELECT MAX (SALARY)

» FROM EMPLOYEE;

� If an (ascending) index on SALARY exists for the employee relation, then the optimizer could decide on traversing the index for the largest value, which would entail following the right most pointer in each index node from the root to a leaf.

274

Implementing Aggregate Operations and Outer Joins (2)

� Implementing Aggregate Operations (contd.):

� SUM, COUNT and AVG

� For a dense index (each record has one index entry):» Apply the associated computation to the values in the

index.

� For a non-dense index:» Actual number of records associated with each index entry

must be accounted for

� With GROUP BY: the aggregate operator must be applied separately to each group of tuples. » Use sorting or hashing on the group attributes to partition

the file into the appropriate groups;

» Computes the aggregate function for the tuples in each group.

� What if we have Clustering index on the grouping attributes?

275


� Implementing Outer Join:� Outer Join Operators:

» LEFT OUTER JOIN» RIGHT OUTER JOIN» FULL OUTER JOIN.

� The full outer join produces a result which is equivalent to the union of the results of the left and right outer joins.

� Example:SELECT FNAME, DNAME FROM (EMPLOYEE LEFT OUTER JOIN DEPARTMENT

ON DNO = DNUMBER);

� Note: The result of this query is a table of employee names and their associated departments. It is similar to a regular join result, with the exception that if an employee does not have an associated department, the employee's name will still appear in the resulting table, although the department name would be indicated as null.

276


� Implementing Outer Join (contd.):

� Modifying Join Algorithms:

» Nested Loop or Sort-Merge joins can be modified to implement outer join. E.g.,

• For left outer join, use the left relation as outer relation and construct result from every tuple in the left relation.

• If there is a match, the concatenated tuple is saved in the result.

• However, if an outer tuple does not match, then the tuple is still included in the result but is padded with a null value(s).

277


� Implementing Outer Join (contd.):� Executing a combination of relational algebra

operators. � Implement the previous left outer join example

» {Compute the JOIN of the EMPLOYEE and DEPARTMENT tables}

• TEMP1�ππππFNAME,DNAME(EMPLOYEE DNO=DNUMBER DEPARTMENT)

» {Find the EMPLOYEEs that do not appear in the JOIN}• TEMP2 � ππππ FNAME (EMPLOYEE) - ππππFNAME (Temp1)

» {Pad each tuple in TEMP2 with a null DNAME field}

• TEMP2 � TEMP2 x 'null'

» {UNION the temporary tables to produce the LEFT OUTER JOIN}

• RESULT � TEMP1 υ TEMP2

� The cost of the outer join, as computed above, would include the cost of the associated steps (i.e., join, projections and union).

278

6. Combining Operations using Pipelining (1)

� Motivation

» A query is mapped into a sequence of operations.

» Each execution of an operation produces a temporary result.

» Generating and saving temporary files on disk is time consuming and expensive.

� Alternative:

» Avoid constructing temporary results as much as possible.

» Pipeline the data through multiple operations - pass the result of a previous operator to the next without waiting to complete the previous operation.

279

Combining Operations using Pipelining (2)

� Example:

»For a 2-way join, combine the 2 selections on the input and one projection on the output with the Join.

� Dynamic generation of code to allow for multiple operations to be pipelined.

� Results of a select operation are fed in a "Pipeline" to the join algorithm.

� Also known as stream-based processing.

280

7. Using Heuristics in Query Optimization (1)

� Process for heuristics optimization1. The parser of a high-level query generates an initial

internal representation;

2. Apply heuristics rules to optimize the internal representation.

3. A query execution plan is generated to execute groups of operations based on the access paths available on the files involved in the query.

� The main heuristic is to apply first the operations that reduce the size of intermediate results. » E.g., Apply SELECT and PROJECT operations

before applying the JOIN or other binary operations.

281

Using Heuristics in Query Optimization (2)

� Query tree:» A tree data structure that corresponds to a relational

algebra expression. It represents the input relations of the query as leaf nodes of the tree, and represents the relational algebra operations as internal nodes.

� An execution of the query tree consists of executing an internal node operation whenever its operands are available and then replacing that internal node by the relation that results from executing the operation.

� Query graph:» A graph data structure that corresponds to a relational

calculus expression. It does not indicate an order on which operations to perform first. There is only a single graph corresponding to each query.

282


� Example:» For every project located in ‘Stafford’, retrieve the project

number, the controlling department number and the department manager’s last name, address and birthdate.

� Relation algebra:

ππππPNUMBER, DNUM, LNAME, ADDRESS, BDATE(((σσσσPLOCATION=‘STAFFORD’(PROJECT))

DNUM=DNUMBER (DEPARTMENT)) MGRSSN=SSN (EMPLOYEE))

� SQL query:Q2: SELECT P.NUMBER,P.DNUM,E.LNAME,

E.ADDRESS, E.BDATEFROM PROJECT AS P,DEPARTMENT AS D,

EMPLOYEE AS EWHERE P.DNUM=D.DNUMBER AND

D.MGRSSN=E.SSN ANDP.PLOCATION=‘STAFFORD’;

283


� Heuristic Optimization of Query Trees:» The same query could correspond to many different

relational algebra expressions — and hence many different query trees.

» The task of heuristic optimization of query trees is to find a final query tree that is efficient to execute.

� Example:Q: SELECT LNAME

FROM EMPLOYEE, WORKS_ON, PROJECT

WHERE PNAME = ‘AQUARIUS’ AND PNMUBER=PNO AND ESSN=SSN

AND BDATE > ‘1957-12-31’;

284


� General Transformation Rules for Relational Algebra Operations:

1. Cascade of σ: A conjunctive selection condition can be broken up into a cascade (sequence) of individual σoperations:

» σ c1 AND c2 AND ... AND cn(R) = σc1 (σc2 (...(σcn(R))...) )

2. Commutativity of σ: The σ operation is commutative:

» σc1 (σc2(R)) = σc2 (σc1(R))

3. Cascade of π: In a cascade (sequence) of π operations, all but the last one can be ignored:

» πList1 (πList2 (...(πListn(R))...) ) = πList1(R)

4. Commuting σ with π: If the selection condition c involves only the attributes A1, ..., An in the projection list, the two operations can be commuted:

» πA1, A2, ..., An (σc (R)) = σc (πA1, A2, ..., An (R))

285


� General Transformation Rules for Relational Algebra Operations (contd.):

5. Commutativity of ( and x ): The operation is commutative as is the x operation:» R C S = S C R; R x S = S x R

6. Commuting σ with (or x ): If all the attributes in the selection condition c involve only the attributes of one of the relations being joined—say, R—the two operations can be commuted as follows:

» σc ( R S ) = (σc (R)) S

� Alternatively, if the selection condition c can be written as (c1 and c2), where condition c1 involves only the attributes of R and condition c2 involves only the attributes of S, the operations commute as follows:

» σc ( R S ) = (σc1 (R)) (σc2 (S))

286



7. Commuting π with (or x): Suppose that the projection list is L = {A1, ..., An, B1, ..., Bm}, where A1, ..., An are attributes of R and B1, ..., Bm are attributes of S. If the join condition c involves only attributes in L, the two operations can be commuted as follows:

»πL ( R C S ) = (πA1, ..., An (R)) C (π B1, ..., Bm (S))

� If the join condition C contains additional attributes not in L, these must be added to the

projection list, and a final π operation is needed.

287



8. Commutativity of set operations: The set operations υ and ∩ are commutative but “–” is not.

9. Associativity of , x, υ, and ∩ : These four operations are individually associative; that is, if θ stands for any one of these four operations (throughout the expression), we have» ( R θ S ) θ T = R θ ( S θ T )

10. Commuting σ with set operations: The σoperation commutes with υ , ∩ , and –. If θstands for any one of these three operations, we have » σc ( R θ S ) = (σc (R)) θ (σc (S))

288



� The π operation commutes with υ. πL ( R υ S ) = (πL (R)) υ (πL (S))

� Converting a (σ, x) sequence into : If the condition c of a σ that follows a x Corresponds to a join condition, convert the (σ, x) sequence into a as follows:

(σC (R x S)) = (R C S)

� Other transformations

289


� Outline of a Heuristic Algebraic Optimization Algorithm:1. Using rule 1, break up any select operations with conjunctive

conditions into a cascade of select operations. 2. Using rules 2, 4, 6, and 10 concerning the commutativity of select

with other operations, move each select operation as far down the query tree as is permitted by the attributes involved in the select condition.

3. Using rule 9 concerning associativity of binary operations, rearrange the leaf nodes of the tree so that the leaf node relations with the most restrictive select operations are executed first in the query tree representation.

4. Using Rule 12, combine a Cartesian product operation with a subsequent select operation in the tree into a join operation.

5. Using rules 3, 4, 7, and 11 concerning the cascading of project and the commuting of project with other operations, break down and move lists of projection attributes down the tree as far as possible by creating new project operations as needed.

6. Identify subtrees that represent groups of operations that can be executed by a single algorithm.

290


� Summary of Heuristics for Algebraic Optimization: 1. The main heuristic is to apply first the operations that

reduce the size of intermediate results.

2. Perform select operations as early as possible to reduce the number of tuples and perform project operations as early as possible to reduce the number of attributes. (This is done by moving select and project operations as far down the tree as possible.)

3. The select and join operations that are most restrictive should be executed before other similar operations. (This is done by reordering the leaf nodes of the tree among themselves and adjusting the rest of the tree appropriately.)

291


� Query Execution Plans

» An execution plan for a relational algebra query consists of a combination of the relational algebra query tree and information about the access methods to be used for each relation as well as the methods to be used in computing the relational operators stored in the tree.

» Materialized evaluation: the result of an operation is stored as a temporary relation.

» Pipelined evaluation: as the result of an operator is produced, it is forwarded to the next operator in sequence.

292

8. Using Selectivity and Cost Estimates in Query Optimization (1)

� Cost-based query optimization:

»Estimate and compare the costs of executing a query using different execution strategies and choose the strategy with the lowest cost estimate.

» (Compare to heuristic query optimization)

� Issues

»Cost function

»Number of execution strategies to be considered

293

Using Selectivity and Cost Estimates in Query Optimization (2)

� Cost Components for Query Execution

1. Access cost to secondary storage

2. Storage cost

3. Computation cost

4. Memory usage cost

5. Communication cost

� Note: Different database systems may focus on different cost components.

294


� Catalog Information Used in Cost Functions

» Information about the size of a file • number of records (tuples) (r),

• record size (R),

• number of blocks (b)

• blocking factor (bfr)

» Information about indexes and indexing attributes of a file

• Number of levels (x) of each multilevel index

• Number of first-level index blocks (bI1)

• Number of distinct values (d) of an attribute

• Selectivity (sl) of an attribute

• Selection cardinality (s) of an attribute. (s = sl * r)

295


� Examples of Cost Functions for SELECT� S1. Linear search (brute force) approach

» CS1a = b; » For an equality condition on a key, CS1a = (b/2) if the

record is found; otherwise CS1a = b.

� S2. Binary search:» CS2 = log2b + (s/bfr) –1» For an equality condition on a unique (key) attribute,

CS2 =log2b

� S3. Using a primary index (S3a) or hash key (S3b) to retrieve a single record» CS3a = x + 1; CS3b = 1 for static or linear hashing;» CS3b = 1 for extendible hashing;

296


� Examples of Cost Functions for SELECT (contd.)

� S4. Using an ordering index to retrieve multiple records:

» For the comparison condition on a key field with an ordering index, CS4 = x + (b/2)

� S5. Using a clustering index to retrieve multiple records:

» CS5 = x + ┌ (s/bfr) ┐

� S6. Using a secondary (B+-tree) index:

» For an equality comparison, CS6a = x + s;

» For an comparison condition such as >, <, >=, or <=,

» CS6a = x + (bI1/2) + (r/2)

297


� Examples of Cost Functions for SELECT (contd.)

� S7. Conjunctive selection: » Use either S1 or one of the methods S2 to S6 to solve.

» For the latter case, use one condition to retrieve the records and then check in the memory buffer whether each retrieved record satisfies the remaining conditions in the conjunction.

� S8. Conjunctive selection using a composite index:» Same as S3a, S5 or S6a, depending on the type of index.

� Examples of using the cost functions.

298


� Examples of Cost Functions for JOIN

» Join selectivity (js)

» js = | (R C S) | / | R x S | = | (R C S) | / (|R| * |S |)

• If condition C does not exist, js = 1;

• If no tuples from the relations satisfy condition C, js = 0;

• Usually, 0 <= js <= 1;

� Size of the result file after join operation

» | (R C S) | = js * |R| * |S |

299


� Examples of Cost Functions for JOIN (contd.)

� J1. Nested-loop join:

» CJ1 = bR + (bR*bS) + ((js* |R|* |S|)/bfrRS)

» (Use R for outer loop)

� J2. Single-loop join (using an access structure to retrieve the matching record(s))

» If an index exists for the join attribute B of S with index levels xB, we can retrieve each record s in R and then use the index to retrieve all the matching records t from S that satisfy t[B] = s[A].

» The cost depends on the type of index.

300


� Examples of Cost Functions for JOIN (contd.)� J2. Single-loop join (contd.)

» For a secondary index, • CJ2a = bR + (|R| * (xB + sB)) + ((js* |R|* |S|)/bfrRS);

» For a clustering index,• CJ2b = bR + (|R| * (xB + (sB/bfrB))) + ((js* |R|* |S|)/bfrRS);

» For a primary index,• CJ2c = bR + (|R| * (xB + 1)) + ((js* |R|* |S|)/bfrRS);

» If a hash key exists for one of the two join attributes — B of S

• CJ2d = bR + (|R| * h) + ((js* |R|* |S|)/bfrRS);

� J3. Sort-merge join:• CJ3a = CS + bR + bS + ((js* |R|* |S|)/bfrRS); • (CS: Cost for sorting files)

301


� Multiple Relation Queries and Join Ordering» A query joining n relations will have n-1 join

operations, and hence can have a large number of different join orders when we apply the algebraic transformation rules.

» Current query optimizers typically limit the structure of a (join) query tree to that of left-deep (or right-deep) trees.

� Left-deep tree:» A binary tree where the right child of each non-leaf

node is always a base relation.• Amenable to pipelining

• Could utilize any access paths on the base relation (the right child) when executing the join.

302

9. Overview of Query Optimization in Oracle

� Oracle DBMS V8» Rule-based query optimization: the optimizer

chooses execution plans based on heuristically ranked operations.

• (Currently it is being phased out)

» Cost-based query optimization: the optimizer examines alternative access paths and operator algorithms and chooses the execution plan with lowest estimate cost.

• The query cost is calculated based on the estimated usage of resources such as I/O, CPU and memory needed.

» Application developers could specify hints to the ORACLE query optimizer.

» The idea is that an application developer might know more information about the data.

303

10. Semantic Query Optimization

� Semantic Query Optimization:» Uses constraints specified on the database schema in

order to modify one query into another query that is more efficient to execute.

� Consider the following SQL query,SELECT E.LNAME, M.LNAMEFROM EMPLOYEE E MWHERE E.SUPERSSN=M.SSN AND E.SALARY>M.SALARY

� Explanation:» Suppose that we had a constraint on the database schema

that stated that no employee can earn more than his or her direct supervisor. If the semantic query optimizer checks for the existence of this constraint, it need not execute the query at all because it knows that the result of the query will be empty. Techniques known as theorem proving can be used for this purpose.

304

Summary

0. Introduction to Query Processing1. Translating SQL Queries into Relational Algebra 2. Algorithms for External Sorting3. Algorithms for SELECT and JOIN Operations4. Algorithms for PROJECT and SET Operations5. Implementing Aggregate Operations and Outer

Joins6. Combining Operations using Pipelining7. Using Heuristics in Query Optimization8. Using Selectivity and Cost Estimates in Query

Optimization9. Overview of Query Optimization in Oracle10. Semantic Query Optimization

305

Agenda









306

Agenda

� Physical Database Design in Relational Databases

� Database Tuning in Relational Systems

307

1. Physical Database Design in Relational Databases (1)

� Factors that Influence Physical Database Design:A. Analyzing the database queries and

transactions

» For each query, the following information is needed.1. The files that will be accessed by the query;

2. The attributes on which any selection conditions for the query are specified;

3. The attributes on which any join conditions or conditions to link multiple tables or objects for the query are specified;

4. The attributes whose values will be retrieved by the query.

» Note: the attributes listed in items 2 and 3 above are candidates for definition of access structures.

308

Physical Database Design in Relational Databases (2)

� Factors that Influence Physical Database Design (cont.): A. Analyzing the database queries and transactions(cont.)» For each update transaction or operation, the following

information is needed.1. The files that will be updated;

2. The type of operation on each file (insert, update or delete);

3. The attributes on which selection conditions for a delete or update operation are specified;

4. The attributes whose values will be changed by an update operation.

» Note: the attributes listed in items 3 above are candidates for definition of access structures. However, the attributes listed in item 4 are candidates for avoiding an access structure.

309


� Factors that Influence Physical Database Design (cont.):B. Analyzing the expected frequency of

invocation of queries and transactions

» The expected frequency information, along with the attribute information collected on each query and transaction, is used to compile a cumulative list of expected frequency of use for all the queries and transactions.

» It is expressed as the expected frequency of using each attribute in each file as a selection attribute or join attribute, over all the queries and transactions.

» 80-20 rule

• 20% of the data is accessed 80% of the time

310


� Factors that Influence Physical Database Design (cont.)C. Analyzing the time constraints of queries and transactions

»Performance constraints place further priorities on the attributes that are candidates for access paths.

»The selection attributes used by queries and transactions with time constraints become higher-priority candidates for primary access structure.

311


� Factors that Influence Physical Database Design (cont.)D. Analyzing the expected frequencies of update operations

»A minimum number of access paths should be specified for a file that is updated frequently.

312


� Factors that Influence Physical Database Design (cont.)E. Analyzing the uniqueness constraints on attributes

»Access paths should be specified on all candidate key attributes — or set of attributes — that are either the primary key or constrained to be unique.

313


� Physical Database Design Decisions

� Design decisions about indexing

»Whether to index an attribute?

»What attribute or attributes to index on?

»Whether to set up a clustered index?

»Whether to use a hash index over a tree index?

»Whether to use dynamic hashing for the file?

314


� Physical Database Design Decisions (cont.)

� Denormalization as a design decision for speeding up queries» The goal of normalization is to separate the logically

related attributes into tables to minimize redundancy and thereby avoid the update anomalies that cause an extra processing overheard to maintain consistency of the database.

» The goal of denormalization is to improve the performance of frequently occurring queries and transactions. (Typically the designer adds to a table attributes that are needed for answering queries or producing reports so that a join with another table is avoided.)

» Trade off between update and query performance

315

2. An Overview of Database Tuning in Relational Systems (1)

� Tuning:

» The process of continuing to revise/adjust the physical database design by monitoring resource utilization as well as internal DBMS processing to reveal bottlenecks such as contention for the same data or devices.

� Goal:

» To make application run faster

» To lower the response time of queries/transactions

» To improve the overall throughput of transactions

316

An Overview of Database Tuning in Relational Systems (2)

� Statistics internally collected in DBMSs:» Size of individual tables» Number of distinct values in a column» The number of times a particular query or transaction

is submitted/executed in an interval of time» The times required for different phases of query and

transaction processing

� Statistics obtained from monitoring:» Storage statistics» I/O and device performance statistics» Query/transaction processing statistics» Locking/logging related statistics» Index statistic

317


� Problems to be considered in tuning:

»How to avoid excessive lock contention?

»How to minimize overhead of logging and unnecessary dumping of data?

»How to optimize buffer size and scheduling of processes?

»How to allocate resources such as disks, RAM and processes for most efficient utilization?

318


� Tuning Indexes

» Reasons to tuning indexes• Certain queries may take too long to run for lack of an index;

• Certain indexes may not get utilized at all;

• Certain indexes may be causing excessive overhead because the index is on an attribute that undergoes frequent changes

» Options to tuning indexes

• Drop or/and build new indexes

• Change a non-clustered index to a clustered index (and vice versa)

• Rebuilding the index

319


� Tuning the Database Design

»Dynamically changed processing requirements need to be addressed by making changes to the conceptual schema if necessary and to reflect those changes into the logical schema and physical design.

320


� Tuning the Database Design (cont.)» Possible changes to the database design

• Existing tables may be joined (denormalized) because certain attributes from two or more tables are frequently needed together.

• For the given set of tables, there may be alternative design choices, all of which achieve 3NF or BCNF. One may be replaced by the other.

• A relation of the form R(K, A, B, C, D, …) that is in BCNF can be stored into multiple tables that are also in BCNF by replicating the key K in each table.

• Attribute(s) from one table may be repeated in another even though this creates redundancy and potential anomalies.

• Apply horizontal partitioning as well as vertical partitioning if necessary.

321


� Tuning Queries

» Indications for tuning queries

• A query issues too many disk accesses

• The query plan shows that relevant indexes are not being used.

322


� Tuning Queries (cont.): Typical instances for query tuning» In some situations involving using of correlated

queries, temporaries are useful.» If multiple options for join condition are possible,

choose one that uses a clustering index and avoid those that contain string comparisons.

» The order of tables in the FROM clause may affect the join processing.

» Some query optimizers perform worse on nested queries compared to their equivalent un-nested counterparts.

» Many applications are based on views that define the data of interest to those applications. Sometimes these views become an overkill.

323


� Tuning Queries (cont.): Typical instances for query tuning» In some situations involving using of correlated

queries, temporaries are useful.» If multiple options for join condition are possible,

choose one that uses a clustering index and avoid those that contain string comparisons.

» The order of tables in the FROM clause may affect the join processing.

» Some query optimizers perform worse on nested queries compared to their equivalent un-nested counterparts.

» Many applications are based on views that define the data of interest to those applications. Sometimes these views become an overkill.

324


� Additional Query Tuning Guidelines» A query with multiple selection conditions that are

connected via OR may not be prompting the query optimizer to use any index. Such a query may be split up and expressed as a union of queries, each with a condition on an attribute that causes an index to be used.

» Apply the following transformations• NOT condition may be transformed into a positive

expression.• Embedded SELECT blocks may be replaced by joins.• If an equality join is set up between two tables, the range

predicate on the joining attribute set up in one table may be repeated for the other table

» WHERE conditions may be rewritten to utilize the indexes on multiple columns.

325

Summary

� Physical Database Design in Relational Databases

� Database Tuning in Relational Systems

326

Agenda









327

Agenda

� Database Programming: Techniques and Issues

� Embedded SQL, Dynamic SQL, and SQLJ

� Database Programming with Function Calls: SQL/CLI and JDBC

� Database Stored Proceduresand SQL/PSM

� Comparing the Three Approaches

328

Introduction to SQL Programming Techniques

� Database applications

� Host language

• Java, C/C++/C#, COBOL, or some other programming language

� Data sublanguage

• SQL

� SQL standards

� Continually evolving

� Each DBMS vendor may have some variations from standard

329

Database Programming: Techniques and Issues

� Interactive interface

� SQL commands typed directly into a monitor

� Execute file of commands

� @<filename>

� Application programs or database applications

� Used as canned transactions by the end users access a database

� May have Web interface

330

Approaches to Database Programming

� Embedding database commands in a general-purpose programming language

� Database statements identified by a special prefix

� Precompiler or preprocessor scans the source program code

• Identify database statements and extract them for processing by the DBMS

� Called embedded SQL

331

Approaches to Database Programming (cont’d.)

� Using a library of database functions

� Library of functions available to the host programming language

� Application programming interface (API)

� Designing a brand-new language

� Database programming language designed from scratch

� First two approaches are more common

332

Impedance Mismatch

� Differences between database model and programming language model

� Binding for each host programming language

� Specifies for each attribute type the compatible programming language types

� Cursor or iterator variable

� Loop over the tuples in a query result

333

Typical Sequence of Interaction in Database Programming

� Open a connection to database server

� Interact with database by submitting queries, updates, and other database commands

� Terminate or close connection to database

334

Embedded SQL, Dynamic SQL, and SQLJ

� Embedded SQL

� C language

� SQLJ

� Java language

� Programming language called host language

335

Retrieving Single Tuples with Embedded SQL

� EXEC SQL

� Prefix

� Preprocessor separates embedded SQL statements from host language code

� Terminated by a matching END-EXEC

• Or by a semicolon (;)

� Shared variables

� Used in both the C program and the embedded SQL statements

� Prefixed by a colon (:) in SQL statement

336

Retrieving Single Tuples with Embedded SQL (cont’d.)

337


� Connecting to the database� CONNECT TO <server name>AS <connection name>

� AUTHORIZATION <user account name and password> ;

� Change connection � SET CONNECTION <connection name> ;

� Terminate connection� DISCONNECT <connection name> ;

338


� SQLCODE and SQLSTATEcommunication variables

� Used by DBMS to communicate exception or error conditions

� SQLCODE variable

� 0 = statement executed successfully

� 100 = no more data available in query result

� < 0 = indicates some error has occurred

339


� SQLSTATE

� String of five characters

� ‘00000’ = no error or exception

� Other values indicate various errors or exceptions

� For example, ‘02000’ indicates ‘no more data’ when using SQLSTATE

340


341

Retrieving Multiple Tuples with Embedded SQL Using Cursors

� Cursor

� Points to a single tuple (row) from result of query

� OPEN CURSOR command

� Fetches query result and sets cursor to a position before first row in result

� Becomes current row for cursor

� FETCH commands

� Moves cursor to next row in result of query

342

Sample Program Segment Illustrating the Use of Cursors

343

Retrieving Multiple Tuples with Embedded SQL Using Cursors (cont’d.)

� FOR UPDATE OF

� List the names of any attributes that will be updated by the program

� Fetch orientation � Added using value: NEXT, PRIOR, FIRST, LAST, ABSOLUTE i, and RELATIVE i

344

Specifying Queries at Runtime Using Dynamic SQL

� Dynamic SQL

� Execute different SQL queries or updates dynamically at runtime

� Dynamic update

� Dynamic query

345

SQLJ: Embedding SQL Commands in Java

� Standard adopted by several vendors for embedding SQL in Java

� Import several class libraries

� Default context

� Uses exceptions for error handling� SQLException is used to return errors or

exception conditions

346

SQLJ: Embedding SQL Commands in Java (cont’d.)

347

Retrieving Multiple Tuples in SQLJ Using Iterators

� Iterator

� Object associated with a collection (set or multiset) of records in a query result

� Named iterator

� Associated with a query result by listing attribute names and types in query result

� Positional iterator

� Lists only attribute types in query result

348

Sample Java Program Segment Using a Named Iterator

349

Retrieving Multiple Tuples in SQLJ Using Iterators (cont’d.)

350

Database Programming with Function Calls: SQL/CLI & JDBC

� Use of function calls

� Dynamic approach for database programming

� Library of functions

� Also known as application programming

interface (API)

� Used to access database

� SQL Call Level Interface (SQL/CLI)

� Part of SQL standard

351

SQL/CLI: Using C as the Host Language

� Environment record

� Track one or more database connections

� Set environment information

� Connection record

� Keeps track of information needed for a particular database connection

� Statement record

� Keeps track of the information needed for one SQL statement

352

SQL/CLI: Using C as the Host Language (cont’d.)

� Description record

� Keeps track of information about tuples or parameters

� Handle to the record

� C pointer variable makes record accessible to program

353

Sample C Program Segment Using SQL/CU for a Query

354

JDBC: SQL Function Calls for Java Programming

� JDBC

� Java function libraries

� Single Java program can connect to several different databases

� Called data sources accessed by the Java program

� Class.forName("oracle.jdbc.driver.OracleDriver")

� Load a JDBC driver explicitly

355

Sample Program Segment with JDBC

356

JDBC: SQL Function Calls for Java Programming

� Connection object

� Statement object has two subclasses:

� PreparedStatement and CallableStatement

� Question mark (?) symbol

� Represents a statement parameter

� Determined at runtime

� ResultSet object

� Holds results of query

357

Database Stored Procedures and SQL/PSM

� Stored procedures

� Program modules stored by the DBMS at the database server

� Can be functions or procedures

� SQL/PSM (SQL/Persistent Stored Modules)

� Extensions to SQL

� Include general-purpose programming constructs in SQL

358

Database Stored Procedures and Functions

� Persistent stored modules

� Stored persistently by the DBMS

� Useful:

� When database program is needed by several applications

� To reduce data transfer and communication cost between client and server in certain situations

� To enhance modeling power provided by views

359

Database Stored Procedures and Functions (cont’d.)

� Declaring stored procedures:� CREATE PROCEDURE <procedure name> (<parameters>)

� <local declarations>

� <procedure body> ;

� declaring a function, a return type is necessary,

so the declaration form is

� CREATE FUNCTION <function name> (<parameters>)

� RETURNS <return type>

� <local declarations>

� <function body> ;

360

Database Stored Procedures and Functions (cont’d.)

� Each parameter has parameter type

� Parameter type: one of the SQL data types

� Parameter mode: IN, OUT, or INOUT

� Calling a stored procedure:�CALL <procedure or function name>

(<argument list>) ;

361

SQL/PSM: Extending SQL for Specifying Persistent Stored Modules

� Conditional branching statement:�IF <condition> THEN <statement list>

�ELSEIF <condition> THEN <statement list>

�...

�ELSEIF <condition> THEN <statement list>

�ELSE <statement list>

�END IF ;

362

SQL/PSM (cont’d.)

� Constructs for looping

363

SQL/PSM (cont’d.)

364

Comparing the Three Approaches

� Embedded SQL Approach

� Query text checked for syntax errors and validated against database schema at compile time

� For complex applications where queries have to be generated at runtime

• Function call approach more suitable

365

Comparing the Three Approaches (cont’d.)

� Library of Function Calls Approach

� More flexibility

� More complex programming

� No checking of syntax done at compile time

� Database Programming Language Approach

� Does not suffer from the impedance mismatch problem

� Programmers must learn a new language

366

Summary

� Techniques for database programming

� Embedded SQL

� SQLJ

� Function call libraries

� SQL/CLI standard

� JDBC class library

� Stored procedures

� SQL/PSM

367

Agenda









368

Agenda

� A Simple PHP Example

� Overview of Basic Features of PHP

� Overview of PHP Database Programming

369

Web Database Programming Using PHP

� Techniques for programming dynamic features into Web

� PHP

� Open source scripting language

� Interpreters provided free of charge

� Available on most computer platforms

370

A Simple PHP Example

� PHP

� Open source general-purpose scripting language

� Comes installed with the UNIX operating system

371

A Simple PHP Example (cont’d.)

� DBMS

� Bottom-tier database server

� PHP

� Middle-tier Web server

� HTML

� Client tier

372

Sample PhP Program Segment

373


� Example Figure 14.1(a)

� PHP script stored in:

� http://www.myserver.com/example/greeting.php

� <?php

� PHP start tag

� ?>

� PHP end tag

� Comments: // or /* */

374


� $_POST

� Auto-global predefined PHP variable

� Array that holds all the values entered through form parameters

� Arrays are dynamic

� Long text strings

� Between opening <<<_HTML_ and closing _HTML_;

375


� PHP variable names

� Start with $ sign

376

Overview of Basic Features of PHP

� Illustrate features of PHP suited for creating dynamic Web pages that contain database access commands

377

PHP Variables, Data Types, and Programming Constructs

� PHP variable names

� Start with $ symbol

� Can include characters, letters, and underscore character (_)

� Main ways to express strings and text

� Single-quoted strings

� Double-quoted strings

� Here documents

� Single and double quotes

378

PHP Variables, Data Types, and Programming Constructs (cont’d.)

� Period (.) symbol

� String concatenate operator

� Single-quoted strings

� Literal strings that contain no PHP program variables

� Double-quoted strings and here documents

� Values from variables need to be interpolated into string

379


� Numeric data types

� Integers and floating points

� Programming language constructs

� For-loops, while-loops, and conditional if-statements

� Boolean expressions

380


� Comparison operators

� == (equal), != (not equal), > (greater than), >= (greater than or equal), < (less than), and <= (less than or equal)

381

PHP Arrays

� Database query results

� Two-dimensional arrays

� First dimension representing rows of a table

� Second dimension representing columns (attributes) within a row

� Main types of arrays:

� Numeric and associative

382

PHP Arrays (cont’d.)

� Numeric array

� Associates a numeric index with each element in the array

� Indexes are integer numbers

• Start at zero

• Grow incrementally

� Associative array

� Provides pairs of (key => value) elements

383


384


� Techniques for looping through arrays in PHP

� Count function

� Returns current number of elements in array

� Sort function

� Sorts array based on element values in it

385

PHP Functions

� Functions

� Define to structure a complex program and to share common sections of code

� Arguments passed by value

� Examples to illustrate basic PHP functions

� Figure 14.4

� Figure 14.5

386

Rewriting a Program Segment Using Functions

387

PHP Functions (cont’d.)

388

PHP Server Variables and Forms

� Built-in entries � $_SERVER auto-global built-in array variable

� Provides useful information about server where the PHP interpreter is running

389

PHP Server Variables and Forms (cont’d.)

� Examples:•$_SERVER['SERVER_NAME']

•$_SERVER['REMOTE_ADDRESS']

•$_SERVER['REMOTE_HOST']

•$_SERVER['PATH_INFO']

•$_SERVER['QUERY_STRING']

•$_SERVER['DOCUMENT_ROOT']

� $_POST

� Provides input values submitted by the user through HTML forms specified in <INPUT> tag

390

Overview of PHP Database Programming

� PEAR DB library

� Part of PHP Extension and Application Repository (PEAR)

� Provides functions for database access

391

Connecting to a Database

� Library module DB.php must be loaded

� DB library functions accessed using DB::<function_name>

� DB::connect('string')

� Function for connecting to a database

� Format for 'string' is: <DBMS software>://<user

account>:<password>@<database

server>

392

Sample Connection to a Database

393

Connecting to a Database (cont’d.)

� Query function� $d->query takes an SQL command as its

string argument

� Sends query to database server for execution

� $d–>setErrorHandling(PEAR_ERROR_DIE)

� Terminate program and print default error messages if any subsequent errors occur

394

Collecting Data from Forms and Inserting Records

� Collect information through HTML or other types of Web forms

� Create unique record identifier for each new record inserted into the database

� PHP has a function $d–>nextID to create

a sequence of unique values for a particular table

� Placeholders

� Specified by ? symbol

395

Retrieval Queries from Database Tables

� $q

� Query result

� $q->fetchRow() retrieve next record in

query result and control loop

� $d=>getAll

� Holds all the records in a query result in a single variable called $allresult

396

Illustrating Database Retrieval Queries

397

Summary

� PHP scripting language

� Very popular for Web database programming

� PHP basics for Web programming

� Data types

� Database commands include:

� Creating tables, inserting new records, and retrieving database records

398

Agenda









399

Summary

� Disk Storage, Basic File Structures, and Hashing

� Indexing Structures for Files

� Algorithms for Query Processing and Optimization

� Physical Database Design and Tuning

� Introduction to SQL Programming Techniques

� Web Database Programming Using PhP

� Summary & Conclusion

400

Assignments & Readings

� Readings

» Slides and Handouts posted on the course web site

» Textbook: Chapters 13, 14, 17, 18, 19, and 20

� Assignment #7

» Textbook exercises: TBA

» See Database Project (Part III) specifications and support material posted under handouts and demos on the course Web site.

� Project Framework Setup (ongoing)

401

Next Session: Additional Database Topics (Readings)

� Transaction Processing and Recovery

� Concurrency Control Techniques

� Overview of Datawarehousing and OLAP

� Object, Object Relational, and XML Databases

� Database Security and Distributed Databases

� Other Advanced Database Models, Systems, and Applications

402

Any Questions?

physical database design - query execution concepts ... · session agenda disk storage, basic file...

Documents