query evaluation techniques for larger databases** by goetz graefe elaborado por: edwin andrés...

Query Evaluation Techniques for Larger Databases**

By Goetz GraefeElaborado por: Edwin Andrés Bernal López

Claudia Jeanneth Becerra Cortés

Curso: Tópicos Avanzados de Bases de Datos

Bogotá, Marzo 23 del 2006

**Portland State University, Computer Science Department, P. O. Box751, Portland, Oregon 97207-0751, Received January 1992, final revision accepted February 1993, Published ACM Computing Surveys, Vol. 25, No 2, June 1993.

Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html

2004

5 Goetz Graefe, Michael J. Zwilling: Transaction support for indexed views. SIGMOD Conference 2004

58 Goetz Graefe: Write-Optimized B-Trees. VLDB 2004: 672-683

57

Conor Cunningham, Goetz Graefe, César A. Galindo-Legaria: PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS. VLDB 2004: 998-1009

2003

56 Goetz Graefe: Executing Nested Queries. BTW 2003: 58-77

55 Goetz Graefe: Partitioned B-trees - a user's guide. BTW 2003: 668-671

54 Goetz Graefe: Sorting And Indexing With Partitioned B-Trees. CIDR 2003

2001

53 Goetz Graefe, Per-Åke Larson: B-Tree Indexes and CPU Caches. ICDE 2001: 349-358

William O'Connell, Andrew Witkowski, Goetz Graefe: Collaborative Analytical Processing - Dream or Reality? (Panel abstract). VLDB 2001: 613, presented in the framework of the 27th International Conference on Very Large Data Bases VLDB '01

51

Sameet Agarwal, José A. Blakeley, Thomas Casey, Kalen Delaney, César A. Galindo-Legaria, Goetz Graefe, Michael Rys, Michael J. Zwilling: Microsoft SQL Server (Chapter 27) Database System Concepts, 4th Edition. 2001: 969-1006


2000

50

Goetz Graefe: Dynamic Query Evaluation Plans: Some Course Corrections? IEEE Data Eng. Bull. 23(2): 3-6 (2000)

1999

49

EE Goetz Graefe: The Value of Merge-Join and Hash-Join in SQL Server. VLDB 1999: 250-253

48

EE

Surajit Chaudhuri, Eric Christensen, Goetz Graefe, Vivek R. Narasayya, Michael J. Zwilling: Self-Tuning Technology in Microsoft SQL Server. IEEE Data Eng. Bull. 22(2): 20-26 (1999)

1998

47

EE Goetz Graefe: The New Database Imperatives. ICDE 1998: 69-72

46

Goetz Graefe, Usama M. Fayyad, Surajit Chaudhuri: On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases. KDD 1998: 204-208

45

EE

Per-Åke Larson, Goetz Graefe: Memory Management During Run Generation in External Sorting. SIGMOD Conference 1998: 472-483

44

EE

Goetz Graefe, Ross Bunker, Shaun Cooper: Hash Joins and Hash Teams in Microsoft SQL Server. VLDB 1998: 86-97

43

EE

Jim Gray, Goetz Graefe: The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb CoRR cs.DB/9809005: (1998)

1997

42

EE

Jim Gray, Goetz Graefe: The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb. SIGMOD Record 26(4): 63-68 (1997)

1996

41

EE Goetz Graefe: The Microsoft Relational Engine. ICDE 1996: 160-161

40

Goetz Graefe: Iterators, Schedulers, and Distributed-memory Parallelism. Softw., Pract. Exper. 26(4): 427-452 (1996)

1995

39

EE

Diane L. Davison, Goetz Graefe: Dynamic Resource Brokering for Multi-User Query Execution. SIGMOD Conference 1995: 281-292

38

EE

Goetz Graefe, Richard L. Cole: Fast Algorithms for Universal Quantification in Large Databases. ACM Trans. Database Syst. 20(2): 187-236 (1995)

37

EE Goetz Graefe: The Cascades Framework for Query Optimization. IEEE Data Eng. Bull. 18(3): 19-29 (1995)

36

EE Goetz Graefe: Letter from the Special Issue Editor. IEEE Data Eng. Bull. 18(3): 2 (1995)

35

EE

Patrick E. O'Neil, Goetz Graefe: Multi-Table Joins Through Bitmapped Join Indices. SIGMOD Record 24(3): 8-11 (1995)



1994

34

EE Goetz Graefe: Sort-Merge-Join: An Idea Whose Time Has(h) Passed? ICDE 1994: 406-417

33

EE Richard L. Cole, Goetz Graefe: Optimization of Dynamic Query Evaluation Plans. SIGMOD Conference 1994: 150-160

32

EE Diane L. Davison, Goetz Graefe: Memory-Contention Responsive Hash Joins. VLDB 1994: 379-390

31

EE

Goetz Graefe: Volcano - An Extensible and Parallel Query Evaluation System. IEEE Trans. Knowl. Data Eng. 6(1): 120-135 (1994)

30

EE

Goetz Graefe, Ann Linville, Leonard D. Shapiro: Sort versus Hash Revisited. IEEE Trans. Knowl. Data Eng. 6(6): 934-944 (1994)

1993

29

Goetz Graefe, Richard L. Cole, Diane L. Davison: Dynamic Techniques for Very Complex Database Queries. FMLDO 1993: 139-142

28 Richard L. Cole, Goetz Graefe: Dynamic Plan Optimization. FMLDO 1993: 45-58

27

EE

Goetz Graefe, William J. McKenna: The Volcano Optimizer Generator: Extensibility and Efficient Search. ICDE 1993: 209-218

26

EE

José A. Blakeley, William J. McKenna, Goetz Graefe: Experiences Building the Open OODB Query Optimizer. SIGMOD Conference 1993: 287-296

25

EE

Richard H. Wolniewicz, Goetz Graefe: Algebraic Optimization of Computations over Scientific Databases. VLDB 1993: 13-24

24

EE Goetz Graefe: Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25(2): 73-170 (1993)

23

EE Richard H. Wolniewicz, Goetz Graefe: Algebraic Optimization of Computations over Scientific Databases.

IEEE Data Eng. Bull. 16(1): 48-51 (1993)

22

EE Goetz Graefe: Letter from the Special Issue Editor. IEEE Data Eng. Bull. 16(4): 3 (1993)

21

EE Goetz Graefe, Diane L. Davison: Encapsulation of Parallelism and Architecture-Independence in Extensible

Database Query Execution. IEEE Trans. Software Eng. 19(8): 749-764 (1993)

20

EE Goetz Graefe: Options in Physical Database Design. SIGMOD Record 22(3): 76-83 (1993)

19

EE David Maier, Lois M. L. Delcambre, Calton Pu, Jonathan Walpole, Goetz Graefe, Leonard D. Shapiro: Database

Research at the Data-Intensive Systems Center. SIGMOD Record 22(4): 81-86 (1993)

1992

18

David Maier, Goetz Graefe, Leonard D. Shapiro, Scott Daniels, Thomas Keller, Bennet Vance: Issues in Distributed Object Assembly. IWDOM 1992: 165-181

17

Goetz Graefe, Shreekant S. Thakkar: Tuning a Parallel Database Algorithm on a Shared-memory Multiprocessor. Softw., Pract. Exper. 22(7): 495-517 (1992)



1991

16

EE

Thomas Keller, Goetz Graefe, David Maier: Efficient Assembly of Complex Objects. SIGMOD Conference 1991: 148-157

15

Michael J. Carey, David J. DeWitt, Daniel Frank, Goetz Graefe, Joel E. Richardson, Eugene J. Shekita, M.

Muralikrishna: The Architecture of the EXODUS Extensible DBMS. On Object-Oriented Database System 1991: 231-256

14

Goetz Graefe, Richard L. Cole, Diane L. Davison, William J. McKenna, Richard H. Wolniewicz: Extensible Query

Optimization and Parallel Execution in Volcano. Query Processing for Advanced Database Systems, Dagstuhl 1991: 305-335

13

David Maier, Scott Daniels, Thomas Keller, Bennet Vance, Goetz Graefe, William J. McKenna: Challenges for

Query Processing in Object-Oriented Databases. Query Processing for Advanced Database Systems, Dagstuhl 1991: 337-380

12

Scott Daniels, Goetz Graefe, Thomas Keller, David Maier, Duri Schmidt, Bennet Vance: Query Optimization in

Revelation, an Overview. IEEE Data Eng. Bull. 14(2): 58-62 (1991)

11

EE

Goetz Graefe: Heap-Filter Merge Join: A New Algorithm For Joining Medium-Size Inputs. IEEE Trans. Software Eng. 17(9): 979-982 (1991)


1990

10

EE

Goetz Graefe: Encapsulation of Parallelism in the Volcano Query Processing System. SIGMOD Conference 1990: 102-111

1989

9 EE Goetz Graefe: Relational Division: Four Algorithms and Their Performance. ICDE 1989: 94-101

8 EE Goetz Graefe, Karen Ward: Dynamic Query Evaluation Plans. SIGMOD Conference 1989: 358-366

1988

7

Goetz Graefe, David Maier: Query Optimization in Object-Oriented Database Systems: A Prospectus. OODBS 1988: 358-363

1987

6 EE Goetz Graefe, David J. DeWitt: The EXODUS Optimizer Generator. SIGMOD Conference 1987: 160-172

5 Goetz Graefe: Rule-Based Query Optimization in Extensible Database Systems Univ. of Wisconsin-Madison 1987


1986

4 EE Michael J. Carey, David J. DeWitt, Daniel Frank, Goetz Graefe, M. Muralikrishna, Joel E. Richardson, Eugene J.

Shekita: The Architecture of the EXODUS Extensible DBMS. OODBS 1986: 52-65

3 EE David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Krishna B. Kumar, M. Muralikrishna:

GAMMA - A High Performance Dataflow Database Machine. VLDB 1986: 228-237

2

Goetz Graefe: Software Modularization with the EXODUS Optimizer Generator. IEEE Database Eng. Bull. 9(4): 37-45 (1986)

1984

1

Michael J. Carey, David J. DeWitt, Goetz Graefe: Mechanisms for Concurrency Control and Recovery in Prolog - A

Proposal. Expert Database Workshop 1984: 271-291

Estado del Arte en Query Processing/93Bulletin of the Technical Committee on

Data EngineeringDecember, 1993 Vol. 16 No. 4 IEEE Computer SocietySpecial Issue on Query Processing in Commercial Database Systems

Letter from the Special Issue Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Goetz Graefe Query Optimization in the IBM DB2 Family . . . . . . . . . . . . . . . . . . Peter Gassner,and Guy LohmanQuery Processing in the IBM Application System 400. . . . . . . . .Richard L. Cole, Mark J. AndersonQuery Processing in NonStop SQL . . A. Chen, Y-F Kao, M. Pong, D. Shak, S. Sharma, J. VaishnavQuery Processing in DEC Rdb: Major Issues and Future Challenges . . . . . Gennady Antoshenkov Letter from the Editor-in-Chief“… Goetz Graefe, our issue editor, has succeeded in overcoming these difficulties. He has collected four papers from prominent database vendors. These papers introduce us to the inside world of ”real” query processing”Letter from the Special Issue Editor“…Second, in some aspects of query processing, the industrial reality has bypassed academic research. By asking leaders in the industrial field to summarize their work, I hope that this issue is a snapshot of the current state of the art. Undoubtedly, some researchers will find inspirations for new, relevant work of their own in these articles.”

25th International Conference on Very Large Data Bases

Edinburgh-Scotland-UK 7 - 10th Sept 99 http://www.dcs.napier.ac.uk/~vldb99/

Conferencias en VLDB

Contribuciones a SQL Server 7Goetz Graefe: The Value of Merge-Join and Hash-Join in SQL Server SQL Server 7 added many new join strategies. Prior releases of SQL Server have been successful at transaction processing and decision support workloads with neither merge join nor hash join, relying entirely on nested loops and index nested loops join. Given this fact, one needs to ask how much the additional join algorithms improve performance? In a pure OLTP workload that requires only record- to-record navigation, intuition and experience suggest that index nested loop join is sufficient. For a DSS workload, however, the question is much more complex. To answer this question, we analysed TPC-D query performance using an internal build of SQL Server with merge-join and hash-join enabled and disabled. Many previous studies have compared join algorithms, but always for only a few isolated queries and presuming a fixed physical database design. Since physical database design has a major impact on our question, we analysed TPC-D performance for multiple indexing schemes, a simplistic and an optimised physical database design. The latter was optimised specifically for the workload, the available disk space, and the available algorithms using SQL Server's new "index tuning wizard".

Incorporación de “PIVOT” a SQL

Proceedings of the 30th VLDB Conference,Toronto, Canada, 2004 PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS Conor Cunningham, César A. Galindo-Legaria, Goetz Graefe Microsoft Corporation, One Microsoft Way, Redmond, WA 98052 USA Abstract “…PIVOT and UNPIVOT, two operators on tabular data that exchange rows and columns,enable data transformations useful in data modeling, data analysis, and data presentation. They can quite easily be implemented inside a query processor, much like select, project, and join. Such a design provides opportunities for better performance, both during query optimization and query execution. We discuss query optimization and execution implications of this integrated design and evaluate the performance of this approach using a prototype implementation in Microsoft SQL Server.”


1993

24EE Goetz Graefe: Query Evaluation Techniques for Large Databases. ACM

Comput. Surv. 25(2): 73-170 (1993)

23 EE Richard H. Wolniewicz, Goetz Graefe: Algebraic Optimization of Computations over Scientific Databases. IEEE Data Eng. Bull. 16(1): 48-51 (1993)

22 EEGoetz Graefe: Letter from the Special Issue Editor. IEEE Data Eng. Bull. 16(4): 3 (1993)

21 EE Goetz Graefe, Diane L. Davison: Encapsulation of Parallelism and Architecture-Independence in Extensible Database Query Execution. IEEE Trans. Software Eng. 19(8): 749-764 (1993)

20 EEGoetz Graefe: Options in Physical Database Design. SIGMOD Record 22(3): 76-83 (1993)

19 EE David Maier, Lois M. L. Delcambre, Calton Pu, Jonathan Walpole, Goetz Graefe, Leonard D. Shapiro: Database Research at the Data-Intensive Systems Center. SIGMOD Record 22(4): 81-86 (1993)

1992

18

David Maier, Goetz Graefe, Leonard D. Shapiro, Scott Daniels, Thomas Keller, Bennet Vance: Issues in Distributed Object Assembly. IWDOM 1992: 165-181

17

Goetz Graefe, Shreekant S. Thakkar: Tuning a Parallel Database Algorithm on a Shared-memory Multiprocessor. Softw., Pract. Exper. 22(7): 495-517 (1992)

Tabla de Contenido del Paper (1a. Pte)

INTRODUCTION1.ARCHITECTURE OF QUERY EXECUTION ENGINES2.SORTING AND HASHING2.1 Sorting2.2.Hashing3. DISK ACCESS3.1 File Scans3.2 Associative Access Using Indices3.3. Buffer Management4. AGGREGATION AND DUPLICATE REMOVAL4.1 Aggregation Algorithm Based on Nested Loops4.2 Aggregation Algorithms Based on Sortlng4.3. Aggregation Algorithms Based on Hashing4.4. A Rough Performance Comparison 4.5. Additional Remarks on Aggregation5. BINARY MATCHING OPERATIONS5.1. Nested-Loops Join Algorithms5.2. Merge-Join Algorithms5.3. Hash Join Algorithms5.4. Pointer-Based Joins5.5. Rough Performance Comparison6. UNIVERSAL QUANTIFICATION7. DUALITY OF SORT- AND HASH-BASED QUERY PROCESSING ALGORITHMS

8. EXECUTION OF COMPLEX QUERY PLANS9. MECHANISMS FOR PARALLEL QUERY EXECUTION9.1. Parallel versus Distributed Database Systems9.2 Forms of Parallelism9.3. Implementation Strategies9.4. Load Balancing and Skew9.5. Architectures and Architecture Independence10. PARALLEL ALGORITHMS10.1 Parallel Selections and Updates10.2. Parallel Sorting10.3. Parallel Aggregation and Duplicate Removal10.4. Parallel Joins and Other Binary Matching Operations10.5. Parallel Universal Quantification11. NON STANDARD QUERY PROCESSING ALGORITHMS11.1. Nested Relations11.2. Temporal and Scientific Database Management11.3. Object-oriented Database Systems11.4. More Control Operators12. ADDITIONAL TECHNIQUES FOR PERFORMANCE IMPROVEMENT12.1 . Precomputatlon and Derived Data12.2. Data Compression12.3. Surrogate Processing12.4. Bit Vector Filtering12.5. Specialized Hardware

SUMMARY AND OUTLOOK

Tabla de Contenido del Paper (2a. Pte)

INTRODUCTION

“..While many, although not all, techniques discussed in this paper have been developed in the context of relational database systems, most of them are applicable to and useful in the query processing facility for any database management system and any data model, provided the data model permits queries over “bulk” data types such as sets and lists.” “This survey discusses a large variety of query execution techniques that must be considered when designing and implementing the query execution module of a new database management system: algorithms and their execution costs, sorting versus hashing, parallelism, resource allocation and scheduling issues in complex queries, special operations for emerging database application domains such as statistical and scientific databases, and general performance-enhancing techniques such as precomputation and compression.”

“ - There are many aspects to the OODB query optimization problem that can benefit from the already proven relational query-optimization technology. However many key features of OODB languages present new and difficult problems not adequately addressed by this technology. These features include object identity, methods, encapsulation, subtype hierarchy, user-defined type constructors, large multimedia objects, multiple collection types, arbitrary nesting of collections, and nesting of query expressions. -”. The lambda-DB: -An ODMG-Based Object-Oriented project at the University of Texas at Arlington. http://lambda.uta.edu/ldb/doc/overview.html

INTRODUCTION

“Query optimization is a special form of planning, employing techniques from artificial intelligence such as plan representation, search including directed search and pruning, dynamic programming, branche-and-bound algorithms, etc. The query execution engine is a collection of query execution operators and mechanisms for operator communication and synchronization —it employs concepts from algorithm design, operating systems, networks, and parallel and distributed computation. The facilities of the query execution engine define the space of possible plans that can be chosen by the querv optimizer.” ”

INTRODUCTION

INTROD.: Query Processing Steps [2]

1. ARCHITECTURE OF QUERY EXECUTION ENGINES

“A complete query execution engine consists of a collection of operators and mechanisms to execute complex expressions using multiple operators, including multiple occurrences of the same operator. Taken as a whole, the query processing algorithms form an algebra which we call the physical algebra of a database system.”


“The advantages of a uniform iterator interface for all query processing algorithms are obvious: it permits arbitrary combination of all operators including new ones in extensible systems, it permits arbitrarily complex plans, and it makes the query optimizer simpler to design and implement.”

2. SORTING AND HASHING

“Before discussing specific algorithms, two general approaches to managing sets of data are introduced. The purpose of many query-processing algorithms is to perform some kind of matching, i.e., bringing items that are “alike” together and performing some operation on them. There are two basic approaches used for this purpose, sorting and hashing. This pair permeates many aspects of query processing, from indexing and clustering over aggregation and join algorithms to methods for parallelizing database operations.”

Access Path

Algorithm + data structure used to locate rows satisfying some condition File scan: can be used for any condition Hash: equality search; all search key attributes of hash index are specified in condition B+ tree: equality or range search; a prefix of the search key attributes are specified in condition Binary search: Relation sorted on a sequence of attributes and some prefix of sequence is specified

in condition

2. ACCESS PATHS

Sorting and Hashing

“…We will conclude the discussion of individual query processing by outlining the many existing similarities and dualities of sort and hash-based query-processing algorithms as well as the points where the two types of algorithms differ. The purpose is to contribute to a better understanding of the two approaches and their tradeoffs. We try to discuss the approaches in general terms, ignoring whether the algorithms are used for relational join, union, intersection, aggregation, duplicate removal, or other operations. “

General External Merge Sort

To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce sorted runs of B pages each. Pass 2, …, etc.: merge B-1 runs.

N B/

Cost of External Merge Sort

Number of passes: Cost = 2N * (# of passes) E.g., with 5 buffer pages, to sort 108 page file:

Pass 0: = 22 sorted runs of 5 pages each (last run is only 3 pages)

Pass 1: = 6 sorted runs of 20 pages each (last run is only 8 pages)

Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages

1 1 log /B N B

108 5/

22 4/

Number of Passes of External Sort

N B=3 B=5 B=9 B=17 B=129 B=257100 7 4 3 2 1 11,000 10 5 4 3 2 210,000 13 7 5 4 2 2100,000 17 9 6 5 3 31,000,000 20 10 7 5 3 310,000,000 23 12 8 6 4 3100,000,000 26 14 9 7 4 41,000,000,000 30 15 10 8 5 4

Double Buffering

To reduce wait time for I/O request to complete, can prefetch into `shadow block’. Potentially, more passes; in practice, most files still sorted in 2-3

passes.

OUTPUT

OUTPUT'

Disk Disk

INPUT 1

INPUT k

INPUT 2

INPUT 1'

INPUT 2'

INPUT k'

block sizeb

B main memory buffers, k-way merge

Sorting Records!

Sorting has become a blood sport! Parallel sorting is the name of the game ...

Datamation: Sort 1M records of size 100 bytes Typical DBMS: 15 minutes World record: 3.5 seconds

12-CPU SGI machine, 96 disks, 2GB of RAM

New benchmarks proposed: Minute Sort: How many can you sort in 1 minute? Dollar Sort: How many can you sort for $1.00?

Using B+ Trees for Sorting

Scenario: Table to be sorted has B+ tree index on sorting column(s). Idea: Can retrieve records in order by traversing leaf pages. Is this a good idea? Cases to consider:

B+ tree is clustered Good idea! B+ tree is not clustered Could be a very bad idea!

Clustered B+ Tree Used for Sorting

Unclustered B+ Tree Used for Sorting

External Sorting vs. Unclustered Index

N Sorting p=1 p=10 p=100

100 200 100 1,000 10,000

1,000 2,000 1,000 10,000 100,000

10,000 40,000 10,000 100,000 1,000,000

100,000 600,000 100,000 1,000,000 10,000,000

1,000,000 8,000,000 1,000,000 10,000,000 100,000,000

10,000,000 80,000,000 10,000,000 100,000,000 1,000,000,000

p: # of records per page B=1,000 and block size=32 for sorting p=100 is the more realistic value.

Query Evaluation

Relational Operations

We will consider how to implement: Selection ( ) Selects a subset of rows from relation. Projection ( ) Deletes unwanted columns from relation. Join ( ) Allows us to combine two relations. Set-difference ( ) Tuples in reln. 1, but not in reln. 2. Union ( ) Tuples in reln. 1 and in reln. 2. Aggregation (SUM, MIN, etc.) and GROUP BY

Since each op returns a relation, ops can be composed! After we cover the operations, we will discuss how to optimize queries formed by composing them.

Access Path

Algorithm + data structure used to locate rows satisfying some condition File scan: can be used for any condition Hash: equality search; all search key attributes of hash index are specified in condition B+ tree: equality or range search; a prefix of the search key attributes are specified in condition Binary search: Relation sorted on a sequence of attributes and some prefix of sequence is specified

in condition

Access Paths

A tree index matches (a conjunction of) terms that involve only attributes in a prefix of the search key. E.g., Tree index on <a, b, c> matches the selection

a=5 AND b=3, and a=5 AND b>6, but not b=3. A hash index matches (a conjunction of) terms that has a

term attribute = value for every attribute in the search key of the index. E.g., Hash index on <a, b, c> matches a=5 AND b=3

AND c=5; but it does not match b=3, or a=5 AND b=3, or a>5 AND b=3 AND c=5.

Access Paths Supported by B+ tree

Example: Given a B+ tree whose search key is the sequence of attributes a2, a1, a3, a4

Access path for search a1>5 a2=3.0 a3=‘x’ (R): find first entry having a2=3.0 a1>5 a3=‘x’ and scan leaves from there until entry having a2>3.0 . Select satisfying entries

Access path for search a2=3.0 a3 >‘x’ (R): locate first entry having a2=3.0 and scan leaves until entry having a2>3.0 . Select satisfying entries

No access path for search a1>5 a3 =‘x’ (R)

Choosing an Access Path

Selectivity of an access path refers to its cost Higher selectivity means lower cost (#pages)

If several access paths cover a query, DBMS should choose the one with greatest selectivity

Size of domain of attribute is a measure of the selectivity of domain Example: CrsCode=‘CS305’ Grade=‘B’ - a B+ tree with search key CrsCode

is more selective than a B+ tree with search key Grade

Computing Selection

No index on attr: If rows unsorted, cost = F

Scan all data pages to find rows satisfying the condition If rows sorted on attr, cost = log2 F + (cost of scan)

Use binary search to locate first data page containing row in which (attr = value)

Scan further to get all rows satisfying (attr op value)

condition: (attr op value)

Computing Selection

B+ tree index on attr (for equality or range search): Locate first index entry corresponding to a row in which (attr = value);

cost = depth of tree Clustered index - rows satisfying condition packed in sequence in

successive data pages; scan those pages; cost depends on number of qualifying rows

Unclustered index - index entries with pointers to rows satisfying condition packed in sequence in successive index pages; scan entries and sort pointers to identify table data pages with qualifying rows, each page (with at least one such row) fetched once

condition: (attr op value)

Unclustered B+ Tree Index

B+ Tree

Index entriessatisfyingcondition

Data File

data page

Computing Selection

Hash index on attr (for equality search only): Hash on value; cost 1.2 (to account for possible overflow

chain) to search the (unique) bucket containing all index entries or rows satisfying condition

Unclustered index - sort pointers in index entries to identify data pages with qualifying rows, each page (containing at least one such row) fetched once

condition: (attr = value)

Complex Selections

Conjunctions: a1 =x a2 <y a3=z (R) Use most selective access path Use multiple access paths

Disjunction: (a1 =x or a2 <y) and (a3=z) (R) DNS (disjunctive normal form) (a1 =x a3 =z) or (a2 < y a3=z) Use file scan if one disjunct requires file scan If better access path exist, and combined selectivity

is better than file scan, use the better access paths, else use a file scan

Two Approaches to General Selections

First approach: Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don’t match the index: Most selective access path: An index or file scan that we estimate

will require the fewest page I/Os. Terms that match this index reduce the number of tuples retrieved;

other terms are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched.

Consider day<8/9/94 AND bid=5 AND sid=3. A B+ tree index on day can be used; then, bid=5 and sid=3 must be checked for each retrieved tuple. Similarly, a hash index on <bid, sid> could be used; day<8/9/94 must then be checked.

Intersection of Rids

Second approach (if we have 2 or more matching indexes that use Alternatives (2) or (3) for data entries): Get sets of rids of data records using each matching index. Then intersect these sets of rids Retrieve the records and apply any remaining terms. Consider day<8/9/94 AND bid=5 AND sid=3. If we have a B+

tree index on day and an index on sid, both using Alternative (2), we can retrieve rids of records satisfying day<8/9/94 using the first, rids of recs satisfying sid=3 using the second, intersect, retrieve records and check bid=5.

Agregación, Remoción De Duplicados

La idea de la agregación es representar un grupo de items mediante un solo valor o clasificar items en grupos y determinar un valor por cada grupo.

Agregación EscalarAgregación por funciones

Agregación, Remoción De Duplicados (3)

Algoritmo de Agregación basado en Ordenamiento

El ordenamiento permite agrupar items con características similares así se hace mucho más sencillo hacer la remoción de datos duplicados.


Algoritmo de Agregación basado en Ordenamiento La cantidad de datos de entada y salida

calculada para éste algoritmo es la siguiente:

Donde 2 es el factor para considerar lectura y escritura, R es el tamaño de la entrada, L1 es el número de niveles que no se han visto afectados, O es el tamaño de la salida Y W es el número estimado de ejecuciones del algoritmo


Algoritmo Basado en HashingLa idea general es realizar particiones de

los datos que se están analizandoSe genera una tabla que contiene

esencialmente items de salida.La cantidad de entradas o salidas para la

agregación depende del número de niveles necesarios.


Algoritmo Basado en Hashing(2)

2 X (R (L + 1) – FL X (M – [(R’ /G – M)/(M – C)] X C X G )

Dónde L es el nivel de recursividad, R es el tamaño de entradas de archivos, K es el número de archivos de partición, F es el Fan out, M es el tamaño de archivos para llegar al desbordamiento de memoria.


Gráfica comparación de los algoritmos

Operaciones Binarias Para “Matching”

De la misma manera en que los procesos de eliminación y agregación son importantes en grupos de datos de tamaño considerable, es deseable también poder cotejar la información, ésta es la función principal del “matching” establecer estas relaciones existentes. Para tal fin se hace uso principalmente del join de las siguientes maneras.


Algoritmos “Join” Basados en “Loops” anidados Es el algoritmo más simple. Para cada entrada seleccionada hace una

búsqueda completa en el resto de los datos, para de ésta manera encontrar los “matches”.

Se requiere un archivo temporal de la entrada que esta siendo escaneada.

Obviamente es un algoritmo con poco rendimiento para grupos de datos muy grandes


Algoritmos “Merge-Join”Este algoritmo requiere que las entradas

estén previamente ordenadas para obtener los resultados; el procedimiento es similar al que previamente se revisó.

Al estar las entradas ordenadas el algoritmo no requiere de memoria adicional excepto cuando el valor total de los paquetes es mayor que el tamaño de la memoria


Algoritmos “Merge-Join Se puede realizar una combinación entre los

el anterior algoritmo y el presente para optimizar los resultados.

Dado que los algoritmos anteriores necesitan de cierta cantidad de memoria es conveniente realizar una asignación así

W = R / (2 X M) +1 , dónde R es el tamaño de la entrada, M es la cantidad de memoria necesaria y las otras dos son constantes de lectura y escritura.


Algoritmos “Hash Join” Estos algoritmos se desarrollan partiendo de

la idea básica de realizar la tabla “hash” y de probar ésta tabla usando los items de otra entrada.

Este algoritmo presenta características en contra como el constante desbordamiento de memoria pero se han realizado varias investigaciones en éste entorno para mejorar las soluciones dadas.


La partición realizada por el algoritmo se podría interpretar de la siguiente manera


Comparación Métodos

Contribuciones a SQL Server 7

query evaluation techniques for larger databases** by goetz graefe elaborado por: edwin andrés...

Documents

eeee goetz graefe

goetz graefe http

goetz graefe elaborado

eeee richard

eeee diane

eeee patrick

sigmod record

large data bases vldb