query evaluation techniques for larger databases** by goetz graefe elaborado por: edwin andrés...
TRANSCRIPT
Query Evaluation Techniques for Larger Databases**
By Goetz GraefeElaborado por: Edwin Andrés Bernal López
Claudia Jeanneth Becerra Cortés
Curso: Tópicos Avanzados de Bases de Datos
Bogotá, Marzo 23 del 2006
**Portland State University, Computer Science Department, P. O. Box751, Portland, Oregon 97207-0751, Received January 1992, final revision accepted February 1993, Published ACM Computing Surveys, Vol. 25, No 2, June 1993.
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
2004
5 Goetz Graefe, Michael J. Zwilling: Transaction support for indexed views. SIGMOD Conference 2004
58 Goetz Graefe: Write-Optimized B-Trees. VLDB 2004: 672-683
57
Conor Cunningham, Goetz Graefe, César A. Galindo-Legaria: PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS. VLDB 2004: 998-1009
2003
56 Goetz Graefe: Executing Nested Queries. BTW 2003: 58-77
55 Goetz Graefe: Partitioned B-trees - a user's guide. BTW 2003: 668-671
54 Goetz Graefe: Sorting And Indexing With Partitioned B-Trees. CIDR 2003
2001
53 Goetz Graefe, Per-Åke Larson: B-Tree Indexes and CPU Caches. ICDE 2001: 349-358
William O'Connell, Andrew Witkowski, Goetz Graefe: Collaborative Analytical Processing - Dream or Reality? (Panel abstract). VLDB 2001: 613, presented in the framework of the 27th International Conference on Very Large Data Bases VLDB '01
51
Sameet Agarwal, José A. Blakeley, Thomas Casey, Kalen Delaney, César A. Galindo-Legaria, Goetz Graefe, Michael Rys, Michael J. Zwilling: Microsoft SQL Server (Chapter 27) Database System Concepts, 4th Edition. 2001: 969-1006
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
2000
50
Goetz Graefe: Dynamic Query Evaluation Plans: Some Course Corrections? IEEE Data Eng. Bull. 23(2): 3-6 (2000)
1999
49
EE Goetz Graefe: The Value of Merge-Join and Hash-Join in SQL Server. VLDB 1999: 250-253
48
EE
Surajit Chaudhuri, Eric Christensen, Goetz Graefe, Vivek R. Narasayya, Michael J. Zwilling: Self-Tuning Technology in Microsoft SQL Server. IEEE Data Eng. Bull. 22(2): 20-26 (1999)
1998
47
EE Goetz Graefe: The New Database Imperatives. ICDE 1998: 69-72
46
Goetz Graefe, Usama M. Fayyad, Surajit Chaudhuri: On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases. KDD 1998: 204-208
45
EE
Per-Åke Larson, Goetz Graefe: Memory Management During Run Generation in External Sorting. SIGMOD Conference 1998: 472-483
44
EE
Goetz Graefe, Ross Bunker, Shaun Cooper: Hash Joins and Hash Teams in Microsoft SQL Server. VLDB 1998: 86-97
43
EE
Jim Gray, Goetz Graefe: The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb CoRR cs.DB/9809005: (1998)
1997
42
EE
Jim Gray, Goetz Graefe: The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb. SIGMOD Record 26(4): 63-68 (1997)
1996
41
EE Goetz Graefe: The Microsoft Relational Engine. ICDE 1996: 160-161
40
Goetz Graefe: Iterators, Schedulers, and Distributed-memory Parallelism. Softw., Pract. Exper. 26(4): 427-452 (1996)
1995
39
EE
Diane L. Davison, Goetz Graefe: Dynamic Resource Brokering for Multi-User Query Execution. SIGMOD Conference 1995: 281-292
38
EE
Goetz Graefe, Richard L. Cole: Fast Algorithms for Universal Quantification in Large Databases. ACM Trans. Database Syst. 20(2): 187-236 (1995)
37
EE Goetz Graefe: The Cascades Framework for Query Optimization. IEEE Data Eng. Bull. 18(3): 19-29 (1995)
36
EE Goetz Graefe: Letter from the Special Issue Editor. IEEE Data Eng. Bull. 18(3): 2 (1995)
35
EE
Patrick E. O'Neil, Goetz Graefe: Multi-Table Joins Through Bitmapped Join Indices. SIGMOD Record 24(3): 8-11 (1995)
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
1994
34
EE Goetz Graefe: Sort-Merge-Join: An Idea Whose Time Has(h) Passed? ICDE 1994: 406-417
33
EE Richard L. Cole, Goetz Graefe: Optimization of Dynamic Query Evaluation Plans. SIGMOD Conference 1994: 150-160
32
EE Diane L. Davison, Goetz Graefe: Memory-Contention Responsive Hash Joins. VLDB 1994: 379-390
31
EE
Goetz Graefe: Volcano - An Extensible and Parallel Query Evaluation System. IEEE Trans. Knowl. Data Eng. 6(1): 120-135 (1994)
30
EE
Goetz Graefe, Ann Linville, Leonard D. Shapiro: Sort versus Hash Revisited. IEEE Trans. Knowl. Data Eng. 6(6): 934-944 (1994)
1993
29
Goetz Graefe, Richard L. Cole, Diane L. Davison: Dynamic Techniques for Very Complex Database Queries. FMLDO 1993: 139-142
28 Richard L. Cole, Goetz Graefe: Dynamic Plan Optimization. FMLDO 1993: 45-58
27
EE
Goetz Graefe, William J. McKenna: The Volcano Optimizer Generator: Extensibility and Efficient Search. ICDE 1993: 209-218
26
EE
José A. Blakeley, William J. McKenna, Goetz Graefe: Experiences Building the Open OODB Query Optimizer. SIGMOD Conference 1993: 287-296
25
EE
Richard H. Wolniewicz, Goetz Graefe: Algebraic Optimization of Computations over Scientific Databases. VLDB 1993: 13-24
24
EE Goetz Graefe: Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25(2): 73-170 (1993)
23
EE Richard H. Wolniewicz, Goetz Graefe: Algebraic Optimization of Computations over Scientific Databases.
IEEE Data Eng. Bull. 16(1): 48-51 (1993)
22
EE Goetz Graefe: Letter from the Special Issue Editor. IEEE Data Eng. Bull. 16(4): 3 (1993)
21
EE Goetz Graefe, Diane L. Davison: Encapsulation of Parallelism and Architecture-Independence in Extensible
Database Query Execution. IEEE Trans. Software Eng. 19(8): 749-764 (1993)
20
EE Goetz Graefe: Options in Physical Database Design. SIGMOD Record 22(3): 76-83 (1993)
19
EE David Maier, Lois M. L. Delcambre, Calton Pu, Jonathan Walpole, Goetz Graefe, Leonard D. Shapiro: Database
Research at the Data-Intensive Systems Center. SIGMOD Record 22(4): 81-86 (1993)
1992
18
David Maier, Goetz Graefe, Leonard D. Shapiro, Scott Daniels, Thomas Keller, Bennet Vance: Issues in Distributed Object Assembly. IWDOM 1992: 165-181
17
Goetz Graefe, Shreekant S. Thakkar: Tuning a Parallel Database Algorithm on a Shared-memory Multiprocessor. Softw., Pract. Exper. 22(7): 495-517 (1992)
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
1991
16
EE
Thomas Keller, Goetz Graefe, David Maier: Efficient Assembly of Complex Objects. SIGMOD Conference 1991: 148-157
15
Michael J. Carey, David J. DeWitt, Daniel Frank, Goetz Graefe, Joel E. Richardson, Eugene J. Shekita, M.
Muralikrishna: The Architecture of the EXODUS Extensible DBMS. On Object-Oriented Database System 1991: 231-256
14
Goetz Graefe, Richard L. Cole, Diane L. Davison, William J. McKenna, Richard H. Wolniewicz: Extensible Query
Optimization and Parallel Execution in Volcano. Query Processing for Advanced Database Systems, Dagstuhl 1991: 305-335
13
David Maier, Scott Daniels, Thomas Keller, Bennet Vance, Goetz Graefe, William J. McKenna: Challenges for
Query Processing in Object-Oriented Databases. Query Processing for Advanced Database Systems, Dagstuhl 1991: 337-380
12
Scott Daniels, Goetz Graefe, Thomas Keller, David Maier, Duri Schmidt, Bennet Vance: Query Optimization in
Revelation, an Overview. IEEE Data Eng. Bull. 14(2): 58-62 (1991)
11
EE
Goetz Graefe: Heap-Filter Merge Join: A New Algorithm For Joining Medium-Size Inputs. IEEE Trans. Software Eng. 17(9): 979-982 (1991)
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
1990
10
EE
Goetz Graefe: Encapsulation of Parallelism in the Volcano Query Processing System. SIGMOD Conference 1990: 102-111
1989
9 EE Goetz Graefe: Relational Division: Four Algorithms and Their Performance. ICDE 1989: 94-101
8 EE Goetz Graefe, Karen Ward: Dynamic Query Evaluation Plans. SIGMOD Conference 1989: 358-366
1988
7
Goetz Graefe, David Maier: Query Optimization in Object-Oriented Database Systems: A Prospectus. OODBS 1988: 358-363
1987
6 EE Goetz Graefe, David J. DeWitt: The EXODUS Optimizer Generator. SIGMOD Conference 1987: 160-172
5 Goetz Graefe: Rule-Based Query Optimization in Extensible Database Systems Univ. of Wisconsin-Madison 1987
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
1986
4 EE Michael J. Carey, David J. DeWitt, Daniel Frank, Goetz Graefe, M. Muralikrishna, Joel E. Richardson, Eugene J.
Shekita: The Architecture of the EXODUS Extensible DBMS. OODBS 1986: 52-65
3 EE David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Krishna B. Kumar, M. Muralikrishna:
GAMMA - A High Performance Dataflow Database Machine. VLDB 1986: 228-237
2
Goetz Graefe: Software Modularization with the EXODUS Optimizer Generator. IEEE Database Eng. Bull. 9(4): 37-45 (1986)
1984
1
Michael J. Carey, David J. DeWitt, Goetz Graefe: Mechanisms for Concurrency Control and Recovery in Prolog - A
Proposal. Expert Database Workshop 1984: 271-291
Estado del Arte en Query Processing/93Bulletin of the Technical Committee on
Data EngineeringDecember, 1993 Vol. 16 No. 4 IEEE Computer SocietySpecial Issue on Query Processing in Commercial Database Systems
Letter from the Special Issue Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Goetz Graefe Query Optimization in the IBM DB2 Family . . . . . . . . . . . . . . . . . . Peter Gassner,and Guy LohmanQuery Processing in the IBM Application System 400. . . . . . . . .Richard L. Cole, Mark J. AndersonQuery Processing in NonStop SQL . . A. Chen, Y-F Kao, M. Pong, D. Shak, S. Sharma, J. VaishnavQuery Processing in DEC Rdb: Major Issues and Future Challenges . . . . . Gennady Antoshenkov Letter from the Editor-in-Chief“… Goetz Graefe, our issue editor, has succeeded in overcoming these difficulties. He has collected four papers from prominent database vendors. These papers introduce us to the inside world of ”real” query processing”Letter from the Special Issue Editor“…Second, in some aspects of query processing, the industrial reality has bypassed academic research. By asking leaders in the industrial field to summarize their work, I hope that this issue is a snapshot of the current state of the art. Undoubtedly, some researchers will find inspirations for new, relevant work of their own in these articles.”
25th International Conference on Very Large Data Bases
Edinburgh-Scotland-UK 7 - 10th Sept 99 http://www.dcs.napier.ac.uk/~vldb99/
Conferencias en VLDB
Contribuciones a SQL Server 7Goetz Graefe: The Value of Merge-Join and Hash-Join in SQL Server SQL Server 7 added many new join strategies. Prior releases of SQL Server have been successful at transaction processing and decision support workloads with neither merge join nor hash join, relying entirely on nested loops and index nested loops join. Given this fact, one needs to ask how much the additional join algorithms improve performance? In a pure OLTP workload that requires only record- to-record navigation, intuition and experience suggest that index nested loop join is sufficient. For a DSS workload, however, the question is much more complex. To answer this question, we analysed TPC-D query performance using an internal build of SQL Server with merge-join and hash-join enabled and disabled. Many previous studies have compared join algorithms, but always for only a few isolated queries and presuming a fixed physical database design. Since physical database design has a major impact on our question, we analysed TPC-D performance for multiple indexing schemes, a simplistic and an optimised physical database design. The latter was optimised specifically for the workload, the available disk space, and the available algorithms using SQL Server's new "index tuning wizard".
Incorporación de “PIVOT” a SQL
Proceedings of the 30th VLDB Conference,Toronto, Canada, 2004 PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS Conor Cunningham, César A. Galindo-Legaria, Goetz Graefe Microsoft Corporation, One Microsoft Way, Redmond, WA 98052 USA Abstract “…PIVOT and UNPIVOT, two operators on tabular data that exchange rows and columns,enable data transformations useful in data modeling, data analysis, and data presentation. They can quite easily be implemented inside a query processor, much like select, project, and join. Such a design provides opportunities for better performance, both during query optimization and query execution. We discuss query optimization and execution implications of this integrated design and evaluate the performance of this approach using a prototype implementation in Microsoft SQL Server.”
Lista de Publicaciones de Goetz Graefehttp://www.informatik.uni-trier.de/~ley/db/indices/a-tree/g/Graefe:Goetz.html
1993
24EE Goetz Graefe: Query Evaluation Techniques for Large Databases. ACM
Comput. Surv. 25(2): 73-170 (1993)
23 EE Richard H. Wolniewicz, Goetz Graefe: Algebraic Optimization of Computations over Scientific Databases. IEEE Data Eng. Bull. 16(1): 48-51 (1993)
22 EEGoetz Graefe: Letter from the Special Issue Editor. IEEE Data Eng. Bull. 16(4): 3 (1993)
21 EE Goetz Graefe, Diane L. Davison: Encapsulation of Parallelism and Architecture-Independence in Extensible Database Query Execution. IEEE Trans. Software Eng. 19(8): 749-764 (1993)
20 EEGoetz Graefe: Options in Physical Database Design. SIGMOD Record 22(3): 76-83 (1993)
19 EE David Maier, Lois M. L. Delcambre, Calton Pu, Jonathan Walpole, Goetz Graefe, Leonard D. Shapiro: Database Research at the Data-Intensive Systems Center. SIGMOD Record 22(4): 81-86 (1993)
1992
18
David Maier, Goetz Graefe, Leonard D. Shapiro, Scott Daniels, Thomas Keller, Bennet Vance: Issues in Distributed Object Assembly. IWDOM 1992: 165-181
17
Goetz Graefe, Shreekant S. Thakkar: Tuning a Parallel Database Algorithm on a Shared-memory Multiprocessor. Softw., Pract. Exper. 22(7): 495-517 (1992)
Tabla de Contenido del Paper (1a. Pte)
INTRODUCTION1.ARCHITECTURE OF QUERY EXECUTION ENGINES2.SORTING AND HASHING2.1 Sorting2.2.Hashing3. DISK ACCESS3.1 File Scans3.2 Associative Access Using Indices3.3. Buffer Management4. AGGREGATION AND DUPLICATE REMOVAL4.1 Aggregation Algorithm Based on Nested Loops4.2 Aggregation Algorithms Based on Sortlng4.3. Aggregation Algorithms Based on Hashing4.4. A Rough Performance Comparison 4.5. Additional Remarks on Aggregation5. BINARY MATCHING OPERATIONS5.1. Nested-Loops Join Algorithms5.2. Merge-Join Algorithms5.3. Hash Join Algorithms5.4. Pointer-Based Joins5.5. Rough Performance Comparison6. UNIVERSAL QUANTIFICATION7. DUALITY OF SORT- AND HASH-BASED QUERY PROCESSING ALGORITHMS
8. EXECUTION OF COMPLEX QUERY PLANS9. MECHANISMS FOR PARALLEL QUERY EXECUTION9.1. Parallel versus Distributed Database Systems9.2 Forms of Parallelism9.3. Implementation Strategies9.4. Load Balancing and Skew9.5. Architectures and Architecture Independence10. PARALLEL ALGORITHMS10.1 Parallel Selections and Updates10.2. Parallel Sorting10.3. Parallel Aggregation and Duplicate Removal10.4. Parallel Joins and Other Binary Matching Operations10.5. Parallel Universal Quantification11. NON STANDARD QUERY PROCESSING ALGORITHMS11.1. Nested Relations11.2. Temporal and Scientific Database Management11.3. Object-oriented Database Systems11.4. More Control Operators12. ADDITIONAL TECHNIQUES FOR PERFORMANCE IMPROVEMENT12.1 . Precomputatlon and Derived Data12.2. Data Compression12.3. Surrogate Processing12.4. Bit Vector Filtering12.5. Specialized Hardware
SUMMARY AND OUTLOOK
Tabla de Contenido del Paper (2a. Pte)
INTRODUCTION
“..While many, although not all, techniques discussed in this paper have been developed in the context of relational database systems, most of them are applicable to and useful in the query processing facility for any database management system and any data model, provided the data model permits queries over “bulk” data types such as sets and lists.” “This survey discusses a large variety of query execution techniques that must be considered when designing and implementing the query execution module of a new database management system: algorithms and their execution costs, sorting versus hashing, parallelism, resource allocation and scheduling issues in complex queries, special operations for emerging database application domains such as statistical and scientific databases, and general performance-enhancing techniques such as precomputation and compression.”
“ - There are many aspects to the OODB query optimization problem that can benefit from the already proven relational query-optimization technology. However many key features of OODB languages present new and difficult problems not adequately addressed by this technology. These features include object identity, methods, encapsulation, subtype hierarchy, user-defined type constructors, large multimedia objects, multiple collection types, arbitrary nesting of collections, and nesting of query expressions. -”. The lambda-DB: -An ODMG-Based Object-Oriented project at the University of Texas at Arlington. http://lambda.uta.edu/ldb/doc/overview.html
INTRODUCTION
“Query optimization is a special form of planning, employing techniques from artificial intelligence such as plan representation, search including directed search and pruning, dynamic programming, branche-and-bound algorithms, etc. The query execution engine is a collection of query execution operators and mechanisms for operator communication and synchronization —it employs concepts from algorithm design, operating systems, networks, and parallel and distributed computation. The facilities of the query execution engine define the space of possible plans that can be chosen by the querv optimizer.” ”
INTRODUCTION
INTROD.: Query Processing Steps [2]
INTROD.: Query Processing Steps [2]
1. ARCHITECTURE OF QUERY EXECUTION ENGINES
“A complete query execution engine consists of a collection of operators and mechanisms to execute complex expressions using multiple operators, including multiple occurrences of the same operator. Taken as a whole, the query processing algorithms form an algebra which we call the physical algebra of a database system.”
1. ARCHITECTURE OF QUERY EXECUTION ENGINES
“The advantages of a uniform iterator interface for all query processing algorithms are obvious: it permits arbitrary combination of all operators including new ones in extensible systems, it permits arbitrarily complex plans, and it makes the query optimizer simpler to design and implement.”
2. SORTING AND HASHING
“Before discussing specific algorithms, two general approaches to managing sets of data are introduced. The purpose of many query-processing algorithms is to perform some kind of matching, i.e., bringing items that are “alike” together and performing some operation on them. There are two basic approaches used for this purpose, sorting and hashing. This pair permeates many aspects of query processing, from indexing and clustering over aggregation and join algorithms to methods for parallelizing database operations.”
Access Path
Algorithm + data structure used to locate rows satisfying some condition File scan: can be used for any condition Hash: equality search; all search key attributes of hash index are specified in condition B+ tree: equality or range search; a prefix of the search key attributes are specified in condition Binary search: Relation sorted on a sequence of attributes and some prefix of sequence is specified
in condition
2. ACCESS PATHS
Sorting and Hashing
“…We will conclude the discussion of individual query processing by outlining the many existing similarities and dualities of sort and hash-based query-processing algorithms as well as the points where the two types of algorithms differ. The purpose is to contribute to a better understanding of the two approaches and their tradeoffs. We try to discuss the approaches in general terms, ignoring whether the algorithms are used for relational join, union, intersection, aggregation, duplicate removal, or other operations. “
General External Merge Sort
To sort a file with N pages using B buffer pages: Pass 0: use B buffer pages. Produce sorted runs of B pages each. Pass 2, …, etc.: merge B-1 runs.
N B/
Cost of External Merge Sort
Number of passes: Cost = 2N * (# of passes) E.g., with 5 buffer pages, to sort 108 page file:
Pass 0: = 22 sorted runs of 5 pages each (last run is only 3 pages)
Pass 1: = 6 sorted runs of 20 pages each (last run is only 8 pages)
Pass 2: 2 sorted runs, 80 pages and 28 pages Pass 3: Sorted file of 108 pages
1 1 log /B N B
108 5/
22 4/
Number of Passes of External Sort
N B=3 B=5 B=9 B=17 B=129 B=257100 7 4 3 2 1 11,000 10 5 4 3 2 210,000 13 7 5 4 2 2100,000 17 9 6 5 3 31,000,000 20 10 7 5 3 310,000,000 23 12 8 6 4 3100,000,000 26 14 9 7 4 41,000,000,000 30 15 10 8 5 4
Double Buffering
To reduce wait time for I/O request to complete, can prefetch into `shadow block’. Potentially, more passes; in practice, most files still sorted in 2-3
passes.
OUTPUT
OUTPUT'
Disk Disk
INPUT 1
INPUT k
INPUT 2
INPUT 1'
INPUT 2'
INPUT k'
block sizeb
B main memory buffers, k-way merge
Sorting Records!
Sorting has become a blood sport! Parallel sorting is the name of the game ...
Datamation: Sort 1M records of size 100 bytes Typical DBMS: 15 minutes World record: 3.5 seconds
12-CPU SGI machine, 96 disks, 2GB of RAM
New benchmarks proposed: Minute Sort: How many can you sort in 1 minute? Dollar Sort: How many can you sort for $1.00?
Using B+ Trees for Sorting
Scenario: Table to be sorted has B+ tree index on sorting column(s). Idea: Can retrieve records in order by traversing leaf pages. Is this a good idea? Cases to consider:
B+ tree is clustered Good idea! B+ tree is not clustered Could be a very bad idea!
Clustered B+ Tree Used for Sorting
Unclustered B+ Tree Used for Sorting
External Sorting vs. Unclustered Index
N Sorting p=1 p=10 p=100
100 200 100 1,000 10,000
1,000 2,000 1,000 10,000 100,000
10,000 40,000 10,000 100,000 1,000,000
100,000 600,000 100,000 1,000,000 10,000,000
1,000,000 8,000,000 1,000,000 10,000,000 100,000,000
10,000,000 80,000,000 10,000,000 100,000,000 1,000,000,000
p: # of records per page B=1,000 and block size=32 for sorting p=100 is the more realistic value.
Query Evaluation
Relational Operations
We will consider how to implement: Selection ( ) Selects a subset of rows from relation. Projection ( ) Deletes unwanted columns from relation. Join ( ) Allows us to combine two relations. Set-difference ( ) Tuples in reln. 1, but not in reln. 2. Union ( ) Tuples in reln. 1 and in reln. 2. Aggregation (SUM, MIN, etc.) and GROUP BY
Since each op returns a relation, ops can be composed! After we cover the operations, we will discuss how to optimize queries formed by composing them.
Access Path
Algorithm + data structure used to locate rows satisfying some condition File scan: can be used for any condition Hash: equality search; all search key attributes of hash index are specified in condition B+ tree: equality or range search; a prefix of the search key attributes are specified in condition Binary search: Relation sorted on a sequence of attributes and some prefix of sequence is specified
in condition
1. ARCHITECTURE OF QUERY EXECUTION ENGINES
Access Paths
A tree index matches (a conjunction of) terms that involve only attributes in a prefix of the search key. E.g., Tree index on <a, b, c> matches the selection
a=5 AND b=3, and a=5 AND b>6, but not b=3. A hash index matches (a conjunction of) terms that has a
term attribute = value for every attribute in the search key of the index. E.g., Hash index on <a, b, c> matches a=5 AND b=3
AND c=5; but it does not match b=3, or a=5 AND b=3, or a>5 AND b=3 AND c=5.
Access Paths Supported by B+ tree
Example: Given a B+ tree whose search key is the sequence of attributes a2, a1, a3, a4
Access path for search a1>5 a2=3.0 a3=‘x’ (R): find first entry having a2=3.0 a1>5 a3=‘x’ and scan leaves from there until entry having a2>3.0 . Select satisfying entries
Access path for search a2=3.0 a3 >‘x’ (R): locate first entry having a2=3.0 and scan leaves until entry having a2>3.0 . Select satisfying entries
No access path for search a1>5 a3 =‘x’ (R)
Choosing an Access Path
Selectivity of an access path refers to its cost Higher selectivity means lower cost (#pages)
If several access paths cover a query, DBMS should choose the one with greatest selectivity
Size of domain of attribute is a measure of the selectivity of domain Example: CrsCode=‘CS305’ Grade=‘B’ - a B+ tree with search key CrsCode
is more selective than a B+ tree with search key Grade
Computing Selection
No index on attr: If rows unsorted, cost = F
Scan all data pages to find rows satisfying the condition If rows sorted on attr, cost = log2 F + (cost of scan)
Use binary search to locate first data page containing row in which (attr = value)
Scan further to get all rows satisfying (attr op value)
condition: (attr op value)
Computing Selection
B+ tree index on attr (for equality or range search): Locate first index entry corresponding to a row in which (attr = value);
cost = depth of tree Clustered index - rows satisfying condition packed in sequence in
successive data pages; scan those pages; cost depends on number of qualifying rows
Unclustered index - index entries with pointers to rows satisfying condition packed in sequence in successive index pages; scan entries and sort pointers to identify table data pages with qualifying rows, each page (with at least one such row) fetched once
condition: (attr op value)
Unclustered B+ Tree Index
B+ Tree
Index entriessatisfyingcondition
Data File
data page
Computing Selection
Hash index on attr (for equality search only): Hash on value; cost 1.2 (to account for possible overflow
chain) to search the (unique) bucket containing all index entries or rows satisfying condition
Unclustered index - sort pointers in index entries to identify data pages with qualifying rows, each page (containing at least one such row) fetched once
condition: (attr = value)
Complex Selections
Conjunctions: a1 =x a2 <y a3=z (R) Use most selective access path Use multiple access paths
Disjunction: (a1 =x or a2 <y) and (a3=z) (R) DNS (disjunctive normal form) (a1 =x a3 =z) or (a2 < y a3=z) Use file scan if one disjunct requires file scan If better access path exist, and combined selectivity
is better than file scan, use the better access paths, else use a file scan
Two Approaches to General Selections
First approach: Find the most selective access path, retrieve tuples using it, and apply any remaining terms that don’t match the index: Most selective access path: An index or file scan that we estimate
will require the fewest page I/Os. Terms that match this index reduce the number of tuples retrieved;
other terms are used to discard some retrieved tuples, but do not affect number of tuples/pages fetched.
Consider day<8/9/94 AND bid=5 AND sid=3. A B+ tree index on day can be used; then, bid=5 and sid=3 must be checked for each retrieved tuple. Similarly, a hash index on <bid, sid> could be used; day<8/9/94 must then be checked.
Intersection of Rids
Second approach (if we have 2 or more matching indexes that use Alternatives (2) or (3) for data entries): Get sets of rids of data records using each matching index. Then intersect these sets of rids Retrieve the records and apply any remaining terms. Consider day<8/9/94 AND bid=5 AND sid=3. If we have a B+
tree index on day and an index on sid, both using Alternative (2), we can retrieve rids of records satisfying day<8/9/94 using the first, rids of recs satisfying sid=3 using the second, intersect, retrieve records and check bid=5.
Agregación, Remoción De Duplicados
La idea de la agregación es representar un grupo de items mediante un solo valor o clasificar items en grupos y determinar un valor por cada grupo.
Agregación EscalarAgregación por funciones
Agregación, Remoción De Duplicados (3)
Algoritmo de Agregación basado en Ordenamiento
El ordenamiento permite agrupar items con características similares así se hace mucho más sencillo hacer la remoción de datos duplicados.
Agregación, Remoción De Duplicados (4)
Algoritmo de Agregación basado en Ordenamiento La cantidad de datos de entada y salida
calculada para éste algoritmo es la siguiente:
Donde 2 es el factor para considerar lectura y escritura, R es el tamaño de la entrada, L1 es el número de niveles que no se han visto afectados, O es el tamaño de la salida Y W es el número estimado de ejecuciones del algoritmo
Agregación, Remoción De Duplicados (5)
Algoritmo Basado en HashingLa idea general es realizar particiones de
los datos que se están analizandoSe genera una tabla que contiene
esencialmente items de salida.La cantidad de entradas o salidas para la
agregación depende del número de niveles necesarios.
Agregación, Remoción De Duplicados (6)
Algoritmo Basado en Hashing(2)
2 X (R (L + 1) – FL X (M – [(R’ /G – M)/(M – C)] X C X G )
Dónde L es el nivel de recursividad, R es el tamaño de entradas de archivos, K es el número de archivos de partición, F es el Fan out, M es el tamaño de archivos para llegar al desbordamiento de memoria.
Agregación, Remoción De Duplicados (7)
Gráfica comparación de los algoritmos
Operaciones Binarias Para “Matching”
De la misma manera en que los procesos de eliminación y agregación son importantes en grupos de datos de tamaño considerable, es deseable también poder cotejar la información, ésta es la función principal del “matching” establecer estas relaciones existentes. Para tal fin se hace uso principalmente del join de las siguientes maneras.
Operaciones Binarias Para “Matching”
Operaciones Binarias Para “Matching”
Algoritmos “Join” Basados en “Loops” anidados Es el algoritmo más simple. Para cada entrada seleccionada hace una
búsqueda completa en el resto de los datos, para de ésta manera encontrar los “matches”.
Se requiere un archivo temporal de la entrada que esta siendo escaneada.
Obviamente es un algoritmo con poco rendimiento para grupos de datos muy grandes
Operaciones Binarias Para “Matching”
Algoritmos “Merge-Join”Este algoritmo requiere que las entradas
estén previamente ordenadas para obtener los resultados; el procedimiento es similar al que previamente se revisó.
Al estar las entradas ordenadas el algoritmo no requiere de memoria adicional excepto cuando el valor total de los paquetes es mayor que el tamaño de la memoria
Operaciones Binarias Para “Matching”
Algoritmos “Merge-Join Se puede realizar una combinación entre los
el anterior algoritmo y el presente para optimizar los resultados.
Dado que los algoritmos anteriores necesitan de cierta cantidad de memoria es conveniente realizar una asignación así
W = R / (2 X M) +1 , dónde R es el tamaño de la entrada, M es la cantidad de memoria necesaria y las otras dos son constantes de lectura y escritura.
Operaciones Binarias Para “Matching”
Algoritmos “Hash Join” Estos algoritmos se desarrollan partiendo de
la idea básica de realizar la tabla “hash” y de probar ésta tabla usando los items de otra entrada.
Este algoritmo presenta características en contra como el constante desbordamiento de memoria pero se han realizado varias investigaciones en éste entorno para mejorar las soluciones dadas.
Operaciones Binarias Para “Matching”
La partición realizada por el algoritmo se podría interpretar de la siguiente manera
Operaciones Binarias Para “Matching”
Comparación Métodos
Contribuciones a SQL Server 7