inclusion of new types in relational database systems

Inclusion of New Types in Relational Database Systems

Michael Stonebraker

Why ORDBMS?

• Allows the addition of complex data and the use of a query language to access this data (e.g. insurance)

• Conveniently supports new applications (e.g. Internet, photography)

Complete Extended Type System

• Allow the definition of user-defined data types (e.g. 2-D boxes)

• Allow the definition of new operators for these data types (e.g. overlaps, contained in)

• Allow the implementation of new access methods for data types (e.g. R-trees)

• Allow optimized query processing for commands containing new data types and operators

Motivating Example

Consider a relation consisting of data on two dimensional boxes. It can be represented by an identifier and the coordinates of two corner points as follows:

create box (id = i4, x1 = f8, x2 = f8, y1 = f8, y2 = f8)

Now consider the query to find all the boxes that overlap the unit square, i.e. the box with coordinates (0, 1, 0, 1). A representation of this request in QUEL is as follows:

retrieve (box.all) where not (box.x2 <= 0 or box.x1 >= 1 or box.y2 <= 0 or box.y1 >= 1)

Problems

• The command is too hard to understand.

• The command is too slow because the query planner will not be able to optimize something this complex.

• The command is too slow because there are too many clauses to check.

Solution

Support a box data type so that the box relation and the resulting user query can be defined as follows:

create box (id = i4, desc = box)

retrieve (box.all) where box.desc !! “0, 1, 0, 1”

Here “!!” is an overlaps operator with two operands of data type box which returns a boolean.

Intuition

• New user-defined type and operator make resulting query more readable

• New user-defined access methods can allow query planner to optimize query

Consequences

• Need ability to define operators for user defined types

• Require support for fast access paths for queries- Extend current access methods (e.g. B-trees for boxes using ascending area)- Define new access methods (e.g. R-trees for boxes using contained in)

• Require support for the query optimizer to construct an efficient plan

Overview

• What will we discuss:– ADTs– New Access Methods– Query Processing and Access Path Selection

Examples of Operators for Box Data Type

Definition of New Types

The new type can be implemented as follows:

define type-name length = value,

input = file-name,

output = file-name

Definition of New Operators

Zero or more operators can be implemented for the new type as follows:

define operator token = value,left-operand = type-name,right-operand = type-name,result = type-name,precedence-level like operator-2,file = file-name

Example

define operator token = !!,

left-operand = box,

right-operand = box,

result = boolean,

precedence like *,

file = /usr/foobar

Comments on the Prototype

• Problem: ADT routines are a security loophole

• Possible solutions:

- Run in separate address space

- Interpret ADT procedure

- Use hardware support for protected procedure calls

• Author’s solution: Provide two environments for ADT procedures

- Protected environment for debugging

- Unprotected one for performance

Registering New Access Methods

• Basic idea:

- Access methods contain a small number of procedures that define its characteristics

- Replace these by others which operate on a different data type

Example

• Consider a B-tree and the following generic query:retrieve (target-list) where relation.key OPR value

- Supports fast access if OPR is in {=, <, <=, >=, >}- Includes procedure calls to support these operators for a particular data type

• We just have to write procedures for our new operators which must have properties P1, …, P7 and the B-tree will function correctly!

Example (cont’d)

• Appropriate information recorded on two access method templates:- TEMPLATE-1 describes conditions which must be true for the operators provided by the access method (only used by humans)- TEMPLATE-2 provides necessary information on the data types of operators

• In AM relation, designer can implement one or more collections of operators which satisfy the template

F1 = (value – low-key) / (high-key – low-key)F2 = (high-key – value) / (high-key – low-key)

Example (cont’d)

• User can modify relations to B-tree using any class of operators defined in AM relation as follows:

modify box to B-tree on desc using area-op• Secondary index can also be constructed as

follows

index on box is box-index (desc) using area-op

Implementing New Access Methods

• Collection of procedure calls that retrieve and update records.

• Need to construct open, close, get-first, get-next, get-unique, insert, delete, replace, and build

• Open and close are usually universally usable and designer only needs to construct the remaining procedures

• Replace and delete do not require modification if the same physical page layout as some existing access method is used

Implementation Problems

• Interface to transaction management code

• Concurrency control subsystem issues

• Interface to buffer manager

• Only briefly discussed in the paper

Query Processing And Access Path Selection

• Require four pieces of information when defining an operator to allow optimization:- Selectivity factor, Stups, estimates the expected number of record satisfying the clause:

where rel-name.field-name OPR value- Selectivity factor, S, is the expected number of records which satisfy the clause:

where relname-1.field-1 OPR relname-2.field-2- Feasibility of merge-sort- Feasibility of hash-join

Example

define operator token = AE,left-operand = box,right-operand = box,result = boolean,precedence like *,file = /usr/foobar,Stups = 1,S = min (N1, N2),merge-sort with AL,hash-join

Generating Query Processing Plan

• Assumptions:- Relations stored keyed on one field in a single file- Secondary indexes can exist for other fields

• Queries involving a single relation can be processed as follows:- Scan of the relation- Scan of a portion of the primary index- Scan of a portion of a secondary index

Generating Query Processing Plan (cont’d)

• Joins can be processed as follows:

- Iterative substitution

- Merge-sort

- Hash-join• Modify standard query planner to compute best

plan using appropriate rules to generate legal plans and the selectivities provided

Summary

• Main contributions of paper:

- Shows how to adapt existing access methods for new data types

- Explains how to code new access methods

- Demonstrates how to support automatic generation of optimized query plans

The Postgres Next-Generation Database Management System

Michael Stonebraker

Greg Kemnitz

Applications of DBMS

• Data management (traditional)

• Object management (new)

• Knowledge management (new)

• An example which requires services in all three dimensions is an application that stores and manipulates text and graphics to facilitate the layout of newspaper copy.

Postgres Data Model and Query Language

• Orientation toward database access from a query language- Emphasis on query language, optimizer, and run-time system

• Orientation toward multilingual access- No programming language-specific tight integration

• Small number of concepts- Classes, inheritance, types, and functions

Classes (Constructed Types, Relations)

• Named collection of instances (records, tuples) of objects

- Each instance has same collection of named attributes

- Each attribute is a specific type

- Each instance has a unique (never-changing) identifier (OID)

• Can be created as follows:

create EMP (name = c12, salary = float, age = int)

• Can inherit data elements from other classes:

create SALESMAN (quota = float) inherits EMP

• POSTGRES allows a class to inherit from an arbitrary collection of other parent classes (multiple inheritance)

Classes (cont’d)

• Three kinds of classes in POSTGRES: real classes, derived classes, and versions.- A real (or base) classes’ instances are stored in the database.- A derived (or view or virtual class) classes’ instances are not physically stored but are materialized only when necessary.- A version of another class is stored as a differential relative to its parent class.

Types

• Three kinds of types in POSTGRES: base types, arrays of base types, and composite types.

• Base types include hard-wired types (e.g. integers, floats, character strings) and constructed ADTs- Can assign values to attributes of base types in POSTQUEL by either specifying a constant or a function which returns the correct type

• Arrays of base types are supported using standard bracket notation and we could define a class as follows:

create EMP (name = c12, salary = float[12], age = int)retrieve (EMP.name) where EMP.salary[4] = 4000

Types (cont’d)

• Composite types allow a user to construct complex objects, that is, attributes which contain other instances as part or all of their value.- Complex objects have a hierarchical internal structure- Zero or more instances of any class is automatically a composite type. For example:

create EMP (name = c12, salary = float[12], age = int, manager = EMP, coworkers = EMP)

• Note, each time a class is constructed, a type is automatically available to hold a collection of instances of the class.

• POSTGRES also supports a final constructed type, set, whose value is a collection of instances from all classes. For example, hobbies information can be added to the EMP class as follows:

add to EMP (hobbies = set)

Types (cont’d)

• Path expressions:

- Elements of an attribute that are a composite type can be hierarchically addressed by nested dot notation. For example, one could write:

retrieve (EMP.manager.age) where EMP.name = “Joe”

• Composite types can have a value that is a function which returns the correct type. For example:

replace EMP (hobbies = compute-hobbies(“Jones”)) where EMP.name = “Jones”

Functions

• Three different kinds of functions in POSTGRES:

- C functions

- Operators

- POSTQUEL functions

C Functions

• To be able to perform complex calculations on objects, POSTGRES supports C functions.

• Can define an arbitrary number of C functions whose arguments are base or composite types

• Can have an argument which is a class name. For example:retrieve (EMP.name) where overpaid (EMP)

- Inherited down the class hierarchy in the standard way- Can be considered as a new attribute for the class whose type is the return type of the function. For example:

retrieve (EMP.name) where EMP.overpaid• Queries with C functions in the qualification cannot be optimized by

the POSTGRES query optimizer. For example, the preceding query will result in a sequential scan of all instances of the class.

Operators

• To be able to use indexes in processing queries, POSTGRES supports operators.

• Operators are functions with one or two operands which use the standard operator notation in the query language. For example:

retrieve (DEPT.dname) where DEPT.floorspace AGT “(0,0), (1,1), (0,2)”

• Only available for operands which are base types

- Access methods support fast access to specific fields in records

- Unclear what an access method should do for a constructed type

Operators (cont’d)

• To assist the query optimizer, hints such as the negator of an operator can be included in the definition of an operator.

• For example, the following query cannot be optimized, but it can be written as the previous query which can be:

retrieve (DEPT.dname) where not DEPT.floorspace ALE “(0,0), (1,1), (0,2)”

• Information on available access paths is stored in the POSTGRES system catalogs.

POSTQUEL Functions

• Any collection of commands in the POSTQUEL query language can be packaged together and defined as a function. For example:

define function high-pay returns EMP as

retrieve (EMP.all)

where EMP.salary > 50000

• POSTQUEL functions can also have parameters. For example:

define function sal-lookup (c12) returns float as

retrieve (EMP.salary)

where EMP.name = $1

• Can be placed in a query or directly executed using the fast path facility

POSTQUEL Functions (cont’d)

• Attributes of a composite type automatically have values which are functions that return the correct type. For example, consider the following function and command:

define function mgr-lookup (c12) returns EMP asretrieve (EMP.all)where EMP.name = DEPT.manager and DEPT.name = $1

append to EMP(name = “Sam”, salary = 1000, age = 40,manager = mgr-lookup(“shoe”))

• Like C functions, POSTQUEL functions can have a specific class as an argument and can either be thought of as functions or as new attributes.

POSTGRES Query Language

• We already saw: User-defined functions and operators, arrays, path expressions

• Support for nested queries

• Transitive closure

• Support for inheritance

• Support for time travel

Nested Queries

• POSTQUEL allows queries to be nested and has operators that have sets of instances as operands. For example:

retrieve (DEPT.dname)where DEPT.floor NOT-IN

{D.floor from D in DEPT whereD.dname != DEPT.dname}

Transitive Closure

• Allows a user to explode an ancestor hierarchy. For example, consider the class parent (older, younger) and the following query:

retrieve* into answer (parent.older) from a in answerwhere parent.younger = “John” or parent.younger = a.older- * after retrieve indicates that associated query should be run until the answer fails to grow- * can also be used to indicate that a query should be run over a class and all classes under it in the inheritance hierarchy. For example:

retrieve (E.name) from E in EMP* where E.age > 40

Time Travel

• Allows a user to run historical queries. For example (T is a time):

retrieve (EMP.salary) from EMP [T] where EMP.name = “Sam”

- POSTGRES will find the version of Sam’s record valid at the correct time and get the appropriate salary

Fast Path

• Reason for fast path: Application may require direct access to user-defined or internal POSTGRES function.

• POSTQUEL has been extended with:

function-name (param-list)

• User can execute any function known to POSTGRES. (e.g. parser, optimizer, executor, access methods, buffer manager, utility routines)

• Validity of parameters not checked

• Allows user program to call a function in another address space rather than its own

Rule System

• Reasons for rule system: Users require support for views, triggers, integrity constraints, referential integrity, protection, and version control.

• POSTGRES rule system is a general-purpose rules system that can perform all of these functions.

Rule System (cont’d)

• Rules have the form:ON event (TO) objectWHERE POSTQUEL-qualificationTHEN DO [instead] POSTQUEL-command(s)

- events: retrieve, replace, delete, append, new (replace or append), or old (delete or replace)- objects: name of a class or class.column- POSTQUEL-commands: set of POSTQUEL commands with the following two changes:- new or current can appear instead of the name of a class in front of any attribute- refuse (target-list) is added as a new POSTQUEL command

Versions

• Innovative application of rule system

• Goal of versions: Create a hypothetical version of a class with the following properties:

- Initially, the hypothetical class has all the instances of the base class

- The hypothetical class can be freely updated to diverge from the base class

- Updates to the hypothetical class do not cause physical modifications to the base class

- Updates to the base class are visible in the hypothetical class, unless the instance updated has been deleted or modified in the hypothetical class

Example

• Can create a version of a class as follow:

create version my-EMP from EMP

• This command is supported by two differential class for EMP:

EMP-MINUS (deleted-OID)

EMP-PLUS (all-fields-in EMP, replaced-OID

• The retrieve rule installed at the time the version is created is:

on retrieve to my-EMP

then do instead retrieve (EMP-PLUS.all)

retrieve (EMP.all)

where EMP.OID NOT-IN {EMP-PLUS.replaced-OID} and

EMP.OID NOT-IN {EMP-MINUS.deleted-OID}

Forward Chaining

• Generally, rules specify additional actions to be taken as a result of user updates. These additional actions may activate other rules, and a forward chaining control flow results. For example:

on new EMP.salary

where EMP.name = “Fred”

then do replace E (salary = new.salary) from E in EMP where E.name = “Joe”

Backward Chaining

• Now consider the following rule:

on retrieve to EMP.salary

where EMP.name = “Joe”

then do instead retrieve (EMP.salary) where EMP.name = “Fred”

• In this case, Joe’s salary is not explicitly stored, but it is derived by activating the above rule. If Fred’s salary is not explicitly stored, then further rules would be used to find the ultimate answer and a backward chaining control flow results.

Implementation of Rules

• Two implementations for POSTGRES rules:

- Through record level processing, the rules system is called when individual records are accessed, deleted, inserted, or modified.

- The second implementation is through query rewrite.

Record Level Rule System

• A marker which contains the identifier of a rule is placed on an attribute of an instance. If the executor touches a marked attribute, then it calls the rules system before proceeding.- Efficient if there are a large number of rules and each only covers a few instances- No extra overhead will be required unless a marked instance is actually touched.

• However, consider the following rule and an incoming query:on replace to EMP.salary then do append to AUDIT (name =

current.name, salary = current.salary, new = new.salary, user = user())replace EMP (salary = 1.1 * EMP.salary) where EMP.age > 50

- In the record level rules system, we will use the rule for every elderly employee, a large overhead.

Query Rewrite Module

• Solution: Rewrite the user command to the following:

append to AUDIT (name = EMP.name, salary = EMP.salary, new = 1.1 * EMP.salary, user = user()) where EMP.age > 50

replace EMP (salary = 1.1 * EMP.salary) where EMP.age > 50

- Auditing operation is done in bulk as a single command

- Preferable over the record level rule system

• This system will perform well if there are a small number of large-scope rules and poorly if there are a large number small-scope rules.

• Note that the two implementations are complementary.

Storage System

• POSTGRES uses a no-overwrite storage manager.

• Old records remain in the database whenever an update occurs and serves the purpose normally performed by a write-ahead log.

• POSTGRES, therefore, has no conventional log and only stores two bits per transaction indicating whether each transaction is committed, aborted, or in progress.

• This system allows for instantaneous crash recovery and time travel.

• Problem: Database will have committed instances intermixed with instances that were written by aborted transactions.

• Solution: System must distinguish between these two and ignore the latter.

Storage System (cont’d)

• If stable memory is available, a no-overwrite storage manager is superior to a conventional one.

• However, in the absence of stable memory, a no-overwrite storage manager must force to disk all pages written by a transaction at commit time because the effects of a committed transaction must be durable in case a crash occurs and main memory is lost. A conventional disk manager only needs to force the log pages.

• Even if there are as many log pages as data pages, which is unlikely, the conventional storage manage is performing sequential I/O versus the no-overwrite storage manage which is performing random I/O.

Conclusions

• Original development and organization of POSTGRES is better than that of INGRES.

• Performance:

- POSTGRES is about twice as fast as UCB-INGRES

- On the other hand, it is 3/5 as fast as ASK-INGRES (commercial version)

• While at the time of the publication, POSTGRES 2.1 was work in progress and contained inefficiencies, it still touched on many interesting ideas for an implementation of an ORDBMS.

inclusion of new types in relational database systems

Documents