1. intro to adbms course

50
1 ADBMS MCA 4.5 Jan 10, 2012

Upload: faheem-anwar

Post on 15-Oct-2014

125 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1. Intro to ADBMS Course

1

ADBMS

MCA 4.5Jan 10, 2012

Page 2: 1. Intro to ADBMS Course

2

Textbook(s)

Main textbook, available at the bookstore:• Database Systems: The Complete Book,

Hector Garcia-Molina, Jeffrey Ullman, Jennifer Widom

Almost identical, and also available at the bookstore:

• A First Course in Database Systems, Jeff Ullman and Jennifer Widom

• Database Implementation, Hector Garcia-Molina, Jeff Ullman and Jennifer Widom

Page 3: 1. Intro to ADBMS Course

3

Other Texts• Database Management Systems, Ramakrishnan

– very comprehensive

• Database System Concepts , A. Silberschatz, H. F. Korth and S. Sudarshan, 6th Ed., McGRAW-HILL, ISBN 978-007-132522, International Edition, 2011

• Fundamentals of Database Systems, Elmasri, Navathe– very widely used

• Data on the Web, Abiteboul, Buneman, Suciu– XML and other new/advanced stuff

Page 4: 1. Intro to ADBMS Course

Course Focus• Main focus of this course is Database System

Implementation i.e. how does one build a DBMS.

• This subject in turn can be divided into 3 parts:1. Storage Management: how secondary

storage is used efficiently to hold data and allow it to be accessed quickly.

2. Query Processing: how queries expressed in a very high-level language such as SQL can be executed efficiently.

3. Transaction Management: how to support transactions with the ACID properties.

Page 5: 1. Intro to ADBMS Course

5

Course Overview• Physical data storage

Blocks on disks, records in blocks, fields in records

• Indexing & HashingB-Trees, hashing,…

• Query ProcessingMethods to execute SQL queries

efficiently

• Crash RecoveryFailures, stable storage, logging

policies, ...

Page 6: 1. Intro to ADBMS Course

6

Course Overview

• Concurrency ControlCorrectness, locks,…

• Transaction ProcessingLogs, deadlocks,…

• Security & IntegrityAuthorization, encryption,…

• Distributed DatabasesInteroperation, distributed

recovery,…

Page 7: 1. Intro to ADBMS Course

Ch [11] HardwareFile and System Structure

Ch.[12]

Ch. [13] Indexing and Hashing

Ch. [14] Indexing and Hashing

Ch. [15] Query Processing

Ch. [16] Query Optimization

Ch. [17] Crash Recovery

Ch. [18] Concurrency Control

Ch. [19] Transaction Processing

Ch. [20] Information Integration

Review

Syllabus Chapter-wise

Page 8: 1. Intro to ADBMS Course

Unit Syllabus Weeks

1 Data storage and File Structure: The Memory Hierarchy, Disks, Using Secondary Storage effectively, Accelerating access to Secondary Storage, Disk Failures, Recovery from Disk Crashes, RAID, Representing Data Elements, Indexing and Hashing: Indexes on Sequential Files, secondary Indexes, B-Trees, Hash Tables, Multidimensional and Bitmap Indexes

2

2. Query Execution: Introduction to Physical-Query-Plan Operators, One-Pass Algorithms for Database Operations, Nested Loop Joins, Two-Pass Algorithms Based on Sorting and Hashing, Index-based Algorithms, Buffer Management, Algorithms Using more than 2 Passes, Parallel algorithms for Relational Operations, Query Optimization: Parsing, Estimating the cost of operations, query optimization.

2

3 Advanced Transaction Management: Transactions in SQL, Coping with System Failures: Models for Resilient Operation, Undo Logging, Redo Logging, Undo/Redo Logging, Protecting Against Media Failures, Concurrency Control: Serial and Serializable Schedules, Conflict-Serializability, Enforcing Serializability by Locks, Locking Systems, Tree Protocol, Concurrency Control by Timestamps and Validation, Advanced Transaction Processing: Resolving Deadlocks, Distributed Databases, Commit and Locking, Long-duration Transactions

2

4 Database System Architectures: Data Models: Review of Relational Data Model and Object based Model, Semi structured Data, XML and its Data Model, Object-orientation in Query Languages, Logical Query Languages, Centralized and Client-Server Architectures, Server System Architectures, Parallel Databases, Distributed Databases, Deductive databases

2

5 Data Warehousing, Data Mining and Information Retrieval: DSS, OLAP, Data Warehousing, Data Mining, ID3 Algorithm, Classification, Association Rules, Clustering, IR

2

6 Spatial Data Management: Time in databases, Spatial and Geographic Data, Multimedia Databases, Mobility and Personal Databases. 2

7 Misc. Topics: Advanced Application Development 1

Page 9: 1. Intro to ADBMS Course

9

Simplified DBMS structure

Query processor

User/Application

Transaction processor

Storage manager

Buffers

Permanent storage

Indexes

User Data System Data

Page 10: 1. Intro to ADBMS Course

10

Why study DBMS implementation techniques?

• Computer scientists’ core knowledge

• Techniques applicable in implementing DBMS-like systems

• Understanding of DBMS internals necessary for database administrators

• Note: This course is not about designing DB-based applications or about using some specific database systems

Page 11: 1. Intro to ADBMS Course

11

Database Systems

• The big commercial database vendors:– Oracle– IBM (with DB2) bought Informix recently– Microsoft (SQL Server)– Sybase

• Some free database systems (Unix) :– Postgres– Mysql– Predator

Page 12: 1. Intro to ADBMS Course

Section 8.6 (Garcia, Ullman, Jennifer Book): Transactions in SQL (Review)

• Transactions• Serializabilty example• Atomicity example• Read-only Transactions• Dirty Reads• Isolation Levels

Page 13: 1. Intro to ADBMS Course

13

Transactions

• A transaction = sequence of statements or collection of one or more operations on the database that either all succeed, or all fail

• Transactions have the ACID properties:A = atomicityC = consistencyI = independenceD = durability

Page 14: 1. Intro to ADBMS Course

ACID Properties

• Atomicity. Either all operations of the transaction are properly reflected in the database or none are.

• Consistency. Transaction moves from a state where integrity holds, to another where integrity holds or relationships among values maintained

• Isolation. Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions. Intermediate transaction results must be hidden from other concurrently executed transactions.

• Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures.

A transaction is a unit of program execution that accesses and possibly updates various data items.To preserve the integrity of data the database

system must ensure:

Page 15: 1. Intro to ADBMS Course

Transactions• Problem: An application must perform several

writes and reads to the database, as a unit.

• Example: Two people attempt to book the last seat on a flight.

• Solution: Multiple actions of the application are bundled into one unit called Transaction.– Transactions guarantee certain properties to hold that

prevent such problems.

Page 16: 1. Intro to ADBMS Course

Serializability

• In applications like Banking/Airline Reservations, hundreds of operations per second may be performed on a single database.

• It is entirely possible that we could have 2 operations affecting the same account or flight, and for those operations to overlap in time.

• Consider the following two examples:– Serializability example (eg 8.26)– Atomicity Example (e.g. 8.27)

Page 17: 1. Intro to ADBMS Course

Example 8.26

• Consider a relation: – Flights(fltNum, fltDate, fltSeat, occupied)

• Write a function chooseSeat() in PL/SQL to read relation Flights for flight number and seats available,

• Find if a particular seat is available, and make it occupied if so.

Page 18: 1. Intro to ADBMS Course

Serializability Example

• Suppose 2 agents are trying to book the same seat for the same flight and date approx. at same times.

Page 19: 1. Intro to ADBMS Course

error

• Each execution of chooseSeat() tells its customer that the seat belongs to them

• Both customers believe they have been granted the seat in question

Page 20: 1. Intro to ADBMS Course

Serial Transaction

• An execution of functions operating on the same database is serial if one function executes completely before any other function begins.

• The execution is serializable if they behave as if they were run serially, even though their executions may overlap in time.

• Clearly, if 2 invocations of chooseSeat() are run i.e. one after another, serially or serializably, then error we saw can not occur.

Page 21: 1. Intro to ADBMS Course

Assuring Serializable Behavior

• Practically it is often impossible to require that operations run one after the another, there are just too many of them and some parallelism is required.

• As a remedy, DBMS’s adopt a mechanism for assuring serializable behavior, even if the execution is not serial, the result looks to user as if operations were executed serially.

Page 22: 1. Intro to ADBMS Course

Assuring Serializable Behavior

• One common approach is for DBMS to lock elements of database so that 2 functions can not access them at the same time.

• If the function chooseSeat() were written to lock other operations out of Flights relation, the operations that did not access Flights could run in parallel.

Page 23: 1. Intro to ADBMS Course

Atomicity example• If 2 or more database operations are performed

about the same time, it is possible for a single operation to put the database in an unacceptable state if there is a h/w or s/w crash while the operation is executing.

• Example 8.27: – Consider a relation Accounts(acctNo, balance).

– Write a function transfer() that inputs 2 accounts and an amount of money, checks that first acount has atleast that much money and if so moves the money from first account to second.

Page 24: 1. Intro to ADBMS Course

Transactions in SQL

• Each SQL statement is normally a transaction by itself.

• As a default, transactions in SQL are executed in a serializable manner.

• START TRANSACTION command is used to start a transaction and COMMIT or ROLLBACK is used to end the transaction.

• In program interfaces, transactions begin whenever the database is accessed, and end when either a COMMIT or ROLLBACK statement is executed.

Page 25: 1. Intro to ADBMS Course

25

Page 26: 1. Intro to ADBMS Course
Page 27: 1. Intro to ADBMS Course

Read-only Transactions• Any transaction that reads and then write some data

into the database, is prone to serialization problems.

• When a transaction only reads data and does not write data, we have more freedom to let the transaction execute in parallel with other transactions.

• For example, suppose we wrote a function that read data to determine whether a certain seat was available; we could execute many invocations of this function at once, without risk of permanent harm to the database.

Page 28: 1. Intro to ADBMS Course

• To tell SQL system next transaction is read-only use command: SET TRANSACTION READ ONLY; just before that transaction begins.

• We can also inform SQL that coming transaction may write data by command: SET TRANSACTION READ WRITE; which is default option and thus is unnecessary.

Read-only Transactions in SQL

Page 29: 1. Intro to ADBMS Course

Dirty Read

• Dirty data is a common term for data written by a transaction that has not yet committed.

• A Dirty read is a read of dirty data.

• The risk in reading dirty data is that the transaction that wrote it may eventually abort.

Page 30: 1. Intro to ADBMS Course

Dirty Read

• Sometimes the dirty read matters and sometimes it doesn’t so that it makes sense to risk an occasional dirty read and thus avoid:

– Time-consuming work by DBMS i.e. needed to prevent dirty read

– Loss of parallelism that results from waiting until there is no possibility of a dirty read

Page 31: 1. Intro to ADBMS Course

Dirty Read examples

• Example 8.30: Consider the relation Accounts(acctNo, balance),

• suppose we want to transfer money from one account to another account, suppose transfers are implemented by a program P that executes the following sequence of steps:

1. Add money to account_22. Test if account_1 has enough money

a) If NO: ROLLBACKb) If YES: subtract money from account_1 and end.

Page 32: 1. Intro to ADBMS Course

Example 8.30

• If program is executed serially, it doesn’t matter that we have put money temporarily in account_2.

• Suppose dirty reads are possible, imagine there are 3 accounts A1(bal=$100), A2(bal=$200), A3(bal=$300),

• Suppose 2 transactions T1 and T2 execute program P, to transfer roughly at the same time:– T1: transfers $150 from A1 to A2– T2: transfers $250 from A2 to A3

Page 33: 1. Intro to ADBMS Course

Example 8.30

• Here is a possible sequence of events:1. T2 executes step 12. T1 executes step 13. T2 executes test of step2 4. T1 executes test of step2 5. T2 executes step 2b6. T1 executes step 2a

Page 34: 1. Intro to ADBMS Course

Example 8.30

• Here dirty read is a problem as it caused an account to negative balance.

• Although total amount of money has not changed i.e. still $600 among 3 accounts.

Page 35: 1. Intro to ADBMS Course

Example 8.31

• Consider the relation Flights(fltNum, fltDate, fltSeat, occupied), find if a particular seat is available, and make it occupied if so. Use the following algorithm:

1. We find an available seat and reserve it by setting occupied to TRUE for that seat, if there is none abort.

2. We ask the customer for approval of the seat. If so we commit. If not we release the seat by setting occupied to FALSE and repeat step 1 to get another seat.

Page 36: 1. Intro to ADBMS Course

Example 8.31

• If two transactions T1 and T2 are executing this algo at about the same time, T1 might reserve a seat S, which later is rejected by customer.

• T2 executes step1 at a time when seat S is marked occupied, the customer, customer for that transaction is not given the option to take seat S.

• The problem is that the dirty read has occurred, but here the problem is not too serious.

• This method of seat choosing with dirty reads allowed makes sense in order to speed up the avg. processing time for booking request.

Page 37: 1. Intro to ADBMS Course

Example 8.31

• SQL allows us to specify that dirty reads are acceptable for a given transaction using the following command:– SET TRANSACTION READ WRITE ISOLATION LEVEL READ UNCOMMITTED;

• first line declares that the transaction may write data

• Second line declares that the transaction may run with the isolation level read uncommitted i.e. its allowed to read dirty data suitable to be used by example 8.31.

Page 38: 1. Intro to ADBMS Course

SQL Isolation Levels

Isolation levels determine what a transaction is allowed to see. The declaration, valid for one transaction, is:

SET TRANSACTION ISOLATION LEVEL X;

where:• X = SERIALIZABLE: this transaction must

execute as if at a point in time, where all other transactions occurred either completely before or completely after.

Page 39: 1. Intro to ADBMS Course

• X = READ COMMITTED: this transaction can read only committed data.– Example: if transactions are as above, Sally

could see the original Sells for statement 1 and the completely changed Sells for statement 2.

• X = REPEATABLE READ: if a transaction reads data twice, then what it saw the first time, it will see the second time (it may see more the second time).– Moreover, all data read at any time must be

committed; i.e., REPEATABLE READ is a strictly stronger condition than READ COMMITTED.

SQL Isolation Levels

Page 40: 1. Intro to ADBMS Course

Transaction Management

Start of DBMS InternalsChapter 17 (Garcia, Ullman, Jennifer

Book)

40

Page 41: 1. Intro to ADBMS Course

Transaction Manager• It is normal to group one or more database

operations into a transaction, which is a unit of work that must be executed atomically and in apparent isolation from other transactions.

• In order to assure that transactions are executed correctly and atomically, Transaction manager interacts with:– Log and Recovery Manager, – Buffer Manager, – Concurrency control Manager (Scheduler)– Query Processor

Page 42: 1. Intro to ADBMS Course

Transaction Management

Page 43: 1. Intro to ADBMS Course

Log and Recovery Manager

• In order to assure durability every change is logged separately on disk.

• The log manager follows one of several policies designed to assure that no matter when a system failure or crash occurs, recovery manager will be able to examine the log of changes and restore the database to some consistent state.

• Log manager initially writes the log in buffers and negotiates with the buffer manager to make sure that buffers are written to disk, where data can survive a crash at appropriate times.

Page 44: 1. Intro to ADBMS Course

Concurrency-Control Manager or Scheduler

• The scheduler must assure that the individual actions of multiple transactions are executed in such an order that the net effect is the same as if transactions had in fact executed in their entirety, one at a time or serially.

• A typical scheduler does its work by maintaining locks on certain pieces of the data.

• These locks prevent 2 transactions from accessing the same piece of data in ways that interact badly.

• Locks are generally stored in a main-memory lock table.

• The scheduler affects the execution engine (part of query processor) from accessing locked parts of the database.

Page 45: 1. Intro to ADBMS Course

Deadlock Resolution

• As transactions compete for resources through the locks that the scheduler grants, they can get into a situation where none can proceed as each needs something another has.

• Transaction manager has the responsibility to intervene and cancel one or more transactions to let the other proceed.

Page 46: 1. Intro to ADBMS Course

Query Processor

• The portion of the DBMS that most affects the performance that the user sees is the query processor, it has two components: Query Compiler and Execution Engine.

• Query Compiler: translates the query into an internal form called a query plan which are implementations of relational algebra operations. It has 3 major units:– A query Parser (builds parse tree from textual

query)– A query Preprocessor (performs semantic checks)– A query Optimizer (finds best available query

plan)• Execution Engine: executes each of the steps in

the chosen query plan.

Page 47: 1. Intro to ADBMS Course

47

Queries• Find all courses that “Mary” takes

• What happens behind the scene ?– Query processor figures out how to answer the

query efficiently.

SELECT C.nameFROM Students S, Takes T, Courses C

WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid

SELECT C.nameFROM Students S, Takes T, Courses C

WHERE S.name=“Mary” and S.ssn = T.ssn and T.cid = C.cid

Page 48: 1. Intro to ADBMS Course

Queries, behind the scene

Imperative query execution plan:

SELECT C.nameFROM Students S, Takes T,

Courses CWHERE S.name=“Mary” and

S.ssn = T.ssn and T.cid = C.cid

SELECT C.nameFROM Students S, Takes T,

Courses CWHERE S.name=“Mary” and

S.ssn = T.ssn and T.cid = C.cid

Declarative SQL query

Students Takes

sid=sid

sname

name=“Mary”

cid=cid

Courses

The optimizer chooses the best execution plan for a query

Page 49: 1. Intro to ADBMS Course

Query Compiler

• Query Compiler uses metadata and statistics about the data to decide which sequence of operations is likely to be the fastest.

• For example: existence of an index, which is a specialized data structure that facilitates access to data, given values for one or more components of that data, can make one plan much faster than another.

Page 50: 1. Intro to ADBMS Course

Execution Engine

• It executes each of the steps in the chosen query plan.

• The execution engine interacts with most of other components of DBMS, either directly or through the buffers.

• It must get the data from the database (stored on disk) into buffers in order to manipulate the data.

• It needs to interact with the scheduler to avoid accessing data that is locked, and with the log manager to make sure that all database changes are properly logged.