distributed databases 1. 2 outline introduction principles / objectives problems

Distributed databases

1



2

Outline

introduction principles / objectives problems


3

Introduction

communication network

server

applicationapplication

application



server

serverDBMS in its own right


4

Introduction

distributed database = collection of connected sites each site is a DB in its own right (1)

• has its own DBMS and its own users

• operations can be performed locally as if the DB was not distributed

the sites collaborate (transparently from the user’s point of view) the union of all DBs = the DB of the whole organisation (institution)

• (oppose to (1))

physical or logical distribution strict homogeneity (assumption)


5

Motivation

advantages matches the structure of the organisation

• example

efficiency of processing• stored closely to where it is being used

increased accessibility• remote DBs can be accessed

disadvantage complexity


6

Implementations (systems)

commercial INGRES/STAR (Ask Group Inc. Ingres Division) distributed database option of ORACLE7 (Oracle

Corporation) distributed data facility of DB2 (IBM)

they all provide some sort of features for distributed databases


7

Fundamental principle

a distributed DB system should look to the user exactly as a non-distributed DB system


8

Principles / objectives

local autonomy no reliance on central site

continuous operation location independence

fragmentation independence replication independence

distributed query processing distributed transaction management

hardware independence OS independence

network independence DBMS independence


9

Principles / objectives

not independent from each other not exhaustive sometimes contradicting different degree of importance (for the user)


10

Local autonomy

all operations at a certain site are fully controlled by that site

not achievable (why?) therefore, autonomy should be achieved to the

maximum extent possible

local data is locally owned and managed local data belongs to the local server even if it is

accessible from other servers security, integrity, ..., are in the responsibility of the local

server


11

No reliance on a central site

reasons bottle-neck vulnerability

conclusion all sites must be equal


12

Continuous operation

greater reliability the probability that the system is running at any moment

of time

greater availability the probability that the system is running for a specified

period of time


13

Location independence / transparency

users should not have to know where data is physically stored

why do you think this is needed?• think of application programs

what does this objective look like?


14

Data fragmentation

data fragmentation if a relation can be divided into “fragments” for storing purposes motivation: performance - data is stored where it is mostly used

types horizontal or vertical

definition fragment = any subrelation derivable via restriction or projection

restrictions disjoint decompositions non-loss decompositions


15

FRAGMENT Emp INTOLo_Emp AT SITE ‘London’

WHERE Dept_id = ‘Sales’Le_Emp AT SITE ‘Leeds’

WHERE Dept_id = ‘Dev’ ;

Data fragmentation - example


16

Fragmentation independence / transparency

users should perceive data as if it were not fragmented

why?

it is the optimiser’s responsibility to determine which fragments need to be physically accessed

similar to views retrieving updating (JOIN and UNION views)


17

FRAGMENT Emp INTOLo_Emp AT SITE ‘London’ WHERE Dept_id = ‘Sales’Le_Emp AT SITE ‘Leeds’ WHERE Dept_id = ‘Dev’ ;

--looks (and works almost) like a viewSELECT * FROM Emp WHERE Salary > 40 AND Dept_id = ‘Dev’;--is transformed intoSELECT * FROM Lo_emp WHERE Salary > 40 AND Dept_id = ‘Dev’;UNIONSELECT * FROM Le_emp WHERE Salary > 40 AND Dept_id = ‘Dev’ ;

Fragmentation independence - example


18

Data replication

copies of the same fragment can exist at different sites

reasons better availability better performance

disadvantage update propagation


19

Replication independence / transparency

users should not have to be aware of data replication

it is the optimiser’s responsibility to choose which replica to use

commercial systems not full support for replication independence (update

problems) - primary copy


20

Distributed query processing

the system must have set level operators one record at a time - too many messages (traffic) relational - indicated

optimisation particularly relevant! find best way to move data across the network


21

Distributed transaction management

problems recovery concurrency

transaction = set of agents agent - runs on a certain machine

recovery two-phase commit protocol

concurrency locking


22

Problems

occur due to network utilisation network messages are costly

aim minimise network utilisation

problems: query processing catalog management update propagation recovery control concurrency control


23

Query processing

in a distributed environment query execution is distributed query optimisation is distributed

• global optimisation

• local optimisation

example• query on relation R issued at site X

• part of R, say Ry, stored at Y

• part of R, say Rz, stored at Z

• where is the query going to be executed?


24

Query processing example - initial conditions

Site A: Suppliers ( S_id, City ) 10,000 tuplesContracts ( S_id, P_id ) 1,000,000 tuples

Site B: Parts (P_id, Colour ) 100,000 tuples

SELECT S.S_idFROM Suppliers S, Contracts C, Parts PWHERE S.S_id = C.S_id AND P.P_id = C.P_id AND

City = ‘London’ ANDColour = ‘red’ ;


25

Query processing example - evaluation

possible evaluation procedures (1) move relation Parts to site A and evaluate the query at A (2) move relations Suppliers and Contracts to B and evaluate at B (3) join Suppliers with Contracts at A, restrict the tuples for suppliers

from London, and for each of these tuples check at site B to see whether the corresponding part is red

(4) join Suppliers with Contracts at A, restrict the tuples for suppliers from London, transfer them B and terminate the processing there

(5) restrict Parts to tuples containing red parts, move the result to A and process there

(6) think of other possibilities … there is an extra dimension added by the site where the query was issued


26

Query processing example - total time

total_time = delay_time + data_transfer_time = no_messages * 0.1 + data_volume(in bits) / 50000

assumptions:

1. disregard computation time on each server (site)2. estimated cardinality of some intermediate results

red parts …... 10contracts with suppliers from London …... 50,000

3. communication assumptions

date rate …... 50k bits / second

access delay …... 0.1 second

4. size of each tuple ……. 200 bits


27

Query processing example - total time

procedure messages/ accesses

total datatransferred

time

move all to A 1 access 100,000 * 200 6.67min

move all to B 2 accesses (1,000,000 + 10,000) *200

1.12hours

join S and C at site A, restrictto 'London' and check foreach tuple …

2 * 50,000messages(query +response foreach tuple)

0 2.58hours

restrict P to 'red' and checkfor each tuple whether thereexists a contract …

2 * 10messages(see above)

0 2seconds

restrict P to 'red', transfer tosite A and process at A

1 access 10 * 200 0.1seconds


28

Catalog management

what ‘other’ data does the catalog include? fragmentation, replication ...

where should the catalogue be stored centralised

• objective: no central site!

fully replicated • loss of autonomy - update propagation!

partitioned • non local operations - very expensive!

combination of first and third


29

Catalog management

R* - object naming

<creator ID>@<creator site ID>.<local name>@<birth site ID> e.g. [email protected]@PostgresGold

each site maintains a catalog entry for every object born at that site (and the site where it had

migrated, if applicable) every object stored at that site object identification - at most 2 sites need to be accessed


30

Update propagation

problems because of replication data might become less available

• due to immediate update request

primary copy scheme one copy is designated primary copy (unique) primary copies exist at different sites (distributed) an update is logically complete if the primary copy has been updated

• the site holding the primary copy would have to propagate the updates• this has to be done before COMMIT (preserve - ACID)• commercial systems: update propagation is guaranteed for some future

time

violation of local autonomy


31

Recovery control

two-phase commit protocol issues

there can be no central site so each site should be able to act as a coordinator

• usually the site where the transaction was initiated

other sites are told by the coordinator what to do • loss of autonomy

there is no protocol (theoretically) that guarantees that a transaction is / is not performed by all agents with respect to any kind of failure

increased number of messages more complex protocols


32

Concurrency control

locking overhead - increased number of messages

primary copy strategy locking only the primary copy the primary copy’s site will propagate the update loss of autonomy (severely)

global deadlock two interlocked (waiting for each other) sites cannot be detected using the wait-for graph - therefore,

communication overhead


33

Global deadlock

Transaction 1x

Transaction 1y Transaction 2y

Transaction 2x

holds lock on tx

holds lock on ty

site X

site Y


34

Gateways

DBMS #1(Ingres)

DBMS #2(Oracle)

GATEWAY


35

Client / server systems

are a special case of distributed database systems

remember from last term read for extra information

distributed databases 1. 2 outline introduction principles / objectives problems

Documents

distributed databases

network slide

user slide

right slide

equal slide

local server slide

union views slide

disadvantage complexity