distributed databases 1. 2 outline introduction principles / objectives problems
TRANSCRIPT
Distributed databases
1
Distributed databases
Distributed databases
2
Outline
introduction principles / objectives problems
Distributed databases
3
Introduction
communication network
server
applicationapplication
application
applicationapplication
applicationapplication
server
serverDBMS in its own right
Distributed databases
4
Introduction
distributed database = collection of connected sites each site is a DB in its own right (1)
• has its own DBMS and its own users
• operations can be performed locally as if the DB was not distributed
the sites collaborate (transparently from the user’s point of view) the union of all DBs = the DB of the whole organisation (institution)
• (oppose to (1))
physical or logical distribution strict homogeneity (assumption)
Distributed databases
5
Motivation
advantages matches the structure of the organisation
• example
efficiency of processing• stored closely to where it is being used
increased accessibility• remote DBs can be accessed
disadvantage complexity
Distributed databases
6
Implementations (systems)
commercial INGRES/STAR (Ask Group Inc. Ingres Division) distributed database option of ORACLE7 (Oracle
Corporation) distributed data facility of DB2 (IBM)
they all provide some sort of features for distributed databases
Distributed databases
7
Fundamental principle
a distributed DB system should look to the user exactly as a non-distributed DB system
Distributed databases
8
Principles / objectives
local autonomy no reliance on central site
continuous operation location independence
fragmentation independence replication independence
distributed query processing distributed transaction management
hardware independence OS independence
network independence DBMS independence
Distributed databases
9
Principles / objectives
not independent from each other not exhaustive sometimes contradicting different degree of importance (for the user)
Distributed databases
10
Local autonomy
all operations at a certain site are fully controlled by that site
not achievable (why?) therefore, autonomy should be achieved to the
maximum extent possible
local data is locally owned and managed local data belongs to the local server even if it is
accessible from other servers security, integrity, ..., are in the responsibility of the local
server
Distributed databases
11
No reliance on a central site
reasons bottle-neck vulnerability
conclusion all sites must be equal
Distributed databases
12
Continuous operation
greater reliability the probability that the system is running at any moment
of time
greater availability the probability that the system is running for a specified
period of time
Distributed databases
13
Location independence / transparency
users should not have to know where data is physically stored
why do you think this is needed?• think of application programs
what does this objective look like?
Distributed databases
14
Data fragmentation
data fragmentation if a relation can be divided into “fragments” for storing purposes motivation: performance - data is stored where it is mostly used
types horizontal or vertical
definition fragment = any subrelation derivable via restriction or projection
restrictions disjoint decompositions non-loss decompositions
Distributed databases
15
FRAGMENT Emp INTOLo_Emp AT SITE ‘London’
WHERE Dept_id = ‘Sales’Le_Emp AT SITE ‘Leeds’
WHERE Dept_id = ‘Dev’ ;
Data fragmentation - example
Distributed databases
16
Fragmentation independence / transparency
users should perceive data as if it were not fragmented
why?
it is the optimiser’s responsibility to determine which fragments need to be physically accessed
similar to views retrieving updating (JOIN and UNION views)
Distributed databases
17
FRAGMENT Emp INTOLo_Emp AT SITE ‘London’ WHERE Dept_id = ‘Sales’Le_Emp AT SITE ‘Leeds’ WHERE Dept_id = ‘Dev’ ;
--looks (and works almost) like a viewSELECT * FROM Emp WHERE Salary > 40 AND Dept_id = ‘Dev’;--is transformed intoSELECT * FROM Lo_emp WHERE Salary > 40 AND Dept_id = ‘Dev’;UNIONSELECT * FROM Le_emp WHERE Salary > 40 AND Dept_id = ‘Dev’ ;
Fragmentation independence - example
Distributed databases
18
Data replication
copies of the same fragment can exist at different sites
reasons better availability better performance
disadvantage update propagation
Distributed databases
19
Replication independence / transparency
users should not have to be aware of data replication
it is the optimiser’s responsibility to choose which replica to use
commercial systems not full support for replication independence (update
problems) - primary copy
Distributed databases
20
Distributed query processing
the system must have set level operators one record at a time - too many messages (traffic) relational - indicated
optimisation particularly relevant! find best way to move data across the network
Distributed databases
21
Distributed transaction management
problems recovery concurrency
transaction = set of agents agent - runs on a certain machine
recovery two-phase commit protocol
concurrency locking
Distributed databases
22
Problems
occur due to network utilisation network messages are costly
aim minimise network utilisation
problems: query processing catalog management update propagation recovery control concurrency control
Distributed databases
23
Query processing
in a distributed environment query execution is distributed query optimisation is distributed
• global optimisation
• local optimisation
example• query on relation R issued at site X
• part of R, say Ry, stored at Y
• part of R, say Rz, stored at Z
• where is the query going to be executed?
Distributed databases
24
Query processing example - initial conditions
Site A: Suppliers ( S_id, City ) 10,000 tuplesContracts ( S_id, P_id ) 1,000,000 tuples
Site B: Parts (P_id, Colour ) 100,000 tuples
SELECT S.S_idFROM Suppliers S, Contracts C, Parts PWHERE S.S_id = C.S_id AND P.P_id = C.P_id AND
City = ‘London’ ANDColour = ‘red’ ;
Distributed databases
25
Query processing example - evaluation
possible evaluation procedures (1) move relation Parts to site A and evaluate the query at A (2) move relations Suppliers and Contracts to B and evaluate at B (3) join Suppliers with Contracts at A, restrict the tuples for suppliers
from London, and for each of these tuples check at site B to see whether the corresponding part is red
(4) join Suppliers with Contracts at A, restrict the tuples for suppliers from London, transfer them B and terminate the processing there
(5) restrict Parts to tuples containing red parts, move the result to A and process there
(6) think of other possibilities … there is an extra dimension added by the site where the query was issued
Distributed databases
26
Query processing example - total time
total_time = delay_time + data_transfer_time = no_messages * 0.1 + data_volume(in bits) / 50000
assumptions:
1. disregard computation time on each server (site)2. estimated cardinality of some intermediate results
red parts …... 10contracts with suppliers from London …... 50,000
3. communication assumptions
date rate …... 50k bits / second
access delay …... 0.1 second
4. size of each tuple ……. 200 bits
Distributed databases
27
Query processing example - total time
procedure messages/ accesses
total datatransferred
time
move all to A 1 access 100,000 * 200 6.67min
move all to B 2 accesses (1,000,000 + 10,000) *200
1.12hours
join S and C at site A, restrictto 'London' and check foreach tuple …
2 * 50,000messages(query +response foreach tuple)
0 2.58hours
restrict P to 'red' and checkfor each tuple whether thereexists a contract …
2 * 10messages(see above)
0 2seconds
restrict P to 'red', transfer tosite A and process at A
1 access 10 * 200 0.1seconds
Distributed databases
28
Catalog management
what ‘other’ data does the catalog include? fragmentation, replication ...
where should the catalogue be stored centralised
• objective: no central site!
fully replicated • loss of autonomy - update propagation!
partitioned • non local operations - very expensive!
combination of first and third
Distributed databases
29
Catalog management
R* - object naming
<creator ID>@<creator site ID>.<local name>@<birth site ID> e.g. [email protected]@PostgresGold
each site maintains a catalog entry for every object born at that site (and the site where it had
migrated, if applicable) every object stored at that site object identification - at most 2 sites need to be accessed
Distributed databases
30
Update propagation
problems because of replication data might become less available
• due to immediate update request
primary copy scheme one copy is designated primary copy (unique) primary copies exist at different sites (distributed) an update is logically complete if the primary copy has been updated
• the site holding the primary copy would have to propagate the updates• this has to be done before COMMIT (preserve - ACID)• commercial systems: update propagation is guaranteed for some future
time
violation of local autonomy
Distributed databases
31
Recovery control
two-phase commit protocol issues
there can be no central site so each site should be able to act as a coordinator
• usually the site where the transaction was initiated
other sites are told by the coordinator what to do • loss of autonomy
there is no protocol (theoretically) that guarantees that a transaction is / is not performed by all agents with respect to any kind of failure
increased number of messages more complex protocols
Distributed databases
32
Concurrency control
locking overhead - increased number of messages
primary copy strategy locking only the primary copy the primary copy’s site will propagate the update loss of autonomy (severely)
global deadlock two interlocked (waiting for each other) sites cannot be detected using the wait-for graph - therefore,
communication overhead
Distributed databases
33
Global deadlock
Transaction 1x
Transaction 1y Transaction 2y
Transaction 2x
holds lock on tx
holds lock on ty
site X
site Y
Distributed databases
34
Gateways
DBMS #1(Ingres)
DBMS #2(Oracle)
GATEWAY
Distributed databases
35
Client / server systems
are a special case of distributed database systems
remember from last term read for extra information