chapter 2 - background · load balancing and ease of migration. moving from a centralised to a...

8

Chapter 2 - Background

2.1 Introduction

This and the next chapter provide an overview of parallel relational database

systems. The concepts are introduced in this chapter and current research areas are

identified in the next chapter.

The exploitation of the inherent parallelism in database applications by running

them on suitable parallel platforms to enhance their performance has become of

significant commercial interest. Initial parallel systems were based on purpose-built

database software and hardware, parallel database machines, and were appropriate for

carrying out relational-type operations in parallel [Dewitt, 90]. There have been several

different parallel machines built that support standard commercial database systems

adapted to a parallel environment.

The first section in this chapter provides a brief review of parallel database

architectures and discusses their advantages and disadvantages. Section 2.3 describes the

GoldRush MegaServer parallel platform from ICL. This is the architecture which has

been used throughout this thesis. Data placement is introduced in section 2.4 by

describing some of the main methods and describing the differences between them.

Section 2.5 describes the two database performance benchmarks that have been used in

this thesis. Some of the terminology concerning parallel database systems that will be

used throughout the rest of the thesis is described in section 2.6. The chapter is rounded

off with some summarising comments.

9

2.2 Review of parallel database architectures

This section presents an overview of parallel relational database systems.

Familiarity with conventional Relational DBMS (RDBMS) [Codd, 70, 94], [Elmasri, 99]

is assumed.

Traditional parallel relational database systems were usually classified as either

shared-everything, shared-disk or shared-nothing, although the distinctions between these

have become blurred. As with non-parallel architectures, the majority of parallel

architectures which host database systems comprise processors, main memory modules,

and secondary storage (disks) as their major types of components. The hardware vendors

make design choices about the way the various components are connected and this serves

as the basis for a classification of parallel database systems into the categories: shared

memory, shared disk, shared nothing and hierarchical.

2.2.1 Shared memory systems

In this type of system the processors and disks have access to a common memory,

typically via a bus or through an interconnection network. This means that it has

extremely efficient communication between the processors. The data in shared memory

can be accessed by any processor without having to move it using software. The

architecture may not be scalable beyond 32 or 64 processors since the bus or the

interconnection network becomes too large a bottleneck. They are widely used for lower

degrees of parallelism (4 to 8).

Examples of research systems, developed for the shared memory model include

DBS3 [Bergsten, 91], [Zait, 96], and XPRS [Hong, 93], while examples of commercial

10

systems include Sybase [Sybase, 04], Oracle [Oracle, 96], [Oracle, 96B], [Oracle, 04],

DB2 [IBM, 04], Informix [Informix, 94], [IBM, 03] and SQL Server [Microsoft, 04].

In a shared memory system there is a single copy of the operating system running

and a single instance of the database engine. As every processor has access to all of the

available memory, communication is straightforward. Message exchange and data

sharing are through the shared memory. It is also relatively easy to implement

synchronisation by using low-level mechanisms (such as semaphores). Shared memory

systems also lend themselves to load balancing, usually taken care of by the operating

system. These systems perform best when running a large number of small tasks such as

multiple threads.

The shared bus allows a memory module to be accessed by any processor, thus

increasing the memory-access latency. To counteract this, processors usually have fast

private cache memories for speeding up the access to shared memory [Lu, 94]. Although

this does mean that mechanisms are required to ensure cache coherency. An example of a

cache coherency algorithm is the “snooping bus write invalidate” method [Shatdal, 96],

where all of the cache controllers listen on the shared bus and the cached value is

invalidated when a memory write on a cached address occurs. The value is read into the

cache from memory, when a read occurs. Cache coherency introduces additional

overhead, which increases as the number of processors or memory modules increases.

2.2.2 Shared disk systems

In this type of system, all of the processors can directly access all of the disks via

an interconnection network, but each processor has its own main memory. This

architecture provides a degree of fault-tolerance, in that, if a processor fails, the other

11

processors can take over its tasks since the database is resident on the disks and these are

accessible from all processors. The downside is that the bottleneck now occurs at the

interconnection to the disk subsystem. Shared disk systems can scale to a large number of

processors, but communication between processors is slower.

Examples of commercial shared disk products include Oracle Parallel Server [Oracle,

96B], [Oracle, 04], DB2 [IBM, 04] and DB2 for IBM System/390 [White, 99].

Figure 2.1 Parallel database architectures

Shared disk systems can have their disk controllers and input/output controllers

connected to the interconnecting network rather than directly to nodes via their bus. On

the other hand some vendors implement a single global file store for their systems, which

provides a virtual shared disk architecture. Although certain disks in the system may be

directly connected to certain nodes, the DBMS has no concept of a node ‘owning’ a

12

certain set of disks. Rather, all disks and data are accessible by all nodes, an example

being Oracle [Oracle, 96B], [Oracle, 04].

Global locking and protocols are required to avoid conflicting accesses to the

same pages and to ensure cache coherency [Valduriez, 93]. To ensure the consistency of

data, a distributed lock manager is employed. This can lead to the creation of heavy

traffic on the interconnection network consisting of lock protocol messages.

2.2.3 Shared nothing systems

In this type of system, a node consists of a processor, memory, and a disk system.

A processor at one node communicates with another processor at another node using

some form of interconnection network. A node functions as the server for the data on the

disk or disks the node owns. Data accessed from local disks (and local memory accesses)

do not pass through interconnection network, thereby minimizing the interference of

resource sharing. Shared nothing multiprocessors can be scaled up to thousands of

processors without interference. The main drawback is the cost of communication and

non-local disk access, as sending data involves software interaction at both ends.

Examples of research systems, developed for the shared nothing model are Bubba [Boral,

88], [Boral, 90] and EDS [Watson, 91], while commercial products include Informix XPS

[Informix, 98], [IBM, 03], Sybase MPP [Knoop, 99], [Sybase, 04C], DB2 Parallel

Edition [Baru, 95], [Infinio, 03] and NonStop SQL/MP [HP, 04].

The database is partitioned and stored on the disks of the different nodes. Each

node is granted exclusive control over a part of the database. A node has its own set of

local disks where its data is stored. It is not possible for a processor to directly access the

13

disks and memory of another. Communication between processors is achieved by sending

messages through the interconnection network.

In more recent architectures, each node itself may be a separate system. Such

architectures are made up of a combination of shared memory at the node level and

distributed memory at the global system level. In addition, the disk controllers of a

particular node may also be connected to the bus of a second node in order to allow fault-

tolerance. Thus the term ‘shared nothing’, originally referring to architectures where a

disk connects to a single node with a single processor, is sometimes also used to describe

hybrid systems with some limited form of memory and disk sharing [Delis, 98].

When more nodes are added to the system, the aggregate bandwidth requirement

scales up in proportion to the number of processors added. Although most current

interconnection networks are designed to satisfy this requirement. Thus the major

advantage of these systems is that they can scale up to handle hundreds of nodes.

It is difficult to co-ordinate large numbers of nodes. This difficulty manifests

itself in load balancing. Ideally the workload should be spread equally among the

participating nodes. However, when data skew [Hua, 91] is present, a parallelised

operation such as join may not bring good returns unless load-balancing issues are

considered explicitly as part of the design of the join algorithm.

2.2.4 Hierarchical

A hierarchical or hybrid architecture combines characteristics of shared-memory,

shared-disk, and shared-nothing architectures. The top level is a shared-nothing

architecture, where the nodes are connected by an interconnection network, and do not

share disks or memory with each other.

14

Each node of the system could be a shared-memory system with a few processors

or each node could be a shared-disk system, and each of the systems sharing a set of disks

could be a shared-memory system. Commercial hybrid NUMA (non-uniform memory

access) systems have been developed by vendors such as Data General, IBM, ICL, NCR,

among others [Garth, 96], [Rudin, 96]. Another example of a commercial hybrid system

is the Compaq TruCluster running Oracle [Gartner, 02].

2.2.5 Comparative analysis

A study of the benefits of the different architectures [Valduriez, 93] points to

shared disk architectures having similar advantages and disadvantages to shared nothing

systems. However, shared disk architectures are seen to provide a better opportunity for

load balancing and ease of migration. Moving from a centralised to a shared disk system

is relatively straightforward as the data on disk does not require to be reorganised. The

access to the shared disk subsystem is highlighted as a potential bottleneck but is not an

issue with shared nothing systems as the nodes do not have to co-ordinate data sharing

since each one is responsible for its own data. Shared disk systems are considered more

flexible in the sense that any node may be selected to access any piece of data, while in

shared nothing systems data can be accessed only through the node that owns it. The

shared disk ensures that a failed node will not affect the availability of the system,

although part of the data may not be accessible.

Shared memory’s main advantages are reported as the simplicity and ease of load

balancing, while limited extensibility and low availability are its disadvantages. Shared

memory architectures may suffer from poor availability, in the sense that a memory fault

may affect most processors and bring the system down.

15

Extensibility and availability are the main virtues of shared nothing systems. On

the other hand, high complexity and poor load balancing are shared nothing’s main

disadvantages. The difficulty with load balancing is due to the amount and location of

processing being determined by the database partitioning. This can especially be a

problem when a new processor or disk is to be added: a decision must be made as to how

to redistribute the data among the processors and disks, which is a costly operation. The

high complexity of a shared nothing system requires the need for distributed database

techniques, such as 2-phase commit, to be implemented to ensure correct functioning of

the system.

There is no single database architecture that is better than others in all respects.

The choice of an optimal platform ultimately depends on the specific user application and

workload, although the current trend is for some form of hybrid system, to try and

capture the best parts of the pure architectures.

The architectural differences between the types of systems are becoming more

and more blurred. Hybrid systems called non-uniform memory access (NUMA) systems

were developed by vendors such as Data General, IBM, ICL, NCR, among others [Garth,

96], [Rudin, 96]. In these systems a node has access to the memory of a different node,

thus a common global memory is established. Each processor accesses the memory over

a two level interconnecting mechanism. Access to local memory (memory residing on the

same board as the processors of the node) is through a high-speed local bus and has a low

access time. Access to remote memory is through a different system bus and has higher

access time. A NUMA system runs a single instance of the operating system, thus

(potentially) providing a shared memory programming model. The effectiveness of this

16

architecture depends on maintaining a high level of data locality in each node. Cray/SGI

Origin2000, the HP-Convex Exemplar and the Sequent NUMA-Q Series are examples of

this hybrid architecture.

Another architecture that was used for handling large database systems was

Symmetric Multiprocessing (SMP). SMP machines are characterised by tightly-coupled

series of identical processors, usually operating on a single shared bank of memory. The

essence is of a timesharing, multi-tasking operating system that has more than one

processor to choose from when scheduling programs to run. The architectural design

consists of separate caches or a shared cache as well as shared main memory. A

commercial example of a SMP machine is the Sun Enterprise 10000 Server [Sun, 99].

There is also an architecture combining both the SMP and NUMA architectures where

another level of cache is introduced. A commercial example of this hybrid architecture is

the Unisys ES7000 [Unisys, 04].

A currently popular architecture being utilised for large database systems

currently is a cluster of clusters. A cluster is a group of separate computers connected

together and used as a single computing entity to provide a service or run an application

for purposes of scalability, load balancing, and task distribution. Operations between

clusters exchange information via messages or memory addressing. Clusters are usually

made up of SMP machines and may also be a single SMP machine. A commercial

example of a cluster machine is the IBM pSeries 690 server [IBM, 04B].

Another current trend in computer manufacture is toward slim, hot swappable

blade servers [Vasudeva, 04] which fit in a single chassis like books in a bookshelf, each

is an independent server, with its own processors, memory, storage, network controllers,

17

operating system and applications. The blade server simply slides into a bay in the

chassis and plugs into a mid- or backplane, sharing power, fans, floppy drives, switches,

and ports with other blade servers. The benefits of the blade approach include reduced

cabling. With switches and power units shared, precious space is freed up. Initial

performance figures for these blade servers [Horwitz, 04] suggest that they may well

dominate the market place of the future.

Parallel technology grew out of distributed technology and the current trend

appears to be moving back to distributed technology mainly in the area of Grid

technology [Foster, 99]. Many of the theories and tools developed for parallel database

systems can be utilised in the area of distributed systems.

2.3 Parallel platform

This section describes the GoldRush MegaServer parallel platform from ICL. At

the beginning of 1996 GoldRush/Oracle and GoldRush/Informix were the state-of-the-art

shared disk and shared nothing high performance database solutions offered by ICL. It is

the architecture that was used in the Mercury Project [Mercury, 95].

2.3.1 ICL GoldRush MegaServer

The ICL GoldRush MegaServer uses parallel processing technology to support

very large databases in an open client/server environment. The architecture of the system

is presented in detail in [Watson, 95], [Watson, 95B], [Watson, 95C].

The basic GoldRush hardware architecture, shown in Figure 2-2, consists of a set

of up to 64 Processing Elements (PEs) and Communication Elements (CEs), and a single

18

Management Element (ME), connected together by a high performance network

(DeltaNet).

DeltaNet

… …PE PE CE CE ME… … ……… …FDDI FDDI

system

management

interface

Figure 2.2 The GoldRush architecture

The PE is the basic building block of the hardware. It runs a UNIX operating

system and database back-end servers. The unit comprises a main PCB with a subsidiary

PCB and a twin HyperSPARC module mounted on it. The internal communication bus of

the PE is a 40 MHz Mbus. The main internal unit is a dual processing unit which consists

of a Processing Unit (PU) and a System Support Unit (SSU). Both of these units include

a 90MHz HyperSPARC chip set. The PU is the main processing engine and the SSU

provides support for message passing between elements. Up to 256 Mbytes of RAM store

is provided in the PE, primarily for use as database cache. Each PE also has two fast and

wide SCSI-2 connections, allowing up to 30 disks to be connected.

The CE is identical to a PE except that it provides external links for GoldRush to

clients via LANs. To do this it has one SCSI-2 interface and two Sbus connectors, into

which two standard FDDI Sbus modules are plugged. The CEs do not run database

servers, but are dedicated to relaying messages transparently between the PEs and FDDI

couplers. It is typical to have 1-2 CEs per 16 elements. The ME is a conventional mid-

range UNIX processor (an ICL DRS6000) which runs GoldRush management software.

19

The DeltaNet high speed network is the medium whereby all the elements of a

GoldRush system communicate with each other. The basic component of the DeltaNet is

a network switching element. This is a packet switching 8×8-crossbar, which

dynamically provides unidirectional channels from each of its 8 inputs to any of its 8

output ports. The network switching elements are connected in stages allowing up to 64

connections. Packets are 128 bytes long. If two or more packets arrive at the inputs of a

switch simultaneously, and each needs to be routed through the same output, contention

results. The switch has 4 buffers per input, so one of the packets is retained until the

output is free. Each channel data interface is capable of 20 Mbytes per second. The total

DeltaNet throughput is up to 1.2 Gbytes per second.

GoldRush relies entirely on message passing for inter-PE communication. This is

done efficiently through a lightweight communications protocol. It is available through

UNIX interfaces so that software can exploit it. For example, the Informix and Oracle

software use it for communicating when exploiting intra-query parallelism.

For database systems supporting a shared disk architecture, GoldRush provides a

Global Coherent Filestore (GCF). Even though physically each disk is attached to one

element only, each element creates its own portion of the GCF on its own local disks and

then cross-mounts the global file store from all other elements, thus making it accessible

to the database server running on it. The GCF is implemented in the OS kernel and uses

the lightweight communications protocol.

A Distributed Lock Manager (DLM) is implemented to provide

coherency/concurrency control for database servers running within a shared disk system.

It is distributed across all PEs, with an instance running on each one. Each global lock is

20

managed by one instance of the DLM and all requests for it are sent to that instance by

the PE generating the request. All communication with the DLM is by the lightweight

communications protocol.

GoldRush can be configured as either a shared disk or a shared nothing system. In

a shared disk configuration each PE runs its own instance of the database. The tables are

placed across disks attached to a number of PEs, with the GCF allowing shared access.

Each PE maintains a local database cache in memory and the DLM is used to ensure

coherency. In a shared nothing configuration, again, each PE runs a database server and

the tables are partitioned across the PEs. Each PE only accesses the table fragments

stored locally. Queries are decomposed into ‘fragments’ and directed to PEs which own

relevant data. There is no need for the services of the DLM, as no global coherency needs

to be maintained.

2.4 Data placement

The way in which data is distributed across the processing elements of a parallel

shared-nothing architecture can have a significant effect on the performance of a parallel

DBMS. Data placement strategies provide a mechanical approach to determining a data

distribution which will provide good performance. However, there is considerable

variation in the results produced by different strategies and no simple way of determining

which strategy will provide the best results for any particular database application. A

poor data distribution can lead to higher loads and/or load imbalance and hence higher

cost when processing transactions. This section considers some of the main types of data

placement strategies and describes some of the problems associated with the placement of

21

data, with particular attention to placing data on shared nothing hardware, such as the

GoldRush MegaServer.

Deciding how to distribute the data is a complex issue. A given data distribution

may be ideal for one type of query but may produce a load imbalance with consequential

reduction in performance for another. The process of deciding how to distribute data

among the different nodes in the system is known as data placement. This involves

breaking up each relation into a number of fragments and allocating each fragment to a

node of the system (or to a particular disk of the node). Various strategies for data

placement have been developed which attempt to achieve improved performance in

different ways. Some focus on the complexity of operations such as joins, others are

based on the accesses made to each fragment of a relation. However, there is no obvious

choice among the different approaches, which would provide the best results in all cases.

The purpose of this section is to provide an understanding of some of the different data

placement strategies and their relative merits.

Various data placement strategies have been developed by researchers to provide

a mechanical approach to the process which will produce data distributions capable of

realising much of the performance potential of shared-nothing database systems. Since

the complexity of the problem is NP complete [Sacca, 83], there is no guarantee that even

a feasible solution, let alone an optimal one, can be found in a reasonable amount of time.

Generally, a static data placement strategy can be divided into three phases. They

are:

• Declustering phase in which relations are partitioned into a number of fragments

which will be allocated to the disks of a parallel machine in the placement phase;

22

• Placement phase in which the fragments obtained from the declustering phase are

allocated to the disks of a parallel machine;

• Re-distribution phase in which data is re-distributed to restore good system

performance after the load has become unbalanced and overall system

performance degraded.

Dynamic data placement is concerned with dynamically re-organising relations

during execution in order to increase the degree of parallelism, reduce response time and

improve throughput. It usually takes the form of generating temporary relations (i.e.

intermediate relations). The placement of base relations resulting from the initial data

placement scheme is not changed.

There is no simple way of determining which data placement strategy would

provide the best results for any particular database.

2.4.1 Overview of data placement strategies

As previously stated, a static data placement strategy can generally be divided

into one of the three phases mentioned. These three phases are not completely

independent as the approach chosen for placement sometimes limits the choice of the

declustering and re-distribution methods.

2.4.1.1 Declustering and Re-distribution

The declustering phase of data placement is concerned with the problem of

partitioning an entire data space into subspaces, which may be overlapping or non-

overlapping. In the case of relational databases, this partitioning can be either horizontal

or vertical. A fragment in a horizontal partitioning consists of a subset of tuples from a

23

relation, while a fragment in a vertical partitioning consists of a projection of a relation

over a set of attributes. These two types of partitioning can also be combined, which

results in a mixed approach. A discussion on the complexity of operations over a mixed-

partitioned relation is given in [Thanos, 91].

For horizontal partitioning, there are three fundamental types: range, hash and the

self explanatory round-robin (card dealing) approach.

The range declustering method partitions a relation into fragments by dividing the

domains of some attributes into sections and then allots the tuples to the fragment whose

range value fits the attribute values of the tuple. The hash declustering method partitions

a relation into fragments according to the hash value obtained by applying a hash

function to the chosen attributes. These can be performed on both single and multiple

attributes. Other partitioning methods include using bit-wise operators [Kim, 88] and

space filling curves [Faloutsos, 93]. However, since hash-join algorithms have been

widely adopted in parallel database systems, the hash method is the most popular as it fits

nicely with the requirements of a hash-join algorithm.

As insertions and deletions are performed on the local data associated with each

PE, the size of the local partition of data may change so that gradually a non-uniform

distribution of data will appear. This causes an unbalanced load on PEs which in turn

degrades overall system performance. In this situation, a re-distribution of data is

necessary to resume good system performance. Although the re-distribution of data can

always be performed with the original declustering and placement strategies, it may

require the movement of large amounts of data in complex ways resulting in a cost which

is even higher than that of processing data under the skewed circumstances. While this

24

data movement could be done in the background, throughput would be severely degraded

in a heavily-loaded system. In such cases re-distribution should be infrequent and should

involve as little data movement as possible.

Various placement strategies have been developed for parallel database systems.

These can be classified into three categories according to the criteria used in reducing

costs incurred on resources such as network bandwidth, CPUs and disks, namely:

• size based [Hua, 90], [Hua, 92]

• access frequency based (although some methods can be considered as combined

strategies [Copeland, 88])

• network traffic based [Apers, 88]

The main idea behind these approaches is to achieve the minimal load (e.g.

network traffic) or a balance of load (e.g. size, I/O access) across the system.

2.4.1.2 Size based strategies

Size based strategies were developed for systems utilising intra-operator

parallelism, where the CPU load of each PE depends on the amount of data being

processed. Since the declustering methods are built around the ideas of hashing and

sorting of relations, relation operations applied to the partitioning attribute can be

performed very efficiently. However, queries that involve non-partitioning attributes are

naturally inefficient.

Hua et al [Hua, 90] proposed a size based data placement scheme which balances

the loads of nodes during query processing. Each relation is partitioned into a large

number of cells using a grid file structure. These cells are then assigned to the PEs using

a greedy algorithm so as to balance the data load for each element. This algorithm is

25

executed by putting all cells into a list, ordered by their size, with the largest cell first,

and then allocating the first cell in the list to the PE which currently has the largest disk

space available. The cell is then removed from the list. This process is repeated until the

cell list becomes empty. This initial placement scheme was also used in [Hua, 92], where

the concept of a cell was generalised to any logical unit of data partitioning and

movement, i.e., all records belonging to the same cell are stored on the same PE.

2.4.1.3 Access frequency based strategies

Access frequency based strategies aim to balance the load of disks across PEs

since disks are most likely to be the bottlenecks or hotspots of the system. Sometimes, the

relation access frequency is taken into account together with the relation size due to the

fact that some parts of the relations may be RAM resident and do not require disk access.

This has resulted in some size and access frequency based strategies.

The Bubba [Copeland, 88] method defines frequency of access to a relation as the

heat of the relation and the quotient of a relation's heat divided by its size as the

temperature of the relation. Heat is used in determining the number of nodes across

which to spread the relations. Temperature is used in determining whether a relation

should be RAM resident. The strategy then assigns fragments of relations to nodes taking

account of each node's temperature and heat. It puts relations in a list ordered by their

temperatures and assigns the fragments of the relations in the list to PEs using a greedy

algorithm to ensure that the heat of each PE is roughly equivalent to the others. The heat

of each relation can be estimated for initial placement and tracked for re-distribution.

There are several disk allocation algorithms developed for multi-disk systems

[Scheuermann, 94], which can be classified as access frequency based. The disk cooling

26

method [Scheuermann, 94] uses a greedy procedure which involves both placement and

re-distribution phases. It tries to achieve an I/O load balance while minimizing the

amount of data that is moved across disks. As the disk cooling procedure was

implemented as a background demon which is invoked at fixed intervals in time to

restore the balance of I/O load across different disks, the heat of each data block can be

tracked dynamically, based on a moving average of the inter-arrival time of requests to

the same block. Alternatively, it can also be tracked by keeping a count of the number of

requests to a block within a fixed timeframe. The problem with these heat tracking

methods is that for application sizing, it would be very difficult to track block heat

dynamically for initial data placement, as there is no ready system running for the

tracking exercise.

2.4.1.4 Network traffic based strategies

Network traffic based strategies aim to minimise the delay incurred by network

communication in transferring data between two nodes. They were originally developed

in a distributed database environment where the PEs are remote from each other and

running independently but connected together by a network.

Apers [Apers, 88] presented an algorithm based on a heuristic approach coupled

with a greedy algorithm. In this algorithm, a network is constructed with the fragments of

a relation as its nodes. Each edge represents the cost of data transmission between two

nodes. Each pair of nodes is then evaluated on the basis of the transmission cost of the

allocation. The pair of nodes with the highest cost is selected and united. This process is

repeated until the network is unified with the actual network of machines. This algorithm

was later extended by Sacca and Wiederhold [Sacca, 83] for shared-nothing parallel

27

database systems, in which a first-fit bin-packing approach was used to allocate selected

fragments.

Ibiza-Espiga and Williams [Ibiza, 92] developed a variant of Apers' placement

method. It pre-assigns the clusters of relations (each cluster contains relations which need

to be co-located due to hash join operations) to groups of PEs, taking into account the

available space of the PEs, so that each cluster is mapped to a set of PEs. The idea is to

find an appropriate balance among the work load of the PEs and an equitable distribution

of the space. Then the units within each cluster are assigned to the set of PEs associated

with the cluster using Apers' method.

In [Williams, 98] a study of a number of placement strategies was carried out on

the original version of STEADY. To keep the study manageable it was only carried out at

the PE level. The distribution of relation fragments amongst multiple disks attached to

each PE was assumed to be round-robin. One of the unsurprising outcomes of the study

was that the way in which data is distributed across the PEs of a parallel shared-nothing

architecture can have a significant effect on the performance of a parallel DBMS. Data

placement strategies provide a mechanical approach to distributing data amongst the PEs

although there is considerable variation in the results produced by different strategies. In

the study, five data placement strategies were considered in detail. These were

representative of the three general categories, size based, access frequency based and

network traffic based. The five different strategies were applied to the transaction

processing benchmark TPC-C [Gray, 93] where the number of processing elements was

varied, and the database size was varied. One conclusion of the study was that the access

frequency based and size based strategies used, provided comparable performance for

28

TPC-C. The traditional network traffic based strategy provided almost identical

performance to that of the size based strategies, which was not a surprise as it utilises size

as a measure for network traffic. In middle to heavy weight read-write oriented

applications, the disks generally became the system bottlenecks. The cost difference

between a remote disk access and a local disk access was mainly related to network

communication. Hence improvements made to other factors such as reduction of network

traffic and CPU cost did not enhance the performance significantly. Another network

traffic based strategy, the distributed-database strategy, scaled up well when the number

of units in relation clusters was larger than the number of processing elements

participating in database activities. The degree of the scalability of this strategy was

sensitive to the comparison between the fragmentation degree of relations and the number

of processing elements. As the number of PEs was increased or the database size

decreased, the performance obtained from data distributions generated using the size

based and access frequency based strategies generally scaled up. However, there were

peaks and drops at particular numbers of PEs, depending on the maximum number of

most heavyweight fragments assigned to a PE.

Although this study was carried out on placement at the PE level, data placement

at the disk level is also important to system performance, as an individual disk on a disk

stack attached to a PE could become a bottleneck due to poor placement of data at this

level. The design of multi-disk structures provides room for parallel I/O accesses. This

complicates still further the process of determining the best data distribution for a

particular database application, and reinforces the need for tools to experiment with data

placement strategies.

29

2.5 Database performance benchmarks

Database performance benchmarks [Gray, 93] are usually classified into two

types. Transaction processing benchmarks are designed to test OLTP (on-line transaction

processing) operations, typical of applications whose database operations are pre-defined

and grouped in transactions. Decision support benchmarks are designed to test ad hoc

querying on specially constructed databases. The two main benchmarks used in the

calibration and validation of STEADY were AS3AP and TPC-C.

AS3AP is a scalable, portable ANSI SQL relational database benchmark. This

benchmark provides a comprehensive set of tests for database processing power, it has a

built-in scalability and portability that tests a broad range of systems, it minimizes human

effort in implementing and running benchmark tests; and it provides a uniform metric

straight-forward interpretation of benchmark results.

TPC-C is an online transaction processing (OLTP) benchmark. This benchmark

involves a mix of five concurrent transactions of different types. The database is

comprised of nine types of tables having a wide range of record and population sizes.

This benchmark is measured in transactions per minute.

2.5.1 AS3AP benchmark

The AS3AP benchmark was designed to:

• provide a comprehensive but tractable set of tests for database processing

power.

• have built in scalability and portability, so that it can be used to test a

broad range of systems.

30

• minimize human effort in implementing and running the benchmark tests.

• provide a uniform metric, the equivalent database ratio, for a

straightforward and non-ambiguous interpretation of the benchmark

results.

For a particular DBMS, the AS3AP benchmark determines an equivalent database

size, which is the maximum size of the AS3AP database for which the system is able to

perform the designated AS3AP set of tests in under 12 hours. The equivalent database

size is an absolute performance metric by itself. It can also provide a basis for comparing

cost and performance of systems, as follows: the cost per megabyte of a DBMS is the

total cost of the DBMS divided by the equivalent database size. The equivalent database

ratio for two systems is the ratio of their equivalent database sizes. Both the cost per

megabyte and the equivalent database ratio provide global comparison metrics.

The AS3AP database contains five relations. One, the tiny relation, is a one tuple,

one column relation, used only to measure overhead. The other four relations are named

as follows:

• uniques. A relation where all attributes have unique values.

• hundred. A relation where most of the attributes have exactly 100 unique

values, and are correlated. This relation provides absolute selectivities of

100, and projections producing exactly 100 multi-attribute tuples.

• tenpct. A relation where most of the attributes have 10% unique values.

This relation provides relative selectivities of 10%.

• updates. A relation customized for updates. Different distributions are

used and three types of indices are built on this relation.

31

The four relations have the same ten attributes (columns) with the same names

and types. The attributes cover the range of commonly used data types: signed and

unsigned integer, floating point, exact decimal, alphanumeric, fixed and variable length

strings, and an 8 character date type. As an example, the uniques relation consists of:

key INTEGER (4)

int INTEGER (4)

signed INTEGER (4)

float REAL (4)

double DOUBLE (8)

decim NUMERIC(18,2)

date DATETIME (8)

code CHAR (10)

name CHAR(20)

address VARCHAR(20)

2.5.2 TPC-C benchmark

Approved in July of 1992, the TPC Benchmark C is an on-line transaction

processing (OLTP) benchmark. TPC-C is more complex than previous OLTP

benchmarks such as TPC-A because of its multiple transaction types, more complex

database and overall execution structure.

TPC-C simulates a complete computing environment where a population of users

executes transactions against a database. The benchmark is centred on the principal

activities (transactions) of an order-entry environment. These transactions include

entering and delivering orders, recording payments, checking the status of orders, and

32

monitoring the level of stock at the warehouses. While the benchmark portrays the

activity of a wholesale supplier, TPC-C is not limited to the activity of any particular

business segment, but represents any industry that must manage, sell, or distribute a

product or service.

In the business model, a wholesale parts supplier operates out of a number of

warehouses and their associated sales districts. The TPC benchmark is designed to scale

just as the Company expands and new warehouses are created. However, certain

consistent requirements must be maintained as the benchmark is scaled. Each warehouse

in the TPC- C model must supply ten sales districts, and each district serves three

thousand customers. At any time an operator from a sales district can select any one of

the five operations or transactions offered by the Company's order-entry system.

The most frequent transaction consists of entering a new order which, on average,

consists of ten different items. Each warehouse tries to maintain stock for the 100,000

items in the Company's catalogue and fill orders from that stock. However, in reality, one

warehouse will probably not have all the parts required to fill every order. Therefore,

TPC-C requires that close to ten percent of all orders must be supplied by another

warehouse of the Company. Another frequent transaction consists in recording a payment

received from a customer. Less frequently, operators will request the status of a

previously placed order, process a batch of ten orders for delivery, or query the system

for potential supply shortages by examining the level of stock at the local warehouse. A

total of five types of transactions are used to model this business activity. The

performance metric reported by TPC-C measures the number of orders that can be fully

processed per minute and is expressed in tpm-C.

33

Each warehouse has ten terminals and all five transactions are available at each

terminal. A remote terminal emulator (RTE) is used to maintain the required mix of

transactions over the performance measurement period. This mix represents the complete

business processing of an order as it is entered, paid for, checked, and delivered. More

specifically, the required mix is defined to produce an equal number of New-Order and

Payment transactions and to produce one Delivery transaction, one Order-Status

transaction, and one Stock-Level transaction for every ten New-Order transactions. The

tpm-C metric is the number of New-Order transactions executed per minute.

The ten relations involved are:

• Warehouse

• District

• Stock

• Items

• Parts

• Customer

• Order

• New-Order

• Orderline

• History

The tables cover a range of cardinalities, from a few rows up to millions of rows.

Relationships between table sizes must be maintained according to the scaleability

requirements of the benchmark, which is governed by the number of warehouses.

34

2.6 Parallel database system terminology

This section introduces some issues and terminology pertaining to parallel

database systems that will be used throughout the rest of the thesis.

In 1970, Codd [Codd, 70] introduced the relational database model, which has

since seen widespread use in database machines and software systems. A relation can be

thought of as a table of data with rows called tuples and columns called attributes.

Attribute values are members of sets of values called domains. The set of attributes over

which all tuples of a relation are uniquely defined is called the primary key.

One of the industry standard interfaces for a relational database management

system (RDBMS) is the Structured Query Language (SQL), a non-procedural language

designed specifically for relational databases.

2.6.1 Query operations

The selection operation is the most basic relational database query operation:

given a relation, it retrieves those tuples with attributes that match certain conditions. For

example, a selection could retrieve all of those tuples from a relation where one of its

attributes is equal to a value. e.g., in SQL:

SELECT ALL FROM TABLE WHERE ATTRIBUTE = VALUE

Selection operations are critical to database machine performance, as most

relational database operations involve one or more selections. Selection implementation

methods include full relation search (usually in parallel when in hardware), indexing

methods, and hashing. It is important to note that the method a database machine uses to

35

implement selection operations dramatically affects the implementation and performance

of other operations (such as join).

The projection operation is similar to a selection but only retrieves a subset of

attributes from matching tuples. Because the subset of attributes may not be unique over

all tuples, it is sometimes necessary to remove duplicate results.

The join operation combines related tuples from two relations into single tuples,

on a specific set of attributes. The join operation represents a large part of the usefulness

of the relational model, and is consequently very important.

2.6.2 Complex query operations

Relational queries are often complicated and may involve multiple selections,

joins and projections. The series of operations involved is known as a query plan, and is

usually expressed in tree form and optimized by a relational database management

system prior to execution.

2.6.3 Parallel query execution

Within a relational database engine, a query is represented as an operator tree,

which is derived from the query execution plan. Examples of operators are scan (part of

select operation), build and probe (parts of a join operation implemented using a hash-

join algorithm), and merge (part of sort-merge-based join operation). Each operator

works on one or more streams of tuples and produces a new stream. So, operators can be

arranged in parallel dataflow graphs.

With dataflow graphs of this type, there are two main types of parallelism, namely

inter-query parallelism and intra-query parallelism. In inter-query parallelism, multiple

36

server processes are used simultaneously to work on the operator trees of different

queries. Intra-query parallelism has three types:

• Independent parallelism - if neither of two operators use data produced by the

other, both can run in parallel on separate processors.

• Inter-operator or pipeline parallelism - refers to the running of a number of

operators in a pipeline, with the output stream of tuples from one operator

forming the input of the next one of the pipeline. Operators such as aggregate

(sum, count, etc.) and sort can not be pipelined as these operators do not emit

their first output until they have consumed the whole of their input.

• Intra-operator or partitioned parallelism - several processes are assigned to

work together on a single operator of the tree. This partitioned execution can

be achieved either by designing and implementing parallel algorithms for each

operator, or by parallelising the data instead [Dewitt, 90], which is simpler,

since the use of existing routines for executing operators is preserved, and is

achieved by inserting merge and split operators in suitable places within the

dataflow graph.

The merge operator combines several data streams into a single sequential stream

to be fed into the next operator. In this way, an operator working on a stream of tuples

can be executed as a number of sub-processes executing on independent processors, each

working on a subset of the tuples from the stream. The output of each operator is fed into

a merge operator, where the streams are combined and passed on.

The split operator does the opposite: it partitions or replicates a single data stream

into several independent streams, so that multiple processes may operate on the stream in

37

parallel. The partition of the data can be round robin or hash, or can be performed by an

arbitrary program [Graefe, 90], [Graefe, 93].

2.6.4 Query optimisation and scheduling

Query optimisation refers to the process of selecting an optimal execution plan for

a query from a set of feasible plans, based on some objective such as minimising the

query response time. One part of this process is to determine a suitable order in which to

perform operations. For example, in a query consisting of a number of join operations, it

may be beneficial to perform highly selective joins first, thus eliminating a large number

of tuples early and reducing the size of intermediate relations. Another part of the

optimisation process is concerned with choosing what algorithms to employ for the

relational operations within the plan. A join may be performed using a hash-based,

nested-loop based, or merge-sort based method amongst others.

2.7 Summary

This chapter has reviewed the field of parallel relational database systems and

introduced some of the concepts and terminology which are used throughout the rest of

the thesis. The common hardware architectures underlying modern parallel database

systems were described and compared. In this thesis the parallel machine that is modelled

is a GoldRush Megaserver which has a shared nothing architecture, although no single

database architecture is better than the others in all respects, the shared nothing

architecture is a reasonable choice for hosting parallel database management systems,

particularly for evenly distributed, ‘delightful’, systems, as Stonebraker [Stonebraker, 86]

states that ‘In scalable, tunable, nearly delightful data bases, shared nothing systems will

38

have no apparent disadvantages compared to the other alternatives. Hence the shared

nothing architecture adequately addresses the common case.’

The notions of data partitioning and placement were reviewed. Two standard

benchmarks for database systems that were used throughout the work in this thesis were

also described.

chapter 2 - background · load balancing and ease of migration. moving from a centralised to a...

Documents