1 jim gray & gordon bell: vldb 95 parallel database systems survey parallel database systems 101...

140
1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented at VLDB 95, Zurich Switzerland, Sept 1995 Detailed notes available from [email protected] this presentation is 120 of the 174 slides (time limit) Notes in PowerPoint7 and Word7

Upload: austin-maher

Post on 26-Mar-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

1Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Parallel Database Systems 101

Jim Gray & Gordon BellMicrosoft Corporation

presented at VLDB 95, Zurich Switzerland, Sept 1995

• Detailed notes available from [email protected] – this presentation is 120 of the 174 slides (time limit)

– Notes in PowerPoint7 and Word7

Page 2: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

2Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Outline

• Why Parallelism: –technology push

–application pull• Benchmark Buyer’s Guide

– metrics

– simple tests

• Parallel Database Techniques– partitioned data

– partitioned and pipelined execution

– parallel relational operators

• Parallel Database Systems– Teradata. Tandem, Oracle, Informix, Sybase, DB2, ‘RedBrick

Page 3: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

3Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Kinds Of Information Processing

Point-to-Point Broadcast

Immediate

TimeShifted

conversationmoney

lectureconcert

mail booknewspaper

NetNetworkwork

DataDataBaseBase

Its ALL going electronicImmediate is being stored for analysis (so ALL database)Analysis & Automatic Processing are being added

Page 4: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

4Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Why Put Everything in Cyberspace?

Low rentmin $/byte

Shrinks timenow or later

Shrinks spacehere or there

Automate processingknowbots

Point-to-Point OR Broadcast

Imm

edia

te O

R T

ime

Del

ayed

NetworkNetwork

DataDataBaseBase

LocateLocateProcessProcessAnalyzeAnalyzeSummarizeSummarize

Page 5: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

5Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Databases: Information At Your Fingertips™

Information Network™Knowledge Navigator™

• All information will be in an online database (somewhere)• You might record everything you

• read: 10MB/day, 400 GB/lifetime (two tapes)• hear: 400MB/day, 16 TB/lifetime (a tape per decade)• see: 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe someday)

• Data storage, organization, and analysis is a challenge.• That is what databases are about• DBs do a good job on “records”• Now working on text, spatial, image, and sound.

Page 6: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

6Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Database Store ALL Data Types

• The New World:•Billions of objects•Big objects (1MB)•Objects have behavior

(methods)

• The Old World:

– Millions of objects

– 100-byte objects

Mike

Won

David NY

Berk

Austin

People

Name Address

Mike

Won

David NY

Berk

Austin Paperless officeLibrary of congress onlineAll information online entertainment publishing businessInformation Network, Knowledge Navigator, Information at your fingertips

Name Address Papers Picture Voice

People

Page 7: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

7Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Magnetic Storage Cheaper than Paper

• File Cabinet: cabinet (4 drawer) 250$paper (24,000 sheets) 250$space (2x3 @ 10$/ft2) 180$total 700$

3 ¢/sheet

•Disk: disk (8 GB =) 2,000$ASCII: 4 m pages

0.05 ¢/sheet (60x cheaper)• Image: 200 k pages

1 ¢/sheet (3x cheaper than paper)• Store everything on disk

Page 8: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

10Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Moore’s Law

128KB

128MB

20008KB

1MB

8MB

1GB

1970 1980 1990

1M 16Mbits: 1K 4K 16K 64K 256K 4M 64M 256M

1 chip memory size ( 2 MB to 32 MB)

•XXX doubles every 18 months 60% increase per year–Micro Processor speeds–chip density–Magnetic disk density–Communications bandwidthWAN bandwidth approaching LANs

•Exponential Growth:

–The past does not matter

–10x here, 10x there, soon you're talking REAL change.

•PC costs decline faster than any other platform

–Volume & learning curves

–PCs will be the building bricks of all future systems

Page 9: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

14Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

In The Limit: The Pico Processor

1 M SPECmarks, 1TFLOP

106 clocks to bulk ram

Event-horizon on chip.

VM reincarnated

Multi-program cacheOn-Chip SMP

Terror Bytes!

10 microsecond ram

10 millisecond disc

10 second tape archive

10 nano-second ram

Pico Processor

10 pico-second ram

1 MM 3

100 TB

1 TB

10 GB

1 MB

100 MB

Page 10: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

23

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

What's a Terabyte? (250 K$ of Disk @ .25$/MB)

1 Terabyte 1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg) 4,000 LandSat images

Library of Congress (in ASCII) is 25 TB 1980: 200 M$ of disc 10,000 discs 5 M$ of tape silo 10,000 tapes

1995: 250 K$ of magnetic disc 70 discs 500 K$ of optical disc robot 250 platters 50 K$ of tape silo 50 tapes

Terror Byte !!

150 miles of bookshelf 15 miles of bookshelf 7 miles of bookshelf 10 days of video

Page 11: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

30Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Summary (of storage)• Capacity and cost are improving fast (100x per decade)

• Accesses are getting larger (MOX, GOX, SCANS)

• BUT Latencies and bandwidth are not improving much• (3x per decade)

• How to deal with this???

• Bandwidth:

– Use partitioned parallel access (disk & tape farms)

• Latency

– Pipeline data up storage hierarchy (next section)

Page 12: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

31Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Interesting Storage Ratios

• Disk is back to 100x cheaper than RAM

• Nearline tape is only 10x cheaper than disk

– and the gap is closing!

100:1

10:1

1:1

1960 1970 1980 1990 2000

RAM $/MBDisk $/MB

30:1

?Disk $/MBNearline Tape

??? Why bother with Tape

Disk & DRAM look good

Page 13: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

32Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Performance =Storage Accesses

not Instructions Executed• In the “old days” we counted instructions and IO’s• Now we count memory references• Processors wait most of the time

Where the time goes: clock ticks used by AlphaSort Components

SortDisc Wait SortDisc Wait OS

Memory Wait

D-Cache Miss

I-Cache MissB-Cache

Data Miss

70 MIPS“real” apps have worse Icache misses so run at 60 MIPSif well tuned, 20 MIPS if not

Page 14: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

33Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Storage Latency: How Far Away is the Data?

RegistersOn Chip CacheOn Board Cache

Memory

Disk

12

10

100

Tape /Optical Robot

10 9

106

Sacramento

This CampusThis Room

My Head

10 min

1.5 hr

2 Years

1 min

Pluto

2,000 YearsAndromdeda

Clo

ck T

icks

Page 15: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

34Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Network Speeds

• Network speeds grow 60% / year• WAN speeds limited by politics

• if voice is X$/minute, how much is video?

• Switched 100Mb Ethernet•1,000x more bandwidth

• ATM is a scaleable net:•1 Gb/s to desktop & wall plug•commodity: same for LAN, WAN

• 1Tb/s fibers in laboratory

1e 9

1e 8

1e 7

1e 6

1e 5

1e 4

1e 3

1960 1970 1980 1990 2000

Processors (i/s)

Year

Comm Speedups

LANs &WANs (b/s)

Page 16: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

35Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Network Trends & Challenge• Bandwidth UP 104 Price DOWN

• Speed-of-light unchanged

• Software got worse

• Standard Fast Nets» ATM

» PCI

» Myrinet

» Tnet

• HOPE:

– Commodity Net

– Good software

• Then clusters become a SNAP!

• commodity: 10k$/slice

102

103

104

105

106

107

108

109

1010

POTS

WAN

LAN

CAN PC Bus

1 Mb/s

1 Gb/s

1 Kb/s

20001995198519751965

WAN Data Rates (fiber)

0.01

0.1

1

10

100

1000

1970 1980 1990 2000

Year

Page 17: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

36Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

The Seven Price Tiers

• 10$: wrist watch computers• 100$: pocket/ palm computers• 1,000$: portable computers• 10,000$: personal computers (desktop)• 100,000$: departmental computers (closet)• 1,000,000$: site computers (glass house)• 10,000,000$: regional computers (glass castle)

SuperServer: Costs more than 100,000 $“Mainframe” Costs more than 1M$Must be an array of processors,

disks, tapescomm ports

Page 18: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

38Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

The New Computer Industry

• Horizontal integrationis new structure

• Each layer picks best from lower layer.

• Desktop (C/S) market•1991: 50%•1995: 75%

Intel & SeagateSilicon & Oxide

SystemsBaseware

MiddlewareApplications SAP

OracleMicrosoftCompaq

Integration EDS

Operation AT&TFunction Example

Page 19: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

40Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Software Economics: Bill’s Law

•Bill Joy’s law (Sun): Don’t write software for less than 100,000 platforms.

@10M$ engineering expense, 1,000$ price

•Bill Gate’s law:Don’t write software for less than 1,000,000 platforms.

@10M$ engineering expense, 100$ price• Examples:

•UNIX vs NT: 3,500$ vs 500$•UNIX-Oracle vs SQL-Server: 100,000$ vs 1,000$•No Spreadsheet or Presentation pack on UNIX/VMS/...

• Commoditization of base Software & Hardware

Page 20: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

44Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

ThesisMany Little will Win over Few Big

1 M$100 K$ 10 K$

Mainframe MiniMicro Nano

14"9"

5.25" 3.5" 2.5" 1.8"

Page 21: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

45Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Year 2000 4B Machine

• The Year 2000 commodity PC (3K$)

•Billion Instructions/Sec

•Billion Bytes RAM

•Billion Bits/s Net

• 10 B Bytes Disk

•Billion Pixel display• 3000 x 3000 x 24 pixel

10 B byte Disk

.1 B byte RAM

1 Bips Processor

1 B

bits

/sec

LA

N/W

AN

Page 22: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

46Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

4 B PC’s: The Bricks of Cyberspace

• Cost 3,000 $• Come with

•OS (NT, POSIX,..)•DBMS•High speed Net•System management•GUI / OOUI •Tools

• Compatible with everyone else• CyberBricks

Page 23: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

47Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Implications of Hardware TrendsLarge Disc Farms will be inexpensive ( 100$/GB)

Large RAM databases will be inexpensive (1,000$/GB)

Processors will be inexpensive

So The building block will be a processor with large RAM lots of Disc

1k SPECintCPU

50 GB Disc

5 GB RAM

Page 24: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

48Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Implication of Hardware Trends: Clusters

Future Servers are CLUSTERSof processors, discs

Distributed Database techniquesmake clusters work

CPU

50 GB Disc

5 GB RAM

Page 25: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

49

Hig

h S

pe

ed N

etw

ork

( 10

Gb

/s)

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Future SuperServer4T Machine

Array of 1,000 4B machinesprocessors, disks, tapescomm lines

A few MegaBucksChallenge:

ManageabilityProgrammabilitySecurityAvailabilityScaleabilityAffordability

As easy as a single system

1,000 discs = 10 Terrorbytes

100 Tape Transports= 1,000 tapes = 1 PetaByte

100 Nodes1 Tips

Page 26: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

50Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Great Debate: Shared What?

Shared Memory (SMP)

Shared Disk Shared Nothing (network)

CLIENTS CLIENTSCLIENTS

MemoryProcessors

Easy to programDifficult to buildDifficult to scaleup

Hard to programEasy to buildEasy to scaleup

Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2

Winner will be a synthesis of these ideasDistributed shared memory (DASH, Encore) blurs distinction between Network and Bus (locality still important)

But gives Shared memory message cost.

Page 27: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

51Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Scaleables: Uneconomic So Far• A Slice is a processor, memory, and a few disks.

• Slice Price of Scaleables so far is 5x to 10x markup

– Teradata: 70K$ for a Intel 486 + 32MB + 4 disk.

– Tandem: 100k$ for a MipsCo R4000 + 64MB + 4 disk

– Intel: 75k$ for an I860 +32MB + 2 disk

– TMC: 75k$ for a SPARC 3 + 32MB + 2 disk.

– IBM/SP2: 100k$ for a R6000 + 64MB + 8 disk

• Compaq Slice Price is less than 10k$

• What is the problem?

– Proprietary interconnect

– Proprietary packaging

– Proprietary software (vendorIX)

Page 28: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

52Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Summary• Storage trends force pipeline & partition parallelism

– Lots of bytes & bandwidth per dollar

– Lots of latency

• Processor trends force pipeline & partition

– Lots of MIPS per dollar

– Lots of processors

• Putting it together Scaleable Networks and Platforms)

– Build clusters of commodity processors & storage

– Commodity interconnect is key (S of PMS)» Traditional interconnects give 100k$/slice.

– Commodity Cluster Operating System is key

– Fault isolation and tolerance is key

– Automatic Parallel Programming is key

Page 29: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

53Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

The Hardware is in Place and Then A Miracle Occurs

SNAPSNAPScaleable Network And Platforms

Commodity Distributed OSCommodity Distributed OS built onCommodity PlatformsCommodity Network Interconnect

?

Enables Parallel Applications

Page 30: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

56Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Why Parallel Access To Data?

1 Terabyte

10 MB/s

At 10 MB/s1.2 days to scan

1 Terabyte

1,000 x parallel1.5 minute SCAN.

Parallelism: divide a big problem into many smaller ones to be solved in parallel.

Bandwidth

Page 31: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

57Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

DataFlow ProgrammingPrefetch & Postwrite Hide Latency

• Can't wait for the data to arrive (2,000 years!)• Need a memory that gets the data in advance ( 100MB/S)

• Solution:•Pipeline from source (tape, disc, ram...) to cpu cache•Pipeline results to destination

Latency

Page 32: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

58Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Why are Relational OperatorsSo Successful for Parallelism?

Relational data model uniform operatorson uniform data streamClosed under composition

Each operator consumes 1 or 2 input streamsEach stream is a uniform collection of dataSequential data in and out: Pure dataflow

partitioning some operators (e.g. aggregates, non-equi-join, sort,..)

requires innovation

AUTOMATIC PARALLELISM

Page 33: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

59Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Database Systems “Hide” Parallelism

•Automate system management via tools•data placement•data organization (indexing)•periodic tasks (dump / recover / reorganize)

•Automatic fault tolerance•duplex & failover• transactions

•Automatic parallelism•among transactions (locking)•within a transaction (parallel execution)

Page 34: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

60Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Automatic Parallel OR DB

Select imagefrom landsatwhere date between 1970 and 1990and overlaps(location, :Rockies) and snow_cover(image) >.7;

Temporal

Spatial

Image

date loc image

Landsat

1/2/72.........4/8/95

33N120W.......34N120W

Assign one process per processor/disk:find images with right data & locationanalyze image, if 70% snow, return it

image

Answer

date, location, & image tests

Page 35: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

61Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Outline• Why Parallelism:

– technology push

– application pull

• Benchmark Buyer’s Guide

–metrics

–simple tests• Parallel Database Techniques

– partitioned data

– partitioned and pipelined execution

– parallel relational operators

• Parallel Database Systems– Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrick

Page 36: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

62Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Parallelism: Speedup & Scaleup

100GB 100GBSpeedup: Same Job, More Hardware Less time

Scaleup: Bigger Job, More Hardware Same time

100GB 1 TB

100GB 1 TB

Server Server

1 k clients 10 k clientsTransactionScaleup: more clients/servers Same response time

Page 37: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

63Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

The New Law of Computing

Grosch's Law:

Parallel Law: Needs

Linear Speedup and Linear ScaleupNot always possible

1 MIPS1 $

1,000 $1,000 MIPS

2x $ is 2x performance

1 MIPS1 $

1,000 MIPS 32 $.03$/MIPS

2x $ is 4x performance

Page 38: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

64Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Parallelism: Performance is the Goal

Goal is to get 'good' performance.

Law 1: parallel system should be faster than serial system

Law 2: parallel system should give near-linear scaleup or

near-linear speedup orboth.

Page 39: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

65Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

The New Performance Metrics• Transaction Processing Performance Council:

– TPC-A: simple transaction

– TPC-B: server only, about 3x lighter than TPC-A– Both obsoleted by TPC-C (no new results after 6/7/95)

• TPC-C (revision 3) Transactions Per Minute tpm-C

– Mix of 5 transactions: query, update, minibatch

– Terminal price eliminated

– about 5x heavier than tpcA (so 3.5 ktpcA 20 ktpmC)

• TPC-D approved in March 1995 - Transactions Per Hour

– Scaleable database (30 GB, 100GB, 300GB,... ) – 17 complex SQL queries (no rewrites, no hints without permission)

– 2 load/purge queries

– No official results yet, many “customer” results.

Page 40: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

66Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

TPC-C Results 12/94

22000200001800016000140001200010000800060004000200000

1000

2000

3000

Tandem

HP-H70

AS400

HP9000

RS6000

Sun

DG

HP-H70

HP T500-8

PERFORMANCE (TPMC)

CO

ST

($

/TP

MC

) AS400

RS6000

HP9000

Tandem Himalaya Server

16 cpus 32 cpus 64 cpus 112 cpus

HP 9000 E55, H70

SUN

HP T500

Courtesy of Charles Levine of Tandem (of course)

Page 41: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

67Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Success Stories

•Online Transaction Processing •many little jobs•SQL systems support 3700 tps-A

(24 cpu, 240 disk)

•SQL systems support 21,000 tpm-C (112 cpu,670 disks)

•Batch (decision support and Utility)• few big jobs, parallelism inside•Scan data at 100 MB/s•Linear Scaleup to 500 processors

tran

sact

ion

s /

sec

hardware

recs

/ se

c

hardware

Page 42: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

68Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

The Perils of ParallelismO

ldTi

me

Ne

wTi

me

Spe

edup

=

Processors & Discs

The Good Speedup Curve

Linearity

Processors & Discs

A Bad Speedup Curve

Linearity

No Parallelism Benefit

Processors & Discs

A Bad Speedup Curve3-Factors

Sta

rtu

p

Inte

rfe

ren

ce

Ske

w

Startup: Creating processesOpening filesOptimization

Interference: Device (cpu, disc, bus)logical (lock, hotspot, server, log,...)

Skew: If tasks get very small, variance > service time

Page 43: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

69Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Benchmark Buyer's Guide

The Whole Story (for any system)

Th

rou

gh

pu

t

Processors & Discs

The Benchmark Report

Things to ask

When does it stop scaling?

Throughput numbers,Not ratios.

Standard benchmarks allowComparison to others

Comparison to sequential

Ratios and non-standard benchmarks are red flags.

Page 44: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

70Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

AggCount

Performance 101: Scan RateDisk is 3MB/s to 10MB/s

Record is 100B to 200B (TPC-D 110...160, Wisconsin 204)So should be able to read 10kr/s to 100kr/s

Simple test: Time this on a 1M record tableSELECT count(*) FROM T WHERE x < :infinity;(table on one disk, turn off parallelism)

Typical problems:disk or controller is an antiqueno read-ahead in operating system or DBsmall page reads (2kb)data not clustered on disk big cpu overhead in record movement

Parallelism is not the cure for these problems

Scan

Page 45: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

71Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Parallel Scan Rate

AggCount

Scan

AggCount

Scan

AggCount

Scan

AggCount

Scan

AggSum

Simplest parallel test:Scaleup previous test:

4 disks, 4 controllers, 4 processors4 times as many records

partitioned 4 ways.Same query

Should have same elapsed time.

Some systems do.

Page 46: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

72Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Parallel Update Rate

UPDATELog

Test: UPDATE TSET x = x + :one;

Test for million row T on 1 disk

Test for four million row T on 4 disks

Look for bottlenecks.

After each call, execute ROLLBACK WORK

See if UNDO runs at the DO speed

See if UNDO is parallel (scales up)

Page 47: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

74Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

The records/$/second Metric• parallel database systems scan data

• An interesting metric (100 byte record):

– Record Scan Rate / System Cost

• Typical scan rates: 1k records/s to 30k records/s

• Each Scaleable system has a “slice price” guess:– Gateway: 15k$ (P5 + ATM + 2 disks +NT + SQLserver or Informix or

Oracle)– Teradata: 75k$– Sequent: 75k$ (P5+2 disks+Dynix+Informix)– Tandem: 100k$– IBM SP2: 130k$ (RS6000+2 disks, AIX, DB2)

• You can compute slice price for systems later in presentation

• BAD: 0.1 records/s/$ (there is one of these)

• GOOD: 0.33 records/s/$ (there is one of these)

• Super! 1.00 records/s/$ (there is one of these)

• We should aim at 10 records/s/$ with P6.

Page 48: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

75Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Embarrassing Questions to Ask Your PDB Vendor

How are constraints checked?ask about unique secondary indicesask about deferred constraintsask about referential integrity

How does parallelism interact withtriggersStored proceduresOO extensions

How can I change my 10 TB database design in an hour?add index add constraintreorganize / repartition

These are hard problems.

Page 49: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

76Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Outline• Why Parallelism:

– technology push

– application pull

• Benchmark Buyer’s Guide– metrics

– simple tests

• Parallel Database Techniques–partitioned data

–partitioned and pipelined execution

–parallel relational operators• Parallel Database Systems

– Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrick

Page 50: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

77Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Automatic Data Partitioning

Split a SQL table to subset of nodes & disks

Partition within set:Range Hash Round Robin

Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning

A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z

Good for equijoins, range queriesgroup-by

Good for equijoins Good to spread load

Page 51: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

78Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Index Partitioning

Hash indices partition by hash

B-tree indices partition as a forest of trees.One tree per range

Primary index clusters data

0...9 10..19 20..29 30..39 40..

A..C D..F G...M N...R S..Z

Page 52: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

79Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Secondary Index Partitioning

In shared nothing, secondary indices are Problematic

Partition by base table key rangesInsert: completely local (but what about unique?)Lookup: examines ALL trees (see figure)

Unique index involves lookup on insert.

Partition by secondary key rangesInsert: two nodes (base and index)Lookup: two nodes (index -> base)Uniqueness is easy

Teradata solution

A..C D..F G...M N...R S..

Base Table

A..Z

Base Table

A..Z A..Z A..Z A..Z

Page 53: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

80Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Kinds of Parallel Execution

Pipeline

Partition outputs split N ways inputs merge M ways

Any Sequential Program

Any Sequential Program

Any Sequential

Any Sequential Program Program

Page 54: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

81Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Data Rivers Split + Merge Streams

River

M ConsumersN producers

Producers add records to the river, Consumers consume records from the riverPurely sequential programming.River does flow control and buffering

does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano.

N X M Data Streams

Page 55: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

82Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Partitioned Execution

A...E F...J K...N O...S T...Z

A Table

Count Count Count Count Count

Count

Spreads computation and IO among processors

Partitioned data gives NATURAL parallelism

Page 56: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

83Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

N x M way Parallelism

A...E F...J K...N O...S T...Z

Merge

Join

Sort

Join

Sort

Join

Sort

Join

Sort

Join

Sort

Merge Merge

N inputs, M outputs, no bottlenecks.

Partitioned DataPartitioned and Pipelined Data Flows

Page 57: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

84Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Picking Data Ranges

Disk PartitioningFor range partitioning, sample load on disks.

Cool hot disks by making range smallerFor hash partitioning,

Cool hot disks by mapping some buckets to others

River PartitioningUse hashing and assume uniform If range partitioning, sample data and use

histogram to level the bulk

Teradata, Tandem, Oracle use these tricks

Page 58: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

85Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Blocking Operators = Short Pipelines

An operator is blocking, if it does not produce any output, until it has consumed all its input

Examples:Sort, Aggregates, Hash-Join (reads all of one operand)

Blocking operators kill pipeline parallelismMake partition parallelism all the more important.

Sort RunsScan

Sort Runs

Sort Runs

Sort Runs

Tape File SQL Table Process

Merge Runs

Merge Runs

Merge Runs

Merge Runs

Table Insert

Index Insert

Index Insert

Index Insert

SQL Table

Index 1

Index 2

Index 3

Database LoadTemplate hasthree blocked phases

Page 59: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

86Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Simple Aggregates (sort or hash?)

Simple aggregates (count, min, max, ...) can use indicesMore compactSometimes have aggregate info.

GROUP BY aggregatesscan in category order if possible (use indices)Else If categories fit in RAM use RAM category hash table

Elsemake temp of <category, item>sort by category,do math in merge step.

Page 60: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

88Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Sort

Used forloading and reorganization (sort makes them sequential)

build B-treesreports

non-equijoinsRarely used for aggregates or equi-joins (if hash available

SortRunsInput

DataSortedData

Merge

Page 61: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

89Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Sub-sortsgenerateruns

Mergeruns

Range or Hash Partition River

River is range or hash partitioned

Scan or other source

Parallel Sort

M input N output Sort design

Disk and mergenot needed if sort fits in memory

Scales linearly because6

12= => 2x slowerlog(10 ) 6

log(10 ) 12

Sort is benchmark from hell for shared nothing machinesnet traffic = disk bandwidth, no data filtering at the source

Page 62: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

90Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

SIGMOD Sort AwardDatamation Sort: 1M records (100 B recs)

1000 seconds 1986

60 seconds 1990

7 seconds 1994

3.5 seconds 1995 (SGI challenge)

micros finally beat the mainframe!

finally! a UNIX system that does IO

SIGMOD MinuteSort1.1GB, Nyberg, 1994

Alpha 3cpu

1.6GB, Nyberg, 1995 SGI Challenge (12 cpu)

no SIGMOD PennySort record Threads (Sprocs) devoted to sorting

Ela

ps

ed

Tim

e (

se

co

nd

s)

0

50

100

150

200

250

1 2 4 6 10

write done

lists merged

lists-sorted

read-done

pin

Sort Time on an SGI Challenge

1.6 GB (16 M 100-byte records)12 cpu, 2.2 GB, 96 disk

1.0E+02

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1985 1990 1995

Sort Records/second vs Time

M68000

Cray YMP

IBM 3090

Tandem

Hardware Sorter

Sequent

Alpha

Intel

HyperCube

SGI

Page 63: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

91Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Nested Loops Join

OuterTable

InnerTable

If inner table indexed on join cols (b-tree or hash)then sequential scan outer (from start key)For each outer record

probe inner table for matching recs

Works best if inner is in RAM (=> small inner

Page 64: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

92Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Merge Join (and sort-merge join)

LeftTable

RightTable

NxM caseCartesian product

Partitions well: partition smaller to larger partition.

Works for all joins (outer, non-equijoins, Cartesian, exclusion,...)

If tables sorted on join cols (b-tree or hash)then sequential scan each (from start key)left < right left=right left > rightadvance left match advance right

Nice sequential scan of data (disk speed)(MxN case may cause backwards rescan)

Sort-merge join sorts before doing the merge

Page 65: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

93Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Hash Join

Hash smaller table into N buckets (hope N=1)

If N=1 read larger table, hash to smallerElse, hash outer to disk then

bucket-by-bucket hash join.

Purely sequential data behavior

Always beats sort-merge and nestedunless data is clustered.

Good for equi, outer, exclusion joinLots of papers,

products just appearing (what went wrong?)

Hash reduces skew

Right Table

LeftTable

HashBuckets

Page 66: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

95Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Parallel Hash JoinICL implemented hash join with bitmaps in CAFS machine

(1976)!

Kitsuregawa pointed out the parallelism benefits of hashjoin in early 1980’s (it partitions beautifully)

We ignored them! (why?) But now, Everybody's doing it.(or promises to do it).

Hashing minimizes skew, requires little thinking for redistribution

Hashing uses massive main memory

Page 67: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

98Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

ObservationsIt is easy to build a fast parallel execution environment

(no one has done it, but it is just programming)

It is hard to write a robust and world-class query optimizer.There are many tricksOne quickly hits the complexity barrier

Common approach:Pick best sequential planPick degree of parallelism based on bottleneck analysis

Bind operators to process

Page 68: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

99Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

What’s Wrong With That?Why isn’t the best serial plan, the best parallel plan?

Counter example:Table partitioned with local secondary index at two nodesRange query selects all of node 1 and 1% of node 2.Node 1 should do a scan of its partition.Node 2 should use secondary index.

SELECT * FROM telephone_book WHERE name < “NoGood”;

Sybase Navigator & DB2 PE should get this right.

We need theorems here (practitioners do not have them)

N..Z

TableScan

A..M

Index Scan

Page 69: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

101Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

What Systems Work This Way

Shared NothingTeradata: 400 nodesTandem: 110 nodesIBM / SP2 / DB2: 128 nodesInformix/SP2 48 nodesATT & Sybase 8x14 nodes

Shared DiskOracle 170 nodesRdb 24 nodes

Shared MemoryInformix 9 nodes RedBrick ? nodes

CLIENTS

MemoryProcessors

CLIENTS

CLIENTS

Page 70: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

102Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Outline• Why Parallelism:

– technology push

– application pull

• Benchmark Buyer’s Guide– metrics

– simple tests

• Parallel Database Techniques– partitioned data

– partitioned and pipelined execution

– parallel relational operators

• Parallel Database Systems–Teradata - Oracle -DB2

–Tandem - Informix -RedBrick- Sybase

Page 71: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

103Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

System Survey Ground Rules

Premise: The world does not need yet another PDB survey

It would be nice to have a survey of “real” systems

Visited each parallel DB vendor I could (time limited)

Asked not to be given confidential info.

Asked for public manuals and benchmarks

Asked that my notes be reviewed

I say only nice things (I am a PDB booster)

Page 72: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

104Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

AcknowledgmentsTeradata

Todd Walter and Carrie BallingerTandem

Susanne Englert, Don Slutz, HansJorge Zeller, Mike PongOracle

Gary Hallmark, Bill WiddingtonInformix

Gary Kelley, Hannes Spintzik, Frank Symonds, Dave ClayNavigator

Rick Stellwagen, Brian Hart, Ilya Listvinsky, Bill Huffman , Bob McDonald, Jan Graveson Ron Chung Hu, Stuart Thompto

DB2 Chaitan Baru, Gilles Fecteau, James Hamilton, Hamid Pirahesh

RedbrickPhil Fernandez, Donovan Schneider

Page 73: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

105Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Teradata • Ship 1984, now an ATT GIS brand name

• Parallel DB server for decision support SQL in, tables out

• Support Heterogeneous data (convert to client format)

Data hash partitioned among AMPswith fallback (mirror) hash.

Applications run on clients

Biggest installation: 476 nodes, 2.4 TB

Ported to UNIX base

Application Processor

AMP

IBM

PC

MAC

UNIX

VMS

AS400

Mac

PEP

Page 74: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

106Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Application Processor

AMP

IBM

PC

MAC

UNIX

VMS

AS400

Mac

PEP

Parsing EnginesInterface to IBM or Ethernet or...Accept SQL, return records and status.Support SQL 89, moving to SQL92

Parse, Plan & authorize SQL cost based optimizerIssue requests to AMPsMerge AMP results to requester.Some global load control based on client priority

(adaptive and GREAT!)

Access ModulesAlmost all work done in AMPsA shared nothing SQL engine

scans, inserts, joins, log, lock,....Manages up to 4 disks (as one logical volume)Easy design, manage, grow (just add disk)

Page 75: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

107Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Data Layout: Hash PartitioningAll data declustered to all nodesEach table has a hash key (may be compound)Key maps to one of 4,000 bucketsBuckets map to one of the AMPsNon-Unique secondary index partitioned by table criterionFallback bucket maps to second AMP in cluster.

Typical cluster is 6 nodes (2 is mirroring).Cluster limits failure scope:

2 failures only cause data outage if both in same cluster.

Within a node, each hash to cylinder then hash to “page”

Page is a heap with a sorted directory

Page 76: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

108Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Teradata Optimization & Execution

Sophisticated query optimizer(many tricks) Great emphasis on Joins & Aggregates.

Nested, merge, product, bitmap join (no hash join)

Automatic load balancing from hashing & load control

Excellent utilities for data loading, reorganize

Move > 1TB database from old to new in 6 days, in background while old system running

Old hardware, 3.8B row table (1TB), >300 AMPstypical scan, sort, join averages 30 minutes

Page 77: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

109Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Query ExecutionProtocol

PE requests workAMP responds OK (or pushback)AMP works (if all OK)AMP declares finishedWhen all finished, PE does 2PC and starts pull

Simple scan: PE broadcasts scan to each AMPEach AMP scans produces answer spool filePE pulls spool file from AMPs via Ynet

If scan were ordered, sort “catcher” would be forkedat each AMP pipelined to scansYnet and PE would do merge of merges from AMPs

Page 78: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

110Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Aggregates, Updates

Aggregate of Scan:Scan’s produce local sub-aggregatesHash sub-aggregates to YnetEach AMP “catches” its sub-aggregate hash bucketsConsolidate sub-aggregates.PE pulls aggregates from AMPs via Ynet.Note: fully scaleable design

Insert / Update / Delete at a AMP nodegenerates insert / update /delete messages to

unique-secondary indicesfallback bucket of base table.messages saved in spool if node is down

Page 79: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

111Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Query Execution: Joins

Great emphasis on Joins.Includes small-table large-table optimization

cheapest triple, then cheapest in triple.

If equi-partitioned, do locallyIf not equi-partitioned,

May replicate small table to large partition (Ynet shines) May repartition one if other is already partitioned on joinMay repartition both (in parallel)

Join algorithm within node is ProductNestedSort-mergeHash bit map of secondary indices, intersected.

Page 80: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

112Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Utilities

Bulk Data Load, Fast Data Load, Multi-load, Blast 32KB of data to an AMPMultiple sessions by multiple clients can drive 200x parallelDouble bufferAMP unpacks, and puts “upsert”onto YnetOne record can generate multiple upserts

(transaction-> inventory, store-sales, ...)Catcher on Ynet, grabs relevant “upserts” to temp file.Sorts and then batches inserts (survives restarts).Online and restartable.Customers cite this as Teradata strength.

Fast Export (similar to bulk data load)

Page 81: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

113Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Utilities II

Backup / Restore: Rarely needed because of fallback.Cluster is unit of recoveryBackup is online, Restore is offline

Reorganize:Rarely needed, add disk is just restartAdd node:

rehash all buckets that go to that node:(Ynet has old and new bucket map)

Fully parallel and fault tolerant, takes minutes

Page 82: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

114Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Port To UNIXNew design (3700 series) described in VLDB 93

Ported to UNIX platforms (3600 AP, PE, AMP)

Moved Teradata to Software Ynet on SMPs

Based on Bullet-Proof UNIX with TOS layer atop.message system

communications stacks

raw disk & virtual processors

virtual partitions (buckets go to virtual partitions)

removes many TOS limits

Result is 10x to 60x faster

than an AMP

Compiled expression evaluation(gives 50x speedup on scans)

Large main memory helps

UNIX 5.4 (SMP, RAS, virtual Ynet)UNIX PDE: TOS adapterTeradata SQL (AMP logic)Parsing engine (parallelism)

ApplicationsSQL

HARDWARE

Page 83: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

115Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Customer BenchmarksStandard Benchmarks

Only old Boral/DeWitt Wisconsin numbers.Nothing public.

Moving > 1TB database from one old to new in 6 days, in background while old system runningSo: unload-load rate > 2MB/s sustained

Background task (speed limited by host speed/space)

Old hardware, 3.8B row table, >300 AMPstypical scan, sort, join averages 30 minutes

rates (rec size not cited): krec/s/AMP k rec/s

scan: 9 2.7 mr/s !!!!!! clustered join: 2 600 kr/s insert-select: .39 120 kr/s Hash index build: 3.3 100 kr/s

Page 84: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

116Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

UNIX/SMP Port of Teradata

op rows seconds k r/s MB/s

scan 50000000 737 67.8 11.0

copy 5000000 1136 4.4 0.7

aggregate 50000000 788 63.5 10.3

Join 50x2M (clustered) 52000000 768 67.7 11.0

Join 5x5 (unclustered) 10000000 237 42.2 6.8

Join 50Mx.1K 50000100 1916 26.1 4.2

Times to process a Teradata Test DB on a 8 Pentium, 3650. These numbers are 10 to 150x better than a single AMP Compiled expression handling

more memory

Page 85: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

117Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Teradata Good Things

Scaleable to large (multi-terabyte) databases

Available TODAY!

It is VERY real: in production in many large sites

Robust and complete set of utilities

Automatic management.

Integrates with the IBM mainframe OLTP world

Heterogeneous data support is good data warehouse

Page 86: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

118Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

TandemMessage-based OS (Guardian): (1) location transparency(2) fault isolation (failover to other nodes).

Expand software 255 Systems WAN

Classic shared-nothing system (like Teradata except applicationsrun inside DB machine.

4 node System

8 x1M B/S

30MB/S

1-16 MIPS R4400 cpusdual port controllers,dual 30MB/s LAN

224PROCESSORS

1974-1985: Encompass: Fault-tolerant Distributed OLTP1986: NonStopSQL: First distributed and high-performance SQL (200 tps)

1989: Parallel NonStopSQL: Parallel query optimizer/executor1994: Parallel and Online SQL (utilities, DDL, recovery, ....)1995: Moving to ServerNet: shared disk model

Page 87: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

119Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Tandem Data LayoutEach table or index range partitioned to a set of disks

(anywhere in network)

Index is B-tree per partitionclustering index is B+ tree

Table fragments are files (extent based).

Descriptors for all local files live in local catalog (node autonomy)

Tables can be distributed in network (lan or wan)

Duplexed disks and disk processes for failover

PartitionBlock

Extents may be added

File= {parts}

Page 88: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

120Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Tandem Software (Process Structure)

Disk Server Pair

Data partition

C/COBOL/..Application

SQLSQL engineJoins, Sortsglobal aggstriggersindex maintenanceviewssecurity

Query Compiler

Utilities

TransactionsHelperProcesses

GUI

SelectsUpdate, DeleteRecord/Set insertAggregatesAssertionsLocking Logging

bufferpool

Disk Pairor Array

Hardware & OS move data at 4MB/s with >1 ins/byte

Page 89: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

121Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

OLTP FeaturesInsert / Update / Delete index in parallel with base table

If 5 indices, 5x faster response time.

Record and key-value range locking, SQL92 isolation levels

Undo scanner per log: double-buffers undo to each server

21 k tpc-C (WOW!!) with 110 node server (800GB db)

Can mix OLTP and batch.Priority serving to avoid priority inversion problem

Buffer management prevents sequential buffer pollution

Page 90: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

122Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Tandem Query Plan & Execution

SQL subsystem

Application

Executors

DiskServers

Simple selects & aggregates done in disk servers

Parallelism chosen: scan: table fragmentation hash: # processors or Outer table fragments

Sorts: redistribution, sort in executors (N-M)Joins done in executors (nest, sort-merge, hash).

Redistribution is always a hash (minimize skew)Pipeline as deep as possible (use lots of processes)

Multiple logs & parallel UNDO avoid bottlenecks

Can mix OLTP and batch.Priority serving to avoid priority inversion problem

Buffer management prevents sequential buffer pollution

Page 91: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

123Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Parallel OperatorsInitially just inserted rivers between sequential operatorsParallel query optimizerCreated executors at all clustering nodes or

at all nodes, repartitioned via hash to themGave parallel select, insert, update, delete

join, sort, aggregates,...correlated subqueries are blocking

Got linear speedup/scaleup on Wisconsin.Marketing never noticed, product slept from 1989-1993

Developers added: Hash Joinaggregates in disk processSQL92 featuresparallel utilitiesonline everythingconverted to MIPScofixed bugs

Page 92: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

124Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Join StrategiesNested loopSort mergeBoth can work off index-only accessReplicate small to all partitions (when one small)Small-table Cartesian product large-table optimizationNow hybrid-hash join

uses many small bucketstuned to memory demand tuned to sequential disk performanceno bitmaps because (1) parallel hash

(2) equijoins usually do not benefit

When both large, and unclustered (rare case)N+M scanners, 16 catchers: sortmerge or hybrid hash

Page 93: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

125Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Administration (Parallel & Online everything)All utilities are online (claim to reduce outages by 40%):

Add table, column,...Add index:

builds index from stale copyuses log for catchupin final minute, gets lock, completes index.

Reorg B-tree while it is accessedAdd / split/ merge/ reorg partitionBackupRecover page, partition, file.Add, alter logs, disks, processors, ...

You need this: Terabyte operations take a long time!

Parallel Utilities:load (M to N)index build (M scanners, N inserters, in background)recovery:

Page 94: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

126Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

BenchmarksNo official DSS benchmark reports

Unofficial results1 to 16 R4400 class processors, 64MB each (Himalayas)

3 disks, 3 ctlrs each

Sequential 16x Parallel rec/s MB/s rec/s MB/s speedup

Load Wisc 1.6 kr/s 321 Kb/s 28 kr/s 5.4 MB/s 16Parallel Index build 1.5 kr/s 15Kb/s 24 kr/s 240 KB/s 16SCAN 28 kr/s5.8 MB/s 470 kr/s 94 MB/s 16 !!!!!!!Aggregate (1 col) 25 kr/s 4.9 MB/s 400 kr/s 58 MB/s 16Aggregate (6 col) 18 kr/s 3.6 MB/s 300 kr/s 60 MB/s 162-Way hash Join 13 kr/s 2.6 MB/s 214 kr/s 42 MB/s 163-Way hash Join ? kr/s ? Mb/s ? kr/s ? MB/s ?

1x and 16x rates are best I’ve seen anywhere.

Page 95: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

127Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Tandem Good Things21 K TPM-C (WOW!)

It is available TODAY!

Online everything

Fault tolerant, distributed, high availability

Mix OLTP and batch

Great Hash Join Algorithm

Probably the best peak performance available

Page 96: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

128Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

OracleParallel Server (V7): Multiple threads in a server

Multiple servers in a cluster Client/server, OLTP & clusters (TP lite)

Parallel Query (V7.1) Parallel SELECT (and sub-selects)

Parallel Recovery: (V7.1) @ restart, one log scanner, multiple redoers

Beta in 1993, Ship 6/94.More Parallel (create table): V7.2, 6/95

Shared disk implementation ported to most platforms

Parallel SELECT (no parallel INSERT, UPDATE, DELETE, DDL) except for sub-selects inside these verbs.

Page 97: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

129Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Tabl

e or

Inde

x

SegmentBlock

Extents may be added

= File SetTable Space

Ext

ent

s

Oracle Data LayoutHomogenous:

one table (index) per segmentextents picked from a TableSpace

Files may be raw disk Segments are B-trees or heaps.

data -> disk map is automaticNo range / hash / round-robin partitioning

ROWID can be used as scan partitioning on base tables.

Guiding principal:If its not organized, it can’t get disorganized,

and doesn’t need to be reorganized.

Page 98: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

130Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Oracle Parallel Query Product ConceptConvert serial SELECT plan to parallel plan

If Table scan or HINT then consider parallel planTable has default degree of parallelism (explicitly set)Overridden by system limits and hints.Use max degree of all participating tables.Intermediate results are hash partitionedNested Loop Join and Merge Join

User hints can (must?) specify join order, join strategy, index, degree of parallelism,...

DBMulti-process & thread Client Query

Coordinator

Page 99: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

131Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Query PlanningQuery Coordinator starts with Oracle Cost-Based plan

If plan requests Table scan or HINT then consider parallel plan

Table has default degree of parallelism (explicitly set)Overridden by system limits and hints.Use max degree of all participating tables.

Shared disk makes temp space allocation easy

Planner picks degree of parallelism and river partitioning.

Proud of their OR optimization.

Page 100: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

132Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Query ExecutionCoordinator does extra work to

merge the outputs of several sortssubsorts pushed to servers

aggregate the outputs of several aggregatesaggregates pushed to servers

Parallel function invocation is potentially a big win.

SELECT COUNT ( f(a,b,c,...)) FROM T;

Invokes function f on each element of T, 100x parallel.

Page 101: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

133Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Join Strategies

Oracle has (1) Nested Loop Join (2) Merge Join

Replicate inner to outer partition automatic in shared disk (looks like partition outer).

Has small-table large-table optimization (Cartesian product join)

User hints can specify join order, join strategy, indexdegree of parallelism,...

Page 102: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

134Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Transactions & RecoveryTransactions and transaction save points (linear nest).

ReadOnly snapshots for decision support.

SQL92 isolation levels (ACID = Snapshot isolation)

Database has multiple rollback segments UNDO log,

Transaction has one commit/REDO log so may be a bottleneck

Parallel recovery at restart:One log scanner,

DEGREE REDO streams, typically one per diskINSTANCE REDO streams, typically two-deep per disk

Page 103: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

135Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Oracle Utilities User can write parallel load / unload utility

Index build, Constraints, are separate stepsNot incremental or online or restartable.

Update Statistics (Analyze) is not parallelIndex build is a N-1 parallel: N scanner/sorter, 1 inserter.Parallel recovery at restart:

One log scanner, DEGREE REDO streams, typically one per diskINSTANCE REDO streams, typically two-deep per disk

AdministrationNot much special:

Limit degree of parallelism at a serverSet default parallelism of a tableQuery can only lower these limits

No special tools, meters, monitors,... Just ordinary Parallel Server

Page 104: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

136Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

BenchmarksSequent 20x 50MHz 486, .5GB RAM, 20 disk

Sequential 20x Parallel rec/s krecr/s KB/s rec/s MB/s speedupLoad 5M Wisc .5 kr/s 113 KB/s 8.8 kr/s 1.8 MB/s 16Parallel Index load 2.2 kr/s 18 Kb/s 29 kr/s 235 KB/s 13SCAN 1.7 kr/s 364 KB/s 26 kr/s 5.3 MB/s 15Agg MJ 3.3 kr/s 660 KB/s 45 kr/s 9.3 MB/s 14Agg NJ 1.4 kr/s 290 KB/s 26 kr/s 5.4 MB/s 19

Same benchmark on 16x SP1 (a shared nothing machine), got similar results.168x N-cube ( 16MB/node), 4 lock nodes, 64 disk nodes got good scaleup

Oracle has published details on all these benchmarks.

20 Pentium, 40 disk system, SCAN at 44 MB/s 55% cpuSept 1994 news:

Page 105: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

137Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Oracle Good ThingsAvailable now!

Parallel Everywhere (on everybody’s box)

A HIGH FUNCTION SQL

No restrictions (triggers, indices,...)

Very easy to use (almost no knobs or options)

Parallel invocation of stored procedures

Near-linear scaleup and speedup of SELECTs.

Respectable performance on Sequent

Page 106: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

138Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

InformixDSA (Dynamic Scaleable Architecture) describes redesign to

thread-based, server-based system. V6 - 1993 - : DSA -- rearchitecture (threads, OLTP focus)V7 - 1994 - : PDQ -- Parallel Data Query (SMP)V8 - 1995 - : XMP -- Cluster parallelism (shared disk/nothing).

Parallelism is a MAJOR focus now that SQL92 under control

Other major focus is TOOLS (ODBC, DRDA, NewEra 4GL).

Informix is a UNIX SQL system: AIX (IBM), HP/UX (HP), OSF/1 (DEC, HP), SCO/UNIX, Sequent/DYNIX, SUN (SunOS, Solaris)

Today shared nothing parallelism on IBM SP2, ATT3650, ICL, (beta)

Page 107: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

139Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Informix Data Layout

DBspaceBlock

Chunks may be added

File

Table or index maps to homogeneous set of DB spaces contains “chunks” (extents)

Partition by: range, round robinexpressionhash (V8)

Access via B+Tree, B* tree, and hash (V8)

Built an extent-based file system on raw disks or files

High speed sequential, clustering, async IO,...

Page 108: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

140Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Informix Execution

Completely parallel DML, some parallel DDLParallel SELECT, UPDATE, DELETE

Executor per partition in all cases.Parallel sort,

joins (nest, merge, hash)aggregates, union

Whenever an operator has input and a free output buffer, it can work

to fill the output buffer.Natural flow control

Blocking operators (sort, hash join, aggregates, correlated subqueries)Spool to a buffer (if small), else spool to disk.

Shared buffer pool minimizes data copies.

scan

M join

scan

scan

M join

Client

Buffer Pool

Virtual Processes

helpers

Page 109: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

141Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Parallel PlansQuery plan is parallelized by

scanner per table partition (does select, project)sub-aggregates per partition (hash or sort)

If clustered join (nested loop or merge) then operator per outer or per partition

If hash-join, parallel scan smaller first, build bitmap and hash buckets

then scan larger and:join to smaller if it fits in memoryelse filter via bitmap and build larger buckets

then join bucket by bucketHybrid hash join with bitmaps and bucket tuning.

Page 110: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

142Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Parallel Operators

Parallel SELECT, UPDATE, DELETEExecutor per partition in all cases.

Parallel sort, joins, aggregates, union

Only correlated subqueries are blocking

Completely parallel DML, some parallel DDL

Page 111: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

143Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Transactions & Recovery

SQL 2 isolation levels allow DSS to run in background

Transaction save points

Separate logical and physical logs.Bulk updates could bottleneck on single log.

Recovery unit is data partition (DBspace)

Parallel recovery: thread per DBspace

If DB fragment

unavailable, DSS readers can skip it

Page 112: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

144Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Informix AdministrationCan assign % of processors, memory, IO to DSS

(parallel query)

Sum of all parallel queries live within this quota

Each query can specify the % of the total that it wishes.(0 means sequential execution)

Parallel Data load (SMP only)Parallel Index Build (N - M)Parallel recoveryOnline backup / restore

Utilities

Page 113: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

145Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Benchmarks

Sequent system: 9 Pentium processors1 GB main memoryBase tables on 16 disk (FWD SCSI)Indices on 10 discsTemp space on 10 disks

Sequential Parallel rec/s MB/s rec/s MB/s speedup

Load 300M Wisc 3kr/s 600Kb/sParallel Index load 48kr/s 1MB/sSCAN 17kr/s 3.5MB/s 147kr/s 30MB/s 8.3Aggregate 11kr/s 2.3MB/s 113kr/s 23MB/s 10.12-Way hash Join 18kr/s 3.2MB/s 242kr/s 31MB/s 9.73-Way hash Join 25kr/s 3.5Mb/s 239kr/s 33MB/s 9.5

Page 114: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

146Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Informix Shared Nothing Benchmark IBM SP2 - : TPC-D-like database 48 SP2 ProcessorsCustomer Benchmark, Not audited benchmark.

Load 60 GB in 40 minutes,

250 GB in 140 min about 100 GB/hr ! 2GB/node/hr

Scan & Aggregate (#6) 60 GB in 7 min = 140 MB/s = 3 MB/s/node = 30 kr/s 260 GB in 24 min = 180 MB/s = 4 MB/s/node = 40 kr/s

Power Test (17 complex queries and 2 load/purge ops) 60 GB in 5 hrs 260 GB in 18 hrs

Multiuser Test: 1 user, 12 queries: 10 hrs, 4 users, 3 queries: 10 hrs

Page 115: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

147Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Informix Good Things

A full function SQL

Available today on Sequent

Beautiful manuals

Linear speedup and scaleup

Best published performance on UNIX systemsProbably best price performance.

(but things are changing fast!)

Some mechanisms to mix OLTP and batch.

Page 116: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

148Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Sybase Navigator Product conceptTwo layer software architecture: (1) Navigator

drives array of shared-nothing SQL engines. (2) Array of SQL engines, each unaware of others.

similar to Tandem disk processesSQL engine is COTS.

Goal: linear scaleup and speedup, plus good OLTP support

Emphasize WHOLE LIFECYCLEConfigurator: tools to design a parallel systemAdministrator: tools to manage a parallel system

(install/upgrade, start/stop, backup/restore, monitor/tune)

Optimizer: execute requests in parallel.

SQLSQL

SQL

SQLSQL

SQL

SQLSQL

SQL

Page 117: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

149Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

ConfiguratorFully graphical design tool

Given ER model and dataflow model of the application workload characteristicsresponse time requirements,hardware components(heavy into circles and arrows)

Recommends hardware configuration/ Table definitions (SQL)table partitioning

Page 118: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

150Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

AdministratorMade HUGE investments in this area.

Truly industry leadinggraphical tools make MPP configuration “doable”.

GUI interface to manage:startup / shutdown of clusterbackup / restore / manage logsconfigure (install, add nodes, configure and tune servers)Manage / consolidate system event logs System stored procedures (global operations)

(e.g. aggregate statistics from local to global cat)Monitor SQL Server events

Page 119: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

151Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Data Layout

Pure shared nothingNavigator partitions data among SQL servers

• map to a subset of the servers • range partition or hash partition.

Secondary indices are partitioned with base table No Unique secondary indicesOnly shorthand views, no protection views Schema server stores global data definition for all nodes.Each partition server has

schema for its partitiondata for its partition.log for its partition

Page 120: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

152Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Sybase SQL Server BackgrounderRecently became SQL89 compliant (cursors, nulls, etc)Stored procedures, multi-threaded, internationalized, B*-tree centric (clustering index is B+tree)Use nested loops, sort-merge join (sort is index build).Page locking, 2K disk IO, ... other little-endian design decisions.Respectable TPC-C results (AIX RS/6000).UNIX raw disks or files are base (also on OS/2, NetWare,...).table->disk mapping

CREATE DATABASE name ON {device...} LOG ON {device...}SP_ADDSEGMENT segment, deviceCREATE TABLE name(cols) [ ON segment]

Microsoft has a copy of the code, deep ported to NT

Page 121: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

153Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Navigator Extension Mechanisms

Navigator extended Sybase TDS byAdding stored procedures to do thingsExtending the syntax (e.g. see data placement syntax below)

Sybase TDS and OpenServer design are great for thisAll “front ends based on OpenServer and threads”

Page 122: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

154Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Process Structure - Pure Shared Nothing

Control(1/node)

Clients

SQLSplit

DBAserver

= catalogs database in a SQL server

= system manager monitor& SQL optimizer

GUINavigatorManager

schemaserver

DBA Server does everything: SQL compilationSystem managementCatalog managementSQL server restart (in 2nd node)DBA fallback detects deadlock does DBA takeover on fail

Control server at each node manages SQL servers there(security, request caching, 2PC, final merge /aggregate,...

parallel stored procedures (SMID) )Split server manages re-partitioning of dataSQL Server is unit of query parallelism, (one per cpu per node)

Page 123: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

155Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Simple Request Processing

Control(1/node)

Client

SQLSplit

DBAserver

schemaserver

Client connects to Navigator (a Control Server) usingstandard Sybase TDS protocol.

SQL request flows to DBA server that compiles itsends stored procedures (plans) to all control servers

plans to all relevant SQL serversControl server executes plan.Pass to SQL server, returns results.

Plan cached on second call, DBA server not invoked.Good for OLTP

Page 124: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

156Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Parallel Request Processing

Control

Split

Control

Client

SQLSplit

DBAserver

schemaserver

Control

Split

If query involves multiple nodes, then command sent to each one (diagram shows secondary index lookup)

Query sent to SQL servers that may have relevant data.

If data needs to be redistributed or aggregated, split servers issue queries and inserts

(that is their only role)

split servers have no persistent storage.

Page 125: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

157Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Data ManipulationSQL server is unit of parallelism

"Parallelized EVERYTHING in the T-SQL language" Includes SIMD execution of T-SQL procedures, plus N-M data move operations.

Two-level optimization: DBA Server has optimizer

(BIG investment, all new code, NOT the infamous Sybase optimizer)

Each SQL server has Sybase optimizer If extreme skew, different servers have different plansDBA optimizer shares code with SQL server

(so they do not play chess with one another).Very proud of their optimizer.

Page 126: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

158Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Query Execution

Classic Sellinger cost-based optimizer.SELECT, UPDATE, DELETE N-to-M parallelBulk and async INSERT interface.N-M Parallel sortAggregate (hash/sort)select and join can do index-only access if data is there.eliminate correlated subqueries (convert to join).

(Gansky&Wong. SIGMOD87 extended)Join: nested-loop, sort-merge, index only

Sybase often dynamically builds index tosupport nested loop (fake sort-merge)

Typically left-deep sequence of binary joins.

Page 127: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

159Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Join and Partition Strategy

Partition strategiesIf already partitioned on join, then no splittingElse Move subset of T1 to T2 partitions.or Replicate T1 to all T2 partitionsor repartition both T1 and T2 to width of home nodes

or target.No hash join, but

all (re) partitioning is range or hash based.

Not aggressive parallelism/pipelining: 2 op at a time.Pipeline to disk via split server (not local to disk and then split).Split servers fake subtables for SQL engines.Top level aggregates merged by control, others done by split.

Page 128: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

160Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Utilities

Bulk data load (N-M) async calls

GUI managesBackup all SQL serves in parallel

Reorg via CREATE TABLE <new> , INSERT INTO <new> SELECT * FROM <old>

Utilities are mostly offline (as per Sybase)

Nice EXPLAIN utility

Page 129: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

161Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Futures

Hash join within split servers

Shared memory optimizations

Full support for unique secondary indices Full trigger support (cross-server triggers)

Full security and view support.

Page 130: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

162Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

BenchmarksPreliminary: 8x8 3600 - Ynet.

node: 8 x (50MHz 486 256k local cache) 512MB main memory, 2 x 10 disk arrays, @ 2GB 4 MB/s per disk.6 x Sybase servers

Scaleup & speedup tests of 1, 4, and 8 nodes.Numbers (except loading) reported as ratios of elapsed times

S&S tests show a >7x speedup of 8-way over 1-way

Tests cover insert, select, update, delete, join, aggregate, load

Reference Account: Chase Manahattan Bank14x8 P5 ATT 3600 cluster: (112 processors)56 SQL servers, 10GB each = 560 GB 100x faster than DB2/MVS (minutes vs days)

Linearity is great.

Page 131: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

163Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Navigator Good ThingsConcern for lifecycle

design, install,manage, operate, use

Good optimization techniques

Fully parallel, including stored procedures!

Scaleup and Speedup are near linear.

Page 132: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

164Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

Sybase IQ

Sybase bought Expressway

Expressway evolved from Model 204

bitmap technology: index duplicates with bitmap

compress bitmap.

Can give 10x or 100x speedup.

Can save space and IO bandwidth

Currently, two products (Sybase and IQ) not integrated

Page 133: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

165Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

DB2DB2/VM: = SQL/DS: System R gone public

DB2/MVS (classic Parallel Sysplex, Parallel Query Server, ...)Parallel and async IO into one process (on mainframe)Parallel execution in next release (late next year?)MVS PQS now withdrawn?

DB2/AS400: Home grown

DB2-2-PE: OS2/DM grown large. First moved to AIXBeing extended parallelismParallelism based on SP/2 -- shared nothing done right.Benchmarks today - Beta everywhere

DB2++: separate code path has OO extensions, good TPC-C Ported to HP/UX, Solaris, NT in beta

Page 134: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

166Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

DB2/2 Data Layout• DATABASE: a collection of nodes (up to 128 SP2s so far)

• NODEGROUP: a collection of logical nodes (a 4k hash map

• LOGICAL NODE: A DB2 instance (segments, log, locks...)

• PHYSICAL NODE: A box.

• Logical Node: Segments of 4 k pages

– Segments allocated in units (64K default)

– Tables stripe across all segments

• Table created in NodeGroup:

– Hash (partition key) across all members of group

• Cluster has single system Image

Segments

Nodes:Group 1

Group 2

Page 135: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

167Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

DB2/2 Query Execution• Each node maintains pool of AIX server processes

• Query optimizer does query decomposition to node plans (like R* distributed query decomposition)

• Parallel Optimization is 1Ø (not like Wai Hong’s work)

• Sends sub-plans to nodes to be executed by servers

• Node binds plan to server process

• Intermediate results hashed

• Proud that Optimizer does not need hints.

• “Standard” join strategies (except no hash join).

Page 136: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

168Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

DB2/2 Utilities• 4 loaders:

– import

– raw-insert (fabricates raw blocks, no checks)

– insert

– bulk insert

• Reorganize hash map, add / drop nodes, add devices– Table unavailable during these operations

• Online & Incremental backup

• Fault tolerance via HACMP

Page 137: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

169Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

DB2/2 Performance: Good performance Great Scaling

Wisconsin scaleups

big = 4.8 M rec = 1 GB

small = 1.2 M rec = 256MB

scan rate ~12 kr/s/node

raw load: 2.5 kr/s/node

see notes for more data

0.0

5.0

10.0

15.0

20.0

25.0

0 2 4 6 8 10 12 14 16

Load

Scan

Agg

SMJ

NLJ

SMJ2

Index1

Index2

MJ

Speedup vs NodesDB2/2 PE on SP2

Page 138: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

170Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

DB2/2 Good Things• Scaleable to 128 nodes (or more)

• From IBM

• Good performance

• Complete SQL (update, insert,...)

• Will converge with DB2/3 (OO and TPC-C stuff)

• Will be available off AIX someday – (aix is slow and SP2 is very expensive)

Page 139: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

171Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

RedBrick• Read-only (LOAD then SELECT only) Database system

– Load is incremental and sophisticated

• Precompute indices to make small-large joins run fast– Indices use compression techniques.

– Only join via indices

• Many aggregate functions to make DSS reports easy

• Parallelism:

– Pipeline IO

– Typically a thread per processor (works on index partition)

– Piggyback many queries on one scan

– Parallel utilities (index in parallel, etc)

– SP2 implementation uses shared disk model.

Page 140: 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented

172Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

SummaryThere is a LOT of activity

(many products coming to market)

Query optimization is near the complexity barrierNeeds a new approach?

All have good speedup & scaleup if they can find a plan

Managing huge processor / disk / tape arrays is hard.

I am working on commoditizing these ideas:low $/record/sec (scaleup PC technology)low Admin $/node (automate, automate, automate,...)Continuous availability (online & fault tolerant)