topics 9-10: database optimization for modern machines

26
Dr. N. Mamoulis Advanced Database Technol ogies 1 Topics 9-10: Database optimization for modern machines Computer architecture is changing over the years, so we have to change our programmes, too! Most database operators, indexes, buffering techniques and storage schemes were developed and optimized over 20 years ago, so we need to re-tune them for the modern reality.

Upload: chinue

Post on 15-Jan-2016

52 views

Category:

Documents


0 download

DESCRIPTION

Topics 9-10: Database optimization for modern machines. Computer architecture is changing over the years, so we have to change our programmes, too! - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

1

Topics 9-10: Database optimization for modern machines

Computer architecture is changing over the years, so we have to change our programmes, too!Most database operators, indexes, buffering techniques and storage schemes were developed and optimized over 20 years ago, so we need to re-tune them for the modern reality.

Page 2: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

2

Moore’s Law – Dictionary Term

http://info.astrian.net/jargon/terms/m/moore_s_law.htmlMoore's Law /morz law/ prov. The observation that the logic density of silicon integrated circuits has closely followed the curve (bits per square inch) = 2(t - 1962) where t is time in years; that is, the amount of information storable on a given amount of silicon has roughly doubled every year since the technology was invented. This relation, first uttered in 1964 by semiconductor engineer Gordon Moore (who co-founded Intel four years later) held until the late 1970s, at which point the doubling period slowed to 18 months. The doubling period remained at that value through time of writing (late 1999). Moore's Law is apparently self-fulfilling. The implication is that somebody, somewhere is going to be able to build a better chip than you if you rest on your laurels, so you'd better start pushing hard on the problem.

Page 3: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

3

Features of Modern Machines and Future Trends

CPU speed, memory bandwidth, and disk bandwidth, follow Moore’s Law. On the other hand, memory latency improves by only 1% per year.Memories are becoming larger and slower relatively to memory bandwidth and CPU speed.

Most of the time the processor of your PC is idle, waiting for data to be fetched from memory.

The role of main memory caches is becoming very important in the overall performance.

Page 4: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

4

Features of Modern Machines and Future Trends (cont’d)

Disk bandwidth also follows Moore’s Law.On the other hand, disk seek time improves by only 10% per year.Thus a random disk access today may cost as much as 20-50 sequential accesses.

Page 5: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

5

Features of Modern Machines and Future Trends (cont’d)

Another characteristic of modern machines is that their processors (AMD Athlon, Intel P4), have parallel processing units able to process at the same time multiple (e.g., 5-9) independent instructions.

Page 6: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

6

The new bottleneck: Memory Access

Memory is now hierarchical: two levels of caching

CPUL1 cache

L2 cache

Main memory

CPU

die

L1 cache-line

L2 cache-line

Memory page

Memory-latency: the time needed to transfer 1 byte from the main memory to the L2 cache.Cache (L2) miss: if the requested data is not in cache and needs to be fetched from main memoryCache-line: The transfer unit from main memory to cache (e.g., L2 cache-line = 128 bytes)

Page 7: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

7

Example of Memory Latency Effects in Databases

Employee(id: 4 bytes, name: 20 bytes, age: 4 bytes, gender: 1 byte)id name age gender

12000 Mike Chan 30 M12305 George Best 27 M12503 Gina Gillian 45 F13454 John Smith 39 M53959 Sarah Croft 53 F92123 James W hite 57 M

Disk/Memory representation:

12000 Mike Chan 30 M 12305 George Best 27 M ...

29 bytes 29 bytes

Page 8: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

8

Example of Memory Latency Effects in Databases (cont’d)

Example query: how many girls work for the company?

SELECT COUNT(*)FROM EMPLOYEEWHERE gender = “F”;

What is the cost of this query, assuming that the relation is in main memory?We have to access the gender values, compare them to “F” and add 1 (or 0).

Page 9: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

9

Example of Memory Latency Effects in Databases (cont’d)

Accessing a specific address from memory loads also the information around it in the cache-line (in parallel over a wide bus).Example: Accessing the gender of the first tuple:

12000 Mike Chan 30 M 12305 George Best 27 M ...

29 bytes 29 bytes

MMem:

Cache: cache-line (128 bytes)

Page 10: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

10

Example of Memory Latency Effects in Databases (cont’d)

Thus a cache-line at a time is accessed from main memory.If the requested information is already in cache (e.g., the gender of the second tuple), main memory is not accessed.Otherwise, a cache-miss occurs.

Page 11: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

11

Effect of cache misses

The stride.c program demonstrates how cache misses may dominate the query processing cost.http://www.csis.hku.hk/~nikos/courses/CSIS7101/stride.c

initialize a random memory buffer;sum = 0;ilimit = 200000; //tentativefor(i=0; i<ilimit; i+=stride) sum += buffer[i];printf("sum=%d\n",sum);

stride=5

+

Page 12: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

12

Effect of cache misses (cont’d)

Page 13: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

13

Re-designing the DBMSWe need to redesign the DBMS in order to face the new bottleneck: memory accessNew storage schemes are proposed to minimize memory and disk accesses.Query processing techniques are redesigned to take under consideration the memory-latency effects.Algorithms are changed to take advantage of the intraparallelism of instruction execution.New instruction types are used (e.g., SIMD instructions).

Page 14: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

14

Problems in optimizing the DBMS

Programming languages do not have control over the replacement policy of memory caches. These are determined by the hardware.As a result, we cannot apply buffer management techniques for memory cache management.Also, simple instructions generated by programming languages cannot control the parallel instruction execution capabilities of the machine.In many cases, we re-write the programs, trying to “fool” the page replacement policy, and the instruction execution in order to make them more efficient.

Page 15: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

15

The N-ary storage modelThe N-ary storage model (NSM) stores the information for each tuple as a sequence of bytes.

id name age gender12000 Mike Chan 30 M12305 George Best 27 M12503 Gina Gillian 45 F13454 John Smith 39 M53959 Sarah Croft 53 F92123 James W hite 57 M

Disk/Memory representation:

12000 Mike Chan 30 M 12305 George Best 27 M ...

29 bytes 29 bytes

Page 16: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

16

A Decomposition Storage Model (DSM)

The table is vertically decomposed, and information for each attribute is stored sequentially.id name age gender

12000 Mike Chan 30 M12305 George Best 27 M12503 Gina Gillian 45 F13454 John Smith 39 M53959 Sarah Croft 53 F92123 James W hite 57 Mid name

12000 Mike Chan12305 George Best12503 Gina Gillian13454 John Smith53959 Sarah Croft92123 James W hite

id age12000 3012305 2712503 4513454 3953959 5392123 57

id gender12000 M12305 M12503 F13454 M53959 F92123 M

12000 Mike Chan 12305 George Best ...12000 30 12305 27 ...

12000 M 12305 M ...

a1

a2 a3

Page 17: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

17

Properties of DSMThe relation is decomposed into many binary tables, one for each attribute.The attributes of a binary table are a surrogate id and the attribute from the original relation.The surrogate-id is necessary to bring back together information for a tuple, by joining the binary tables.Example: Print the information for tuple with id=12305SELECT *

FROM EMPLOYEEWHERE id=12305

NSM

SELECT id,name,age,genderFROM a1,a2,a3WHERE a1.id=12305 AND

a2.id = a1.id ANDa3.id = a1.id

DSM

Page 18: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

18

Properties of DSM (cont’d)Advantages of DSM

If the relation has many attributes, but queries involve only few attributes, then:

Much less information is read from disk, than in NSM Cache misses are fewer than in NSM, because the

stride is smaller Projection queries are very fast

Disadvantages of DSM If a tuple needs to be reconstructed, this will require

many joins. The size of the decomposed relation is larger than

the original due to the replication of the surrogate key.

Page 19: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

19

A Decomposition Model with Unary tables (Monet)

id name age gender12000 Mike Chan 30 M12305 George Best 27 M12503 Gina Gillian 45 F13454 John Smith 39 M53959 Sarah Croft 53 F92123 James W hite 57 Mvoid id

1 120002 123053 125034 134545 539596 92123

void age1 302 273 454 395 536 57

void gender1 M2 M3 F4 M5 F6 M

void name1 Mike Chan2 George Best3 Gina Gillian4 John Smith5 Sarah Croft6 James W hite

1 1 M F ...

value of first surrogate

size of attribute type in bytes

Given the surrogate sid of a tuple,we can compute its attribute value v by:v = *(table_address+(sid - first_surrogate)*size)

not materialized

Page 20: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

20

Partition Attributes AcrossAttempts to combine advantages of both NSM and DSM, while avoiding their disadvantages.The data are split to pages, like in NSM, but the organization in each page is like DSM. Thus: Cache misses are minimized because the

information for a specific attribute is stored compactly in the page.

Record reconstruction cost is low, since the tuples are actually stored like in NSM on disk, but vertically partitioned in each page, where the reconstruction cost is minimal.

Page 21: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

21

Mike Chan George Best

How do pages look in each schema?

PAGE HEADER12000Mike Chan30 M 12305George Best 27 M ...

...

PAGE HEADER54647John Kit33 M 94765Flora Ho 27 F ...

...

NSMPAGE HEADER12000Mike Chan12305Best ...

...

PAGE HEADER1200012305 27 54647 3394765 27 ......

DSM

George

PAGE HEADER ...

30

PAGE HEADER ...

...

PAGE HEADER...

PAX

many tables

1200012305

...30 27 ...

John Kit Flora Ho

PAGE HEADER...

5464794765

...33 27 ...

Page 22: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

22

Comparison between PAX and NSM

SELECT AVG(age)FROM EMPLOYEEWHERE id<20000

PAGE HEADER12000Mike Chan30 M 12305George Best 27 M ...

...

NSM

Mike Chan George Best

PAGE HEADER...

PAX12000

12305

...30 27 ...

Cache

12000 Mike Chan 30 M

Cache

12000 12305 ...

30 27 ...

irrelevant data are fetched to cache

relevant data are fetched to cache

Page 23: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

23

SELECT AVG(age)FROM EMPLOYEEWHERE name=“M*”

Comparison between PAX and DSM

Mike Chan George Best

PAGE HEADER...

PAX12000

12305

...30 27 ...

Cache

12000 Mike Chan 12305 George ...

Cache

Mike Chan George ...

30 ...

join is avoided

PAGE HEADER12000Mike Chan12305Best ...

...

DSM

George

PAGE HEADER1200012305 27 54647 3394765 27 ......

30 qualifying ids have to be joined with other pages

12000...

join

12000 ...

Page 24: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

24

Is PAX always better than DSM?

NO. If the relation has many attributes (e.g., 10) and the query involves only few (e.g., 2) then the join may be more beneficial, than reading whole tuples in memory.Example: R(a1,a2,a3, ..., a10)Each attribute is 4 bytes long.DSM: D1(id,a1), D2(id,a2),...,D10(id,a10).Query:

SELECT avg(a2)FROM RWHERE a1=x;

Query selectivity is very high 1%. The qualifying ids can fit in memoryPAX has to read the whole table=40bytes|R|DSM has to read D1, apply the query, and create an intermediate table X with qualifying ids (in memory). Then it has to read D2 to get the qualifying a2 that join with X. So the total bytes DSM reads from disk are 16bytes|R|.

Page 25: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

25

SummaryIn modern computer architectures the bottleneck is memory access. In many cases the processor is waiting for data to be fetched from memory.Memory access is no longer “random”. When we access a location in disk then data around it are loaded to a fast memory chip (cache). Access locality is very important.Database operators, storage and buffering schemes, indexes are optimized for the reduction of cache misses.

Page 26: Topics 9-10: Database optimization for modern machines

Dr. N. Mamoulis Advanced Database Technologies

26

ReferencesGeorge P. Copeland, Setrag Khoshafian, A Decomposition Storage Model. SIGMOD, 1985.Peter A. Boncz, Stefan Manegold, Martin L. Kersten, Database Architecture Optimized for the New Bottleneck: Memory Access. VLDB, 1999.Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, Marios Skounakis: Weaving Relations for Cache Performance, VLDB, 2001.P. A. Boncz. Monet: A Next-Generation DBMS Kernel For Query-Intensive Applications. PhD dissertation. Universiteit van Amsterdam, May 2002.