analyzing and optimizing linux kernel for postgresql · analyzing and optimizing linux kernel for...

Post on 07-Jun-2018

350 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Analyzing and Optimizing Linux Kernel for PostgreSQL

Sangwook Kim PGConf.Asia 2017

2

Sangwook Kim •  Co-founder and CEO @ Apposha •  Ph. D. in Computer Science

•  Cloud/Virtualization •  SMP scheduling [ASPLOS’13, VEE’14] •  Group-based memory management [JPDC’14]

•  Database/Storage •  Non-volatile cache management [USENIX ATC’15, ApSys’16] •  Request-centric I/O prioritization [FAST’17, HotStorage’17]

3

Fully-managed & performance-optimized for MySQL and PostgreSQL

for

What We Do

Fully-managed & performance-optimized for MongoDB

No special H/W support No code changes for DB engine

Contents

•  Background tasks

•  Full page writes

•  Parallel query

4

PostgreSQL Architecture

5

Storage Device

Operating System

T1

Client

T2

I/O

T3 T4

Request Response

I/O I/O I/O

PostgreSQL

Database performance

* PostgreSQL’s processes -  Backend (foreground)

-  Checkpointer

-  Autovacuum workers

-  WAL writer

-  Writer

-  …

PostgreSQL Architecture

5

Storage Device

Operating System

T1

Client

T2

I/O

T3 T4

Request Response

I/O I/O I/O

PostgreSQL * PostgreSQL’s processes -  Backend (foreground)

-  Checkpointer

-  Autovacuum workers

-  WAL writer

-  Writer

-  …

PostgreSQL Checkpoint

6

http://www.interdb.jp/pg/pgsql09.html

Impact of Checkpointer

7

0

500

1000

1500

2000

2500

0 500 1000 1500 2000 2500 3000

Max

trx

late

ncy

(ms)

Elapsed time (sec)

Default

1.1 sec

•  Dell Poweredge R730 •  32 cores •  132GB DRAM •  1 SAS SSD

•  PostgreSQL v9.5.6 •  52GB shared_buf •  10GB max_wal

•  TPC-C workload •  50GB dataset •  1 hour run •  50 clients

PostgreSQL Autovacuum

8

Free Space Map

http://bstar36.tistory.com/308

Impact of Autovacuum Workers

9

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

1 50 100 150 200 250 300

Trx

thpu

t (trx

/sec

)

# of clients

Default Aggressive AV

40%

autovacuum_max_workers 3 => 6 autovacuum_naptime 1min => 15s autovacuum_vacuum_threshold 50 => 25 autovacuum_analyze_threshold 50 => 10 autovacuum_vacuum_scale_factor 0.2 => 0.1 autovacuum_analyze_scale_factor 0.1 => 0.05 autovacuum_vacuum_cost_delay 20ms => -1 autovacuum_vacuum_cost_limit -1 => 1000

Impact of Autovacuum Workers

10

0

500

1000

1500

2000

2500

0 500 1000 1500 2000 2500 3000

Max

trx

late

ncy

(ms)

Elapsed time (sec)

Default Aggressive AV 2.5 sec

Why Background Tasks Matter

• Multiple independent layers

11

Storage Device

Caching Layer

Application

File System Layer

Block Layer Abs

tract

ion

Buffer Cache

read() write()

FG FG BG

BG FG BG BG

reorder

admission control

admission control

reorder

BG

admission control

Why Background Tasks Matter

•  I/O priority inversion

12

Storage Device

Caching Layer

Application

File System Layer

Block Layer

I/O

FG wait

wait

BG user var

wake

FG wait

I/O FG lock

BG wait

FG wait

wait BG var

wake

Request-Centric I/O Prioritization

• Solution v1 (reactive) •  Request-aware I/O processing in the I/O path [FAST’17] •  I/O priority inheritance [USENIX ATC’15, FAST’17]

•  Locks

•  Condition variables

13

FG lock

BG I/O FG BG submit

complete

FG BG

FG wait

BG

register

BG

inherit

FG BG I/O submit

complete

wake

CV CV CV

[USENIX ATC’15] Request-Oriented Durable Write Caching for Application Performance [FAST’17] Enlightening the I/O Path: A Holistic Approach for Application Performance

Request-Centric I/O Prioritization

14

Caching Layer

Application

File System Layer

Block Layer

• Solution v1 (problem)

Synchronization

linux/include/linux/mutex.h linux/include/linux/pagemap.h linux/include/linux/rtmutex.h linux/include/linux/rwsem.h linux/include/uapi/linux/sem.h linux/include/linux/wait.h linux/kernel/sched/wait.c linux/kernel/locking/rwsem.c linux/kernel/futex.c linux/kernel/locking/mutex.h linux/kernel/locking/mutex.c linux/kernel/locking/rtmutex.c linux/kernel/locking/rwsem-xadd.c

linux/kernel/fork.c linux/kernel/sys.c linux/kernel/sysctl.c linux/include/linux/sched.h

linux/include/linux/blk_types.h linux/include/linux/blkdev.h linux/include/linux/buffer_head.h linux/include/linux/caq.h linux/block/blk-core.c linux/block/blk-flush.c linux/block/blk-lib.c linux/block/blk-mq.c linux/block/caq-iosched.c linux/block/cfq-iosched.c linux/block/elevator.c

linux/fs/buffer.c linux/fs/ext4/extents.c linux/fs/ext4/inode.c linux/include/linux/jbd2.h linux/fs/jbd2/commit.c linux/fs/jbd2/ journal.c linux/fs/jbd2/ transaction.c

linux/include/linux/writeback.h linux/mm/page-writeback.c linux/include/linux/mm_types.h linux/fs/buffer.c

Interface

Request-Centric I/O Prioritization

• Solution v2 (proactive)

15

Device Driver

Noop CFQ Deadline Apposha I/O Scheduler

Block Layer

Ext4 XFS F2FS

VFS

Apposha Front-End File System

Etc

Linux I/O Stack

Page Cache

-  Priority-aware I/O scheduling

-  Control device-level congestion

-  Control writes at the upper layer

-  Tag low-priority for background I/Os

• V12 Engine

Request-Centric I/O Prioritization

16

Kernel Modules

MongoDB Library

PostgreSQL Library

V12-M V12-P

User-level library -  Classify task priority, optimize WAL accesses

Front-End File System I/O Scheduler

Request-Centric I/O Prioritization

17

0

500

1000

1500

2000

2500

0 500 1000 1500 2000 2500 3000

Max

trx

late

ncy

(ms)

Elapsed time (sec)

Default Aggressive AV V12-P

0.15 sec

V12-P: V12 Engine for PostgreSQL

Request-Centric I/O Prioritization

17

0

500

1000

1500

2000

2500

0 500 1000 1500 2000 2500 3000

Max

trx

late

ncy

(ms)

Elapsed time (sec)

Default Aggressive AV V12-P v9.6

1.1 sec

V12-P: V12 Engine for PostgreSQL

New feature since v9.6: checkpoint_flush_after

Contents

•  Background tasks

•  Full page writes

•  Parallel query

18

Full Page Writes

19 Tuning PostgreSQL for High Write Workloads – Grant McAlister

0

2

4

6

8

10

12

14

0 100 200 300 400 500 600

WA

L si

ze (G

B)

Elapsed time (sec)

Default

Full Page Writes

•  Impact on WAL size

20

•  PostgreSQL v9.6.5 •  52GB shared_buffer •  1GB max_wal

•  TPC-C workload •  50GB dataset •  10 min run •  200 clients

WAL Compression

•  Impact on WAL size

21

0

2

4

6

8

10

12

14

0 100 200 300 400 500 600

WA

L si

ze (G

B)

Elapsed time (sec)

Default WAL compression

•  PostgreSQL v9.6.5 •  52GB shared_buffer •  1GB max_wal

•  TPC-C workload •  50GB dataset •  10 min run •  200 clients

WAL Compression

•  Impact on performance

22

0

2000

4000

6000

8000

10000

12000

Trx

thpu

t (tr

x/se

c)

Default WAL compression

WAL Compression

•  Impact on performance

22

0

2000

4000

6000

8000

10000

12000

Trx

thpu

t (tr

x/se

c)

Default WAL compression

What if we can safely disable full_page_writes w/o special H/W support?

Atomic Write Support

23

write()

Data Block

In memory On disk

Mapping Table

Data Block

Linux Kernel (w/ V12 Engine)

PostgreSQL (full_page_writes off)

Atomic Write Support

23

fsync()

Data Block

PostgreSQL (full_page_writes off)

In memory On disk

Mapping Table

Data Block Data Block

writeback Linux Kernel (w/ V12 Engine)

Atomic Write Support

23

fsync()

Data Block

In memory On disk

Mapping Table

Data Block Data Block

atomic update Linux Kernel

(w/ V12 Engine)

PostgreSQL (full_page_writes off)

Atomic Write Support

23

fsync()

Data Block

In memory On disk

Mapping Table

Data Block Data Block

Linux Kernel (w/ V12 Engine)

PostgreSQL (full_page_writes off)

Atomic Write Support

23

Data Block

In memory On disk

Mapping Table

Data Block Data Block

free

Linux Kernel (w/ V12 Engine)

PostgreSQL (full_page_writes off)

Atomic Write Support

23

Data Block

In memory On disk

Mapping Table

Data Block

Linux Kernel (w/ V12 Engine)

PostgreSQL (full_page_writes off)

Atomic Write Support

•  Impact on WAL size

24

0

2

4

6

8

10

12

14

0 100 200 300 400 500 600

WA

L si

ze (G

B)

Elapsed time (sec)

Default WAL compression V12-P

V12-P: V12 Engine for PostgreSQL

Atomic Write Support

•  Impact on performance

25

0

2000

4000

6000

8000

10000

12000

Trx

thpu

t (tr

x/se

c)

Default WAL compression V12-P 2X

V12-P: V12 Engine for PostgreSQL

Contents

•  Background tasks

•  Full page writes

•  Parallel query

26

Parallel Query

27

User query

Backend

Worker Worker Worker Worker

Scan, aggregate, join, …

Parallel Query

27

Slow response Fast response

User query

Backend

Worker

cache

storage

cache

storage

cache

storage

cache

storage

hit hit miss!

Worker Worker Worker

miss!

Parallel Query

• Problem inside Linux kernel

28

Fair-based (communism)

Query1 Query2

18sec

20sec I/O Scheduling

10sec 10sec

Parallel Query

• Problem inside Linux kernel

28 Fair-based (communism)

Query1 Query2

24sec

26sec I/O Scheduling

10sec 10sec

Query3

10sec

30sec

Parallel Query

• Problem inside Linux kernel

28 Fair-based (communism)

Query1 Query2

?

? I/O Scheduling

10sec 10sec

Query3

10sec

?

… QueryN

10sec

… …

Parallel Query

• Problem inside Linux kernel

29

Fair-based (communism) Task-aware (utilitarianism)

Query1 Query2

18sec

20sec

10sec

20sec I/O

Scheduling

10sec 10sec

Optimizing Parallel Query

•  Illustrative experiment

30

pgbench –i –s 1000 test1 pgbench –i –s 1000 test2 Q1: SELECT * FROM pgbench_accounts WHERE filler LIKE ‘%x% -d test1 Q2: SELECT * FROM pgbench_accounts WHERE filler LIKE ‘%x% -d test2 QUERY PLAN -------------------------------------------------------------------------------------- Gather (cost=1000.00..1818916.53 rows=1 width=97) Workers Planned: 7 -> Parallel Seq Scan on pgbench_accounts (cost=0.00..1817916.43 rows=1 width=97) Filter: (filler ~~ '%x%'::text)

0

10

20

30

40

50

60

70

Deadline CFQ NOOP Taskaware

Que

ry la

tenc

y (m

s)

Q1 Q2 Q3 Q4

Optimizing Parallel Query

•  Illustrative experiment

31

Recall

• Multiple independent layers

32

Storage Device

Caching Layer

Application

File System Layer

Block Layer Abs

tract

ion

Buffer Cache

read() write()

FG FG BG

BG FG BG BG

reorder

Optimizing Parallel Query

•  Illustrative experiment (device queue off)

33

Avg 23%

0

20

40

60

80

100

Deadline CFQ NOOP Taskaware

Que

ry la

tenc

y (m

s)

Q1 Q2 Q3 Q4 Q1

73%

Optimizing Parallel Query

•  Illustrative experiment (4ms idle_slice)

34

0 10 20 30 40 50 60 70

Deadline CFQ NOOP Taskaware Taskaware (Idle 4ms)

Que

ry la

tenc

y (m

s)

Q1 Q2 Q3 Q4 Avg 17% Q1

2.8X

Optimizing Parallel Query

• Ongoing work •  Handling heavy tasks •  Extending to CPU scheduling

35

47

H http://apposha.io F www.facebook.com/apposha M sangwook@apposha.io

top related