analyzing and optimizing linux kernel for postgresql · analyzing and optimizing linux kernel for...

Analyzing and Optimizing Linux Kernel for PostgreSQL

Sangwook Kim PGConf.Asia 2017

2

Sangwook Kim •  Co-founder and CEO @ Apposha •  Ph. D. in Computer Science

•  Cloud/Virtualization •  SMP scheduling [ASPLOS’13, VEE’14] •  Group-based memory management [JPDC’14]

•  Database/Storage •  Non-volatile cache management [USENIX ATC’15, ApSys’16] •  Request-centric I/O prioritization [FAST’17, HotStorage’17]

3

Fully-managed & performance-optimized for MySQL and PostgreSQL

for

What We Do

Fully-managed & performance-optimized for MongoDB

No special H/W support No code changes for DB engine

Contents

•  Background tasks

•  Full page writes

•  Parallel query

4

PostgreSQL Architecture

5

Storage Device

Operating System

T1

Client

T2

I/O

T3 T4

Request Response

I/O I/O I/O

PostgreSQL

Database performance

* PostgreSQL’s processes -  Backend (foreground)

-  Checkpointer

-  Autovacuum workers

-  WAL writer

-  Writer

-  …

PostgreSQL Architecture

5

Storage Device

Operating System

T1

Client

T2

I/O

T3 T4

Request Response

I/O I/O I/O

PostgreSQL * PostgreSQL’s processes -  Backend (foreground)

-  Checkpointer

-  Autovacuum workers

-  WAL writer

-  Writer

-  …

PostgreSQL Checkpoint

6

http://www.interdb.jp/pg/pgsql09.html

Impact of Checkpointer

7

0

500

1000

1500

2000

2500

0 500 1000 1500 2000 2500 3000

Max

trx

late

ncy

(ms)

Elapsed time (sec)

Default

1.1 sec

•  Dell Poweredge R730 •  32 cores •  132GB DRAM •  1 SAS SSD

•  PostgreSQL v9.5.6 •  52GB shared_buf •  10GB max_wal

•  TPC-C workload •  50GB dataset •  1 hour run •  50 clients

PostgreSQL Autovacuum

8

Free Space Map

http://bstar36.tistory.com/308

Impact of Autovacuum Workers

9

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

1 50 100 150 200 250 300

Trx

thpu

t (trx

/sec

)

# of clients

Default Aggressive AV

40%

autovacuum_max_workers 3 => 6 autovacuum_naptime 1min => 15s autovacuum_vacuum_threshold 50 => 25 autovacuum_analyze_threshold 50 => 10 autovacuum_vacuum_scale_factor 0.2 => 0.1 autovacuum_analyze_scale_factor 0.1 => 0.05 autovacuum_vacuum_cost_delay 20ms => -1 autovacuum_vacuum_cost_limit -1 => 1000

Impact of Autovacuum Workers

10

0

500

1000

1500

2000

2500

0 500 1000 1500 2000 2500 3000

Max

trx

late

ncy

(ms)

Elapsed time (sec)

Default Aggressive AV 2.5 sec

Why Background Tasks Matter

• Multiple independent layers

11

Storage Device

Caching Layer

Application

File System Layer

Block Layer Abs

tract

ion

Buffer Cache

read() write()

FG FG BG

BG FG BG BG

reorder

admission control

admission control

reorder

BG

admission control

Why Background Tasks Matter

•  I/O priority inversion

12

Storage Device

Caching Layer

Application

File System Layer

Block Layer

I/O

FG wait

wait

BG user var

wake

FG wait

I/O FG lock

BG wait

FG wait

wait BG var

wake

Request-Centric I/O Prioritization

• Solution v1 (reactive) •  Request-aware I/O processing in the I/O path [FAST’17] •  I/O priority inheritance [USENIX ATC’15, FAST’17]

•  Locks

•  Condition variables

13

FG lock

BG I/O FG BG submit

complete

FG BG

FG wait

BG

register

BG

inherit

FG BG I/O submit

complete

wake

CV CV CV

[USENIX ATC’15] Request-Oriented Durable Write Caching for Application Performance [FAST’17] Enlightening the I/O Path: A Holistic Approach for Application Performance


14

Caching Layer

Application

File System Layer

Block Layer

• Solution v1 (problem)

Synchronization

linux/include/linux/mutex.h linux/include/linux/pagemap.h linux/include/linux/rtmutex.h linux/include/linux/rwsem.h linux/include/uapi/linux/sem.h linux/include/linux/wait.h linux/kernel/sched/wait.c linux/kernel/locking/rwsem.c linux/kernel/futex.c linux/kernel/locking/mutex.h linux/kernel/locking/mutex.c linux/kernel/locking/rtmutex.c linux/kernel/locking/rwsem-xadd.c

linux/kernel/fork.c linux/kernel/sys.c linux/kernel/sysctl.c linux/include/linux/sched.h

linux/include/linux/blk_types.h linux/include/linux/blkdev.h linux/include/linux/buffer_head.h linux/include/linux/caq.h linux/block/blk-core.c linux/block/blk-flush.c linux/block/blk-lib.c linux/block/blk-mq.c linux/block/caq-iosched.c linux/block/cfq-iosched.c linux/block/elevator.c

linux/fs/buffer.c linux/fs/ext4/extents.c linux/fs/ext4/inode.c linux/include/linux/jbd2.h linux/fs/jbd2/commit.c linux/fs/jbd2/ journal.c linux/fs/jbd2/ transaction.c

linux/include/linux/writeback.h linux/mm/page-writeback.c linux/include/linux/mm_types.h linux/fs/buffer.c

Interface


• Solution v2 (proactive)

15

Device Driver

Noop CFQ Deadline Apposha I/O Scheduler

Block Layer

Ext4 XFS F2FS

VFS

Apposha Front-End File System

Etc

Linux I/O Stack

Page Cache

-  Priority-aware I/O scheduling

-  Control device-level congestion

-  Control writes at the upper layer

-  Tag low-priority for background I/Os

• V12 Engine


16

Kernel Modules

MongoDB Library

PostgreSQL Library

V12-M V12-P

User-level library -  Classify task priority, optimize WAL accesses

Front-End File System I/O Scheduler


17

0

500

1000

1500

2000

2500

0 500 1000 1500 2000 2500 3000

Max

trx

late

ncy

(ms)

Elapsed time (sec)

Default Aggressive AV V12-P

0.15 sec

V12-P: V12 Engine for PostgreSQL


17

0

500

1000

1500

2000

2500

0 500 1000 1500 2000 2500 3000

Max

trx

late

ncy

(ms)

Elapsed time (sec)

Default Aggressive AV V12-P v9.6

1.1 sec


New feature since v9.6: checkpoint_flush_after

Contents




18

Full Page Writes

19 Tuning PostgreSQL for High Write Workloads – Grant McAlister

0

2

4

6

8

10

12

14

0 100 200 300 400 500 600

WA

L si

ze (G

B)

Elapsed time (sec)

Default

Full Page Writes

•  Impact on WAL size

20

•  PostgreSQL v9.6.5 •  52GB shared_buffer •  1GB max_wal

•  TPC-C workload •  50GB dataset •  10 min run •  200 clients

WAL Compression


21

0

2

4

6

8

10

12

14

0 100 200 300 400 500 600

WA

L si

ze (G

B)

Elapsed time (sec)

Default WAL compression

•  PostgreSQL v9.6.5 •  52GB shared_buffer •  1GB max_wal

•  TPC-C workload •  50GB dataset •  10 min run •  200 clients

WAL Compression

•  Impact on performance

22

0

2000

4000

6000

8000

10000

12000

Trx

thpu

t (tr

x/se

c)


WAL Compression


22

0

2000

4000

6000

8000

10000

12000

Trx

thpu

t (tr

x/se

c)


What if we can safely disable full_page_writes w/o special H/W support?

Atomic Write Support

23

write()

Data Block

In memory On disk

Mapping Table

Data Block

Linux Kernel (w/ V12 Engine)

PostgreSQL (full_page_writes off)


23

fsync()

Data Block


In memory On disk

Mapping Table

Data Block Data Block

writeback Linux Kernel (w/ V12 Engine)


23

fsync()

Data Block

In memory On disk

Mapping Table


atomic update Linux Kernel

(w/ V12 Engine)



23

fsync()

Data Block

In memory On disk

Mapping Table





23

Data Block

In memory On disk

Mapping Table


free




23

Data Block

In memory On disk

Mapping Table

Data Block





24

0

2

4

6

8

10

12

14

0 100 200 300 400 500 600

WA

L si

ze (G

B)

Elapsed time (sec)

Default WAL compression V12-P




25

0

2000

4000

6000

8000

10000

12000

Trx

thpu

t (tr

x/se

c)

Default WAL compression V12-P 2X


Contents




26

Parallel Query

27

User query

Backend

Worker Worker Worker Worker

Scan, aggregate, join, …

Parallel Query

27

Slow response Fast response

User query

Backend

Worker

cache

storage

cache

storage

cache

storage

cache

storage

hit hit miss!

Worker Worker Worker

miss!

Parallel Query

• Problem inside Linux kernel

28

Fair-based (communism)

Query1 Query2

18sec

20sec I/O Scheduling

10sec 10sec

Parallel Query


28 Fair-based (communism)

Query1 Query2

24sec

26sec I/O Scheduling

10sec 10sec

Query3

10sec

30sec

Parallel Query


28 Fair-based (communism)

Query1 Query2

?

? I/O Scheduling

10sec 10sec

Query3

10sec

?

… QueryN

10sec

…

…

…

…

…

…

…

…

… …

Parallel Query


29

Fair-based (communism) Task-aware (utilitarianism)

Query1 Query2

18sec

20sec

10sec

20sec I/O

Scheduling

10sec 10sec

Optimizing Parallel Query

•  Illustrative experiment

30

pgbench –i –s 1000 test1 pgbench –i –s 1000 test2 Q1: SELECT * FROM pgbench_accounts WHERE filler LIKE ‘%x% -d test1 Q2: SELECT * FROM pgbench_accounts WHERE filler LIKE ‘%x% -d test2 QUERY PLAN -------------------------------------------------------------------------------------- Gather (cost=1000.00..1818916.53 rows=1 width=97) Workers Planned: 7 -> Parallel Seq Scan on pgbench_accounts (cost=0.00..1817916.43 rows=1 width=97) Filter: (filler ~~ '%x%'::text)

0

10

20

30

40

50

60

70

Deadline CFQ NOOP Taskaware

Que

ry la

tenc

y (m

s)

Q1 Q2 Q3 Q4


•  Illustrative experiment

31

Recall

• Multiple independent layers

32

Storage Device

Caching Layer

Application

File System Layer

Block Layer Abs

tract

ion

Buffer Cache

read() write()

FG FG BG

BG FG BG BG

reorder


•  Illustrative experiment (device queue off)

33

Avg 23%

0

20

40

60

80

100

Deadline CFQ NOOP Taskaware

Que

ry la

tenc

y (m

s)

Q1 Q2 Q3 Q4 Q1

73%


•  Illustrative experiment (4ms idle_slice)

34

0 10 20 30 40 50 60 70

Deadline CFQ NOOP Taskaware Taskaware (Idle 4ms)

Que

ry la

tenc

y (m

s)

Q1 Q2 Q3 Q4 Avg 17% Q1

2.8X


• Ongoing work •  Handling heavy tasks •  Extending to CPU scheduling

35

47

H http://apposha.io F www.facebook.com/apposha M [email protected]

analyzing and optimizing linux kernel for postgresql · analyzing and optimizing linux kernel for...

Documents