analyzing and optimizing linux kernel for postgresql · analyzing and optimizing linux kernel for...
TRANSCRIPT
Analyzing and Optimizing Linux Kernel for PostgreSQL
Sangwook Kim PGConf.Asia 2017
2
Sangwook Kim • Co-founder and CEO @ Apposha • Ph. D. in Computer Science
• Cloud/Virtualization • SMP scheduling [ASPLOS’13, VEE’14] • Group-based memory management [JPDC’14]
• Database/Storage • Non-volatile cache management [USENIX ATC’15, ApSys’16] • Request-centric I/O prioritization [FAST’17, HotStorage’17]
3
Fully-managed & performance-optimized for MySQL and PostgreSQL
for
What We Do
Fully-managed & performance-optimized for MongoDB
No special H/W support No code changes for DB engine
Contents
• Background tasks
• Full page writes
• Parallel query
4
PostgreSQL Architecture
5
Storage Device
Operating System
T1
Client
T2
I/O
T3 T4
Request Response
I/O I/O I/O
PostgreSQL
Database performance
* PostgreSQL’s processes - Backend (foreground)
- Checkpointer
- Autovacuum workers
- WAL writer
- Writer
- …
PostgreSQL Architecture
5
Storage Device
Operating System
T1
Client
T2
I/O
T3 T4
Request Response
I/O I/O I/O
PostgreSQL * PostgreSQL’s processes - Backend (foreground)
- Checkpointer
- Autovacuum workers
- WAL writer
- Writer
- …
PostgreSQL Checkpoint
6
http://www.interdb.jp/pg/pgsql09.html
Impact of Checkpointer
7
0
500
1000
1500
2000
2500
0 500 1000 1500 2000 2500 3000
Max
trx
late
ncy
(ms)
Elapsed time (sec)
Default
1.1 sec
• Dell Poweredge R730 • 32 cores • 132GB DRAM • 1 SAS SSD
• PostgreSQL v9.5.6 • 52GB shared_buf • 10GB max_wal
• TPC-C workload • 50GB dataset • 1 hour run • 50 clients
PostgreSQL Autovacuum
8
Free Space Map
http://bstar36.tistory.com/308
Impact of Autovacuum Workers
9
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
1 50 100 150 200 250 300
Trx
thpu
t (trx
/sec
)
# of clients
Default Aggressive AV
40%
autovacuum_max_workers 3 => 6 autovacuum_naptime 1min => 15s autovacuum_vacuum_threshold 50 => 25 autovacuum_analyze_threshold 50 => 10 autovacuum_vacuum_scale_factor 0.2 => 0.1 autovacuum_analyze_scale_factor 0.1 => 0.05 autovacuum_vacuum_cost_delay 20ms => -1 autovacuum_vacuum_cost_limit -1 => 1000
Impact of Autovacuum Workers
10
0
500
1000
1500
2000
2500
0 500 1000 1500 2000 2500 3000
Max
trx
late
ncy
(ms)
Elapsed time (sec)
Default Aggressive AV 2.5 sec
Why Background Tasks Matter
• Multiple independent layers
11
Storage Device
Caching Layer
Application
File System Layer
Block Layer Abs
tract
ion
Buffer Cache
read() write()
FG FG BG
BG FG BG BG
reorder
admission control
admission control
reorder
BG
admission control
Why Background Tasks Matter
• I/O priority inversion
12
Storage Device
Caching Layer
Application
File System Layer
Block Layer
I/O
FG wait
wait
BG user var
wake
FG wait
I/O FG lock
BG wait
FG wait
wait BG var
wake
Request-Centric I/O Prioritization
• Solution v1 (reactive) • Request-aware I/O processing in the I/O path [FAST’17] • I/O priority inheritance [USENIX ATC’15, FAST’17]
• Locks
• Condition variables
13
FG lock
BG I/O FG BG submit
complete
FG BG
FG wait
BG
register
BG
inherit
FG BG I/O submit
complete
wake
CV CV CV
[USENIX ATC’15] Request-Oriented Durable Write Caching for Application Performance [FAST’17] Enlightening the I/O Path: A Holistic Approach for Application Performance
Request-Centric I/O Prioritization
14
Caching Layer
Application
File System Layer
Block Layer
• Solution v1 (problem)
Synchronization
linux/include/linux/mutex.h linux/include/linux/pagemap.h linux/include/linux/rtmutex.h linux/include/linux/rwsem.h linux/include/uapi/linux/sem.h linux/include/linux/wait.h linux/kernel/sched/wait.c linux/kernel/locking/rwsem.c linux/kernel/futex.c linux/kernel/locking/mutex.h linux/kernel/locking/mutex.c linux/kernel/locking/rtmutex.c linux/kernel/locking/rwsem-xadd.c
linux/kernel/fork.c linux/kernel/sys.c linux/kernel/sysctl.c linux/include/linux/sched.h
linux/include/linux/blk_types.h linux/include/linux/blkdev.h linux/include/linux/buffer_head.h linux/include/linux/caq.h linux/block/blk-core.c linux/block/blk-flush.c linux/block/blk-lib.c linux/block/blk-mq.c linux/block/caq-iosched.c linux/block/cfq-iosched.c linux/block/elevator.c
linux/fs/buffer.c linux/fs/ext4/extents.c linux/fs/ext4/inode.c linux/include/linux/jbd2.h linux/fs/jbd2/commit.c linux/fs/jbd2/ journal.c linux/fs/jbd2/ transaction.c
linux/include/linux/writeback.h linux/mm/page-writeback.c linux/include/linux/mm_types.h linux/fs/buffer.c
Interface
Request-Centric I/O Prioritization
• Solution v2 (proactive)
15
Device Driver
Noop CFQ Deadline Apposha I/O Scheduler
Block Layer
Ext4 XFS F2FS
VFS
Apposha Front-End File System
Etc
Linux I/O Stack
Page Cache
- Priority-aware I/O scheduling
- Control device-level congestion
- Control writes at the upper layer
- Tag low-priority for background I/Os
• V12 Engine
Request-Centric I/O Prioritization
16
Kernel Modules
MongoDB Library
PostgreSQL Library
V12-M V12-P
User-level library - Classify task priority, optimize WAL accesses
Front-End File System I/O Scheduler
Request-Centric I/O Prioritization
17
0
500
1000
1500
2000
2500
0 500 1000 1500 2000 2500 3000
Max
trx
late
ncy
(ms)
Elapsed time (sec)
Default Aggressive AV V12-P
0.15 sec
V12-P: V12 Engine for PostgreSQL
Request-Centric I/O Prioritization
17
0
500
1000
1500
2000
2500
0 500 1000 1500 2000 2500 3000
Max
trx
late
ncy
(ms)
Elapsed time (sec)
Default Aggressive AV V12-P v9.6
1.1 sec
V12-P: V12 Engine for PostgreSQL
New feature since v9.6: checkpoint_flush_after
Contents
• Background tasks
• Full page writes
• Parallel query
18
Full Page Writes
19 Tuning PostgreSQL for High Write Workloads – Grant McAlister
0
2
4
6
8
10
12
14
0 100 200 300 400 500 600
WA
L si
ze (G
B)
Elapsed time (sec)
Default
Full Page Writes
• Impact on WAL size
20
• PostgreSQL v9.6.5 • 52GB shared_buffer • 1GB max_wal
• TPC-C workload • 50GB dataset • 10 min run • 200 clients
WAL Compression
• Impact on WAL size
21
0
2
4
6
8
10
12
14
0 100 200 300 400 500 600
WA
L si
ze (G
B)
Elapsed time (sec)
Default WAL compression
• PostgreSQL v9.6.5 • 52GB shared_buffer • 1GB max_wal
• TPC-C workload • 50GB dataset • 10 min run • 200 clients
WAL Compression
• Impact on performance
22
0
2000
4000
6000
8000
10000
12000
Trx
thpu
t (tr
x/se
c)
Default WAL compression
WAL Compression
• Impact on performance
22
0
2000
4000
6000
8000
10000
12000
Trx
thpu
t (tr
x/se
c)
Default WAL compression
What if we can safely disable full_page_writes w/o special H/W support?
Atomic Write Support
23
write()
Data Block
In memory On disk
Mapping Table
Data Block
Linux Kernel (w/ V12 Engine)
PostgreSQL (full_page_writes off)
Atomic Write Support
23
fsync()
Data Block
PostgreSQL (full_page_writes off)
In memory On disk
Mapping Table
Data Block Data Block
writeback Linux Kernel (w/ V12 Engine)
Atomic Write Support
23
fsync()
Data Block
In memory On disk
Mapping Table
Data Block Data Block
atomic update Linux Kernel
(w/ V12 Engine)
PostgreSQL (full_page_writes off)
Atomic Write Support
23
fsync()
Data Block
In memory On disk
Mapping Table
Data Block Data Block
Linux Kernel (w/ V12 Engine)
PostgreSQL (full_page_writes off)
Atomic Write Support
23
Data Block
In memory On disk
Mapping Table
Data Block Data Block
free
Linux Kernel (w/ V12 Engine)
PostgreSQL (full_page_writes off)
Atomic Write Support
23
Data Block
In memory On disk
Mapping Table
Data Block
Linux Kernel (w/ V12 Engine)
PostgreSQL (full_page_writes off)
Atomic Write Support
• Impact on WAL size
24
0
2
4
6
8
10
12
14
0 100 200 300 400 500 600
WA
L si
ze (G
B)
Elapsed time (sec)
Default WAL compression V12-P
V12-P: V12 Engine for PostgreSQL
Atomic Write Support
• Impact on performance
25
0
2000
4000
6000
8000
10000
12000
Trx
thpu
t (tr
x/se
c)
Default WAL compression V12-P 2X
V12-P: V12 Engine for PostgreSQL
Contents
• Background tasks
• Full page writes
• Parallel query
26
Parallel Query
27
User query
Backend
Worker Worker Worker Worker
Scan, aggregate, join, …
Parallel Query
27
Slow response Fast response
User query
Backend
Worker
cache
storage
cache
storage
cache
storage
cache
storage
hit hit miss!
Worker Worker Worker
miss!
Parallel Query
• Problem inside Linux kernel
28
Fair-based (communism)
Query1 Query2
18sec
20sec I/O Scheduling
10sec 10sec
Parallel Query
• Problem inside Linux kernel
28 Fair-based (communism)
Query1 Query2
24sec
26sec I/O Scheduling
10sec 10sec
Query3
10sec
30sec
Parallel Query
• Problem inside Linux kernel
28 Fair-based (communism)
Query1 Query2
?
? I/O Scheduling
10sec 10sec
Query3
10sec
?
… QueryN
10sec
…
…
…
…
…
…
…
…
… …
Parallel Query
• Problem inside Linux kernel
29
Fair-based (communism) Task-aware (utilitarianism)
Query1 Query2
18sec
20sec
10sec
20sec I/O
Scheduling
10sec 10sec
Optimizing Parallel Query
• Illustrative experiment
30
pgbench –i –s 1000 test1 pgbench –i –s 1000 test2 Q1: SELECT * FROM pgbench_accounts WHERE filler LIKE ‘%x% -d test1 Q2: SELECT * FROM pgbench_accounts WHERE filler LIKE ‘%x% -d test2 QUERY PLAN -------------------------------------------------------------------------------------- Gather (cost=1000.00..1818916.53 rows=1 width=97) Workers Planned: 7 -> Parallel Seq Scan on pgbench_accounts (cost=0.00..1817916.43 rows=1 width=97) Filter: (filler ~~ '%x%'::text)
0
10
20
30
40
50
60
70
Deadline CFQ NOOP Taskaware
Que
ry la
tenc
y (m
s)
Q1 Q2 Q3 Q4
Optimizing Parallel Query
• Illustrative experiment
31
Recall
• Multiple independent layers
32
Storage Device
Caching Layer
Application
File System Layer
Block Layer Abs
tract
ion
Buffer Cache
read() write()
FG FG BG
BG FG BG BG
reorder
Optimizing Parallel Query
• Illustrative experiment (device queue off)
33
Avg 23%
0
20
40
60
80
100
Deadline CFQ NOOP Taskaware
Que
ry la
tenc
y (m
s)
Q1 Q2 Q3 Q4 Q1
73%
Optimizing Parallel Query
• Illustrative experiment (4ms idle_slice)
34
0 10 20 30 40 50 60 70
Deadline CFQ NOOP Taskaware Taskaware (Idle 4ms)
Que
ry la
tenc
y (m
s)
Q1 Q2 Q3 Q4 Avg 17% Q1
2.8X
Optimizing Parallel Query
• Ongoing work • Handling heavy tasks • Extending to CPU scheduling
35
47
H http://apposha.io F www.facebook.com/apposha M [email protected]