analyzing and optimizing linux kernel for postgresql · analyzing and optimizing linux kernel for...

Analyzing and Optimizing Linux Kernel for PostgreSQL

Sangwook Kim PGConf.Asia 2017

Sangwook Kim •  Co-founder and CEO @ Apposha •  Ph. D. in Computer Science

•  Cloud/Virtualization •  SMP scheduling [ASPLOS’13, VEE’14] •  Group-based memory management [JPDC’14]

•  Database/Storage •  Non-volatile cache management [USENIX ATC’15, ApSys’16] •  Request-centric I/O prioritization [FAST’17, HotStorage’17]

Fully-managed & performance-optimized for MySQL and PostgreSQL

What We Do

Fully-managed & performance-optimized for MongoDB

No special H/W support No code changes for DB engine

Contents

•  Background tasks

•  Full page writes

•  Parallel query

PostgreSQL Architecture

Storage Device

Operating System

Client

Request Response

I/O I/O I/O

PostgreSQL

Database performance

* PostgreSQL’s processes -  Backend (foreground)

-  Checkpointer

-  Autovacuum workers

-  WAL writer

-  Writer

-  …

PostgreSQL Architecture

Storage Device

Operating System

Client

Request Response

I/O I/O I/O

PostgreSQL * PostgreSQL’s processes -  Backend (foreground)

-  Checkpointer

-  Autovacuum workers

-  WAL writer

-  Writer

-  …

PostgreSQL Checkpoint

http://www.interdb.jp/pg/pgsql09.html

Impact of Checkpointer

0 500 1000 1500 2000 2500 3000

Elapsed time (sec)

Default

1.1 sec

•  Dell Poweredge R730 •  32 cores •  132GB DRAM •  1 SAS SSD

•  PostgreSQL v9.5.6 •  52GB shared_buf •  10GB max_wal

•  TPC-C workload •  50GB dataset •  1 hour run •  50 clients

PostgreSQL Autovacuum

Free Space Map

http://bstar36.tistory.com/308

Impact of Autovacuum Workers

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

1 50 100 150 200 250 300

t (trx

# of clients

Default Aggressive AV

autovacuum_max_workers 3 => 6 autovacuum_naptime 1min => 15s autovacuum_vacuum_threshold 50 => 25 autovacuum_analyze_threshold 50 => 10 autovacuum_vacuum_scale_factor 0.2 => 0.1 autovacuum_analyze_scale_factor 0.1 => 0.05 autovacuum_vacuum_cost_delay 20ms => -1 autovacuum_vacuum_cost_limit -1 => 1000

Impact of Autovacuum Workers

0 500 1000 1500 2000 2500 3000

Elapsed time (sec)

Default Aggressive AV 2.5 sec

Why Background Tasks Matter

• Multiple independent layers

Storage Device

Caching Layer

Application

File System Layer

Block Layer Abs

Buffer Cache

read() write()

FG FG BG

BG FG BG BG

reorder

admission control

reorder

admission control

Why Background Tasks Matter

•  I/O priority inversion

Storage Device

Caching Layer

Application

File System Layer

Block Layer

FG wait

BG user var

FG wait

I/O FG lock

BG wait

FG wait

wait BG var

Request-Centric I/O Prioritization

• Solution v1 (reactive) •  Request-aware I/O processing in the I/O path [FAST’17] •  I/O priority inheritance [USENIX ATC’15, FAST’17]

•  Locks

•  Condition variables

FG lock

BG I/O FG BG submit

complete

FG wait

register

inherit

FG BG I/O submit

complete

CV CV CV

[USENIX ATC’15] Request-Oriented Durable Write Caching for Application Performance [FAST’17] Enlightening the I/O Path: A Holistic Approach for Application Performance

Caching Layer

Application

File System Layer

Block Layer

• Solution v1 (problem)

Synchronization

linux/include/linux/mutex.h linux/include/linux/pagemap.h linux/include/linux/rtmutex.h linux/include/linux/rwsem.h linux/include/uapi/linux/sem.h linux/include/linux/wait.h linux/kernel/sched/wait.c linux/kernel/locking/rwsem.c linux/kernel/futex.c linux/kernel/locking/mutex.h linux/kernel/locking/mutex.c linux/kernel/locking/rtmutex.c linux/kernel/locking/rwsem-xadd.c

linux/kernel/fork.c linux/kernel/sys.c linux/kernel/sysctl.c linux/include/linux/sched.h

linux/include/linux/blk_types.h linux/include/linux/blkdev.h linux/include/linux/buffer_head.h linux/include/linux/caq.h linux/block/blk-core.c linux/block/blk-flush.c linux/block/blk-lib.c linux/block/blk-mq.c linux/block/caq-iosched.c linux/block/cfq-iosched.c linux/block/elevator.c

linux/fs/buffer.c linux/fs/ext4/extents.c linux/fs/ext4/inode.c linux/include/linux/jbd2.h linux/fs/jbd2/commit.c linux/fs/jbd2/ journal.c linux/fs/jbd2/ transaction.c

linux/include/linux/writeback.h linux/mm/page-writeback.c linux/include/linux/mm_types.h linux/fs/buffer.c

Interface

• Solution v2 (proactive)

Device Driver

Noop CFQ Deadline Apposha I/O Scheduler

Block Layer

Ext4 XFS F2FS

Apposha Front-End File System

Linux I/O Stack

Page Cache

-  Priority-aware I/O scheduling

-  Control device-level congestion

-  Control writes at the upper layer

-  Tag low-priority for background I/Os

• V12 Engine

Kernel Modules

MongoDB Library

PostgreSQL Library

V12-M V12-P

User-level library -  Classify task priority, optimize WAL accesses

Front-End File System I/O Scheduler

0 500 1000 1500 2000 2500 3000

Elapsed time (sec)

Default Aggressive AV V12-P

0.15 sec

V12-P: V12 Engine for PostgreSQL

0 500 1000 1500 2000 2500 3000

Elapsed time (sec)

Default Aggressive AV V12-P v9.6

1.1 sec

New feature since v9.6: checkpoint_flush_after

Contents

Full Page Writes

19 Tuning PostgreSQL for High Write Workloads – Grant McAlister

0 100 200 300 400 500 600

Elapsed time (sec)

Default

Full Page Writes

•  Impact on WAL size

•  PostgreSQL v9.6.5 •  52GB shared_buffer •  1GB max_wal

•  TPC-C workload •  50GB dataset •  10 min run •  200 clients

WAL Compression

0 100 200 300 400 500 600

Elapsed time (sec)

Default WAL compression

•  PostgreSQL v9.6.5 •  52GB shared_buffer •  1GB max_wal

•  TPC-C workload •  50GB dataset •  10 min run •  200 clients

WAL Compression

•  Impact on performance

WAL Compression

What if we can safely disable full_page_writes w/o special H/W support?

Atomic Write Support

write()

Data Block

In memory On disk

Mapping Table

Data Block

Linux Kernel (w/ V12 Engine)

PostgreSQL (full_page_writes off)

fsync()

Data Block

In memory On disk

Mapping Table

Data Block Data Block

writeback Linux Kernel (w/ V12 Engine)

fsync()

Data Block

In memory On disk

Mapping Table

atomic update Linux Kernel

(w/ V12 Engine)

fsync()

Data Block

In memory On disk

Mapping Table

Data Block

In memory On disk

Mapping Table

Data Block

In memory On disk

Mapping Table

Data Block

0 100 200 300 400 500 600

Elapsed time (sec)

Default WAL compression V12-P

Default WAL compression V12-P 2X

Contents

Parallel Query

User query

Backend

Worker Worker Worker Worker

Scan, aggregate, join, …

Parallel Query

Slow response Fast response

User query

Backend

Worker

storage

hit hit miss!

Worker Worker Worker

Parallel Query

• Problem inside Linux kernel

Fair-based (communism)

Query1 Query2

20sec I/O Scheduling

10sec 10sec

Parallel Query

28 Fair-based (communism)

Query1 Query2

26sec I/O Scheduling

10sec 10sec

Query3

Parallel Query

28 Fair-based (communism)

Query1 Query2

? I/O Scheduling

10sec 10sec

Query3

… QueryN

… …

Parallel Query

Fair-based (communism) Task-aware (utilitarianism)

Query1 Query2

20sec I/O

Scheduling

10sec 10sec

Optimizing Parallel Query

•  Illustrative experiment

pgbench –i –s 1000 test1 pgbench –i –s 1000 test2 Q1: SELECT * FROM pgbench_accounts WHERE filler LIKE ‘%x% -d test1 Q2: SELECT * FROM pgbench_accounts WHERE filler LIKE ‘%x% -d test2 QUERY PLAN -------------------------------------------------------------------------------------- Gather (cost=1000.00..1818916.53 rows=1 width=97) Workers Planned: 7 -> Parallel Seq Scan on pgbench_accounts (cost=0.00..1817916.43 rows=1 width=97) Filter: (filler ~~ '%x%'::text)

Deadline CFQ NOOP Taskaware

Q1 Q2 Q3 Q4

•  Illustrative experiment

Recall

• Multiple independent layers

Storage Device

Caching Layer

Application

File System Layer

Block Layer Abs

Buffer Cache

read() write()

FG FG BG

BG FG BG BG

reorder

•  Illustrative experiment (device queue off)

Avg 23%

Deadline CFQ NOOP Taskaware

Q1 Q2 Q3 Q4 Q1

•  Illustrative experiment (4ms idle_slice)

0 10 20 30 40 50 60 70

Deadline CFQ NOOP Taskaware Taskaware (Idle 4ms)

Q1 Q2 Q3 Q4 Avg 17% Q1

• Ongoing work •  Handling heavy tasks •  Extending to CPU scheduling

H http://apposha.io F www.facebook.com/apposha M sangwook@apposha.io

analyzing and optimizing linux kernel for postgresql · analyzing and optimizing linux kernel for...

Documents

postgresql · postgresql as a schemaless database....

optimizing kernel machines using deep learning...framework,...

postgresql linux kernel - friendship · 8388 8388 postgres...

postgresql linux kernel - friendship · 2019-12-21 ·...

the madlib analytics library - florian schoppmann · –...

postgresql vs mysql: postgresql como alternativa

ezpeČnostnÝ bulletin - sk-cert€¦ · linux kernel local...

data warehousing with postgresql - postgresql wiki

free launch: optimizing gpu dynamic kernel launches

optimizing transition states via kernel-based machine...

crunchy enterprise postgresql support r2...postgresql...

c3 postgresql...

postgresql and xml - prague postgresql developers day...

the postgresql global development group · 2019-09-26 ·...

postgresql, postgresql monitoring and monitoring...

makingprophecieswithdecisionpredicatesejk/papers/ucam-cl-tr-789.pdf ·...

Расширяемость postgresql для хакеров...

目录 - pic.mairuan.com...postgresql ੯ତி 93...

mastering postgresql · mastering postgresql in application...

mkpipe: a compiler framework for optimizing multi-kernel