enlightening the i/o path: a holistic approach for application performance (usenix fast'17)
TRANSCRIPT
Enlightening the I/O Path:A Holistic Approach for Application Performance
Sangwook Kim13, Hwanju Kim2, Joonwon Lee3, and Jinkyu Jeong3
Apposha1
Dell EMC2
Sungkyunkwan University3
Data-Intensive Applications
2
Relational
Key-value
SearchColumn
Document
Data-Intensive Applications
• Common structure
3
Storage Device
Operating System
T1
Client
T2
I/O
T3 T4
Request Response
I/O I/O I/O
Application
Application
performance
* Example: MongoDB
- Client (foreground)
- Checkpointer
- Log writer
- Eviction worker
- …
Data-Intensive Applications
• Common structure
4
Storage Device
Operating System
T1
Client
T2
I/O
T3 T4
Request Response
I/O I/O I/O
Application
Application
performance
* Example: MongoDB
- Server (client)
- Checkpointer
- Log writer
- Evict worker
- …
Background tasks are problematic
for application performance
Application Impact
5
• Illustrative experiment
• YCSB update-heavy workload against MongoDB
Application Impact
6
• Illustrative experiment
• YCSB update-heavy workload against MongoDB
0
5000
10000
15000
20000
25000
30000
0 200 400 600 800 1000 1200 1400 1600 1800
Opera
tion thro
ughput
(ops/
sec)
Elapsed time (sec)
CFQ
Regular checkpoint task
30 seconds latency at 99.99th percentile
0
5000
10000
15000
20000
25000
30000
0 200 400 600 800 1000 1200 1400 1600 1800
Opera
tion thro
ughput
(ops/
sec)
Elapsed time (sec)
CFQ CFQ-IDLE
Application Impact
7
• Illustrative experiment
• YCSB update-heavy workload against MongoDB
I/O priority does not help
0
5000
10000
15000
20000
25000
30000
0 200 400 600 800 1000 1200 1400 1600 1800
Opera
tion thro
ughput
(ops/
sec)
Elapsed time (sec)
CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO
Application Impact
8
• Illustrative experiment
• YCSB update-heavy workload against MongoDB
State-of-the-art schedulers do not help much
What’s the Problem?
• Independent policies in multiple layers
• Each layer processes I/Os w/ limited information
• I/O priority inversion
• Background I/Os can arbitrarily delay foreground tasks
9
What’s the Problem?
• Independent policies in multiple layers
• Each layer processes I/Os w/ limited information
• I/O priority inversion
• Background I/Os can arbitrarily delay foreground tasks
10
Multiple Independent Layers
• Independent I/O processing
11
Storage Device
Caching Layer
Application
File System Layer
Block LayerAb
str
actio
n
Multiple Independent Layers
12
Storage Device
Caching Layer
Application
File System Layer
Block Layer
Buffer Cache
read() write()admission
control
Ab
str
actio
n
• Independent I/O processing
Multiple Independent Layers
• Independent I/O processing
13
Storage Device
Caching Layer
Application
File System Layer
Block Layer
Buffer Cache
read() write()
Block-level Q
admission control
Ab
str
actio
n
Multiple Independent Layers
• Independent I/O processing
14
Storage Device
Caching Layer
Application
File System Layer
Block LayerAb
str
actio
n
Buffer Cache
read() write()
reorder
FG FG BGBG
Multiple Independent Layers
• Independent I/O processing
15
Storage Device
Caching Layer
Application
File System Layer
Block LayerAb
str
actio
n
Buffer Cache
read() write()
FG FGBG
Device-internal Q
admission control
Multiple Independent Layers
• Independent I/O processing
16
Storage Device
Caching Layer
Application
File System Layer
Block LayerAb
str
actio
n
Buffer Cache
read() write()
FG FGBG
BG FG BGBG
reorder
What’s the Problem?
• Independent policies in multiple layers
• Each layer processes I/Os w/ limited information
• I/O priority inversion
• Background I/Os can arbitrarily delay foreground tasks
17
I/O Priority Inversion
• Task dependency
18
Storage Device
Caching Layer
Application
File System Layer
Block Layer
Locks
Condition variables
I/O Priority Inversion
• Task dependency
19
Storage Device
Caching Layer
Application
File System Layer
Block Layer
Condition variables
I/OFGlock
BGwait
I/O Priority Inversion
• Task dependency
20
Storage Device
Caching Layer
Application
File System Layer
Block Layer
I/OFGlock
BGwait
FGwait
wait
BGvarwake
I/O Priority Inversion
• Task dependency
21
Storage Device
Caching Layer
Application
File System Layer
Block Layer
I/O
FGwait
wait
BGuser var
wake
FGwait
I/O Priority Inversion
• I/O dependency
22
Storage Device
Caching Layer
Application
File System Layer
Block Layer
Outstanding I/Os
I/O Priority Inversion
• I/O dependency
23
Storage Device
Caching Layer
Application
File System Layer
Block Layer
I/OFGwait
For ensuring consistency and/or durability
24
100 ms latency at 99.99th percentile
0
5000
10000
15000
20000
25000
30000
35000
0 200 400 600 800 1000 1200 1400 1600 1800
Opera
tion thro
ughput
(ops/
sec)
Elapsed time (sec)
CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO RCP
• Request-centric I/O prioritization (RCP)
• Critical I/O: I/O in the critical path of request handling
• Policy: holistically prioritizes critical I/Os along the I/O path
Our Approach
Challenges
• How to accurately identify I/O criticality
• How to effectively enforce I/O criticality
25
Critical I/O Detection
• Enlightenment API
• Interface for tagging foreground tasks
• I/O priority inheritance
• Handling task dependency
• Handling I/O dependency
26
I/O Priority Inheritance
• Handling task dependency
• Locks
• Condition variables
27
FGlock
BG I/OFG
inherit
BG
submit
complete
FG BG
unlock
FGwait
BG
register
BG
inherit
FG BGI/O
submit
complete
wake
CV CV CV
I/O Priority Inheritance
• Handling I/O dependency
28
I/O Priority Inheritance
• Handling I/O dependency
29
Block Layer
I/OI/O
I/O Priority Inheritance
• Handling I/O dependency
30
Block Layer
Q admission stage
I/OI/O
Sched queueing stage
I/O Priority Inheritance
• Handling I/O dependency
31
Block Layer
PER-DEV ROOT
NCIO NCIO
NCIO NCIO
Q admission stage
I/OI/O
Sched queueing stage
Non-critical I/O tracking
Descriptor
Location
Resolver
Sector #
I/O Priority Inheritance
• Handling I/O dependency
32
Block Layer
Descriptor Location Resolver Sector #
allocate
Q admission stage
I/O
I/O
I/O
Sched queueing stage
Non-critical I/O tracking
PER-DEV ROOT
NCIO NCIO
NCIO NCIO
I/O Priority Inheritance
• Handling I/O dependency
33
Block Layer
Q admission stage
I/O
I/O
I/O
Sched queueing stage
Non-critical I/O tracking
update
Descriptor Location Resolver Sector #
PER-DEV ROOT
NCIO NCIO
NCIO NCIO
I/O Priority Inheritance
• Handling I/O dependency
34
Block Layer
Q admission stage
I/O
I/O
Sched queueing stage
I/O
Non-critical I/O tracking
update Descriptor Location Resolver Sector #
PER-DEV ROOT
NCIO NCIO
NCIO NCIO
I/O Priority Inheritance
• Handling I/O dependency
35
Block Layer
Q admission stage
I/O
I/O
Sched queueing stage
I/O
Non-critical I/O tracking
updateDescriptor Location Resolver Sector #
PER-DEV ROOT
NCIO NCIO
NCIO NCIO
I/O Priority Inheritance
• Handling I/O dependency
36
Block Layer
Q admission stage
I/O
I/O
Sched queueing stage
I/O
Non-critical I/O tracking
update
Descriptor Location Resolver Sector #
PER-DEV ROOT
NCIO NCIO
NCIO NCIO
I/O Priority Inheritance
• Handling I/O dependency
37
Block Layer
Q admission stage
I/O
I/O
Sched queueing stage
I/O
Non-critical I/O tracking
Descriptor Location Resolver Sector #
PER-DEV ROOT
NCIO NCIO
NCIO NCIO
delete on completion
Handling Transitive Dependency
• Possible states of dependent task
38
FG
inherit
BG BG
Blocked on task
I/OFG
inherit
BG
wait wait
Blocked on I/O
FG
inherit
BG
wait
Blocked at admission stage
Handling Transitive Dependency
• Recording blocking status
39
FG
inherit
BG BG
Blocked on task
I/OFG
inherit
BG
wait wait
Blocked on I/O
FG
inherit
BG
wait
Blocked at admission stage
Handling Transitive Dependency
• Recording blocking status
40
FG
inherit
BG BG
Task is recorded
I/OFG
inherit
BG
wait
Blocked on I/O
FG
inherit
BG
wait
Blocked at admission stage
inherit
Handling Transitive Dependency
• Recording blocking status
41
FG
inherit
BG BG I/OFG
inherit
BG
reprio
I/O is recorded
FG
inherit
BG
wait
Blocked at admission stage
Task is recorded
inherit
Handling Transitive Dependency
• Recording blocking status
42
FG
inherit
BG BG I/OFG
inherit
BG FG
inherit
BG
retryreprio
I/O is recorded
Task is recorded
inherit
Challenges
• How to accurately identify I/O criticality
• Enlightenment API
• I/O priority inheritance
• Recording blocking status
• How to effectively enforce I/O criticality
43
Criticality-Aware I/O Prioritization
• Caching layer
• Apply low dirty ratio for non-critical writes (1% by default)
• Block layer
• Isolate allocation of block queue slots
• Maintain 2 FIFO queues
• Schedule critical I/O first
• Limit # of outstanding non-critical I/Os (1 by default)
• Support queue promotion to resolve I/O dependency
44
Evaluation
• Implementation on Linux 3.13 w/ ext4
• Application studies
• PostgreSQL relational database
• Backend processes as foreground tasks
• I/O priority inheritance on LWLocks (semop)
• MongoDB document store
• Client threads as foreground tasks
• I/O priority inheritance on Pthread mutex and condition vars (futex)
• Redis key-value store
• Master process as foreground task
45
Evaluation
• Experimental setup
• 2 Dell PowerEdge R530 (server & client)
• 1TB Micron MX200 SSD
• I/O prioritization schemes
• CFQ (default), CFQ-IDLE
• SPLIT-A (priority), SPLIT-D (deadline) [SOSP’15]
• QASIO [FAST’15]
• RCP
46
Application Throughput
• PostgreSQL w/ TPC-C workload
47
0
1000
2000
3000
4000
5000
6000
7000
10GB dataset 60GB dataset 200GB dataset
Transa
ctio
n thro
ughput
(trx
/sec)
CFQ CFQ-IDLE
Application Throughput
• PostgreSQL w/ TPC-C workload
48
0
1000
2000
3000
4000
5000
6000
7000
10GB dataset 60GB dataset 200GB dataset
Transa
ctio
n thro
ughput
(trx
/sec)
CFQ CFQ-IDLE SPLIT-A
Application Throughput
• PostgreSQL w/ TPC-C workload
49
0
1000
2000
3000
4000
5000
6000
7000
10GB dataset 60GB dataset 200GB dataset
Transa
ctio
n thro
ughput
(trx
/sec)
CFQ CFQ-IDLE SPLIT-A SPLIT-D
Application Throughput
• PostgreSQL w/ TPC-C workload
50
0
1000
2000
3000
4000
5000
6000
7000
10GB dataset 60GB dataset 200GB dataset
Transa
ctio
n thro
ughput
(trx
/sec)
CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO
Application Throughput
• PostgreSQL w/ TPC-C workload
51
0
1000
2000
3000
4000
5000
6000
7000
10GB dataset 60GB dataset 200GB dataset
Transa
ctio
n thro
ughput
(trx
/sec)
CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO RCP
37%
31%28%
Application Throughput
• Impact on background task
52
0
5
10
15
20
25
30
35
0 200 400 600 800 1000 1200 1400 1600 1800
Transa
ctio
n log s
ize (GB)
Elapsed time (sec)
CFQ CFQ-IDLE QASIO
Application Throughput
• Impact on background task
53
0
5
10
15
20
25
30
35
0 200 400 600 800 1000 1200 1400 1600 1800
Transa
ctio
n log s
ize (GB)
Elapsed time (sec)
CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO
Application Throughput
• Impact on background task
54
Our scheme improvesapplication throughput w/o penalizing background tasks
0
5
10
15
20
25
30
35
0 200 400 600 800 1000 1200 1400 1600 1800
Transa
ctio
n log s
ize (GB)
Elapsed time (sec)
CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO RCP
Application Latency
• PostgreSQL w/ TPC-C workload
55
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
0 1000 2000 3000 4000 5000 6000
CCD
F P[X
>=x]
Transaction latency (msec)
CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO
100
10-1
10-2
10-3
10-4
10-5
Over 2 sec at 99.9th
0th
90th
99th
99.9th
99.99th
99.999th
Application Latency
• PostgreSQL w/ TPC-C workload
56
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
0 1000 2000 3000 4000 5000 6000
CCD
F P[X
>=x]
Transaction latency (msec)
CFQ CFQ-IDLE SPLIT-A SPLIT-D QASIO RCP
100
10-1
10-2
10-3
10-4
10-5
300 msecat 99.999th
Our scheme is effective for improving tail latency
Over 2 sec at 99.9th
100
10-1
10-2
10-3
10-4
10-5
0th
90th
99th
99.9th
99.99th
99.999th
Summary of Other Results
• Performance results
• MongoDB: 12%-201% throughput, 5x-20x latency at 99.9th
• Redis: 7%-49% throughput, 2x-20x latency at 99.9th
• Analysis results
• System latency analysis using LatencyTOP
• System throughput vs. Application latency
• Need for holistic approach
57
Conclusions
• Key observation• All the layers in the I/O path should be considered as a whole with I/O
priority inversion in mind for effective I/O prioritization
• Request-centric I/O prioritization• Enlightens the I/O path solely for application performance
• Improves throughput and latency of real applications
• Ongoing work• Practicalizing implementation
• Applying RCP to database cluster with multiple replicas
58