1 a new approach to file system cache writeback of application data sorin faibish – emc...

1

A New Approach to File System Cache Writeback

of Application Data

Sorin Faibish – EMC Distinguished EngineerP. Bixby, J. Forecast, P. Armangau and S. PawarEMC USD Advanced Development

SYSTOR 2010, May 24-26, 2010, Haifa, Israel

2

Motivation: changes in servers technology

Cache writeback problem statement

Monitoring behavior of application data flush

Cache writeback as a closed loop system

Current cache writeback methods are obsolete

I/O “slow down” problem

New algorithms for cache writeback

Simulation results of new algorithms

Experimental results of a real NFS server

Summary and conclusions

Future work and extension to Linux FS

Outline

3

Large numbers of cores in CPUs – more computing power

Large cheaper memory caches – cached data very large

Very large disk drives – but modest increase in disk throughput

Application data I/O increased much faster – but require constant flush to disk

Cache writeback is used to smooth bursty I/O traffic to disk

Conclusion: cache writeback of large amounts of application data is slower

Motivation: changes in servers technology

4

I/O speeds increase forcing caching large amounts of dirty pages at servers to hide disk latency

Large number of clients access servers increasing burstiness of disk I/O and need for cache

Large caches of the FS and servers allow longer retention

Cache writeback flush is based on cache fullness metrics

Flush to disk is done at maximum speed when cache full leaving no room for additional I/Os

As long as cache is full I/Os will have to wait for empty cache pages availability – I/O “stoppage”

Result application performance is lower than disk performance

Cache writeback problem statement

5

Monitoring behavior of application data flush

Understanding the problem: •Instrument kernel to measure cache Dirty Pages dynamics•Monitor the behavior of DP in Buffer Cache•Run benchmark multi-client application

6

Cache writeback as a closed loop system

Application controls the flush using I/O commit based on application cache state

– DP in cache are difference between incoming I/O and DP flushed to disk

– Goal is to keep difference/error zero– The error loop is closed as application send

commits after each I/O– Cache Writeback is controlled by application

Flush to disk based on state of fullness of the Buffer Cache

– Cache control mechanism ensure cache availability for new I/Os

– DP in cache like water in tank– Water level is controlled by cache manager to

prevent overflow– No relation between application I/O arrival and

when the I/O is flush to disk – Result in large delays between I/O creation

and I/O on disk – open loop – Cache writeback is controlled by algorithm

Dirty Pages &Buffer Cache

Dynamics

Cache WritebackAlgorithm

+

-

UserI/Os

I/OsIn Cache

I/OsFlushed

+

-

ApplicationCommits

DirtyPages

Dirty Pages &Buffer Cache

Dynamics

Cache WritebackAlgorithm

+

-UserI/Os

I/Os inCache

+

-

WatermarkFlushes

DirtyPages

SampleDirty Pages

Delaysec

7

Current cache writeback methods

Trickle flush of DPs– Flush based on proportion of incoming application I/Os

(rate based)– Use low priority to reduce CPU consumption– Background task with low efficiency– Used only to reduce memory pressures– Cannot address high bursts of I/O

Watermark based flush of DPs– Inspired from database and transactional applications– Cache writeback triggered by number/proportion of DP

in the cache– There is no prediction of high I/O bursts – disadvantage

for multi-clients– Flush is done at maximum disk speed to reduce latency– Close to incoming I/O rate for small caches – flush often– Inefficient for very large caches– Interfere with metadata and read operations

N DirtyPages/sec

n Flushes/sec

Watermark increase(N-n)*t

File System userDirty Pages

Other Dirty Pages

8

Current cache writeback deficiency

Watermark based flush of DPs is similar a non-linear saturation effect in the cache closed loop

Introduces oscillations in the DP behavior due to the saturation

The oscillation introduces additional I/O latencies to the disk latencies

Creates burstiness to the disk I/O – reduce aggregate performance

280 290 300 310 320 330 340 350 360-400

-200

0

200

400

600

800

1000

Time [sec]

Me

mo

ry [M

B];R

ate

[MB

/se

c]

Dirty pages=Blue;Rate of Change=Green

9

I/O “slow down” problem

Application data flush require FS MD updates to same disks

Flush is triggered when high watermark threshold is crossed

Watermark based flushes cannot throttle the I/O speed as it is an ultimate resort before kernel crash on starvation

Additional I/Os are slowed down until the MD is flushed for the new arriving I/Os

Even if NVRAM is used the DP need to be removed from cache to make room for additional I/Os

Application I/Os latency increases until the cache is freed – “slow down”

In worst cases the latency is so high that resemble to a I/O stoppage

If additional burst of I/Os on other new clients there is no room to put I/Os and new I/Os will wait until the watermark goes under low watermark - stoppage

10

New algorithms for cache writeback

Trying to address deficiency of current cache writeback methods

Inspired from control system and signal processing theory

Use adaptive control and machine learning methods

Utilize better modern HW characteristics

The goals of the solution are:– Reduce the I/O slowdown limited only by maximum disk I/O throughput– Reduce to minimum disk I/O burstiness and– Maximize aggregate I/O performance of the system (benchmark)

Same algorithms apply to network as well as local FSs

All the algorithms can be used for application DPs and MD DPs flush

11

New algorithms for cache writeback (cont.)

We present and simulate only 5 algorithms (more were considered):– Modified Trickle Flush – improved version of trickle by changing priority

and use more CPU– Fixed Interval Algorithm – use a goal as target of number of DPs similar to

watermark methods but compensate better for bursts of I/O (semi-throttling) by pacing the flush to disk

– Variable Interval Algorithm – use an adaptive control scheme that adapt the time interval based on the change in DP during previous interval similar to trickle but with faster adaptation in response to I/O bursts

– Quantum Flush – use the idea of lowest retention of DP in cache similar to watermark based methods but adapt flush speed proportional to number of new I/Os in the previous sample time

– Rate of Change Proportional Algorithm – flushes DPs proportional to the first derivative of the number of DPs using fixed interval and a forgetting factor proportional to difference between I/O rate and maximum disk throughput:

c = R * (t - ti ) + W * μμ = α * (B – R) / B

12

Simulation results of new algorithms

Selection of best algorithm by:– Optimal behavior to unexpected bursts of I/Os– Flush best matching the rate of change in DPs in the cache (minimum DP level)– Minimize I/O slow down to clients (reduce I/O average latency)

Rate of change based algorithm with forgetting factor was best

0 20 40 60 80 1000

500

1000

1500

2000

2500

3000

Time [sec]

# o

f D

irty

Pag

es

Dirty Pages in the Buffer Cache for all Algorithms - best version

Trickle 1 sec FIA FVA Quantum Rate alpha=0.16

13

Experimental results of a real NFS server

We implemented the Modified Trickle and Rate Proportional algorithms on the Celerra NAS server

Used SPEC sfs2008 benchmark and measured the number of DP in cache with 4 msec resolution

Experimental results show some I/O slowdown using the MT algorithm resulting in 92K NFS iops (diagrams sampled at same 55K NFS iops level)

The Rate Proportional algorithm show much shorter I/O slow down time resulting in 110.6K NFS iops

0 20 40 60 80 100-300

-200

-100

0

100

200

300

Time [sec]

# D

irty

Pa

ge

s a

nd

Use

r I/O

s [1

00

0 IO

/se

c]

Dirty Pages in BC=green; User I/Os=red; Trickle Algorithm

0 20 40 60 80 100-300

-200

-100

0

100

200

300

Time [sec]

# D

irty

Pa

ge

s a

nd

Use

r I/O

s [1

00

0 IO

/se

c]

Dirty Pages in BC=green; User I/Os=red; Rate Proportional Algorithm

14

Summary and conclusions

Discussed new algorithms and paradigm to address the cache writeback in modern FS and servers

Discussed how the new algorithm can reduce the impact of bursts of application I/Os to the aggregate I/O performance otherwise bounded by the maximum disk speeds

We show how current cache writeback algorithms create I/O slowdown at I/O speeds that are lower than disk speed but changing rapidly

We presented reduced number of algorithms that are presented in the literature explaining their deficiencies

We discuss several new algorithms and show simulation results that allowed us to select the best algorithm for experimentation

We presented experimental results for 2 algorithms and show that Rate Proportional is the best algorithm based on the given criteria of success

Finally we discuss how these algorithms can be used for MD and DP on any file system network or local

15

Future work and extension to Linux FS

Investigation of additional algorithms inspired from signal processing of non-linear signals that address oscillatory behavior

Address similar behavior for cache writeback of local file systems including ext3, ReiserFS and ext4 in Linux OS (a discussion at next Linux workshop)

Linux FS developers are aware of this behavior and currently work to instrument the Linux kernel with same measurement tools as we used

We are also looking to use machine learning in order to be able to compensate for very fast I/O rate changes that will allow to optimize application performance for very large number of clients

Additional work is needed to find algorithms that will allow the maximum application performance equal the maximum aggregate disk performance

We are also looking to instrument NFS clients’ kernel to allow us evaluate the I/O slow down and tune the flush algorithm to reduce the slow down effect to zero

More work is needed to extend this study to MD and find new MD specific flushing methods

1 a new approach to file system cache writeback of application data sorin faibish – emc...

Documents