improve linux swap for high speed flash storage · • tlb flush is page based • tlb flush is...

15
Fusion-io Confidential—Copyright © 2013 Fusion-io, Inc. All rights reserved. Fusion-io Confidential—Copyright © 2013 Fusion-io, Inc. All rights reserved. Improve Linux SWAP For High Speed Flash Storage Shaohua Li <[email protected]>

Upload: others

Post on 18-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

Fusion-io Confidential—Copyright © 2013 Fusion-io, Inc. All rights reserved. Fusion-io Confidential—Copyright © 2013 Fusion-io, Inc. All rights reserved.

Improve Linux SWAP For High Speed Flash Storage

Shaohua Li <[email protected]>

Page 2: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

Background

▸ Partially replace DRAM with Flash storage (SSD) ▸ Compared to DRAM, Flash has:

•  Low price •  High density •  Low power

▸ Reasonable latency is tolerable

Page 3: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

How does swap work?

Page in active LRU list

Page in inactive LRU list

Add page to swap cache

Unmap page and set PTE

Pageout page

Remove page from swap cache

Free page

Page fault

Allocate new page

Add page to swap cache

Pagein page

Set PTE

SWAPOUT SWAPIN

Page 4: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

SSD specific characteristics

▸ Fast ▸ No seek penalty ▸ Big size request has better throughput ▸ High iodepth has better throughput ▸ Discard

Page 5: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

SWAPOUT – TLB flush

▸ At least 2 TLB flushes •  Clear PTE ‘A’ bit •  Clear PTE ‘P’ bit

▸ Overhead is quite high •  TLB flush is page based •  TLB flush is involved by several tasks

▸ Solution: •  Improve smp_call_function_many() •  The first TLB flush can be removed in x86? •  Batch TLB flush?

Page 6: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

SWAPOUT – swap_map scan

▸ swap_map entry - in-memory data structure to track swap disk usage

▸ Slow linear memory scan to find a cluster (128 adjacent pages) - to produce big size IO request

▸ Solution: cluster list •  Pros - O(1) algorithm •  Cons - restrictive cluster alignment •  Only enabled for SSD

Page 7: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

SWAPOUT – IO pattern

▸ Interleaved IO pattern •  Multiple reclaimers •  New found cluster is shared by all reclaimers

▸ Block layer can’t merge the interleaved IO completely

▸ Solution: per-cpu cluster •  Reclaimer does sequential IO •  Easy to do IO merge in block layer

Page 8: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

SWAPIN

▸ Page fault does sync IO - iodepth 1, page size IO request

▸ Need swap readahead to produce: •  Big size IO request •  High iodepth IO •  Parallel IO and CPU

▸ Userspace readahead API •  madvise(MADV_WILLNEED) is extended to do swap

prefetch

Page 9: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

SWAPIN - cont

▸ In-kernel readahead •  Arbitrary readahead (always 8 pages) •  Random access workload ▸ Unnecessary currently ▸ Waste IO and increase memory pressure ▸  Let readahead aware workload is random

•  Sequential access workload ▸ Not enough currently ▸ Hard to do - can’t guarantee sequentially accessed pages

swapped out sequentially •  Sequentially accessed pages might not live adjacently in LRU

list •  Adjacent pages of LRU list might be swapout by different

reclaimers

Page 10: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

Lock contentions

▸ Lock contentions are high •  Concurrent swapout - kswapd, direct page reclaim •  Concurrent swapin – page fault from each task

▸ Solution: •  anon_vma mutex - now it’s a rw_semaphore •  swap_lock and swap address space lock ▸ Per-swap lock now (a workaround) ▸ Still have lock contention with very high speed SSD

Page 11: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

SWAP discard

▸ Discard is important to optimize SSD write throughput

▸ swap discard implementation is synchronous •  Block layer discard API is sync (introduce delay) •  Discard just before write is useless

▸ Solution: async swap discard •  No delay •  Discard and write can run in the same time •  Discard is cluster based

Page 12: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

Other issues

▸ Page reclaim policy – bias swap? ▸ Huge page swap

Page 13: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

Benchmark – sequential workload

Page 14: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

Benchmark – random workload

Page 15: Improve Linux SWAP For High Speed Flash Storage · • TLB flush is page based • TLB flush is involved by several tasks Solution: • Improve smp_call_function_many() • The first

f u s i o n i o . c o m | R E D E F I N E W H A T ’ S P O S S I B L E