practical, transparent operating system support for superpages
DESCRIPTION
Practical, transparent operating system support for superpages. Juan Navarro, Sitaram Iyer, Peter Druschel, Alan Cox (Rice University) Appears in: Fifth Symposium on Operating Systems Design and Implementation (OSDI 2002) Presented by: David R. Choffnes. Outline. The superpage problem - PowerPoint PPT PresentationTRANSCRIPT
CS 443 Advanced OS
David R. Choffnes, Spring 2005
Practical, transparent operating system support
for superpages
Juan Navarro, Sitaram Iyer, Peter Druschel, Alan Cox
(Rice University)
Appears in: Fifth Symposium on Operating Systems Design and Implementation
(OSDI 2002)
Presented by: David R. Choffnes
3
Introduction
TLB coverage– Definition– Effect on performance
Superpages– Wasted memory– Fragmentation
Contribution– General, transparent superpages– Deals with fragmentation– Contiguity-aware page replacement algo– Demotion/Eviction of dirty superpages
4
The Superpage Problem
Factor of 1000 decrease in
15 years
TLB miss overhead:
5% 5-10%
30%
TLB coverage trend
TLB coverage of % of main memory
5
The Superpage Problem
Increasing TLB coverage– More TLB entries is expensive– Larger page size leads to internal fragmentation
and increased I/O– Solution: use multiple page sizes
Superpage definition
Hardware-imposed constraints– Finite set of page sizes (subset of powers of 2)– Contiguity– Alignment
6
A superpage TLB
base page entry (size=1)
superpage entry (size=4)superpage entry (size=4)
physical memory
virtual memory
virtualaddress
TLB
physicaladdress
Alpha: 8,64,512KB; 4MB
Itanium:4,8,16,64,256KB; 1,4,16,64,256MB
8
Issue 1: superpage allocationIssue 1: superpage allocation
virtual memory
physical memory
superpage boundaries
B
B
A
A
C
C
D
D A B C D
How / when / what size to allocate?How / when / what size to allocate?
9
Superpage Issues (Cont.)
Promotion– Incremental– Timing (not too soon, not too late)
Demotion and Eviction– Hardware reference and dirty bit limitation
10
Issue 2: promotion
Promotion: create a superpage out of a set of smaller pages– mark page table entry of each base page
When to promote?
Create small superpage?May waste overhead.
Wait for app to touch pages? May lose opportunity to increase
TLB coverage.
Forcibly populate pages?May cause internal fragmentation.
11
Superpage Issues: Fragmentation
Fragmentation– Memory becomes fragmented due to
• use of multiple page sizes• persistence of file cache pages• scattered wired (non-pageable) pages
– Contiguity as contended resource
12
Related Approaches
HP-UX and IRIX Reservations– Not transparent
Page Relocation– Used exclusively, leads to lower performance due
to increased TLB misses
Hardware Support– Talluri and Hill: Remove contiguity requirement
This approach: Hybrid reservation and relocation system with page replacement that
biases toward pages that contribute to contiguity
13
Design
Reservation-based superpage management
Multiple superpage sizes
Demotion of sparsely referenced superpages
Preservation of contiguity w/o compaction
Efficient disk I/O for partially modified SPs
Uses buddy allocator for contiguous regions
14
Key observation
Once an application touches the first page of a memory object then it is likely that it will
quickly touch every page of that object
Example: array initialization
Opportunistic policies– superpages as large and as soon as possible– as long as no penalty if wrong decision
15
Reservations
Set of frames initially reserved at page fault– Fixed-size objects: largest aligned superpage that
is not larger than the object– Dynamic objects: same as fixed, but reservation is
allowed to extend beyond the end of the object
Preemption– If no available memory for allocation request,
system will preempt the reservation whose most recent page allocation occurred least recently
16
Managing reservations
largest unused (and aligned) chunk
best candidate for preemption at front:best candidate for preemption at front: reservation whose most recently populated reservation whose most recently populated
frame was populated the least recentlyframe was populated the least recently
1
2
4
17
Other Design Issues
Fragmentation control– Coalescing– Contiguity-aware page replacement
Incremental promotions– Occurs as soon as a superpage region is fully
populated
Speculative demotion– Occurs on eviction (recursively)– Occurs on first write to clean superpage
• Overhead too high for hash digests
– Daemon periodically demotes pages speculatively• Necessary due to reference bit limitation
19
More Design Issues
Multi-list reservation scheme– One list of each page size supported by hardware– Reservations sorted by allocation recency– Preemption removes from head of list
• Reservation recursively broken into extents• Fully populated extents are not put in reservation lists
Population map– Reserved frame lookup– Overlap avoidance– Promotion decisions– Preemption assistance
20
Implementation Notes
FreeBSD uses three lists of pages in A-LRU order: active, inactive, cache
Contiguity-aware page daemon– Cache considered available for allocation– Daemon activated when contiguity falls low– Clean file-backed pages moved to inactive as
soon as file is closed
Wired page clustering
Multiple mappings
21
Evaluation
Setup– FreeBSD 4.3– Alpha 21264, 500 MHz, 512 MB RAM– 8 KB, 64 KB, 512 KB, 4 MB pages– 128-entry DTLB, 128-entry ITLB– Unmodified applications
22
Best-Case Results
TLB miss reduction usually above 95%
SPEC CPU2000 integer– 11.2% improvement (0 to 38%)
SPEC CPU2000 floating point– 11.0% improvement (-1.5% to 83%)
Other benchmarks– FFT (2003 matrix): 55%– 1000x1000 matrix transpose: 655%
30%+ in 8 out of 35 benchmarks
24
Sustained benefits
Use Web server to fragment memory, then use FFTW to see how quickly memory is reclaimed
FFTW reaches a speedup of almost 55%, Web server performance degrades only 1.6% on successive run
Concurrent execution: only 3% degradation with modified page daemon
25
Fragmentation control
time0
.2
.4
.6
.8
10min
normalized contiguity of free memory
no frag control
web server FFT FFTFFT FFT
no speedupfull speedup
partial speedup
web server FFT FFT FFT FFT
frag control
26
Adversary applications
Incremental promotion– Slowdown of 8.9%, 7.2% is hardware-specific
Sequential access– 0.1% degradation
Preemption– 1.1% degradation
General overhead– Use superpage supporting mechanisms, but don’t
promote: 1-2% performance degradation
27
Cetera
Dirty Superpages– Performance penalty of not demoting is a factor of
20
Scalability– Most operations O(1), O(S) or O(S*R)– Daemon, promotion, demotion and dirty/reference
bit emulation are linear• Promotion/Demotion is amortized to O(S) for programs
the need to change page size only early in life• Dirty/Reference bits: Motivates the need for clustered
page tables either in OS or HW