managing distributed, shared l2 caches through os-level page allocation sangyeun cho and lei jin...

Managing Distributed, Shared L2 Cachesthrough

OS-Level Page Allocation

Sangyeun Cho and Lei Jin

Dept. of Computer ScienceUniversity of Pittsburgh

Dec. 13 ’06 – MICRO-39

Multicore distributed L2 caches

L2 caches typically sub-banked and distributed• IBM Power4/5: 3 banks• Sun Microsystems T1: 4 banks• Intel Itanium2 (L3): many “sub-arrays”

(Distributed L2 caches + switched NoC) NUCA

Hardware-based management schemes• Private caching• Shared caching• Hybrid caching

processor core

local L2 cache

router

Dec. 13 ’06 – MICRO-39

Private caching

2. L2 access

1. L1 miss

1. L1 miss2. L2 access

• Hit• Miss

3. Access directory• A copy on chip• Global miss

3. Access directory

short hit latency (always local)

high on-chip miss rate

long miss resolution time

complex coherence enforcement

Dec. 13 ’06 – MICRO-39

Shared caching

1. L1 miss

1. L1 miss

2. L2 access• Hit• Miss

low on-chip miss rate

straightforward data location

simple coherence (no replication)

long average hit latency

Dec. 13 ’06 – MICRO-39

Our work

Placing “flexibility” as the top design consideration

OS-level data to L2 cache mapping• Simple hardware based on shared caching• Efficient mapping maintenance at page granularity

Demonstrating the impact using different policies

Dec. 13 ’06 – MICRO-39

Talk roadmap

Data mapping, a key property

Flexible page-level mapping• Goals• Architectural support• OS design issues

Management policies

Conclusion and future works

Dec. 13 ’06 – MICRO-39

Data mapping, the key

Data mapping = deciding data location (i.e., cache slice)

Private caching• Data mapping determined by program location• Mapping created at miss time• No explicit control

Shared caching• Data mapping determined by address

slice number = (block address) % (Nslice)

• Mapping is static• Cache block installation at miss time• No explicit control• (Run-time can impact location within slice)

Mapping granularity = block

Dec. 13 ’06 – MICRO-39

Changing cache mapping granularity

Memory blocks Memory pages

miss rate?

impact on existing techniques?

(e.g., prefetching)

latency?

Dec. 13 ’06 – MICRO-39

Observation: page-level mapping

Memory pages Program 1

Program 2

OS PAGE ALLOCATIONOS PAGE ALLOCATION

Mapping data to different $$ feasible

Key: OS page allocation policies

Flexible

Dec. 13 ’06 – MICRO-39

Goal 1: performance management

Proximity-aware data mapping

Dec. 13 ’06 – MICRO-39

Goal 2: power management

Usage-aware cache shut-off

0

0 0

000 0

0000

0

Dec. 13 ’06 – MICRO-39

Goal 3: reliability management

On-demand cache isolation

X

X

Dec. 13 ’06 – MICRO-39

Goal 4: QoS management

Contract-based cache allocation

Dec. 13 ’06 – MICRO-39

page_num page offset

Architectural support

L1 miss

Method 1: “bit selection”

slice_num = (page_num) % (Nslice)

other bits slice_num page offset

data address

Method 2: “region table”

regionx_low ≤ page_num ≤ regionx_high

page_num page offset

region0_low slice_num0

region0_high

region1_low slice_num1

region1_high

Method 3: “page table (TLB)”

page_num «–» slice_num

vpage_num0 slice_num0

ppage_num0

vpage_num1 slice_num1

ppage_num1

reg_table

TLB

Method 1: “bit selection”slice number = (page_num) % (Nslice)

Method 2: “region table”regionx_low ≤ page_num ≤ regionx_high

Method 3: “page table (TLB)”page_num «–» slice_num

Simple hardware support enough

Combined scheme feasible

Dec. 13 ’06 – MICRO-39

Some OS design issues

Congruence group CG(i)• Set of physical pages mapped to slice i• A free list for each i multiple free lists

On each page allocation, consider• Data proximity• Cache pressure• (e.g.) Profitability function P = f(M, L, P, Q, C)

M: miss ratesL: network link statusP: current page allocation statusQ: QoS requirementsC: cache configuration

Impact on process scheduling

Leverage existing frameworks• Page coloring – multiple free lists• NUMA OS – process scheduling & page allocation

Dec. 13 ’06 – MICRO-39

Working example

Program

10 2 3

4 5 6

8

7

9 10 11

12 13 14 15

5

5

5

5

P(4) = 0.9P(6) = 0.8P(5) = 0.7…

P(1) = 0.95P(6) = 0.9P(4) = 0.8… 4

1

6

Static vs. dynamic mapping

Program information (e.g., profile)

Proper run-time monitoring needed

Dec. 13 ’06 – MICRO-39

Page mapping policies

Dec. 13 ’06 – MICRO-39

Simulating private caching

For a page requested from a program running on core i, map the page to cache slice i

0

20

40

60

80

100

128kB 256kB 512kB

0

20

40

60

80

100

128kB 256kB 512kB

L2 c

ach

e late

ncy

(cy

cles)

0

20

40

60

80

100

128kB 256kB 512kB

0

20

40

60

80

100

128kB 256kB 512kB

SPEC2k INT SPEC2k FP

private caching

OS-based

L2 cache slice size

Simulating private caching is simple

Similar or better performance

Dec. 13 ’06 – MICRO-39

Simulating “large” private caching

For a page requested from a program running on core i, map the page to cache slice i; also spread pages


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

gcc parser eon twolf

Rela

tive p

erf

orm

ance

(ti

me

-1)

OSprivate

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

wupwise galgel ammp sixtracks

1.93

512kB cache slice

Dec. 13 ’06 – MICRO-39

0

20

40

60

80

100

128kB 256kB 512kB

0

20

40

60

80

100

128kB 256kB 512kB

Simulating shared caching

For a page requested from a program running on core i, map the page to all cache slices (round-robin, random, …)

L2 c

ach

e late

ncy

(cy

cles)


0

20

40

60

80

100

128kB 256kB 512kB

0

20

40

60

80

100

128kB 256kB 512kB

L2 cache slice size

sharedOS

129 106

Simulating shared caching is simple

Mostly similar behavior/performance

Pathological cases (e.g., applu)

Dec. 13 ’06 – MICRO-39

10 2 3

4 5 6

8

7

9 10 11

12 13 14 15

0

0.2

0.4

0.6

0.8

1

1.2

FFT LU RADIX OCEAN

Simulating clustered caching

For a page requested from a program running on core of group j, map the page to any cache slice within group (round-robin, random, …)

Rela

tive p

erf

orm

ance

(ti

me

-1)

privateOSshared

4 cores used; 512kB cache slice

Simulating clustered caching is simple

Lower miss traffic than private

Lower on-chip traffic than shared

Dec. 13 ’06 – MICRO-39

Profile-driven page mapping

Using profiling collect:• Inter-page conflict information• Per-page access count information

Page mapping cost function (per slice)• Given program location, page to map, and previously mapped

pages• (# conflicts miss penalty) + weight (# accesses latency)• weight as a knob

Larger value more weight on proximity (than miss rate)Optimize both miss rate and data proximity

Theoretically important to understand limits Can be practically important, too

miss cost Latency cost

Dec. 13 ’06 – MICRO-39

Profile-driven page mapping, cont’d

0%

20%

40%

60%

80%

100%

amm

p

art

bzi

p2

craf

ty

eon

equa

ke

gap gcc

gzi

p

mcf

mes

a

mg

rid

par

ser

twol

f

vort

ex vpr

wup

wis

e

0%

20%

40%

60%

80%

100%

256kB L2 cache slice

remote

local

miss

on-chip hit

L2 c

ach

e a

ccess

es

weight

Dec. 13 ’06 – MICRO-39

0

50

100

150

200

250

300

350

400

450


# p

ages

mapped


Program location

GCC

Dec. 13 ’06 – MICRO-39



Perf

orm

ance

im

pro

vem

ent

Over

share

d c

ach

ing

-1%

9%

1%

17%

7%0%

9%

21%

2%

39%

4% 3%9%

23%

2%6%

-20%

0%

20%

40%

60%

80%

amm

p

art

bzi

p2

craf

ty

eon

equa

ke

gap gcc

gzi

p

mcf

mes

a

mg

rid

par

ser

twol

f

vort

ex vpr

wup

wis

e

108%

Room for performance improvement

Best of the two or better than the two

Dynamic mapping schemes desired

Dec. 13 ’06 – MICRO-39

Isolating faulty caches

When there are faulty cache slices, avoid mapping pages to them

0

2

4

6

8

0 1 2 4 8

Rela

tive L

2 c

ach

e late

ncy

4 cores running a multiprogrammed workload; 512kB cache slice

shared

OS

# cache slice deletions

Dec. 13 ’06 – MICRO-39

Conclusion

“Flexibility” will become important in future multicores• Many shared resources• Allows us to implement high-level policies

OS-level page-granularity data-to-slice mapping• Low hardware overhead• Flexible

Several management policies studied• Mimicking private/shared/clustered caching straightforward• Performance-improving schemes

Dec. 13 ’06 – MICRO-39

Future works

Dynamic mapping schemes• Performance• Power

Performance monitoring techniques• Hardware-based• Software-based

Data migration and replication support

Dec. 13 ’06 – MICRO-39

Managing Distributed, Shared L2 Caches

through OS-Level Page AllocationSangyeun Cho and Lei Jin

Dept. of Computer ScienceUniversity of Pittsburgh

Thank you!

Dec. 13 ’06 – MICRO-39

Multicores are here

AMD Opteron dual-core (2005)

IBM Power5 (2004)

Sun Micro. T1, 8 cores (2005)

Intel Core2 Duo (2006)

Quad cores (2007)

Intel 80 cores? (2010?)

Dec. 13 ’06 – MICRO-39

A multicore outlook

???

Dec. 13 ’06 – MICRO-39

A processor model

Many cores (e.g., 16)

processor core

local L2 cache

router

Private L1 I/D-$$• 8kB~32kB

Local unified L2 $$• 128kB~512kB• 8~18 cycles

Switched network• 2~4 cycles/switch

Distributed directory• Scatter hotspots

Dec. 13 ’06 – MICRO-39

Other approaches

Hybrid/flexible schemes• “Core clustering” [Speight et al., ISCA2005]• “Flexible CMP cache sharing” [Huh et al., ICS2004]• “Flexible bank mapping” [Liu et al., HPCA2004]

Improving shared caching• “Victim replication” [Zhang and Asanovic, ISCA2005]

Improving private caching• “Cooperative caching” [Chang and Sohi, ISCA2006]• “CMP-NuRAPID” [Chishti et al., ISCA2005]

managing distributed, shared l2 caches through os-level page allocation sangyeun cho and lei jin...

Documents