managing distributed, shared l2 caches through os-level page allocation sangyeun cho and lei jin...

33
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Upload: stephanie-webb

Post on 18-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Managing Distributed, Shared L2 Cachesthrough

OS-Level Page Allocation

Sangyeun Cho and Lei Jin

Dept. of Computer ScienceUniversity of Pittsburgh

Page 2: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Multicore distributed L2 caches

L2 caches typically sub-banked and distributed• IBM Power4/5: 3 banks• Sun Microsystems T1: 4 banks• Intel Itanium2 (L3): many “sub-arrays”

(Distributed L2 caches + switched NoC) NUCA

Hardware-based management schemes• Private caching• Shared caching• Hybrid caching

processor core

local L2 cache

router

Page 3: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Private caching

2. L2 access

1. L1 miss

1. L1 miss2. L2 access

• Hit• Miss

3. Access directory• A copy on chip• Global miss

3. Access directory

short hit latency (always local)

high on-chip miss rate

long miss resolution time

complex coherence enforcement

Page 4: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Shared caching

1. L1 miss

1. L1 miss

2. L2 access• Hit• Miss

low on-chip miss rate

straightforward data location

simple coherence (no replication)

long average hit latency

Page 5: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Our work

Placing “flexibility” as the top design consideration

OS-level data to L2 cache mapping• Simple hardware based on shared caching• Efficient mapping maintenance at page granularity

Demonstrating the impact using different policies

Page 6: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Talk roadmap

Data mapping, a key property

Flexible page-level mapping• Goals• Architectural support• OS design issues

Management policies

Conclusion and future works

Page 7: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Data mapping, the key

Data mapping = deciding data location (i.e., cache slice)

Private caching• Data mapping determined by program location• Mapping created at miss time• No explicit control

Shared caching• Data mapping determined by address

slice number = (block address) % (Nslice)

• Mapping is static• Cache block installation at miss time• No explicit control• (Run-time can impact location within slice)

Mapping granularity = block

Page 8: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Changing cache mapping granularity

Memory blocks Memory pages

miss rate?

impact on existing techniques?

(e.g., prefetching)

latency?

Page 9: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Observation: page-level mapping

Memory pages Program 1

Program 2

OS PAGE ALLOCATIONOS PAGE ALLOCATION

Mapping data to different $$ feasible

Key: OS page allocation policies

Flexible

Page 10: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Goal 1: performance management

Proximity-aware data mapping

Page 11: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Goal 2: power management

Usage-aware cache shut-off

0

0 0

000 0

0000

0

Page 12: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Goal 3: reliability management

On-demand cache isolation

X

X

Page 13: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Goal 4: QoS management

Contract-based cache allocation

Page 14: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

page_num page offset

Architectural support

L1 miss

Method 1: “bit selection”

slice_num = (page_num) % (Nslice)

other bits slice_num page offset

data address

Method 2: “region table”

regionx_low ≤ page_num ≤ regionx_high

page_num page offset

region0_low slice_num0

region0_high

region1_low slice_num1

region1_high

Method 3: “page table (TLB)”

page_num «–» slice_num

vpage_num0 slice_num0

ppage_num0

vpage_num1 slice_num1

ppage_num1

reg_table

TLB

Method 1: “bit selection”slice number = (page_num) % (Nslice)

Method 2: “region table”regionx_low ≤ page_num ≤ regionx_high

Method 3: “page table (TLB)”page_num «–» slice_num

Simple hardware support enough

Combined scheme feasible

Page 15: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Some OS design issues

Congruence group CG(i)• Set of physical pages mapped to slice i• A free list for each i multiple free lists

On each page allocation, consider• Data proximity• Cache pressure• (e.g.) Profitability function P = f(M, L, P, Q, C)

M: miss ratesL: network link statusP: current page allocation statusQ: QoS requirementsC: cache configuration

Impact on process scheduling

Leverage existing frameworks• Page coloring – multiple free lists• NUMA OS – process scheduling & page allocation

Page 16: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Working example

Program

10 2 3

4 5 6

8

7

9 10 11

12 13 14 15

5

5

5

5

P(4) = 0.9P(6) = 0.8P(5) = 0.7…

P(1) = 0.95P(6) = 0.9P(4) = 0.8… 4

1

6

Static vs. dynamic mapping

Program information (e.g., profile)

Proper run-time monitoring needed

Page 17: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Page mapping policies

Page 18: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Simulating private caching

For a page requested from a program running on core i, map the page to cache slice i

0

20

40

60

80

100

128kB 256kB 512kB

0

20

40

60

80

100

128kB 256kB 512kB

L2 c

ach

e late

ncy

(cy

cles)

0

20

40

60

80

100

128kB 256kB 512kB

0

20

40

60

80

100

128kB 256kB 512kB

SPEC2k INT SPEC2k FP

private caching

OS-based

L2 cache slice size

Simulating private caching is simple

Similar or better performance

Page 19: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Simulating “large” private caching

For a page requested from a program running on core i, map the page to cache slice i; also spread pages

SPEC2k INT SPEC2k FP

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

gcc parser eon twolf

Rela

tive p

erf

orm

ance

(ti

me

-1)

OSprivate

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

wupwise galgel ammp sixtracks

1.93

512kB cache slice

Page 20: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

0

20

40

60

80

100

128kB 256kB 512kB

0

20

40

60

80

100

128kB 256kB 512kB

Simulating shared caching

For a page requested from a program running on core i, map the page to all cache slices (round-robin, random, …)

L2 c

ach

e late

ncy

(cy

cles)

SPEC2k INT SPEC2k FP

0

20

40

60

80

100

128kB 256kB 512kB

0

20

40

60

80

100

128kB 256kB 512kB

L2 cache slice size

sharedOS

129 106

Simulating shared caching is simple

Mostly similar behavior/performance

Pathological cases (e.g., applu)

Page 21: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

10 2 3

4 5 6

8

7

9 10 11

12 13 14 15

0

0.2

0.4

0.6

0.8

1

1.2

FFT LU RADIX OCEAN

Simulating clustered caching

For a page requested from a program running on core of group j, map the page to any cache slice within group (round-robin, random, …)

Rela

tive p

erf

orm

ance

(ti

me

-1)

privateOSshared

4 cores used; 512kB cache slice

Simulating clustered caching is simple

Lower miss traffic than private

Lower on-chip traffic than shared

Page 22: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Profile-driven page mapping

Using profiling collect:• Inter-page conflict information• Per-page access count information

Page mapping cost function (per slice)• Given program location, page to map, and previously mapped

pages• (# conflicts miss penalty) + weight (# accesses latency)• weight as a knob

Larger value more weight on proximity (than miss rate)Optimize both miss rate and data proximity

Theoretically important to understand limits Can be practically important, too

miss cost Latency cost

Page 23: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Profile-driven page mapping, cont’d

0%

20%

40%

60%

80%

100%

amm

p

art

bzi

p2

craf

ty

eon

equa

ke

gap gcc

gzi

p

mcf

mes

a

mg

rid

par

ser

twol

f

vort

ex vpr

wup

wis

e

0%

20%

40%

60%

80%

100%

256kB L2 cache slice

remote

local

miss

on-chip hit

L2 c

ach

e a

ccess

es

weight

Page 24: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

0

50

100

150

200

250

300

350

400

450

Profile-driven page mapping, cont’d

# p

ages

mapped

256kB L2 cache slice

Program location

GCC

Page 25: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Profile-driven page mapping, cont’d

256kB L2 cache slice

Perf

orm

ance

im

pro

vem

ent

Over

share

d c

ach

ing

-1%

9%

1%

17%

7%0%

9%

21%

2%

39%

4% 3%9%

23%

2%6%

-20%

0%

20%

40%

60%

80%

amm

p

art

bzi

p2

craf

ty

eon

equa

ke

gap gcc

gzi

p

mcf

mes

a

mg

rid

par

ser

twol

f

vort

ex vpr

wup

wis

e

108%

Room for performance improvement

Best of the two or better than the two

Dynamic mapping schemes desired

Page 26: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Isolating faulty caches

When there are faulty cache slices, avoid mapping pages to them

0

2

4

6

8

0 1 2 4 8

Rela

tive L

2 c

ach

e late

ncy

4 cores running a multiprogrammed workload; 512kB cache slice

shared

OS

# cache slice deletions

Page 27: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Conclusion

“Flexibility” will become important in future multicores• Many shared resources• Allows us to implement high-level policies

OS-level page-granularity data-to-slice mapping• Low hardware overhead• Flexible

Several management policies studied• Mimicking private/shared/clustered caching straightforward• Performance-improving schemes

Page 28: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Future works

Dynamic mapping schemes• Performance• Power

Performance monitoring techniques• Hardware-based• Software-based

Data migration and replication support

Page 29: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Managing Distributed, Shared L2 Caches

through OS-Level Page AllocationSangyeun Cho and Lei Jin

Dept. of Computer ScienceUniversity of Pittsburgh

Thank you!

Page 30: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Multicores are here

AMD Opteron dual-core (2005)

IBM Power5 (2004)

Sun Micro. T1, 8 cores (2005)

Intel Core2 Duo (2006)

Quad cores (2007)

Intel 80 cores? (2010?)

Page 31: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

A multicore outlook

???

Page 32: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

A processor model

Many cores (e.g., 16)

processor core

local L2 cache

router

Private L1 I/D-$$• 8kB~32kB

Local unified L2 $$• 128kB~512kB• 8~18 cycles

Switched network• 2~4 cycles/switch

Distributed directory• Scatter hotspots

Page 33: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh

Dec. 13 ’06 – MICRO-39

Other approaches

Hybrid/flexible schemes• “Core clustering” [Speight et al., ISCA2005]• “Flexible CMP cache sharing” [Huh et al., ICS2004]• “Flexible bank mapping” [Liu et al., HPCA2004]

Improving shared caching• “Victim replication” [Zhang and Asanovic, ISCA2005]

Improving private caching• “Cooperative caching” [Chang and Sohi, ISCA2006]• “CMP-NuRAPID” [Chishti et al., ISCA2005]