stimuluscache: boosting performance of chip multiprocessors with excess cache hyunjin lee sangyeun...

30
University of Pittsburg StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University of Pittsburgh

Upload: camron-bates

Post on 30-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

StimulusCache: Boosting Performance of Chip Multiprocessors

with Excess Cache

Hyunjin Lee Sangyeun Cho Bruce R. Childers

Dept. of Computer ScienceUniversity of Pittsburgh

Page 2: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Staggering processor chip yield

1 2 3 4 5 6 7 80%

5%

10%

15%

20%

25%

30%

35%8-core CMP

# of sound cores

Pro

bab

ilit

y

IBM Cell initial yield = 14%

Two sources of yield loss• Physical defects• Process variations

Smaller device sizes• Critical defect size shrinks• Process variations become

more severe

As a result, yield is se-verely limited

Page 3: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Core disabling to rescue

Recent multicore processors employ “core disabling”• Disable failed cores to salvage sound cores in a chip• Significant yield improvement, • IBM Cell: 14% 40% with core disabling of a single faulty core

Yet this approach unnecessarily disables many “good components”• Ex: AMD Phenom X3 disables L2 cache of faulty cores

But… is it economical?

Page 4: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Core disabling uneconomical

Many unused L2 caches exist Problem exacerbated with many cores

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320%

5%

10%

15%

20%

25%

30%

35%

40%L2 cacheprocessing logiccore (L2 + proc. Logic)

# of sound cores/ L2 caches

prob

abili

ty

32 core

1 2 3 4 5 6 7 80%

10%

20%

30%

40%

50%

60%

70%

80%8 core L2 cache

processing logic

core (L2 + proc. logic)

# of sound cores/ L2 caches

Prob

abili

ty

Page 5: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

StimulusCache Basic idea

• Exploit “excess cache” (EC) in a failed core

• Core disabling (yield ↑) + larger cache capacity (performance ↑)

Simple HW architecture extension• Cache controller has knowledge about EC utilization• L2 cache are chain linked using vector tables

Modest OS support• OS manages the hardware data structures in cache controllers

to set up EC utilization policies

Sizable performance improvement

Page 6: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

StimulusCache design issues

Questions

1: How to arrange ECs to give to cores?

2: Where to place data, in ECs or local L2?

3: What HW support is needed?

4: How to flexibly and dynamically allocate ECs?

Core 4 Core 5 Core 7Core 6

Core 0 Core 1 Core 2 Core 3Excess caches

Working cores

Page 7: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Excess cache utilization policies

Static Dynamic

Private

Sharing

Tempo-ralSpa-

tial

Simple

Limited Performance

Complex

Maximized Performance

Page 8: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Static Dynamic

Private

Sharing Shared by multiple cores

Performance interference Maximum capacity usage

Excess cache utilization policies

Tempo-ralSpa-

tial Exclusive to a core

Performance isolation Limited capacity usage

Page 9: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Excess cache utilization policies

Static Dynamic

Private

Sharing

Tempo-ralSpa-

tial

Static pri-vate

Dynamic shar-ing

Static shar-ing

BAD(not evalu-

ated)

Page 10: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Static private policy

Symmetric alloca-tion

Asymmetric alloca-tion

L2Core 0 EC

EC

EC

EC

L2Core 1

L2Core 2

L2Core 3

EC

EC

EC

EC

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Page 11: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Static sharing policy

L2Core 0

L2Core 1

L2Core 2

L2Core 3

All ECs shared by all cores

Mainmem-ory

Mainmem-oryEC

#1EC #2

Page 12: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Flow-in#N: data block counts to EC#N

Hit#N: data block counts hit at EC#N

Dynamic sharing policy

L2Core 0

L2Core 1

L2Core 2

L2Core 3

EC #1

Mainmem-ory

Flow-in#1 EC

#2

Hits#1

Hits#2

Flow-in#2Flow-in#2

Page 13: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Hit/Flow-in ↑ more ECs Hit/Flow-in ↓ less ECs

Dynamic sharing policy

L2Core 0

L2Core 1

L2Core 2

L2Core 3

EC #1

Mainmem-ory

Flow-in#1 EC

#2

Hits#1

Hits#2

Flow-in#2Flow-in#2

Page 14: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 0: at least 1 EC no harmful effect on EC#2

allocate 2 ECs

Dynamic sharing policy

EC #1

Mainmem-oryEC

#2

Hits#1

Flow-in#1

Page 15: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 1: at least 2 ECs

allocate 2 ECs

Dynamic sharing policy

EC #1

Mainmem-oryEC

#2

Flow-in#1

Hits#2

Flow-in#2

Page 16: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 2: at least 1 EC harmful effect on EC#2

allocate 1 EC

Dynamic sharing policy

EC #1

Mainmem-oryEC

#2

Flow-in#1

Hits#1

Flow-in#2

Page 17: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 2: no benefit with ECs

allocate 0 EC

Dynamic sharing policy

EC #1

Mainmem-oryEC

#2Flow

-in#1 Flow-in#2

Page 18: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

L2Core 0

L2Core 1

L2Core 2

L2Core 3

Maximized capacity utilization

Minimized capacity interference

Dynamic sharing policy

Mainmem-oryEC

#1EC #2

2

2

1

0

EC#

Page 19: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

HW architecture: Vector table

1 0 1 1 0 0 0 0 ECAV: Excess Cache Allocation Vector

1 0 1

NECP: Next Excess Cache Pointers

1

0 0 0

0 1 1

0

1

Valid

Next coreCore

0Core 1

Core 7

0 7

0 0 0 0 0 0 0 0 SCV: Shared Core Vector

1 2 3 4 5 6Core

Core

Page 20: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

ECAV: Excess cache allocation vec-tor Data search support

Core 4 Core 5 Core 7Core 6

Core 0 Core 1 Core 2 Core 3Excess caches

Working cores

ECAV 1 0 1 1 0 0 0 0Core 6

0 1 2 3 4 5 6 7Core

Core

Page 21: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

SCV: Shared core vector

Cache coherency support

SCV 0 0 0 0 0 0 1 0Core 0,2, and 3

Core 4 Core 5 Core 7Core 6

Core 0 Core 1 Core 2 Core 3

0 1 2 3 4 5 6 7Core

Core

Excess caches

Working cores

Page 22: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

NECP: Next excess cache pointers

Data promotion/ eviction support

Valid

In-dex0 1 1Core

61 EC at Core 3

0 1 0Core 3

1 EC at Core 2

0 0 0Core 2

1 EC at Core 0

0 0 0Core 0

0 Main memory

Core 4 Core 5 Core 7Core 6

Core 0 Core 1 Core 2 Core 3Excess caches

Working cores

Page 23: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Software support

OS enforces an excess cache utilization policy before programming cache controllers• Explicit decision by administrator• Autonomous decision based on system monitoring

OS may program cache controllers• At system boot-up time• Before a program starts• During a program execution

OS may take into account workload characteristics be-fore programming

Page 24: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Experimental setup

Intel ATOM-like cores w/ a 16-stage pipeline @ 2GHz

Memory hierarchy L1: 32KB I/D, 1 cycle L2: 512KB, 10 cycles Main memory: 300 cycles, contention modeled

On-chip network Crossbar for 8-core CMP (4 cores + 4 ECs) 2D mesh for 32-core CMP (16 cores + 16 ECs) Contention modeled

SPEC CPU2006, SPLASH-2, and SPECjbb2005

Page 25: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Static private – single thread

h2

64

ref

hm

me

r

asta

r

bzip

2

mcf

gcc

gro

ma

cs

ga

me

ss

so

ple

x

sp

hin

x3

Ge

msFD

TD

mil

c

Light Medium Heavy Light Medium HeavyINT FP

-20%

0%

20%

40%

60%

80%

100%

120%

140%4 EC 3 EC 2 EC 1 EC

Pe

rfo

rma

nce

im

pro

ve

men

t

Page 26: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

With “H” workloadsAll “H” workloadsWithout “H” workloadsMore than 40% improve-ment

HHHH

LLHH2

LLHH4

LLMM1

LLMM3

MMHH1

MMHH3

MMMM

choles

ky lu-5%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50% static.private

pe

rfo

rma

nce

im

pro

ve

men

t

Static private – multithread

Multi-pro-grammed

Multi-threaded

Page 27: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

HHHH

LLHH2

LLHH4

LLMM1

LLMM3

MMHH1

MMHH3

MMMM

choles

ky lu-5%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50% static.sharing static.private

pe

rfo

rma

nce

im

pro

ve

men

t Capacity interferenceMultithreaded work-loads

Significant improve-ment

Static sharing – multithread

Multi-pro-grammed

Multi-threaded

Page 28: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

HHHH

LLHH2

LLHH4

LLMM1

LLMM3

MMHH1

MMHH3

MMMM

choles

ky lu-5%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50% dynamic.sharing static.sharing static.private

pe

rfo

rma

nce

im

pro

ve

men

t Additional improvementCapacity interference avoided

Dynamic sharing – multithread

Multi-pro-grammed

Multi-threaded

Page 29: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

!!! StimulusCache is always better than the baseline

Dynamic sharing – individual work-loads

LLLL

LLMM1

LLMM2

LLMM3

LLMM4

LLHH1

LLHH2

LLHH3

LLHH4

MMMM

MMHH1

MMHH2

MMHH3

MMHH4

HHHH-2%

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%core 0 core 1 core 2 core 3

Ad

dit

ion

al p

erf

orm

an

ce

imp

rovem

en

t

Significant additional

improvement

over static sharing

Minimum degradation over

static sharing

Page 30: StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University

University of Pittsburgh

Conclusions

Processing logic yield vs. L2 cache yield• A large number of excess L2 caches

StimulusCache• Core disabling (yield ↑) + larger cache capacity (performance ↑)• Simple HW architecture extension + modest OS support

Three excess cache utilization policies• Static private, static sharing, and dynamic sharing

Performance improvement by up to 135%.