energy efficient d-tlb and data cache using semantic-aware multilateral partitioning school of...

Energy Efficient Energy Efficient DD-TLB and -TLB and Data Cache Using Semantic-Data Cache Using Semantic-Aware Multilateral Aware Multilateral PartitioningPartitioning

School of Electrical and Computer EngineeringSchool of Electrical and Computer EngineeringGeorgia Institute of TechnologyGeorgia Institute of Technology

Atlanta, GA 30332Atlanta, GA 30332

ISLPED 2003

Hsien-Hsin Hsien-Hsin ““SeanSean”” Lee Lee Chinnakrishnan Ballapuram

2ISLPED 2003

Background PictureBackground Picture

Address Translation and Caches Major processor power contributors I-TLB and d-TLB lookup for every instruction

and memory reference TLBs are Fully Associative

Superscalar processor needs multi-ported design increasing power consumption multi-wide machines may need multiple

memory references in the same cycle

3ISLPED 2003

Virtual Memory Space Virtual Memory Space PartitioningPartitioning Based on programming

languageNon-overlapped

subdivisionsSplit Code and Data

I-CacheI-Cache and D-CacheD-Cache Split Data into Regions

Stack () Heap () Global (static) Read-only (static)

Protected

reserved

reserved

max mem

min mem

ARM Architecture

Code Region

Static GLOBAL Data Region

HEAP grows upward

STACK grows downward

Read-only region

The unique access behavior to these regions by a program creates an opportunity to reduce power

4ISLPED 2003

Outline of the TalkOutline of the TalkMotivation

unique access behavior and locality are analyzed for energy reduction

Semantic-Aware Multilateral Partitioning (SAM)Semantic-Aware d-TLB (SAT)Semantic-Aware d-Cachelets (SAC)Selective Multi-Porting SAM Architecture

Performance/Energy/Area EvaluationConclusions

5ISLPED 2003

Footprint of Stack Page Footprint of Stack Page AccessesAccesses

786426

786427

786428

786429

0 500 1000 1500 2000 2500 3000 3500 4000

offset

un

iqu

e v

irtu

al

pag

e n

um

ber

stack page accesses

Only two stack pages are required by all stack accesses stack band is small In general, x-axis shows the working set size, y-axis shows the

required TLB entries

6ISLPED 2003

Footprint of Global and Heap Page Footprint of Global and Heap Page AccessesAccesses

8260

8270

8280

8290

8300

8310

0 1000 2000 3000 4000

offset

un

iqu

e vi

rtu

al p

age

nu

mb

erheap global

number of heap pages (y-axis) and heap working set (x-axis) required is greater than stack and global heap band >> global band > stack band

7ISLPED 2003

Compulsory data-TLB Compulsory data-TLB missesmisses

Nu

mb

er

of

com

pu

lsory

TLB

N

um

ber

of

com

pu

lsory

TLB

M

isses

Mis

ses

highly active heap accesses evict the useful stack and global entries due to conflict misses

1

10

100

1000

10000

100000

blow

fish

bitc

ount

cjpe

g

djpe

g

dijk

stra fft

rijn

dael

patricia

bzip

2gc

cm

cf

pars

er

H-M

ean

stack global heap

MiBench Spec2000

8ISLPED 2003

Compulsory data-Cache Compulsory data-Cache missesmissesN

um

ber

of

com

pu

lsory

Cach

e

Nu

mb

er

of

com

pu

lsory

Cach

e

Mis

ses

Mis

ses

1

10

100

1000

10000

100000

1000000

10000000 stack global heap

smaller stack and global working set than heap smaller stack and global cache size is enough to capture most of the memory accesses to these semantic regions

9ISLPED 2003

Dynamic Data Memory DistributionDynamic Data Memory Distribution

~40 % of the dynamic memory accesses go to the stack which is concentrated on only few pages

4 memory accesses ~= 2 stack, 1 global and 1 heap

0%

20%

40%

60%

80%

100%

stack global heap text env

10ISLPED 2003

Semantic-Aware Memory Semantic-Aware Memory ArchitectureArchitecture

smaller stack and global TLB

smaller stack and global cache

Reduced power consumption

To Processor

Unified L2 Cache

Data Address Router

sCache gCachehCache

ld_data_base_reg

ld_env_base_reg

ld_data_bound_reg

sTLB

gTLB 0

1

2

3

To Processor

Virtual address

uTLB 0

1

63

Most of the memory references go to

sTLB 0

1

sCache

11ISLPED 2003

Semantic-Aware TLB Semantic-Aware TLB MissesMisses

Number of TLB EntriesNumber of TLB Entries

Nu

mb

er

of

TLB

N

um

ber

of

TLB

M

isses

Mis

ses

TLB

Mis

s R

ate

TLB

Mis

s R

ate

The number of hTLB misses does not come down even at 512 TLB entries

1

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

1 2 4 8 16 32 64 128 256 512

0%

10%

20%

30%

40%

50%

60%

70%

80%

sTLB gTLB hTLB

stack miss % global miss % heap miss %

12ISLPED 2003



Nu

mb

er

of

TLB

N

um

ber

of

TLB

M

isses

Mis

ses

TLB

Mis

s R

ate

TLB

Mis

s R

ate

The number of gTLB misses saturate at 8 TLB entries

1

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

1 2 4 8 16 32 64 128 256 512

0%

10%

20%

30%

40%

50%

60%

70%

80%

sTLB gTLB hTLB


13ISLPED 2003



Nu

mb

er

of

TLB

N

um

ber

of

TLB

M

isses

Mis

ses

TLB

Mis

s R

ate

TLB

Mis

s R

ate

The number of sTLB misses saturate faster than global and heap

1

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

1 2 4 8 16 32 64 128 256 512

0%

10%

20%

30%

40%

50%

60%

70%

80%

sTLB gTLB hTLB


14ISLPED 2003

Semantic-Aware Cache Semantic-Aware Cache MissesMisses

Nu

mb

er

of

Cach

e

Nu

mb

er

of

Cach

e

Mis

ses

Mis

ses

Cache Size in KBCache Size in KB

Cach

e M

iss R

ate

Cach

e M

iss R

ate

Stack demonstrate very stable working set size than the other two. Global saturates at a reasonable rate.

1

10

100

1000

10000

100000

1000000

10000000

100000000

1000000000

2 4 8 16 32 64 128 256

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

sCache gCache hCache

sCache miss % gCache miss % hCache miss %

15ISLPED 2003

Simulation Simulation IInfrastructurenfrastructureTarget Architecture: ARM Performance: Simplescalar Power: Integrated Wattch Power ModelAccess Time/Area: CACTI 3.0

Execution Engine Out-of-Order

Fetch / Decode / Issue / Commit 4 / 4 / 4 / 4

L1 / L2 / Memory Latency 1 / 6 / 150

TLB hit / miss latency 1 / 30

L1 Cache baseline DM 32KB

L1 stack / global / heap Cachelet 8KB / 8KB / 16 KB

L2 Cache 4w 512KB

Cache line size 32B

16ISLPED 2003

Design Effectiveness of Design Effectiveness of SAMSAM

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

blowfis

h

bitcount

cpeg

djpeg

dijkst

ra fft

rijndae

l

patric

ia

bzip2

gcc mcf

parse

rAvg

Performance Ratio d-TLB Energy w/ SATL1 d-Cache Energy w/ SAC ~4% Perf.~4% Perf.

LossLoss

~35% Energy~35% EnergySavingsSavings

17ISLPED 2003

Multi-porting Effectiveness Multi-porting Effectiveness of SAMof SAM

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

blowfis

h

bitcount

cpeg

djpeg

dijkst

ra fft

rijndae

l

patric

ia

bzip2

gcc mcf

parse

rAvg

Performance Ratio d-TLB Energy w/ SATL1 d-Cache Energy w/ SAC

Baseline: 2 port TLB/Cache SAM: 2 port s-TLB/Cache, 1 port g- and h-TLB/Cache

~45% Energy~45% EnergySavingsSavings

~4% Perf.~4% Perf.LossLoss

18ISLPED 2003

Multi-porting Access Time / Die Multi-porting Access Time / Die AreaArea

Baseline Semantic-Aware Cachelets (SAC)

Cache Model 32KB unified

8KB sCachelet

8KB gCachelet

16KB hCachelet

Total SAC Area

Area Savings

R/W ports 2 2 1 1

Access time (ns) 1.125 0.826 0.692 0.816

Area (mm2) 5.304 1.393 0.616 1.095 3.104 41.5 %41.5 %Cache Model 64KB

unified16KB sCachelet

16KB gCachelet

32KB hCachelet

Total SAC Area

Area Savings

R/W ports 2 2 1 1

Access time (ns) 1.630 0.949 0.816 0.948

Area (mm2) 8.942 2.555 1.095 2.246 5.897 34.1 %

area savings with 4% performance loss

19ISLPED 2003

ConclusionsConclusions Presented Semantic-Aware Multilateral technique

to reduce d-TLB and data cache energy consumption data TLB – 36 % energy savings data Cache – 34 % energy savings 4 % performance loss

Selective Multi-porting SAM reduces energy and area data TLB – 47 % energy savings data Cache – 45 % energy savings 4 % performance loss

20ISLPED 2003

21ISLPED 2003

Distribution of Parallel TLB Distribution of Parallel TLB ActivityActivity

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

stack 1 global 1 heap 1 stack 2 global 2 heap 2 stack 3 global 3 heap 3 stack 4 global 4 heap 4

blowfish bitcount cpeg djpeg dijkstra fft patricia rijndael

Para

llel N

um

ber

of

TLB

P

ara

llel N

um

ber

of

TLB

A

ccesses

Accesses

22ISLPED 2003

Cost-Effective TLB Cost-Effective TLB configurationconfiguration

bm Bf Bc Cj Dj Dij Fft Rij Pat Bz Gc Par

dTLB base

32 32 128 64 64 64 32 256 64 64 64

sTLB 2 2 2 2 2 2 2 2 4 4 4

gTLB 8 8 8 8 32 8 8 8 16 16 16

hTLB 16 32 128 64 32 64 32 256 64 64 64

23ISLPED 2003

24ISLPED 2003

Design Effectiveness of Design Effectiveness of SAMSAM

0.88

0.9

0.92

0.94

0.96

0.98

1

0 0.2 0.4 0.6 0.8 1

Energy

Sp

ee

d

blowfish

djpeg

bitcountcjpeg

fft

dijkstra

rijndael

patricia

bzip2

mcf

gcc

parser

average

energy efficient d-tlb and data cache using semantic-aware multilateral partitioning school of...

Documents

global tlb smaller stack

heap slide

stack accesses stack

data cache

stack pages

upward stack

useful stack

power slide