a speculative technique for auto-memoization processor with multithreading

A Speculative Technique forAuto-Memoization Processorwith Multithreading

Yushi KAMIYA†

Tomoaki TSUMURA†

Hiroshi MATSUO†

Yasuhiko NAKASHIMA‡

○

† 　 Nagoya Institute of Technology‡ 　 Nara Institute of Science and Technology

The 10th International Conference onParallel and Distributed Computing, Applications and Technologies (PDCAT)Hiroshima, Japan on 9th, December, 2009

Outline• Background

• Model– Auto-Memoization Processor– Two speculative threads

• Implementation– Architecture– Register Synchronization

• Evaluation

• Conclusion

Research background• Former speedup techniques

– Based on ILP : Superscalar, SIMD– Based on TLP : Auto-Parallelization Compiler

• Limits of ILP and TLP based techniques– Many programs have little distinct ILPs– Resource conflict : Memory throughput– It is difficult to find TLP in programs

• Memoization– Storing the results of functions for later reuse

・・・ Auto-Memoization Processor

How to skip execution

Memoization for functions and loops• Memoizable Instruction Regions

– A region between the instructionwith a callee label and return instruction (A)

– A region between a backwardbranch instruction (B)

func: : : return %xmain: : call func :

: :.LL3: : : ba .LL3 : :

(A) : Functions (B) : Loops

Memoizable Instruction Regions

between backward branch and branch target label

between a callee label and return instruction

Auto-Memoization Processor

Regs D$1

ALU

Temporary buffer

Computing...End of computation

store

writeback

Match

MemoBuf MemoTbl

Save the input/outputsequence

Detect a function or a loop

D$2

InputMatching

Registration of an input sequence

00

01

02

03

04

05

..

RB (CAM)

00

01

02

03

04

05

.. ... ...

RA (RAM)

v=6

W1 pointer

... ...

v=140W1 (RAM)

RF (RAM)

Memory(Cache)

00000004 00:00001000

00000002 02:00001008

--:--------00000001 01

opr

1

2

0

0x1000

0x1004

0x1008

int x, y[5];...opr(4);...

opr(int a) { int v; v = x + a; v = v * y[1]; return (v);}

MemoTbl

x

y[0]

y[1]

00

02

FF

02:000020000000040601

00:0000400460000000FF

--:--------8000000803 00

sum

Memobufval %i000000004

adr x00001000

val x00000002

adr y[1]00001008

val y[1]00000001

RB RA RB RA RB RA

(A) (B) (C)

(A)

(B)

(C)

Store

Input Matching

00

01

02

03

04

05

..

00

01

02

03

04

05

.. ... ...

W1 pointer

... ...

Memory(Cache)

v=140

opr

v=6

RB (CAM) RA (RAM) W1 (RAM)

RF (RAM)1

2

0

0x1000

0x1004

0x1008MemoTbl

x

y[0]

y[1]

sum

02:000020000000040601

00:0000400460000000FF

--:--------8000000803 00

00000002 02:00001008

--:--------00000001 01

00

02

00000004 00:00001000FF

int x, y[5];...opr(4);...


FF:00000004

00:00000002

02:00000001

Reuse Overhead

00

01

02

03

04

05

..

00

01

02

03

04

05

.. ... ...

W1 pointer

... ...

Memory(Cache)

v=140

Comparing the input sequence with the value of RB entries

opr

v=6

RB (CAM) RA (RAM) W1 (RAM)

RF (RAM)1

2

0

0x1000

0x1004

0x1008MemoTbl

x

y[0]

y[1]

int x, y[5];...opr(4);...


02:000020000000040601

00:0000400460000000FF

--:000000008000000803 00

00000002 02:00001008

--:0000000000000001 01

00

02

00000004 00:00001000FF

sum

Regs

D$1

Writing backthe output sequence

Reuse Overheads

Speculative Multithreading• Speedup technique with Multicore

– Precomputing functions and loop iterations with predicted input sequences

– Storing input sequences to MemoTbl– The main core can use the result of calculation

SpMT core

SpMT core

Main core

SpMT core

Stride valuePrediction

MemoTbl

Reusethe function fact(4)

fact(3)

fact(4)

fact(5)

fact(1)fact(2)fact(4)

Calculation in advance

fact(5) = 120fact(4) = 24

fact(3) = 6

fact(1) = 1fact(2) = 2

* fact : factorial(n!)




• Evaluation

• Conclusion

Memoization and Multithreading• The problem of Speculative Multithreading

– Do not make the best use of many coresinput / output sequences registered by SpMT cores are

not always usedInput / output sequences are deleted due to MemoTbl

is full up

• Reduction of Reuse overhead– Multithreading technique for memoization– Using multi cores effectively

Our proposal

Reduction of Reuse Overhead• Two kinds of reuse overhead

– The overhead when input matching succeededThe cost of input matchingThe cost of writing back outputs into registers or

caches

– The overhead when input matching failedThe cost of input matching until it failedNo-memoization thread : assumes that the input matching will fail

Preceding thread : assumes that the input matching will succeed

・・・ The area (A) is executed normally

・・・ The area (B) is executed speculatively(B)

...v = u / wsum(); ・・・ (A)y = x + 4;...

Execution model

③

(A)

(B)

timeMain thread

Preceding threadMain thread

Preceding thread

①

①

Proposal Model

time

： Execution ： Search ： Write back

Reuse overhead

Former Model

②

(C)②

No-memoization thread

①

④ ③

②

No-memoization thread

Main thread

③

②

... v = u / w; x = sum(5, 3); y = x + 4; z = x + y; ... x = sum(3, 6); z = x + y; ...

int sum(a, b) { int i, sum = 0; for(i=0; i<a; i++) sum += i + b; return(sum);}

(α)

(β)

Reduction(α + β)

First several input values match the value of RB entries

Completely matched

Do not match

Prediction Pointer

00

01

02

03

04

05

..

00

01

02

03

04

05

.. ... ...

W1 pointer

Prediction pointer

v=6

... ...

Memory(Cache)

01

01

01

RB (CAM) RA (RAM)

1

2

0

0x1000

0x1004

0x1008MemoTbl

RF (RAM)

W1 (RAM)

opr

x

y[0]

y[1]

int x, y[5];...opr(4);...


02:000020000000040601

00:0000400460000000FF

--:000000008000000803 00

00000002 02:00001008

--:0000000000000001 01

00

02

00000004 00:00001000FF

v=140

sum

Match




• Evaluation

• Conclusion

Architecture – the proposal model

D$2

MemoTblShared

Memobuf

Regs D$1

ALU SpRF

Regs D$1

ALU SpRF

Regs D$1

ALU SpRF

Regs D$1

ALUMemo

BufInputPred.

Main threadPreceding threadNo-memoization thread

SpMT coresAdditional register file set

SpMT cores don't usethe shared MemoBuf

Shared with all cores

Register Synchronization

0 0 0g0g1g2 ・・・・・

0 0 0g3g4g5

0 0 0g6g7

0

g0g1g2g3g4g5...

[0] 0 0 0g0g1g2 ・・・・・

0 0 0g3g4g5

0 0 0g6g7

0

g0g1g2g3g4g5...

[1] 0 0 0g0g1g2 ・・・・・

0 0 0g3g4g5

0 0 0g6g7

0

g0g1g2g3g4g5...

1 1[2]

0FFF1000

00000040

0FFF10000FFF1000

000000400000004000000050

1

0FFF1000

00000040

0FFF1000

00000040 00000040

...sum();a = b * c;...min(a, b, c);...

search(A)

(B)

(C)

： Main ： Preceding ： No-memoization

0FFF1000

RF SpRF RF SpRF RF SpRFSpRF RF

WB

Register mask

Main thread Preceding thread No-memoization threadMain threadNo-memoization thread RF ⇔ SpRFDon't synchronized




• Evaluation

• Conclusion

Performance Evaluation• Evaluation environment

– Single-issue SPARC-V8 Simulator– Simulation parameters

• Workload– SPEC CPU95 (train)

Memo Buffer (Shared + Local) (RAM) 160 KBytes

Memo Table (CAM) 128 KBytes

(RAM) 448 KBytes

Comparison (Register and CAM) 9 Cycles/32Bytes

Comparison (Cache and CAM) 10 Cycles/32Bytes

Write back (MemoTbl ⇒ Register or Cache) 1 Cycle/32Bytes

Register copy 1 Cycle/32Bits

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Performance – SPEC CPU95： reuse_ovh

： D$2 ： window

099.go

124.m88ks

im

126.gcc

129.compre

ss130.li

132.ijpeg

147.vorte

x

101.tomca

tv

102.swim

103.su2co

r

104.hydro

2d

107.mgrid

110.applu

125.turb

3d

141.apsi

145.fpppp

146.wave

5

134.perl

(N) w/o Memoization

(M) Memoization

(P) Memoization + Proposal

(A) Memoization + SpMT + Proposal

(S) Memoization + SpMT

： exec

： regcopy ： D$1

CFP CINT

max ave.

(M) Memoization 13.9% -0.1%

(S) Memoization + SpMT 35.2% 5.6%

(P) Memoization + Proposal 21.7% 2.1%

(A) Memoization + SpMT + Proposal 36.0% 9.0%

Reduced cycles

Conclusion• The technique which reduces the reuse

overhead has proposed– The approach which is different views of

speculative multithreading

• Our future work– Changing assignment of the cores to the thread

dynamicallyTo exchange SpMT and the other thread each other

– A further improvement of the processorTo conceal the overheads of reusing loop iterations

a speculative technique for auto-memoization processor with multithreading

Documents

w1 pointer v

ff int x

input sequence rbcam

x main

value of rb entries

val y1

0x1000 0x1004 0x1008

adr y1