a speculative technique for auto-memoization processor with multithreading
DESCRIPTION
Yushi KAMIYA, Tomoaki TSUMURA, Hiroshi MATSUO, Yasuhiko NAKASHIMA: "A Speculative Technique for Auto-Memoization Processor with Multithreading"(発表資料) Proc. 10th Intl. Conf. on Parallel and Distributed Computing, Applications and Technologies (PDCAT'09), Higashi-Hiroshima, Japan, pp.160-166 (Dec. 2009)TRANSCRIPT
A Speculative Technique forAuto-Memoization Processorwith Multithreading
Yushi KAMIYA†
Tomoaki TSUMURA†
Hiroshi MATSUO†
Yasuhiko NAKASHIMA‡
○
† Nagoya Institute of Technology‡ Nara Institute of Science and Technology
The 10th International Conference onParallel and Distributed Computing, Applications and Technologies (PDCAT)Hiroshima, Japan on 9th, December, 2009
Outline• Background
• Model– Auto-Memoization Processor– Two speculative threads
• Implementation– Architecture– Register Synchronization
• Evaluation
• Conclusion
Research background• Former speedup techniques
– Based on ILP : Superscalar, SIMD– Based on TLP : Auto-Parallelization Compiler
• Limits of ILP and TLP based techniques– Many programs have little distinct ILPs– Resource conflict : Memory throughput– It is difficult to find TLP in programs
• Memoization– Storing the results of functions for later reuse
・・・ Auto-Memoization Processor
How to skip execution
Memoization for functions and loops• Memoizable Instruction Regions
– A region between the instructionwith a callee label and return instruction (A)
– A region between a backwardbranch instruction (B)
func: : : return %xmain: : call func :
: :.LL3: : : ba .LL3 : :
(A) : Functions (B) : Loops
Memoizable Instruction Regions
between backward branch and branch target label
between a callee label and return instruction
Auto-Memoization Processor
Regs D$1
ALU
Temporary buffer
Computing...End of computation
store
writeback
Match
MemoBuf MemoTbl
Save the input/outputsequence
Detect a function or a loop
D$2
InputMatching
Registration of an input sequence
00
01
02
03
04
05
..
RB (CAM)
00
01
02
03
04
05
.. ... ...
RA (RAM)
v=6
W1 pointer
... ...
v=140W1 (RAM)
RF (RAM)
Memory(Cache)
00000004 00:00001000
00000002 02:00001008
--:--------00000001 01
opr
1
2
0
0x1000
0x1004
0x1008
int x, y[5];...opr(4);...
opr(int a) { int v; v = x + a; v = v * y[1]; return (v);}
MemoTbl
x
y[0]
y[1]
00
02
FF
02:000020000000040601
00:0000400460000000FF
--:--------8000000803 00
sum
Memobufval %i000000004
adr x00001000
val x00000002
adr y[1]00001008
val y[1]00000001
RB RA RB RA RB RA
(A) (B) (C)
(A)
(B)
(C)
Store
Input Matching
00
01
02
03
04
05
..
00
01
02
03
04
05
.. ... ...
W1 pointer
... ...
Memory(Cache)
v=140
opr
v=6
RB (CAM) RA (RAM) W1 (RAM)
RF (RAM)1
2
0
0x1000
0x1004
0x1008MemoTbl
x
y[0]
y[1]
sum
02:000020000000040601
00:0000400460000000FF
--:--------8000000803 00
00000002 02:00001008
--:--------00000001 01
00
02
00000004 00:00001000FF
int x, y[5];...opr(4);...
opr(int a) { int v; v = x + a; v = v * y[1]; return (v);}
FF:00000004
00:00000002
02:00000001
Reuse Overhead
00
01
02
03
04
05
..
00
01
02
03
04
05
.. ... ...
W1 pointer
... ...
Memory(Cache)
v=140
Comparing the input sequence with the value of RB entries
opr
v=6
RB (CAM) RA (RAM) W1 (RAM)
RF (RAM)1
2
0
0x1000
0x1004
0x1008MemoTbl
x
y[0]
y[1]
int x, y[5];...opr(4);...
opr(int a) { int v; v = x + a; v = v * y[1]; return (v);}
02:000020000000040601
00:0000400460000000FF
--:000000008000000803 00
00000002 02:00001008
--:0000000000000001 01
00
02
00000004 00:00001000FF
sum
Regs
D$1
Writing backthe output sequence
Reuse Overheads
Speculative Multithreading• Speedup technique with Multicore
– Precomputing functions and loop iterations with predicted input sequences
– Storing input sequences to MemoTbl– The main core can use the result of calculation
SpMT core
SpMT core
Main core
SpMT core
Stride valuePrediction
MemoTbl
Reusethe function fact(4)
fact(3)
fact(4)
fact(5)
fact(1)fact(2)fact(4)
Calculation in advance
fact(5) = 120fact(4) = 24
fact(3) = 6
fact(1) = 1fact(2) = 2
* fact : factorial(n!)
Outline• Background
• Model– Auto-Memoization Processor– Two speculative threads
• Implementation– Architecture– Register Synchronization
• Evaluation
• Conclusion
Memoization and Multithreading• The problem of Speculative Multithreading
– Do not make the best use of many coresinput / output sequences registered by SpMT cores are
not always usedInput / output sequences are deleted due to MemoTbl
is full up
• Reduction of Reuse overhead– Multithreading technique for memoization– Using multi cores effectively
Our proposal
Reduction of Reuse Overhead• Two kinds of reuse overhead
– The overhead when input matching succeededThe cost of input matchingThe cost of writing back outputs into registers or
caches
– The overhead when input matching failedThe cost of input matching until it failedNo-memoization thread : assumes that the input matching will fail
Preceding thread : assumes that the input matching will succeed
・・・ The area (A) is executed normally
・・・ The area (B) is executed speculatively(B)
...v = u / wsum(); ・・・ (A)y = x + 4;...
Execution model
③
(A)
(B)
timeMain thread
Preceding threadMain thread
Preceding thread
①
①
Proposal Model
time
: Execution : Search : Write back
Reuse overhead
Former Model
②
(C)②
No-memoization thread
①
④ ③
②
No-memoization thread
Main thread
③
②
... v = u / w; x = sum(5, 3); y = x + 4; z = x + y; ... x = sum(3, 6); z = x + y; ...
int sum(a, b) { int i, sum = 0; for(i=0; i<a; i++) sum += i + b; return(sum);}
(α)
(β)
Reduction(α + β)
First several input values match the value of RB entries
Completely matched
Do not match
Prediction Pointer
00
01
02
03
04
05
..
00
01
02
03
04
05
.. ... ...
W1 pointer
Prediction pointer
v=6
... ...
Memory(Cache)
01
01
01
RB (CAM) RA (RAM)
1
2
0
0x1000
0x1004
0x1008MemoTbl
RF (RAM)
W1 (RAM)
opr
x
y[0]
y[1]
int x, y[5];...opr(4);...
opr(int a) { int v; v = x + a; v = v * y[1]; return (v);}
02:000020000000040601
00:0000400460000000FF
--:000000008000000803 00
00000002 02:00001008
--:0000000000000001 01
00
02
00000004 00:00001000FF
v=140
sum
Match
Outline• Background
• Model– Auto-Memoization Processor– Two speculative threads
• Implementation– Architecture– Register Synchronization
• Evaluation
• Conclusion
Architecture – the proposal model
D$2
MemoTblShared
Memobuf
Regs D$1
ALU SpRF
Regs D$1
ALU SpRF
Regs D$1
ALU SpRF
Regs D$1
ALUMemo
BufInputPred.
Main threadPreceding threadNo-memoization thread
SpMT coresAdditional register file set
SpMT cores don't usethe shared MemoBuf
Shared with all cores
Register Synchronization
0 0 0g0g1g2 ・・・・・
0 0 0g3g4g5
0 0 0g6g7
0
g0g1g2g3g4g5...
[0] 0 0 0g0g1g2 ・・・・・
0 0 0g3g4g5
0 0 0g6g7
0
g0g1g2g3g4g5...
[1] 0 0 0g0g1g2 ・・・・・
0 0 0g3g4g5
0 0 0g6g7
0
g0g1g2g3g4g5...
1 1[2]
0FFF1000
00000040
0FFF10000FFF1000
000000400000004000000050
1
0FFF1000
00000040
0FFF1000
00000040 00000040
...sum();a = b * c;...min(a, b, c);...
search(A)
(B)
(C)
: Main : Preceding : No-memoization
0FFF1000
RF SpRF RF SpRF RF SpRFSpRF RF
WB
Register mask
Main thread Preceding thread No-memoization threadMain threadNo-memoization thread RF ⇔ SpRFDon't synchronized
Outline• Background
• Model– Auto-Memoization Processor– Two speculative threads
• Implementation– Architecture– Register Synchronization
• Evaluation
• Conclusion
Performance Evaluation• Evaluation environment
– Single-issue SPARC-V8 Simulator– Simulation parameters
• Workload– SPEC CPU95 (train)
Memo Buffer (Shared + Local) (RAM) 160 KBytes
Memo Table (CAM) 128 KBytes
(RAM) 448 KBytes
Comparison (Register and CAM) 9 Cycles/32Bytes
Comparison (Cache and CAM) 10 Cycles/32Bytes
Write back (MemoTbl ⇒ Register or Cache) 1 Cycle/32Bytes
Register copy 1 Cycle/32Bits
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Performance – SPEC CPU95: reuse_ovh
: D$2 : window
099.go
124.m88ks
im
126.gcc
129.compre
ss130.li
132.ijpeg
147.vorte
x
101.tomca
tv
102.swim
103.su2co
r
104.hydro
2d
107.mgrid
110.applu
125.turb
3d
141.apsi
145.fpppp
146.wave
5
134.perl
(N) w/o Memoization
(M) Memoization
(P) Memoization + Proposal
(A) Memoization + SpMT + Proposal
(S) Memoization + SpMT
: exec
: regcopy : D$1
CFP CINT
max ave.
(M) Memoization 13.9% -0.1%
(S) Memoization + SpMT 35.2% 5.6%
(P) Memoization + Proposal 21.7% 2.1%
(A) Memoization + SpMT + Proposal 36.0% 9.0%
Reduced cycles
Conclusion• The technique which reduces the reuse
overhead has proposed– The approach which is different views of
speculative multithreading
• Our future work– Changing assignment of the cores to the thread
dynamicallyTo exchange SpMT and the other thread each other
– A further improvement of the processorTo conceal the overheads of reusing loop iterations