Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator
F. Mehdipour, Hiroaki Honda*, H. Kataoka, K. Inoue and K. Murakami
Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan
*Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan
E-mail: [email protected]
2009/01/29
Agenda
IntroductionSFQ-LSRDP General ArchitectureThe Design Procedure and Tool ChainInput/ Output Nodes PlacementArea MinimizationExperimental ResultsConclusions
2009/01/29
CREST-JST SFQ-RDP Project (2006~): A Low-power, high-performance reconfigurable processor based on single-flux quantum circuits
SFQ-LSRDPProf. K. Murakami et al.
Kyushu Univ.Architecture, Compiler
and Applications
Dr. S. Nagasawa et al.
Superconducting Research Lab. (SRL)
SFQ process
Prof. N. Yoshikawa et al.
Yokohama National Univ.SFQ-FPU chip, cell library
Prof. A. Fujimaki et al.
Nagoya Univ.SFQ-RDP chip, cell library,
and wiring
Prof. N. Takagi (Leader) et al.
Nagoya Univ.CAD for logic design and arithmetic circuits
2009/01/29
Goals
Discovering appropriate scientific applications
Developing compiler tools
Developing performance analyzing tools
Designing and Implementing SFQ-LSRDP Designing and Implementing SFQ-LSRDP architecture considering the features and architecture considering the features and limitations of SFQ circuitslimitations of SFQ circuits
2009/01/29
How a reconfigurable processor works
Application codeMain
Memory
GPPComputation-intensive (critical) code
Non-critical code
Computation-intensive (critical) code
Non-critical code
Non-critical code
LSRDP
...PE PE PEPE
...PE PE PEPE
...PE PE PEPE
LSRDP
ORN
…
ORN...
2009/01/29
Single-flux quantum (SFQ)against CMOS
CMOS main issues in implementing a large accelerator: High electric power consumption High heat radiation Difficulties in high-density packing
SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation (smaller area) Suitable for pipeline processing of data stream
ジョセフソン接合
超伝導ループ
磁束量子Single Flux QuantumSuperconductivityloop
Josephson junctionジョセフソン接合
超伝導ループ
磁束量子
ジョセフソン接合
超伝導ループ
磁束量子
ジョセフソン接合
超伝導ループ
磁束量子Single Flux QuantumSuperconductivityloop
Josephson junction
2009/01/29
Outline of large-scale reconfigurable data-path (LSRDP) processor
Features:Handling data flow graphs (DFGs)
extracted from scientific applicationsPipeline executionBurst transfer of input /output
rearranged data from/to memoryReduced no. of memory accesses
(alleviating the memory wall problem)MainMemory
GPP
ORN
: : : :
ORN : Operand Routing Network
...PE PE PEPE
...PE PE PEPE
...PE PE PEPE
LSRDP
: : : ... :SB
SMAC
Scratchpad Memory
Reconfigurable data-path components:A matrix of large number of floating-
point Functional Units (FUs) Reconfigurable Operand Routing
Network : (ORN)Dynamic reconfiguration facilitiesStreaming Buffer (SB) for I/O ports
SFQ-LSRDP General Architecture
2009/01/29
LSRDP architecture
Processing Elements
FU (Functional Unit): implements basic 64-bit double-precision floating point operations including: ADD/SUB and MUL
TU(transfer unit): as a routing resource for transferring data b/w inconsecutive rows
FU TU
FU
TU FU TUTU
FU TUFU
PE including two components
Four functionalities
Input ports
Output ports
MULNode 15
TU
7 4
15
13
12
2009/01/29
PE structures
FU TU
PE Basic arch.
3-inps/2-outs
FU -
- TU
FU TU
TU TU
FU TUTUPE arch. I
4-inps/3-outs
FU --
FU TU-
- TU-
FU TUTU
FU TU-
TU TUTU
FU TU
PE arch. II
3-inps/3-outs
TU TU
- TU
FU -TU TU
FU TU
TU-TU TU
2009/01/29
Layout types- Type IW
ORN
ORN
ORN
.
.
.
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
…A
TM
AT
M
AT
M
AT
M
AT
M
ADD/SUB
MUL
TU
Each PE implements ADD/SUB and MUL
M
A
T
: ADD/SUB
: MUL
: Transfer Unit
H
Flexible but consumes a lot of resources
2009/01/29
W
ORN
ORN
ORN
.
.
.
…M TA T A T A T M T
…M TA T A T A T M T
…M TA T A T A T M T
…M TA T A T A T M T
Layout types- Type II
H
Each PE implements ADD/SUB or MUL Each PE implements
ADD/SUB or MUL
ADD/SUB TU MUL TU
2009/01/29
Maximum connection length (MCL)-Definition
(i, 0)
(i+1,0)
(i+1,j)
...
...
(i,j)
ORN
...
... ...
...
(i+1,j+L)
Longest ConnectionLength= L
(i,j+2)
(i,j+1)
(i+1,j+2)
(i+1,j+1)
ConnectionLength= 0
ConnectionLength= 2
MCL: maximum horizontal distance b/w two PEs located in two subsequent rows
2009/01/29
An ORN structure
A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.
FPUFPUFPUFPUFPU TTTTT
FPUFPUFPUFPUFPU TT
T
TT
½CB½CB½CB½CB½CB
CB CB CB CBT2 T2
½CB½CB½CB½CB½CB
CB CB CB CBCB
CB CB CB CBCB CB CB CBCBCB
CB CB CB CBT2 T2CB CB CB CBCB
T2 CB T2 CBT2 CB T2 CBCBT2
FPUFPUFPUFPUFPU TTTTT
FPUFPUFPUFPUFPU TT
T
TT
½CB½CB½CB½CB½CB
CB CB CB CBT2 T2
½CB½CB½CB½CB½CB
CB CB CB CBCB
CB CB CB CBCB CB CB CBCBCB
CB CB CB CBT2 T2CB CB CB CBCB
T2 CB T2 CBT2 CB T2 CBCBT2
ORN is consisted of 2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches
FPU
2bit shiftregister
ORN
2009/01/29
Dynamic reconfiguration architecture
FU(A op B)
TransferUnit
ImmediateRegister (64b)
ORN
MUX
・・・・・・
ImmediateRegister
・・・・・・
PEInput-AInput-B Input-C
log(2x (2MCL+1)) x 3 [b]
Conf. Reg.[bit]
Three bit-stream lines for dynamic reconfiguration of:• Immediate registers (64bit) in each PE • Selector bits for muxes selecting the input data of FUs• Cross-bar switches in ORNs
Execution
wait
Starting ofExecution
End ofExecution
Starting ofReconfiguration
End ofReconfiguration
idle
Reconfiguration
ORN
Immediate
PE
InitialState
2009/01/29
What should be decided during the design procedure
Height
PE1 ...
...
...
PEm...
.
.
.
.
.
.
.
.
.
PE2 PE3
ORN
ORN
Width
...
...
Streaming Buffer (SB)
ORN
Operand Routing Network (ORN)
Streaming Buffer (SB)
Maximum Connection Length (MCL)? ORN size and structure?
Reconfiguration mechanism?(PE, ORN, Immediate data)
Layout: FU types(ADD/SUB and MUL)?
Width and Height ?
On-chip memory configuration?
The number of I/O ports?
The Design Procedure and Tool Chain
2009/01/29
Compiler and design flow
Application Code
Hardware-Software Partitioning(manual or automatic)
Critical Parts(h/w part)
Non-Critical Parts(s/w part)
Port positioning
s/w Part Modification
Binary Code(for GPP)
Configurations(for LSRDP)
LSRDP Architecture
Placement
Routing
DFG Genration
Bit-Stream Generation
DFGsDFGs
Analyzing DFG mappingresults
Design Phase
Mapping
• DFGs are manually generated
• DFG mapping results are employed for:• Analyzing LSRDP architecture statistics (a quantitative approach)• Generating LSRDP configuration bit-streams
2009/01/29
Benchmark applications
Finite differential method calculation of2nd order partial differential equations 1dim-Heat equation (Heat) 1dim-Vibration equation (Vibration) 2dim-Poisson equation (Poisson)
Quantum chemistry application Recursive parts of Electron Repulsion Integral calculation
(ERI-Rec)
Types of operations in the calculations:ADD/SUB and MUL
2009/01/29
DFG extraction- Heat equation
1-dim. heat equation for T(x,t)
Calculation by Finite DifferenceMethod (FDM)
2
2
( , ) ( , )T x t T x tA
t x
(A is const.)
T(i,j+1)
T(i-1,j) T(i,j) T(i+1,j)
+
*
*
+
D
B
T(i,j+1)
T(i-1,j) T(i,j) T(i+1,j)
+
*
*
+
D
B
),(),(*),(*
),(
11
1
jijiji
ji
txTtxTBtxTD
txT
Basic DFG
Basic DFG can be extended to horizontal and vertical directions to make a larger DFG
2009/01/29
A sample DFG - Heat
Inputs: 32Outputs: 16Operations: 721 Immediates: 364
A sample DFG (Heat)
2009/01/29
DFG mapping flow
Longest connectionsMCL= 2
Placing DFG nodes on LSRDP
Routing connections
Routing Inp/Out connections
DFGLSRDP Architecture
Description
Configuration File
Placing IO nodesRe-placing DFG nodes on
LSRDP (considering IO nodespositions)
Re-palcing output nodes
Modified Mapping Flow
Placing Input/Output Nodes
2009/01/29
Fan-out based I/O nodes placement
ni: the number of children of input node i
Ci1, Ci2, Ci3, Ci,ni X: location of the input node i
Total Connection Length:
TCL= |Ci1-X|+ |Ci2-X|+…|Ci,ni-X|
Objective: Minimize TCL ni= 1 X= Ci1 ni= 2 Ci1 <= X <= Ci2 ni= 3 X = Ci2 ni>=2 X = Cij, j=2…ni-1
(i, 0)(i,
j+3)... (i,j)
Reconfigurable Routing Network
...
...(i,j+4)
(i,j+2)
(i,j+1)
...
Reconfigurable Routing Network
... ...
...
...
...
...
...
(i+1,0)
(i+1,j+3)
... (i+1,j)
...(i+1,j+4)
(i+1,j+2)
(i+1,j+1)
INP2
INP1
2: Boundaries of the location withminimum connection length to children
1: There is only one location withminimum connection length to a child
INP3 3: The best location is on a middle point
2009/01/29
One main reason for the large MCL
Inputs Ports are far from each other
2009/01/29
Proximity-factor based placement
Proximity factor indicates how far a pair of input ports should be located from each other
For a pair of input nodes The larger number of closer descendants, higher proximity factor is
assigned
Si,j: a set of common descendants for input nodes i and j
Dk,i(=Dk,j): distance of common descendant node k to the input nodes i and j (it is equal to ASAP execution level of the node)
2,1,
,21,2
,12,1
......
...
...
nn
n
n
pp
pp
pp
P
jiifD
jiif
pp
jiSk ik
ijji
,,
,,1
2009/01/29
Proximity factor-Example
12
13
4
1
3
1
2
1),(
7,6,4
2,121
2,1
pIIP
S
4
1),(
7
3,131
3,1
pIIP
S
Inputs nodes I1 and I2 should be located closer than I3
4
1),(
7
3,232
3,2
pIIP
S
1
I1 I2 I3
4
6
7
2 3
5
2009/01/29
Input nodes placement alg.: Example
if C(l)> C(r)l= l+1, L[l]=j
elser= r+1, L[r]=j
1
1
1)(
r
li
ij liplC
1
1
1)(
r
li
ij irprC
N/2-3 N/2-2 N/2-1 1 N/2+1 N/2+2 N/2+3… …
N/2-3 N/2-2 2 1 N/2+1 N/2+2 N/2+3… …
Placing the 1st input node with the highest proximity factor
Placing the 2nd input node with the highest proximity factor
2009/01/29
Input ports placement alg.: Example
Placing i-th input node
l r
N/2-K … 2 1 3 … N/2+M… …
If C(l)> C(r):
l r
i … 2 1 3 … N/2+M… …
If C(r)> C(l):
l r
N/2-K … 2 1 3 … i …
Area Minimization
2009/01/29
Estimating the area of a PE
TU TUFU
PE arch. I
FU TU
PE arch. II
TU TU op
mux
A B C
TUsel
A B C
Layout I: Area(PE)= 2.2x Area(FU)
Layout II: Area(PE)= 1.2x Area(FU)
Layout I: Area(PE)= 2.2x Area(FU),
Layout II: Area(PE)= 1.2x Area(FU)
Area(FU)= Area(ADD/SUB)= Area(MUL)
Area(TU)= Area(MUX)~ 0.1 Area (FU)TUFU
PE basic arch
Layout I: Area(PE)= 2.1x Area(FU),
Layout II: Area(PE)= 1.1x Area(FU)
2009/01/29
Estimating the ORN area-PE Basic arch.
Area (ORN) = 1.5 x W x (4 x MCL) x Area (CB)
W: the no. of the PEs in a RDP row
FU TU
Basic arch.
3-inps/2-outs
MCL= 1
Nu
mb
er
of
row
s =
1.5
×W
Number of columns = 4×MCL
2009/01/29
Estimating the ORN area-PE arch. I
Area (ORN) = 2 x W x (6 x MCL+ 2) x Area (CB)
FU TUTU
PE arch. I
4-inps/3-outs
Nu
mb
er
of
row
s =
2
×W
Number of columns = 6×MCL+2
MCL= 1
2009/01/29
Estimating the ORN area-PE arch. II
Area (ORN) = 1.5 x W x (4 x MCL + 1) x Area (CB)
FU TU
PE arch. II
3-inps/3-outs
TU TUN
um
ber
of
row
s =
1.5
×W
Number of columns = 4×MCL+1
MCL= 2
2009/01/29
A modified connection length measurement
New measurement technique for the net length
src
dest
dh
dv
Connection length measurement:
initial C.L.= dh
modified C.L.= dh/ dv
src
dest1
dest2
C.L.(previous)= 3
C.L.(new)=1
C.L.(previous)= 3
C.L.(new)=3
2009/01/29
A modified connection length measurement- Example
1, 31, 1
2, 22, 2/3
Parent 2
3, 13,1/3
4, 04, 0
dh
dh/dv
Parent 1
0, 40, 4/3
1, 30.5, 0.75
2, 21, 0.5
3, 13/2, 1/4
4, 02, 0
0, 40, 1
is chosen when C.L. is measured as dh
MCL= 2
is chosen when C.L. is measured as dh/dv
MCL= 1
dh
dh/dv
2009/01/29
MCL minimization- Using a MCL threshold
A maximum threshold is assumed for the MCL During the placement process:
For each CL larger than the threshold, the vertical distance increases as:
dv= CL/MCL_Threshold
src
dest
dest
• max permitted length= 2
• dh =3 > max permitted length
• dv= 1
PE with the min. C.L to the source
dv= dv+ [3/2]=dv+1= 2
2009/01/29
Basic placement and routing vs. integrated placement and routing
DFG
LSRDP Architecture Description
Placing Input Nodes
Placing Operational & Output Nodes
Routing Nets
Routing IO NetsFinal Map
DFG
LSRDP Architecture Description
Placing Input Nodes using PF-based alg.
Placing Operational Nodes &Routing Nets (node by node)
Placing Output Nodes
Routing Output NetsFinal Map
Basic Placement and Routing Flow Integrated Placement and Routing Flow
Experimental Results
2009/01/29
Specifications of the benchmark DFGs
DFG
# of nodes
# of inputs
# ofoutputs
# of pureops
max. inp.nodesfan-out
Max.fan-out
Heat-8x1 34 6 4 16 2 1
Heat-8x2 60 8 4 32 3 3
Heat-16x2 172 16 12 96 3 3
Poisson-3x3 62 18 1 33 3 2
Vibration-4x2 48 8 4 24 3 2
Vibration-8x2 136 16 12 72 4 4
ERI-1 20 8 3 9 3 1
ERI-2 76 16 9 51 3 3
ERI-3 89 14 9 66 3 6
ERI-4 67 19 1 47 4 3
Max 172 19 12 96 4 6
2009/01/29
Evaluation results for various architectures-MCL and ORN sizes
Layout-I Layout-IIS1 S2 S1 S2
MCLPE basic arch. 14 6 15 12PE arch. I 8 3 9 4PE arch. II 10 4 12 7
ORN size (overall)
x CB
PE basic arch 25116 12600 29250 24336
PE arch. I 30600 13680 38080 17680
PE arch. II 18696 8237 23520 14790
nodes placement Connection length measurement
S1 fan-out based lh
S2 proximity-factor based lhv
S2 results in smaller MCL and ORN size for both layout types
2009/01/29
Evaluation results for various architectures-no. of utilized PEs
Layout-I Layout-IIS1 S2 S1 S2
No. of PEs (overall)
x PE
PE basic arch 580 683 330 344
PE arch. I 634 713 384 384
PE arch. II 627 669 360 384
By using lhv, larger number of RDP rows are utilized
larger number of PEs will be employed for S2
2009/01/29
Evaluation results for various architectures-overall LSRDP area (KJJ)
Layout-I Layout-II
S1 S2 S1 S2Overall LSRDP
Area x (KJJ)PE basic arch 36923 34193 29200 27040
PE arch. I 48083 35995 36190 25031PE arch. II 35307 31258 27266 23451
S2 results in smaller overall area in terms of KJJ for both layout types
Layout II results in smaller area
PE arch. II gives smaller area
FU TU FU TUTU
Basic PE arch.
3-inps/2-outs
PE arch. I
4-inps/3-outs
FU TU
PE arch. II
3-inps/3-outs
TU TU
Layout-I
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
...
...
...
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
ADD/SUBMUL
...
.
.
.
.
.
.
.
.
.
ADD/SUBMUL
ADD/SUBMUL
ORN
ORN
ORN
.
.
.
Layout-II
ADD/SUB
MUL
ADD/SUB
MULADD/SUB
MUL
ADD/SUB
MULADD/SUB
MUL
...
...
...
ADD/SUB
MULADD/SUB
MUL ...
.
.
.
.
.
.
.
.
.
MULADD/SUB
ORN
ORN
ORN
.
.
.
2009/01/29
A sample ORN implementation
data_in
ladderclkin_lfin
clkin_lfoutclkin_hf
data_outcircuit under test
Block diagram of a high frequency test bench
input shift register output shift register
ladder
input shift registeroutput shift register
circuit under test
A photograph of a chip with 1-to-3 ORN prototype test bench
5 m
m
2009/01/29
Conclusions
SFQ-LSRDP is a basic core of a high-performance low-power computer
Data Flow Graphs (DFGs) extracted from scientific applications are mapped on the LSRDP
LSRDP micro-architecture is designed based on characteristics of DFGs via a quantitative approach
LSRDP is promising for resolving issues originated from CMOS technology as well as achieving remarkable performance
Acknowledgement: This research was supported in part by Core Research for Evolutional Science
and Technology (CREST) of Japan Scienceand Technology Corporation (JST).
Thanks for your attention!
Any questions?