Download - Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator F. Mehdipour, Hiroaki Honda *, H. Kataoka, K. Inoue and K. Murakami

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator

F. Mehdipour, Hiroaki Honda*, H. Kataoka, K. Inoue 　 and K. Murakami

Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan

*Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan

E-mail: [email protected]

2009/01/29

Agenda

IntroductionSFQ-LSRDP General ArchitectureThe Design Procedure and Tool ChainInput/ Output Nodes PlacementArea MinimizationExperimental ResultsConclusions

2009/01/29

CREST-JST SFQ-RDP Project (2006~): A Low-power, high-performance reconfigurable processor based on single-flux quantum circuits

SFQ-LSRDPProf. K. Murakami et al.

Kyushu Univ.Architecture, Compiler

and Applications

Dr. S. Nagasawa et al.

Superconducting Research Lab. (SRL)

SFQ process

Prof. N. Yoshikawa et al.

Yokohama National Univ.SFQ-FPU chip, cell library

Prof. A. Fujimaki et al.

Nagoya Univ.SFQ-RDP chip, cell library,

and wiring

Prof. N. Takagi (Leader) et al.

Nagoya Univ.CAD for logic design and arithmetic circuits

2009/01/29

Goals

Discovering appropriate scientific applications

Developing compiler tools

Developing performance analyzing tools

Designing and Implementing SFQ-LSRDP Designing and Implementing SFQ-LSRDP architecture considering the features and architecture considering the features and limitations of SFQ circuitslimitations of SFQ circuits

2009/01/29

How a reconfigurable processor works

Application codeMain

Memory

GPPComputation-intensive (critical) code

Non-critical code

Computation-intensive (critical) code

Non-critical code

Non-critical code

LSRDP

...PE PE PEPE

...PE PE PEPE

...PE PE PEPE

LSRDP

ORN

…

ORN...

2009/01/29

Single-flux quantum (SFQ)against CMOS

CMOS main issues in implementing a large accelerator: High electric power consumption High heat radiation Difficulties in high-density packing

SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation (smaller area) Suitable for pipeline processing of data stream

ジョセフソン接合

超伝導ループ

磁束量子Single Flux QuantumSuperconductivityloop

Josephson junctionジョセフソン接合

超伝導ループ

磁束量子


超伝導ループ

磁束量子


超伝導ループ

磁束量子Single Flux QuantumSuperconductivityloop

Josephson junction

2009/01/29

Outline of large-scale reconfigurable data-path (LSRDP) processor

Features:Handling data flow graphs (DFGs)

extracted from scientific applicationsPipeline executionBurst transfer of input /output

rearranged data from/to memoryReduced no. of memory accesses

(alleviating the memory wall problem)MainMemory

GPP

ORN

: : : :

ORN : Operand Routing Network

...PE PE PEPE

...PE PE PEPE

...PE PE PEPE

LSRDP

: : : ... :SB

SMAC

Scratchpad Memory

Reconfigurable data-path components:A matrix of large number of floating-

point Functional Units (FUs) Reconfigurable Operand Routing

Network : (ORN)Dynamic reconfiguration facilitiesStreaming Buffer (SB) for I/O ports

SFQ-LSRDP General Architecture

2009/01/29

LSRDP architecture

Processing Elements

FU (Functional Unit): implements basic 64-bit double-precision floating point operations including: ADD/SUB and MUL

TU(transfer unit): as a routing resource for transferring data b/w inconsecutive rows

FU TU

FU

TU FU TUTU

FU TUFU

PE including two components

Four functionalities

Input ports

Output ports

MULNode 15

TU

7 4

15

13

12

2009/01/29

PE structures

FU TU

PE Basic arch.

3-inps/2-outs

FU -

- TU

FU TU

TU TU

FU TUTUPE arch. I

4-inps/3-outs

FU --

FU TU-

- TU-

FU TUTU

FU TU-

TU TUTU

FU TU

PE arch. II

3-inps/3-outs

TU TU

- TU

FU -TU TU

FU TU

TU-TU TU

2009/01/29

Layout types- Type IW

ORN

ORN

ORN

.

.

.

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

…A

TM

AT

M

AT

M

AT

M

AT

M

ADD/SUB

MUL

TU

Each PE implements ADD/SUB and MUL

M

A

T

: ADD/SUB

: MUL

: Transfer Unit

H

Flexible but consumes a lot of resources

2009/01/29

W

ORN

ORN

ORN

.

.

.

…M TA T A T A T M T




Layout types- Type II

H

Each PE implements ADD/SUB or MUL Each PE implements

ADD/SUB or MUL

ADD/SUB TU MUL TU

2009/01/29

Maximum connection length (MCL)-Definition

(i, 0)

(i+1,0)

(i+1,j)

...

...

(i,j)

ORN

...

... ...

...

(i+1,j+L)

Longest ConnectionLength= L

(i,j+2)

(i,j+1)

(i+1,j+2)

(i+1,j+1)

ConnectionLength= 0

ConnectionLength= 2

MCL: maximum horizontal distance b/w two PEs located in two subsequent rows

2009/01/29

An ORN structure

A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.

FPUFPUFPUFPUFPU TTTTT

FPUFPUFPUFPUFPU TT

T

TT

½CB½CB½CB½CB½CB

CB CB CB CBT2 T2


CB CB CB CBCB

CB CB CB CBCB CB CB CBCBCB

CB CB CB CBT2 T2CB CB CB CBCB

T2 CB T2 CBT2 CB T2 CBCBT2

FPUFPUFPUFPUFPU TTTTT

FPUFPUFPUFPUFPU TT

T

TT


CB CB CB CBT2 T2


CB CB CB CBCB

CB CB CB CBCB CB CB CBCBCB

CB CB CB CBT2 T2CB CB CB CBCB

T2 CB T2 CBT2 CB T2 CBCBT2

ORN is consisted of 2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches

FPU

2bit shiftregister

ORN

2009/01/29

Dynamic reconfiguration architecture

FU(A op B)

TransferUnit

ImmediateRegister (64b)

ORN

MUX

・・・・・・

ImmediateRegister

・・・・・・

PEInput-AInput-B Input-C

log(2x (2MCL+1)) x 3 [b]

Conf. Reg.[bit]

Three bit-stream lines for dynamic reconfiguration of:• Immediate registers (64bit) in each PE • Selector bits for muxes selecting the input data of FUs• Cross-bar switches in ORNs

Execution

wait

Starting ofExecution

End ofExecution

Starting ofReconfiguration

End ofReconfiguration

idle

Reconfiguration

ORN

Immediate

PE

InitialState

2009/01/29

What should be decided during the design procedure

Height

PE1 ...

...

...

PEm...

.

.

.

.

.

.

.

.

.

PE2 PE3

ORN

ORN

Width

...

...

Streaming Buffer (SB)

ORN

Operand Routing Network (ORN)

Streaming Buffer (SB)

Maximum Connection Length (MCL)? ORN size and structure?

Reconfiguration mechanism?(PE, ORN, Immediate data)

Layout: FU types(ADD/SUB and MUL)?

Width and Height ?

On-chip memory configuration?

The number of I/O ports?

The Design Procedure and Tool Chain

2009/01/29

Compiler and design flow

Application Code

Hardware-Software Partitioning(manual or automatic)

Critical Parts(h/w part)

Non-Critical Parts(s/w part)

Port positioning

s/w Part Modification

Binary Code(for GPP)

Configurations(for LSRDP)

LSRDP Architecture

Placement

Routing

DFG Genration

Bit-Stream Generation

DFGsDFGs

Analyzing DFG mappingresults

Design Phase

Mapping

• DFGs are manually generated

• DFG mapping results are employed for:• Analyzing LSRDP architecture statistics (a quantitative approach)• Generating LSRDP configuration bit-streams

2009/01/29

Benchmark applications

Finite differential method calculation of2nd order partial differential equations 1dim-Heat equation 　　　　 (Heat) 1dim-Vibration equation (Vibration) 2dim-Poisson equation (Poisson)

Quantum chemistry application Recursive parts of Electron Repulsion Integral calculation

(ERI-Rec)

Types of operations in the calculations:ADD/SUB and MUL

2009/01/29

DFG extraction- Heat equation

1-dim. heat equation for T(x,t)

　　　　　　　　　　　　　

Calculation by Finite DifferenceMethod (FDM)

2

2

( , ) ( , )T x t T x tA

t x

(A is const.)

T(i,j+1)

T(i-1,j) T(i,j) T(i+1,j)

+

*

*

+

D

B

T(i,j+1)

T(i-1,j) T(i,j) T(i+1,j)

+

*

*

+

D

B

),(),(*),(*

),(

11

1

jijiji

ji

txTtxTBtxTD

txT

Basic DFG

Basic DFG can be extended to horizontal and vertical directions to make a larger DFG

2009/01/29

A sample DFG - Heat

Inputs: 32Outputs: 16Operations: 721 Immediates: 364

A sample DFG (Heat)

2009/01/29

DFG mapping flow

Longest connectionsMCL= 2

Placing DFG nodes on LSRDP

Routing connections

Routing Inp/Out connections

DFGLSRDP Architecture

Description

Configuration File

Placing IO nodesRe-placing DFG nodes on

LSRDP (considering IO nodespositions)

Re-palcing output nodes

Modified Mapping Flow

Placing Input/Output Nodes

2009/01/29

Fan-out based I/O nodes placement

ni: the number of children of input node i

Ci1, Ci2, Ci3, Ci,ni X: location of the input node i

Total Connection Length:

TCL= |Ci1-X|+ |Ci2-X|+…|Ci,ni-X|

Objective: Minimize TCL ni= 1 X= Ci1 ni= 2 Ci1 <= X <= Ci2 ni= 3 X = Ci2 ni>=2 X = Cij, j=2…ni-1

(i, 0)(i,

j+3)... (i,j)

Reconfigurable Routing Network

...

...(i,j+4)

(i,j+2)

(i,j+1)

...

Reconfigurable Routing Network

... ...

...

...

...

...

...

(i+1,0)

(i+1,j+3)

... (i+1,j)

...(i+1,j+4)

(i+1,j+2)

(i+1,j+1)

INP2

INP1

2: Boundaries of the location withminimum connection length to children

1: There is only one location withminimum connection length to a child

INP3 3: The best location is on a middle point

2009/01/29

One main reason for the large MCL

Inputs Ports are far from each other

2009/01/29

Proximity-factor based placement

Proximity factor indicates how far a pair of input ports should be located from each other

For a pair of input nodes The larger number of closer descendants, higher proximity factor is

assigned

Si,j: a set of common descendants for input nodes i and j

Dk,i(=Dk,j): distance of common descendant node k to the input nodes i and j (it is equal to ASAP execution level of the node)

2,1,

,21,2

,12,1

......

...

...

nn

n

n

pp

pp

pp

P

jiifD

jiif

pp

jiSk ik

ijji

,,

,,1

2009/01/29

Proximity factor-Example

12

13

4

1

3

1

2

1),(

7,6,4

2,121

2,1

pIIP

S

4

1),(

7

3,131

3,1

pIIP

S

Inputs nodes I1 and I2 should be located closer than I3

4

1),(

7

3,232

3,2

pIIP

S

1

I1 I2 I3

4

6

7

2 3

5

2009/01/29

Input nodes placement alg.: Example

if C(l)> C(r)l= l+1, L[l]=j

elser= r+1, L[r]=j

1

1

1)(

r

li

ij liplC

1

1

1)(

r

li

ij irprC

N/2-3 N/2-2 N/2-1 1 N/2+1 N/2+2 N/2+3… …

N/2-3 N/2-2 2 1 N/2+1 N/2+2 N/2+3… …

Placing the 1st input node with the highest proximity factor

Placing the 2nd input node with the highest proximity factor

2009/01/29

Input ports placement alg.: Example

Placing i-th input node

l r

N/2-K … 2 1 3 … N/2+M… …

If C(l)> C(r):

l r

i … 2 1 3 … N/2+M… …

If C(r)> C(l):

l r

N/2-K … 2 1 3 … i …

Area Minimization

2009/01/29

Estimating the area of a PE

TU TUFU

PE arch. I

FU TU

PE arch. II

TU TU op

mux

A B C

TUsel

A B C

Layout I: Area(PE)= 2.2x Area(FU)

Layout II: Area(PE)= 1.2x Area(FU)

Layout I: Area(PE)= 2.2x Area(FU),


Area(FU)= Area(ADD/SUB)= Area(MUL)

Area(TU)= Area(MUX)~ 0.1 Area (FU)TUFU

PE basic arch

Layout I: Area(PE)= 2.1x Area(FU),


2009/01/29

Estimating the ORN area-PE Basic arch.

Area (ORN) = 1.5 x W x (4 x MCL) x Area (CB)

W: the no. of the PEs in a RDP row

FU TU

Basic arch.

3-inps/2-outs

MCL= 1

Nu

mb

er

of

row

s =

1.5

×W

Number of columns = 4×MCL

2009/01/29

Estimating the ORN area-PE arch. I

Area (ORN) = 2 x W x (6 x MCL+ 2) x Area (CB)

FU TUTU

PE arch. I

4-inps/3-outs

Nu

mb

er

of

row

s =

2

×W

Number of columns = 6×MCL+2

MCL= 1

2009/01/29

Estimating the ORN area-PE arch. II

Area (ORN) = 1.5 x W x (4 x MCL + 1) x Area (CB)

FU TU

PE arch. II

3-inps/3-outs

TU TUN

um

ber

of

row

s =

1.5

×W

Number of columns = 4×MCL+1

MCL= 2

2009/01/29

A modified connection length measurement

New measurement technique for the net length

src

dest

dh

dv

Connection length measurement:

initial C.L.= dh

modified C.L.= dh/ dv

src

dest1

dest2

C.L.(previous)= 3

C.L.(new)=1

C.L.(previous)= 3

C.L.(new)=3

2009/01/29

A modified connection length measurement- Example

1, 31, 1

2, 22, 2/3

Parent 2

3, 13,1/3

4, 04, 0

dh

dh/dv

Parent 1

0, 40, 4/3

1, 30.5, 0.75

2, 21, 0.5

3, 13/2, 1/4

4, 02, 0

0, 40, 1

is chosen when C.L. is measured as dh

MCL= 2

is chosen when C.L. is measured as dh/dv

MCL= 1

dh

dh/dv

2009/01/29

MCL minimization- Using a MCL threshold

A maximum threshold is assumed for the MCL During the placement process:

For each CL larger than the threshold, the vertical distance increases as:

dv= CL/MCL_Threshold

src

dest

dest

• max permitted length= 2

• dh =3 > max permitted length

• dv= 1

PE with the min. C.L to the source

dv= dv+ [3/2]=dv+1= 2

2009/01/29

Basic placement and routing vs. integrated placement and routing

DFG

LSRDP Architecture Description

Placing Input Nodes

Placing Operational & Output Nodes

Routing Nets

Routing IO NetsFinal Map

DFG

LSRDP Architecture Description

Placing Input Nodes using PF-based alg.

Placing Operational Nodes &Routing Nets (node by node)

Placing Output Nodes

Routing Output NetsFinal Map

Basic Placement and Routing Flow Integrated Placement and Routing Flow

Experimental Results

2009/01/29

Specifications of the benchmark DFGs

DFG

# of nodes

# of inputs

# ofoutputs

# of pureops

max. inp.nodesfan-out

Max.fan-out

Heat-8x1 34 6 4 16 2 1

Heat-8x2 60 8 4 32 3 3

Heat-16x2 172 16 12 96 3 3

Poisson-3x3 62 18 1 33 3 2

Vibration-4x2 48 8 4 24 3 2

Vibration-8x2 136 16 12 72 4 4

ERI-1 20 8 3 9 3 1

ERI-2 76 16 9 51 3 3

ERI-3 89 14 9 66 3 6

ERI-4 67 19 1 47 4 3

Max 172 19 12 96 4 6

2009/01/29

Evaluation results for various architectures-MCL and ORN sizes

Layout-I Layout-IIS1 S2 S1 S2

MCLPE basic arch. 14 6 15 12PE arch. I 8 3 9 4PE arch. II 10 4 12 7

ORN size (overall)

x CB

PE basic arch 25116 12600 29250 24336

PE arch. I 30600 13680 38080 17680

PE arch. II 18696 8237 23520 14790

nodes placement Connection length measurement

S1 fan-out based lh

S2 proximity-factor based lhv

S2 results in smaller MCL and ORN size for both layout types

2009/01/29

Evaluation results for various architectures-no. of utilized PEs

Layout-I Layout-IIS1 S2 S1 S2

No. of PEs (overall)

x PE

PE basic arch 580 683 330 344

PE arch. I 634 713 384 384

PE arch. II 627 669 360 384

By using lhv, larger number of RDP rows are utilized

larger number of PEs will be employed for S2

2009/01/29

Evaluation results for various architectures-overall LSRDP area (KJJ)

Layout-I Layout-II

S1 S2 S1 S2Overall LSRDP

Area x (KJJ)PE basic arch 36923 34193 29200 27040

PE arch. I 48083 35995 36190 25031PE arch. II 35307 31258 27266 23451

S2 results in smaller overall area in terms of KJJ for both layout types

Layout II results in smaller area

PE arch. II gives smaller area

FU TU FU TUTU

Basic PE arch.

3-inps/2-outs

PE arch. I

4-inps/3-outs

FU TU

PE arch. II

3-inps/3-outs

TU TU

Layout-I

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

...

...

...

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

ADD/SUBMUL

...

.

.

.

.

.

.

.

.

.

ADD/SUBMUL

ADD/SUBMUL

ORN

ORN

ORN

.

.

.

Layout-II

ADD/SUB

MUL

ADD/SUB

MULADD/SUB

MUL

ADD/SUB

MULADD/SUB

MUL

...

...

...

ADD/SUB

MULADD/SUB

MUL ...

.

.

.

.

.

.

.

.

.

MULADD/SUB

ORN

ORN

ORN

.

.

.

2009/01/29

A sample ORN implementation

data_in

ladderclkin_lfin

clkin_lfoutclkin_hf

data_outcircuit under test

Block diagram of a high frequency test bench

input shift register output shift register

ladder

input shift registeroutput shift register

circuit under test

A photograph of a chip with 1-to-3 ORN prototype test bench

5 m

m

2009/01/29

Conclusions

SFQ-LSRDP is a basic core of a high-performance low-power computer

Data Flow Graphs (DFGs) extracted from scientific applications are mapped on the LSRDP

LSRDP micro-architecture is designed based on characteristics of DFGs via a quantitative approach

LSRDP is promising for resolving issues originated from CMOS technology as well as achieving remarkable performance

Acknowledgement: This research was supported in part by Core Research for Evolutional Science

and Technology (CREST) of Japan Scienceand Technology Corporation (JST).

Thanks for your attention!

Any questions?

Download - Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator F. Mehdipour, Hiroaki Honda *, H. Kataoka, K. Inoue and K. Murakami

Top Related