a distributed stallable architecture to handle delay variations

29
A Distributed Stallable Architecture to Handle Delay Variations Dr. Alberto A. Del Barrio Complutense University of Madrid

Upload: greendisc

Post on 19-Jun-2015

153 views

Category:

Education


2 download

DESCRIPTION

A summary of Alberto A. Del Barrio's work during his stay in UCLA, under the direction of Prof. Jason Cong

TRANSCRIPT

Page 1: A Distributed Stallable Architecture to Handle Delay Variations

A Distributed Stallable Architectureto Handle Delay Variations

Dr. Alberto A. Del BarrioComplutense University of Madrid

Page 2: A Distributed Stallable Architecture to Handle Delay Variations
Page 3: A Distributed Stallable Architecture to Handle Delay Variations

UCLA Stay

• VLSI, Architecture, Synthesis and Technology(VAST) Laboratory– http://cadlab.cs.ucla.edu/beta/cadlab/news

• Lead by Prof. Jason Cong• Around 20 students (postdocs, predocs, master, undergrads, visitors)

• More than 400 papers• Tools releases, startups

– xPilot AutoESL Vivado HLS (Xilinx)

Page 4: A Distributed Stallable Architecture to Handle Delay Variations

UCLA research summary

• Goal: handling delay variations applying theDistributed Architecture developed in mythesis

• How to do this ?– Simulator & HW Implementation– Binding algorithm

Page 5: A Distributed Stallable Architecture to Handle Delay Variations

UCLA research summary

• Goal: handling delay variations applying theDistributed Architecture developed in mythesis

• How to do this ?– Simulator & HW Implementation– Binding algorithm

Page 6: A Distributed Stallable Architecture to Handle Delay Variations

How to modeldelay whileconsidering

process variations[Jung and Kim, ICCAD’07]

Page 7: A Distributed Stallable Architecture to Handle Delay Variations

Comparisons

• State of the art: Worst Case– Overpesimistic

• CODES’09: BTW + Centralized Stallable Arch.– Every failure in execution time will stall the wholedatapath

• Many operations finishing its execution at the same time willincurr an extra cycle penalty

• Dynamic behavior escaping from static analysis will stall thedatapath

– Can only recover failures up to 1 cycle– Worse behavior when sharing resources

• Proposal: Distributed Stallable Arch.

7

Page 8: A Distributed Stallable Architecture to Handle Delay Variations

An example: Differential EquationSolver (DES)

8

×1 × 2 + 5

×6 × 3 < 8

- 9 × 7 × 4

- 11 + 10

Page 9: A Distributed Stallable Architecture to Handle Delay Variations

Worst Case vs Best Case

Worst Case = Better ThanWorse Case

Best Case, 5 cycles of difference. DistributedArch. will be close to BC

9

Page 10: A Distributed Stallable Architecture to Handle Delay Variations

UCLA research summary

• Goal: handling delay variations applying theDistributed Architecture developed in mythesis

• How to do this ?– Simulator & HW Implementation– Binding algorithm

Page 11: A Distributed Stallable Architecture to Handle Delay Variations

Razor Register

Shadow

Main 1

0

Comp

clk

dclk

hit

din Comparison is performedbetween two registers

11

Page 12: A Distributed Stallable Architecture to Handle Delay Variations

Razor Register ChronogramD

dclk If the inputs changein the lapse of time dclk_shift ishappening, thevalue stored in theshadow registercould be dirty

If a FU is shared, theworst case delay isallocated for itsoperations

T

Page 13: A Distributed Stallable Architecture to Handle Delay Variations

Literature: Centralized StallableArchitecture

13

Buffe

r

Rz. Inp

ut

Register

Rz. O

utpu

t Re

gister

Combina onal Logic

FSM Combina onal

Logic

Rz. State Register

Rz. Stab.

Re

gister

By Cong et al., CODES’09Problems: FU sharing restrictsthe possibilities of the design. Theworst case timingmust be allocatedwhen happening

Deals withprocess variations

Page 14: A Distributed Stallable Architecture to Handle Delay Variations

Better Than Worse Case (unconstrained)

Additional slack guaranteesops. 4 and 7 to be correct

14

Page 15: A Distributed Stallable Architecture to Handle Delay Variations

Better Than Worse Case (unconstrained)

If only operations4 and 7 haveproblems, it´s ok

15

Page 16: A Distributed Stallable Architecture to Handle Delay Variations

Better Than Worse Case (unconstrained)

Operation 5 failurewas not considered

in the staticanalysis

Every failure istranslated into an

extra cycle

1st iterationfinishes after 11 cycles (1 failure)

16

Page 17: A Distributed Stallable Architecture to Handle Delay Variations

My approach: Distributed StallableArchitecture

17

Buffe

r

Raz.

Inpu

t Re

gister

Rz. O

utpu

t Re

gister

Combina onal Logic

FSM CL 1

St Reg 1

Rz. Stab.

Re

gister

Commit Signals Logic Unit

FSM CL N

St Reg N

DistributedArchitecture, by Del Barrio et al., DATE’10, TCAD (March, 2011)

The controller is splitinto several local controllers, plus a coordinatorresponsible forchecking hazardsdynamically

But how to integratewith Razor Registers???

Page 18: A Distributed Stallable Architecture to Handle Delay Variations

Usefulness of a DistributedArchitecture

×1 × 2

+ 5 ×6

× 3

< 8- 9

× 7 4

- 11 + 10

×

1

2

3

4

×1 × 2 + 5

×6

× 3

< 8

- 9

× 7

× 4

- 11

+ 10

1

2

3

4

BTW static schedule (unconstrained) Priority‐list static schedule (unconstrained)

Operations 7 and 4 are covered, ifa failure happens, it will have no impact over the latency

Page 19: A Distributed Stallable Architecture to Handle Delay Variations

Usefulness of a DistributedArchitecture

×1 × 2 + 5

×6

× 3

< 8- 9 × 7

× 4

- 11

+ 10

R R

R

R

R

1

2

3

4

×1 × 2

+ 5 ×6

× 3

< 8- 9

×7 4

- 11 + 10

×

1

2

3

4

R

R

R

1-stall

2-stall

3-stall

R4-stall

BTW execution example Distributed execution example

But what if more failures happen ?? BTW could not be enough

Page 20: A Distributed Stallable Architecture to Handle Delay Variations

Distributed Architecture: Best Case Static Scheduling (unconstrained)

20

Page 21: A Distributed Stallable Architecture to Handle Delay Variations

Distributed Architecture

We schedule consideringthe Best Case, but thedatapath is able toreschedule on the fly, and besides some failures can be hidden

If there are not more failures, 1st iteration will finish in 10 cycles, besides hiding 2 failures

21

Page 22: A Distributed Stallable Architecture to Handle Delay Variations

Simulator Results:

Benchmark Codes BTW DisM Barrier DisMDiffEq 10.5 10.394 6.31ARF [13] 18.79 18.379 14.103FFT [12] 11.57 12.218 11.844FIR16 [12] 20.65 20.342 16.074EWF [13] 12.892 12.829 12.316

Benchmark Codes BTW DisM Barrier DisMDiffEq 10.66 10.374 8.305ARF [13] 23.32 20.318 17.825FFT [12] 16.51 14.92 14.38FIR16 [12] 24.27 21.22 16.6EWF [13] 16.1 12.79 12.47

Unconstrained

RC‐constrained: 4+, 4*

Codes and DisMBarrier havesimilar results. DisM reduces 17% latency

DisM Barrier and DisM reduce 12% and 23% latency, respectively

Page 23: A Distributed Stallable Architecture to Handle Delay Variations

Implementation Results (P & R)

Distr. Arch. implementsmodulo scheduling

dynamically, as it executesoperations when ready

Comparison is performedbetween two registers, there isno problem with FU sharing, as din (combinational) will notinfluence on the comparisonresult

23

10 ns Vivado+NP*+Xilinx 14.4 (2*,2+) 4 8.165 32.66 2 135 71 0.26 0.04 0.055 ns Vivado+NP*+Xilinx 14.4 (2*,2+) 5 9.003 45.015 2 157 73 0.26 0.05 0.05

Alb+Xilinx 14.4 (2*,2+) 3 8.7 26.1 2 246 118 0.26 0.08 0.08Alb tuned+Xilinx 14.4 (2*,2+) 3 9.535 28.605 2 204 86 0.26 0.07 0.06

Toolchain ResourcesLatencyCycle Time (nsEx. Time (ns) DSPs LUTs Regs %DSPs %LUTs %RegsTime Area

Page 24: A Distributed Stallable Architecture to Handle Delay Variations

UCLA research summary

• Goal: handling delay variations applying theDistributed Architecture developed in mythesis

• How to do this ?– Simulator & HW Implementation– Binding algorithm

Page 25: A Distributed Stallable Architecture to Handle Delay Variations

Binding Problem: bad binding

cstep 1

cstep 2

cstep 3

cstep 4

×1 × 2 + 5

×6 × 3 < 8

- 9 × 7 × 4

- 11 + 10

The hazard between 8 and 9 stallsseveral components of the graph

Page 26: A Distributed Stallable Architecture to Handle Delay Variations

Clock Cycle M1 M2 A1 A2

S1

S6

S7

S7

S1

S6

S7

S7

0

0

0

0

1

1

1

1

S2

S3

S4

S4

S2

S3

S4

S4

0

0

0

0

1

1

1

1

S5

S5

S10

S10

S10

S5

S10

S10

0

0

0

0

0

1

1

1

S8

S8

S8

S9

S11

S8

S8

S8

0

0

0

0

0

1

1

1

T T T T State State State State

x 1 x 2 + 5

x 6 x 3 + 5

8 x 7

- 9

- 11 + 10 x 2

x 6 x 3 + 5

- 11

x 7

+ 10

x 4

Issued Committed

1

2

3

4

5

6

7

8

x 1 x 2

x 6 x 3 + 5

x 6 x 3 + 5

8

- 11 + 10 x 1 x 2

x 7 x 4

8

- 11

x 7

+ 10 x 2 x 1

x 1

- 9

x 4

x 4 8

- 9

x 2 x 1

9

10

S7 1 S4 1 S10 1 S9 1

S1 0 S2 0 S10 1 S11 1

< <

< 8

< <

x 7 x 4 x 7 x 4

x 4 x 7 - 9

Page 27: A Distributed Stallable Architecture to Handle Delay Variations

Binding problem: good binding

cstep 1

cstep 2

cstep 3

cstep 4

×1 × 2 + 5

×6 × 3 < 8

- 9 × 7 × 4

- 11 + 10

The hazard between 8 and 10 is lessdamaging than the one between 6 and 

4, because of the extra cstepThe cost function depends on

two bound operations

Page 28: A Distributed Stallable Architecture to Handle Delay Variations

Clock Cycle M1 M2 A1 A2

S1

S6

S4

S1

S6

S4

S1

S1

0

0

0

1

1

1

0

0

S2

S3

S7

S2

S3

S7

S7

S2

0

0

0

1

1

1

1

0

S5

S5

S8

S10

S5

S8

S8

S10

0

0

0

0

1

1

1

1

S9

S9

S9

S11

S9

S9

S11

S11

0

0

0

0

1

1

1

1

T T T T State State State State

x 1 x 2 + 5

x 6 x 3 + 5

< 8 x 4 x 7 - 9

- 11 + 10 x 1 x 2

x 6 x 3 + 5

- 11

x 1 x 7

+ 10 x 2

Issued Committed

1

2

3

4

5

6

7

8

x 1 x 2

x 6 x 3 + 5

x 4

x 6 x 3 + 5

< 8 x 7 - 9

- 11 + 10 x 1 x 2

< 8 x 4 x 7 - 9 - 9

< 8

- 11

< 8

x 1 + 10 x 2 x 1

x 4

x 7

Page 29: A Distributed Stallable Architecture to Handle Delay Variations

State of the Research

• Simulator: OK• Implementation: 

– Sharing problem not solved yet• Binding algorithm

– Greedy version: OK– ILP formulation: difficult to model, not working– Network Flow formulation: possible target

• Study of controllers granularity– 1 FSM per FU– 1 FSM per operation cluster … but define what is anoperation cluster??!!