dynamic voltage/frequency scaling in loop accelerators using blades

18
University of Michigan Electrical Engineering and Computer Science Dynamic Dynamic Voltage/Frequency Voltage/Frequency Scaling in Loop Scaling in Loop Accelerators using Accelerators using BLADES BLADES Ganesh Dasika 1 , Shidhartha Das 2 , Kevin Fan 1 , Scott Mahlke 1 , David Bull 2 1 1 University of Michigan Advanced Computer Architecture Laboratoy Ann Arbor, MI 2 ARM Ltd. Cambridge United Kingdom

Upload: oliver-torres

Post on 30-Dec-2015

28 views

Category:

Documents


0 download

DESCRIPTION

Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES. Ganesh Dasika 1 , Shidhartha Das 2 , Kevin Fan 1 , Scott Mahlke 1 , David Bull 2. 2 ARM Ltd. Cambridge United Kingdom. 1 University of Michigan Advanced Computer Architecture Laboratoy Ann Arbor, MI. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

University of MichiganElectrical Engineering and Computer Science

Dynamic Voltage/Frequency Scaling Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADESin Loop Accelerators using BLADES

Ganesh Dasika1, Shidhartha Das2, Kevin Fan1,Scott Mahlke1, David Bull2

1

1University of MichiganAdvanced Computer

Architecture LaboratoyAnn Arbor, MI

2ARM Ltd.Cambridge

United Kingdom

University of MichiganElectrical Engineering and Computer Science

IntroductionIntroduction

2

[Austin, IEEE Computer March 04]

University of MichiganElectrical Engineering and Computer Science

RazorRazor

• Allows for voltage/frequency scaling beyond first-failure point• Exploits difference between design-time conditions (“slow”) and

actual conditions (“typical”)

3

[Das, JSSC 2006]

University of MichiganElectrical Engineering and Computer Science

Razor in General Purpose ProcessorsRazor in General Purpose Processors

• Requires detailed analysis of microarchitectural impact– Analyze what state should be stored– Lengthening pipeline for stabilization increases

complexity of forwarding logic• Unpredictable control and data flow• Difficult to determine worst-case vectors

4

University of MichiganElectrical Engineering and Computer Science

BLADESBLADES

• Better-than-worst-case Loop Accelerator Design• Incorporate DVFS into ASICs using Razor

– Shave off some of the high NRE using HLS– Develop generic methodology for any application– Razor solution for a templated architecture

• Create ASIC design flow that is aware of Razor-ization costs

5

University of MichiganElectrical Engineering and Computer Science

Loop Accelerator TemplateLoop Accelerator Template

• Hardware realization of modulo-scheduled loop• Parameterized execution resources, storage, connectivity• Control is statically determined, simple and not timing-critical• Opportunity to make application-specific optimizations

6

University of MichiganElectrical Engineering and Computer Science

Razorized Loop AcceleratorRazorized Loop Accelerator

7

Razor++ **++ **

Extended register queues

Addedinterconnect

“Roll-back” muxes

} R

R is the number of extra entries required

Function of max pipeline depth and error-detection delay

University of MichiganElectrical Engineering and Computer Science

Error “Life-Cycle”Error “Life-Cycle”

8

Razor++ **++ **

Error Reset

Error

Error OR-tree Error stabilization

Roll-backpipelining

++Error

processing

Control

University of MichiganElectrical Engineering and Computer Science

Issues with RazorIssues with Razor

• Area, added hold-fixing

9

tspec

D

CLK

University of MichiganElectrical Engineering and Computer Science10

Or1Or1Or0Or0FU 1

Add1Add1Add0Add0FU 0

Time 5Time 4Time 3Time 2Time 1Time 0

Or1FU 3

Or0FU 2

Add1FU 1

Add0FU 0

Time 2Time 1Time 0

Add-Or1Add-Or0FU 0

Time 3Time 2Time 1Time 0

Or1Or0FU 1

Add1Add0FU 0

Time 2Time 1Time 0

50% FU utilization removes hold-fixing need, but requires halving performance or doubling area

Use hybrid scheme to execute >2 ops per FU

++

II

Opcode-chainingOpcode-chaining

University of MichiganElectrical Engineering and Computer Science

Identifying Opcode ChainsIdentifying Opcode Chains• Compiler identifies

subgraphs of 3-4 input, 1 output instructions– All arith. ops supported

• Greedy selection algorithm

11

<<<< <<<<

++++

>>>>

+

>>>>

++++

++

++

&&

STST

&&

STST

>>>>

++

<<<< ++

<<<<

LDLD

>>>>

LDLD

1 2

3

4 5

6

7

University of MichiganElectrical Engineering and Computer Science

Custom FUsCustom FUs

12

<<<< <<<<

++++

>>>>

+

>>>>

++++

++

++

&&

STST

&&

STST

>>>>

++

<<<< ++

<<<<

LDLD

>>>>

LDLD

1 2

3

4 5

6

7

<<<< <<<<

++++

>>>>

+

>>>>

++++

++

++

&&

STST

&&

STST

>>>>

++

<<<< ++

<<<<

LDLD

>>>>

LDLD

1 2

3

4 5

6

7

>>

+

+

<<

+

Enabled every2 cycles

Razor DFF

University of MichiganElectrical Engineering and Computer Science

ResultsResults

13

idct, sharp, systolic_dct had multiple CFUs, and overall lower # of FUsViterbi, dequant had signficant control-flow that restricted opportunities for creating custom ops

22% reduction in hold-fixing overhead in sobel

University of MichiganElectrical Engineering and Computer Science

ConclusionConclusion

• Application-specific optimizations definitely help to mitigate Razor costs– 24% reduction in overhead– 33% energy savings overall

• Can optimize Razor-ization with further input from the compiler– Critical-instruction analysis– Error impact analysis

14

University of MichiganElectrical Engineering and Computer Science

Thank you!Thank you!

15

http://cccp.eecs.umich.edu

University of MichiganElectrical Engineering and Computer Science

Future WorkFuture Work

• Errors in different FUs affect the system differently– Error “impact-analysis”– Data computation not necessarily error-sensitive– Address, branch target/direction critical to functionality

• Razor-ization of arbitrary Verilog

16

University of MichiganElectrical Engineering and Computer Science

MotivationMotivation

• Using Razor has significant design overhead– Error-recovery system– Added “backup” state– Additional hold-time fixing

• Modifications for different u-archs are different• Information about work-load cannot be used since

design must preserve generality

17

University of MichiganElectrical Engineering and Computer Science18

++ **