executive summary

71
1 Executive Summary

Upload: rhian

Post on 04-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Executive Summary. Overall Architecture of ARC. Architecture of ARC Multiple cores and accelerators Global Accelerator Manager (GAM) Shared L2 cache banks and NoC routers between multiple accelerators. GAM. Accelerator + DMA+SPM. Shared Router. Shared L2 $. Core. Memory controller. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Executive Summary

1

Executive Summary

Page 2: Executive Summary

2

Overall Architecture of ARC♦ Architecture of ARC

Multiple cores and accelerators Global Accelerator Manager (GAM) Shared L2 cache banks and NoC

routers between multiple accelerators

GAMAccelerator +

DMA+SPMShared RouterCore

Shared

L2 $Memory

controller

Page 3: Executive Summary

3

What are the Problems with ARC? ♦ Dedicated accelerators are inflexible

An LCA may be useless for new algorithms or new domains Often under-utilized LCAs contain many replicated structures • Things like fp-ALUs, DMA engines, SPM• Unused when the accelerator is unused

♦ We want flexibility and better resource utilization Solution: CHARM

♦ Private SPM is wasteful Solution: BiN

Page 4: Executive Summary

4

A Composable Heterogeneous Accelerator-Rich Microprocessor (CHARM) [ISLPED’12]

♦ Motivation Great deal of data parallelism

• Tasks performed by accelerators tend to have a great deal of data parallelism Variety of LCAs with possible overlap

• Utilization of any particular LCA being somewhat sporadic It is expensive to have both:

• Sufficient diversity of LCAs to handle the various applications • Sufficient quantity of a particular LCA to handle the parallelism

Overlap in functionality• LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs)

♦ Idea Flexible accelerator building blocks (ABB) that can be composed into accelerators

♦ Leverage economy of scale

Page 5: Executive Summary

5

Micro Architecture of CHARM♦ ABB

Accelerator building blocks (ABB) Primitive components that can be

composed into accelerators♦ ABB islands

Multiple ABBs Shared DMA controller, SPM and

NoC interface

♦ ABC Accelerator Block Composer

(ABC)• To orchestrate the data flow between

ABBs to create a virtual accelerator• Arbitrate requests from cores

♦ Other components Cores L2 Banks Memory controllers

Page 6: Executive Summary

6

CAMEL♦ What are the problems with

CHARM? What if new algorithm introduce

new ABBs? What if we want to use this

architecture on multiple domains?

♦ Adding programmable fabric to increase Longevity Domain span

Page 7: Executive Summary

7

CAMEL♦ What are the problems with

CHARM? What if new algorithm introduce

new ABBs? What if we want to use this

architecture on multiple domains?

♦ Adding programmable fabric to increase Longevity Domain span

♦ ABC is now responsible to allocate programmable fabric

Page 8: Executive Summary

8

Page 9: Executive Summary

9

Extensive Use of Accelerators♦ Accelerators provide high power-efficiency over general-purpose processors

IBM wire-speed processor Intel Larrabee

♦ ITRS 2007 System drivers prediction: Accelerator number close to 1500 by 2022

♦ Two kinds of accelerators Tightly coupled – part of datapath Loosely coupled – shared via NoC

♦ Challenges Accelerator extraction and synthesis Efficient accelerator management

• Scheduling• Sharing• Virtualization …

Friendly programming models

Acc

eler

ator

2A

ccel

erat

or1

Core3

Core1

Core4

Core2

L1 L1

L1 L1

L2 L2

L2 L2

NoC

Page 10: Executive Summary

10

Architecture Support for Accelerator-Rich CMPs (ARC) [DAC’2012]

♦ Managing accelerators through the OS is expensive♦ In an accelerator rich CMP, management should be cheaper both in

terms of time and energy Invoke “Open”s the driver and returns the handler to driver. Called once. RD/WR is called multiple times.

Operation Latency (# Cycles)

1 core 2 cores 4 cores 8 cores 16 cores

Invoke 214413 256401 266133 308434 316161

RD/WR 703 725 781 837 885

CPU

App

OS

Accelerator Manager AcceleratorAccelerator

Motivation

Page 11: Executive Summary

11

Overall Architecture of ARC♦ Architecture of ARC

Multiple cores and accelerators Global Accelerator Manager (GAM) Shared L2 cache banks and NoC

routers between multiple accelerators

GAMAccelerator +

DMA+SPMShared RouterCore

Shared

L2 $Memory

controller

Page 12: Executive Summary

12

Overall Communication Scheme in ARC

1. The core requests for a given type of accelerator (lcacc-req).

New ISA

lcacc-req t

lcacc-rsrv t, e

lcacc-cmd id, f, addr

lcacc-free id

CPUCPU

Memory Memory LCALCA

GAMGAM1

Page 13: Executive Summary

13

Overall Communication Scheme in ARC

2. The GAM responds with a “list + waiting time” or NACK

New ISA

lcacc-req t

lcacc-rsrv t, e

lcacc-cmd id, f, addr

lcacc-free id

CPUCPU

Memory Memory LCALCA

GAMGAM2

Page 14: Executive Summary

14

Overall Communication Scheme in ARC

3. The core reserves (lcacc-rsv) and waits.

New ISA

lcacc-req t

lcacc-rsrv t, e

lcacc-cmd id, f, addr

lcacc-free id

CPUCPU

Memory Memory LCALCA

GAMGAM3

Page 15: Executive Summary

15

Overall Communication Scheme in ARC

4. The GAM ACK the reservation and send the core ID to accelerator

New ISA

lcacc-req t

lcacc-rsrv t, e

lcacc-cmd id, f, addr

lcacc-free id

CPUCPU

Memory Memory LCALCA

GAMGAM4

4

Page 16: Executive Summary

16

Overall Communication Scheme in ARC

5. The core shares a task description with the accelerator through memory and starts it (lcacc-cmd).• Task description consists of:

o Function ID and input parameterso Input/output addresses and strides

New ISA

lcacc-req t

lcacc-rsrv t, e

lcacc-cmd id, f, addr

lcacc-free id

CPUCPU

MemoryMemory Task description AcceleratorAccelerator

GAMGAM

55

Page 17: Executive Summary

17

Overall Communication Scheme in ARC

6. The accelerator reads the task description, and begins working• Overlapped Read/Write from/to Memory and Compute• Interrupting core when TLB miss

New ISA

lcacc-req t

lcacc-rsrv t, e

lcacc-cmd id, f, addr

lcacc-free id

CPUCPU

MemoryMemory Task description LCALCA

GAMGAM

6

6

Page 18: Executive Summary

18

Overall Communication Scheme in ARC

7. When the accelerator finishes its current task it notifies the core.

New ISA

lcacc-req t

lcacc-rsrv t, e

lcacc-cmd id, f, addr

lcacc-free id

CPUCPU

MemoryMemory Task description LCALCA

GAMGAM

7

Page 19: Executive Summary

19

Overall Communication Scheme in ARC

8. The core then sends a message to the GAM freeing the accelerator (lcacc-free).

New ISA

lcacc-req t

lcacc-rsrv t, e

lcacc-cmd id, f, addr

lcacc-free id

CPUCPU

MemoryMemory Task description LCALCA

GAMGAM8

Page 20: Executive Summary

20

Accelerator Chaining and Composition

♦ Chaining Efficient accelerator to

accelerator communication

♦ Composition Constructing virtual

accelerators

Accelerator1 Accelerator1

Scratchpad Scratchpad

DMA controller DMA controller

Accelerator2 Accelerator2

Scratchpad Scratchpad

DMA controller DMA controller

M-point1D FFTM-point1D FFT

M-point1D FFTM-point1D FFT

3D FFT3D FFT

N-point2D FFTN-point2D FFT

virtualization

M-point1D FFTM-point1D FFT

M-point1D FFTM-point1D FFT

Page 21: Executive Summary

21

Accelerator Virtualization♦ Application programmer or compilation framework selects high-

level functionality♦ Implementation via

Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries

♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT

Page 22: Executive Summary

22

Accelerator Virtualization♦ Application programmer or compilation framework selects high-

level functionality♦ Implementation via

Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries

♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT

Step 1: 1D FFT on Row 1 and Row 2

Page 23: Executive Summary

23

Accelerator Virtualization♦ Application programmer or compilation framework selects high-

level functionality♦ Implementation via

Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries

♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT

Step 2: 1D FFT on Row 3 and Row 4

Page 24: Executive Summary

24

Accelerator Virtualization♦ Application programmer or compilation framework selects high-

level functionality♦ Implementation via

Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries

♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT

Step 3: 1D FFT on Col 1 and Col 2

Page 25: Executive Summary

25

Accelerator Virtualization♦ Application programmer or compilation framework selects high-

level functionality♦ Implementation via

Monolithic accelerator Distributed accelerators composed to a virtual accelerator Software decomposition libraries

♦ Example: Implementing a 4x4 2-D FFT using 2 4-point 1-D FFT

Step 4: 1D FFT on Col 3 and Col 4

Page 26: Executive Summary

26

Light-Weight Interrupt Support

CPUCPU

LCALCA

GAMGAM

Page 27: Executive Summary

27

Light-Weight Interrupt Support

CPUCPU

LCALCA

GAMGAM

Request/Reserve Confirmation and NACKSent by GAM

Page 28: Executive Summary

28

Light-Weight Interrupt Support

CPUCPU

LCALCA

GAMGAM

TLB MissTask Done

Page 29: Executive Summary

29

Light-Weight Interrupt Support

CPUCPU

LCALCA

GAMGAM

TLB MissTask Done

Core Sends Logical Addresses to LCALCA keeps a small TLB for the addresses that it is working on

Page 30: Executive Summary

30

Light-Weight Interrupt Support

CPUCPU

LCALCA

GAMGAM

TLB MissTask Done

Core Sends Logical Addresses to LCALCA keeps a small TLB for the addresses that it is working on

Why Logical Address?1- Accelerators can work on irregular addresses (e.g. indirect addressing)2- Using large page size can be a solution but will effect other applications

Page 31: Executive Summary

31

Light-Weight Interrupt Support

CPUCPU

LCALCA

GAMGAM

OperationLatency to switch to ISR and back (# Cycles)

1 core 2 cores 4 cores 8 cores 16 cores

Interrupt 16 K 20 K 24 K 27 K 29 K

It’s expensive to handle the interrupts via OS

Page 32: Executive Summary

32

Light-Weight Interrupt Support

CPUCPU

LCALCA

GAMGAM

Extending the core with a light-weight interrupt support

LWI

LWI

Page 33: Executive Summary

33

Light-Weight Interrupt Support

CPUCPU

LCALCA

GAMGAM

Extending the core with a light-weight interrupt support

LWI

LWI

Two main components added: A table to store ISR info An interrupt controller to queue and prioritize incoming interrupt packets

Each thread registers: Address of the ISR and its arguments and lw-int source

Limitations: Only can be used when running the same thread which LW interrupt belongs to OS-handled interrupt otherwise

Page 34: Executive Summary

34

Programming interface to ARCPlatform creation

Application Mapping & Development

Page 35: Executive Summary

35

Evaluation methodology♦ Benchmarks

Medical imaging Vision & Navigation

Page 36: Executive Summary

36

compressive sensing

level set methods

fluid registration

total variational algorithm

Application Domain: Medical Image Processing

deno

isin

g

regi

stra

tion

se

gmen

tatio

n

anal

ysis

h

zyS

i,jvolumevoxel

ji

S

kkk

eiZ

wjfwi

1

21

2

j

2, )(

1 ,2)()(u :voxel

)()()()( uxTxRuxTvv

uvt

uv

0t)(x, : xvoxels)(surface

div),(F

t

datat

3

12

23

1

),(

),()(

ji

j

ij

j ij

ij

i txfx

vv

x

p

x

vv

t

v

txfvpvvt

v

re

cons

truc

tion

voxels

2

points sampled

)(-ARmin

:theoryNyquist -Shannon classical rate aat sampled be can and sparsity,exhibit images Medical

ugradSuu

Navier-Stokes

equations

Page 37: Executive Summary

37

Area Overhead

♦ AutoESL (from Xilinx) for C to RTL synthesis♦ Synopsys for ASIC synthesis

32 nm Synopsys Educational library♦ CACTI for L2♦ Orion for NoC♦ One UltraSparc IIIi core (area scaled to 32 nm)

178.5 mm^2 in 0.13 um (http://en.wikipedia.org/wiki/UltraSPARC_III)

Core NoC L2 Deblur Denoise Segmentation Registration SPM Banks Number of instance/Size 1 1 8MB 1 1 1 1 39 x 2KBArea(mm^2) 10.8 0.3 39.8 0.03 0.01 0.06 0.01 0.02Percentage (%) 18.0 0.5 66.2 3.4 0.8 6.5 1.2 2.4

Total ARC: 14.3 %

Page 38: Executive Summary

38

Experimental Results – Performance(N cores, N threads, N accelerators)

Performance improvement over OS based approaches:

on average 51x, up to 292x

Performance improvement over OS based approaches:

on average 51x, up to 292x

Performance improvement over SW only approaches:

on average 168x, up to 380x

Performance improvement over SW only approaches:

on average 168x, up to 380x1 2 4 8 16

050

100150200250300350400

Speedup over SW-Only

RegistrationDeblurDenoiseSegmentation

Configuration (N cores, N threads, N accelerators)

Gai

n (X

)

1 2 4 8 160

50100150200250300350

Speedup over OS-based

RegistrationDeblurDenoiseSegmentation

Configuration (N cores, N threads, N accelerators)

Gai

n (X

)

Page 39: Executive Summary

39

Experimental Results – Energy (N cores, N threads, N accelerators)

Energy improvement over OS-based

approaches:on average 17x, up to 63x

Energy improvement over OS-based

approaches:on average 17x, up to 63x

Energy improvement over SW-only approaches:

on average 241x, up to 641x

Energy improvement over SW-only approaches:

on average 241x, up to 641x1 2 4 8 16

0100200300400500600700

Energy gain over SW-only version

RegistrationDeblurDenoiseSegmentation

Configuration (N cores, N threads, N accelerators)

Gai

n (X

)

1 2 4 8 160

10203040506070

Energy gain over OS-based version

RegistrationDeblurDenoiseSegmentation

Configuration (N cores, N threads, N accelerators)

Gai

n (X

)

Page 40: Executive Summary

40

What are the Problems with ARC? ♦ Dedicated accelerators are inflexible

An LCA may be useless for new algorithms or new domains Often under-utilized LCAs contain many replicated structures • Things like fp-ALUs, DMA engines, SPM• Unused when the accelerator is unused

♦ We want flexibility and better resource utilization Solution: CHARM

♦ Private SPM is wasteful Solution: BiN

Page 41: Executive Summary

41

A Composable Heterogeneous Accelerator-Rich Microprocessor (CHARM) [ISLPED’12]

♦ Motivation Great deal of data parallelism

• Tasks performed by accelerators tend to have a great deal of data parallelism Variety of LCAs with possible overlap

• Utilization of any particular LCA being somewhat sporadic It is expensive to have both:

• Sufficient diversity of LCAs to handle the various applications • Sufficient quantity of a particular LCA to handle the parallelism

Overlap in functionality• LCAs can be built using a limited number of smaller, more general LCAs: Accelerator building blocks (ABBs)

♦ Idea Flexible accelerator building blocks (ABB) that can be composed into accelerators

♦ Leverage economy of scale

Page 42: Executive Summary

42

Micro Architecture of CHARM♦ ABB

Accelerator building blocks (ABB) Primitive components that can be

composed into accelerators♦ ABB islands

Multiple ABBs Shared DMA controller, SPM and

NoC interface

♦ ABC Accelerator Block Composer

(ABC)• To orchestrate the data flow between

ABBs to create a virtual accelerator• Arbitrate requests from cores

♦ Other components Cores L2 Banks Memory controllers

Page 43: Executive Summary

43

An Example of ABB Library (for Medical Imaging)

Internal

of Poly

Page 44: Executive Summary

44

Example of ABB Flow-Graph (Denoise)

2

Page 45: Executive Summary

45

Example of ABB Flow-Graph (Denoise)

--

**

--

**

--

**

--

**

--

**

--

**++ ++ ++

++

++

sqrtsqrt

1/x1/x

2

Page 46: Executive Summary

46

Example of ABB Flow-Graph (Denoise)

--

**

--

**

--

**

--

**

--

**

--

**++ ++ ++

++

++

sqrtsqrt

1/x1/x

2

ABB1: Poly

ABB2: Poly

ABB3: Sqrt

ABB4: Inv

Page 47: Executive Summary

47

Example of ABB Flow-Graph (Denoise)

--

**

--

**

--

**

--

**

--

**

--

**++ ++ ++

++

++

sqrtsqrt

1/x1/x

2

ABB1:Poly

ABB2: Poly

ABB3: Sqrt

ABB4: Inv

Page 48: Executive Summary

48

LCA Composition Process

ABB

ISLAND1

ABB

ISLAND1

ABB

ISLAND2

ABB

ISLAND2

ABB

ISLAND3

ABB

ISLAND3

ABB

ISLAND4

ABB

ISLAND4

ABC

x

y

x

w

z

w

y

z

Core

Page 49: Executive Summary

49

ABB

ISLAND1

ABB

ISLAND1

ABB

ISLAND2

ABB

ISLAND2

ABB

ISLAND3

ABB

ISLAND3

ABB

ISLAND4

ABB

ISLAND4

Core

LCA Composition Process

1. Core initiation Core sends the task description: task flow-

graph of the desired LCA to ABC together with polyhedral space for input and output

ABC

x

y

x

w

z

w

y

z

xx

yy zz

10x10 input and output

Task description

Page 50: Executive Summary

50

ABB

ISLAND1

ABB

ISLAND1

ABB

ISLAND2

ABB

ISLAND2

ABB

ISLAND3

ABB

ISLAND3

ABB

ISLAND4

ABB

ISLAND4

LCA Composition Process

2. Task-flow parsing and task-list creation ABC parses the task-flow graph and breaks the request

into a set of tasks with smaller data size and fills the task list

ABC

x

y

x

w

z

w

y

z Needed ABBs: “x”, “y”, “z”

With task size of 5x5 block,

ABC generates 4 tasks

Core

ABC generates internally

Page 51: Executive Summary

51

ABB

ISLAND1

ABB

ISLAND1

ABB

ISLAND2

ABB

ISLAND2

ABB

ISLAND3

ABB

ISLAND3

ABB

ISLAND4

ABB

ISLAND4

Core

LCA Composition Process3. Dynamic ABB mapping

ABC uses a pattern matching algorithm to assign ABBs to islands

Fills the composed LCA table and resource allocation table

ABC

x

y

x

w

z

w

y

z

Island ID

ABB Type

ABB ID Status

1 x 1 Free

1 y 1 Free

2 x 1 Free

2 w 1 Free

3 z 1 Free

3 w 1 Free

4 y 1 Free

4 z 1 Free

Page 52: Executive Summary

52

ABB

ISLAND1

ABB

ISLAND1

ABB

ISLAND2

ABB

ISLAND2

ABB

ISLAND3

ABB

ISLAND3

ABB

ISLAND4

ABB

ISLAND4

Core

LCA Composition Process3. Dynamic ABB mapping

ABC uses a pattern matching algorithm to assign ABBs to islands

Fills the composed LCA table and resource allocation table

ABC

x

y

x

w

z

w

y

z

Island ID

ABB Type

ABB ID Status

1 x 1 Busy

1 y 1 Busy

2 x 1 Free

2 w 1 Free

3 z 1 Busy

3 w 1 Free

4 y 1 Free

4 z 1 Free

Page 53: Executive Summary

53

ABB

ISLAND1

ABB

ISLAND1

ABB

ISLAND2

ABB

ISLAND2

ABB

ISLAND3

ABB

ISLAND3

ABB

ISLAND4

ABB

ISLAND4

Core

LCA Composition Process4. LCA cloning

Repeat to generate more LCAs if ABBs are available

ABC

x

y

x

w

z

w

y

z

Core ID

ABB Type

ABB ID Status

1 x 1 Busy

1 y 1 Busy

2 x 1 Busy

2 w 1 Free

3 z 1 Busy

3 w 1 Free

4 y 1 Busy

4 z 1 Busy

Page 54: Executive Summary

54

ABB

ISLAND1

ABB

ISLAND1

ABB

ISLAND2

ABB

ISLAND2

ABB

ISLAND3

ABB

ISLAND3

ABB

ISLAND4

ABB

ISLAND4

Core

LCA Composition Process

5. ABBs finishing task When ABBs finish, they signal the ABC. If

ABC has another task it sends otherwise it frees the ABBs

ABC

x

y

x

w

z

w

y

z

Island ID

ABB Type

ABB ID Status

1 x 1 Busy

1 y 1 Busy

2 x 1 Busy

2 w 1 Free

3 z 1 Busy

3 w 1 Free

4 y 1 Busy

4 z 1 Busy

DONE

Page 55: Executive Summary

55

ABB

ISLAND1

ABB

ISLAND1

ABB

ISLAND2

ABB

ISLAND2

ABB

ISLAND3

ABB

ISLAND3

ABB

ISLAND4

ABB

ISLAND4

Core

LCA Composition Process

5. ABBs being freed When an ABB finishes, it signals the ABC. If

ABC has another task it sends otherwise it frees the ABBs

ABC

x

y

x

w

z

w

y

z

Island ID

ABB Type

ABB ID Status

1 x 1 Busy

1 y 1 Busy

2 x 1 Free

2 w 1 Free

3 z 1 Busy

3 w 1 Free

4 y 1 Free

4 z 1 Free

Page 56: Executive Summary

56

ABB

ISLAND1

ABB

ISLAND1

ABB

ISLAND2

ABB

ISLAND2

ABB

ISLAND3

ABB

ISLAND3

ABB

ISLAND4

ABB

ISLAND4

Core

LCA Composition Process

6. Core notified of end of task When the LCA finishes ABC signals the core

ABC

x

y

x

w

z

w

y

z

Island ID

ABB Type

ABB ID Status

1 x 1 Free

1 y 1 Free

2 x 1 Free

2 w 1 Free

3 z 1 Free

3 w 1 Free

4 y 1 Free

4 z 1 Free

DONE

Page 57: Executive Summary

57

ABC Internal Design♦ ABC sub-components

Resource Table(RT): To keep track of available/used ABBs

Composed LCA Table (CLT): Eliminates the need to re-compose LCAs

Task List (TL): To queue the broken LCA requests (to smaller data size)

TLB: To service and share the translation requests by ABBs

Task Flow-Graph Interpreter (TFGI): Breaks the LCA DFG into ABBs

LCA Composer (LC): Compose the LCA using available ABBs

♦ Implementation RT, CLT, TL and TLB are implemented using

RAM TFGI has a table to keep ABB types and an

FSM to read task-flow-graph and compares LC has an FSM to go over CLT and RT and

check mark the available ABBs

Resource Table

Composed LCA Table

TLB

Task List

DFG Interpreter

LCA Composer

From ABBs(Done signal)

Cores

Accelerator Block Composer

To ABBs

(allocate)

ABBs(TLB service)

Page 58: Executive Summary

58

CHARM Software Infrastructure♦ ABB type extraction

Input: compute-intensive kernels from different application

Output: ABB Super-patterns Currently semi-automatic

♦ ABB template mapping Input: Kernels + ABB types Output: Covered kernels as an

ABB flow-graph

♦ CHARM uProgram generation Input: ABB flow-graph Output:

Page 59: Executive Summary

59

Evaluation Methodology

♦ Simics+GEMS based simulation♦ AutoPilot/Xilinx+ Synopsys for

ABB/ABC/DMA-C synthesis♦ Cacti for memory synthesis (SPM)♦ Automatic flow to generate the CHARM

software and simulation modules♦ Case studies

Physical LCA sharing with Global Accelerator Manager (LCA+GAM)

Physical LCA sharing with ABC (LCA+ABC)

ABB composition and sharing with ABC (ABB+ABC)

♦ Medical imaging benchmarks Denoise, Deblur, Segmentation and

Registration

Page 60: Executive Summary

60

Area Overhead Analysis♦ Area-equivalent

The total area consumed by the ABBs equals the total area of all LCAs required to run a single instance of each benchmark

♦ Total CHARM area is 14% of the 1cmx1cm chip A bit less than LCA-based

design

Page 61: Executive Summary

61

Results: Improvement Over LCA-based Design

♦ N’x’ has N times area -equivalent accelerators

♦ Performance 2.5X vs. LCA+GAM (max 5X) 1.4X vs. LCA+ABC (max 2.6X)

♦ Energy 1.9X vs. LCA+GAM (max 3.4X) 1.3X vs. LCA+ABC (max 2.2X)

♦ ABB+ABC better energy and performance ABC starts composing ABBs to

create new LCAs Creates more parallelism

1x 2x 4x 8x 1x 2x 4x 8x 1x 2x 4x 8x 1x 2x 4x 8xDeb Den Reg Seg

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Normalized Performance

LCA+GAMLCA+ABCABB+ABC

1x 2x 4x 8x 1x 2x 4x 8x 1x 2x 4x 8x 1x 2x 4x 8xDeb Den Reg Seg

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Normalized Energy

LCA+GAMLCA+ABCABB+ABC

Page 62: Executive Summary

62

Results: Platform Flexibility♦ Two applications from two

unrelated domains to MI Computer vision • Log-Polar Coordinate Image

Patches (LPCIP) Navigation• Extended Kalman Filter-based

Simultaneous Localization and Mapping (EKF-SLAM)

♦ Only one ABB is added Indexed Vector Load

MAX Benefit over LCA+GAM 3.64X

AVG Benefit over LCA+GAM 2.46X

MAX Benefit over LCA+ABC 3.04X

AVG Benefit over LCA+ABC 2.05X

Page 63: Executive Summary

63

CAMEL♦ What are the problems with

CHARM? What if new algorithm introduce

new ABBs? What if we want to use this

architecture on multiple domains?

♦ Adding programmable fabric to increase Longevity Domain span

Page 64: Executive Summary

64

CAMEL♦ What are the problems with

CHARM? What if new algorithm introduce

new ABBs? What if we want to use this

architecture on multiple domains?

♦ Adding programmable fabric to increase Longevity Domain span

♦ ABC is now responsible to allocate programmable fabric

Page 65: Executive Summary

65

CAMEL Microarchitecture

Page 66: Executive Summary

66

CAMEL Programmable Fabric Block

Page 67: Executive Summary

67

Programmable Fabric Allocation by ABC♦ ABB replication in order to rate-match♦ ABB assignment on those paths that have larger slack in order to

decrease negative effect♦ Interval-based allocation to increase the performance between multiple

requests♦ When on Programmable Fabric?

ABB does not exist in system ABB is being used by other accelerators

● Beneficial if can be rate-matched● Potential slow-down

Page 68: Executive Summary

68

FPGA-allocation by ABC – Block Diagram

♦ Greedy approach Single application service

♦ Interval-based approach Wait for some fixed interval Use a greedy-approach to find the

best to satisfy multiple request

LCA Task Flow Graph

Find Needed ABBs on PF

Available ABBs

Min Area Feasible?

Return FALSE

Fit PF?

Return True

Set Maximum Rate

Yes

No

Program Fabric

DecreaseRate

No

Lowest rate?

Return FALSE

Page 69: Executive Summary

69

CAMEL Results – All Domains

Page 70: Executive Summary

70

CAMEL Results - Speedup

Page 71: Executive Summary

71

CAMEL Results – Energy improvement