aspdac 2008 tutorial: system-level synthesis -- functions...

ASPDAC 2008 Tutorial:ASPDAC 2008 Tutorial:SystemSystem--Level Synthesis Level Synthesis ----

Functions, Architectures, and CommunicationsFunctions, Architectures, and Communications

Alberto Sangiovanni VincentelliAlberto Sangiovanni Vincentelli & & Douglas DensmoreDouglas DensmoreUC Berkeley, UC Berkeley, [email protected]@eecs.berkeley.edu

Jason CongJason CongUCLA, UCLA, [email protected]@CS.UCLA.EDU

Radu Radu MarculescuMarculescuCMU, CMU, [email protected]@ece.cmu.edu

OutlineOutlineFoundation of platformFoundation of platform--based design and Metropolis framework (Alberto/Douglas) 9based design and Metropolis framework (Alberto/Douglas) 9--10:3010:30

Challenges in System Level DesignChallenges in System Level DesignPlatformPlatform--based Design as a unifying methodologybased Design as a unifying methodologyA framework for PBD: A framework for PBD: •• Theoretical foundations (heterogeneous systems, Theoretical foundations (heterogeneous systems, metamodelingmetamodeling, abstract semantics), abstract semantics)•• Metropolis II: integration platform architectureMetropolis II: integration platform architecture•• Application to embedded system design: cars and building networkApplication to embedded system design: cars and building networked embedded ed embedded

systemssystemsSynthesis for functionality (Jason) 11am Synthesis for functionality (Jason) 11am –– 12:30pm12:30pm

Synthesis for customized logicSynthesis for customized logicUse of applicationsUse of applications--specific processors and processor networksspecific processors and processor networks

Synthesis for communication (Synthesis for communication (RaduRadu) 2 ) 2 –– 3:30pm3:30pmNetworksNetworks--onon--Chip design spaceChip design spaceOptimization for performance, energy, and faultOptimization for performance, energy, and fault--tolerancetolerance

RealReal--life examples (all) 4 life examples (all) 4 –– 5pm5pmQ/A DiscussionsQ/A Discussions

Copyright: A. Sangiovanni-Vincentelli

Art ScienceSystem-Level Design:

TOFROM

Alberto Sangiovanni-Vincentelli

The Edgar L. and Harold H. Buttner Chair of EECSUniversity of California at Berkeley

4

Outline

Challenges

The Movement towards Design SciencePlatform-based Design as a Unifying Approach

Metropolis

The GSRC Agenda

5

Challenge: The Bifurcation of the Market

The Core:Performance is premiumPower and cost constrainedRelatively long life-timeExpensive jewelry

The Expanding Periphery:Cost and size are premiumIntegration and power are key“Just enough performance”Electronics like a fashion statement

5

6

Ubiquitous Sensor Networks

Challenge: The Physical Internet

Year

Log

(peo

ple

per c

ompu

ter)

Number CrunchingData Storage

ProductivityInteractive

Mainframe

Minicomputer

Workstation

PC

Laptop

PDA

Cellular phone Streaming information to and from physical world

6

Limitations imposed by physics

Limitations imposed by economicsmay ultimately end its long run (20 nm – 13 nm – 8 nm – …?) or may not …

Challenge: The Waning Days of Moore’s Law

7

Scaling enabled integration of complex systems with hundreds of millions of devices on a single die

Intel KEROM dual coreISSCC 07, 290M trans.

SUN Niagara-2ISSCC 07, 500M trans.

IBM/Sony Cell ISSCC 05, 235M trans.

Challenge: Parallel Architectures

8

9

Source: Public financials, Gartner 2005

• 2005 revenue $17.4B

• CAGR 10% (2004-2010)

IC Vendors~15% of revenue from automotive

• 2004 Revenue ~$200B

• CAGR 5.4% (2004-2010)

Tier 1 Suppliers90%+ of revenue from automotive

Automakers • 2005 Revenue $1.1T

• CAGR 2.8% (2004-2010)

Challenge: Design Chain IntegrationAutomotive Industry

10

Challenge: Platforms and Software Content

Supplied by ST

2000STAPI

1998Specs

System-Above-Chip (Boards, Chips, & Software)NO value in customer owning/writing drivers. (TMM, E*, HNS)Customer added value is application, Conditional Access, Brand NameST supplies the complete base system BELOW MIDDLEWARE to save time to market

2003 &Beyond

11

Concurrency and Heterogeneity

IntelMontecito

Source: Bosch

InformationSystems

Tele

mat

ics

Fau

lt

Tole

ran

t

Body Electronics B

ody

Fun

ctio

ns

Fail

Safe

Fau

ltFu

nct

ion

al

Body Electronics

Dri

vin

g an

d V

ehic

leD

ynam

ic F

un

ctio

ns

Mobile Communications Navigation

FireWall

Access to WWWDAB

GateWay

GateWay

Theft warning

Door Module Light Module

AirConditioning

Shift by

Wire

EngineManage-

ment

ABS

Steer by

Wire

Brake by

Wire

MOSTMOSTFirewireFirewire

CANCANLinLin

CANTTCAN

FlexRay

Today, more than 80Microprocessors and millions of lines of code

11

12

HVAC: High Performance Buildings

13

Challenge summary

Industry will move towards robust architectures which can:

Yesterday Features (can you do it?)

Today Cost (are you cheaper?)

Tomorrow Integration (but can you also…?)

mix-and-match components from different vendors

avoid costly system-level simulations

create a system by just interconnecting modules

NXP Semiconductors, René Penning de Vries, May 3 - 2007, IEF Athens

14

Plug and Pray!

Integration: Plug and Play?

15

The Design Integration Nightmare

P. Picasso, Blue Period

Specification:

P. Picasso “Femme se coiffant”1940

Implementation:

16

Common Features

Transport Layer

Network Layer

MAC Layer

Link Layer

Dis

cret

e E

vent

Physical Layer

Application

Pre-Post

Process Networks

x Low pass

Manager Tables andParameters

User CSP

ContinuousTimem

+

c

s

• Systems are assembled out of heterogeneous components

• Systems are distributed

• Interactions difficult to define

17

The Intellectual Agenda

To create a modern computational systems science and systems design practice with

ConcurrencyComposabilityTimeHierarchyHeterogeneityResource constraintsVerifiabilityUnderstandability

18

Outline

Challenges


Metropolis

The GSRC Agenda

19

Opportunity: System Design Chain

Interfaces

Fabrics

Manufacturing

Implementation

System Design

IP

Design Science

Design Process Transformation in Chip Design

C/C++ SW CODE

RTOS CODE

Textual Design

Specification

Functional High Level

HW MODELS

Cycle Accurate RTL

Timing Accurate

Gate-Level Netlist

Embedded System Design Gaps

Validate

ValidateValidate

Source: Greg Spirakis, Intel

DriveCODE

TestVectors

Simulate RTL

Verify Gates

Synthesis

Translation

Estimation

HW/SWPartitioning

Estimation

MANUAL

MANUAL

NO FORMAL SEMANTICS

MANUAL

MANUAL

21

MiddlewareJavaTV, TVPAK, OpenTV, MHP/Java, proprietary ...

Applications

Nexperia Hardware

Streaming andPlatform Software K

erne

l: pS

OS

, VxW

orks

, Win

-CE

TM-xxxxD$I$

TriMedia CPU

DEVICE IP BLOCK

DEVICE IP BLOCK

DEVICE IP BLOCK

.

.

.

DVP SYSTEM SILICON

DEVICE IP BLOCK

PRxxxxD$I$

MIPS CPU

DEVICE IP BLOCK.

.

.DEVICE IP BLOCK

PI B

US

SDRAM

MMID

VP M

EMO

RY

BU

S

PI B

US

TriMedia™MIPS™

Source: Philips

Hardware Software

Early Platform Architecture: Philips Nexperia

Platform-types

22

IBMPowerPC7/00 Mindspeed

SkyRailgigabit serial I/O9/00

RocketChipsmixed-signal IPacquisition10/00

Wind RiverO/S3/01

Virtex-II Proproduction3/02

“Highly-Programmable Platform (Virtex-II Pro)”

Xilinx

Designing Platforms: the IC Company View

23

Application Space

e

Ideal Architectural Platform

23

Using Platforms: the System Company View

24

Architectural Space

Ideal Application Platform

Application Space

24

25

The Platform Concept Meet-in-the-Middle Structured methodology that limits the space of exploration, yet achieves good results in limited timeA formal mechanism for identifying the most critical hand-off points in the design chainA method for design re-use at all abstraction levelsAn intellectual framework for the complete electronic design process!

Texas Instruments OMAP

PlatformDesign-Space

Export

PlatformMapping

Architectural Space

Application SpaceApplication Instance

Platform Instance

Semantic PlatformPlatform


Export

PlatformMapping

Architectural Space

Application SpaceApplication Instance

Platform Instance

Semantic PlatformPlatform

25

26

The EXREAL PlatformTM

We provide integrated solutions based on LSI development

platform, application platform and partnerships

Integrated Solution PlatformIntegrated solutions including applied application (including collaboration with users)

Deployment to platform for each applicationApplication Platform

Flexible ScalabilityFlexible

ScalabilityHigh PortabilityHigh Portability HeterogeneousStructure

HeterogeneousStructure

Specification

Analysis

Dev

elop

men

t Pro

cess

BusesBusesMatlab

CPUs Buses OperatingSystems

Behavior Components Virtual Architectural Components

C-CodeIPs

Dymola

ECUECU--11 ECUECU--22

ECUECU--33BusBus

f1f1 f2f2

f3f3

Behavior Platform

Mapping

Performance Analysis

Refinement

Evaluation ofArchitectural

and Partitioning Alternatives

Implementation

Separation of Concerns (ca. 1990!)

27

28

Platform-Based Design

Platform: library of resources defining an abstraction layerResources do contain virtual components i.e., place holders that will be customized in the implementation phase to meet constraintsVery important resources are interconnections and communication protocols


ExportPlatformMapping

Architectural SpaceApplication Space

Application InstancePlatform Instance

29

Fractal Nature of DesignPlatform Instance

Platform Design-Space Export

Platform(Architectural) Space

Platform Instance

Function Instance

FunctionSpace Mapped


FunctionSpace

Platform Instance

Function Instance

Mapped

29

An Example: Wireless Sensor Networks

30

Functional & PerformanceRequirements

Network Architecture

Performance analysis

NetworkLevel

Radio NodeLevel


Node Architecture



Network Architecture


ModuleLevel

Constraints

Constraints

Source: Jan Rabaey

31

FunctionFunction Space Platform

Formal Mechanism

Library Elements

Closure underconstrained composition(term algebra)

Platform Instance

Formal Mechanism

32

PlatformCommon Semantic Platform

Platform InstanceAll Platform behaviors(non deterministic)

Mapping

33

Platform Instance

FunctionCommon Semantic PlatformFunction Space

Mapped Instance

Admissible Refinements

Platform-based Design for DFMD

esign Methods

Design

Mask / Mfrg“Post Design”

“Golden”GDS

Layout OptimizationParasitic Extraction

LVS/DRC

Sign-Off PVBatch RET Treatment

Verification (OPC, CMP …)

Yield Ramp and FA

Digital SoCP+R

CustomLayout

Lith

o

CM

P

Etch

34

Design


Design


“Golden”GDSEl

ectr

ical

Ana

lysi

sPh

ysic

al A

naly

sis

RLC

PBD Abstraction Links Implementation to Manufacturing

Layout OptimizationParasitic Extraction

LVS/DRC

Sign-Off PVBatch RET Treatment

Verification (OPC, CMP …)

Yield Ramp and FA

Digital SoCP+R

CustomLayout

Lith

o

CM

P

Etch

Lith

o

CM

P

Etch

RLC

35

36

Driver

OSMulti-core abstraction layer

CPU CPU

OS, Driver

DSP

DSP M/W

APPLI-CATION

SW

HWDSP, dedicated HWCPU

Driver

DSP M/WMiddleware

Application

CPU

CPU M/W

Driver

CPU MW

Peripherals

OS, Driver

Hardwareengine

OS, Driver

Security…

Multi-core Various markets

・ Heterogeneous and scalable architecturefor various markets

・ Multi-core abstraction layer for software virtualization

Processes/devices

Circuits

Architecture

Software

Tool

s

Courtesy: NEC

Hetero and Scalable Architecture

37

Design Tools: Platform-Based Design for Integrated Building Management

38

Platform-Based Design for Dynamic Networks

39

Consequences

There is no difference between HW and SW. Decision comes later.HW/SW implementation depend on choice of component at the architecture platform level.Function/Architecture co-design happens at all levels of abstractions

Each platform is an “architecture” since it is a library of usable components and interconnects. It can be designed independently of a particular behavior.Usable components can be considered as “containers”, i.e., they can support a set of behaviors.Mapping chooses one such behavior. A Platform Instance is a mapped behavior onto a platform.Fractal: it applies to ALL levels of design from functional all the way to DFM

40

Outline

Challenges


Metropolis

The GSRC Agenda

41

Putting it All Together….

We need an integration platform:To deal with heterogeneity:• Where we can deal with Hardware and Software• Where we can mix digital and analog• Where we can assemble internal and external IPs• Where we can work at different levels of abstraction• Where we can work with performance-power trade-offs

To handle the design chain

To support integration• e.g. tool integration• e.g. IP integration

The integration platform must subsume the traditional design flow, rather than displacing it

42

Metropolis: an Environment for System-Level Design• Motivation

– Both design complexity and the need for verification are increasing

– Semantic link between specification and implementation is necessary

• Platform-Based Design

– Meet-in-the-middle approach

– Separation of concerns

– Function vs. architecture

– Capability vs. performance

– Computation vs. communication

• Metropolis Framework

– Extensible framework providing simulation, verification, and synthesis capabilities

– Easily extract relevant design information and interface to external tools

• Released Sept. 15th, 2004

Metropolis Guiding PrinciplesMetropolis Guiding Principles

Uni

fied

MO

CU

nifie

d M

OC

Form

al S

eman

tics

Form

al S

eman

tics

Sepa

ratio

n of

Con

cern

sSe

para

tion

of C

once

rns

Map

ping

Fun

ctio

n to

Arc

hite

ctur

eM

appi

ng F

unct

ion

to A

rchi

tect

ure

Metropolis

MethodologiesMethodologiesToolsTools

44

Fundamental ConceptsSupport for different Models of Computation

Support for Architecture Specification and Analysis

Mix of imperative and declarative specification styles

Quantities of interest dictated by the designer, not the framework

Framework designed to allow interfacing with external tools

45

Meta Frameworks

Tagged Signal Semantics

Process Networks Semantics

Firing Semantics

Stateful Firing SemanticsKahn processnetworks

dataflow

discreteevents

synchronous/reactive

hybrid systems

continuoustime

Metropolis provides a process networks abstract semantics and emphasizes formal description of constraints, communication refinement, and joint modeling of applications and architectures.

Metropolis Framework

MetamodelCompiler

…tool

Verification tool

Front end

MetaModel language

Simulator tool

...Back end1

Abstract syntax trees

Back end2 Back endn

MetropolisInteractive

Shell

FunctionalityWhat does it do?

Architecture PlatformHow is it done?At what cost?

MappingBinding between the two

46

Metropolis Modeling

• Network of processes with sequential program for each

• Unbounded FIFOs with multi-rate read and write

Func

tiona

l Mod

eling

∞∞

∞

•Communication refined to bounded FIFOs and shared memories with finer primitives (called TTL API):

allocate/release space, move data, probe space/data

∞

Abstraction

Mapp

ing

Functional Network

Arch. Network

synch(…), synch(…), …••Associate functional and architectural models Associate functional and architectural models explicitly and explicitly and formally formally ••Add declarative constraints that associate events Add declarative constraints that associate events ••Accomplished with the Accomplished with the ““synchsynch”” keyword in MMMkeyword in MMM

Metropolis Modeling

DMA

DSP

RAMs RAMd

$HW

MemFMemS

$

DSPHW

• Mapped to resources with coarse service APIs• Services annotated with performance models• Interfaces to match the TTL API

• Cycle-accurate services and performance modelsAbstraction

Arch

itect

ure M

odeli

ng

Metropolis ObjectsMetropolis elements adhere to a “separation of concerns” ideology.

Proc1P1 P2

I1 I2Media1

QM1

Active ObjectsSequential Executing Thread

Passive ObjectsImplement Interface Services

Schedule access to resources and quantities

• Processes (Computation)

• Media (Communication)

• Quantity Managers (Coordination)

Meta-Model : Functional Netlist

process P{port reader X; port writer Y;thread(){while(true){ ...z = f(X.read());Y.write(z);

}}}

medium M implements reader, writer{int storage;int n, space;void write(int z){

await(space>0; this.writer ; this.writer)n=1; space=0; storage=z;

}word read(){ ... }

}

interface reader extends Port{update int read();eval int n();

}

interface writer extends Port{update void write(int i);eval int space();

}

MP1X Y P2X Y

Env1 Env2

MyFncNetlist

Meta-Model: Architecture ComponentsAn architecture component specifies services, i.e.

• what it can do • how much it costs

: interfaces: quantities, annotation, logic of constraints

medium Bus implements BusMasterService …{port BusArbiterService Arb;port MemService Mem; …update void busRead(String dest, int size) {

if(dest== … ) Mem.memRead(size);[[Arb.request(B(thisthread, this.busRead));

GTime.request(B(thisthread, this.memRead),BUSCLKCYCLE + GTime.A(B(thisthread, this.busRead)));

]]}…

scheduler BusArbiter extends Quantity implements BusArbiterService {

update void request(event e){ … }update void resolve() { //schedule }

}

interface BusMasterService extends Port {update void busRead(String dest, int size);update void busWrite(String dest, int size);

}

interface BusArbiterService extends Port {update void request(event e);update void resolve();

}

BusArbiterBus

Metro. Netlists and Events

Proc1

P1

Media1 QM1

Scheduled Netlist Scheduling Netlist

GlobalTime

Metropolis Architectures are created via two netlists:• Scheduled – generate events1 for services in the scheduled netlist.• Scheduling – allow these events access to the services and annotateevents with quantities.

I1

I21. E. Lee and A. Sangiovanni-Vincentelli, A Unified Framework for Comparing

Models of Computation, IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, Vol. 17, N. 12, pg. 1217-1229, December 1998

Proc2

P2

Quantity Request – Service

Ti

CpuRtos GTime

CpuRtos.cpuRead()

CS.Request(beg(Ti, this.cpuRead),csr)

ScheduledNetlist SchedulingNetlist

Task.Read(){CpuRtos.cpuRead();

}

CpuRtos.Read(){CS.Request(beg(Ti, this.cpuRead), csr);Bus.busRead();CS.Request(end(Ti, this.cpuRead), csr);

}

CS.Resolve(){//Task scheduling algorithm;

}

setMustDo(e)

Bus.busRead()

CpuScheduler

CS.Resolve()

Modeling & Char. Review

DedHW Sched

PLB Sched

BRAM Sched

GlobalTime

PPC Sched

Task1 Task2

PPC

Task3 Task4

DEDICATED HW

BRAM

PLB

Scheduled Netlist Characterizer

Scheduling Netlist

Media (scheduled) Process

Quantity ManagerQuantity

Enabled Event

Disabled Event

Heterogeneous IP Import in Metropolis

Excessive time spent in design importRedefining and implementing classes and methodsMemory allocation, data types, templates, etc

Challenges in Infineon case study802.11a on MuSIC (multiple SIMD core) architecture

Collection of Heterogeneous IP

Metropolis Design

IP rewrittenin Metamodel

Different design teamsDifferent languagesDifferent MoCs

55

Heterogeneous IP Import in Metro II

ProsFramework easier to develop and maintainLeverage existing compilers/debuggersQuicker import for most IP

ConsFramework has limited visibility

Collection of Heterogeneous IP

Metro II Design

Wrappers

56

Phase 1 Phase 2

Behavior-Performance Separation in Metropolis

Processes make explicit requests for annotationAnnotation/scheduling are intertwined

Iteration between multiple quantity managersChallenges in GM case study

Vehicle stability application on distributed CAN architectureInteractions between global time QM and resource QM difficult todebug

P1 P2

R

Global Time

ResourceScheduler

2. Quantity Resolution

1. Explicit quantity requests

3. Granting of requests

57

Behavior-Performance Separation in Metro II

ProsPhase 1 objects no longer explicitly request annotationSeparation of quantity managers into annotators and schedulers

• “Global time” separates into physical time (annotation) and logical time (scheduling)

ConsAdditional phase introduced into execution model

Phase 1

P1 P2

R

Phase 2

Physical Time

1. Block processes at interfaces2. Annotations

Phase 3

Logical Time

Resource Scheduler

3. Sched. Resolution

4. Enable some processes

58

Operational/Denotational Specification in Metropolis

Constraints break operational encapsulationConstraints between arbitrary pairs of eventsAny state in scope of event may be used in constraints

No special declarative constructs for mappingChallenges in Intel case study

JPEG encoding on MXP5800 heterogeneous multiprocessorKeeping track of events, values, and constraints requires separate data structureHard to debug local variables involved in synchronization constraints

void func() {int a;event e1;int b;event e2;

}

void arch() {int c;event e3;int d;event e4;

}

sync(e1, e2, a == c)

sync(e3, e4, b <= d)

59

Operational/Denotational Specification in Metro II

Accessible events are beg/end of interface methodsValues are either parameters or return values

Mapping allocates functional components to architectural components

Coarser granularity

60

Updated Features for Metro II

Import heterogeneous IPDifferent languagesDifferent models of computation

Behavior-Performance SeparationNo explicit requests for annotationAnnotation separated from constraint solving

Function-Architecture SeparationExplicit separate phases for function and architecture models

Operational/Denotational SeparationRestricted access to events and valuesMapping carried out at various levels

CoordinationFramework

Event-orientedFramework

4-Phase Execution

61

4-Phase Execution

1. FunctionEach function process proposes events and suspends

2. ArchitectureArchitecture process triggered by function process proposes events and suspends

3. AnnotationTag proposed events with quantities (logical and physical)

4. Constraint solvingEnable/disable events by solving the constraints (denotational and imperative)

Constraint Solving

Annotation

Function Architecture

Extended Base Model

62

Events

An event is the fundamental concept in the framework

Fields:Process: Generator of eventValue Set: Variables exposed along with eventTag Set: Quantity annotations

E = <p, V, T>

63

Event States

Inactive Proposed Enabled

start

Waiting

enabled by S

disabled by S

disabled by S

enabled by S

64

Phases and Events

Each phase is allowed to interact with events in a limited way

Keep responsibilities separatePhase Events Tags Values

Propose Disable Read Write Read Write

Func. Yes Yes Yes

Arch. Yes Yes Yes

Annotation Yes Yes Yes

Const. Sol. Yes Yes Yes

65

Mappers

Mapper

Func. Comp

Arch. Comp

Support mapping at various abstraction levels

• Event level• Service level• Interface level• Component level

66

Service level triggering

Functional Method{

}

Arch Service{

}

B

E

Trigger

Function Phase Architecture Phase

B

ETrigger

67

68

Industrial Collaborations Infineon: Software Defined RadiosIntel:

Mobile PlatformsMulti-media Platforms

General MotorsNext generation car architectures

United TechnologiesElevator (OTIS), Air conditioning (CARRIER), Security (Chubb)

XilinxProgrammable platform modeling

69

Outline

Challenges


Metropolis

The GSRC Agenda

Core Theme Overview and Design Flow

Jason CongDaniel D. GajskiWen-mei HwuAndrew KahngRaduMarculescuJaijeetRoychowdhury

Modeling & Simulation Side

Compare implementation results with simulation results

UIUC

Input: C descriptions

MMMsSystemC, C++

SW SOC Platform HW

Analysis

Parallelism extraction Code cleanup

Performance/AreaEstimations

ASPN Synthesis (MetroII)

Synthesizable RTL

1. ImportFunctional Model(i.e. h.264, UMTS)

UCI

UCB

2. Check Equiv.(Model Algebra)

3. Create Arch Services(i.e. Xilinx)

4. Map and Simulate

Implementation Side

ASPN Synthesis/Verification

ASIP synthesis

xPilot HW/MCsim synthesis

UCLA/UCB/UCSD/CMU/Columbia

Communication Synthesis, NOC synthesis, and physical modelingof interconnection and logic

Processor Library

–ARM–PowerPC–MicroBlaze …

Concurrent Functional Model(after architecture independent optimization

TransformationRules

TLMEquivalence

Checker(UCI)

EquivalenceResult

DesignOptimization

App1 +Platform1

App2 +Platform2

TLMGen.

TLMGen.

TLM1

TLM2

DesignDecisions

Metro-II Framework

(UCB)

Metro-TLMLibrary

Analog synthesis Desynchronization

70

Platform Instance


Platform Instance

Function Instance



FunctionSpace

Platform Instance

Function Instance

Mapped

Co-simulation

MetroIIFunctional model

ASPNArchitectural model

event trace

annotated event trace

MCSim simulationMetroII simulation

RefinementHigh level

MetroII model

refine

ASPN model withmapped functionality

MCSim simulation(cycle accurate)

abstracted performance annotation

Analog Sim

Engineering Tomorrow’s Designs

The creation of novel biological functions and tools by modifying or integrating well-characterized biological components into higher-order systems using mathematical modeling to direct the construction towards the desired end product.

“Building life from the ground up” (Jay Keasling, UCB)Keynote presentation, World Congress on Industrial Biotechnology

and Bioprocessing, March 2007.

Synthetic Biology

Development of foundational technologies:• tools for hiding information and managing complexity• core components that can be used in combination reliably71

[Reference: Scientific American, June 2006]

Moving from ad-hoc to structured design

Pioneering Synthetic Biology

72

Engineering Tomorrows DesignsSimilar Considerations Hold for the Nano-Electronics and

Nano-mechanics Arenas

A Disciplined Platform-Based Design Methodology ??Exploration of scalable computational fabrics Deriving useful abstractions and interfacesDeveloping modeling and characterization environmentsAutomating the synthesis processPopulating the design space

Source: J. Rabaey73

OutlineOutlineFoundation of platformFoundation of platform--based design and Metropolis framework (Alberto/Douglas) 9based design and Metropolis framework (Alberto/Douglas) 9--10:3010:30

Challenges in System Level DesignChallenges in System Level DesignPlatformPlatform--based Design as a unifying methodologybased Design as a unifying methodologyA framework for PBD: A framework for PBD: •• Theoretical foundations (heterogeneous systems, Theoretical foundations (heterogeneous systems, metamodelingmetamodeling, abstract semantics), abstract semantics)•• Metropolis II: integration platform architectureMetropolis II: integration platform architecture•• Application to embedded system design: cars and building networkApplication to embedded system design: cars and building networked embedded ed embedded

systemssystemsSynthesis for functionality (Jason) 11am Synthesis for functionality (Jason) 11am –– 12:30pm12:30pm


Synthesis for communication (Synthesis for communication (RaduRadu) 2 ) 2 –– 3:30pm3:30pmNetworksNetworks--onon--Chip design spaceChip design spaceOptimization for performance, energy, and faultOptimization for performance, energy, and fault--tolerancetolerance

RealReal--life examples (all) 4 life examples (all) 4 –– 5pm5pmQ/A DiscussionsQ/A Discussions

Core Theme Overview and Design FlowAlberto Sangiovanni-VincentelliLuca CarloniJason CongDaniel D. GajskiWen-mei HwuAndrew KahngRadu MarculescuJaijeet Roychowdhury


UIUC


MMMsSystemC, C++

Analysis



ASPN Synthesis (MetroII)1. ImportFunctional Model(i.e. h.264, UMTS)

UCI

UCB



4. Map and Simulate


TransformationRules

TLMEquivalence

Checker(UCI)

EquivalenceResult

DesignOptimization

App1 +Platform1

App2 +Platform2

TLMGen.

TLMGen.

TLM1

TLM2

DesignDecisions

Metro-II Framework

(UCB)

Metro-TLMLibrary

75

Platform Instance


Platform Instance

Function Instance



FunctionSpace

Platform Instance

Function Instance

Mapped

ASPN Simulation

Co-simulation



event trace




MetroII model

refine




Analog Sim


Processor Synthesis

Processor Library

–ARM–PowerPC–ASIP

Processor Synthesis

•Customized coprocessor

•ASIPs

xPilot

•Process mapping

ASPN Synthesis Engine

Implementation SideUCLA/UCB/UCSD/CMU/Columbia

Outline - Synthesis for customized logicSynthesis for customized logic

Overview of behavioral synthesisScheduling

Task DescriptionScheduling algorithms• ASAP/ALAP, list scheduling, force-directed scheduling,

mathematical-programming-based scheduling, scheduling for low power

BindingTask DescriptionClassification of binding algorithmsBinding algorithms• Left-edge, min-cost network flow, multi-vdd binding

Architectural synthesis for multi-cycle communication (MCAS)

IC Design Steps

Packaging Fabri-cation

PhysicalDesign

TechnologyMapping

LogicalSynthesis

System-LevelSpecification

System-LevelSpecification

Behavior-levelDescription

Behavior-levelDescription

RT-LevelDescriptionRT-Level

Description

Placed& RoutedDesign

Placed& RoutedDesign

X=(AB*CD)+(A+D)+(A(B+C))

Y = (A(B+C)+AC+D+A(BC+D))

[©Sherwani]

Gate/CircuitDesign

Gate/CircuitDesign

Generic LogicDescription

Generic LogicDescription

C Program:void f (int var) { int [] array;

…..}

BehavioralSynthesis

VHDL/Verilog entity f isport (…)architecture behav…

Advantages of Behavioral Synthesis Shorter verification/simulation cycle• 100X speed up with behavior-level simulation

Better complexity management, faster time to market• 10M gate design may require 700K lines of RTL code

Rapid system exploration• Quick evaluation of different hardware/software boundaries• Fast exploration of multiple micro-architecture alternatives

Higher quality of results• Platform-based synthesis & optimization• Full consideration of physical reality

Subtasks in High-Level Synthesis

Scheduling determines when an operation will be executedAllocation determines number of instances of each type of resourcesBinding binds operations, variables, or data-transfers to the resources

A DFG

+

++

+

××

+

++

+

+

×

×+123456

Scheduling & allocation

Operation

Variable

ALU ALU

Binding

Resources:

2 adders

2 multipliers

xPilot: Behavioral-to-RTL Synthesis Flow [Cong’06]

Behavioral spec. in C/SystemC

RTL + constraints

SSDMSSDM

μArch-generation & RTL/constraints generation

Verilog/VHDL/SystemCFPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …

Presynthesis optimizationsLoop unrolling/shiftingStrength reduction / Tree height reductionBitwidth analysisMemory analysis …

FPGAs/ASICsFPGAs/ASICs

Frontendcompiler

Frontendcompiler

Platform description

Core synthesis optimizationsSchedulingResource binding, e.g., functional unit binding register/port binding





BindingTask DescriptionClassification of binding algorithmsBinding algorithms• Left-edge, min-cost network flow , multi-vdd binding

Architectural synthesis for multi-cycle communication(MCAS)

The Scheduling TaskThe Scheduling Task

Behavioral Model: CDFGBehavioral Model: CDFGControlControl--data flow graphdata flow graph

Nodes: operationsNodes: operationsDirected edgesDirected edges•• Data edgeData edge•• Control edge: branch, loopsControl edge: branch, loops

Generated by a compiler Generated by a compiler frontendfrontend from high level from high level description (C/VHDL/others)description (C/VHDL/others)

ParseParseCompiler optimizationsCompiler optimizations

If no control structure, a dataIf no control structure, a data--flow graph (DFG) is sufficientflow graph (DFG) is sufficient

RTL Model: Finite State MachineRTL Model: Finite State MachineEach clock cycle corresponds to a state in the FSMEach clock cycle corresponds to a state in the FSMScheduling: map operations to states.Scheduling: map operations to states.

do{xl = x+dx;ul = u-3*x*u*dx-3*y*dxyl = y+u*dxc = xl<a;x = xl; u = ul; y = yl;

}while(c)

Impact of SchedulingImpact of SchedulingPerformancePerformance

Latency/ throughput: given clock cycle timeLatency/ throughput: given clock cycle time

AreaAreaFunctional unitsFunctional unitsRegistersRegistersMultiplexorsMultiplexors

Power / Reliability/ etc.Power / Reliability/ etc.

Unconstrained SchedulingUnconstrained SchedulingOnly Consideration: dependenciesOnly Consideration: dependencies

AsAs--soonsoon--asas--possible (ASAP) possible (ASAP) scheduleschedule

schedule an operation to the earliest schedule an operation to the earliest possible steppossible step

AsAs--latelate--asas--possible (ALAP) possible (ALAP) scheduleschedule

schedule an operation to the earliest schedule an operation to the earliest possible step, without increasing the possible step, without increasing the total latencytotal latency

+ *

*

−

+

+ *

*

−

+

ASAP schedule

ALAP schedule

ResourceResource--Constrained SchedulingConstrained SchedulingWhen functional units are limitedWhen functional units are limited

Each functional unit can only perform one operation at each clocEach functional unit can only perform one operation at each clock cycle.k cycle.ASAP does not guarantee resource constraintsASAP does not guarantee resource constraints

A resourceA resource--constrained scheduling problem for DFGconstrained scheduling problem for DFGGiven the number of functional units of each typeGiven the number of functional units of each typeMinimize latencyMinimize latencyResource constraint: if there are only k adders, no more than k Resource constraint: if there are only k adders, no more than k additions can additions can be executed in the same cbe executed in the same c--step.step.NPNP--hard!hard!•• Reduce to multiprocessor scheduling when resources are identicalReduce to multiprocessor scheduling when resources are identical..

Usually solved heuristically using list scheduling Usually solved heuristically using list scheduling [[Pangrle & Gajski, 87]

List Scheduling AlgorithmList Scheduling AlgorithmConstructive algorithm for Constructive algorithm for resourceresource--constrained schedulingconstrained scheduling

cc--step by cstep by c--stepstepStandard compiler techniqueStandard compiler technique

Maintain a list of Maintain a list of ‘‘readyready’’operations considering operations considering dependencydependency

Select one operation from the Select one operation from the ready operations based on some ready operations based on some priority function.priority function.

While (there are unscheduled operations) {curStep ← curStep+1

while (there are data-ready operations and available resources) {

op ← the ready operation with highest priority

schedule op to curStepupdate the ready listupdate priorities

}}

List Scheduling AlgorithmList Scheduling AlgorithmCommonly used priority function for latency optimizationCommonly used priority function for latency optimization

Nodes with small ALAP value picked firstNodes with small ALAP value picked firstNodes with more successors picked firstNodes with more successors picked first

+1 +2*1

*2

+3 Not enough resources this step, proceed to next!

Ready: +1 *1 +2 *2 +3 1 Adder and 1 multiplier

ForceForce--Directed Scheduling AlgorithmDirected Scheduling AlgorithmTimeTime--constrained schedulingconstrained scheduling

Deadline (latency) of DFG is given as constraintDeadline (latency) of DFG is given as constraint

ForceForce--directed schedulingdirected scheduling [Paulin & Knight, 1989][Paulin & Knight, 1989]Try to reduce hardware resource.Try to reduce hardware resource.Balancing the concurrency of operations to ensure a high Balancing the concurrency of operations to ensure a high utilization of each unit.utilization of each unit.•• Functional unitsFunctional units•• RegistersRegisters•• InterconnectInterconnect

ForceForce--Directed Scheduling AlgorithmDirected Scheduling AlgorithmConstructiveConstructive

Operation by operationOperation by operation

Guided by Guided by ’’forceforce’’Force is defined for every Force is defined for every possible assignment of possible assignment of unscheduled operation unscheduled operation ii to cto c--step step jj

Each iteration commit the Each iteration commit the assignment with least forceassignment with least force

While (there are unscheduled operations) {create distribution graph;Lowest_force ← infinity;

for (each unscheduled operation i){for (each feasible c-step j for i){

calculate the force of scheduling i in c-step j, force(i,j);

if (force(i,j) <lowest_force){lowest_force = force(i,j);best_op = i; best_step =j;

}}

}schedule best_op to best_step;

}

ForceForce--Directed Scheduling AlgorithmDirected Scheduling AlgorithmDetermine ASAP & ALAP schedules.

Determine the time frame of each opLength: possible rangeWidth: probabilityUniform distribution

ForceForce--Directed Scheduling AlgorithmDirected Scheduling AlgorithmCreate distribution graph (DG) for each kind of operation unitCreate distribution graph (DG) for each kind of operation unit

DG(iDG(i): the expected number of operations in c): the expected number of operations in c--step i.step i.For DFGFor DFG

Minimize functional unitMinimize functional unit•• minimize maxminimize maxii DG(iDG(i))

Revised formulationRevised formulation•• minimize minimize ΣΣii DG(i)DG(i)22

•• Analogous to spring system energy minimizationAnalogous to spring system energy minimization•• Note Note ΣΣii DG(iDG(i) = constant, when there are no dependency between ) = constant, when there are no dependency between

operations, the revised formulation is equivalent to the originaoperations, the revised formulation is equivalent to the original onel one

Force-Directed Scheduling AlgorithmScheduling an operation changes the distribution graphSelf force of an assignment i to j

Self_force(i,j) = Σk DG(k)*x(i,j,k)• k is c-step index• x(i,j,k) is the change of DG(k) after

assigning i to j

Example: trying to schedule the multiplication M5To c-step 1. self_force(M5, 1) = DG(1)*x(M5, 1,1)+DG(2)*x(M5,1,2)= 2.83*0.5+2.33*(-0.5)=0.25To c-step 2. self_force(M5, 2) =DG(1)*x(M5, 2,1)+DG(2)*x(M5, 2,2)= 2.83*(-0.5)+2.33*0.5=-0.25Desirable schedule should have negative self force

M5

Force-Directed Scheduling AlgorithmPredecessor & successor forces

Scheduling an operation may affect the time frame for other operationsPredecessor & successor forces = sum of self forces for implicitly scheduled operationsForce(i,j) = self_force(i,j) + predecessor&successor_force(i,j)

Look-aheadWhen we try an assignment, consider the effect on DG(i) due to all implied assignmentsMinimize Minimize ΣΣii DG(i)DG(i)22

Can extend to balance register lifetime, communication, etc.DG computation can be extended for branches and loops

Classical Integer Linear Programming FormulationAn exact formulationAn exact formulation

00--1 assignment variables1 assignment variablesxxijij = = 11 if operation if operation ii scheduled to cscheduled to c--step step jj, otherwise 0, otherwise 0Constraint: Constraint: ΣΣjj xxijij = 1= 1

Resource modelingResource modelingNumber of additions in cNumber of additions in c--step step jj is is ΣΣi is additioni is addition xxijij

The cThe c--step for node step for node iit(it(i) =) =ΣΣjj j*j*xxijij

Dependency constraints: Dependency constraints: t(i1) t(i1) –– t(i2) >= delay(i2)t(i2) >= delay(i2), , i1 depend on i2i1 depend on i2Overall latency for DFG: Overall latency for DFG: max max t(it(i))Various scheduling problems can be modeled.Various scheduling problems can be modeled.

Classical Integer Linear Programming Formulation

Pros: modeling abilityCan be extended to handle almost every design aspects• Resource allocation • Module selection• Area, power, etc.

Cons: computationally expensive#variables= O( #operations * #c-steps)0-1 assignment variables: need extensive search to find optimal solution

Resource Allocation & Scheduling Using ILPResource Allocation & Scheduling Using ILPGiven total area for functional units, minimize latency for Given total area for functional units, minimize latency for DFG.DFG.

Minimize Minimize tt

s.ts.t. . ΣΣjj xxijij = 1 = 1 for all operation i for all operation i t>= t>= ΣΣjj j*j*xxijij for all operation ifor all operation iΣΣtype(itype(i)=k)=k xxijij <=<=rrkk for all cfor all c--step jstep jΣΣkk AreaAreakkrrkk <=Area<=AreaΣΣjj j*j*xxtjtj--ΣΣjj j*j*xxsjsj>=>=delay(tdelay(t) ) for all dependency of t on sfor all dependency of t on s

SDCSDC--Based Scheduling AlgorithmBased Scheduling AlgorithmJ. Cong and Z. Zhang. "An Efficient and Versatile Scheduling Algorithm Based On SDC Formulation". DAC 2006.

Motivation: more efficient ILP formulationUse the c-step index directly -- sv(i) is the c-step for operation i.#variables = O(#operations)Restrict the type of constraints•• sv(isv(i) ) –– sv(jsv(j) <= b ) <= b ------ finite difference constraintfinite difference constraint

Idea: use system of finite difference constraint to model allconstraints in scheduling

Some constraints are modeled approximately or heuristically

Advantage: easy to get integer solutions (details later)

SDCSDC--Based Scheduling AlgorithmBased Scheduling AlgorithmScheduling variable definitionScheduling variable definition

For each operation node v in CDFG, define For each operation node v in CDFG, define {{svsvii(v(v)) | | ii ∈∈ [0, [0, KK]} ]} where where KK = = LatencyLatency((vv))∀∀n,n, ∀∀ii, the value of , the value of svsvii((nn) is a non) is a non--negative integernegative integersvsvii((vv) ) –– svsvii--11((vv) = 1) = 1Let Let svsvbegbeg((vv) ) ≡≡ svsv00((vv) and ) and svsvendend((vv) ) ≡≡ svsvKK((vv))

*v: a multiplicationLatency(v) = 2

*0

*1

*2

sv0(v) ≡ svbeg(v)

sv1(v)

sv2(v) ≡ svend(v)

sv1(v) – sv0(v) = 1

sv2(v) – sv1(v) = 1

SDCSDC--Based Scheduling AlgorithmBased Scheduling Algorithm

The value of a scheduling variable describes the relative temporThe value of a scheduling variable describes the relative temporal al position of one pipeline stage of an operation node in the finalposition of one pipeline stage of an operation node in the finalscheduleschedule

+

*

*

−

+CS0

CS1

CS2

CS3

CS4

CS5

CS6

v1 v2 v4

v3

v5

sv0(v3) ≡ svbeg(v3) = 3

sv1(v3) = 4sv2(v3) ≡ svend(v3) = 5

SDCSDC--Based Scheduling AlgorithmBased Scheduling AlgorithmInteger difference constraintsInteger difference constraints

A special kind of linear constraintA special kind of linear constraintIn the form of In the form of sv(isv(i) ) –– sv(jsv(j) <= b, b is an integer) <= b, b is an integerVery powerful and suitable for scheduling problemVery powerful and suitable for scheduling problem

System of difference constraints (SDC)System of difference constraints (SDC)Totally Totally unimodularunimodular matrix (TUM)matrix (TUM)•• An integer matrix An integer matrix AA is called is called totally totally unimodularunimodular if every square, if every square, nonsingualrnonsingualr

submatrixsubmatrix of of AA has a determinant of +1/has a determinant of +1/--11Constraint matrix is totally Constraint matrix is totally unimodularunimodular for SDCfor SDC

SDCSDC--Based Scheduling AlgorithmBased Scheduling AlgorithmTheorem on integralityTheorem on integrality

The LP model based on a TUM matrix is solvable by linear The LP model based on a TUM matrix is solvable by linear programming in polynomial time with integer solutions programming in polynomial time with integer solutions [Papadimitriou and [Papadimitriou and SteiglitzSteiglitz, Combinatorial Optimization 1982], Combinatorial Optimization 1982]

Or ,equivalently, the extreme points of the polyhedron defined bOr ,equivalently, the extreme points of the polyhedron defined by y SDC are vectors of integers.SDC are vectors of integers.

Steps of the algorithmSteps of the algorithmModel constraints using SDCModel constraints using SDCModel objective using linear expression of scheduling variablesModel objective using linear expression of scheduling variablesSolve a LP and get integer solution, then generate FSMSolve a LP and get integer solution, then generate FSM

SDCSDC--Based Scheduling Algorithm Based Scheduling Algorithm –– Constraint GenerationConstraint Generation

Dependency constraintDependency constraintData dependency: Data dependency: svsvendend((uu) ) –– svsvbegbeg((vv)) ≤≤ 0, 0, u u depends ondepends on vv

Control dependency: introduce artificial Control dependency: introduce artificial nodesnodes•• SuperSuper--source of source of bbbbii : : ssrcssrc((bbbbii) )

∀∀vv∈∈bbbbii , , svsvendend((ssrcssrc((bbbbii)) )) –– svsvbegbeg((vv) ) ≤≤ 00

•• SuperSuper--sink of sink of bbbbii : : ssnkssnk((bbbbii))∀∀vv∈∈bbbbii , , svsvendend((vv) ) –– svsvbegbeg((ssnkssnk((bbbbii)) )) ≤≤ 00

•• If If bbbbjj depends on depends on bbbbii after backward edge after backward edge removalremoval

svsvendend((ssnkssnk((bbbbii)) )) –– svsvbegbeg((ssrcssrc((bbbbjj)) )) ≤≤ 00

SDCSDC--Based Scheduling Algorithm Based Scheduling Algorithm –– Constraint GenerationConstraint GenerationRelative timing constraintRelative timing constraint

A minimum timing constraint between A minimum timing constraint between vvii and and vvjj•• vvjj follows follows vvii by at least by at least LL number of clock cyclesnumber of clock cycles•• svsvbegbeg((vvii) ) –– svsvbegbeg((vvjj) ) ≤≤ ––LL

Similar for maximum timing constraintSimilar for maximum timing constraint•• Latency constraintLatency constraint

Cycle time (frequency) constraintCycle time (frequency) constraintGiven a target clock period Given a target clock period T T , the maximum combinational delay within a clock , the maximum combinational delay within a clock cycle must not exceed cycle must not exceed TT•• svsvbegbeg((vvii)) –– svsvendend(v(vjj) ) ≤≤ ––( ( ⎡⎡CombDelayCombDelay((vvii, , vvjj) / ) / TT⎤⎤ –– 1)1)

Prevent chaining a long combinational path in one cyclePrevent chaining a long combinational path in one cycle+ *

*

−

+v1 v2

v3

v4

v5

svsvbegbeg((vv22) ) –– svsvendend((vv55) ) ≤≤ --11AddsubAddsub (+/(+/--) takes 2ns) takes 2nsMultMult(*) takes 5ns(*) takes 5nsConsider path Consider path vv22--vv33--vv55

SDCSDC--Based Scheduling Algorithm Based Scheduling Algorithm –– Constraint GenerationConstraint Generation

Resource constraint as SDCResource constraint as SDCFor each type of resource, heuristically create an ordered list For each type of resource, heuristically create an ordered list of of the operationsthe operations•• Use priorities like in list scheduling to decide orderUse priorities like in list scheduling to decide order

Serialize the operation pairs which are Serialize the operation pairs which are K K distance away, k is the distance away, k is the number of available resources. number of available resources.

Resource constraint: 2 Adders

SDCSDC--Based Scheduling Algorithm Based Scheduling Algorithm –– Versatile ObjectivesVersatile Objectives

Modeling objectives as linear expressionModeling objectives as linear expressionASAPASAP objective objective −− min min ∑∑svsvbegbeg((vv))ALAPALAP objective objective −− max max ∑∑svsvbegbeg((vv))Longest path latency Longest path latency −− min min svsvendend((ssnkssnk((exitexit--bbbb))))Expected overall latencyExpected overall latency•• Weighted sum of basic block latencyWeighted sum of basic block latency•• min min ααii((svsvendend((bbbbii))--svsvbegbeg((bbbbii))))

Slack distributionSlack distribution•• min min ∑∑vv depends on depends on uu ((svsvbegbeg((vv))--svsvendend((uu))))

Scheduling for Low PowerA lot of opportunities for power reduction during scheduling

An active research area

Explore the design space ofMultiple Vdd/Vth [Johnson&Roy, 97; Tang et.al, 05]Variable Vdd/Vth [Shin&Kim, 05]Others: clock gating, power gating, etc.

AlgorithmsInteger linear programmingVarious heuristics

Scheduling with Integer Time BudgetingScheduling with Integer Time BudgetingWei Jiang, Zhiru Zhang, Miodrag Potkonjak and Jason Cong, “Scheduling with Integer Time Budgeting for Low-Power Optimization”, ASPDAC 2008.A follow-up work of SDC-based scheduling for low power.Integer time budgeting problem

Time budgeting: distributing slacks to different modules of a design to optimize some objectives (such as area, power)Example: reduce area/power by using slower addersIn scheduling, we require these slacks to be integer

Mathematical programming formulationSimilar scheduling variables as in SDC-based schedulingIntroduce time-budget variable for each nodeConvex separable objective

Mathematical Formulation of Time Budgeting ProblemMathematical Formulation of Time Budgeting Problem

POsTsPIssVdb

Ejisbs:toSubject

bfMin

ii

ii

iii

jii

V

i ii

∈∀≤∈∀≥∈∀≥

∈≤+

∑ =

v v 0v

),(e

)( ||

1si : start time of node vi

di : minimum latency of vi

bi : time budget at node vi

1 2 3 4 Delay (ns)

Power (mW)

100

50

30

v1 v2 v3

v4 v5

v6

T

Directed Acyclic Graph: G = (V, E)

Each fi is a single-variable convex function

Linearly constrained separable convex optimization problem

Totally Totally UnimodularUnimodular Constraint MatrixConstraint Matrix

Theorem 1:The constraint matrix is a TUM

s1

10100

s2

01010

s3

-1-1001

b1

10000

b2

01000

b3

00000

v3

v1 v2

s1 + d1 – s3 ≤ 0

s2 + d2 – s3 ≤ 0

s1 ≥ 0

s2 ≥ 0

s3 ≤ TJ1 J1 J1

J2

Optimizing Separable Convex ObjectiveOptimizing Separable Convex Objective

1 2 3 4 Delay (ns)

Power (mW)

100

5030

1 2 3 4 Delay (ns)

Power (mW)

100

5030

PWL approximation

Application to LowApplication to Low--Power SchedulingPower SchedulingMotivationMotivation

Scheduling and time budgeting are highly correlatedScheduling and time budgeting are highly correlated

Problem descriptionProblem descriptionConsider the scheduling and budgeting problem together to minimiConsider the scheduling and budgeting problem together to minimize ze the average power under time constraint the average power under time constraint TT

Power modelingPower modelingPowerPower--delay tradeoff curve (convex)delay tradeoff curve (convex)Total node power: summing up the power for all the nodes (Total node power: summing up the power for all the nodes (operatonsoperatons))Total FU power: summing up the power of all functional unitsTotal FU power: summing up the power of all functional units•• Consider resource sharingConsider resource sharing

LowLow--Power Scheduling Power Scheduling Each node Each node vvii ∈∈ VVopop is associated with a node budgeting variable is associated with a node budgeting variable bvbv((vvii)) which denotes the # of clock cycles that operation which denotes the # of clock cycles that operation vvii lasts in lasts in the final schedulethe final scheduleAdjust the following constraintsAdjust the following constraints

Data dependence constraintData dependence constraint•• ∀∀((uu, , vv))∈∈EEdd : : svsvbegbeg((uu) + ) + bvbv((uu) ) ≤≤ svsvbegbeg((vv) )

Latency constraint Latency constraint TT•• ∀∀vv∈∈VVopop :: svsvbegbeg((uu) + ) + bvbv((vv) ) ≤≤ TT

Throughput constraint with initiation interval Throughput constraint with initiation interval II II •• ∀∀vv∈∈VVopop :: bvbv((vv) ) ≤≤ IIII

Optimizing total node powerOptimizing total node powerWe can optimally minimize the total node power in polynomial timWe can optimally minimize the total node power in polynomial timee

∑ =

||

1 )( ))(( op

i

V

i ivop vbvpwMin

Consideration of Resource BindingConsideration of Resource BindingOptimizing total FU powerOptimizing total FU power

Constraint matrix is no longer totally Constraint matrix is no longer totally unimodularunimodular with thewith therequirement that:requirement that:•• all operations sharing a same function unit must have same slackall operations sharing a same function unit must have same slackss

The problem is NPThe problem is NP--complete (reduction from 3complete (reduction from 3--SAT)SAT)

Proposed heuristicProposed heuristicFirst solve the continuous version and obtain the First solve the continuous version and obtain the ““optimaloptimal””fractional budget fractional budget fbfb((vvii)) for each node for each node vvii

Perform a global rounding by minimizing the Perform a global rounding by minimizing the leastleast--squares errorsquares error•• Objective function is separable convexObjective function is separable convex

∑ =

||

1 )( ))((*|| F

j jfopj fbvpwfMinj

2||

1))()(( ∑ =

−opV

i ii vfbvbvMin

Introduction of BindingBinding maps operations, variables, or data-transfers to the resources

Functional unit, register, memory array, multiplexer, bus…Resources are usually shared to save area cost

FU bindingGoal is to minimize the number of FUs

Register bindingGoal is to minimize the number of registers

Advanced bindingGoal is to optimize and trade-off multiple design qualities, including total area, interconnections, clock period, power…

Binding Example

1* 3+

2*

4+

A scheduled DFG

FUs (registers) are shared by operations of same type (variables) whose lifetimes do not overlap

Lifetime [birth-time, death-time]• Operation: The whole execution time• Variable: From the time this variable is generated to the time it is last read

Datapath uArch Model

MUL ALU

variables registers

multiplexers

functional units

Possible Positions of Binding in Entire Behavioral Synthesis Flow

After scheduling is doneDecide resource usage and detailed architecture

Before scheduling is doneAffect both area and delay

Simultaneous scheduling and bindingMore globally better result

Binding Works Classified by Algorithms Graph-based algorithms

Clique partitioning [Tseng CAD’86] [Paulin DAC’86]Left-edge [Kurdahi DAC’87]

• Minimum number of registers and FUsBipartite [Huang DAC’90]

Brach-and-bound [Pangrle DAC’88]Integer Linear Programming (ILP) [Gebotys JSSC’92] [Rim DAC’92]

Pros: Optimal solutionCons: Scalability; difficult to formulate versatile constraints

Simulated annealingSimultaneous allocation and binding of all resources [Devadas TCAD’89] [Choi TDAES’99]Pros: Consider multiple optimization parameters together for globally better results Cons: Run-time and scalability

Min-cost network-flow[Kim CICC’95] [Chang DAC’95] [Chang DATE’96] [Lyuh TVLSI’03] [Chen ASPDAC’04] [Chen DAC’06]Formulate binding problems into a min-cost network flow

• Edge cost represents optimization goal, such as interconnections, power, …

……

The Order of FU and Register Binding

Inter-dependency exists between FU binding and register binding

To minimize interconnection, one task needs the other’s result to make accurate decision

o1

o2

o3

R1

F1

R2

F2

R1

F1

R2

F2

step1

step2

step3

step4

v1

v2

v3

o1,o2 o3o1 o2, o3

v1,v2 v3 v1,v2 v3

(1)A scheduling example (2a) FU binding + REG binding (2b) REG binding + FU binding

Resource constraints: 2 FUs, 2 REGs The inter-dependency is more complicated in real designs

Simultaneous Binding AlgorithmsPerforming tasks simultaneously

Lead to globally better resultsBut usually untractable

Simultaneous bindingSimulated annealing [Devadas TCAD’89] [Choi TDAES’99]Simulated evolution [Ly TCAD’93]Iterative searching [Dasgupta ISLPED’95] [Lakshminarayana TVLSI’99]Step-by-step simultaneous binding [Kim CICC’95] Interleaving simultaneous binding [Cong DATE’08]

Binding Works Classified by Optimization GoalsNumber of FUs and registers

[Tseng CAD’86] [Kurdahi DAC’87]

Interconnections, multiplexers [Paulin DAC’86][Stok DATE’90] [Huang DAC’90] [Kim CICC’95] [Lyuh TVLSI’03] [Chen ASPDAC’04] [Cong DATE’08]

Low power [Chang DAC’95] [Chang DATE’96] [Hong ICCAD’00] [Zhong TCAD’05] [Chen DAC’06]

Spurious switching activity [Mussoll ISLPED’95] [Dey TCAD’99] [Zhong ICCD’02]

Clock period [Huang DAC’06]

….

Binding Works Classified by Micro-ArchitecturesProcessor architecture

Multicluster architecture [Farkas 97]Multicomputer processor-DRAM chip model [Dally 99]

Synthesis consideration of distributed architecturesUse register files during post-processing, e.g., Hyper [Rabaey 91]Sequencer (stack or queue)-based architecture [Aloqeely 94] Data routing approach [Lanneer 94]Distributed VLIW [Jacome 00]Distributed-register architecture [Jeon 01, Kim 01]Regular Distributed Register (RDR) for multicycle communication [Cong 04]

…

Micro-Architecture Example - Register FileMultiplexer and interconnect costs are significant Register files are used to hide the multiplexers, which are replaced by dedicated decoders

1

2

2

1

(a)

4

33 2 41

FU

MUXMUX

(b)

FU

(c)

1234

A scheduled data-flow graph with optimal register binding labeled on each variable

Binding using discrete registers

Binding using a register file

Terminology and DefinitionThe input of the following binding algorithms is a scheduled data flow graph (DFG), G = (O, A)

O: the set of operations• Each operation has an operation type t

A: the data dependence of operationsV: the set of variables• A variable crossing the clock boundary needs to be registered

Compatibility and Conflict GraphGiven a scheduled DFG G = (V, A) , build the below graphs for operations of type f, Gf = (Vf, Af) Vf : all the operations of type f in VAf : depending on graph type

Compatibility Graph : all the edges between compatible operations in Vf

Conflict Graph : all the edges between un-compatible operations in Vf

Comparability Graph : all the directed edges between compatible operations in Vf

• af = (vi , vj ) iff death-time(vi ) < birth-time(vj )

1 3

2

4

1 3

2

4

A scheduled DFG Compatibility graph

Note: The graphs for variables/registers are constructed in a same way.

1 3

2

4

Conflict graph

1 3

2

4

Comparability graph

Operations have same type

Time ComplexityThe minimum coloring of the conflict graph is the minimum needed resources

Generally NP-CompleteFor DFG, the corresponding conflict graphs are interval graphs• Vertices correspond to intervals• Edges correspond to interval intersection• Polynomial solvable

Loops make the register binding intractable• Circular-arc graph

Left-Edge AlgorithmInput is a group of intervals with starting and ending timeGoal: Minimize the number of colors (resources)Basic idea1. Sort intervals in the order of increasing starting time2. Get the first interval from the list and put it into a new color group3. Search the list from the beginning and put as many intervals as possible into

the new color group4. Go to step 2 till the list is empty

Possible to incorporate other factorsInterconnect, bitwidth …

Example

From Giovanni De Micheli’s class slides

0 1 2 3 4 5 6 7

1

6

4

7

8

2

3

5

1

0 1 2 3 4 5 6 7 8

2 3

6 7 5

4

Intervals6

7 4

2

1

3

5

Colored conflict graph

Coloring

0

1

2

3

4

5

6

7

8

A scheduled DFG

1 6

7

2

8

4

3

5

9

8

8

8

Min-Cost Network Flow [Chang DAC’95]

Assumption Scheduling and functional-unit allocation is already done• Sufficient number of FUs, equal or greater than that of left-edge algorithm

The network is constructed based on nodes’ compatibility

Costs represent the concerned goals The lower the cost, the more possible the two nodes share a same resource

A n-flow binds the nodes into n resourcesSpecial nodes and properties are added to guarantee all nodes are covered once and only once

Construction of Network - 1Comparability graph

PropertiesComparability graph has transitive orientationDirected Acyclic Graph (DAG)Operations bound to a resource execute in topological order

Given a DFG G, build a comparability graph, Gc = (Vc, Ac) Vc : all the operations in GAc : all the edges between compatible operations in Vc

ac = (vi , vj ) iff Deathtime(vi ) < Birthtime(vj )

Wij : weight of ac , the cost of binding vi and vj into a single FUInterconnections, switching activity (power) …

ComparabilityComparability graph Gc

1 3

2

4

1 3

2

4

A scheduled DFG

Construction of Network - 2Network

Add in source and sink vertices into the comparability graphThere is an edge from source to every node in graphThere is an edge from every node in graph to sink

ComparabilityComparability graph GcNetwork

1 3

2

4

s

t

1 3

2

4

Add in source and sink

Network Flow and BindingFU binding solutions correspond to flows in the network

Network flows

1 3

2

4

s

t

1 3

2

4

A scheduled DFG

Resources: two FUs

Split NetworkGuarantee all nodes are covers once and only once?

Split each node into two nodes

Network

1 3

2

4

s

t

Split network

1 3

2

4

s

t

5

1’ 3’

2’

4’

flow(v, v’) = 1

LemmasA unit flow f (| f | = 1) in the split network corresponds to a clique χ in the original comparability graph Gc

An edge (vi’ , vj) in the flow indicates operations vi and vj will be bound into the same FU

A k-flow f (| f | = k) that passes through every node by a unit flow is equivalent to finding k disjoint paths (or chains) in network, thus generating k cliques in Gccovering all the operational nodes

This forms a legal binding solution

Multi-Vdd Binding (Chen et al ASPDAC’05)Solved problem

Simultaneously perform resource binding and voltage assignment to optimize power

SolutionExtend the split network ([Chang DAC’95]) to support voltage assignmentEdge costs represent switching activities or voltage decisionsOptimality• Guarantee that the largest number of operations are assigned low-

vdd with the minimum total switching activity

Problem Definition of FU Binding with Voltage AssignmentGiven:

A scheduled data flow graph (DFG)A module (functional unit) library with multi-VddsThe Vdd of each module can be changed dynamically while executing different operations

Goal:Assign low-Vdd to maximum number of operations with switching-activity considerationMinimize total switching power through functional unit binding

Constraint:Latency constraintResource constraint

Motivational Example

Which set of operations to extend?Honor data dependencyMaximum number under resource and latency constraintsThe best such set in terms of switching-activity reduction during FU binding later on

Need to consider voltage assignment and FU binding simultaneously to achieve optimal solution

1 2

4

3

65

1 2

4

3

65

Possible Extensions

MultiplicationHigh-vdd: 3 clocks

Low-vdd: 5 clocks

Addition

Network Construction - 1

A Scheduled DFG A Scheduled DFG (additions)(additions)

12

3

4

Comparability GraphComparability Graph

12

3

4

w14w23

Given a DFG G, build a compatibility graph, Gc = (Vc, Ac) Vc : all the operations in GAc : all the edges between compatible operations in Vc

ac = (vi , vj ) iff DT(vi ) < BT(vj )

Wij : weight of ac , the cost of binding vi and vj into a single FUswitching activity

ExtendableExtendable operations: operations: op1, op3op1, op3

Definition: Operations which can Definition: Operations which can be assigned lowbe assigned low--vddvdd without without violating dataviolating data--dependency or dependency or timingtiming--constraintconstraint


2

3

4

1

2

3

4

s

t

1’

3’

ComparabilityComparability graphgraph

NetworkNetwork withwith twotwo VddsVdds

-TC(v1’, v4 )

= C(v1 , v4 )

w14w23

Add in nodes (v’) for extendable operationsAssign cost (-T) to edges (v, v’)

L = 100T ＝ L × |Vc| –T for maximum number of extensions

C(vi , vj) = –L × (1 – Wij)


Add in split nodes (vd) to guarantees that all the operations will be bound

1 2

3

1’

t

s

1 2

3

1’

1d 2d

3d

C(v2d, v3 )

= C(v2 , v3 )

t

s

-T

NetworkNetwork withwith twotwo VddsVdds

flow(v, vd) = 1

Split networkSplit network withwith twotwo VddsVdds

Theorem

GivenA comparability graph with estimated switching activities on the edgesk functional unitsDual supply voltages

TheoremThe min-cost k-flow f on the corresponding split network gives the largest number of extended operations in the design with the minimum total switching activity on kfunctional units

Multi-Cycle Communication Architectural Synthesis (MCAS)

Regular Distributed Register (RDR) micro-architectureHighly regularDirect support of multi-cycle on-chip communication

MCAS: Architectural Synthesis for Multi-cycle Communication

Integrated architectural synthesis (e.g. resource binding, scheduling) with physical planningTarget at RDR architecture

Needs for Multi-Cycle On-Chip Communication

11.4 22.8 28.301 clock

2 clock

3 clock

4 clock

5 clock

Interconnect delays dominate the timing in DSM tech.Interconnect delays dominate the timing in DSM tech.SingleSingle--cycle full chip synchronization is no longer possiblecycle full chip synchronization is no longer possible

ITRS’01 0.07um Tech5.63 G Hz across-chip clock800 mm2 (28.3mm x 28.3mm)IPEM BIWS estimations

Buffer size: 100xDriver/receiver size: 100x

From corner to corner:at semi-global layer (Tier 3)can travel up to 11.4mm in one cycleneed 5 clock cycles

Regular Distributed Register Architecture (1)

Distribute registers to each “island”Chose the island size such that local computation and communication in each island can be done in a single cycle:

Global Interconnect

…LCC

Reg. file

…LCC

Reg. file

…LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…LCC

Reg. file

FSMFSM

FSMFSM

FSMFSM

THWDDDDD iiopticopticislandra ≤++≤+= −−− )(2 intlogintlogint

LocalComputationalCluster (LCC)

….Register File

Wi

Hi

Island

FSM

ADD

MUXMUL

Cluster with area constraint

Regular Distributed Register Architecture (2)

Global Interconnect

…LCC

Reg. file

…LCC

Reg. file

…LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…LCC

Reg. file

FSMFSM

FSMFSM

FSMFSM

LocalComputationalCluster (LCC)

….Register File

Wi

Hi

Island

FSM

ADD

MUXMUL

Cluster with area constraint

Use register banks:Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island

Highly regular

1 cycle

2 cycle

k cycle

Example : Regular Distributed Register Architecture for 70nm Technology

ITRS’01 70nm TechChip dimension: 800 mm2 (28.3mm x 28.3mm)5.63 G Hz across-chip clock• Can travel up to 11.4mm within 1 clock cycle

under best interconnect optimization• Need 5 clock cycles to cross the chip

Each island base dimension• Wi = Hi=2.08mm• ≈ 1/3 of distance a wire can travel in 1 clock

cycle• Logic volume: 6.76M min-size 2-NAND gates

12X12 array of islandsLocal registers are partitioned to 7 banks

+ 2

* 3 * 4

- 6- 5

* 7 * 8

- 9 * 11 * 12

- 10

- 1

Data flow graph extracted from discrete cosine transformation (DCT)

The nodes with the same color are assigned to the same functional unit.

Example: Impact of Interconnect on Scheduling

Performance-driven Placement

Reg. file

Reg. file…Alu1

1,5,10Alu22,6,9

…Reg. file

Reg. file…Mul23,7,12

…Mul14,8,11LCC

- +

* *

--

* *

- * *

-

21 nsalu

22 nsmultiplier

numdelayresource

2 ns

1 ns

Long interconnectShort interconnect

- +

* *

--

* *

- * *

-

represents registers

Single-cycle vs. Multi-cycle Interconnect Communication

Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4nsTotal latency is 24ns

Cycle1

Cycle2

Cycle3

Cycle4

Cycle5

Cycle6

+ 2

- 1

* 3 * 4

- 6

- 5

* 7

* 12

- 9

* 11

* 8

- 10

Cycle1

Cycle2

Cycle3

Cycle4

Cycle5

Cycle6

Cycle7

Cycle8

Cycle9

Multi-cycle interconnect communicationScheduled in 9 clock cyclesClock period is 2nsTotal latency is 18ns

+ 2- 1

* 3 * 4

- 6- 5

* 7

* 11

- 9

* 8

* 12

- 10

With placement integrated with scheduling, critical path is reduced.The DFG can be scheduled in 8 clock cycles, with clock period of 2ns.The total latency is 16ns.

Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization

Cycle1

Cycle2

Cycle3

Cycle4

Cycle5

Cycle6

Cycle7

Cycle8

+ 2- 1

* 3 * 4

- 6- 5

* 7 * 8

- 9

* 11 * 12

- 10

Reg. file

Reg. file…Alu1

1,5,10

…Reg. file


…Mul14,8,11

Simultaneous Placement and Scheduling

Alu22,6,9

Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization

With placement integrated with scheduling and binding, the critical path is further reduced.The DFG can be scheduled in 7 clock cycles, with clock period of 2ns.The total latency is 14ns

Cycle1

Cycle2

Cycle3

Cycle4

Cycle5

Cycle6

Cycle7Simultaneous Placement, Scheduling and Binding

Reg. file

Reg. file…Alu1

1,5,10

…Reg. file


…Alu22,6,9

Mul14,8,12

+ 2- 1

* 3 * 4

- 6- 5

* 7

* 8

- 9

* 11

* 12

- 10

MCAS: Placement-Driven Architectural Synthesis Using RDR Architecture

CDFG

Interconnected Component Graph (ICG)

C / VHDL

Location information

Functional unit binding

Placement-driven rebinding & scheduling

Scheduling-driven placement

CDFG generation

Register and port binding

Datapath & FSM generation

Floorplan constraints

Resource allocationResource constraints

RD

R A

rch. Spec.Target clock period

RTL VHDL files

Multi-cycle path constraints

MultiMulti--Cycle Communication Architectural Cycle Communication Architectural Synthesis (MCAS) SystemSynthesis (MCAS) System

SchedulingScheduling--driven placementdriven placementIntegrate listIntegrate list--scheduling with a SAscheduling with a SA--based based global placement for minimizing the total global placement for minimizing the total latency.latency.Employ net weighting technique to shorten Employ net weighting technique to shorten the critical global connections.the critical global connections.

PlacementPlacement--driven rescheduling & driven rescheduling & rebindingrebinding

Integrate forceIntegrate force--directed listdirected list--scheduling with scheduling with simultaneous rescheduling & rebinding to simultaneous rescheduling & rebinding to further minimize the latency.further minimize the latency.

Challenges in Behavioral SynthesisSub-tasks in synthesis flow have complicated inter-dependency

How to perform resource allocation, scheduling and binding simultaneously to gain better degisn qualities

Physical reality has impact on the final resultsE.g. how to consider the impact to back-end synthesis (such as routability) during behavioral synthesis• Mcas considers interconnect complexity and coarse placement

Still need more enhancements to become a complete solution

E.g. how to consider floorplanning of voltage islands for multi-vdd synthesis

OutlineOutlineFoundation of platformFoundation of platform--based design and Metropolis framework (Alberto)based design and Metropolis framework (Alberto)

Challenges in System Level DesignChallenges in System Level DesignPlatformPlatform--based Design as a unifying methodologybased Design as a unifying methodologyA framework for PBD: A framework for PBD: •• Theoretical foundations (heterogeneous systems, Theoretical foundations (heterogeneous systems, metamodelingmetamodeling, abstract , abstract

semantics)semantics)•• Metropolis II: integration platform architectureMetropolis II: integration platform architecture•• Application to embedded system design: cars and building networkApplication to embedded system design: cars and building networked embedded ed embedded

systemssystemsSynthesis for functionality (Jason)Synthesis for functionality (Jason)


Synthesis for communication (Synthesis for communication (RaduRadu))NetworksNetworks--onon--Chip design spaceChip design spaceOptimization for performance, energy, and faultOptimization for performance, energy, and fault--tolerancetolerance

Beyond systemBeyond system--onon--aa--chip (chip (ClasClas))



UIUC


MMMsSystemC, C++

Analysis




UCI

UCB



4. Map and Simulate


TransformationRules

TLMEquivalence

Checker(UCI)

EquivalenceResult

DesignOptimization

App1 +Platform1

App2 +Platform2

TLMGen.

TLMGen.

TLM1

TLM2

DesignDecisions

Metro-II Framework

(UCB)

Metro-TLMLibrary

162

Platform Instance


Platform Instance

Function Instance



FunctionSpace

Platform Instance

Function Instance

Mapped

ASPN Simulation

Co-simulation



event trace




MetroII model

refine




Analog Sim


Processor Synthesis

Processor Library


Processor Synthesis


•ASIPs

xPilot

•Process mapping



Application-Specific Instruction-Set Processors (ASIPs)General purpose processor cores + programmable fabric

Loosely coupled as a coprocessor• Example: Xilinx MicroBlaze, etc.

Tightly integrated as extra function units in application-specific instruction-set processors• GPP has the capability to extend basic instruction set• Programmable fabric implements the customized instructions • Example: Altera Nios / Nios II

Custom instruction logic for Nios II [source: www.altera.com]

Xilinx MicroBlaze[source: www.xilinx.com]

Comparison of Different ApproachesGeneral-purpose CPUs

Very flexible and easy to programLow performance for data-intensive applicationsPoor power efficiency

Hardwired LogicHigh performance due to parallelismBest power efficiencyInflexible after tapeoutHigh cost and long development time

ASIPReduce development time and cost by reusing most of the components of a pre-verified processorExtend instructions to leverage the parallelism for performance improvementReconfigurability of programmable fabrics provide certain flexibility

t1 = a * b;

t2 = b * 0xf0;;

t3 = c & 0x12;

t4 = t1 + t2;

t5 = t2 + t3;

t6 = t5 + t4;

Execution time: 9 clock cycles*: 2 clock cycles others: 1 clock cycles

Extended Instruction Set: I∪extop1 ∪expop2

extop1 extop2

* * &

+ ++

0xf0 0x12a b c

t1 = extop1(a, b, 0xf0);

t2 = extop2(b, c, 0xf0, 0x12);

t3 = t1 + t2;

Execution time: 5 clock cycles

Speedup: 1.8

Motivational Example

Subtask (1) ─ Extended Instruction IdentificationFind the extended instruction candidates under micro-architectural constraints

Exactly isomorphic patterns• Heuristic-based pattern grow

[Clark, Micro’03] [Sun, ICCAD’02]• Pattern enumeration

[Atasu, DAC’03] [Cong, FPGA’04]Similar patterns [Brisk, DAC’04] [Cong, FPGA’08]• Reduce area overhead by sharing

resources• May slow down clock speed

* * &

+ +

+

0xf0 0x12a b c

n1 n2 n3

n4 n5

n6

ALUMUL

MUL

ALU

Subtask (2) ─ Instruction SelectionInstruction selection problem

Limited resource budgetSelect a subset of instruction candidates to be finally implemented on configurable fabricsThe objective is to maximize performance

ApproachesGreedy [Clark, Micro’03]Knapsack [Cong, FPGA’04]Iterative [Atasu, DAC’03]

* * &

+ +

+

0xf0 0x12a b c

n1 n2 n3

n4 n5

n6

34(*)+(*)p5

14+++p4

34(*)+(&)p6

12++p3

12&+p2

12*+p1

GainAreaFunctionPattern

If the total area is 8, which instructions should be used?

p1p2

Subtask (3) ─ Application MappingApplication mapping

A covering problem: cover the application with the extended instruction setExploit the extended instructions for maximal gainApproaches• Iterative covering [Atasu,

DAC’03]• Binate covering [Liao,

ICCAD’95] [Cong, FPGA’04]

* * &

+ +

+

0xf0 0x12a b c

n1 n2 n3

n4 n5

n6

34(*)+(*)p5

14+++p4

34(*)+(&)p6

12++p3

12&+p2

12*+p1

GainAreaFunctionPattern

Subtask (1) – Heuristic Method

Grow subgraphs from seed nodes [Clark et al, Micro’03]All nodes are seedsTake 4 factors into consideration• Criticality

combining operations on the critical path• Latency

combing low latency nodes to pack more nodes• Area

prefer nodes with low area overhead• Input/Output

prefer nodes with fewer input/output portsSum of these factors determines value of each direction

Subtask (1) – Cut EnumerationEnumerate multiple-input single-output patterns [Cong, FPGA’04]

Cone: a subgraph consisting of node v and its predecessors such that any path connecting a node in the cone and v lies entirely in the coneEach pattern is a Nin-feasible coneCut enumeration is used to enumerate all the k-feasible cones [cong et al, FPGA’99]Basic idea: In topological order, merge the cuts of fan-ins and discards those cuts not k-feasible

3-feasible cones:

n1: {a, b} n2: {b, 0xf0} n3: {c, 0x12}

n4: {n1, n2}, {n1, b, 0xf0}, {n2, a, b}, {a, b, 0xf0}

* * *

+ +

+

0xf0 0x12a b c

n1 n2 n3

n4 n5

n6

Subtask (1) – Search Tree (1)Enumerate all multiple-inputs multiple-

outputs patterns [Atasu et al, DAC’04]Nodes are numberedbased on a reversed topological sortBuild a search tree to enumerate all the possible patterns

Example: Nout = 1

from Atasu’s conference slides

Subtask (1) – Search Tree (2)Subtree elimination

Prune the searching space to speedup the enumeration process

Based on a violation of the output portconstraint

Based on a violation of the convexityconstraint


Area = 17

Area = 25

Two DFGs

1.5

Resulting Datapath

Area = 28Area Estimate = 42

AreaCosts

85

13

Subtask (1) – Resource Sharing (1)

G3 G4G1 G2

Subtask (1) – Resource Sharing (2)

from Philip Brisk’s conference slides

Use substring matching to share resources [Brisk, DAC’04]Colors represent the type of operationsEncode paths as stringsMerge DFGs along matched nodes

Subtask (2) – Greedy Selection Heuristic

(1,7)59N

…………

(1,3,7)162

(3,4),(6,8)4201

OpsCostValueSubgraph Number

50N

…………

(1,3,7)162

(6,8)4101

OpsCostValueSubgraph Number

• Use estimates of performance improvement / cost• Iteratively pick the pattern that provides the

largest value/cost ratio• Update the table after selecting a pattern

from Nathan Clark’s conference slides

Subtask (2) – Knapsack Formulation (1)

Simultaneously consider speedup, occurrence frequency and area [Cong, FPGA’04]Speedup: the ratio between the latency of hardware implementation and pure software implementationOccurrence

Some pattern instances may be isomorphicGraph isomorphism test [ Nauty Package ]Small subgraphs, isomorphism test is very fast

Gain(p) = Speedup(p) × Occurrence(p)Pattern *+

Tsw= 3

Thw= 2

Speedup = 1.5

* * *

+ ++

0xf0 0x12a b cn1 n2 n3

n4 n5

n6

Subtask (2) – Knapsack Formulation (2)

Can be formulated as a 0-1 knapsack problemGiven:• n items (patterns) • the ith item (pattern) is associated with value (gain) vi and weight

(area) wi• Total weight W (area constraint A)

Objective:select a subset of items to maximize the total value, while the total weight does not exceed W.

Subtask (2) (3) – Iterative Method (1) [Atasu et al, DAC’03]

How to select M patterns within a single basic block? Build a (M+1)-ary treeBranch 0 means that the node is not included in any patternBranch i (i>0) means that the node is in pattern i

A sample search tree to identify two subgraphs (M=2)from a single basic block


Subtask (2) (3) – Iterative Method (2)How to select M patterns within multiple basic blocks?

Add one pattern at a timeAt each iteration, check in which basic block an additional pattern brings the largest gain

Identification of 3 subgraphs within 3 basic blocks

Subtask (3) – Binate CoveringModeled as a binate covering problem [Cong, FPGA’04]

Covering sink nodeCovering inputs of the selectedpattern

n1, n2, n42(*)+(*)p10

n3, n52*+p9

n12*p5

n22*p4

n32*p3

n2, n3, n52(*)+(*)p11

n2, n52*+p8

n2, n42*+p7

n1 , n42*+p6

n41+p2

n51+p1

n61+p0

CoversCostFunctionPattern

Covering clause:

p0 → p2+p6+p7+p10

¬ p0 + ( p2+p6+p7+p10)

* * *

+ +

+

0xf0 0x12a b cn1 n2 n3

n4 n5

n6

Application-Specific Processor Network (ASPN)ASPN consists of processor cores, logic blocks, memories, communication channels and peripheralsASPN enables higher-level of abstraction and higher productivityExtend the standard cell-based methodology for ASIC design to an application-specific processor-based design paradigm

Graphics Engine

Audio Processing

I/O Peripherals

On-chipMemory

DSP Programmable Logic

ARM MIPS

Encryption Engine

Digital Baseband

Programmable Logic PMU

Scratch-padMemory

MemoryController

ASIPs ARM

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

Graphics Engine

Audio Processing

DSP Programmable Logic

ARM MIPS

Encryption Engine

Programmable Logic

ASIPs ARM

On-chipMemory

Scratch-padMemory

MemoryController

The Era of ASPNMulticore processors enter the mainstream

Moore’s law continues to apply in the multicore eraIntel, AMD, IBM and Sun have launched their dual-core and quad-core processors

Hardware accelerators Examples: Synergistic Processing Elements (SPE) in the CELL processor, encryption engines in Niagara II, accelerators implemented on FPGA chipsPerform kernel computations in hardware to increase performance• Customized implementations to explore coarse / fine granularity of

parallelism• Reduce heat dissipation and power consumption (0.1 GFLOPS/Watt on AMD

2.5GHz vs. 0.9 GFLOPS/Watt on Stratix III FPGA)• Adapt to a wide range of applications and evolving standards

Parser FSM

Texture IDCT

Motion Comp.

Copy Controller

Texture Update

Display Controller

ASPN Synthesis

μP μP OSDriver

tasks

NetworkInterfaceBuffer Network

InterfaceBuffer

μP

μP μP OSDriver

tasks

NetworkInterfaceBuffer

μP μP OSDriver

tasks

μP μP OSDriver

tasks

μP OSDriver

tasksNetworkInterfaceBuffer

Architecture SynthesisApplication mapping

Design challengesHow to construct the optimal architectures for the given application (architecture synthesis)?How to partition and map jobs to the synthesized architectures (application mapping)?

Synthesis WorksOptimize throughput

Modified list scheduling [Hoang, 1993]Adaptive multi-objective genetic algorithm [Dick, 1998][Grajcar, 1999]Modulo scheduling [Karkowski, 1997]Integer linear programming to map tasks to a fixed number of processors [Jin, 2005]Iterative balanced partitioning [Dai, 2005] [Yu, 2007]

Optimize latencyMerge tasks greedily + list scheduling [Sarkar, 1986]Formulate as a mixed integer linear program [Prakash, 1991] Iterative method [Wolf, 1997]

Optimize latency under throughput constraintLabeling + Clustering approach [Cong, 2007]

Throughput Optimization

Recursively bipartition and refine [Yu, DAC’07]

Implemented on multiple homogeneous PEs to improve system throughputThroughput is determined by the maximal stage latencyThe stage latency for each stage is the sum of its processing time and communication time 34 us stage latency

34 us latency

10 us stage latency

30 us latency

1

1

12

1 1

4

3

43

44

10

1

Resource Balanced BipartitioningBipartition the program recursively to a number of stages with minimum communication cost

Compute the cut_ratio r for bipartitioning• Compute r based on the # stages ( r = 2:(3+2) if we have 5 stages)

Applying r-balanced min-cut partitioning to get subgraph G1 and G2

• Min-cut partition to minimize communication costAllocate PEs to G1 and G2 based on their processing times

1

1

1 31

1

4

3

4 3

4 4

10

1

Use 4 processors to implement a 3 stages

pipeline system

1 PE

1 PE

2 PEs

1 PE

1 PE

RefinementRefinement procedure is used to improve the quality of initial results

Migrate tasks from bottleneck stages to non-bottleneck onesOnly move to its neighbors which can accommodate additional tasks

Use 4 processors to implement a 3 stages

pipeline system 1

1

1 31

1

4

3

4 3

4 4

10

1

1 PE

1 PE

1 PE

1 PE

bottleneck PE

Latency OptimizationOptimize system latency for a given task graph on heterogeneous multiprocessor systems [Prakash, DAC’91]

Create a mixed integer linear programming model (the model has been simplified in this tutorial)Variables• Binary variables

Subtask to processor mapping σd,a = 1 if the task Sa maps to the processor PdCommunication mapping γa1, a2=1(0) if Sa1 and Sa2 have remote communication

• Real variablesInput data available time Tia of task aOuput data available time Toa of task aTask a’s start time TsaTask a’s finish time Tea

MILP Constraints (1)Add constraints to ensure the proper ordering of tasks and

communicationsProcessor selection constraint: exact one processor used for task Sa

Data-transfer constraint: γa1, a2=1 if Sa1 and Sa2 are implemented on different processors

Task start time constraint: task Sa cannot be executed until it’s inputs are ready

1,

=∑∈

σ adPd

σσγ 2,1,2,11

adPd

adaa ∑∈

−=

saia TT ≤

MILP Constraints (2)Task finish time constraint: it depends on the processor used

Input data available time constraint: it depends on the finish time of the data producer and the communication volume if they are mapped onto different processors

Other constraints: to ensure that the hardware resources are shared correctly

Optimization objective: minimize latency Tf

∑∈

+=Pd

aadsaea SdTypeDelayTT )),((,σ

)( 2,12,121 aaaaeaia VCommTT γ+≥

aeach task for eaf TT ≥

Optimize Latency Under Throughput ConstraintsAnalogous to the problem of circuit clustering for delay minimization in VLSI physical design

Labeling + Clustering [Cong, FPGA’07]

LabelingDefine the label of a node as the minimum pipeline stage where it can be executed

4

3

3 4

4 8

6

3

6

l=0

l=0

l=0

l=0

l=1 l=1

l=0 l=1

l=2

a b

d

c

e f

g h

i

stage period = 14

4

3

3 4

4 8

6

3

6a b

d

c

e f

g h

i

Clustering For DAGGenerate clusters in the reversed topological order

Theorem 1: the labeling and clustering algorithms generate latency-optimal pipeline solutions for directed acyclic task graphs in O(|V|2)

Theorem 2: If every task’s computation time is no less than it’s communication time, we can generate a latency optimal, duplication free solution.

4

c

c’

Relaxed Model – Allow Inter-stage Communication

4

3

3 4

4 8

6

3

6a b

d

c

e f

g h

i4

c

c’

4

3

3 4

4 8

6

3

6a b

d

c

e f

g h

i4

c

c’

No inter-stage communication

Allow inter-stage communication

stage period = 14

3 stages

2 stages

Branch and Bound Algorithm for Cluster GenerationNP-complete problem

Need to calculate the earliest finish time (MinTime) for each node

Label the node with <L, MinTime> in topological order

Apply pruning techniques to save computation time

4

3

3 4

6a b

d

c

e

4

MinTime(e)=14

MinTime(e)=14

MinTime(e)=14

MinTime(e)=13

63

3

Y

N

Y

include b

include e

include d

include a

MCSim: An Efficient Simulation Tool for Heterogeneous Multi-core Systems [Cong]Goals

Provide a framework to explore future ASPNsScalable/Fast• Can we tractably simulate 64+ cores?• Can simulations be parallelized for greater speedup?

Synthesizable CoProcessors and NoC• Helps measure physical characteristics and validates functionality

Support for variety of benchmarks/workloads• Multitasked single-threaded workloads • Cooperative shared-memory multithreaded applications

Modular• Plug and play with different models

Metrics of Interest• Performance• Power• Area

MCSim Structure

L2Bank

L2Bank

L2Bank

…

CACHE CONTROLLER

Functional Network Switch

…

SystemC NoC Model

message latencies

messages

Central Page Handler

Tightly coupled coprocessor

Loosely coupled coprocessor

SESC Instance

MINT

C C Co……

SESC Instance

MINT

C C C…

SESC Instance

MINT

C C Co…

• Each SESC instance is a number of cores cooperating on a single (potentially multithreaded) application • May loosely/tightly coupled with application-specific coprocessors

A number of cache banks

Central page handler• Allows support for multitasking

•A functional network switch to functionally route messages between components•A SystemC NoC model to accurately model latency and power

Coprocessor Simulation Model Generation Flow

behavioral behavioral synthesissynthesis

SSDM/CDFGSSDM/CDFGPlatformPlatform--based based

behavioral synthesisbehavioral synthesis

Coprocessor Simulator Coprocessor Simulator generationgeneration

FSMD/SSDMFSMD/SSDM

CycleCycle--accurate Performanceaccurate PerformanceModel in CModel in C

Data ModelData Model

C specificationC specification

FrontFront--end compilerend compiler

Platform description Platform description & constraints& constraints

CoprocessorCoprocessor--Processor Processor InterfaceInterface

Accuracy & Simulation Speed of the Generated Performance Models

396 48.0 0.005 0.240 Sha

303 41.0 0.030 1.230 pipelined MC

873 48.5 0.086 4.170 MotionCompensation(MC)

283 56.2 0.004 0.219 idct

147 32.8 0.004 0.128 dct

SpeedupCSystemC#Cycles

Simulation Speed (sec)Benchmark

MCSim Simulator Validation - Litho Simulation

3x3 4x4 5x5 6x6speedup 7.58 11.61 14.49 19.59

ALUT Memory Bits Fmax (MHz) Speedup25042 2,972,876, 115.58 15.52

• About 1000 lines of ANSI-C code

• Generate 1,1000 lines of VHDL code for 5x5 partitioning

Off by only 7%

References - 1C-J Tseng, D. Siewiorek, "Automated Synthesis of Data Paths in Digital Systems," IEEE Trans. On CAD, V.CAD-5, N.3, pp. 379-395, July 1986. F. Kurdahi, A. Parker, "REAL : A Program for Register Allocation," Proc. of DAC24, pp. 210-215, 1987. K. Choi and S. P. Levitan, “A flexible datapath allocation method for architectural synthesis,” ACM Trans. Des. Autom. Electron. Syst., vol. 4, no. 4, pp. 376–404, 1999.S. Devadas and A. Newton, “Algorithms for hardware allocation in data path synthesis,” IEEE Trans. Computer-Aided Design, vol. 8(7), pp. 768–781, July 1989.C. Gebotys and M. Elmasry, “Optimal synthesis of high-performance architectures,” IEEE J. Solid-State Circuits, vol. 27(3), pp. 389–397, Mar. 1992. M. Rim, R. Jain, and R. De Leone, “Optimal allocation and binding in high-level synthesis,” in Proc. of the 29th ACM/IEEE Conference on Design Automation (DAC’92), 1992, pp. 120–123.T. Kim and C. L. Liu, “An integrated data path synthesis algorithm based on network flow method,” Proc. of the IEEE Custom Integrated Circuits Conference, vol. 1-4, pp. 615–618, May 1995. J. M. Chang and M. Pedram, “Register allocation and binding for low power,” in Proc. Design Automation Conf., June 1995, pp. 29–35. J. M. Chang and M. Pedram, “Module Assignment for Low Power,” Conf. on European Design Automation. 1996. 376~381. C. G. Lyuh and K. Taewhan, “High-level Synthesis for Low-Power Based on Network Flow Method,” IEEE Trans. on VLSI Systems. 2003. 11(3): 364~375.D. Chen, J. Cong, Y. Fan and J. Xu, "Optimality Study of Resource Binding with Multi-Vdds," Proceedings of the 2006 Design Automation Conference, San Francisco, CA, pp. 580-585, July 2006. L. Stok, “Architectural Synthesis and Optimization of Digital Systems”, Ph.D Dissertation, Eindhoven ‘University of Technology, 1991.Shih-Hsu Huang, Chun-Hua Cheng, Yow-Tyng Nieh, Wei-Chieh Yu: Register binding for clock period minimization. DAC 2006

References - 2E. Mussoll and J. Cortadella, “High-level synthesis techniques for reducing the activity of functional units,” in Proc. Int. Symp. Low Power Design, Apr. 1995, pp. 99–104.S. Dey, A. Raghunathan, N. K. Jha, and K. Wakabayashi, “Controller-based power management for control-flow intensive designs,”IEEE Trans. Computer-Aided Design, vol. 18, no. 10, pp. 1496–1508, Oct. 1999. C.-Y. Huang, Y.-S. Chen, Y.-L. Lin, and Y.-C. Hsu, “Data path allocation based on bipartite weighted matching,” in Proc. of the 27th Conference on Design Automation, 1990, pp. 499–504. S. Hong and T. Kim, “Bus optimization for low-power data path synthesis based on network flow method,” in Proc. Int. Conf. Computer- Aided Design, Nov. 2000, pp. 312–317. P. G. Paulin, J. P. Knight and E. F. Girczyc, "HAL: A Multi-Paradigm Approach to Automatic Data Path Synthesis," 23rd DesignAutomation Conference, pp. 263-270, Jul. 1986.B. M. Pangrle, "Splicer: A Heuristic Approach to Connectivity Binding," 25th ACMIIEEE Design Automation Conference, pp. 536-541, Jun. 1988.D. Chen, and J. Cong, "Register Binding and Port Assignment for Multiplexer Optimization," Proceedings of the Asia Pacific Design Automation Conference, pp. 68 - 73, January 2004. B.M. Pangrle and D.D. Gajski. Design tools for Intelligent silicon compilation”, IEEE Trans. CAD, 1987.P.G. Paulin and J.P. Knight. Force-directed scheduling for the behavioral synthesis of ASICs. IEEE Trans. CAD, 1989.D. Shin and J. Kim. Optimizing intra-task voltage scheduling using data flow analysis. ASPDAC’05.M.C. Johnson and K. Roy. Datapath scheduling with multiple supply voltages and level converters. ACM Trans. Des. Autom. Electron. Syst. 2, 3 (Jul. 1997), 227-248. X. Tang, H. Zhou and P. Banerjee. Leakage power optimization with dual-Vth library in high-level synthesis. DAC’05.……

OutlineOutlineFoundation of platformFoundation of platform--based design and Metropolis framework (Alberto/Douglas)based design and Metropolis framework (Alberto/Douglas)






RealReal--life examples (all)life examples (all)



UIUC


MMMsSystemC, C++

Analysis




UCI

UCB



4. Map and Simulate


TransformationRules

TLMEquivalence

Checker(UCI)

EquivalenceResult

DesignOptimization

App1 +Platform1

App2 +Platform2

TLMGen.

TLMGen.

TLM1

TLM2

DesignDecisions

Metro-II Framework

(UCB)

Metro-TLMLibrary

203

Platform Instance


Platform Instance

Function Instance



FunctionSpace

Platform Instance

Function Instance

Mapped

ASPN Simulation

Co-simulation



event trace




MetroII model

refine




Analog Sim


Processor Synthesis

Processor Library


Processor Synthesis


•ASIPs

xPilot

•Process mapping



204

Outline (Radu’s part)Part I

Packet-based communication and NoC designMapping and routing issuesVFIs and NoCs power management

Part IIPerformance optimization via buffer sizingExploiting small world effects Fault-tolerance and scalability issues

Conclusion A 3D view on NoC designFPGA prototype

205

SoC universe Application Architecture

Mapping

Performance evaluation

Communication refinement

Simulation/prototyping

Implementing on-chip multiprocessor systems brings concurrency and communication at the forefront of the design process!

206

“The chip IS the network”

On-chip NETWORKSModularity, scalabilityBetter predictabilityHigher bandwidthConcurrent communicationEnergy efficiency

Regular NoC architectures

Design constraintsHigh performance Low power and reliable operationTime-to-market (ease of design, modularity, CAD tools)Cost

Buses P2P irregular architectures

Packet-based communication: this is the focus today!Regular architectures implementing application specific NOCs

207

A NEW science of networks needs to emerge

There are similarities NoC approach is inspired by the success of macro networksShare some concepts (i.e. topology, routing, etc.)

… but also major differences between NoCs and macro networksResource limitation

• Much less area overhead possible. Buffer space is very limited.Energy efficiency

• Energy of global communication does not scale down with device scalingDesign time specialization

• NoCs are usually developed specifically for a specific set of application(s)• Traffic is also application-specific• Trade-offs among buffer space and quality of video, power and performance

New design methodologies and tools are needed for NoCs!

208

StaticCommunication infrastructure

• Topology (mesh, hypercube,…)• Buffer size (uniform, preferential)

DynamicCommunication paradigm

• Routing (deterministic, stochastic…), flow-control• Traffic (uniform, bursty,…)

OptimizationMapping applications onto architectures

• Performance, energy• Fault-tolerance,…

NoC design issues

209

NoC design spaceDesign effort

Design quality

•Standard topologies•Explore mapping & routing

Fixed standard Architecture

Semi-customized Architecture

Increased customization level and flexibility

•Buffer allocation•Long-range links

•Fixed topology and routing•Explore mapping

•Arbitrary topologies

Customized Architecture

first

second

DATEDATE’’03, DATE03, DATE’’04, 04, CODESCODES’’04, 04, etc.etc. DACDAC’’04, ICCAD04, ICCAD’’04, 04,

DATEDATE’’05, ISQED05, ISQED’’07, 07, etc.etc.

ICCDICCD’’02, ICCD02, ICCD’’04, 04, DATEDATE’’05, DAC05, DAC’’07, etc.07, etc.

210

Packet-based communicationPerformance and power dissipation are two major design constraints

Regularized, tile-based network-on-chip architecture• Well-controlled electrical parameters • Reliable interconnect• High performance

Processing Element

Communication wrapper

SwitchFabric

InputBuffers

OutputBuffers

(0,0) (0,1)

(1,1)

(2,1)

211

How does a tile look like?

ProcessingCore Router

bufferWestInput

WestOutput

buffer EastInput

EastOutput

buffe

rN

orth

Inpu

tN

orth

Out

put

buffe

rS

outh

Inpu

t Sout

hO

utpu

t

buffe

r

Proc.

Input Proc

.

Output

Onetile

Routingtable

CrossbarSwitch

Questions to addressbuffers size?topology?mapping, routing, etc…

212

MIT’s RAW processor

ComputationResources

longest wire = length of tile

(462 Gb/s @ 225 Mhz)

A Scalable 32 bit Fabricfor General Purpose andEmbedded Computing

Source: http://cag.lcs.mit.edu/raw

213

Intel’s 80-Tile 1.28TFLOPS NoC

S. Vangal et al. “An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS”Proc. IEEE Intl. Solid-State Circuits Conf, 2007

214

OutlinePart I



Conclusion A 3D view on NoC design

215

Energy-aware application mapping

(2,0) (2,3)(2,2)(2,1)

(3,0) (3,3)(3,2)(3,1)

(0,0) (0,3)(0,2)(0,1)

(1,0) (1,3)(1,2)(1,1)

Tile

NetworkLogic

Tile-based Architecture Application Characterization Graph (APCG)ASIC1

CPU1

DSP1DSP2

DSP3

ASIC25Mb | 2Gb/s

4Mb | 1.5Gb/s

Routing

MPEG design with adad--hochoc mapping

MPEG4 design with energy-aware mapping

Videoin MC VLE

ME Recon.Frame

Framebuffer

VideoOut

VOPdef.

DCTQ

IDCTIQ

Videoin ME MC

Framebuffer

DCTQ

VLE

VOPdef.

VideoOut

Recon.Frame

IDCIQ

Cycle-accurate simulations show

~50% communication energy savings!

216

Problem formulationGiven an APCG and ARCG with

size(APCG)≤ size(ARCG) Find a mapping function map( ) from the APCG to ARCG and a deadlock-free, minimal

routing function R( ) which minimizes:

Such that:

Where is the bandwidth of link and:

217

How does the energy-aware mapping work?Based on a branch-and-bound algorithm

Searching tree• Internal node: partial mapping• Leaf node: one feasible complete mapping

xxxx

4xxx3xxx

23xx21xx

2xxx1xxx

24xx

234x231x

2314 2341

Leaf node

Internal node

Root node

1 2

3 4

IP0

IP2 IP3

IP1

Mapping

218

Can we do better?

Main idea: Exploiting routing flexibility helps expand the solution space but makes the problem even more complex

(2,0) (2,3)(2,2)(2,1)

(3,0) (3,3)(3,2)(3,1)

(0,0) (0,3)(0,2)(0,1)

(1,0) (1,3)(1,2)(1,1)

Tile-based Architecture Communication Task Graph

ASIC1

CPU1

DSP1DSP2

DSP3

ASIC22Gb/s

1.5Gb/s

Assume the link bandwidth is only 3.0Gb/s

219

Exploiting routing flexibility

1. Helps in finding solutions for architectures with lower link bandwidthLower implementation cost

2. Leads to solutions with less energy consumption

526Mb/s

476Mb/s

500Mb/s

4.80J

4.50J

3.12J

220

NoC design with multiple VFIs

Voltage/Freq. Island VFI 1(V1, f1, Vt1) VFI 2

(V2, f2, Vt2)

VFI 3(V3, f3, Vt3)

Mixed clock / mixed voltage FIFO

NoC architecture is partitioned into multiple VFIs

Globally asynchronous, locally synchronous (GALS) communicationEach VFI can work at its own speed, while the communication across different voltage islands is achieved through mixed clock/mixed voltage FIFOs

221

CrossbarSwitch

FIFOOC

FIFO

OC

FIFOOC

OCFIFO

PEFIFO

CrossbarSwitch

FIFO

OC

FIFOOC

FIFOOC

OCFIFO

PEFIFO

Clock Domain 1 Clock Domain 2Output controller Mixed clock FIFO

Mixed clock/mixed voltage FIFO

Interface between two VFI domains

222

VFI synthesisVFI design choices

Chip partitioningGiven a VFI partitioning, assign the supply and threshold voltages

Each node in the network is a separate VFI Possibly largest energy savings, but very costly

• Mixed-clock/voltage FIFOs, voltage converters and power distribution

Increasing level of granularity

Appl

icatio

n En

ergy

Co

nsum

ptio

nArea and Energy

Overhead

our target

223

Design methodology for multi-VFI NoCs

NoC Architecture(topology, routing, etc.) Application

Scheduling

VFI Partitioning & Static Voltage-Frequency Assignment

Interface Design for Voltage-Frequency Islands

Dynamic Voltage and Frequency Scaling (DVFS) On-line

Workload Characterization

ASPN synthesis (UCLA)Voltage-frequency levels of customized

processors

COSI & Latency-insensitive design (UCB, Columbia Univ.)

Interaction to achieve energy optimization

Interaction with physical design (UCSD)Technology parameters, variability, etc.

System-level

Micro-architectural level

Physical level

224

VFI partitioning problemGiven

NoC architecture and a schedule for the driver applicationMaximum number of allowed VFIs and physical constraints

Find VFI partitioning (i.e., optimum number of VFIs, n ≤ N)Assignment of the supply and threshold voltages to each island

Such that the total energy consumption is minimized

( )∑=

+=n

iVFIAppTotal iEEE

1Application (useful) energy consumption

(comp+comm)

Overhead of ith VFI

Number of VFIs

225

Voltage/frequency assignment problemGiven a VFI partitioning

Find supply (Vi) and threshold (Vti) voltage assignmentsSuch that application energy consumption is minimized

( ) ( ) ( )∑ ∑∑∈∀ ∈∀∈∀

+=Ti Ti

bitTi

tiiiApp j,iEj,ivolV,VEEmin

Subject to the following deadline constraints per task t:

tttComm

t

t timestartdeadlinetfx

−−≤+

Energy consumed when the task is executed at (Vi ,Vti)

Communication energy consumption

Execution time Communication delay

226

For all pairs of neighboring islands (i , j )

Solve static VF assignment problem

Merge VFIs i and j

Compute the energy consumption

Given an initial partitioning with N islands, find the

static voltages

Update the VFI configuration

Merge the pair of islands thatprovides the minimum energy

Solve the voltage/frequency

assignment problem

VFI partitioning algorithm

227

1.5mJ2.6 mJ6.9 mJEnergy cons.

3-VFI2-VFI1-VFI

3-VFITile ID Tile ID

1-VFITile IDTile ID

2-VFITile IDTile ID

Energy savings for a 5x5 multi-VFI NoC

228

Run-time application mapping

Time

Applications : Application comes in

: DVFS

App 1: VFI1, VFI2App 2

App 3

Selection of this region is very important!

229

OutlinePart I




230

How does an on-chip router look like? Addr Decoder Channel Ctrl

North Input FIFO

Addr Decoder Channel Ctrl

East Input FIFO


West Input FIFO


South Input FIFO

Crossbar Switch

Crossbar Arbiter


Local Input FIFO

North Out Channel

East Out Channel

West Out Channel

South Out Channel

Local Out Channel

What should the proper buffer size of each input FIFO be?

231

Impact of buffer size on router area

Prototype router layout (buffer = 4 words)

Router size vs.buffer capacity

Buffer

Other logic

Buffering resource for on-chip router consumes significant area.To reduce the chip cost, the use of this resource has to be minimized.

232

Impact of buffer size on performance

System performance for differentbuffer capacities

Histogram for 1000 different random buffer configurations

Uniform1856 cycles

Best random187 cycles

• Most NoCs are application specific and demonstrate specific traffic patterns.• With limited buffering space available, it is important to carefully allocate these resources to each channel to match the traffic pattern of the given application.

862

64

233

Problem formulation

Given: Total available buffering budget BApplication communication characteristics andArchitecture specific packet servicing time S and routing function R

• S characterizes the packet service time in a router without contention

Determine:Buffer length for each input channelMinimize the average packet latency L

∑∑∑∀ ∀ ∀

≤x y dir

diryx BltsL ,,..)min(

234

Router/Channel Analytical Models

Cx-1,y,E

Cx+1,y,W

Cx,y-1,N

Cx,y+1,S

Cx,y,LO

Cx,y,N

Cx,y,E

Cx,y,W

Cx,y,S

Cx,y,L


North Input FIFO


East Input FIFO


West Input FIFO


South Input FIFO

Crossbar Switch

Crossbar Arbiter


Local Input FIFO

North Out Channel

East Out Channel

West Out Channel

South Out Channel

Local Out Channel

Router Rx,y

235

Evaluation under realistic trafficApplying the algorithm for applications that mimic real traffic

AutomotiveTelecommunicationAudio-video multimedia systems

To achieve the same or better performance as CNoC, the UNoC has to consume 12.5, 3.5 and 6 times more buffering space.

0

200

400

600

800

1000

1200

1400

1600

1800

2000

Automotive Telecom Audio-video

UNoC (B=96) UNoC (B=144)CNoC (B=96)

908.3

537.4

181.6

Packet latency (cycles) comparison

236

‘Regularized’ MPEG-2 decoder

Map MPEG app onto tiles and use a network for inter-tile communicationExperiments

Collect network traffic traces via simulation (time stamp, #bytes arrived)Analyze macroblock-level statistical properties of the resulting time seriesBuild formal models of packet size distributions

Trace 1

Trace 2

237

Surprising result: On-chip self-similar behavior+1

-1

0

lag k0 100Auto

corr

elat

ion

Coef

ficie

nt

Typical short-rangedependent process

+1

-1

0

lag k0 100Auto

corr

elat

ion

Coef

ficie

nt

Typical short-rangedependent process

Interactions “die out”

Expectation

+1

-1

0

lag k0 100Auto

corr

elat

ion

Coef

ficie

nt Typical long-range dependent process

+1

-1

0

lag k0 100Auto

corr

elat

ion

Coef

ficie

nt Typical long-range dependent process

Reality

The rate at which autocorrelation decays is described by the Hurst parameter (H)

Self-similar (fractal) processes model long range dependence (0.5 < H < 1.0)

Hawaii video

…but that’s not what happens.This “heavy tailed” distribution confirms long-range interactions

Trace analysis from regularized MPEG

Ideal Markov(short range) dependency

Idealfractal behavior

Hawaii video data

238

Designing on-chip networks is quite unique!LRD model analytical prediction

Simulation0.01

0.1

Bad moviequality!

Markovian modelanalytical prediction

Implications of long-range dependent traffic on on-chip network designThe average delay of a buffer increases sharply at surprisingly low utilization factorsIf ignored, this produces optimistic performance predictions and inadequate resource allocation

239

OutlinePart I




240

When it comes to silicon, regularity is good!

Fully structured Fully customized

Large inter-node distanceLarge latencyNot application specific

Well-controlled parametersLow powerSimple to layout

Widely varying linksLoss of structureWire routing, floorplan

Small inter-node distanceBetter performance

Processing Element

Communication wrapper

SwitchFabric

InputBuffers

OutputBuffers

(0,0) (0,1)

(1,1)

(2,1)

241

Physics collaboration network: Newman et. al. Physical Review 2004

Yeast proteins: Maslov et. al. Science,2002.

Physics collaboration network: Newman et. al. Physical Review 2004

“It’s a Small-World After All”

Graph Theory

Regular graphs

Highly clusteredShort inter-node distances

Random graphsSmall-worldnetworks

Small-world networksNatural: Biological networksSocial networks: Movie actors, collaboration networksTechnological: Internet, WWW

242

Inducing small-world effects in NoCs Completely structured Fully customized

Large inter-node distance

Long latency

Not tailored towards a target application

WellWell--controlled controlled parametersparameters

Low powerLow power

Simple to Simple to layoutlayout

Loss of structure

Widely varying links

Wire routing, floorplan, timing,…

Better Better performance performance

Small interSmall inter--node distancenode distance

CustomizationCustomization

Customization via LRL

LongLong--range range linklink

243

Problem formulation

GivenCommunication frequencies between nodesMaximum number of links to be addedThe initial network & corresponding routing strategy

DetermineThe set of long-range links to be added on top of the mesh network A deadlock-free routing strategy for the newly added long-range links

s.t. network performanceperformance is optimized

…

∑ ∑≠

=

p pqpq

ijij V

Vf

Communication volume from node i to j

244

Performance evaluation

FREE STATEFREE STATE

CONGESTED CONGESTED STATESTATE

Phase transition

Sepang Circuit, MalaysiaMonza Circuit, ItalyAve. speed in fastest lap

256 km/h256 km/h 213 km/h213 km/h

245 km/h245 km/h 201 km/h201 km/hAve. speed for the race

245

Long-range link (LRL) insertion algorithm

Add a link from i to j

Utilization < S

For all (i, j) Tji, ∈

Update utilization

Yes Generate routing data

No

Routing algorithmfor mesh network Available

resources, SCommunication frequencies, ffijij

Evaluate the current configuration

Output

246

Routing strategy for LRLsLocal routing decision

Deadlock free

There is a long-range link

Long-range linkdecreases d(i,j)

Yes

Use the default routing algorithm

No

No

No

Use the long range link

Yes

Yes

Basic turn model

Not allowedAllowed

South-Last routing

247

Latency comparison

• 13.6 % improvement in the critical traffic load

• 69% reduction in the latency at the critical load

• 36.3 % improvement in the critical traffic load

• 61.5% reduction in the latency at the critical load

0.1 0.2 0.30

100

200

300

Total packet injection rate (packet/cycle)

Ave

rage

pac

ket l

aten

cy (c

ycle

s)

Auto-industry Benchmark4x4 Mesh network4x4 Mesh network with long-range links

69%

13.6%

0.2 0.4 0.6

30

70

110

Total packet injection rate (packet/cycle)A

vera

ge p

acke

t lat

ency

(cyc

les) Telecom Benchmark

5x5 Mesh network5x5 Mesh network with long-range links

61.5%

36.3%

248

Scalability, scalability, scalability

00.20.40.60.8

11.21.4

Pack

ets/c

ycle

4x4 6x6 8x8 10x10

Network Size

Critical Traffic Load

2D Mesh Network

2D Mesh Networkwith LRL

0

50

100

150

200

Cycle

s

4x4 6x6 8x8 10x10Network Size

Average Packet Latency

249

Practical considerations

Extra port

Long-range Link

Long-range links are divided into regular link

segments

X

Y

250

CMU’s FPGA prototype for LRL linksDetails

Wormhole routingParameterized packet lengthRouting table4 cycles service time for the header flit16 bit channels (parameterized)

Area (for Xilinx Virtex-II XC2V4000 FPGA)

OutCont.

InCont.

InCont.

OutCont.

Out

Cont

.In Cont

.

Out.Cont.

In Cont.

RoutingTable

Port 3

Port 1

Port

2

Port

4

0

200

400

600

3-port 4-port 5-port 6-port

1.4%1.8%

2.2%2.8%

Numb

er of

slice

s

4x4 Mesh Network: 6683 slices (29%)

4x4 Mesh Network with LRL: 7143 slices (31%)

251

Measurements using the FPGA prototype

252

OutlinePart I


Part IIPerformance optimization via buffer sizingExploiting small world effectsFault-tolerance and scalability issues


253

Issues to worry about

Design complexity increasesVerification and testing become more difficult

For a shrinking manufacturing process from 0.25μm to 0.18 μm and a supply voltage drop from 2V to 1.6V, α-particles and neutron effects increase more than 10 times

C. Constantinescu - “Neutron SER Characterization of Microprocessors”, DSN 2005

Destination

Source

Cosmic Rays

254

Tiles that communicate stochastically

Each tile containsIP coreSend / receive buffersCRC hardwareRandom number generator (RND)

Main ideaProbabilistic broadcast: Packets

randomly transmitted several times, using multiple paths

Transmissions protected by CRC. Corrupted packets are discarded

We call this on-chip stochastic communication

Consumer

Producer

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

CRC check

IP

Buffers

BuffersRND

RN

D

RND

RN

D

255

Parameters of stochastic communicationTransmission probability (p ∈ [0,1])

Governs the random packet transmissionInfluences latency and energy dissipationMessages are disseminated “explosively” fast

Time-to-live (TTL)A packet has a finite TTLInfluences the energy dissipation

Energy dissipationDepends on the total number of packets (Npackets) transmitted in the NoCCan be obtained fromEcommunication = Npackets S Ebit

where: S = average size of packets, Ebit = energy dissipated per bit

Duration of a round (TR)Influences latency

fSN

T roundpacketsR

/=

256

Problem descriptionGiven

NoC communication architecture and an initial configuration ( i.e. a set of sources and destinations) Environment influences the NoC reliability via various particles (neutron, α, etc.)Each node can send the packet to a subset of its neighbours at random

DetermineEvolution of the packet dissemination s.t. the system-level fault tolerance is ensured through the probabilistic communication mechanism

Producer

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16Consumer

Data upset Buffer overflow

Synchronization error

257

New model: Spreader-Ignorant interactionISS

αα11αα22

αα44

αα 33

{ }4,3,2,1for

),())(()(,)()(,)(=

++−=+=−==+=+

khOhkikskitIkstSihtIshtSP kα

SS

I

I I

I I

I II

I I I II

I I II

I I II

S-SpreaderI-Ignorant

SS

SS SS

SS

αα11

SS

SS

SS

SS

SS

SS

SS

I

I

SS

SS

SS

SS

Complete graph

Main idea: The topology plays an essential role!

258

Master equation for packet disseminationWhat is the probability P(s,i,t) of having s spreader nodes and iignorant nodes at time t ?

Solve the following equation:

),,()(),,1()1)(1(

),,()(),,())((),,(

2

1

25

4

1

25

2

1

4

1

tisPisNistisPisNs

tiksPkstkiksPkiksdt

tisdP

kk

kk

kk

kk

⎭⎬⎫

⎩⎨⎧

−−++−+−−−++

++++−+−=

∑∑

∑∑

==

==

αβαα

βα

Spreader-Ignorant Interaction Spreader-Spreader Interaction

Spreader-Stifler Interaction

259

Coverage analysisCoverage of critical points for a 10x10 mesh network (250 rounds)

Number of reached nodes saturatesAs probability gets higher, the saturation (# of reached nodes) increases faster

Stochastic simulation of spreader nodes in 10x10 mesh network

Forwarding probability 0.7Similar asymptotic behaviors in the presence of faults

Coverage for 10x10 mesh network without faults Coverage for 10x10 mesh network without faults

260

Hierarchical NoCHierarchical NoC

Shared BusShared Bus

Putting it all together

Desire for lower power & higher performance suggestson-chip diversity (e.g. GALS architectures, mixed technologies, complex deterministic and stochastic communication)

261

OutlinePart I


Part IIPerformance optimization via buffer sizingExploiting small world effectsFault-tolerance and scalability issues


262

The big picture

paradigm

infrastructure

application

adaptivedet

SWN

randomcustom

regularrandom

LRL

DyAD

MPEG2

multimedia

X

Y

Z

263

The big picture

paradigm

infrastructure

application

adaptivedet

SWN

randomcustom

regularrandom

LRL

DyAD

MPEG2A/V

MPEG4multimedia

X

Y

Z

264

Mapping the Spread of Contagions. Black nodes are persons potentially infectious, pink nodes represent exposed persons with incubating infection and are not infectious, green represent exposed persons with no infection and are not infectious. The infection status is unknown for the grey nodes. The black node in the center of the graph, is also the most infectious.(Source www.orgnet.com)

The big picture

paradigm

infrastructure

application

adaptivedet

SWN

randomcustom

regularrandomDyAD

rumors/epidemics

X

Y

Z

265

More info about some slides – see references…General

R. Marculescu, U. Y. Ogras, N. H. Zamora, ' Computation and Communication Refinement for Multiprocessor SoC Design: A System-Level Perspective , ' in ACM TODAES, Vol.11, No.3, July, 2006. U. Y. Ogras, J. Hu, R. Marculescu, ' Key Research Problems in NoC Design: A Holistic Perspective ', in Proc. CODES+ISSS, Jersey City, NJ, Sep. 2005.Networks on Chip, A. Jantsch, H. Tenhunen, Eds., Kluwer Academic, 2003.

Point-to-Point communication synthesis J. Hu, Y. Deng, R. Marculescu, 'System-Level Point-to-Point Communication Synthesis Using Floorplanning Information,' in Proc. ASPDAC-VLSI, Bangalore, Jan. 2002.A. Pinto, L. Carloni, A. Sangiovanni-Vincentelli, Constraint-Driven Communication Synthesis,' in Proc. DAC, New Orleans, LA, June 2002.

266

References (cont’d) NoC mapping, scheduling, routing

J. Hu, R. Marculescu, 'Communication and Task Scheduling of Application-Specific Networks-on-Chip', in IEE Proc. Computers & Digital Techniques, Sept. 2005.J. Hu, R. Marculescu, 'Energy- and Performance-Aware Mapping for Regular NoC Architectures', in IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, Vol.24, No.4, April 2005.

Buffer allocation, topology synthesis/customizationJ. Hu, U. Y. Ogras, R. Marculescu, 'System-Level Buffer Allocation for Application-Specific Networks-on-Chip Router Design,' in IEEE Trans. CAD, Vol. 25, Dec. 2006.U. Y. Ogras, R. Marculescu, ‘ "It’s a small world after all": NoC Performance Optimization via Long Link Insertion,' in IEEE Trans. on VLSI, Vol. 14, July 2006.

Traffic analysis G. Varatkar, R. Marculescu, 'Traffic Analysis for On-chip Networks Design of Multimedia Applications,' in IEEE Trans. on VLSI, Jan. 2004.

267

References (cont’d) Stochastic communication

P. Bogdan, T. Dumitras, R. Marculescu, ‘Stochastic Communication: A New Paradigm for Fault-Tolerant Networks-on-Chip,' in VLSI Design, Hindawi Publishing Corp., 2007.T. Dumitras, R. Marculescu, 'On-Chip Stochastic Communication,' in Proc. DATE, Munich, Germany, March 2003.C. Constantinescu, ‘Impact of Deep Submicron Technology on Dependability of VLSI Circuits,’ in Proc. DSN, 2002.

Power and link management C.-L. Chou, R. Marculescu, ' Incremental Run-time Application Mapping for Homogeneous NoCs with Multiple Voltage Levels ', in Proc. CODES+ISSS, Salzburg, Austria, Oct. 2007.U. Y. Ogras, R. Marculescu, P. Choudhary, D. Marculescu, ‘Voltage-Frequency Island Partitioning for GALS-based Networks-on-Chip,’, in Proc. DAC 2007. J. Hu, Y. Shin, N. Dhanwada, R. Marculescu, 'Architecting Voltage Islands in Core-based System-on-a-Chip Designs', in Proc. ISLPED, Newport Beach, Ca, Aug. 2004.

268

References (cont’d) Implementation

U. Y. Ogras, R. Marculescu, H. G. Lee, P. Choudhary, D. Marculescu, M. Kaufman, P. Nelson, 'NoC Prototyping Using FPGAs: Challenges and Promising Results in NoC Prototyping Using FPGAs', in IEEE Micro Special Issue on Interconnects for Multi-Core Chips, Sept./Oct. 2007.H. G. Lee, N. Chang, U. Y. Ogras, R. Marculescu, ‘On-chip Communication Architecture Exploration: A Quantitative Evaluation of Point-to-Point, Bus and Network-on-Chip Approaches,’ in ACM TODAES, Vol.12, No. 3, June, 2007.

This list of references is NOT exhaustive. There are many good contributions not mentioned here due to space limitations.

A good selection of NoC papers are available athttp://www.cl.cam.ac.uk/~rdm34/onChipNetBib/browser.htm http://www.ocpip.org/university/biblio_main/comparison/

Sponsors: Marco GSRC, SRC, NSF

269

NoCPro: The CMU-SNU prototypeA complete MPEG-2 encoder implementation using NoC, bus-based and P2P architecturesDetailed area, power, and performance comparisons based on real measurementsIllustrate the scalability of the NoC approach using a real multimedia application

InputBuffer R1 R2

DCT &Quant.

VLE &Out. Buffer

IQuant.& IDCT

MotionEst.

MotionComp.

FrameBuffer

270

MPEG-2 EncoderMPEG-2 belongs to the rich class of multimedia applications

JPEG, MJPEG, MPEG1, etc.

MPEG-2 communication task graph

InputBuffer

DCT

MotionComp.

MotionEst.

FrameBuffer

VLE &Out

IQuant.

Quant.

IDCT

271

MPEG-2 Encoder Implementation (P2P)

Processing elements

961480956802

3,8732,527

74

# of Slices

Area(Xilinx Virtex-II FPGA)

8Motion Esti.19Motion Comp.

1IQuant/IDCT75Reconst FB.

1VLE/Out Buf.

1DCT/Quant.1Input Buffer

# of BRAMs

Processing Element

Network interface(116 slices)

Input port Output port

Input buffer(FIFO)

...

...

Output buffer(FIFO)

PE

PE

272

MPEG-2 Encoder Implementation (P2P)

FrameBuffer (FB)

InputBuffer (IB)

DCT &Quant. (DQ)

VLE &Out. Buffer (VB)

MotionComp. (MC)

MotionEst. (ME)

IQuant.& IDCT (IQ) DedicatedDedicated

linkslinks

# of links: 10# of links: 10# of # of NIsNIs: 19: 19

Processing elementsProcessing elements

NetworkNetworkInterfacesInterfaces

273

MPEG-2 Encoder Implementation (NoC)

On-chip router designWormhole routingPacket length: 64 flits (parameterized )Routing table4-cycle service time for header flit

1.73975-port2.25036-port

1.33044-port1.02193-port

Util. (%)Resource(# of slices)

OutputInput

InputOutput

Outp

utIn

put

OutputInput

RoutingTable

Port 3

Port 1

Port

2

Port

4

OutputInput

OutputCont.

InputCont.

InputOutput InputCont.

OutputCont.

Outp

utIn

put

Outp

utCo

nt.

Inpu

tCo

nt.

OutputInput OutputCont.

InputCont.

RoutingTable

Port 3

Port

4

Area (Xilinx Virtex-II XC2V4000)

274

MPEG-2 Encoder Implementation (NoC)

MPEG-2 encoder (hierarchical star network)

InputBuffer R1 R2

DCT &Quant.

VLE &Out. Buffer

IQuant.& IDCT

MotionEst.

MotionComp.

FrameBuffer

# of links: 8# of links: 8# of # of NIsNIs: 7: 7# of routers: 2# of routers: 2

PE

PacketizeDepacketize

Input port Output port

Inputbuffer(FIFO)

...

...

Outputbuffer(FIFO)

Router

189 slices189 slices

275

Scalability with Increasing Parallelism

Increase the number of modulesME module is the performance bottleneck

IB R1 R2

DQ

VB

IQ

ME1 MCFB

ME2

ME2

FB

IB DQ VB

MC

ME1

IQ

FB

IB VB

MC

# of links: 14# of links: 14# of # of NIsNIs: 27: 27

# of links: 9# of links: 9# of # of NIsNIs: 8: 8# of routers: 2# of routers: 2

276

To summarize…

MotionEst. 2

MotionEst. 2

InputBuffer R1 R2

DCT &Quant.

VLE &Out. Buffer

Inv Quant.& IDCT

MotionEst.

MotionComp.

FrameBuffer

Networks-on-chip Implementation

FrameBuffer

InputBuffer

DCT &Quant.

VLE &Out. Buffer

MotionComp.

MotionEst.

Inv Quant.& IDCT

Point-to-point Implementation

Dedicatedlinks

MotionEst. 2

InputBuffer

DCT &Quant.

VLE &Out. Buffer

Inv Quant.& IDCT

MotionEst.

MotionComp.

FrameBuffer

Bus Implementation

Bus Cont.Unit

277

Area and Performance Comparison

0

100

200

300

400

500

1 2 4 8Degree of parallelism

# of

Fra

me/

sec

P2P NoC Bus

In terms of PerformanceNoC scales similar to P2P implementationBus implementation scales poorly

0

5K

10K

15K

20K

25K


# of

slic

es

P2P Bus NoC

0

5K

10K

15K

20K

25K


# of

slic

es

P2P Bus NoC In terms of AreaNoC scales similar to the bus architectureScaling of the P2P implementation is poor

278

Energy and Power Consumption Comparison

20

40

60

80

100


Ener

gy (m

J/Fr

ame)

10

20

30

40

50

Perc

enta

ge (%

)

P2P (Energy) Bus (Energy) NOC (Energy)P2P (Percentage) Bus (Percentage) NOC (Percentage)

0 0

In terms of Energy per FrameNoC exhibits the best scalabilityScaling of the bus implementation is poor

1,000

2,000

3,000

4,000

5,000

6,000

7,000


Pow

er (m

W)

0

10

20

30

40

50

60

Perc

enta

ge (%

)

P2P (Power) Bus (Power) NOC (Power)P2P (Percentage) Bus (Percentage) NOC (Percentage)

0

In terms of Power ConsumptionBus implementation exhibits the best scalability (due to slow operation)Scaling of the NoC implementation is better than P2P

279

ConclusionMain message: The NoC outperforms P2P!

Less design complexity and smaller area (~22%)Less energy consumption (~42%) and better energy-delay product Similar performance (P2P: 48 Frame/s, NoC: 47 Frame/s)

NoC implementation is scalable in terms of area, performance, and power consumption

EmbeddedLow PowerLaboratory

OutlineOutlineFoundation of platformFoundation of platform--based design and Metropolis framework (Alberto/Douglas)based design and Metropolis framework (Alberto/Douglas)






RealReal--life examples (all)life examples (all)

xPilot: Behavioral-to-RTL Synthesis Flow Behavioral spec.

in C/SystemC

RTL + constraints

SSDMSSDM

μArch-generation & RTL/constraints generation

Verilog/VHDL/SystemCFPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …

Presynthesis optimizationsLoop unrolling/shiftingStrength reduction / Tree height reductionBitwidth analysisMemory analysis …

FPGAs/ASICsFPGAs/ASICs

Frontendcompiler

Frontendcompiler

Platform description

Core synthesis optimizationsSchedulingResource binding, e.g., functional unit binding register/port binding

Advantages of Behavioral SynthesisShorter verification/simulation cycle• 100X speed up with behavior-level simulation

Better complexity management, faster time to market• 10X improvement on code density

Rapid system exploration• Quick evaluation of different hardware/software boundaries• Fast exploration of multiple micro-architecture alternatives

Higher quality of results• Platform-based synthesis & optimization• Full consideration of physical reality

xPilot is Licensed to AutoESL for Commercialization

C/C++/SystemCC/C++/SystemC

Timing/Power/Layout Timing/Power/Layout ConstraintsConstraints

RTL RTL HDLsHDLs &&RTL SystemCRTL SystemC

Platform Characterization

Library

ASICs/FPGAsASICs/FPGAsImplementationImplementation

FPGA FPGA PrototypePrototype

=

Simulation, Verification, and Prototyping

Compilation & Compilation & ElaborationElaboration

Advanced CodeAdvanced CodeTransformationTransformation

Behavioral & CommunicationBehavioral & CommunicationSynthesis and OptimizationsSynthesis and Optimizations

AutoPilotTM

Com

mon Testbench

User ConstraintsUser Constraints

ESL Synthesis

Design Specification

Platform-based & communication-centric ESL synthesisAutomated ESL-to-GDSII silicon compilationRapid platform-based system-level explorationMore than 10X design productivity gain

AutoClipse IDE

Standard JFace Text EditorKeyword highlighting, key bindings

Outline ViewUsing internal parser

Content AssistUsing internal parser and index

C/C++ Projects ViewShowing CDT specific things: includes, binaries

Build Console Output

MPEG-4 Simple Profile Decoder: C-Based Synthesis Results

Texture Update & Copy Control

Texture/IDCT

Motion Comp.

Parser/VLDModule

16

BRAMs

8.0

2693Video:CIF@30fps

Device:v2p30 1407

2032

899

Slices Period (ns)Setting

•• Complexity of synthesized RTLsComplexity of synthesized RTLs

2736textureUpdate.c(220)

Texture Update

569215168Total

10934motion_decode.c(492)

11537

6089

12036

6093

4681

2815

VHDL line#

texture_idct.c(1819)

Texture/IDCT

texture_vld.c(504)

parser.c(1095)

Parser/VLD

bitstream.c(439)

Motion-compensation.c

(312)

Motion Comp.

copyControl.c(287)

Copy Controller

Orig. CSource File

(+ line#)

Module Name

Experimental Results: ASIC FlowMagma RTL to GDSII flow Technology library: TSMC 90nmDesign: Motion Compensation Block

1st column: Cycle time constraint enforced in AutoPilot and Magma tools2nd column: Estimated cycle count of synthesized RTL3rd-5th column: Data reported by Magma tool

2154.4297.533612848732756413500

2148.1328.230472899533047053000

1833.7429.223303111135097872500

2442.2262.538102841932716414000

1455.6546.118313241038907952000

1162.2739.113533693349188591500

1172.7933.7107144778686810951000

Total Latency (ns)

Fmax (MHz)

Crit. Path Delay (ps)

Area (um2)

Cell Count

Cycle Count

Clock Period Constraint (ps)

ESL SystemC to ASIC (Magma Flow)

Magma BlastCreate

AutoPilotAutoPilotTMTM

Synthesis ToolSynthesis Tool

ESL SystemC ESL SystemC to ASIC Flowto ASIC Flow

SystemC behavioral specificationAES (Advanced Encryption Standard)Untimed model; bit-accurate data typesAbout 1300 lines code

AutoPilot synthesis resultLatency: 270 cycles RTL Verilog code: about 23K lines

Magma Blast-Create resultTechnology node: TSMC-90nmArea (u2): 70KFrequency: 125MHz+ (8ns constraint)

Behavioral SystemCBehavioral SystemCDesign ModelDesign Model

RTL SystemC, RTL SystemC, VHDL/VerilogVHDL/Verilog

AutoPilotTM Simulation/Verification Flow

Behavioral C/C++/SystemCBehavioral C/C++/SystemCDesign ModelDesign Model

Automated Simulation FlowAutomated Simulation Flow

RTL RTL SystemC and HDL SystemC and HDL

modelsmodels

ASICs/FPGAsASICs/FPGAsRTL Synthesis/LayoutRTL Synthesis/Layout

FPGA FPGA Prototype/Prototype/EmulationEmulation

Cycle Accurate Waveform /

Coverage and Assertion Report

AutoPilotAutoPilotTMTM

Synthesis ToolSynthesis Tool

Behavior-Level (Untimed) Test Bench and Stimuli

Synthesis FlowSynthesis Flow

AutoPilotTM

Bench Adapter• Generate wrappers for RTL models• Reuse untimed bench and stimuli • Automatically compare SystemC and HDL waveforms

Compilation for Reconfigurable Accelerated Computing

AutoPilot C-based synthesis for high-performance computingSynthesize pure ANSI-C, the “Universal Language” for software programmers• Quickly port legacy C programs into optimized hardware implementations

GCC-compatible compilation flow• Full support of IEEE-754 floating point data types & operations• Efficiently handle bit-accurate fixed-point arithmetic

Automatic parallelization / pipelining for performance speedup

int test(int in0, int in1){

/* user code */}

ANSI-Centity test is port (

in0 : IN SIGNED (31 downto 0);in1 : IN SIGNED (31 downto 0);result : OUT SIGNED (31 downto 0);clk : IN STD_LOGIC;reset : IN STD_LOGIC;done : OUT STD_LOGIC;start : IN STD_LOGIC );

end;-- Synthesized code in VHDL --

RTL VHDL

AutoPilotTM

Acceleration of Lithographic Simulation with AutoPilotTM

Lithography simulationSimulate the optical imaging processComputational intensive; very slow for full-chip simulation

15X+ Performance Improvement vs. AMD Opteron 2.2GHz Processor with automated compilation

XtremeData X1000 development system (AMD Opteron + Altera StratixII EP2S180)

AutoPilotTM

Synthesis Tool

Algorithm in C

Ι(x,y) = Σ λκ ∗ | Σ τ [ψκ(x−x1, y−y1) −

ψκ(x−x2, y−y1) + ψκ(x−x2, y−y2) − ψκ(x−x1, y−y2)] |2

AutoPilot QoRCustomer design

Evaluation criteria• QoR: Compare (latency and area) with manual design• Quick design space exploration

Results highlight• Parameterizable C design• QoR comparable to hand design; Discovered an alterative with 26% better latency

AutoPilotTM Features and Benefits

Higher quality-of-resultsAdvanced compiler optimizations

Faster timing/power closureCommunication/interconnect centric synthesis

Enable design reuse: From single source tomultiple RTLs for different technologies & platforms

Unified FPGAs and ASICs support

Platform-based and implementation-aware behavioral synthesis

Scalable & near-optimal algorithms for global behavioral & communication co-optimizations

Integrated C/C++/SystemC-based design flow

AutoPilotTM Features

Precise platform pre-characterization allowing more informed optimizations

Higher quality-of-results

Best language/abstraction support; Enable efficient simulation, prototyping, and implementation

Benefits

293

File for Xilinx EDK Tool Flow

IP Library

1. Select an application and understand its behavior.

2. Create a Metropolis functional model which models this behavior.

3. Assemble an architecture from library services or create your own services.

4. Map the functionality to the architecture.

5. Extract a structural file from the top level netlist of the architecture created.

On-ChipPeripheral

Bus(OPB)

SynthMaster

SynthSlave

MicroBlaze

Mapping ProcessMapping

Process


Process

BRAMBRAM

Preprocessing DCT Quantization Huffman

JPEG Encoder Function Model (Block Level)

StructureExtractor Top Level Netlist

Example Design

294

Example Design Cont.File for Xilinx EDK Tool Flow

Permutation Generator

ISS Info CharDataTransaction

Info

Platform Characterization Tool (Xilinx EDK/ISE Tools)

Characterizer Database

Software Routinesint DCT (data){Begin

calculate ……} Automatic32 Bit Read = Ack, Addr, Data, Trans, Ack

Manual

Hardware RoutinesDCT1 = 10 CyclesDCT2 =5 CyclesFFT = 5 Cycles

Manual

1. Feed the captured structural file to the permutation generator.

2. Feed the permutations to the Xilinx tools and extract the data.3. Capture execution info for software and hardware services.4. Provide transaction info for communication services.

Permutation 1 Permutation 2 Permutation N

295

Example Design Cont.Preprocessing DCT Quantization Huffman

JPEG Encoder Function Model (Block Level)

On-ChipPeripheral

Bus(OPB)

SynthMaster

SynthSlave

MicroBlaze

Mapping ProcessMapping Process


Process

BRAMBRAM

ISS InfoCharDataTransaction

Info

2. Refine design to meet performance requirements.

3. Use Refinement Verification to check validity of design changes.

• Depth, Vertical, or Horizontal• Refinement properties

1. Simulate the design and observe the performance.

Execution time 100msBus Cycles 4000Ave Memory Occupancy 500KB

BRAM

ConcurrentVertical Refinement

New AlgorithmDepth

VerificationTool

Yes? No?

Execution time 200msBus Cycles 1000Ave Memory Occupancy100KB

4. Re-simulate to see if your goals are met.

Backend Tool Process:1. Abstract Syntax Tree (AST) retrieves structure.

2. Control Data Flow Graph - DepthFORTE – Intel ToolReactive Models – UC Berkeley

3. Event Traces – RefinementProperties.

Vertical RefinementHorizontal Refinement

296

Intel MXP5800 Architecture

Highly Heterogeneous Parallel PlatformDesigned for Imaging Applications8 Image Signal Processors connected with mesh

PEs have limited capabilitiesCommunication is data-driven with support for multiple consumersBuffer memory is extremely limited: 16 registers

297

Application and Architecture ModelingFunctional Modeling

Hierarchical23 Processes21 FIFOsFocus on DCT

Pre-processing DCT Quantization Huffman

Scan ColorConv.

1D-DCT Trans-pose 1D-DCT Trans-

pose

ZigZag Mult

RLE Lookup

Shift

Add4

Sub4

Mult1

Mult2Merge

Add2

Sub2

Architectural Modeling Time is performance metricTasks provide blocking read/write and execution servicesPEs support static schedules

298

MappingReplication of best scenarios from Intel libraryAccurate performance modelingEasy implementation of additional scenarios

Change allocation and scheduling

Cycles for different scenarios

0

500

1000

1500

2000

2500

Hardware Balanced OPE emphasis OPE Heavy

Scenario

Cyc

les

Metropolis ScenariosIntel Software Library

A. Davare, Q. Zhu, J. Moondanos, ASV, “JPEG Encoding on the MXP5800: A Platform-based Design Case Study,” Proceedings of EstiMedia 2005.

299

Motion JPEG on XilinxStep 1: Decompose application (MJPEG encoding) into desired topologies.

Step 3: Create Architecture Models in MMM for a target platform.

Step 2: Create MJPEG Functional Models in the MMM language.

Step 4: Map processes in the func. model to tasks in the arch. model.

300

Mapping Motion-JPEGStep 3: Map processes in the functional model to tasks in the architecture model.

In our exploration: One to one mapping between functional and architectural tasks

0.0031

0.0026

0.0021

0.0030

Execution Time (Secs)

46.3

56.7

72.3

101.5

Max MHZ

1, 1, 1

2, 2, 3

3, 3, 2

4, 4, 4

Rankings (Real, Char, Est)

143335

147036

154217

304585

Real Cycles

9278144432 (<+1%)103320 (28%)Model 4

7035145414 (1.2%)103935 (29%)Model 3

4927145659 (6%)103812 (33%)Model 2

4306228356 (25%)145282 (52%)Model 1

Area (Slices)

Characterized Cycles

Estimated Cycles

System

Real Cycles and Execution Time

050000

100000150000200000250000300000350000

1 2 3 4

Model

Cyc

les

00.00050.0010.00150.0020.00250.0030.0035

Exec

utio

n Ti

me

(Sec

)

Real CyclesExecution Time

Real Cycles and Area

050000

100000150000200000250000300000350000

1 2 3 4

Model

Cyc

les

0

2000

4000

6000

8000

10000

Slic

es

CyclesArea

aspdac 2008 tutorial: system-level synthesis -- functions...

Documents