aspdac 2008 tutorial: system-level synthesis -- functions...
TRANSCRIPT
ASPDAC 2008 Tutorial:ASPDAC 2008 Tutorial:SystemSystem--Level Synthesis Level Synthesis ----
Functions, Architectures, and CommunicationsFunctions, Architectures, and Communications
Alberto Sangiovanni VincentelliAlberto Sangiovanni Vincentelli & & Douglas DensmoreDouglas DensmoreUC Berkeley, UC Berkeley, [email protected]@eecs.berkeley.edu
Jason CongJason CongUCLA, UCLA, [email protected]@CS.UCLA.EDU
Radu Radu MarculescuMarculescuCMU, CMU, [email protected]@ece.cmu.edu
OutlineOutlineFoundation of platformFoundation of platform--based design and Metropolis framework (Alberto/Douglas) 9based design and Metropolis framework (Alberto/Douglas) 9--10:3010:30
Challenges in System Level DesignChallenges in System Level DesignPlatformPlatform--based Design as a unifying methodologybased Design as a unifying methodologyA framework for PBD: A framework for PBD: •• Theoretical foundations (heterogeneous systems, Theoretical foundations (heterogeneous systems, metamodelingmetamodeling, abstract semantics), abstract semantics)•• Metropolis II: integration platform architectureMetropolis II: integration platform architecture•• Application to embedded system design: cars and building networkApplication to embedded system design: cars and building networked embedded ed embedded
systemssystemsSynthesis for functionality (Jason) 11am Synthesis for functionality (Jason) 11am –– 12:30pm12:30pm
Synthesis for customized logicSynthesis for customized logicUse of applicationsUse of applications--specific processors and processor networksspecific processors and processor networks
Synthesis for communication (Synthesis for communication (RaduRadu) 2 ) 2 –– 3:30pm3:30pmNetworksNetworks--onon--Chip design spaceChip design spaceOptimization for performance, energy, and faultOptimization for performance, energy, and fault--tolerancetolerance
RealReal--life examples (all) 4 life examples (all) 4 –– 5pm5pmQ/A DiscussionsQ/A Discussions
Copyright: A. Sangiovanni-Vincentelli
Art ScienceSystem-Level Design:
TOFROM
Alberto Sangiovanni-Vincentelli
The Edgar L. and Harold H. Buttner Chair of EECSUniversity of California at Berkeley
4
Outline
Challenges
The Movement towards Design SciencePlatform-based Design as a Unifying Approach
Metropolis
The GSRC Agenda
5
Challenge: The Bifurcation of the Market
The Core:Performance is premiumPower and cost constrainedRelatively long life-timeExpensive jewelry
The Expanding Periphery:Cost and size are premiumIntegration and power are key“Just enough performance”Electronics like a fashion statement
5
6
Ubiquitous Sensor Networks
Challenge: The Physical Internet
Year
Log
(peo
ple
per c
ompu
ter)
Number CrunchingData Storage
ProductivityInteractive
Mainframe
Minicomputer
Workstation
PC
Laptop
PDA
Cellular phone Streaming information to and from physical world
6
Limitations imposed by physics
Limitations imposed by economicsmay ultimately end its long run (20 nm – 13 nm – 8 nm – …?) or may not …
Challenge: The Waning Days of Moore’s Law
7
Scaling enabled integration of complex systems with hundreds of millions of devices on a single die
Intel KEROM dual coreISSCC 07, 290M trans.
SUN Niagara-2ISSCC 07, 500M trans.
IBM/Sony Cell ISSCC 05, 235M trans.
Challenge: Parallel Architectures
8
9
Source: Public financials, Gartner 2005
• 2005 revenue $17.4B
• CAGR 10% (2004-2010)
IC Vendors~15% of revenue from automotive
• 2004 Revenue ~$200B
• CAGR 5.4% (2004-2010)
Tier 1 Suppliers90%+ of revenue from automotive
Automakers • 2005 Revenue $1.1T
• CAGR 2.8% (2004-2010)
Challenge: Design Chain IntegrationAutomotive Industry
10
Challenge: Platforms and Software Content
Supplied by ST
2000STAPI
1998Specs
System-Above-Chip (Boards, Chips, & Software)NO value in customer owning/writing drivers. (TMM, E*, HNS)Customer added value is application, Conditional Access, Brand NameST supplies the complete base system BELOW MIDDLEWARE to save time to market
2003 &Beyond
11
Concurrency and Heterogeneity
IntelMontecito
Source: Bosch
InformationSystems
Tele
mat
ics
Fau
lt
Tole
ran
t
Body Electronics B
ody
Fun
ctio
ns
Fail
Safe
Fau
ltFu
nct
ion
al
Body Electronics
Dri
vin
g an
d V
ehic
leD
ynam
ic F
un
ctio
ns
Mobile Communications Navigation
FireWall
Access to WWWDAB
GateWay
GateWay
Theft warning
Door Module Light Module
AirConditioning
Shift by
Wire
EngineManage-
ment
ABS
Steer by
Wire
Brake by
Wire
MOSTMOSTFirewireFirewire
CANCANLinLin
CANTTCAN
FlexRay
Today, more than 80Microprocessors and millions of lines of code
11
12
HVAC: High Performance Buildings
13
Challenge summary
Industry will move towards robust architectures which can:
Yesterday Features (can you do it?)
Today Cost (are you cheaper?)
Tomorrow Integration (but can you also…?)
mix-and-match components from different vendors
avoid costly system-level simulations
create a system by just interconnecting modules
NXP Semiconductors, René Penning de Vries, May 3 - 2007, IEF Athens
14
Plug and Pray!
Integration: Plug and Play?
15
The Design Integration Nightmare
P. Picasso, Blue Period
Specification:
P. Picasso “Femme se coiffant”1940
Implementation:
16
Common Features
Transport Layer
Network Layer
MAC Layer
Link Layer
Dis
cret
e E
vent
Physical Layer
Application
Pre-Post
Process Networks
x Low pass
Manager Tables andParameters
User CSP
ContinuousTimem
+
c
s
• Systems are assembled out of heterogeneous components
• Systems are distributed
• Interactions difficult to define
17
The Intellectual Agenda
To create a modern computational systems science and systems design practice with
ConcurrencyComposabilityTimeHierarchyHeterogeneityResource constraintsVerifiabilityUnderstandability
18
Outline
Challenges
The Movement towards Design SciencePlatform-based Design as a Unifying Approach
Metropolis
The GSRC Agenda
19
Opportunity: System Design Chain
Interfaces
Fabrics
Manufacturing
Implementation
System Design
IP
Design Science
Design Process Transformation in Chip Design
C/C++ SW CODE
RTOS CODE
Textual Design
Specification
Functional High Level
HW MODELS
Cycle Accurate RTL
Timing Accurate
Gate-Level Netlist
Embedded System Design Gaps
Validate
ValidateValidate
Source: Greg Spirakis, Intel
DriveCODE
TestVectors
Simulate RTL
Verify Gates
Synthesis
Translation
Estimation
HW/SWPartitioning
Estimation
MANUAL
MANUAL
NO FORMAL SEMANTICS
MANUAL
MANUAL
21
MiddlewareJavaTV, TVPAK, OpenTV, MHP/Java, proprietary ...
Applications
Nexperia Hardware
Streaming andPlatform Software K
erne
l: pS
OS
, VxW
orks
, Win
-CE
TM-xxxxD$I$
TriMedia CPU
DEVICE IP BLOCK
DEVICE IP BLOCK
DEVICE IP BLOCK
.
.
.
DVP SYSTEM SILICON
DEVICE IP BLOCK
PRxxxxD$I$
MIPS CPU
DEVICE IP BLOCK.
.
.DEVICE IP BLOCK
PI B
US
SDRAM
MMID
VP M
EMO
RY
BU
S
PI B
US
TriMedia™MIPS™
Source: Philips
Hardware Software
Early Platform Architecture: Philips Nexperia
Platform-types
22
IBMPowerPC7/00 Mindspeed
SkyRailgigabit serial I/O9/00
RocketChipsmixed-signal IPacquisition10/00
Wind RiverO/S3/01
Virtex-II Proproduction3/02
“Highly-Programmable Platform (Virtex-II Pro)”
Xilinx
Designing Platforms: the IC Company View
23
Application Space
e
Ideal Architectural Platform
23
Using Platforms: the System Company View
24
Architectural Space
Ideal Application Platform
Application Space
24
25
The Platform Concept Meet-in-the-Middle Structured methodology that limits the space of exploration, yet achieves good results in limited timeA formal mechanism for identifying the most critical hand-off points in the design chainA method for design re-use at all abstraction levelsAn intellectual framework for the complete electronic design process!
Texas Instruments OMAP
PlatformDesign-Space
Export
PlatformMapping
Architectural Space
Application SpaceApplication Instance
Platform Instance
Semantic PlatformPlatform
PlatformDesign-Space
Export
PlatformMapping
Architectural Space
Application SpaceApplication Instance
Platform Instance
Semantic PlatformPlatform
25
26
The EXREAL PlatformTM
We provide integrated solutions based on LSI development
platform, application platform and partnerships
Integrated Solution PlatformIntegrated solutions including applied application (including collaboration with users)
Deployment to platform for each applicationApplication Platform
Flexible ScalabilityFlexible
ScalabilityHigh PortabilityHigh Portability HeterogeneousStructure
HeterogeneousStructure
Specification
Analysis
Dev
elop
men
t Pro
cess
BusesBusesMatlab
CPUs Buses OperatingSystems
Behavior Components Virtual Architectural Components
C-CodeIPs
Dymola
ECUECU--11 ECUECU--22
ECUECU--33BusBus
f1f1 f2f2
f3f3
Behavior Platform
Mapping
Performance Analysis
Refinement
Evaluation ofArchitectural
and Partitioning Alternatives
Implementation
Separation of Concerns (ca. 1990!)
27
28
Platform-Based Design
Platform: library of resources defining an abstraction layerResources do contain virtual components i.e., place holders that will be customized in the implementation phase to meet constraintsVery important resources are interconnections and communication protocols
PlatformDesign-Space
ExportPlatformMapping
Architectural SpaceApplication Space
Application InstancePlatform Instance
29
Fractal Nature of DesignPlatform Instance
Platform Design-Space Export
Platform(Architectural) Space
Platform Instance
Function Instance
FunctionSpace Mapped
Platform(Architectural) Space
FunctionSpace
Platform Instance
Function Instance
Mapped
29
An Example: Wireless Sensor Networks
30
Functional & PerformanceRequirements
Network Architecture
Performance analysis
NetworkLevel
Radio NodeLevel
Functional & PerformanceRequirements
Node Architecture
Performance analysis
Functional & PerformanceRequirements
Network Architecture
Performance analysis
ModuleLevel
Constraints
Constraints
Source: Jan Rabaey
31
FunctionFunction Space Platform
Formal Mechanism
Library Elements
Closure underconstrained composition(term algebra)
Platform Instance
Formal Mechanism
32
PlatformCommon Semantic Platform
Platform InstanceAll Platform behaviors(non deterministic)
Mapping
33
Platform Instance
FunctionCommon Semantic PlatformFunction Space
Mapped Instance
Admissible Refinements
Platform-based Design for DFMD
esign Methods
Design
Mask / Mfrg“Post Design”
“Golden”GDS
Layout OptimizationParasitic Extraction
LVS/DRC
Sign-Off PVBatch RET Treatment
Verification (OPC, CMP …)
Yield Ramp and FA
Digital SoCP+R
CustomLayout
Lith
o
CM
P
Etch
34
Design
Mask / Mfrg“Post Design”
Design
Mask / Mfrg“Post Design”
“Golden”GDSEl
ectr
ical
Ana
lysi
sPh
ysic
al A
naly
sis
RLC
PBD Abstraction Links Implementation to Manufacturing
Layout OptimizationParasitic Extraction
LVS/DRC
Sign-Off PVBatch RET Treatment
Verification (OPC, CMP …)
Yield Ramp and FA
Digital SoCP+R
CustomLayout
Lith
o
CM
P
Etch
Lith
o
CM
P
Etch
RLC
35
36
Driver
OSMulti-core abstraction layer
CPU CPU
OS, Driver
DSP
DSP M/W
APPLI-CATION
SW
HWDSP, dedicated HWCPU
Driver
DSP M/WMiddleware
Application
CPU
CPU M/W
Driver
CPU MW
Peripherals
OS, Driver
Hardwareengine
OS, Driver
Security…
Multi-core Various markets
・ Heterogeneous and scalable architecturefor various markets
・ Multi-core abstraction layer for software virtualization
Processes/devices
Circuits
Architecture
Software
Tool
s
Courtesy: NEC
Hetero and Scalable Architecture
37
Design Tools: Platform-Based Design for Integrated Building Management
38
Platform-Based Design for Dynamic Networks
39
Consequences
There is no difference between HW and SW. Decision comes later.HW/SW implementation depend on choice of component at the architecture platform level.Function/Architecture co-design happens at all levels of abstractions
Each platform is an “architecture” since it is a library of usable components and interconnects. It can be designed independently of a particular behavior.Usable components can be considered as “containers”, i.e., they can support a set of behaviors.Mapping chooses one such behavior. A Platform Instance is a mapped behavior onto a platform.Fractal: it applies to ALL levels of design from functional all the way to DFM
40
Outline
Challenges
The Movement towards Design SciencePlatform-based Design as a Unifying Approach
Metropolis
The GSRC Agenda
41
Putting it All Together….
We need an integration platform:To deal with heterogeneity:• Where we can deal with Hardware and Software• Where we can mix digital and analog• Where we can assemble internal and external IPs• Where we can work at different levels of abstraction• Where we can work with performance-power trade-offs
To handle the design chain
To support integration• e.g. tool integration• e.g. IP integration
The integration platform must subsume the traditional design flow, rather than displacing it
42
Metropolis: an Environment for System-Level Design• Motivation
– Both design complexity and the need for verification are increasing
– Semantic link between specification and implementation is necessary
• Platform-Based Design
– Meet-in-the-middle approach
– Separation of concerns
– Function vs. architecture
– Capability vs. performance
– Computation vs. communication
• Metropolis Framework
– Extensible framework providing simulation, verification, and synthesis capabilities
– Easily extract relevant design information and interface to external tools
• Released Sept. 15th, 2004
Metropolis Guiding PrinciplesMetropolis Guiding Principles
Uni
fied
MO
CU
nifie
d M
OC
Form
al S
eman
tics
Form
al S
eman
tics
Sepa
ratio
n of
Con
cern
sSe
para
tion
of C
once
rns
Map
ping
Fun
ctio
n to
Arc
hite
ctur
eM
appi
ng F
unct
ion
to A
rchi
tect
ure
Metropolis
MethodologiesMethodologiesToolsTools
44
Fundamental ConceptsSupport for different Models of Computation
Support for Architecture Specification and Analysis
Mix of imperative and declarative specification styles
Quantities of interest dictated by the designer, not the framework
Framework designed to allow interfacing with external tools
45
Meta Frameworks
Tagged Signal Semantics
Process Networks Semantics
Firing Semantics
Stateful Firing SemanticsKahn processnetworks
dataflow
discreteevents
synchronous/reactive
hybrid systems
continuoustime
Metropolis provides a process networks abstract semantics and emphasizes formal description of constraints, communication refinement, and joint modeling of applications and architectures.
Metropolis Framework
MetamodelCompiler
…tool
Verification tool
Front end
MetaModel language
Simulator tool
...Back end1
Abstract syntax trees
Back end2 Back endn
MetropolisInteractive
Shell
FunctionalityWhat does it do?
Architecture PlatformHow is it done?At what cost?
MappingBinding between the two
46
Metropolis Modeling
• Network of processes with sequential program for each
• Unbounded FIFOs with multi-rate read and write
Func
tiona
l Mod
eling
∞∞
∞
•Communication refined to bounded FIFOs and shared memories with finer primitives (called TTL API):
allocate/release space, move data, probe space/data
∞
Abstraction
Mapp
ing
Functional Network
Arch. Network
synch(…), synch(…), …••Associate functional and architectural models Associate functional and architectural models explicitly and explicitly and formally formally ••Add declarative constraints that associate events Add declarative constraints that associate events ••Accomplished with the Accomplished with the ““synchsynch”” keyword in MMMkeyword in MMM
Metropolis Modeling
DMA
DSP
RAMs RAMd
$HW
MemFMemS
$
DSPHW
• Mapped to resources with coarse service APIs• Services annotated with performance models• Interfaces to match the TTL API
• Cycle-accurate services and performance modelsAbstraction
Arch
itect
ure M
odeli
ng
Metropolis ObjectsMetropolis elements adhere to a “separation of concerns” ideology.
Proc1P1 P2
I1 I2Media1
QM1
Active ObjectsSequential Executing Thread
Passive ObjectsImplement Interface Services
Schedule access to resources and quantities
• Processes (Computation)
• Media (Communication)
• Quantity Managers (Coordination)
Meta-Model : Functional Netlist
process P{port reader X; port writer Y;thread(){while(true){ ...z = f(X.read());Y.write(z);
}}}
medium M implements reader, writer{int storage;int n, space;void write(int z){
await(space>0; this.writer ; this.writer)n=1; space=0; storage=z;
}word read(){ ... }
}
interface reader extends Port{update int read();eval int n();
}
interface writer extends Port{update void write(int i);eval int space();
}
MP1X Y P2X Y
Env1 Env2
MyFncNetlist
Meta-Model: Architecture ComponentsAn architecture component specifies services, i.e.
• what it can do • how much it costs
: interfaces: quantities, annotation, logic of constraints
medium Bus implements BusMasterService …{port BusArbiterService Arb;port MemService Mem; …update void busRead(String dest, int size) {
if(dest== … ) Mem.memRead(size);[[Arb.request(B(thisthread, this.busRead));
GTime.request(B(thisthread, this.memRead),BUSCLKCYCLE + GTime.A(B(thisthread, this.busRead)));
]]}…
scheduler BusArbiter extends Quantity implements BusArbiterService {
update void request(event e){ … }update void resolve() { //schedule }
}
interface BusMasterService extends Port {update void busRead(String dest, int size);update void busWrite(String dest, int size);
}
interface BusArbiterService extends Port {update void request(event e);update void resolve();
}
BusArbiterBus
Metro. Netlists and Events
Proc1
P1
Media1 QM1
Scheduled Netlist Scheduling Netlist
GlobalTime
Metropolis Architectures are created via two netlists:• Scheduled – generate events1 for services in the scheduled netlist.• Scheduling – allow these events access to the services and annotateevents with quantities.
I1
I21. E. Lee and A. Sangiovanni-Vincentelli, A Unified Framework for Comparing
Models of Computation, IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, Vol. 17, N. 12, pg. 1217-1229, December 1998
Proc2
P2
Quantity Request – Service
Ti
CpuRtos GTime
CpuRtos.cpuRead()
CS.Request(beg(Ti, this.cpuRead),csr)
ScheduledNetlist SchedulingNetlist
Task.Read(){CpuRtos.cpuRead();
}
CpuRtos.Read(){CS.Request(beg(Ti, this.cpuRead), csr);Bus.busRead();CS.Request(end(Ti, this.cpuRead), csr);
}
CS.Resolve(){//Task scheduling algorithm;
}
setMustDo(e)
Bus.busRead()
CpuScheduler
CS.Resolve()
Modeling & Char. Review
DedHW Sched
PLB Sched
BRAM Sched
GlobalTime
PPC Sched
Task1 Task2
PPC
Task3 Task4
DEDICATED HW
BRAM
PLB
Scheduled Netlist Characterizer
Scheduling Netlist
Media (scheduled) Process
Quantity ManagerQuantity
Enabled Event
Disabled Event
Heterogeneous IP Import in Metropolis
Excessive time spent in design importRedefining and implementing classes and methodsMemory allocation, data types, templates, etc
Challenges in Infineon case study802.11a on MuSIC (multiple SIMD core) architecture
Collection of Heterogeneous IP
Metropolis Design
IP rewrittenin Metamodel
Different design teamsDifferent languagesDifferent MoCs
55
Heterogeneous IP Import in Metro II
ProsFramework easier to develop and maintainLeverage existing compilers/debuggersQuicker import for most IP
ConsFramework has limited visibility
Collection of Heterogeneous IP
Metro II Design
Wrappers
56
Phase 1 Phase 2
Behavior-Performance Separation in Metropolis
Processes make explicit requests for annotationAnnotation/scheduling are intertwined
Iteration between multiple quantity managersChallenges in GM case study
Vehicle stability application on distributed CAN architectureInteractions between global time QM and resource QM difficult todebug
P1 P2
R
Global Time
ResourceScheduler
2. Quantity Resolution
1. Explicit quantity requests
3. Granting of requests
57
Behavior-Performance Separation in Metro II
ProsPhase 1 objects no longer explicitly request annotationSeparation of quantity managers into annotators and schedulers
• “Global time” separates into physical time (annotation) and logical time (scheduling)
ConsAdditional phase introduced into execution model
Phase 1
P1 P2
R
Phase 2
Physical Time
1. Block processes at interfaces2. Annotations
Phase 3
Logical Time
Resource Scheduler
3. Sched. Resolution
4. Enable some processes
58
Operational/Denotational Specification in Metropolis
Constraints break operational encapsulationConstraints between arbitrary pairs of eventsAny state in scope of event may be used in constraints
No special declarative constructs for mappingChallenges in Intel case study
JPEG encoding on MXP5800 heterogeneous multiprocessorKeeping track of events, values, and constraints requires separate data structureHard to debug local variables involved in synchronization constraints
void func() {int a;event e1;int b;event e2;
}
void arch() {int c;event e3;int d;event e4;
}
sync(e1, e2, a == c)
sync(e3, e4, b <= d)
59
Operational/Denotational Specification in Metro II
Accessible events are beg/end of interface methodsValues are either parameters or return values
Mapping allocates functional components to architectural components
Coarser granularity
60
Updated Features for Metro II
Import heterogeneous IPDifferent languagesDifferent models of computation
Behavior-Performance SeparationNo explicit requests for annotationAnnotation separated from constraint solving
Function-Architecture SeparationExplicit separate phases for function and architecture models
Operational/Denotational SeparationRestricted access to events and valuesMapping carried out at various levels
CoordinationFramework
Event-orientedFramework
4-Phase Execution
61
4-Phase Execution
1. FunctionEach function process proposes events and suspends
2. ArchitectureArchitecture process triggered by function process proposes events and suspends
3. AnnotationTag proposed events with quantities (logical and physical)
4. Constraint solvingEnable/disable events by solving the constraints (denotational and imperative)
Constraint Solving
Annotation
Function Architecture
Extended Base Model
62
Events
An event is the fundamental concept in the framework
Fields:Process: Generator of eventValue Set: Variables exposed along with eventTag Set: Quantity annotations
E = <p, V, T>
63
Event States
Inactive Proposed Enabled
start
Waiting
enabled by S
disabled by S
disabled by S
enabled by S
64
Phases and Events
Each phase is allowed to interact with events in a limited way
Keep responsibilities separatePhase Events Tags Values
Propose Disable Read Write Read Write
Func. Yes Yes Yes
Arch. Yes Yes Yes
Annotation Yes Yes Yes
Const. Sol. Yes Yes Yes
65
Mappers
Mapper
Func. Comp
Arch. Comp
Support mapping at various abstraction levels
• Event level• Service level• Interface level• Component level
66
Service level triggering
Functional Method{
}
Arch Service{
}
B
E
Trigger
Function Phase Architecture Phase
B
ETrigger
67
68
Industrial Collaborations Infineon: Software Defined RadiosIntel:
Mobile PlatformsMulti-media Platforms
General MotorsNext generation car architectures
United TechnologiesElevator (OTIS), Air conditioning (CARRIER), Security (Chubb)
XilinxProgrammable platform modeling
69
Outline
Challenges
The Movement towards Design SciencePlatform-based Design as a Unifying Approach
Metropolis
The GSRC Agenda
Core Theme Overview and Design Flow
Jason CongDaniel D. GajskiWen-mei HwuAndrew KahngRaduMarculescuJaijeetRoychowdhury
Modeling & Simulation Side
Compare implementation results with simulation results
UIUC
Input: C descriptions
MMMsSystemC, C++
SW SOC Platform HW
Analysis
Parallelism extraction Code cleanup
Performance/AreaEstimations
ASPN Synthesis (MetroII)
Synthesizable RTL
1. ImportFunctional Model(i.e. h.264, UMTS)
UCI
UCB
2. Check Equiv.(Model Algebra)
3. Create Arch Services(i.e. Xilinx)
4. Map and Simulate
Implementation Side
ASPN Synthesis/Verification
ASIP synthesis
xPilot HW/MCsim synthesis
UCLA/UCB/UCSD/CMU/Columbia
Communication Synthesis, NOC synthesis, and physical modelingof interconnection and logic
Processor Library
–ARM–PowerPC–MicroBlaze …
Concurrent Functional Model(after architecture independent optimization
TransformationRules
TLMEquivalence
Checker(UCI)
EquivalenceResult
DesignOptimization
App1 +Platform1
App2 +Platform2
TLMGen.
TLMGen.
TLM1
TLM2
DesignDecisions
Metro-II Framework
(UCB)
Metro-TLMLibrary
Analog synthesis Desynchronization
70
Platform Instance
Platform Design-Space Export
Platform Instance
Function Instance
FunctionSpace Mapped
Platform(Architectural) Space
FunctionSpace
Platform Instance
Function Instance
Mapped
Co-simulation
MetroIIFunctional model
ASPNArchitectural model
event trace
annotated event trace
MCSim simulationMetroII simulation
RefinementHigh level
MetroII model
refine
ASPN model withmapped functionality
MCSim simulation(cycle accurate)
abstracted performance annotation
Analog Sim
Engineering Tomorrow’s Designs
The creation of novel biological functions and tools by modifying or integrating well-characterized biological components into higher-order systems using mathematical modeling to direct the construction towards the desired end product.
“Building life from the ground up” (Jay Keasling, UCB)Keynote presentation, World Congress on Industrial Biotechnology
and Bioprocessing, March 2007.
Synthetic Biology
Development of foundational technologies:• tools for hiding information and managing complexity• core components that can be used in combination reliably71
[Reference: Scientific American, June 2006]
Moving from ad-hoc to structured design
Pioneering Synthetic Biology
72
Engineering Tomorrows DesignsSimilar Considerations Hold for the Nano-Electronics and
Nano-mechanics Arenas
A Disciplined Platform-Based Design Methodology ??Exploration of scalable computational fabrics Deriving useful abstractions and interfacesDeveloping modeling and characterization environmentsAutomating the synthesis processPopulating the design space
Source: J. Rabaey73
OutlineOutlineFoundation of platformFoundation of platform--based design and Metropolis framework (Alberto/Douglas) 9based design and Metropolis framework (Alberto/Douglas) 9--10:3010:30
Challenges in System Level DesignChallenges in System Level DesignPlatformPlatform--based Design as a unifying methodologybased Design as a unifying methodologyA framework for PBD: A framework for PBD: •• Theoretical foundations (heterogeneous systems, Theoretical foundations (heterogeneous systems, metamodelingmetamodeling, abstract semantics), abstract semantics)•• Metropolis II: integration platform architectureMetropolis II: integration platform architecture•• Application to embedded system design: cars and building networkApplication to embedded system design: cars and building networked embedded ed embedded
systemssystemsSynthesis for functionality (Jason) 11am Synthesis for functionality (Jason) 11am –– 12:30pm12:30pm
Synthesis for customized logicSynthesis for customized logicUse of applicationsUse of applications--specific processors and processor networksspecific processors and processor networks
Synthesis for communication (Synthesis for communication (RaduRadu) 2 ) 2 –– 3:30pm3:30pmNetworksNetworks--onon--Chip design spaceChip design spaceOptimization for performance, energy, and faultOptimization for performance, energy, and fault--tolerancetolerance
RealReal--life examples (all) 4 life examples (all) 4 –– 5pm5pmQ/A DiscussionsQ/A Discussions
Core Theme Overview and Design FlowAlberto Sangiovanni-VincentelliLuca CarloniJason CongDaniel D. GajskiWen-mei HwuAndrew KahngRadu MarculescuJaijeet Roychowdhury
Modeling & Simulation Side
UIUC
Input: C descriptions
MMMsSystemC, C++
Analysis
Parallelism extraction Code cleanup
Performance/AreaEstimations
ASPN Synthesis (MetroII)1. ImportFunctional Model(i.e. h.264, UMTS)
UCI
UCB
2. Check Equiv.(Model Algebra)
3. Create Arch Services(i.e. Xilinx)
4. Map and Simulate
Concurrent Functional Model(after architecture independent optimization
TransformationRules
TLMEquivalence
Checker(UCI)
EquivalenceResult
DesignOptimization
App1 +Platform1
App2 +Platform2
TLMGen.
TLMGen.
TLM1
TLM2
DesignDecisions
Metro-II Framework
(UCB)
Metro-TLMLibrary
75
Platform Instance
Platform Design-Space Export
Platform Instance
Function Instance
FunctionSpace Mapped
Platform(Architectural) Space
FunctionSpace
Platform Instance
Function Instance
Mapped
ASPN Simulation
Co-simulation
MetroIIFunctional model
ASPNArchitectural model
event trace
annotated event trace
MCSim simulationMetroII simulation
RefinementHigh level
MetroII model
refine
ASPN model withmapped functionality
MCSim simulation(cycle accurate)
abstracted performance annotation
Analog Sim
Communication Synthesis, NOC synthesis, and physical modelingof interconnection and logic
Processor Synthesis
Processor Library
–ARM–PowerPC–ASIP
Processor Synthesis
•Customized coprocessor
•ASIPs
xPilot
•Process mapping
ASPN Synthesis Engine
Implementation SideUCLA/UCB/UCSD/CMU/Columbia
Outline - Synthesis for customized logicSynthesis for customized logic
Overview of behavioral synthesisScheduling
Task DescriptionScheduling algorithms• ASAP/ALAP, list scheduling, force-directed scheduling,
mathematical-programming-based scheduling, scheduling for low power
BindingTask DescriptionClassification of binding algorithmsBinding algorithms• Left-edge, min-cost network flow, multi-vdd binding
Architectural synthesis for multi-cycle communication (MCAS)
IC Design Steps
Packaging Fabri-cation
PhysicalDesign
TechnologyMapping
LogicalSynthesis
System-LevelSpecification
System-LevelSpecification
Behavior-levelDescription
Behavior-levelDescription
RT-LevelDescriptionRT-Level
Description
Placed& RoutedDesign
Placed& RoutedDesign
X=(AB*CD)+(A+D)+(A(B+C))
Y = (A(B+C)+AC+D+A(BC+D))
[©Sherwani]
Gate/CircuitDesign
Gate/CircuitDesign
Generic LogicDescription
Generic LogicDescription
C Program:void f (int var) { int [] array;
…..}
BehavioralSynthesis
VHDL/Verilog entity f isport (…)architecture behav…
Advantages of Behavioral Synthesis Shorter verification/simulation cycle• 100X speed up with behavior-level simulation
Better complexity management, faster time to market• 10M gate design may require 700K lines of RTL code
Rapid system exploration• Quick evaluation of different hardware/software boundaries• Fast exploration of multiple micro-architecture alternatives
Higher quality of results• Platform-based synthesis & optimization• Full consideration of physical reality
Subtasks in High-Level Synthesis
Scheduling determines when an operation will be executedAllocation determines number of instances of each type of resourcesBinding binds operations, variables, or data-transfers to the resources
A DFG
+
++
+
××
+
++
+
+
×
×+123456
Scheduling & allocation
Operation
Variable
ALU ALU
Binding
Resources:
2 adders
2 multipliers
xPilot: Behavioral-to-RTL Synthesis Flow [Cong’06]
Behavioral spec. in C/SystemC
RTL + constraints
SSDMSSDM
μArch-generation & RTL/constraints generation
Verilog/VHDL/SystemCFPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …
Presynthesis optimizationsLoop unrolling/shiftingStrength reduction / Tree height reductionBitwidth analysisMemory analysis …
FPGAs/ASICsFPGAs/ASICs
Frontendcompiler
Frontendcompiler
Platform description
Core synthesis optimizationsSchedulingResource binding, e.g., functional unit binding register/port binding
Outline - Synthesis for customized logicSynthesis for customized logic
Overview of behavioral synthesisScheduling
Task DescriptionScheduling algorithms• ASAP/ALAP, list scheduling, force-directed scheduling,
mathematical-programming-based scheduling, scheduling for low power
BindingTask DescriptionClassification of binding algorithmsBinding algorithms• Left-edge, min-cost network flow , multi-vdd binding
Architectural synthesis for multi-cycle communication(MCAS)
The Scheduling TaskThe Scheduling Task
Behavioral Model: CDFGBehavioral Model: CDFGControlControl--data flow graphdata flow graph
Nodes: operationsNodes: operationsDirected edgesDirected edges•• Data edgeData edge•• Control edge: branch, loopsControl edge: branch, loops
Generated by a compiler Generated by a compiler frontendfrontend from high level from high level description (C/VHDL/others)description (C/VHDL/others)
ParseParseCompiler optimizationsCompiler optimizations
If no control structure, a dataIf no control structure, a data--flow graph (DFG) is sufficientflow graph (DFG) is sufficient
RTL Model: Finite State MachineRTL Model: Finite State MachineEach clock cycle corresponds to a state in the FSMEach clock cycle corresponds to a state in the FSMScheduling: map operations to states.Scheduling: map operations to states.
do{xl = x+dx;ul = u-3*x*u*dx-3*y*dxyl = y+u*dxc = xl<a;x = xl; u = ul; y = yl;
}while(c)
Impact of SchedulingImpact of SchedulingPerformancePerformance
Latency/ throughput: given clock cycle timeLatency/ throughput: given clock cycle time
AreaAreaFunctional unitsFunctional unitsRegistersRegistersMultiplexorsMultiplexors
Power / Reliability/ etc.Power / Reliability/ etc.
Outline - Synthesis for customized logicSynthesis for customized logic
Overview of behavioral synthesisScheduling
Task DescriptionScheduling algorithms• ASAP/ALAP, list scheduling, force-directed scheduling,
mathematical-programming-based scheduling, scheduling for low power
BindingTask DescriptionClassification of binding algorithmsBinding algorithms• Left-edge, min-cost network flow , multi-vdd binding
Architectural synthesis for multi-cycle communication(MCAS)
Unconstrained SchedulingUnconstrained SchedulingOnly Consideration: dependenciesOnly Consideration: dependencies
AsAs--soonsoon--asas--possible (ASAP) possible (ASAP) scheduleschedule
schedule an operation to the earliest schedule an operation to the earliest possible steppossible step
AsAs--latelate--asas--possible (ALAP) possible (ALAP) scheduleschedule
schedule an operation to the earliest schedule an operation to the earliest possible step, without increasing the possible step, without increasing the total latencytotal latency
+ *
*
−
+
+ *
*
−
+
ASAP schedule
ALAP schedule
ResourceResource--Constrained SchedulingConstrained SchedulingWhen functional units are limitedWhen functional units are limited
Each functional unit can only perform one operation at each clocEach functional unit can only perform one operation at each clock cycle.k cycle.ASAP does not guarantee resource constraintsASAP does not guarantee resource constraints
A resourceA resource--constrained scheduling problem for DFGconstrained scheduling problem for DFGGiven the number of functional units of each typeGiven the number of functional units of each typeMinimize latencyMinimize latencyResource constraint: if there are only k adders, no more than k Resource constraint: if there are only k adders, no more than k additions can additions can be executed in the same cbe executed in the same c--step.step.NPNP--hard!hard!•• Reduce to multiprocessor scheduling when resources are identicalReduce to multiprocessor scheduling when resources are identical..
Usually solved heuristically using list scheduling Usually solved heuristically using list scheduling [[Pangrle & Gajski, 87]
List Scheduling AlgorithmList Scheduling AlgorithmConstructive algorithm for Constructive algorithm for resourceresource--constrained schedulingconstrained scheduling
cc--step by cstep by c--stepstepStandard compiler techniqueStandard compiler technique
Maintain a list of Maintain a list of ‘‘readyready’’operations considering operations considering dependencydependency
Select one operation from the Select one operation from the ready operations based on some ready operations based on some priority function.priority function.
While (there are unscheduled operations) {curStep ← curStep+1
while (there are data-ready operations and available resources) {
op ← the ready operation with highest priority
schedule op to curStepupdate the ready listupdate priorities
}}
List Scheduling AlgorithmList Scheduling AlgorithmCommonly used priority function for latency optimizationCommonly used priority function for latency optimization
Nodes with small ALAP value picked firstNodes with small ALAP value picked firstNodes with more successors picked firstNodes with more successors picked first
+1 +2*1
*2
+3 Not enough resources this step, proceed to next!
Ready: +1 *1 +2 *2 +3 1 Adder and 1 multiplier
ForceForce--Directed Scheduling AlgorithmDirected Scheduling AlgorithmTimeTime--constrained schedulingconstrained scheduling
Deadline (latency) of DFG is given as constraintDeadline (latency) of DFG is given as constraint
ForceForce--directed schedulingdirected scheduling [Paulin & Knight, 1989][Paulin & Knight, 1989]Try to reduce hardware resource.Try to reduce hardware resource.Balancing the concurrency of operations to ensure a high Balancing the concurrency of operations to ensure a high utilization of each unit.utilization of each unit.•• Functional unitsFunctional units•• RegistersRegisters•• InterconnectInterconnect
ForceForce--Directed Scheduling AlgorithmDirected Scheduling AlgorithmConstructiveConstructive
Operation by operationOperation by operation
Guided by Guided by ’’forceforce’’Force is defined for every Force is defined for every possible assignment of possible assignment of unscheduled operation unscheduled operation ii to cto c--step step jj
Each iteration commit the Each iteration commit the assignment with least forceassignment with least force
While (there are unscheduled operations) {create distribution graph;Lowest_force ← infinity;
for (each unscheduled operation i){for (each feasible c-step j for i){
calculate the force of scheduling i in c-step j, force(i,j);
if (force(i,j) <lowest_force){lowest_force = force(i,j);best_op = i; best_step =j;
}}
}schedule best_op to best_step;
}
ForceForce--Directed Scheduling AlgorithmDirected Scheduling AlgorithmDetermine ASAP & ALAP schedules.
Determine the time frame of each opLength: possible rangeWidth: probabilityUniform distribution
ForceForce--Directed Scheduling AlgorithmDirected Scheduling AlgorithmCreate distribution graph (DG) for each kind of operation unitCreate distribution graph (DG) for each kind of operation unit
DG(iDG(i): the expected number of operations in c): the expected number of operations in c--step i.step i.For DFGFor DFG
Minimize functional unitMinimize functional unit•• minimize maxminimize maxii DG(iDG(i))
Revised formulationRevised formulation•• minimize minimize ΣΣii DG(i)DG(i)22
•• Analogous to spring system energy minimizationAnalogous to spring system energy minimization•• Note Note ΣΣii DG(iDG(i) = constant, when there are no dependency between ) = constant, when there are no dependency between
operations, the revised formulation is equivalent to the originaoperations, the revised formulation is equivalent to the original onel one
Force-Directed Scheduling AlgorithmScheduling an operation changes the distribution graphSelf force of an assignment i to j
Self_force(i,j) = Σk DG(k)*x(i,j,k)• k is c-step index• x(i,j,k) is the change of DG(k) after
assigning i to j
Example: trying to schedule the multiplication M5To c-step 1. self_force(M5, 1) = DG(1)*x(M5, 1,1)+DG(2)*x(M5,1,2)= 2.83*0.5+2.33*(-0.5)=0.25To c-step 2. self_force(M5, 2) =DG(1)*x(M5, 2,1)+DG(2)*x(M5, 2,2)= 2.83*(-0.5)+2.33*0.5=-0.25Desirable schedule should have negative self force
M5
Force-Directed Scheduling AlgorithmPredecessor & successor forces
Scheduling an operation may affect the time frame for other operationsPredecessor & successor forces = sum of self forces for implicitly scheduled operationsForce(i,j) = self_force(i,j) + predecessor&successor_force(i,j)
Look-aheadWhen we try an assignment, consider the effect on DG(i) due to all implied assignmentsMinimize Minimize ΣΣii DG(i)DG(i)22
Can extend to balance register lifetime, communication, etc.DG computation can be extended for branches and loops
Outline - Synthesis for customized logicSynthesis for customized logic
Overview of behavioral synthesisScheduling
Task DescriptionScheduling algorithms• ASAP/ALAP, list scheduling, force-directed scheduling,
mathematical-programming-based scheduling, scheduling for low power
BindingTask DescriptionClassification of binding algorithmsBinding algorithms• Left-edge, min-cost network flow , multi-vdd binding
Architectural synthesis for multi-cycle communication(MCAS)
Classical Integer Linear Programming FormulationAn exact formulationAn exact formulation
00--1 assignment variables1 assignment variablesxxijij = = 11 if operation if operation ii scheduled to cscheduled to c--step step jj, otherwise 0, otherwise 0Constraint: Constraint: ΣΣjj xxijij = 1= 1
Resource modelingResource modelingNumber of additions in cNumber of additions in c--step step jj is is ΣΣi is additioni is addition xxijij
The cThe c--step for node step for node iit(it(i) =) =ΣΣjj j*j*xxijij
Dependency constraints: Dependency constraints: t(i1) t(i1) –– t(i2) >= delay(i2)t(i2) >= delay(i2), , i1 depend on i2i1 depend on i2Overall latency for DFG: Overall latency for DFG: max max t(it(i))Various scheduling problems can be modeled.Various scheduling problems can be modeled.
Classical Integer Linear Programming Formulation
Pros: modeling abilityCan be extended to handle almost every design aspects• Resource allocation • Module selection• Area, power, etc.
Cons: computationally expensive#variables= O( #operations * #c-steps)0-1 assignment variables: need extensive search to find optimal solution
Resource Allocation & Scheduling Using ILPResource Allocation & Scheduling Using ILPGiven total area for functional units, minimize latency for Given total area for functional units, minimize latency for DFG.DFG.
Minimize Minimize tt
s.ts.t. . ΣΣjj xxijij = 1 = 1 for all operation i for all operation i t>= t>= ΣΣjj j*j*xxijij for all operation ifor all operation iΣΣtype(itype(i)=k)=k xxijij <=<=rrkk for all cfor all c--step jstep jΣΣkk AreaAreakkrrkk <=Area<=AreaΣΣjj j*j*xxtjtj--ΣΣjj j*j*xxsjsj>=>=delay(tdelay(t) ) for all dependency of t on sfor all dependency of t on s
SDCSDC--Based Scheduling AlgorithmBased Scheduling AlgorithmJ. Cong and Z. Zhang. "An Efficient and Versatile Scheduling Algorithm Based On SDC Formulation". DAC 2006.
Motivation: more efficient ILP formulationUse the c-step index directly -- sv(i) is the c-step for operation i.#variables = O(#operations)Restrict the type of constraints•• sv(isv(i) ) –– sv(jsv(j) <= b ) <= b ------ finite difference constraintfinite difference constraint
Idea: use system of finite difference constraint to model allconstraints in scheduling
Some constraints are modeled approximately or heuristically
Advantage: easy to get integer solutions (details later)
SDCSDC--Based Scheduling AlgorithmBased Scheduling AlgorithmScheduling variable definitionScheduling variable definition
For each operation node v in CDFG, define For each operation node v in CDFG, define {{svsvii(v(v)) | | ii ∈∈ [0, [0, KK]} ]} where where KK = = LatencyLatency((vv))∀∀n,n, ∀∀ii, the value of , the value of svsvii((nn) is a non) is a non--negative integernegative integersvsvii((vv) ) –– svsvii--11((vv) = 1) = 1Let Let svsvbegbeg((vv) ) ≡≡ svsv00((vv) and ) and svsvendend((vv) ) ≡≡ svsvKK((vv))
*v: a multiplicationLatency(v) = 2
*0
*1
*2
sv0(v) ≡ svbeg(v)
sv1(v)
sv2(v) ≡ svend(v)
sv1(v) – sv0(v) = 1
sv2(v) – sv1(v) = 1
SDCSDC--Based Scheduling AlgorithmBased Scheduling Algorithm
The value of a scheduling variable describes the relative temporThe value of a scheduling variable describes the relative temporal al position of one pipeline stage of an operation node in the finalposition of one pipeline stage of an operation node in the finalscheduleschedule
+
*
*
−
+CS0
CS1
CS2
CS3
CS4
CS5
CS6
v1 v2 v4
v3
v5
sv0(v3) ≡ svbeg(v3) = 3
sv1(v3) = 4sv2(v3) ≡ svend(v3) = 5
SDCSDC--Based Scheduling AlgorithmBased Scheduling AlgorithmInteger difference constraintsInteger difference constraints
A special kind of linear constraintA special kind of linear constraintIn the form of In the form of sv(isv(i) ) –– sv(jsv(j) <= b, b is an integer) <= b, b is an integerVery powerful and suitable for scheduling problemVery powerful and suitable for scheduling problem
System of difference constraints (SDC)System of difference constraints (SDC)Totally Totally unimodularunimodular matrix (TUM)matrix (TUM)•• An integer matrix An integer matrix AA is called is called totally totally unimodularunimodular if every square, if every square, nonsingualrnonsingualr
submatrixsubmatrix of of AA has a determinant of +1/has a determinant of +1/--11Constraint matrix is totally Constraint matrix is totally unimodularunimodular for SDCfor SDC
SDCSDC--Based Scheduling AlgorithmBased Scheduling AlgorithmTheorem on integralityTheorem on integrality
The LP model based on a TUM matrix is solvable by linear The LP model based on a TUM matrix is solvable by linear programming in polynomial time with integer solutions programming in polynomial time with integer solutions [Papadimitriou and [Papadimitriou and SteiglitzSteiglitz, Combinatorial Optimization 1982], Combinatorial Optimization 1982]
Or ,equivalently, the extreme points of the polyhedron defined bOr ,equivalently, the extreme points of the polyhedron defined by y SDC are vectors of integers.SDC are vectors of integers.
Steps of the algorithmSteps of the algorithmModel constraints using SDCModel constraints using SDCModel objective using linear expression of scheduling variablesModel objective using linear expression of scheduling variablesSolve a LP and get integer solution, then generate FSMSolve a LP and get integer solution, then generate FSM
SDCSDC--Based Scheduling Algorithm Based Scheduling Algorithm –– Constraint GenerationConstraint Generation
Dependency constraintDependency constraintData dependency: Data dependency: svsvendend((uu) ) –– svsvbegbeg((vv)) ≤≤ 0, 0, u u depends ondepends on vv
Control dependency: introduce artificial Control dependency: introduce artificial nodesnodes•• SuperSuper--source of source of bbbbii : : ssrcssrc((bbbbii) )
∀∀vv∈∈bbbbii , , svsvendend((ssrcssrc((bbbbii)) )) –– svsvbegbeg((vv) ) ≤≤ 00
•• SuperSuper--sink of sink of bbbbii : : ssnkssnk((bbbbii))∀∀vv∈∈bbbbii , , svsvendend((vv) ) –– svsvbegbeg((ssnkssnk((bbbbii)) )) ≤≤ 00
•• If If bbbbjj depends on depends on bbbbii after backward edge after backward edge removalremoval
svsvendend((ssnkssnk((bbbbii)) )) –– svsvbegbeg((ssrcssrc((bbbbjj)) )) ≤≤ 00
SDCSDC--Based Scheduling Algorithm Based Scheduling Algorithm –– Constraint GenerationConstraint GenerationRelative timing constraintRelative timing constraint
A minimum timing constraint between A minimum timing constraint between vvii and and vvjj•• vvjj follows follows vvii by at least by at least LL number of clock cyclesnumber of clock cycles•• svsvbegbeg((vvii) ) –– svsvbegbeg((vvjj) ) ≤≤ ––LL
Similar for maximum timing constraintSimilar for maximum timing constraint•• Latency constraintLatency constraint
Cycle time (frequency) constraintCycle time (frequency) constraintGiven a target clock period Given a target clock period T T , the maximum combinational delay within a clock , the maximum combinational delay within a clock cycle must not exceed cycle must not exceed TT•• svsvbegbeg((vvii)) –– svsvendend(v(vjj) ) ≤≤ ––( ( ⎡⎡CombDelayCombDelay((vvii, , vvjj) / ) / TT⎤⎤ –– 1)1)
Prevent chaining a long combinational path in one cyclePrevent chaining a long combinational path in one cycle+ *
*
−
+v1 v2
v3
v4
v5
svsvbegbeg((vv22) ) –– svsvendend((vv55) ) ≤≤ --11AddsubAddsub (+/(+/--) takes 2ns) takes 2nsMultMult(*) takes 5ns(*) takes 5nsConsider path Consider path vv22--vv33--vv55
SDCSDC--Based Scheduling Algorithm Based Scheduling Algorithm –– Constraint GenerationConstraint Generation
Resource constraint as SDCResource constraint as SDCFor each type of resource, heuristically create an ordered list For each type of resource, heuristically create an ordered list of of the operationsthe operations•• Use priorities like in list scheduling to decide orderUse priorities like in list scheduling to decide order
Serialize the operation pairs which are Serialize the operation pairs which are K K distance away, k is the distance away, k is the number of available resources. number of available resources.
Resource constraint: 2 Adders
SDCSDC--Based Scheduling Algorithm Based Scheduling Algorithm –– Versatile ObjectivesVersatile Objectives
Modeling objectives as linear expressionModeling objectives as linear expressionASAPASAP objective objective −− min min ∑∑svsvbegbeg((vv))ALAPALAP objective objective −− max max ∑∑svsvbegbeg((vv))Longest path latency Longest path latency −− min min svsvendend((ssnkssnk((exitexit--bbbb))))Expected overall latencyExpected overall latency•• Weighted sum of basic block latencyWeighted sum of basic block latency•• min min ααii((svsvendend((bbbbii))--svsvbegbeg((bbbbii))))
Slack distributionSlack distribution•• min min ∑∑vv depends on depends on uu ((svsvbegbeg((vv))--svsvendend((uu))))
Outline - Synthesis for customized logicSynthesis for customized logic
Overview of behavioral synthesisScheduling
Task DescriptionScheduling algorithms• ASAP/ALAP, list scheduling, force-directed scheduling,
mathematical-programming-based scheduling, scheduling for low power
BindingTask DescriptionClassification of binding algorithmsBinding algorithms• Left-edge, min-cost network flow , multi-vdd binding
Architectural synthesis for multi-cycle communication(MCAS)
Scheduling for Low PowerA lot of opportunities for power reduction during scheduling
An active research area
Explore the design space ofMultiple Vdd/Vth [Johnson&Roy, 97; Tang et.al, 05]Variable Vdd/Vth [Shin&Kim, 05]Others: clock gating, power gating, etc.
AlgorithmsInteger linear programmingVarious heuristics
Scheduling with Integer Time BudgetingScheduling with Integer Time BudgetingWei Jiang, Zhiru Zhang, Miodrag Potkonjak and Jason Cong, “Scheduling with Integer Time Budgeting for Low-Power Optimization”, ASPDAC 2008.A follow-up work of SDC-based scheduling for low power.Integer time budgeting problem
Time budgeting: distributing slacks to different modules of a design to optimize some objectives (such as area, power)Example: reduce area/power by using slower addersIn scheduling, we require these slacks to be integer
Mathematical programming formulationSimilar scheduling variables as in SDC-based schedulingIntroduce time-budget variable for each nodeConvex separable objective
Mathematical Formulation of Time Budgeting ProblemMathematical Formulation of Time Budgeting Problem
POsTsPIssVdb
Ejisbs:toSubject
bfMin
ii
ii
iii
jii
V
i ii
∈∀≤∈∀≥∈∀≥
∈≤+
∑ =
v v 0v
),(e
)( ||
1si : start time of node vi
di : minimum latency of vi
bi : time budget at node vi
1 2 3 4 Delay (ns)
Power (mW)
100
50
30
v1 v2 v3
v4 v5
v6
T
Directed Acyclic Graph: G = (V, E)
Each fi is a single-variable convex function
Linearly constrained separable convex optimization problem
Totally Totally UnimodularUnimodular Constraint MatrixConstraint Matrix
Theorem 1:The constraint matrix is a TUM
s1
10100
s2
01010
s3
-1-1001
b1
10000
b2
01000
b3
00000
v3
v1 v2
s1 + d1 – s3 ≤ 0
s2 + d2 – s3 ≤ 0
s1 ≥ 0
s2 ≥ 0
s3 ≤ TJ1 J1 J1
J2
Optimizing Separable Convex ObjectiveOptimizing Separable Convex Objective
1 2 3 4 Delay (ns)
Power (mW)
100
5030
1 2 3 4 Delay (ns)
Power (mW)
100
5030
PWL approximation
Application to LowApplication to Low--Power SchedulingPower SchedulingMotivationMotivation
Scheduling and time budgeting are highly correlatedScheduling and time budgeting are highly correlated
Problem descriptionProblem descriptionConsider the scheduling and budgeting problem together to minimiConsider the scheduling and budgeting problem together to minimize ze the average power under time constraint the average power under time constraint TT
Power modelingPower modelingPowerPower--delay tradeoff curve (convex)delay tradeoff curve (convex)Total node power: summing up the power for all the nodes (Total node power: summing up the power for all the nodes (operatonsoperatons))Total FU power: summing up the power of all functional unitsTotal FU power: summing up the power of all functional units•• Consider resource sharingConsider resource sharing
LowLow--Power Scheduling Power Scheduling Each node Each node vvii ∈∈ VVopop is associated with a node budgeting variable is associated with a node budgeting variable bvbv((vvii)) which denotes the # of clock cycles that operation which denotes the # of clock cycles that operation vvii lasts in lasts in the final schedulethe final scheduleAdjust the following constraintsAdjust the following constraints
Data dependence constraintData dependence constraint•• ∀∀((uu, , vv))∈∈EEdd : : svsvbegbeg((uu) + ) + bvbv((uu) ) ≤≤ svsvbegbeg((vv) )
Latency constraint Latency constraint TT•• ∀∀vv∈∈VVopop :: svsvbegbeg((uu) + ) + bvbv((vv) ) ≤≤ TT
Throughput constraint with initiation interval Throughput constraint with initiation interval II II •• ∀∀vv∈∈VVopop :: bvbv((vv) ) ≤≤ IIII
Optimizing total node powerOptimizing total node powerWe can optimally minimize the total node power in polynomial timWe can optimally minimize the total node power in polynomial timee
∑ =
||
1 )( ))(( op
i
V
i ivop vbvpwMin
Consideration of Resource BindingConsideration of Resource BindingOptimizing total FU powerOptimizing total FU power
Constraint matrix is no longer totally Constraint matrix is no longer totally unimodularunimodular with thewith therequirement that:requirement that:•• all operations sharing a same function unit must have same slackall operations sharing a same function unit must have same slackss
The problem is NPThe problem is NP--complete (reduction from 3complete (reduction from 3--SAT)SAT)
Proposed heuristicProposed heuristicFirst solve the continuous version and obtain the First solve the continuous version and obtain the ““optimaloptimal””fractional budget fractional budget fbfb((vvii)) for each node for each node vvii
Perform a global rounding by minimizing the Perform a global rounding by minimizing the leastleast--squares errorsquares error•• Objective function is separable convexObjective function is separable convex
∑ =
||
1 )( ))((*|| F
j jfopj fbvpwfMinj
2||
1))()(( ∑ =
−opV
i ii vfbvbvMin
Outline - Synthesis for customized logicSynthesis for customized logic
Overview of behavioral synthesisScheduling
Task DescriptionScheduling algorithms• ASAP/ALAP, list scheduling, force-directed scheduling,
mathematical-programming-based scheduling, scheduling for low power
BindingTask DescriptionClassification of binding algorithmsBinding algorithms• Left-edge, min-cost network flow , multi-vdd binding
Architectural synthesis for multi-cycle communication(MCAS)
Introduction of BindingBinding maps operations, variables, or data-transfers to the resources
Functional unit, register, memory array, multiplexer, bus…Resources are usually shared to save area cost
FU bindingGoal is to minimize the number of FUs
Register bindingGoal is to minimize the number of registers
Advanced bindingGoal is to optimize and trade-off multiple design qualities, including total area, interconnections, clock period, power…
Binding Example
1* 3+
2*
4+
A scheduled DFG
FUs (registers) are shared by operations of same type (variables) whose lifetimes do not overlap
Lifetime [birth-time, death-time]• Operation: The whole execution time• Variable: From the time this variable is generated to the time it is last read
Datapath uArch Model
MUL ALU
variables registers
multiplexers
functional units
Possible Positions of Binding in Entire Behavioral Synthesis Flow
After scheduling is doneDecide resource usage and detailed architecture
Before scheduling is doneAffect both area and delay
Simultaneous scheduling and bindingMore globally better result
Outline - Synthesis for customized logicSynthesis for customized logic
Overview of behavioral synthesisScheduling
Task DescriptionScheduling algorithms• ASAP/ALAP, list scheduling, force-directed scheduling,
mathematical-programming-based scheduling, scheduling for low power
BindingTask DescriptionClassification of binding algorithmsBinding algorithms• Left-edge, min-cost network flow , multi-vdd binding
Architectural synthesis for multi-cycle communication(MCAS)
Binding Works Classified by Algorithms Graph-based algorithms
Clique partitioning [Tseng CAD’86] [Paulin DAC’86]Left-edge [Kurdahi DAC’87]
• Minimum number of registers and FUsBipartite [Huang DAC’90]
Brach-and-bound [Pangrle DAC’88]Integer Linear Programming (ILP) [Gebotys JSSC’92] [Rim DAC’92]
Pros: Optimal solutionCons: Scalability; difficult to formulate versatile constraints
Simulated annealingSimultaneous allocation and binding of all resources [Devadas TCAD’89] [Choi TDAES’99]Pros: Consider multiple optimization parameters together for globally better results Cons: Run-time and scalability
Min-cost network-flow[Kim CICC’95] [Chang DAC’95] [Chang DATE’96] [Lyuh TVLSI’03] [Chen ASPDAC’04] [Chen DAC’06]Formulate binding problems into a min-cost network flow
• Edge cost represents optimization goal, such as interconnections, power, …
……
The Order of FU and Register Binding
Inter-dependency exists between FU binding and register binding
To minimize interconnection, one task needs the other’s result to make accurate decision
o1
o2
o3
R1
F1
R2
F2
R1
F1
R2
F2
step1
step2
step3
step4
v1
v2
v3
o1,o2 o3o1 o2, o3
v1,v2 v3 v1,v2 v3
(1)A scheduling example (2a) FU binding + REG binding (2b) REG binding + FU binding
Resource constraints: 2 FUs, 2 REGs The inter-dependency is more complicated in real designs
Simultaneous Binding AlgorithmsPerforming tasks simultaneously
Lead to globally better resultsBut usually untractable
Simultaneous bindingSimulated annealing [Devadas TCAD’89] [Choi TDAES’99]Simulated evolution [Ly TCAD’93]Iterative searching [Dasgupta ISLPED’95] [Lakshminarayana TVLSI’99]Step-by-step simultaneous binding [Kim CICC’95] Interleaving simultaneous binding [Cong DATE’08]
Binding Works Classified by Optimization GoalsNumber of FUs and registers
[Tseng CAD’86] [Kurdahi DAC’87]
Interconnections, multiplexers [Paulin DAC’86][Stok DATE’90] [Huang DAC’90] [Kim CICC’95] [Lyuh TVLSI’03] [Chen ASPDAC’04] [Cong DATE’08]
Low power [Chang DAC’95] [Chang DATE’96] [Hong ICCAD’00] [Zhong TCAD’05] [Chen DAC’06]
Spurious switching activity [Mussoll ISLPED’95] [Dey TCAD’99] [Zhong ICCD’02]
Clock period [Huang DAC’06]
….
Binding Works Classified by Micro-ArchitecturesProcessor architecture
Multicluster architecture [Farkas 97]Multicomputer processor-DRAM chip model [Dally 99]
Synthesis consideration of distributed architecturesUse register files during post-processing, e.g., Hyper [Rabaey 91]Sequencer (stack or queue)-based architecture [Aloqeely 94] Data routing approach [Lanneer 94]Distributed VLIW [Jacome 00]Distributed-register architecture [Jeon 01, Kim 01]Regular Distributed Register (RDR) for multicycle communication [Cong 04]
…
Micro-Architecture Example - Register FileMultiplexer and interconnect costs are significant Register files are used to hide the multiplexers, which are replaced by dedicated decoders
1
2
2
1
(a)
4
33 2 41
FU
MUXMUX
(b)
FU
(c)
1234
A scheduled data-flow graph with optimal register binding labeled on each variable
Binding using discrete registers
Binding using a register file
Outline - Synthesis for customized logicSynthesis for customized logic
Overview of behavioral synthesisScheduling
Task DescriptionScheduling algorithms• ASAP/ALAP, list scheduling, force-directed scheduling,
mathematical-programming-based scheduling, scheduling for low power
BindingTask DescriptionClassification of binding algorithmsBinding algorithms• Left-edge, min-cost network flow , multi-vdd binding
Architectural synthesis for multi-cycle communication(MCAS)
Terminology and DefinitionThe input of the following binding algorithms is a scheduled data flow graph (DFG), G = (O, A)
O: the set of operations• Each operation has an operation type t
A: the data dependence of operationsV: the set of variables• A variable crossing the clock boundary needs to be registered
Compatibility and Conflict GraphGiven a scheduled DFG G = (V, A) , build the below graphs for operations of type f, Gf = (Vf, Af) Vf : all the operations of type f in VAf : depending on graph type
Compatibility Graph : all the edges between compatible operations in Vf
Conflict Graph : all the edges between un-compatible operations in Vf
Comparability Graph : all the directed edges between compatible operations in Vf
• af = (vi , vj ) iff death-time(vi ) < birth-time(vj )
1 3
2
4
1 3
2
4
A scheduled DFG Compatibility graph
Note: The graphs for variables/registers are constructed in a same way.
1 3
2
4
Conflict graph
1 3
2
4
Comparability graph
Operations have same type
Time ComplexityThe minimum coloring of the conflict graph is the minimum needed resources
Generally NP-CompleteFor DFG, the corresponding conflict graphs are interval graphs• Vertices correspond to intervals• Edges correspond to interval intersection• Polynomial solvable
Loops make the register binding intractable• Circular-arc graph
Left-Edge AlgorithmInput is a group of intervals with starting and ending timeGoal: Minimize the number of colors (resources)Basic idea1. Sort intervals in the order of increasing starting time2. Get the first interval from the list and put it into a new color group3. Search the list from the beginning and put as many intervals as possible into
the new color group4. Go to step 2 till the list is empty
Possible to incorporate other factorsInterconnect, bitwidth …
Example
From Giovanni De Micheli’s class slides
0 1 2 3 4 5 6 7
1
6
4
7
8
2
3
5
1
0 1 2 3 4 5 6 7 8
2 3
6 7 5
4
Intervals6
7 4
2
1
3
5
Colored conflict graph
Coloring
0
1
2
3
4
5
6
7
8
A scheduled DFG
1 6
7
2
8
4
3
5
9
8
8
8
Min-Cost Network Flow [Chang DAC’95]
Assumption Scheduling and functional-unit allocation is already done• Sufficient number of FUs, equal or greater than that of left-edge algorithm
The network is constructed based on nodes’ compatibility
Costs represent the concerned goals The lower the cost, the more possible the two nodes share a same resource
A n-flow binds the nodes into n resourcesSpecial nodes and properties are added to guarantee all nodes are covered once and only once
Construction of Network - 1Comparability graph
PropertiesComparability graph has transitive orientationDirected Acyclic Graph (DAG)Operations bound to a resource execute in topological order
Given a DFG G, build a comparability graph, Gc = (Vc, Ac) Vc : all the operations in GAc : all the edges between compatible operations in Vc
ac = (vi , vj ) iff Deathtime(vi ) < Birthtime(vj )
Wij : weight of ac , the cost of binding vi and vj into a single FUInterconnections, switching activity (power) …
ComparabilityComparability graph Gc
1 3
2
4
1 3
2
4
A scheduled DFG
Construction of Network - 2Network
Add in source and sink vertices into the comparability graphThere is an edge from source to every node in graphThere is an edge from every node in graph to sink
ComparabilityComparability graph GcNetwork
1 3
2
4
s
t
1 3
2
4
Add in source and sink
Network Flow and BindingFU binding solutions correspond to flows in the network
Network flows
1 3
2
4
s
t
1 3
2
4
A scheduled DFG
Resources: two FUs
Split NetworkGuarantee all nodes are covers once and only once?
Split each node into two nodes
Network
1 3
2
4
s
t
Split network
1 3
2
4
s
t
5
1’ 3’
2’
4’
flow(v, v’) = 1
LemmasA unit flow f (| f | = 1) in the split network corresponds to a clique χ in the original comparability graph Gc
An edge (vi’ , vj) in the flow indicates operations vi and vj will be bound into the same FU
A k-flow f (| f | = k) that passes through every node by a unit flow is equivalent to finding k disjoint paths (or chains) in network, thus generating k cliques in Gccovering all the operational nodes
This forms a legal binding solution
Multi-Vdd Binding (Chen et al ASPDAC’05)Solved problem
Simultaneously perform resource binding and voltage assignment to optimize power
SolutionExtend the split network ([Chang DAC’95]) to support voltage assignmentEdge costs represent switching activities or voltage decisionsOptimality• Guarantee that the largest number of operations are assigned low-
vdd with the minimum total switching activity
Problem Definition of FU Binding with Voltage AssignmentGiven:
A scheduled data flow graph (DFG)A module (functional unit) library with multi-VddsThe Vdd of each module can be changed dynamically while executing different operations
Goal:Assign low-Vdd to maximum number of operations with switching-activity considerationMinimize total switching power through functional unit binding
Constraint:Latency constraintResource constraint
Motivational Example
Which set of operations to extend?Honor data dependencyMaximum number under resource and latency constraintsThe best such set in terms of switching-activity reduction during FU binding later on
Need to consider voltage assignment and FU binding simultaneously to achieve optimal solution
1 2
4
3
65
1 2
4
3
65
Possible Extensions
MultiplicationHigh-vdd: 3 clocks
Low-vdd: 5 clocks
Addition
Network Construction - 1
A Scheduled DFG A Scheduled DFG (additions)(additions)
12
3
4
Comparability GraphComparability Graph
12
3
4
w14w23
Given a DFG G, build a compatibility graph, Gc = (Vc, Ac) Vc : all the operations in GAc : all the edges between compatible operations in Vc
ac = (vi , vj ) iff DT(vi ) < BT(vj )
Wij : weight of ac , the cost of binding vi and vj into a single FUswitching activity
ExtendableExtendable operations: operations: op1, op3op1, op3
Definition: Operations which can Definition: Operations which can be assigned lowbe assigned low--vddvdd without without violating dataviolating data--dependency or dependency or timingtiming--constraintconstraint
Network Construction - 21
2
3
4
1
2
3
4
s
t
1’
3’
ComparabilityComparability graphgraph
NetworkNetwork withwith twotwo VddsVdds
-TC(v1’, v4 )
= C(v1 , v4 )
w14w23
Add in nodes (v’) for extendable operationsAssign cost (-T) to edges (v, v’)
L = 100T = L × |Vc| –T for maximum number of extensions
C(vi , vj) = –L × (1 – Wij)
Network Construction - 3
Add in split nodes (vd) to guarantees that all the operations will be bound
1 2
3
1’
t
s
1 2
3
1’
1d 2d
3d
C(v2d, v3 )
= C(v2 , v3 )
t
s
-T
NetworkNetwork withwith twotwo VddsVdds
flow(v, vd) = 1
Split networkSplit network withwith twotwo VddsVdds
Theorem
GivenA comparability graph with estimated switching activities on the edgesk functional unitsDual supply voltages
TheoremThe min-cost k-flow f on the corresponding split network gives the largest number of extended operations in the design with the minimum total switching activity on kfunctional units
Outline - Synthesis for customized logicSynthesis for customized logic
Overview of behavioral synthesisScheduling
Task DescriptionScheduling algorithms• ASAP/ALAP, list scheduling, force-directed scheduling,
mathematical-programming-based scheduling, scheduling for low power
BindingTask DescriptionClassification of binding algorithmsBinding algorithms• Left-edge, min-cost network flow , multi-vdd binding
Architectural synthesis for multi-cycle communication(MCAS)
Multi-Cycle Communication Architectural Synthesis (MCAS)
Regular Distributed Register (RDR) micro-architectureHighly regularDirect support of multi-cycle on-chip communication
MCAS: Architectural Synthesis for Multi-cycle Communication
Integrated architectural synthesis (e.g. resource binding, scheduling) with physical planningTarget at RDR architecture
Needs for Multi-Cycle On-Chip Communication
11.4 22.8 28.301 clock
2 clock
3 clock
4 clock
5 clock
Interconnect delays dominate the timing in DSM tech.Interconnect delays dominate the timing in DSM tech.SingleSingle--cycle full chip synchronization is no longer possiblecycle full chip synchronization is no longer possible
ITRS’01 0.07um Tech5.63 G Hz across-chip clock800 mm2 (28.3mm x 28.3mm)IPEM BIWS estimations
Buffer size: 100xDriver/receiver size: 100x
From corner to corner:at semi-global layer (Tier 3)can travel up to 11.4mm in one cycleneed 5 clock cycles
Regular Distributed Register Architecture (1)
Distribute registers to each “island”Chose the island size such that local computation and communication in each island can be done in a single cycle:
Global Interconnect
…LCC
Reg. file
…LCC
Reg. file
…LCC
Reg. file
…
LCC
Reg. file
…
LCC
Reg. file
…LCC
Reg. file
FSMFSM
FSMFSM
FSMFSM
THWDDDDD iiopticopticislandra ≤++≤+= −−− )(2 intlogintlogint
LocalComputationalCluster (LCC)
….Register File
Wi
Hi
Island
FSM
ADD
MUXMUL
Cluster with area constraint
Regular Distributed Register Architecture (2)
Global Interconnect
…LCC
Reg. file
…LCC
Reg. file
…LCC
Reg. file
…
LCC
Reg. file
…
LCC
Reg. file
…LCC
Reg. file
FSMFSM
FSMFSM
FSMFSM
LocalComputationalCluster (LCC)
….Register File
Wi
Hi
Island
FSM
ADD
MUXMUL
Cluster with area constraint
Use register banks:Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island
Highly regular
1 cycle
2 cycle
k cycle
Example : Regular Distributed Register Architecture for 70nm Technology
ITRS’01 70nm TechChip dimension: 800 mm2 (28.3mm x 28.3mm)5.63 G Hz across-chip clock• Can travel up to 11.4mm within 1 clock cycle
under best interconnect optimization• Need 5 clock cycles to cross the chip
Each island base dimension• Wi = Hi=2.08mm• ≈ 1/3 of distance a wire can travel in 1 clock
cycle• Logic volume: 6.76M min-size 2-NAND gates
12X12 array of islandsLocal registers are partitioned to 7 banks
+ 2
* 3 * 4
- 6- 5
* 7 * 8
- 9 * 11 * 12
- 10
- 1
Data flow graph extracted from discrete cosine transformation (DCT)
The nodes with the same color are assigned to the same functional unit.
Example: Impact of Interconnect on Scheduling
Performance-driven Placement
Reg. file
Reg. file…Alu1
1,5,10Alu22,6,9
…Reg. file
Reg. file…Mul23,7,12
…Mul14,8,11LCC
- +
* *
--
* *
- * *
-
21 nsalu
22 nsmultiplier
numdelayresource
2 ns
1 ns
Long interconnectShort interconnect
- +
* *
--
* *
- * *
-
represents registers
Single-cycle vs. Multi-cycle Interconnect Communication
Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4nsTotal latency is 24ns
Cycle1
Cycle2
Cycle3
Cycle4
Cycle5
Cycle6
+ 2
- 1
* 3 * 4
- 6
- 5
* 7
* 12
- 9
* 11
* 8
- 10
Cycle1
Cycle2
Cycle3
Cycle4
Cycle5
Cycle6
Cycle7
Cycle8
Cycle9
Multi-cycle interconnect communicationScheduled in 9 clock cyclesClock period is 2nsTotal latency is 18ns
+ 2- 1
* 3 * 4
- 6- 5
* 7
* 11
- 9
* 8
* 12
- 10
With placement integrated with scheduling, critical path is reduced.The DFG can be scheduled in 8 clock cycles, with clock period of 2ns.The total latency is 16ns.
Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization
Cycle1
Cycle2
Cycle3
Cycle4
Cycle5
Cycle6
Cycle7
Cycle8
+ 2- 1
* 3 * 4
- 6- 5
* 7 * 8
- 9
* 11 * 12
- 10
Reg. file
Reg. file…Alu1
1,5,10
…Reg. file
Reg. file…Mul23,7,12
…Mul14,8,11
Simultaneous Placement and Scheduling
Alu22,6,9
Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization
With placement integrated with scheduling and binding, the critical path is further reduced.The DFG can be scheduled in 7 clock cycles, with clock period of 2ns.The total latency is 14ns
Cycle1
Cycle2
Cycle3
Cycle4
Cycle5
Cycle6
Cycle7Simultaneous Placement, Scheduling and Binding
Reg. file
Reg. file…Alu1
1,5,10
…Reg. file
Reg. file…Mul23,7,11
…Alu22,6,9
Mul14,8,12
+ 2- 1
* 3 * 4
- 6- 5
* 7
* 8
- 9
* 11
* 12
- 10
MCAS: Placement-Driven Architectural Synthesis Using RDR Architecture
CDFG
Interconnected Component Graph (ICG)
C / VHDL
Location information
Functional unit binding
Placement-driven rebinding & scheduling
Scheduling-driven placement
CDFG generation
Register and port binding
Datapath & FSM generation
Floorplan constraints
Resource allocationResource constraints
RD
R A
rch. Spec.Target clock period
RTL VHDL files
Multi-cycle path constraints
MultiMulti--Cycle Communication Architectural Cycle Communication Architectural Synthesis (MCAS) SystemSynthesis (MCAS) System
SchedulingScheduling--driven placementdriven placementIntegrate listIntegrate list--scheduling with a SAscheduling with a SA--based based global placement for minimizing the total global placement for minimizing the total latency.latency.Employ net weighting technique to shorten Employ net weighting technique to shorten the critical global connections.the critical global connections.
PlacementPlacement--driven rescheduling & driven rescheduling & rebindingrebinding
Integrate forceIntegrate force--directed listdirected list--scheduling with scheduling with simultaneous rescheduling & rebinding to simultaneous rescheduling & rebinding to further minimize the latency.further minimize the latency.
Challenges in Behavioral SynthesisSub-tasks in synthesis flow have complicated inter-dependency
How to perform resource allocation, scheduling and binding simultaneously to gain better degisn qualities
Physical reality has impact on the final resultsE.g. how to consider the impact to back-end synthesis (such as routability) during behavioral synthesis• Mcas considers interconnect complexity and coarse placement
Still need more enhancements to become a complete solution
E.g. how to consider floorplanning of voltage islands for multi-vdd synthesis
OutlineOutlineFoundation of platformFoundation of platform--based design and Metropolis framework (Alberto)based design and Metropolis framework (Alberto)
Challenges in System Level DesignChallenges in System Level DesignPlatformPlatform--based Design as a unifying methodologybased Design as a unifying methodologyA framework for PBD: A framework for PBD: •• Theoretical foundations (heterogeneous systems, Theoretical foundations (heterogeneous systems, metamodelingmetamodeling, abstract , abstract
semantics)semantics)•• Metropolis II: integration platform architectureMetropolis II: integration platform architecture•• Application to embedded system design: cars and building networkApplication to embedded system design: cars and building networked embedded ed embedded
systemssystemsSynthesis for functionality (Jason)Synthesis for functionality (Jason)
Synthesis for customized logicSynthesis for customized logicUse of applicationsUse of applications--specific processors and processor networksspecific processors and processor networks
Synthesis for communication (Synthesis for communication (RaduRadu))NetworksNetworks--onon--Chip design spaceChip design spaceOptimization for performance, energy, and faultOptimization for performance, energy, and fault--tolerancetolerance
Beyond systemBeyond system--onon--aa--chip (chip (ClasClas))
Core Theme Overview and Design FlowAlberto Sangiovanni-VincentelliLuca CarloniJason CongDaniel D. GajskiWen-mei HwuAndrew KahngRadu MarculescuJaijeet Roychowdhury
Modeling & Simulation Side
UIUC
Input: C descriptions
MMMsSystemC, C++
Analysis
Parallelism extraction Code cleanup
Performance/AreaEstimations
ASPN Synthesis (MetroII)1. ImportFunctional Model(i.e. h.264, UMTS)
UCI
UCB
2. Check Equiv.(Model Algebra)
3. Create Arch Services(i.e. Xilinx)
4. Map and Simulate
Concurrent Functional Model(after architecture independent optimization
TransformationRules
TLMEquivalence
Checker(UCI)
EquivalenceResult
DesignOptimization
App1 +Platform1
App2 +Platform2
TLMGen.
TLMGen.
TLM1
TLM2
DesignDecisions
Metro-II Framework
(UCB)
Metro-TLMLibrary
162
Platform Instance
Platform Design-Space Export
Platform Instance
Function Instance
FunctionSpace Mapped
Platform(Architectural) Space
FunctionSpace
Platform Instance
Function Instance
Mapped
ASPN Simulation
Co-simulation
MetroIIFunctional model
ASPNArchitectural model
event trace
annotated event trace
MCSim simulationMetroII simulation
RefinementHigh level
MetroII model
refine
ASPN model withmapped functionality
MCSim simulation(cycle accurate)
abstracted performance annotation
Analog Sim
Communication Synthesis, NOC synthesis, and physical modelingof interconnection and logic
Processor Synthesis
Processor Library
–ARM–PowerPC–ASIP
Processor Synthesis
•Customized coprocessor
•ASIPs
xPilot
•Process mapping
ASPN Synthesis Engine
Implementation SideUCLA/UCB/UCSD/CMU/Columbia
Application-Specific Instruction-Set Processors (ASIPs)General purpose processor cores + programmable fabric
Loosely coupled as a coprocessor• Example: Xilinx MicroBlaze, etc.
Tightly integrated as extra function units in application-specific instruction-set processors• GPP has the capability to extend basic instruction set• Programmable fabric implements the customized instructions • Example: Altera Nios / Nios II
Custom instruction logic for Nios II [source: www.altera.com]
Xilinx MicroBlaze[source: www.xilinx.com]
Comparison of Different ApproachesGeneral-purpose CPUs
Very flexible and easy to programLow performance for data-intensive applicationsPoor power efficiency
Hardwired LogicHigh performance due to parallelismBest power efficiencyInflexible after tapeoutHigh cost and long development time
ASIPReduce development time and cost by reusing most of the components of a pre-verified processorExtend instructions to leverage the parallelism for performance improvementReconfigurability of programmable fabrics provide certain flexibility
t1 = a * b;
t2 = b * 0xf0;;
t3 = c & 0x12;
t4 = t1 + t2;
t5 = t2 + t3;
t6 = t5 + t4;
Execution time: 9 clock cycles*: 2 clock cycles others: 1 clock cycles
Extended Instruction Set: I∪extop1 ∪expop2
extop1 extop2
* * &
+ ++
0xf0 0x12a b c
t1 = extop1(a, b, 0xf0);
t2 = extop2(b, c, 0xf0, 0x12);
t3 = t1 + t2;
Execution time: 5 clock cycles
Speedup: 1.8
Motivational Example
Subtask (1) ─ Extended Instruction IdentificationFind the extended instruction candidates under micro-architectural constraints
Exactly isomorphic patterns• Heuristic-based pattern grow
[Clark, Micro’03] [Sun, ICCAD’02]• Pattern enumeration
[Atasu, DAC’03] [Cong, FPGA’04]Similar patterns [Brisk, DAC’04] [Cong, FPGA’08]• Reduce area overhead by sharing
resources• May slow down clock speed
* * &
+ +
+
0xf0 0x12a b c
n1 n2 n3
n4 n5
n6
ALUMUL
MUL
ALU
Subtask (2) ─ Instruction SelectionInstruction selection problem
Limited resource budgetSelect a subset of instruction candidates to be finally implemented on configurable fabricsThe objective is to maximize performance
ApproachesGreedy [Clark, Micro’03]Knapsack [Cong, FPGA’04]Iterative [Atasu, DAC’03]
* * &
+ +
+
0xf0 0x12a b c
n1 n2 n3
n4 n5
n6
34(*)+(*)p5
14+++p4
34(*)+(&)p6
12++p3
12&+p2
12*+p1
GainAreaFunctionPattern
If the total area is 8, which instructions should be used?
p1p2
Subtask (3) ─ Application MappingApplication mapping
A covering problem: cover the application with the extended instruction setExploit the extended instructions for maximal gainApproaches• Iterative covering [Atasu,
DAC’03]• Binate covering [Liao,
ICCAD’95] [Cong, FPGA’04]
* * &
+ +
+
0xf0 0x12a b c
n1 n2 n3
n4 n5
n6
34(*)+(*)p5
14+++p4
34(*)+(&)p6
12++p3
12&+p2
12*+p1
GainAreaFunctionPattern
Subtask (1) – Heuristic Method
Grow subgraphs from seed nodes [Clark et al, Micro’03]All nodes are seedsTake 4 factors into consideration• Criticality
combining operations on the critical path• Latency
combing low latency nodes to pack more nodes• Area
prefer nodes with low area overhead• Input/Output
prefer nodes with fewer input/output portsSum of these factors determines value of each direction
Subtask (1) – Cut EnumerationEnumerate multiple-input single-output patterns [Cong, FPGA’04]
Cone: a subgraph consisting of node v and its predecessors such that any path connecting a node in the cone and v lies entirely in the coneEach pattern is a Nin-feasible coneCut enumeration is used to enumerate all the k-feasible cones [cong et al, FPGA’99]Basic idea: In topological order, merge the cuts of fan-ins and discards those cuts not k-feasible
3-feasible cones:
n1: {a, b} n2: {b, 0xf0} n3: {c, 0x12}
n4: {n1, n2}, {n1, b, 0xf0}, {n2, a, b}, {a, b, 0xf0}
* * *
+ +
+
0xf0 0x12a b c
n1 n2 n3
n4 n5
n6
Subtask (1) – Search Tree (1)Enumerate all multiple-inputs multiple-
outputs patterns [Atasu et al, DAC’04]Nodes are numberedbased on a reversed topological sortBuild a search tree to enumerate all the possible patterns
Example: Nout = 1
from Atasu’s conference slides
Subtask (1) – Search Tree (2)Subtree elimination
Prune the searching space to speedup the enumeration process
Based on a violation of the output portconstraint
Based on a violation of the convexityconstraint
from Atasu’s conference slides
Area = 17
Area = 25
Two DFGs
1.5
Resulting Datapath
Area = 28Area Estimate = 42
AreaCosts
85
13
Subtask (1) – Resource Sharing (1)
G3 G4G1 G2
Subtask (1) – Resource Sharing (2)
from Philip Brisk’s conference slides
Use substring matching to share resources [Brisk, DAC’04]Colors represent the type of operationsEncode paths as stringsMerge DFGs along matched nodes
Subtask (2) – Greedy Selection Heuristic
(1,7)59N
…………
(1,3,7)162
(3,4),(6,8)4201
OpsCostValueSubgraph Number
50N
…………
(1,3,7)162
(6,8)4101
OpsCostValueSubgraph Number
• Use estimates of performance improvement / cost• Iteratively pick the pattern that provides the
largest value/cost ratio• Update the table after selecting a pattern
from Nathan Clark’s conference slides
Subtask (2) – Knapsack Formulation (1)
Simultaneously consider speedup, occurrence frequency and area [Cong, FPGA’04]Speedup: the ratio between the latency of hardware implementation and pure software implementationOccurrence
Some pattern instances may be isomorphicGraph isomorphism test [ Nauty Package ]Small subgraphs, isomorphism test is very fast
Gain(p) = Speedup(p) × Occurrence(p)Pattern *+
Tsw= 3
Thw= 2
Speedup = 1.5
* * *
+ ++
0xf0 0x12a b cn1 n2 n3
n4 n5
n6
Subtask (2) – Knapsack Formulation (2)
Can be formulated as a 0-1 knapsack problemGiven:• n items (patterns) • the ith item (pattern) is associated with value (gain) vi and weight
(area) wi• Total weight W (area constraint A)
Objective:select a subset of items to maximize the total value, while the total weight does not exceed W.
Subtask (2) (3) – Iterative Method (1) [Atasu et al, DAC’03]
How to select M patterns within a single basic block? Build a (M+1)-ary treeBranch 0 means that the node is not included in any patternBranch i (i>0) means that the node is in pattern i
A sample search tree to identify two subgraphs (M=2)from a single basic block
from Atasu’s conference slides
Subtask (2) (3) – Iterative Method (2)How to select M patterns within multiple basic blocks?
Add one pattern at a timeAt each iteration, check in which basic block an additional pattern brings the largest gain
Identification of 3 subgraphs within 3 basic blocks
Subtask (3) – Binate CoveringModeled as a binate covering problem [Cong, FPGA’04]
Covering sink nodeCovering inputs of the selectedpattern
n1, n2, n42(*)+(*)p10
n3, n52*+p9
n12*p5
n22*p4
n32*p3
n2, n3, n52(*)+(*)p11
n2, n52*+p8
n2, n42*+p7
n1 , n42*+p6
n41+p2
n51+p1
n61+p0
CoversCostFunctionPattern
Covering clause:
p0 → p2+p6+p7+p10
¬ p0 + ( p2+p6+p7+p10)
* * *
+ +
+
0xf0 0x12a b cn1 n2 n3
n4 n5
n6
Application-Specific Processor Network (ASPN)ASPN consists of processor cores, logic blocks, memories, communication channels and peripheralsASPN enables higher-level of abstraction and higher productivityExtend the standard cell-based methodology for ASIC design to an application-specific processor-based design paradigm
Graphics Engine
Audio Processing
I/O Peripherals
On-chipMemory
DSP Programmable Logic
ARM MIPS
Encryption Engine
Digital Baseband
Programmable Logic PMU
Scratch-padMemory
MemoryController
ASIPs ARM
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
R
Graphics Engine
Audio Processing
DSP Programmable Logic
ARM MIPS
Encryption Engine
Programmable Logic
ASIPs ARM
On-chipMemory
Scratch-padMemory
MemoryController
The Era of ASPNMulticore processors enter the mainstream
Moore’s law continues to apply in the multicore eraIntel, AMD, IBM and Sun have launched their dual-core and quad-core processors
Hardware accelerators Examples: Synergistic Processing Elements (SPE) in the CELL processor, encryption engines in Niagara II, accelerators implemented on FPGA chipsPerform kernel computations in hardware to increase performance• Customized implementations to explore coarse / fine granularity of
parallelism• Reduce heat dissipation and power consumption (0.1 GFLOPS/Watt on AMD
2.5GHz vs. 0.9 GFLOPS/Watt on Stratix III FPGA)• Adapt to a wide range of applications and evolving standards
Parser FSM
Texture IDCT
Motion Comp.
Copy Controller
Texture Update
Display Controller
ASPN Synthesis
μP μP OSDriver
tasks
NetworkInterfaceBuffer Network
InterfaceBuffer
μP
μP μP OSDriver
tasks
NetworkInterfaceBuffer
μP μP OSDriver
tasks
μP μP OSDriver
tasks
μP OSDriver
tasksNetworkInterfaceBuffer
Architecture SynthesisApplication mapping
Design challengesHow to construct the optimal architectures for the given application (architecture synthesis)?How to partition and map jobs to the synthesized architectures (application mapping)?
Synthesis WorksOptimize throughput
Modified list scheduling [Hoang, 1993]Adaptive multi-objective genetic algorithm [Dick, 1998][Grajcar, 1999]Modulo scheduling [Karkowski, 1997]Integer linear programming to map tasks to a fixed number of processors [Jin, 2005]Iterative balanced partitioning [Dai, 2005] [Yu, 2007]
Optimize latencyMerge tasks greedily + list scheduling [Sarkar, 1986]Formulate as a mixed integer linear program [Prakash, 1991] Iterative method [Wolf, 1997]
Optimize latency under throughput constraintLabeling + Clustering approach [Cong, 2007]
Throughput Optimization
Recursively bipartition and refine [Yu, DAC’07]
Implemented on multiple homogeneous PEs to improve system throughputThroughput is determined by the maximal stage latencyThe stage latency for each stage is the sum of its processing time and communication time 34 us stage latency
34 us latency
10 us stage latency
30 us latency
1
1
12
1 1
4
3
43
44
10
1
Resource Balanced BipartitioningBipartition the program recursively to a number of stages with minimum communication cost
Compute the cut_ratio r for bipartitioning• Compute r based on the # stages ( r = 2:(3+2) if we have 5 stages)
Applying r-balanced min-cut partitioning to get subgraph G1 and G2
• Min-cut partition to minimize communication costAllocate PEs to G1 and G2 based on their processing times
1
1
1 31
1
4
3
4 3
4 4
10
1
Use 4 processors to implement a 3 stages
pipeline system
1 PE
1 PE
2 PEs
1 PE
1 PE
RefinementRefinement procedure is used to improve the quality of initial results
Migrate tasks from bottleneck stages to non-bottleneck onesOnly move to its neighbors which can accommodate additional tasks
Use 4 processors to implement a 3 stages
pipeline system 1
1
1 31
1
4
3
4 3
4 4
10
1
1 PE
1 PE
1 PE
1 PE
bottleneck PE
Latency OptimizationOptimize system latency for a given task graph on heterogeneous multiprocessor systems [Prakash, DAC’91]
Create a mixed integer linear programming model (the model has been simplified in this tutorial)Variables• Binary variables
Subtask to processor mapping σd,a = 1 if the task Sa maps to the processor PdCommunication mapping γa1, a2=1(0) if Sa1 and Sa2 have remote communication
• Real variablesInput data available time Tia of task aOuput data available time Toa of task aTask a’s start time TsaTask a’s finish time Tea
MILP Constraints (1)Add constraints to ensure the proper ordering of tasks and
communicationsProcessor selection constraint: exact one processor used for task Sa
Data-transfer constraint: γa1, a2=1 if Sa1 and Sa2 are implemented on different processors
Task start time constraint: task Sa cannot be executed until it’s inputs are ready
1,
=∑∈
σ adPd
σσγ 2,1,2,11
adPd
adaa ∑∈
−=
saia TT ≤
MILP Constraints (2)Task finish time constraint: it depends on the processor used
Input data available time constraint: it depends on the finish time of the data producer and the communication volume if they are mapped onto different processors
Other constraints: to ensure that the hardware resources are shared correctly
Optimization objective: minimize latency Tf
∑∈
+=Pd
aadsaea SdTypeDelayTT )),((,σ
)( 2,12,121 aaaaeaia VCommTT γ+≥
aeach task for eaf TT ≥
Optimize Latency Under Throughput ConstraintsAnalogous to the problem of circuit clustering for delay minimization in VLSI physical design
Labeling + Clustering [Cong, FPGA’07]
LabelingDefine the label of a node as the minimum pipeline stage where it can be executed
4
3
3 4
4 8
6
3
6
l=0
l=0
l=0
l=0
l=1 l=1
l=0 l=1
l=2
a b
d
c
e f
g h
i
stage period = 14
4
3
3 4
4 8
6
3
6a b
d
c
e f
g h
i
Clustering For DAGGenerate clusters in the reversed topological order
Theorem 1: the labeling and clustering algorithms generate latency-optimal pipeline solutions for directed acyclic task graphs in O(|V|2)
Theorem 2: If every task’s computation time is no less than it’s communication time, we can generate a latency optimal, duplication free solution.
4
c
c’
Relaxed Model – Allow Inter-stage Communication
4
3
3 4
4 8
6
3
6a b
d
c
e f
g h
i4
c
c’
4
3
3 4
4 8
6
3
6a b
d
c
e f
g h
i4
c
c’
No inter-stage communication
Allow inter-stage communication
stage period = 14
3 stages
2 stages
Branch and Bound Algorithm for Cluster GenerationNP-complete problem
Need to calculate the earliest finish time (MinTime) for each node
Label the node with <L, MinTime> in topological order
Apply pruning techniques to save computation time
4
3
3 4
6a b
d
c
e
4
MinTime(e)=14
MinTime(e)=14
MinTime(e)=14
MinTime(e)=13
63
3
Y
N
Y
include b
include e
include d
include a
MCSim: An Efficient Simulation Tool for Heterogeneous Multi-core Systems [Cong]Goals
Provide a framework to explore future ASPNsScalable/Fast• Can we tractably simulate 64+ cores?• Can simulations be parallelized for greater speedup?
Synthesizable CoProcessors and NoC• Helps measure physical characteristics and validates functionality
Support for variety of benchmarks/workloads• Multitasked single-threaded workloads • Cooperative shared-memory multithreaded applications
Modular• Plug and play with different models
Metrics of Interest• Performance• Power• Area
MCSim Structure
L2Bank
L2Bank
L2Bank
…
CACHE CONTROLLER
Functional Network Switch
…
SystemC NoC Model
message latencies
messages
Central Page Handler
Tightly coupled coprocessor
Loosely coupled coprocessor
SESC Instance
MINT
C C Co……
SESC Instance
MINT
C C C…
SESC Instance
MINT
C C Co…
• Each SESC instance is a number of cores cooperating on a single (potentially multithreaded) application • May loosely/tightly coupled with application-specific coprocessors
A number of cache banks
Central page handler• Allows support for multitasking
•A functional network switch to functionally route messages between components•A SystemC NoC model to accurately model latency and power
Coprocessor Simulation Model Generation Flow
behavioral behavioral synthesissynthesis
SSDM/CDFGSSDM/CDFGPlatformPlatform--based based
behavioral synthesisbehavioral synthesis
Coprocessor Simulator Coprocessor Simulator generationgeneration
FSMD/SSDMFSMD/SSDM
CycleCycle--accurate Performanceaccurate PerformanceModel in CModel in C
Data ModelData Model
C specificationC specification
FrontFront--end compilerend compiler
Platform description Platform description & constraints& constraints
CoprocessorCoprocessor--Processor Processor InterfaceInterface
Accuracy & Simulation Speed of the Generated Performance Models
396 48.0 0.005 0.240 Sha
303 41.0 0.030 1.230 pipelined MC
873 48.5 0.086 4.170 MotionCompensation(MC)
283 56.2 0.004 0.219 idct
147 32.8 0.004 0.128 dct
SpeedupCSystemC#Cycles
Simulation Speed (sec)Benchmark
MCSim Simulator Validation - Litho Simulation
3x3 4x4 5x5 6x6speedup 7.58 11.61 14.49 19.59
ALUT Memory Bits Fmax (MHz) Speedup25042 2,972,876, 115.58 15.52
• About 1000 lines of ANSI-C code
• Generate 1,1000 lines of VHDL code for 5x5 partitioning
Off by only 7%
References - 1C-J Tseng, D. Siewiorek, "Automated Synthesis of Data Paths in Digital Systems," IEEE Trans. On CAD, V.CAD-5, N.3, pp. 379-395, July 1986. F. Kurdahi, A. Parker, "REAL : A Program for Register Allocation," Proc. of DAC24, pp. 210-215, 1987. K. Choi and S. P. Levitan, “A flexible datapath allocation method for architectural synthesis,” ACM Trans. Des. Autom. Electron. Syst., vol. 4, no. 4, pp. 376–404, 1999.S. Devadas and A. Newton, “Algorithms for hardware allocation in data path synthesis,” IEEE Trans. Computer-Aided Design, vol. 8(7), pp. 768–781, July 1989.C. Gebotys and M. Elmasry, “Optimal synthesis of high-performance architectures,” IEEE J. Solid-State Circuits, vol. 27(3), pp. 389–397, Mar. 1992. M. Rim, R. Jain, and R. De Leone, “Optimal allocation and binding in high-level synthesis,” in Proc. of the 29th ACM/IEEE Conference on Design Automation (DAC’92), 1992, pp. 120–123.T. Kim and C. L. Liu, “An integrated data path synthesis algorithm based on network flow method,” Proc. of the IEEE Custom Integrated Circuits Conference, vol. 1-4, pp. 615–618, May 1995. J. M. Chang and M. Pedram, “Register allocation and binding for low power,” in Proc. Design Automation Conf., June 1995, pp. 29–35. J. M. Chang and M. Pedram, “Module Assignment for Low Power,” Conf. on European Design Automation. 1996. 376~381. C. G. Lyuh and K. Taewhan, “High-level Synthesis for Low-Power Based on Network Flow Method,” IEEE Trans. on VLSI Systems. 2003. 11(3): 364~375.D. Chen, J. Cong, Y. Fan and J. Xu, "Optimality Study of Resource Binding with Multi-Vdds," Proceedings of the 2006 Design Automation Conference, San Francisco, CA, pp. 580-585, July 2006. L. Stok, “Architectural Synthesis and Optimization of Digital Systems”, Ph.D Dissertation, Eindhoven ‘University of Technology, 1991.Shih-Hsu Huang, Chun-Hua Cheng, Yow-Tyng Nieh, Wei-Chieh Yu: Register binding for clock period minimization. DAC 2006
References - 2E. Mussoll and J. Cortadella, “High-level synthesis techniques for reducing the activity of functional units,” in Proc. Int. Symp. Low Power Design, Apr. 1995, pp. 99–104.S. Dey, A. Raghunathan, N. K. Jha, and K. Wakabayashi, “Controller-based power management for control-flow intensive designs,”IEEE Trans. Computer-Aided Design, vol. 18, no. 10, pp. 1496–1508, Oct. 1999. C.-Y. Huang, Y.-S. Chen, Y.-L. Lin, and Y.-C. Hsu, “Data path allocation based on bipartite weighted matching,” in Proc. of the 27th Conference on Design Automation, 1990, pp. 499–504. S. Hong and T. Kim, “Bus optimization for low-power data path synthesis based on network flow method,” in Proc. Int. Conf. Computer- Aided Design, Nov. 2000, pp. 312–317. P. G. Paulin, J. P. Knight and E. F. Girczyc, "HAL: A Multi-Paradigm Approach to Automatic Data Path Synthesis," 23rd DesignAutomation Conference, pp. 263-270, Jul. 1986.B. M. Pangrle, "Splicer: A Heuristic Approach to Connectivity Binding," 25th ACMIIEEE Design Automation Conference, pp. 536-541, Jun. 1988.D. Chen, and J. Cong, "Register Binding and Port Assignment for Multiplexer Optimization," Proceedings of the Asia Pacific Design Automation Conference, pp. 68 - 73, January 2004. B.M. Pangrle and D.D. Gajski. Design tools for Intelligent silicon compilation”, IEEE Trans. CAD, 1987.P.G. Paulin and J.P. Knight. Force-directed scheduling for the behavioral synthesis of ASICs. IEEE Trans. CAD, 1989.D. Shin and J. Kim. Optimizing intra-task voltage scheduling using data flow analysis. ASPDAC’05.M.C. Johnson and K. Roy. Datapath scheduling with multiple supply voltages and level converters. ACM Trans. Des. Autom. Electron. Syst. 2, 3 (Jul. 1997), 227-248. X. Tang, H. Zhou and P. Banerjee. Leakage power optimization with dual-Vth library in high-level synthesis. DAC’05.……
OutlineOutlineFoundation of platformFoundation of platform--based design and Metropolis framework (Alberto/Douglas)based design and Metropolis framework (Alberto/Douglas)
Challenges in System Level DesignChallenges in System Level DesignPlatformPlatform--based Design as a unifying methodologybased Design as a unifying methodologyA framework for PBD: A framework for PBD: •• Theoretical foundations (heterogeneous systems, Theoretical foundations (heterogeneous systems, metamodelingmetamodeling, abstract , abstract
semantics)semantics)•• Metropolis II: integration platform architectureMetropolis II: integration platform architecture•• Application to embedded system design: cars and building networkApplication to embedded system design: cars and building networked embedded ed embedded
systemssystemsSynthesis for functionality (Jason)Synthesis for functionality (Jason)
Synthesis for customized logicSynthesis for customized logicUse of applicationsUse of applications--specific processors and processor networksspecific processors and processor networks
Synthesis for communication (Synthesis for communication (RaduRadu))NetworksNetworks--onon--Chip design spaceChip design spaceOptimization for performance, energy, and faultOptimization for performance, energy, and fault--tolerancetolerance
RealReal--life examples (all)life examples (all)
Core Theme Overview and Design FlowAlberto Sangiovanni-VincentelliLuca CarloniJason CongDaniel D. GajskiWen-mei HwuAndrew KahngRadu MarculescuJaijeet Roychowdhury
Modeling & Simulation Side
UIUC
Input: C descriptions
MMMsSystemC, C++
Analysis
Parallelism extraction Code cleanup
Performance/AreaEstimations
ASPN Synthesis (MetroII)1. ImportFunctional Model(i.e. h.264, UMTS)
UCI
UCB
2. Check Equiv.(Model Algebra)
3. Create Arch Services(i.e. Xilinx)
4. Map and Simulate
Concurrent Functional Model(after architecture independent optimization
TransformationRules
TLMEquivalence
Checker(UCI)
EquivalenceResult
DesignOptimization
App1 +Platform1
App2 +Platform2
TLMGen.
TLMGen.
TLM1
TLM2
DesignDecisions
Metro-II Framework
(UCB)
Metro-TLMLibrary
203
Platform Instance
Platform Design-Space Export
Platform Instance
Function Instance
FunctionSpace Mapped
Platform(Architectural) Space
FunctionSpace
Platform Instance
Function Instance
Mapped
ASPN Simulation
Co-simulation
MetroIIFunctional model
ASPNArchitectural model
event trace
annotated event trace
MCSim simulationMetroII simulation
RefinementHigh level
MetroII model
refine
ASPN model withmapped functionality
MCSim simulation(cycle accurate)
abstracted performance annotation
Analog Sim
Communication Synthesis, NOC synthesis, and physical modelingof interconnection and logic
Processor Synthesis
Processor Library
–ARM–PowerPC–ASIP
Processor Synthesis
•Customized coprocessor
•ASIPs
xPilot
•Process mapping
ASPN Synthesis Engine
Implementation SideUCLA/UCB/UCSD/CMU/Columbia
204
Outline (Radu’s part)Part I
Packet-based communication and NoC designMapping and routing issuesVFIs and NoCs power management
Part IIPerformance optimization via buffer sizingExploiting small world effects Fault-tolerance and scalability issues
Conclusion A 3D view on NoC designFPGA prototype
205
SoC universe Application Architecture
Mapping
Performance evaluation
Communication refinement
Simulation/prototyping
Implementing on-chip multiprocessor systems brings concurrency and communication at the forefront of the design process!
206
“The chip IS the network”
On-chip NETWORKSModularity, scalabilityBetter predictabilityHigher bandwidthConcurrent communicationEnergy efficiency
Regular NoC architectures
Design constraintsHigh performance Low power and reliable operationTime-to-market (ease of design, modularity, CAD tools)Cost
Buses P2P irregular architectures
Packet-based communication: this is the focus today!Regular architectures implementing application specific NOCs
207
A NEW science of networks needs to emerge
There are similarities NoC approach is inspired by the success of macro networksShare some concepts (i.e. topology, routing, etc.)
… but also major differences between NoCs and macro networksResource limitation
• Much less area overhead possible. Buffer space is very limited.Energy efficiency
• Energy of global communication does not scale down with device scalingDesign time specialization
• NoCs are usually developed specifically for a specific set of application(s)• Traffic is also application-specific• Trade-offs among buffer space and quality of video, power and performance
New design methodologies and tools are needed for NoCs!
208
StaticCommunication infrastructure
• Topology (mesh, hypercube,…)• Buffer size (uniform, preferential)
DynamicCommunication paradigm
• Routing (deterministic, stochastic…), flow-control• Traffic (uniform, bursty,…)
OptimizationMapping applications onto architectures
• Performance, energy• Fault-tolerance,…
NoC design issues
209
NoC design spaceDesign effort
Design quality
•Standard topologies•Explore mapping & routing
Fixed standard Architecture
Semi-customized Architecture
Increased customization level and flexibility
•Buffer allocation•Long-range links
•Fixed topology and routing•Explore mapping
•Arbitrary topologies
Customized Architecture
first
second
DATEDATE’’03, DATE03, DATE’’04, 04, CODESCODES’’04, 04, etc.etc. DACDAC’’04, ICCAD04, ICCAD’’04, 04,
DATEDATE’’05, ISQED05, ISQED’’07, 07, etc.etc.
ICCDICCD’’02, ICCD02, ICCD’’04, 04, DATEDATE’’05, DAC05, DAC’’07, etc.07, etc.
210
Packet-based communicationPerformance and power dissipation are two major design constraints
Regularized, tile-based network-on-chip architecture• Well-controlled electrical parameters • Reliable interconnect• High performance
Processing Element
Communication wrapper
SwitchFabric
InputBuffers
OutputBuffers
(0,0) (0,1)
(1,1)
(2,1)
211
How does a tile look like?
ProcessingCore Router
bufferWestInput
WestOutput
buffer EastInput
EastOutput
buffe
rN
orth
Inpu
tN
orth
Out
put
buffe
rS
outh
Inpu
t Sout
hO
utpu
t
buffe
r
Proc.
Input Proc
.
Output
Onetile
Routingtable
CrossbarSwitch
Questions to addressbuffers size?topology?mapping, routing, etc…
212
MIT’s RAW processor
ComputationResources
longest wire = length of tile
(462 Gb/s @ 225 Mhz)
A Scalable 32 bit Fabricfor General Purpose andEmbedded Computing
Source: http://cag.lcs.mit.edu/raw
213
Intel’s 80-Tile 1.28TFLOPS NoC
S. Vangal et al. “An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS”Proc. IEEE Intl. Solid-State Circuits Conf, 2007
214
OutlinePart I
Packet-based communication and NoC designMapping and routing issuesVFIs and NoCs power management
Part IIPerformance optimization via buffer sizingExploiting small world effects Fault-tolerance and scalability issues
Conclusion A 3D view on NoC design
215
Energy-aware application mapping
(2,0) (2,3)(2,2)(2,1)
(3,0) (3,3)(3,2)(3,1)
(0,0) (0,3)(0,2)(0,1)
(1,0) (1,3)(1,2)(1,1)
Tile
NetworkLogic
Tile-based Architecture Application Characterization Graph (APCG)ASIC1
CPU1
DSP1DSP2
DSP3
ASIC25Mb | 2Gb/s
4Mb | 1.5Gb/s
Routing
MPEG design with adad--hochoc mapping
MPEG4 design with energy-aware mapping
Videoin MC VLE
ME Recon.Frame
Framebuffer
VideoOut
VOPdef.
DCTQ
IDCTIQ
Videoin ME MC
Framebuffer
DCTQ
VLE
VOPdef.
VideoOut
Recon.Frame
IDCIQ
Cycle-accurate simulations show
~50% communication energy savings!
216
Problem formulationGiven an APCG and ARCG with
size(APCG)≤ size(ARCG) Find a mapping function map( ) from the APCG to ARCG and a deadlock-free, minimal
routing function R( ) which minimizes:
Such that:
Where is the bandwidth of link and:
217
How does the energy-aware mapping work?Based on a branch-and-bound algorithm
Searching tree• Internal node: partial mapping• Leaf node: one feasible complete mapping
xxxx
4xxx3xxx
23xx21xx
2xxx1xxx
24xx
234x231x
2314 2341
Leaf node
Internal node
Root node
1 2
3 4
IP0
IP2 IP3
IP1
Mapping
218
Can we do better?
Main idea: Exploiting routing flexibility helps expand the solution space but makes the problem even more complex
(2,0) (2,3)(2,2)(2,1)
(3,0) (3,3)(3,2)(3,1)
(0,0) (0,3)(0,2)(0,1)
(1,0) (1,3)(1,2)(1,1)
Tile-based Architecture Communication Task Graph
ASIC1
CPU1
DSP1DSP2
DSP3
ASIC22Gb/s
1.5Gb/s
Assume the link bandwidth is only 3.0Gb/s
219
Exploiting routing flexibility
1. Helps in finding solutions for architectures with lower link bandwidthLower implementation cost
2. Leads to solutions with less energy consumption
526Mb/s
476Mb/s
500Mb/s
4.80J
4.50J
3.12J
220
NoC design with multiple VFIs
Voltage/Freq. Island VFI 1(V1, f1, Vt1) VFI 2
(V2, f2, Vt2)
VFI 3(V3, f3, Vt3)
Mixed clock / mixed voltage FIFO
NoC architecture is partitioned into multiple VFIs
Globally asynchronous, locally synchronous (GALS) communicationEach VFI can work at its own speed, while the communication across different voltage islands is achieved through mixed clock/mixed voltage FIFOs
221
CrossbarSwitch
FIFOOC
FIFO
OC
FIFOOC
OCFIFO
PEFIFO
CrossbarSwitch
FIFO
OC
FIFOOC
FIFOOC
OCFIFO
PEFIFO
Clock Domain 1 Clock Domain 2Output controller Mixed clock FIFO
Mixed clock/mixed voltage FIFO
Interface between two VFI domains
222
VFI synthesisVFI design choices
Chip partitioningGiven a VFI partitioning, assign the supply and threshold voltages
Each node in the network is a separate VFI Possibly largest energy savings, but very costly
• Mixed-clock/voltage FIFOs, voltage converters and power distribution
Increasing level of granularity
Appl
icatio
n En
ergy
Co
nsum
ptio
nArea and Energy
Overhead
our target
223
Design methodology for multi-VFI NoCs
NoC Architecture(topology, routing, etc.) Application
Scheduling
VFI Partitioning & Static Voltage-Frequency Assignment
Interface Design for Voltage-Frequency Islands
Dynamic Voltage and Frequency Scaling (DVFS) On-line
Workload Characterization
ASPN synthesis (UCLA)Voltage-frequency levels of customized
processors
COSI & Latency-insensitive design (UCB, Columbia Univ.)
Interaction to achieve energy optimization
Interaction with physical design (UCSD)Technology parameters, variability, etc.
System-level
Micro-architectural level
Physical level
224
VFI partitioning problemGiven
NoC architecture and a schedule for the driver applicationMaximum number of allowed VFIs and physical constraints
Find VFI partitioning (i.e., optimum number of VFIs, n ≤ N)Assignment of the supply and threshold voltages to each island
Such that the total energy consumption is minimized
( )∑=
+=n
iVFIAppTotal iEEE
1Application (useful) energy consumption
(comp+comm)
Overhead of ith VFI
Number of VFIs
225
Voltage/frequency assignment problemGiven a VFI partitioning
Find supply (Vi) and threshold (Vti) voltage assignmentsSuch that application energy consumption is minimized
( ) ( ) ( )∑ ∑∑∈∀ ∈∀∈∀
+=Ti Ti
bitTi
tiiiApp j,iEj,ivolV,VEEmin
Subject to the following deadline constraints per task t:
tttComm
t
t timestartdeadlinetfx
−−≤+
Energy consumed when the task is executed at (Vi ,Vti)
Communication energy consumption
Execution time Communication delay
226
For all pairs of neighboring islands (i , j )
Solve static VF assignment problem
Merge VFIs i and j
Compute the energy consumption
Given an initial partitioning with N islands, find the
static voltages
Update the VFI configuration
Merge the pair of islands thatprovides the minimum energy
Solve the voltage/frequency
assignment problem
VFI partitioning algorithm
227
1.5mJ2.6 mJ6.9 mJEnergy cons.
3-VFI2-VFI1-VFI
3-VFITile ID Tile ID
1-VFITile IDTile ID
2-VFITile IDTile ID
Energy savings for a 5x5 multi-VFI NoC
228
Run-time application mapping
Time
Applications : Application comes in
: DVFS
App 1: VFI1, VFI2App 2
App 3
Selection of this region is very important!
229
OutlinePart I
Packet-based communication and NoC designMapping and routing issuesVFIs and NoCs power management
Part IIPerformance optimization via buffer sizingExploiting small world effects Fault-tolerance and scalability issues
Conclusion A 3D view on NoC design
230
How does an on-chip router look like? Addr Decoder Channel Ctrl
North Input FIFO
Addr Decoder Channel Ctrl
East Input FIFO
Addr Decoder Channel Ctrl
West Input FIFO
Addr Decoder Channel Ctrl
South Input FIFO
Crossbar Switch
Crossbar Arbiter
Addr Decoder Channel Ctrl
Local Input FIFO
North Out Channel
East Out Channel
West Out Channel
South Out Channel
Local Out Channel
What should the proper buffer size of each input FIFO be?
231
Impact of buffer size on router area
Prototype router layout (buffer = 4 words)
Router size vs.buffer capacity
Buffer
Other logic
Buffering resource for on-chip router consumes significant area.To reduce the chip cost, the use of this resource has to be minimized.
232
Impact of buffer size on performance
System performance for differentbuffer capacities
Histogram for 1000 different random buffer configurations
Uniform1856 cycles
Best random187 cycles
• Most NoCs are application specific and demonstrate specific traffic patterns.• With limited buffering space available, it is important to carefully allocate these resources to each channel to match the traffic pattern of the given application.
862
64
233
Problem formulation
Given: Total available buffering budget BApplication communication characteristics andArchitecture specific packet servicing time S and routing function R
• S characterizes the packet service time in a router without contention
Determine:Buffer length for each input channelMinimize the average packet latency L
∑∑∑∀ ∀ ∀
≤x y dir
diryx BltsL ,,..)min(
234
Router/Channel Analytical Models
Cx-1,y,E
Cx+1,y,W
Cx,y-1,N
Cx,y+1,S
Cx,y,LO
Cx,y,N
Cx,y,E
Cx,y,W
Cx,y,S
Cx,y,L
Addr Decoder Channel Ctrl
North Input FIFO
Addr Decoder Channel Ctrl
East Input FIFO
Addr Decoder Channel Ctrl
West Input FIFO
Addr Decoder Channel Ctrl
South Input FIFO
Crossbar Switch
Crossbar Arbiter
Addr Decoder Channel Ctrl
Local Input FIFO
North Out Channel
East Out Channel
West Out Channel
South Out Channel
Local Out Channel
Router Rx,y
235
Evaluation under realistic trafficApplying the algorithm for applications that mimic real traffic
AutomotiveTelecommunicationAudio-video multimedia systems
To achieve the same or better performance as CNoC, the UNoC has to consume 12.5, 3.5 and 6 times more buffering space.
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Automotive Telecom Audio-video
UNoC (B=96) UNoC (B=144)CNoC (B=96)
908.3
537.4
181.6
Packet latency (cycles) comparison
236
‘Regularized’ MPEG-2 decoder
Map MPEG app onto tiles and use a network for inter-tile communicationExperiments
Collect network traffic traces via simulation (time stamp, #bytes arrived)Analyze macroblock-level statistical properties of the resulting time seriesBuild formal models of packet size distributions
Trace 1
Trace 2
237
Surprising result: On-chip self-similar behavior+1
-1
0
lag k0 100Auto
corr
elat
ion
Coef
ficie
nt
Typical short-rangedependent process
+1
-1
0
lag k0 100Auto
corr
elat
ion
Coef
ficie
nt
Typical short-rangedependent process
Interactions “die out”
Expectation
+1
-1
0
lag k0 100Auto
corr
elat
ion
Coef
ficie
nt Typical long-range dependent process
+1
-1
0
lag k0 100Auto
corr
elat
ion
Coef
ficie
nt Typical long-range dependent process
Reality
The rate at which autocorrelation decays is described by the Hurst parameter (H)
Self-similar (fractal) processes model long range dependence (0.5 < H < 1.0)
Hawaii video
…but that’s not what happens.This “heavy tailed” distribution confirms long-range interactions
Trace analysis from regularized MPEG
Ideal Markov(short range) dependency
Idealfractal behavior
Hawaii video data
238
Designing on-chip networks is quite unique!LRD model analytical prediction
Simulation0.01
0.1
Bad moviequality!
Markovian modelanalytical prediction
Implications of long-range dependent traffic on on-chip network designThe average delay of a buffer increases sharply at surprisingly low utilization factorsIf ignored, this produces optimistic performance predictions and inadequate resource allocation
239
OutlinePart I
Packet-based communication and NoC designMapping and routing issuesVFIs and NoCs power management
Part IIPerformance optimization via buffer sizingExploiting small world effects Fault-tolerance and scalability issues
Conclusion A 3D view on NoC design
240
When it comes to silicon, regularity is good!
Fully structured Fully customized
Large inter-node distanceLarge latencyNot application specific
Well-controlled parametersLow powerSimple to layout
Widely varying linksLoss of structureWire routing, floorplan
Small inter-node distanceBetter performance
Processing Element
Communication wrapper
SwitchFabric
InputBuffers
OutputBuffers
(0,0) (0,1)
(1,1)
(2,1)
241
Physics collaboration network: Newman et. al. Physical Review 2004
Yeast proteins: Maslov et. al. Science,2002.
Physics collaboration network: Newman et. al. Physical Review 2004
“It’s a Small-World After All”
Graph Theory
Regular graphs
Highly clusteredShort inter-node distances
Random graphsSmall-worldnetworks
Small-world networksNatural: Biological networksSocial networks: Movie actors, collaboration networksTechnological: Internet, WWW
242
Inducing small-world effects in NoCs Completely structured Fully customized
Large inter-node distance
Long latency
Not tailored towards a target application
WellWell--controlled controlled parametersparameters
Low powerLow power
Simple to Simple to layoutlayout
Loss of structure
Widely varying links
Wire routing, floorplan, timing,…
Better Better performance performance
Small interSmall inter--node distancenode distance
CustomizationCustomization
Customization via LRL
LongLong--range range linklink
243
Problem formulation
GivenCommunication frequencies between nodesMaximum number of links to be addedThe initial network & corresponding routing strategy
DetermineThe set of long-range links to be added on top of the mesh network A deadlock-free routing strategy for the newly added long-range links
s.t. network performanceperformance is optimized
…
∑ ∑≠
=
p pqpq
ijij V
Vf
Communication volume from node i to j
244
Performance evaluation
FREE STATEFREE STATE
CONGESTED CONGESTED STATESTATE
Phase transition
Sepang Circuit, MalaysiaMonza Circuit, ItalyAve. speed in fastest lap
256 km/h256 km/h 213 km/h213 km/h
245 km/h245 km/h 201 km/h201 km/hAve. speed for the race
245
Long-range link (LRL) insertion algorithm
Add a link from i to j
Utilization < S
For all (i, j) Tji, ∈
Update utilization
Yes Generate routing data
No
Routing algorithmfor mesh network Available
resources, SCommunication frequencies, ffijij
Evaluate the current configuration
Output
246
Routing strategy for LRLsLocal routing decision
Deadlock free
There is a long-range link
Long-range linkdecreases d(i,j)
Yes
Use the default routing algorithm
No
No
No
Use the long range link
Yes
Yes
Basic turn model
Not allowedAllowed
South-Last routing
247
Latency comparison
• 13.6 % improvement in the critical traffic load
• 69% reduction in the latency at the critical load
• 36.3 % improvement in the critical traffic load
• 61.5% reduction in the latency at the critical load
0.1 0.2 0.30
100
200
300
Total packet injection rate (packet/cycle)
Ave
rage
pac
ket l
aten
cy (c
ycle
s)
Auto-industry Benchmark4x4 Mesh network4x4 Mesh network with long-range links
69%
13.6%
0.2 0.4 0.6
30
70
110
Total packet injection rate (packet/cycle)A
vera
ge p
acke
t lat
ency
(cyc
les) Telecom Benchmark
5x5 Mesh network5x5 Mesh network with long-range links
61.5%
36.3%
248
Scalability, scalability, scalability
00.20.40.60.8
11.21.4
Pack
ets/c
ycle
4x4 6x6 8x8 10x10
Network Size
Critical Traffic Load
2D Mesh Network
2D Mesh Networkwith LRL
0
50
100
150
200
Cycle
s
4x4 6x6 8x8 10x10Network Size
Average Packet Latency
249
Practical considerations
Extra port
Long-range Link
Long-range links are divided into regular link
segments
X
Y
250
CMU’s FPGA prototype for LRL linksDetails
Wormhole routingParameterized packet lengthRouting table4 cycles service time for the header flit16 bit channels (parameterized)
Area (for Xilinx Virtex-II XC2V4000 FPGA)
OutCont.
InCont.
InCont.
OutCont.
Out
Cont
.In Cont
.
Out.Cont.
In Cont.
RoutingTable
Port 3
Port 1
Port
2
Port
4
0
200
400
600
3-port 4-port 5-port 6-port
1.4%1.8%
2.2%2.8%
Numb
er of
slice
s
4x4 Mesh Network: 6683 slices (29%)
4x4 Mesh Network with LRL: 7143 slices (31%)
251
Measurements using the FPGA prototype
252
OutlinePart I
Packet-based communication and NoC designMapping and routing issuesVFIs and NoCs power management
Part IIPerformance optimization via buffer sizingExploiting small world effectsFault-tolerance and scalability issues
Conclusion A 3D view on NoC design
253
Issues to worry about
Design complexity increasesVerification and testing become more difficult
For a shrinking manufacturing process from 0.25μm to 0.18 μm and a supply voltage drop from 2V to 1.6V, α-particles and neutron effects increase more than 10 times
C. Constantinescu - “Neutron SER Characterization of Microprocessors”, DSN 2005
Destination
Source
Cosmic Rays
254
Tiles that communicate stochastically
Each tile containsIP coreSend / receive buffersCRC hardwareRandom number generator (RND)
Main ideaProbabilistic broadcast: Packets
randomly transmitted several times, using multiple paths
Transmissions protected by CRC. Corrupted packets are discarded
We call this on-chip stochastic communication
Consumer
Producer
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
CRC check
IP
Buffers
BuffersRND
RN
D
RND
RN
D
255
Parameters of stochastic communicationTransmission probability (p ∈ [0,1])
Governs the random packet transmissionInfluences latency and energy dissipationMessages are disseminated “explosively” fast
Time-to-live (TTL)A packet has a finite TTLInfluences the energy dissipation
Energy dissipationDepends on the total number of packets (Npackets) transmitted in the NoCCan be obtained fromEcommunication = Npackets S Ebit
where: S = average size of packets, Ebit = energy dissipated per bit
Duration of a round (TR)Influences latency
fSN
T roundpacketsR
/=
256
Problem descriptionGiven
NoC communication architecture and an initial configuration ( i.e. a set of sources and destinations) Environment influences the NoC reliability via various particles (neutron, α, etc.)Each node can send the packet to a subset of its neighbours at random
DetermineEvolution of the packet dissemination s.t. the system-level fault tolerance is ensured through the probabilistic communication mechanism
Producer
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16Consumer
Data upset Buffer overflow
Synchronization error
257
New model: Spreader-Ignorant interactionISS
αα11αα22
αα44
αα 33
{ }4,3,2,1for
),())(()(,)()(,)(=
++−=+=−==+=+
khOhkikskitIkstSihtIshtSP kα
SS
I
I I
I I
I II
I I I II
I I II
I I II
S-SpreaderI-Ignorant
SS
SS SS
SS
αα11
SS
SS
SS
SS
SS
SS
SS
I
I
SS
SS
SS
SS
Complete graph
Main idea: The topology plays an essential role!
258
Master equation for packet disseminationWhat is the probability P(s,i,t) of having s spreader nodes and iignorant nodes at time t ?
Solve the following equation:
),,()(),,1()1)(1(
),,()(),,())((),,(
2
1
25
4
1
25
2
1
4
1
tisPisNistisPisNs
tiksPkstkiksPkiksdt
tisdP
kk
kk
kk
kk
⎭⎬⎫
⎩⎨⎧
−−++−+−−−++
++++−+−=
∑∑
∑∑
==
==
αβαα
βα
Spreader-Ignorant Interaction Spreader-Spreader Interaction
Spreader-Stifler Interaction
259
Coverage analysisCoverage of critical points for a 10x10 mesh network (250 rounds)
Number of reached nodes saturatesAs probability gets higher, the saturation (# of reached nodes) increases faster
Stochastic simulation of spreader nodes in 10x10 mesh network
Forwarding probability 0.7Similar asymptotic behaviors in the presence of faults
Coverage for 10x10 mesh network without faults Coverage for 10x10 mesh network without faults
260
Hierarchical NoCHierarchical NoC
Shared BusShared Bus
Putting it all together
Desire for lower power & higher performance suggestson-chip diversity (e.g. GALS architectures, mixed technologies, complex deterministic and stochastic communication)
261
OutlinePart I
Packet-based communication and NoC designMapping and routing issuesVFIs and NoCs power management
Part IIPerformance optimization via buffer sizingExploiting small world effectsFault-tolerance and scalability issues
Conclusion A 3D view on NoC design
262
The big picture
paradigm
infrastructure
application
adaptivedet
SWN
randomcustom
regularrandom
LRL
DyAD
MPEG2
multimedia
X
Y
Z
263
The big picture
paradigm
infrastructure
application
adaptivedet
SWN
randomcustom
regularrandom
LRL
DyAD
MPEG2A/V
MPEG4multimedia
X
Y
Z
264
Mapping the Spread of Contagions. Black nodes are persons potentially infectious, pink nodes represent exposed persons with incubating infection and are not infectious, green represent exposed persons with no infection and are not infectious. The infection status is unknown for the grey nodes. The black node in the center of the graph, is also the most infectious.(Source www.orgnet.com)
The big picture
paradigm
infrastructure
application
adaptivedet
SWN
randomcustom
regularrandomDyAD
rumors/epidemics
X
Y
Z
265
More info about some slides – see references…General
R. Marculescu, U. Y. Ogras, N. H. Zamora, ' Computation and Communication Refinement for Multiprocessor SoC Design: A System-Level Perspective , ' in ACM TODAES, Vol.11, No.3, July, 2006. U. Y. Ogras, J. Hu, R. Marculescu, ' Key Research Problems in NoC Design: A Holistic Perspective ', in Proc. CODES+ISSS, Jersey City, NJ, Sep. 2005.Networks on Chip, A. Jantsch, H. Tenhunen, Eds., Kluwer Academic, 2003.
Point-to-Point communication synthesis J. Hu, Y. Deng, R. Marculescu, 'System-Level Point-to-Point Communication Synthesis Using Floorplanning Information,' in Proc. ASPDAC-VLSI, Bangalore, Jan. 2002.A. Pinto, L. Carloni, A. Sangiovanni-Vincentelli, Constraint-Driven Communication Synthesis,' in Proc. DAC, New Orleans, LA, June 2002.
266
References (cont’d) NoC mapping, scheduling, routing
J. Hu, R. Marculescu, 'Communication and Task Scheduling of Application-Specific Networks-on-Chip', in IEE Proc. Computers & Digital Techniques, Sept. 2005.J. Hu, R. Marculescu, 'Energy- and Performance-Aware Mapping for Regular NoC Architectures', in IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, Vol.24, No.4, April 2005.
Buffer allocation, topology synthesis/customizationJ. Hu, U. Y. Ogras, R. Marculescu, 'System-Level Buffer Allocation for Application-Specific Networks-on-Chip Router Design,' in IEEE Trans. CAD, Vol. 25, Dec. 2006.U. Y. Ogras, R. Marculescu, ‘ "It’s a small world after all": NoC Performance Optimization via Long Link Insertion,' in IEEE Trans. on VLSI, Vol. 14, July 2006.
Traffic analysis G. Varatkar, R. Marculescu, 'Traffic Analysis for On-chip Networks Design of Multimedia Applications,' in IEEE Trans. on VLSI, Jan. 2004.
267
References (cont’d) Stochastic communication
P. Bogdan, T. Dumitras, R. Marculescu, ‘Stochastic Communication: A New Paradigm for Fault-Tolerant Networks-on-Chip,' in VLSI Design, Hindawi Publishing Corp., 2007.T. Dumitras, R. Marculescu, 'On-Chip Stochastic Communication,' in Proc. DATE, Munich, Germany, March 2003.C. Constantinescu, ‘Impact of Deep Submicron Technology on Dependability of VLSI Circuits,’ in Proc. DSN, 2002.
Power and link management C.-L. Chou, R. Marculescu, ' Incremental Run-time Application Mapping for Homogeneous NoCs with Multiple Voltage Levels ', in Proc. CODES+ISSS, Salzburg, Austria, Oct. 2007.U. Y. Ogras, R. Marculescu, P. Choudhary, D. Marculescu, ‘Voltage-Frequency Island Partitioning for GALS-based Networks-on-Chip,’, in Proc. DAC 2007. J. Hu, Y. Shin, N. Dhanwada, R. Marculescu, 'Architecting Voltage Islands in Core-based System-on-a-Chip Designs', in Proc. ISLPED, Newport Beach, Ca, Aug. 2004.
268
References (cont’d) Implementation
U. Y. Ogras, R. Marculescu, H. G. Lee, P. Choudhary, D. Marculescu, M. Kaufman, P. Nelson, 'NoC Prototyping Using FPGAs: Challenges and Promising Results in NoC Prototyping Using FPGAs', in IEEE Micro Special Issue on Interconnects for Multi-Core Chips, Sept./Oct. 2007.H. G. Lee, N. Chang, U. Y. Ogras, R. Marculescu, ‘On-chip Communication Architecture Exploration: A Quantitative Evaluation of Point-to-Point, Bus and Network-on-Chip Approaches,’ in ACM TODAES, Vol.12, No. 3, June, 2007.
This list of references is NOT exhaustive. There are many good contributions not mentioned here due to space limitations.
A good selection of NoC papers are available athttp://www.cl.cam.ac.uk/~rdm34/onChipNetBib/browser.htm http://www.ocpip.org/university/biblio_main/comparison/
Sponsors: Marco GSRC, SRC, NSF
269
NoCPro: The CMU-SNU prototypeA complete MPEG-2 encoder implementation using NoC, bus-based and P2P architecturesDetailed area, power, and performance comparisons based on real measurementsIllustrate the scalability of the NoC approach using a real multimedia application
InputBuffer R1 R2
DCT &Quant.
VLE &Out. Buffer
IQuant.& IDCT
MotionEst.
MotionComp.
FrameBuffer
270
MPEG-2 EncoderMPEG-2 belongs to the rich class of multimedia applications
JPEG, MJPEG, MPEG1, etc.
MPEG-2 communication task graph
InputBuffer
DCT
MotionComp.
MotionEst.
FrameBuffer
VLE &Out
IQuant.
Quant.
IDCT
271
MPEG-2 Encoder Implementation (P2P)
Processing elements
961480956802
3,8732,527
74
# of Slices
Area(Xilinx Virtex-II FPGA)
8Motion Esti.19Motion Comp.
1IQuant/IDCT75Reconst FB.
1VLE/Out Buf.
1DCT/Quant.1Input Buffer
# of BRAMs
Processing Element
Network interface(116 slices)
Input port Output port
Input buffer(FIFO)
...
...
Output buffer(FIFO)
PE
PE
272
MPEG-2 Encoder Implementation (P2P)
FrameBuffer (FB)
InputBuffer (IB)
DCT &Quant. (DQ)
VLE &Out. Buffer (VB)
MotionComp. (MC)
MotionEst. (ME)
IQuant.& IDCT (IQ) DedicatedDedicated
linkslinks
# of links: 10# of links: 10# of # of NIsNIs: 19: 19
Processing elementsProcessing elements
NetworkNetworkInterfacesInterfaces
273
MPEG-2 Encoder Implementation (NoC)
On-chip router designWormhole routingPacket length: 64 flits (parameterized )Routing table4-cycle service time for header flit
1.73975-port2.25036-port
1.33044-port1.02193-port
Util. (%)Resource(# of slices)
OutputInput
InputOutput
Outp
utIn
put
OutputInput
RoutingTable
Port 3
Port 1
Port
2
Port
4
OutputInput
OutputCont.
InputCont.
InputOutput InputCont.
OutputCont.
Outp
utIn
put
Outp
utCo
nt.
Inpu
tCo
nt.
OutputInput OutputCont.
InputCont.
RoutingTable
Port 3
Port
4
Area (Xilinx Virtex-II XC2V4000)
274
MPEG-2 Encoder Implementation (NoC)
MPEG-2 encoder (hierarchical star network)
InputBuffer R1 R2
DCT &Quant.
VLE &Out. Buffer
IQuant.& IDCT
MotionEst.
MotionComp.
FrameBuffer
# of links: 8# of links: 8# of # of NIsNIs: 7: 7# of routers: 2# of routers: 2
PE
PacketizeDepacketize
Input port Output port
Inputbuffer(FIFO)
...
...
Outputbuffer(FIFO)
Router
189 slices189 slices
275
Scalability with Increasing Parallelism
Increase the number of modulesME module is the performance bottleneck
IB R1 R2
DQ
VB
IQ
ME1 MCFB
ME2
ME2
FB
IB DQ VB
MC
ME1
IQ
FB
IB VB
MC
# of links: 14# of links: 14# of # of NIsNIs: 27: 27
# of links: 9# of links: 9# of # of NIsNIs: 8: 8# of routers: 2# of routers: 2
276
To summarize…
MotionEst. 2
MotionEst. 2
InputBuffer R1 R2
DCT &Quant.
VLE &Out. Buffer
Inv Quant.& IDCT
MotionEst.
MotionComp.
FrameBuffer
Networks-on-chip Implementation
FrameBuffer
InputBuffer
DCT &Quant.
VLE &Out. Buffer
MotionComp.
MotionEst.
Inv Quant.& IDCT
Point-to-point Implementation
Dedicatedlinks
MotionEst. 2
InputBuffer
DCT &Quant.
VLE &Out. Buffer
Inv Quant.& IDCT
MotionEst.
MotionComp.
FrameBuffer
Bus Implementation
Bus Cont.Unit
277
Area and Performance Comparison
0
100
200
300
400
500
1 2 4 8Degree of parallelism
# of
Fra
me/
sec
P2P NoC Bus
In terms of PerformanceNoC scales similar to P2P implementationBus implementation scales poorly
0
5K
10K
15K
20K
25K
1 2 4 8Degree of parallelism
# of
slic
es
P2P Bus NoC
0
5K
10K
15K
20K
25K
1 2 4 8Degree of parallelism
# of
slic
es
P2P Bus NoC In terms of AreaNoC scales similar to the bus architectureScaling of the P2P implementation is poor
278
Energy and Power Consumption Comparison
20
40
60
80
100
1 2 4 8Degree of parallelism
Ener
gy (m
J/Fr
ame)
10
20
30
40
50
Perc
enta
ge (%
)
P2P (Energy) Bus (Energy) NOC (Energy)P2P (Percentage) Bus (Percentage) NOC (Percentage)
0 0
In terms of Energy per FrameNoC exhibits the best scalabilityScaling of the bus implementation is poor
1,000
2,000
3,000
4,000
5,000
6,000
7,000
1 2 4 8Degree of parallelism
Pow
er (m
W)
0
10
20
30
40
50
60
Perc
enta
ge (%
)
P2P (Power) Bus (Power) NOC (Power)P2P (Percentage) Bus (Percentage) NOC (Percentage)
0
In terms of Power ConsumptionBus implementation exhibits the best scalability (due to slow operation)Scaling of the NoC implementation is better than P2P
279
ConclusionMain message: The NoC outperforms P2P!
Less design complexity and smaller area (~22%)Less energy consumption (~42%) and better energy-delay product Similar performance (P2P: 48 Frame/s, NoC: 47 Frame/s)
NoC implementation is scalable in terms of area, performance, and power consumption
EmbeddedLow PowerLaboratory
OutlineOutlineFoundation of platformFoundation of platform--based design and Metropolis framework (Alberto/Douglas)based design and Metropolis framework (Alberto/Douglas)
Challenges in System Level DesignChallenges in System Level DesignPlatformPlatform--based Design as a unifying methodologybased Design as a unifying methodologyA framework for PBD: A framework for PBD: •• Theoretical foundations (heterogeneous systems, Theoretical foundations (heterogeneous systems, metamodelingmetamodeling, abstract , abstract
semantics)semantics)•• Metropolis II: integration platform architectureMetropolis II: integration platform architecture•• Application to embedded system design: cars and building networkApplication to embedded system design: cars and building networked embedded ed embedded
systemssystemsSynthesis for functionality (Jason)Synthesis for functionality (Jason)
Synthesis for customized logicSynthesis for customized logicUse of applicationsUse of applications--specific processors and processor networksspecific processors and processor networks
Synthesis for communication (Synthesis for communication (RaduRadu))NetworksNetworks--onon--Chip design spaceChip design spaceOptimization for performance, energy, and faultOptimization for performance, energy, and fault--tolerancetolerance
RealReal--life examples (all)life examples (all)
xPilot: Behavioral-to-RTL Synthesis Flow Behavioral spec.
in C/SystemC
RTL + constraints
SSDMSSDM
μArch-generation & RTL/constraints generation
Verilog/VHDL/SystemCFPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …
Presynthesis optimizationsLoop unrolling/shiftingStrength reduction / Tree height reductionBitwidth analysisMemory analysis …
FPGAs/ASICsFPGAs/ASICs
Frontendcompiler
Frontendcompiler
Platform description
Core synthesis optimizationsSchedulingResource binding, e.g., functional unit binding register/port binding
Advantages of Behavioral SynthesisShorter verification/simulation cycle• 100X speed up with behavior-level simulation
Better complexity management, faster time to market• 10X improvement on code density
Rapid system exploration• Quick evaluation of different hardware/software boundaries• Fast exploration of multiple micro-architecture alternatives
Higher quality of results• Platform-based synthesis & optimization• Full consideration of physical reality
xPilot is Licensed to AutoESL for Commercialization
C/C++/SystemCC/C++/SystemC
Timing/Power/Layout Timing/Power/Layout ConstraintsConstraints
RTL RTL HDLsHDLs &&RTL SystemCRTL SystemC
Platform Characterization
Library
ASICs/FPGAsASICs/FPGAsImplementationImplementation
FPGA FPGA PrototypePrototype
=
Simulation, Verification, and Prototyping
Compilation & Compilation & ElaborationElaboration
Advanced CodeAdvanced CodeTransformationTransformation
Behavioral & CommunicationBehavioral & CommunicationSynthesis and OptimizationsSynthesis and Optimizations
AutoPilotTM
Com
mon Testbench
User ConstraintsUser Constraints
ESL Synthesis
Design Specification
Platform-based & communication-centric ESL synthesisAutomated ESL-to-GDSII silicon compilationRapid platform-based system-level explorationMore than 10X design productivity gain
AutoClipse IDE
Standard JFace Text EditorKeyword highlighting, key bindings
Outline ViewUsing internal parser
Content AssistUsing internal parser and index
C/C++ Projects ViewShowing CDT specific things: includes, binaries
Build Console Output
MPEG-4 Simple Profile Decoder: C-Based Synthesis Results
Texture Update & Copy Control
Texture/IDCT
Motion Comp.
Parser/VLDModule
16
BRAMs
8.0
2693Video:CIF@30fps
Device:v2p30 1407
2032
899
Slices Period (ns)Setting
•• Complexity of synthesized RTLsComplexity of synthesized RTLs
2736textureUpdate.c(220)
Texture Update
569215168Total
10934motion_decode.c(492)
11537
6089
12036
6093
4681
2815
VHDL line#
texture_idct.c(1819)
Texture/IDCT
texture_vld.c(504)
parser.c(1095)
Parser/VLD
bitstream.c(439)
Motion-compensation.c
(312)
Motion Comp.
copyControl.c(287)
Copy Controller
Orig. CSource File
(+ line#)
Module Name
Experimental Results: ASIC FlowMagma RTL to GDSII flow Technology library: TSMC 90nmDesign: Motion Compensation Block
1st column: Cycle time constraint enforced in AutoPilot and Magma tools2nd column: Estimated cycle count of synthesized RTL3rd-5th column: Data reported by Magma tool
2154.4297.533612848732756413500
2148.1328.230472899533047053000
1833.7429.223303111135097872500
2442.2262.538102841932716414000
1455.6546.118313241038907952000
1162.2739.113533693349188591500
1172.7933.7107144778686810951000
Total Latency (ns)
Fmax (MHz)
Crit. Path Delay (ps)
Area (um2)
Cell Count
Cycle Count
Clock Period Constraint (ps)
ESL SystemC to ASIC (Magma Flow)
Magma BlastCreate
AutoPilotAutoPilotTMTM
Synthesis ToolSynthesis Tool
ESL SystemC ESL SystemC to ASIC Flowto ASIC Flow
SystemC behavioral specificationAES (Advanced Encryption Standard)Untimed model; bit-accurate data typesAbout 1300 lines code
AutoPilot synthesis resultLatency: 270 cycles RTL Verilog code: about 23K lines
Magma Blast-Create resultTechnology node: TSMC-90nmArea (u2): 70KFrequency: 125MHz+ (8ns constraint)
Behavioral SystemCBehavioral SystemCDesign ModelDesign Model
RTL SystemC, RTL SystemC, VHDL/VerilogVHDL/Verilog
AutoPilotTM Simulation/Verification Flow
Behavioral C/C++/SystemCBehavioral C/C++/SystemCDesign ModelDesign Model
Automated Simulation FlowAutomated Simulation Flow
RTL RTL SystemC and HDL SystemC and HDL
modelsmodels
ASICs/FPGAsASICs/FPGAsRTL Synthesis/LayoutRTL Synthesis/Layout
FPGA FPGA Prototype/Prototype/EmulationEmulation
Cycle Accurate Waveform /
Coverage and Assertion Report
AutoPilotAutoPilotTMTM
Synthesis ToolSynthesis Tool
Behavior-Level (Untimed) Test Bench and Stimuli
Synthesis FlowSynthesis Flow
AutoPilotTM
Bench Adapter• Generate wrappers for RTL models• Reuse untimed bench and stimuli • Automatically compare SystemC and HDL waveforms
Compilation for Reconfigurable Accelerated Computing
AutoPilot C-based synthesis for high-performance computingSynthesize pure ANSI-C, the “Universal Language” for software programmers• Quickly port legacy C programs into optimized hardware implementations
GCC-compatible compilation flow• Full support of IEEE-754 floating point data types & operations• Efficiently handle bit-accurate fixed-point arithmetic
Automatic parallelization / pipelining for performance speedup
int test(int in0, int in1){
/* user code */}
ANSI-Centity test is port (
in0 : IN SIGNED (31 downto 0);in1 : IN SIGNED (31 downto 0);result : OUT SIGNED (31 downto 0);clk : IN STD_LOGIC;reset : IN STD_LOGIC;done : OUT STD_LOGIC;start : IN STD_LOGIC );
end;-- Synthesized code in VHDL --
RTL VHDL
AutoPilotTM
Acceleration of Lithographic Simulation with AutoPilotTM
Lithography simulationSimulate the optical imaging processComputational intensive; very slow for full-chip simulation
15X+ Performance Improvement vs. AMD Opteron 2.2GHz Processor with automated compilation
XtremeData X1000 development system (AMD Opteron + Altera StratixII EP2S180)
AutoPilotTM
Synthesis Tool
Algorithm in C
Ι(x,y) = Σ λκ ∗ | Σ τ [ψκ(x−x1, y−y1) −
ψκ(x−x2, y−y1) + ψκ(x−x2, y−y2) − ψκ(x−x1, y−y2)] |2
AutoPilot QoRCustomer design
Evaluation criteria• QoR: Compare (latency and area) with manual design• Quick design space exploration
Results highlight• Parameterizable C design• QoR comparable to hand design; Discovered an alterative with 26% better latency
AutoPilotTM Features and Benefits
Higher quality-of-resultsAdvanced compiler optimizations
Faster timing/power closureCommunication/interconnect centric synthesis
Enable design reuse: From single source tomultiple RTLs for different technologies & platforms
Unified FPGAs and ASICs support
Platform-based and implementation-aware behavioral synthesis
Scalable & near-optimal algorithms for global behavioral & communication co-optimizations
Integrated C/C++/SystemC-based design flow
AutoPilotTM Features
Precise platform pre-characterization allowing more informed optimizations
Higher quality-of-results
Best language/abstraction support; Enable efficient simulation, prototyping, and implementation
Benefits
293
File for Xilinx EDK Tool Flow
IP Library
1. Select an application and understand its behavior.
2. Create a Metropolis functional model which models this behavior.
3. Assemble an architecture from library services or create your own services.
4. Map the functionality to the architecture.
5. Extract a structural file from the top level netlist of the architecture created.
On-ChipPeripheral
Bus(OPB)
SynthMaster
SynthSlave
MicroBlaze
Mapping ProcessMapping
Process
Mapping ProcessMapping
Process
BRAMBRAM
Preprocessing DCT Quantization Huffman
JPEG Encoder Function Model (Block Level)
StructureExtractor Top Level Netlist
Example Design
294
Example Design Cont.File for Xilinx EDK Tool Flow
Permutation Generator
ISS Info CharDataTransaction
Info
Platform Characterization Tool (Xilinx EDK/ISE Tools)
Characterizer Database
Software Routinesint DCT (data){Begin
calculate ……} Automatic32 Bit Read = Ack, Addr, Data, Trans, Ack
Manual
Hardware RoutinesDCT1 = 10 CyclesDCT2 =5 CyclesFFT = 5 Cycles
Manual
1. Feed the captured structural file to the permutation generator.
2. Feed the permutations to the Xilinx tools and extract the data.3. Capture execution info for software and hardware services.4. Provide transaction info for communication services.
Permutation 1 Permutation 2 Permutation N
295
Example Design Cont.Preprocessing DCT Quantization Huffman
JPEG Encoder Function Model (Block Level)
On-ChipPeripheral
Bus(OPB)
SynthMaster
SynthSlave
MicroBlaze
Mapping ProcessMapping Process
Mapping ProcessMapping
Process
BRAMBRAM
ISS InfoCharDataTransaction
Info
2. Refine design to meet performance requirements.
3. Use Refinement Verification to check validity of design changes.
• Depth, Vertical, or Horizontal• Refinement properties
1. Simulate the design and observe the performance.
Execution time 100msBus Cycles 4000Ave Memory Occupancy 500KB
BRAM
ConcurrentVertical Refinement
New AlgorithmDepth
VerificationTool
Yes? No?
Execution time 200msBus Cycles 1000Ave Memory Occupancy100KB
4. Re-simulate to see if your goals are met.
Backend Tool Process:1. Abstract Syntax Tree (AST) retrieves structure.
2. Control Data Flow Graph - DepthFORTE – Intel ToolReactive Models – UC Berkeley
3. Event Traces – RefinementProperties.
Vertical RefinementHorizontal Refinement
296
Intel MXP5800 Architecture
Highly Heterogeneous Parallel PlatformDesigned for Imaging Applications8 Image Signal Processors connected with mesh
PEs have limited capabilitiesCommunication is data-driven with support for multiple consumersBuffer memory is extremely limited: 16 registers
297
Application and Architecture ModelingFunctional Modeling
Hierarchical23 Processes21 FIFOsFocus on DCT
Pre-processing DCT Quantization Huffman
Scan ColorConv.
1D-DCT Trans-pose 1D-DCT Trans-
pose
ZigZag Mult
RLE Lookup
Shift
Add4
Sub4
Mult1
Mult2Merge
Add2
Sub2
Architectural Modeling Time is performance metricTasks provide blocking read/write and execution servicesPEs support static schedules
298
MappingReplication of best scenarios from Intel libraryAccurate performance modelingEasy implementation of additional scenarios
Change allocation and scheduling
Cycles for different scenarios
0
500
1000
1500
2000
2500
Hardware Balanced OPE emphasis OPE Heavy
Scenario
Cyc
les
Metropolis ScenariosIntel Software Library
A. Davare, Q. Zhu, J. Moondanos, ASV, “JPEG Encoding on the MXP5800: A Platform-based Design Case Study,” Proceedings of EstiMedia 2005.
299
Motion JPEG on XilinxStep 1: Decompose application (MJPEG encoding) into desired topologies.
Step 3: Create Architecture Models in MMM for a target platform.
Step 2: Create MJPEG Functional Models in the MMM language.
Step 4: Map processes in the func. model to tasks in the arch. model.
300
Mapping Motion-JPEGStep 3: Map processes in the functional model to tasks in the architecture model.
In our exploration: One to one mapping between functional and architectural tasks
0.0031
0.0026
0.0021
0.0030
Execution Time (Secs)
46.3
56.7
72.3
101.5
Max MHZ
1, 1, 1
2, 2, 3
3, 3, 2
4, 4, 4
Rankings (Real, Char, Est)
143335
147036
154217
304585
Real Cycles
9278144432 (<+1%)103320 (28%)Model 4
7035145414 (1.2%)103935 (29%)Model 3
4927145659 (6%)103812 (33%)Model 2
4306228356 (25%)145282 (52%)Model 1
Area (Slices)
Characterized Cycles
Estimated Cycles
System
Real Cycles and Execution Time
050000
100000150000200000250000300000350000
1 2 3 4
Model
Cyc
les
00.00050.0010.00150.0020.00250.0030.0035
Exec
utio
n Ti
me
(Sec
)
Real CyclesExecution Time
Real Cycles and Area
050000
100000150000200000250000300000350000
1 2 3 4
Model
Cyc
les
0
2000
4000
6000
8000
10000
Slic
es
CyclesArea