modular design and the importance of models of computation bart kienhuis, leiden university, liacs...
TRANSCRIPT
Modular design and the importance of models of computation
Bart Kienhuis,Leiden University, LIACSComputer Systems Group
Based on the presentation given at the 37th DAC tutorial on Embedded System Design
2
New Applications
Stream oriented applications Multi-media (Smart) imaging Bioinformatics Classical digital signal processing
Ferocious appetite for compute power
3
Smart Imaging
Camellia project Core for ambient
and mobile intelligent imaging applications
IST project (fr5)Detection of a
pedestrian walking in front of a car Renault Philips
Target 1
Target 2
Huge compute Requirements Giga operations per second Real Time
Embedded System Low cost Low power
4
Heterogeneous Architectures
Stream-based applications Autonomous operating components
(task-level parallelism) Low bandwidth communication between components Programmable interconnect Distributed memory
Programmable Interconnect (NoC)
Programmable Interconnect (NoC)
IPcore
IPcore
RP
UR
PU
Mem
oryM
emory
CP
UC
PU
Micro
ProcessorM
icro P
rocessorMemoryMemory
...
Programmable Interconnect (NoC)
Programmable Interconnect (NoC)
IPcore
IPcore
RP
UR
PU
Mem
oryM
emory
CP
UC
PU
Micro
ProcessorM
icro P
rocessorMemoryMemory
...
Microprocessors (DSP, CPU)
Reconfigurable Units (FPGA)
Dedicated Hard (IP cores)
Distributed Memory Memory banks
5
Computational efficiency [MOPS/W]
Intrinsic Computational Efficiency (ICE)
i386SXi486DX P568040
microsparc
Supersparc
601604
Ultrasparc P6
604e21164a
Turbosparc 604e
21364
7400
106
105
104
103
102
101
100
2 1 0.5 0.25 0.13 0.07Feature size [m]
E. Roza, System-on-chip: what are the limits?, IEE Electronics Communication Engineering Journal, vol13, No6, Dec 2001, pp 249-255.
Microprocessors
Intrinsic ComputationalEfficiency of Silicon
Pla
ying
fiel
d
6
Xilinx Virtex II Pro
PowerPC based 420 Dhrystone
MIPS at 300 MHz 1 to 4 PowerPCs
Virtex-4 Already over a
billion transistors 90nm technology
Reconfigurable logicand memory blocks
PowerPCs
Source: Xilinx
7
SpaceCake
TitleCurrent Development in Philips Research. Idea is to make a platform that is Moore’s Law resilient.
Homogenous Tiles
If more transistors become available, more titles can be combined delivering more compute power.
Programmable Interconnect
CPU0 CPU1 CPU2
RPU IPcore Memory
Stravers, P and Hoogerbrugge, J. 2001. Homogeneous multiprocessoring and the future of silicon design paradigms. In proceedings of the Int. Symposium on VLSI Technology, Systems, and Applications.
8
PicoArrayStartup,
EnglandProcessor with
430 16-bit RISC cores on a single die.
Projected Markets: WCDMA / 802.11 Source: www.picoarray.com
Homogeneous Architecture
9
Different Views
Homogeneous architectures Linear scaling
More CPUs leading to an equal increase in available compute power
Load balancingEach CPU is working at the same workload level
Heterogeneous architectures Flexibility
Match the computation to the correct component in terms of ICE; Take advantage of heterogeneity
10
Software Efficiency
Pro
du
ctiv
ity
Tra
ns.
/ S
taff
. M
on
th
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
198
1
198
5
198
9
199
3
199
7
200
1
200
5
200
9
Log
ic t
ransi
stors
per
chip
(K
)
10
100
1,000
10,000
100,000
1,000,000
10,000,000
Logic
Tr./Chip
58% / Yr. compoundcomplexity growth rate
Tr./S.M
21% / Yr. compoundproductivity growth rate
Source: SEMATECH© Kreutzer
Productivitygap
11
Problem
“We know how to build billion transistor ICs, but we do not know how to program them”
Current compiler technology is not capable to handle the heterogeneity of the architectures
How come and how to solve?
12
Y-chart ApproachThree different ways to improve the performance of a system.
Suggest architecturalimprovements
Rewrite theapplications
ApplicationsApplicationsArchitecture Instance
Mapping
Applications
PerformanceAnalysis
PerformanceNumbers
Use differentMapping strategies
Kienhuis,B., Deprettere, E., Van der Wolf, P., and Vissers, K. 2002. A Methodology to Design Programmable Embedded Systems. LNCS, vol. 2268. Springer Verlag, pages 18 – 37.
13
Mapping
ApplicationsApplicationsArchitecture Instance
Mapping
Applications
PerformanceAnalysis
PerformanceNumbers
MPEG
Codedvideo
DemuxVLD Q-1 IDCT
MotionBuffer
Reorderordering
quantization control
motion vectors & mode
Decodedvideo
MPEG Decoding
+
Programmable Interconnect (NoC)
Programmable Interconnect (NoC)
IPcore
IPcore
RP
UR
PU
Mem
oryM
emory
CP
UC
PU
Micro
ProcessorM
icro P
rocessor
MemoryMemory
...
Programmable Interconnect (NoC)
Programmable Interconnect (NoC)
IPcore
IPcore
RP
UR
PU
Mem
oryM
emory
CP
UC
PU
Micro
ProcessorM
icro P
rocessor
MemoryMemory
...
MAPPING
14
Mapping
bus
coproc
CPU
coproc.
Architecture:•Resources
•ALUs, CORDICS, PEs•Registers, SRAM, DRAM•Busses, Switches
•Communication•Bits, Signals
Application:• Computations
•IDCT, SQRT, Quantizer• Communication
•Pixels, Blocks
Both described a network of components that performa particular function and that communication in a
particular way
MPEG
Codedvideo
DemuxVLD Q-1 IDCT
MotionBuffer
Reorderordering
quantization control
motion vectors & mode
Decodedvideo
MPEG Decoding
+
15
Mapping
Architecture Application
bus
coproc
CPU
coproc.
Mapping
Can we formalize the description of these networks?“Models of Architecture” and “Models of Computation”
MPEG
Codedvideo
DemuxVLD Q-1 IDCT
MotionBuffer
Reorderordering
quantization control
motion vectors & mode
Decodedvideo
MPEG Decoding
+
16
Model of Computation
A
C
D
B
A Model of computation is a formal representation of the operational semantics of networks of
functional blocks describing the computations.
17
Model of ComputationTerminology
Actor Describes the functionality
Relation The actors can communicate
with each other using relations.
Token The exchange of a quantum
of information. It represents a signal
Firing A quantum of computation Moment of interaction with other
actors
fire { … token = get(); … send(token); …}
Port
(Active/Passive)
Port
Relation
A
C
D
B
Actor
token
18
Active/Passive Actors
A
C
D
B
Passive Actor:•Scheduler needed.
•Schedule ABBCD•A firing needs to terminate•Fire-and-exit behavior
fire { token = get(); … send(token); …}
fire { while(1) { token = get(); send(token); }}
Active Actor:•Schedules itself•A firing typically doesn’t terminate
•Endless while loop•Process behavior
Two kinds of Actors:Exit
19
Communication Between Actors
Data Type of the Token•Integer, Double, Complex•Matrix, Vector•Record
Actor 2.
fire { … get(); …}
port port
Tokenfire { … send(); …}
Actor 1.
Way exchange takes place•Buffered•Timed•Synchronized
Communication(Semantics)
20
Different Semantics
Analog computers (odes) Discrete time (difference
equations) Discrete-event systems
(DE) Process networks (Kahn) Sequential processes with
rendezvous (CSP) Dataflow (Dennis) Synchronous-reactive
systems (SR) Codesign finite state
machines (CFSM)
continuous time:
discrete time:
discrete events:
E1 E2 E3
E4 E5 E6
partially-orderedevents:
synchronous/reactive:
21
Synchronous/reactive Models (SR)
Network of concurrent executing actors Passive actors Communication is unbuffered
Computation and communication is instantaneous. A model progresses as a sequence of “ticks.” At a tick, the signals are defined by a fixed point equation:
Characteristics of SR models Tightly synchronized Stable state points Control intensive systems
),(
)(
)1(
yxf
zf
f
z
y
x
c
b
A
Fixed point equation
A
C
D
B
x
y
z
fire { … get(); …}port port
Tokenfire { … send(); …}
22
Process Network (PN)
Network of concurrent executing processes Active actors Communicate over
unbounded FIFOs Performing some
operation, a blocking read or a non-blocking write
Characteristics of process networks Deterministic execution Doesn’t impose a particular
schedule (Dynamic) dataflow
A
C
D
B
Process
Stream channel
fire { … get(); …}port port
Tokenfire { … send(); …}
23
Synchronous Dataflow (SDF)
Network of concurrent executing actors Passive actors Communication is buffered
A model progresses as a sequence of “iterations.”
A “firing rule” determines the firing condition of an actor.
At each firing, a fixed number of tokens is consumes and produces.
Characteristics of SDF Compile time analyzable. Memory/schedule/speed Static dataflow
Schedule: ABBBC
A
C
D
B
1
1
1 1
3
33
3
port
fire { … get(); …}port
Tokensfire { … send(); …}
24
Codesign Finite State Machine (CFSM)
Network of concurrent executing actors Passive actors Synchronous locally Asynchronous globally
An “event” causes the evaluation (firing) of a FSM.
Characteristics of CFSM Compile time analyzable. Reactive systems
FSMport port
Token FSM
A
C
D
B
Timed Event
25
Finite State Machine (FSM)
•More efficient way to describe sequential control.•Formal semantics which allows for verifying various properties like safety, liveness, and fairness.
•FSM may only have one state active at the time•FSM has only a finite number of states.
Port_BELTOFF
WAIT
ALARM
KEY=0N => START
KEY=OFF or BELT=ON =>ALARM=OFF
END=5 => ALARM=ON
END=10 orBELT=ON orKEY=OFF =>ALARM=OFF
Port_KEY
Port_END
Port_START
Port_ALARM
26
Model of Architecture
A Model of architecture is a formal representation of the operational semantics of networks offunctional blocks describing architectures.
Model of Architecture is similar to Model of Computation, but the focus is on
the architecture instead of on the applications.
A,B,C and D are nowhardware resources likeCPUs, busses, Memory,
and dedicated coprocessors.
A
C
D
B
27
Examples
Programmable Communication Network
PE2
PE3
PE
Control Dominated Tasks•Sequential
Control/ Data Tasks•Sequential•Centralized computation•Mutual Exclusive [1]
Data Dominated Tasks•Parallel / DMA•Data flow•Distributed computation
Less mature then MoC
Com
plex
ity
High
LowCPU
Bus
Memory
CPU
Bus
Memory
CPU PE1
Memory
[1] Wolf, Wayne. A Decade of Hardware/Software Codesign. IEEE Computer, Volume 36, April 2003, pages 38-43
28
Conclusion: Matching Models
Data Type
Architecture
Model of Architecture
Application
Model of Computation
When the MoC and MoA match, a simple mapping results
Natual Fit
29
Putting It Together Example 1.
Platform: microprocessor “von-Neumann architecture”Machine Description
Architecture Instances
ApplicationsApplicationsMicroProcessor
Compiler
Application
Pentium/ArmMIPS/Alpha
PerformanceNumbers
GCC
SPECInt Benchmarks
30
Putting It Together Example 1.
for i=1:1:10for j=1:1:10 A(i,j) =FIR();end
endfor i=1:1:10,
for j=1:1:10, A(i,j) =SRC( A(i,j) );
endend
ProgramCounter
Memory
ALUInstructionDecoder
(address)
Model of Architecture:• Sequential (Program Counter) • one item over the bus at the time.• Shared Memory
Model of Computation:• Sequential• Shared Memory
Picture in PictureMicro Processor
Compiler
Simulator
PerformanceNumbers
Natural FIT
31
ApplicationsApplicationsArchitecture Instance
Mapping
Applications
PerformanceAnalysis
PerformanceNumbers
Putting It Together Example 2.
Programmable Interconnect (NoC)
Programmable Interconnect (NoC)
IPcore
IPcore
RP
UR
PU
Mem
oryM
emory
CP
UC
PU
Micro
ProcessorM
icro P
rocessor
MemoryMemory
...
Programmable Interconnect (NoC)
Programmable Interconnect (NoC)
IPcore
IPcore
RP
UR
PU
Mem
oryM
emory
CP
UC
PU
Micro
ProcessorM
icro P
rocessor
MemoryMemory
...
%parameter N 8 16;%parameter K 100 1000;
for k = 1:1:K, for j = 1:1:N, [ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) ); for i = j+1:1:N, [ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t ); end endend
Matlab Code (QR Algorithm)
Model of Architecture:• Task Level Parallelism • Heterogeneity• Distributed Memory
Model of Computation:• Imperative• Sequential Execution • Global Memory
NO Natural FIT
32
ApplicationsApplicationsArchitecture Instance
Mapping
Applications
PerformanceAnalysis
PerformanceNumbers
Putting It Together Example 2.
Programmable Interconnect (NoC)
Programmable Interconnect (NoC)
IPcore
IPcore
RP
UR
PU
Mem
oryM
emory
CP
UC
PU
Micro
ProcessorM
icro P
rocessor
MemoryMemory
...
Programmable Interconnect (NoC)
Programmable Interconnect (NoC)
IPcore
IPcore
RP
UR
PU
Mem
oryM
emory
CP
UC
PU
Micro
ProcessorM
icro P
rocessor
MemoryMemory
...
Model of Architecture:• Task Level Parallelism • Heterogeneity• Distributed Memory
Model of Computation:• Process Networks• Distributed Memory • Distributed Control
P1 P2
S1Source
P3 P4
Sink
Natural FIT
33
Our Research Focus….
Application
Programmable Interconnect (NoC)Programmable Interconnect (NoC)
IPcore
IPcore
RP
UR
PU
Mem
oryM
emory
CP
UC
PU
Micro
Processor
Micro
Processor
MemoryMemory
...
ProgrammingCompaan
Laura
/ESPAM
P1 P2
S1Source
P3 P4
Sink
%parameter N 8 16;%parameter K 100 1000;
for k = 1:1:K, for j = 1:1:N, [ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) ); for i = j+1:1:N, [ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t ); end endend
Matlab Code (QR Algorithm)
Process Network
34
Other Examples
Other example of the “Natual Fit” concept VSP architecture, CSDF Tangam, CSP Polis system, CFSM
35
Conclusions
We will get billion transistor ICsWe already know how to build them, but
not how to program themCurrent compilers do not take into
account the notion of “natural fit” in terms of models of computation and models of architecture
New compiler research is needed that takes into account different models of computation