modular design and the importance of models of computation bart kienhuis, leiden university, liacs...

Modular design and the importance of models of computation

Bart Kienhuis,Leiden University, LIACSComputer Systems Group

Based on the presentation given at the 37th DAC tutorial on Embedded System Design

2

New Applications

Stream oriented applications Multi-media (Smart) imaging Bioinformatics Classical digital signal processing

Ferocious appetite for compute power

3

Smart Imaging

Camellia project Core for ambient

and mobile intelligent imaging applications

IST project (fr5)Detection of a

pedestrian walking in front of a car Renault Philips

Target 1

Target 2

Huge compute Requirements Giga operations per second Real Time

Embedded System Low cost Low power

4

Heterogeneous Architectures

Stream-based applications Autonomous operating components

(task-level parallelism) Low bandwidth communication between components Programmable interconnect Distributed memory

Programmable Interconnect (NoC)


IPcore

IPcore

RP

UR

PU

Mem

oryM

emory

CP

UC

PU

Micro

ProcessorM

icro P

rocessorMemoryMemory

...



IPcore

IPcore

RP

UR

PU

Mem

oryM

emory

CP

UC

PU

Micro

ProcessorM

icro P

rocessorMemoryMemory

...

Microprocessors (DSP, CPU)

Reconfigurable Units (FPGA)

Dedicated Hard (IP cores)

Distributed Memory Memory banks

5

Computational efficiency [MOPS/W]

Intrinsic Computational Efficiency (ICE)

i386SXi486DX P568040

microsparc

Supersparc

601604

Ultrasparc P6

604e21164a

Turbosparc 604e

21364

7400

106

105

104

103

102

101

100

2 1 0.5 0.25 0.13 0.07Feature size [m]

E. Roza, System-on-chip: what are the limits?, IEE Electronics Communication Engineering Journal, vol13, No6, Dec 2001, pp 249-255.

Microprocessors

Intrinsic ComputationalEfficiency of Silicon

Pla

ying

fiel

d

6

Xilinx Virtex II Pro

PowerPC based 420 Dhrystone

MIPS at 300 MHz 1 to 4 PowerPCs

Virtex-4 Already over a

billion transistors 90nm technology

Reconfigurable logicand memory blocks

PowerPCs

Source: Xilinx

7

SpaceCake

TitleCurrent Development in Philips Research. Idea is to make a platform that is Moore’s Law resilient.

Homogenous Tiles

If more transistors become available, more titles can be combined delivering more compute power.

Programmable Interconnect

CPU0 CPU1 CPU2

RPU IPcore Memory

Stravers, P and Hoogerbrugge, J. 2001. Homogeneous multiprocessoring and the future of silicon design paradigms. In proceedings of the Int. Symposium on VLSI Technology, Systems, and Applications.

8

PicoArrayStartup,

EnglandProcessor with

430 16-bit RISC cores on a single die.

Projected Markets: WCDMA / 802.11 Source: www.picoarray.com

Homogeneous Architecture

9

Different Views

Homogeneous architectures Linear scaling

More CPUs leading to an equal increase in available compute power

Load balancingEach CPU is working at the same workload level

Heterogeneous architectures Flexibility

Match the computation to the correct component in terms of ICE; Take advantage of heterogeneity

10

Software Efficiency

Pro

du

ctiv

ity

Tra

ns.

/ S

taff

. M

on

th

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

198

1

198

5

198

9

199

3

199

7

200

1

200

5

200

9

Log

ic t

ransi

stors

per

chip

(K

)

10

100

1,000

10,000

100,000

1,000,000

10,000,000

Logic

Tr./Chip

58% / Yr. compoundcomplexity growth rate

Tr./S.M

21% / Yr. compoundproductivity growth rate

Source: SEMATECH© Kreutzer

Productivitygap

11

Problem

“We know how to build billion transistor ICs, but we do not know how to program them”

Current compiler technology is not capable to handle the heterogeneity of the architectures

How come and how to solve?

12

Y-chart ApproachThree different ways to improve the performance of a system.

Suggest architecturalimprovements

Rewrite theapplications

ApplicationsApplicationsArchitecture Instance

Mapping

Applications

PerformanceAnalysis

PerformanceNumbers

Use differentMapping strategies

Kienhuis,B., Deprettere, E., Van der Wolf, P., and Vissers, K. 2002. A Methodology to Design Programmable Embedded Systems. LNCS, vol. 2268. Springer Verlag, pages 18 – 37.

13

Mapping


Mapping

Applications

PerformanceAnalysis

PerformanceNumbers

MPEG

Codedvideo

DemuxVLD Q-1 IDCT

MotionBuffer

Reorderordering

quantization control

motion vectors & mode

Decodedvideo

MPEG Decoding

+



IPcore

IPcore

RP

UR

PU

Mem

oryM

emory

CP

UC

PU

Micro

ProcessorM

icro P

rocessor

MemoryMemory

...



IPcore

IPcore

RP

UR

PU

Mem

oryM

emory

CP

UC

PU

Micro

ProcessorM

icro P

rocessor

MemoryMemory

...

MAPPING

14

Mapping

bus

coproc

CPU

coproc.

Architecture:•Resources

•ALUs, CORDICS, PEs•Registers, SRAM, DRAM•Busses, Switches

•Communication•Bits, Signals

Application:• Computations

•IDCT, SQRT, Quantizer• Communication

•Pixels, Blocks

Both described a network of components that performa particular function and that communication in a

particular way

MPEG

Codedvideo

DemuxVLD Q-1 IDCT

MotionBuffer

Reorderordering



Decodedvideo

MPEG Decoding

+

15

Mapping

Architecture Application

bus

coproc

CPU

coproc.

Mapping

Can we formalize the description of these networks?“Models of Architecture” and “Models of Computation”

MPEG

Codedvideo

DemuxVLD Q-1 IDCT

MotionBuffer

Reorderordering



Decodedvideo

MPEG Decoding

+

16

Model of Computation

A

C

D

B

A Model of computation is a formal representation of the operational semantics of networks of

functional blocks describing the computations.

17

Model of ComputationTerminology

Actor Describes the functionality

Relation The actors can communicate

with each other using relations.

Token The exchange of a quantum

of information. It represents a signal

Firing A quantum of computation Moment of interaction with other

actors

fire { … token = get(); … send(token); …}

Port

(Active/Passive)

Port

Relation

A

C

D

B

Actor

token

18

Active/Passive Actors

A

C

D

B

Passive Actor:•Scheduler needed.

•Schedule ABBCD•A firing needs to terminate•Fire-and-exit behavior

fire { token = get(); … send(token); …}

fire { while(1) { token = get(); send(token); }}

Active Actor:•Schedules itself•A firing typically doesn’t terminate

•Endless while loop•Process behavior

Two kinds of Actors:Exit

19

Communication Between Actors

Data Type of the Token•Integer, Double, Complex•Matrix, Vector•Record

Actor 2.

fire { … get(); …}

port port

Tokenfire { … send(); …}

Actor 1.

Way exchange takes place•Buffered•Timed•Synchronized

Communication(Semantics)

20

Different Semantics

Analog computers (odes) Discrete time (difference

equations) Discrete-event systems

(DE) Process networks (Kahn) Sequential processes with

rendezvous (CSP) Dataflow (Dennis) Synchronous-reactive

systems (SR) Codesign finite state

machines (CFSM)

continuous time:

discrete time:

discrete events:

E1 E2 E3

E4 E5 E6

partially-orderedevents:

synchronous/reactive:

21

Synchronous/reactive Models (SR)

Network of concurrent executing actors Passive actors Communication is unbuffered

Computation and communication is instantaneous. A model progresses as a sequence of “ticks.” At a tick, the signals are defined by a fixed point equation:

Characteristics of SR models Tightly synchronized Stable state points Control intensive systems

),(

)(

)1(

yxf

zf

f

z

y

x

c

b

A

Fixed point equation

A

C

D

B

x

y

z

fire { … get(); …}port port


22

Process Network (PN)

Network of concurrent executing processes Active actors Communicate over

unbounded FIFOs Performing some

operation, a blocking read or a non-blocking write

Characteristics of process networks Deterministic execution Doesn’t impose a particular

schedule (Dynamic) dataflow

A

C

D

B

Process

Stream channel

fire { … get(); …}port port


23

Synchronous Dataflow (SDF)

Network of concurrent executing actors Passive actors Communication is buffered

A model progresses as a sequence of “iterations.”

A “firing rule” determines the firing condition of an actor.

At each firing, a fixed number of tokens is consumes and produces.

Characteristics of SDF Compile time analyzable. Memory/schedule/speed Static dataflow

Schedule: ABBBC

A

C

D

B

1

1

1 1

3

33

3

port

fire { … get(); …}port

Tokensfire { … send(); …}

24

Codesign Finite State Machine (CFSM)

Network of concurrent executing actors Passive actors Synchronous locally Asynchronous globally

An “event” causes the evaluation (firing) of a FSM.

Characteristics of CFSM Compile time analyzable. Reactive systems

FSMport port

Token FSM

A

C

D

B

Timed Event

25

Finite State Machine (FSM)

•More efficient way to describe sequential control.•Formal semantics which allows for verifying various properties like safety, liveness, and fairness.

•FSM may only have one state active at the time•FSM has only a finite number of states.

Port_BELTOFF

WAIT

ALARM

KEY=0N => START

KEY=OFF or BELT=ON =>ALARM=OFF

END=5 => ALARM=ON

END=10 orBELT=ON orKEY=OFF =>ALARM=OFF

Port_KEY

Port_END

Port_START

Port_ALARM

26

Model of Architecture

A Model of architecture is a formal representation of the operational semantics of networks offunctional blocks describing architectures.

Model of Architecture is similar to Model of Computation, but the focus is on

the architecture instead of on the applications.

A,B,C and D are nowhardware resources likeCPUs, busses, Memory,

and dedicated coprocessors.

A

C

D

B

27

Examples

Programmable Communication Network

PE2

PE3

PE

Control Dominated Tasks•Sequential

Control/ Data Tasks•Sequential•Centralized computation•Mutual Exclusive [1]

Data Dominated Tasks•Parallel / DMA•Data flow•Distributed computation

Less mature then MoC

Com

plex

ity

High

LowCPU

Bus

Memory

CPU

Bus

Memory

CPU PE1

Memory

[1] Wolf, Wayne. A Decade of Hardware/Software Codesign. IEEE Computer, Volume 36, April 2003, pages 38-43

28

Conclusion: Matching Models

Data Type

Architecture

Model of Architecture

Application

Model of Computation

When the MoC and MoA match, a simple mapping results

Natual Fit

29

Putting It Together Example 1.

Platform: microprocessor “von-Neumann architecture”Machine Description

Architecture Instances

ApplicationsApplicationsMicroProcessor

Compiler

Application

Pentium/ArmMIPS/Alpha

PerformanceNumbers

GCC

SPECInt Benchmarks

30


for i=1:1:10for j=1:1:10 A(i,j) =FIR();end

endfor i=1:1:10,

for j=1:1:10, A(i,j) =SRC( A(i,j) );

endend

ProgramCounter

Memory

ALUInstructionDecoder

(address)

Model of Architecture:• Sequential (Program Counter) • one item over the bus at the time.• Shared Memory

Model of Computation:• Sequential• Shared Memory

Picture in PictureMicro Processor

Compiler

Simulator

PerformanceNumbers

Natural FIT

31


Mapping

Applications

PerformanceAnalysis

PerformanceNumbers




IPcore

IPcore

RP

UR

PU

Mem

oryM

emory

CP

UC

PU

Micro

ProcessorM

icro P

rocessor

MemoryMemory

...



IPcore

IPcore

RP

UR

PU

Mem

oryM

emory

CP

UC

PU

Micro

ProcessorM

icro P

rocessor

MemoryMemory

...

%parameter N 8 16;%parameter K 100 1000;

for k = 1:1:K, for j = 1:1:N, [ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) ); for i = j+1:1:N, [ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t ); end endend

Matlab Code (QR Algorithm)

Model of Architecture:• Task Level Parallelism • Heterogeneity• Distributed Memory

Model of Computation:• Imperative• Sequential Execution • Global Memory

NO Natural FIT

32


Mapping

Applications

PerformanceAnalysis

PerformanceNumbers




IPcore

IPcore

RP

UR

PU

Mem

oryM

emory

CP

UC

PU

Micro

ProcessorM

icro P

rocessor

MemoryMemory

...



IPcore

IPcore

RP

UR

PU

Mem

oryM

emory

CP

UC

PU

Micro

ProcessorM

icro P

rocessor

MemoryMemory

...

Model of Architecture:• Task Level Parallelism • Heterogeneity• Distributed Memory

Model of Computation:• Process Networks• Distributed Memory • Distributed Control

P1 P2

S1Source

P3 P4

Sink

Natural FIT

33

Our Research Focus….

Application

Programmable Interconnect (NoC)Programmable Interconnect (NoC)

IPcore

IPcore

RP

UR

PU

Mem

oryM

emory

CP

UC

PU

Micro

Processor

Micro

Processor

MemoryMemory

...

ProgrammingCompaan

Laura

/ESPAM

P1 P2

S1Source

P3 P4

Sink

%parameter N 8 16;%parameter K 100 1000;

for k = 1:1:K, for j = 1:1:N, [ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) ); for i = j+1:1:N, [ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t ); end endend

Matlab Code (QR Algorithm)

Process Network

34

Other Examples

Other example of the “Natual Fit” concept VSP architecture, CSDF Tangam, CSP Polis system, CFSM

35

Conclusions

We will get billion transistor ICsWe already know how to build them, but

not how to program themCurrent compilers do not take into

account the notion of “natural fit” in terms of models of computation and models of architecture

New compiler research is needed that takes into account different models of computation

modular design and the importance of models of computation bart kienhuis, leiden university, liacs...

Documents

mapping slide

xilinx slide

field slide

compute power slide

homogeneous architecture

embedded system design

mapping bus

problem z