speed-ups obtained by reconfigurable computing reiner hartenstein capes/dfg cooperation on...

63
Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of Mechanical Engineering, Universidade de Brasilia 1 slightly modified version

Upload: alvin-cox

Post on 29-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

Speed-ups obtained by

Reconfigurable Computing

Reiner Hartenstein

CAPES/DFG Cooperation on Reconfigurable Computing,

inv. talk, Sept 19, 2008, Dept of Mechanical Engineering,

Universidade de Brasilia

1

slightly modified version

Reiner
Der Titel stimmt eigentlich nicht:wir müssen das Diesseits und das Jenseits (dieses Paradigmas natürlich) miteinander verknüpfen.Darüber geht eigentlich der VortragAußerdem ist das Jenseits hier neu für Sie
Page 2: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

outline

2

Introduction

Manycore Crisis & von Neumann Syndrome

The Impact of Reconfigurable Computing

Programmer education: new roadmap needed

Conclusions

Page 3: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

5 key issues

climate change faster than predicted: by carbon emission, primarily from power plants ?

the manycore programming crisis stalls progress (end of the free ride on the Gordon Moore curve)technologically stalled Moore‘s Law*

very high and growing computer energy cost – and growing number of power plants needed here

3

Reconfigurable Computing is a promising alternative

2008: 65, 45, 32 nm[Nick Tredennick (Gilder), 2003]*) Tom Williams (keynote): the 20 nm wall

Reiner
aber dies ist gewiß
Reiner
by ending the GHz race -- stalls progress to affordable HPC (the manycore programming crisis)end of the free ride on the Gordon Moore Curve
Reiner
the 20 nanometer wallslowdown in speed advance, copper wiring hits the wall, high-k dielectric manufacturable at all ? earlier predictions only based on resolution problems in mask making etc. - but now: material problems
Reiner
A poll by Forrester Research at a conference reveals that firms are increasingly concerned with the impact of energy consumption in their business operations.
Page 4: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

History of data processing

• prototyped: 1884 Herman Hollerith

4

•datastream-based•datastream-based

The first reconfigurable computer

DPUDPU

• 1st Xilinx FPGA 100 years later

Reiner
Herman Hollerith *29 Feb 1860 Buffalohand-crafted configware !- no program counter - i. e. no CPU: only DPU- no instruction streams- no bootingtransistor not yet inventedwho knows: vacuum tube invented when ?
Page 5: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Configware Programming

5

60 years later: RAM available –ferrite cores

manually (Configuration)

motivating the von Neumann paradigm

J. v N, 1946

or, by swapping pre-wired board

(Reconfiguration)

no instruction streams

Reiner
Herman Hollerith *29 Feb 1860 Buffalohand-crafted configware !- no program counter - i. e. no CPU: only DPU- no instruction streams- no bootingtransistor not yet inventedwho knows: vacuum tube invented when ?
Page 6: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

fine-grained reconfigurable

form

ing

a wi

re

switch box

Conn

ect t

o CL

B

connect box

CLB

CConfigurable Logic Box

6

CLB CLB

CLB CLB

CLB CLB

FPGAField-Programmable Gate Array

B

A

Xilinx old „island architecture“

Reiner Hartenstein
has become mainstreamcame up 25 years ago
Reiner Hartenstein
explain LUT
Page 7: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de7

CLB CLB

CLB CLB

CLB CLB

FPGAField-Programmable Gate Array

Conn

ect t

o CL

Bfo

rmin

g a

wire

B

A

switch box

CLB

CConfigurable Logic Box

connect box

Reiner Hartenstein
has become mainstreamcame up 25 years ago
Reiner
old island architecture from Xilinx
Page 8: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

switch box

RAM-based

configware code loaded before run time into “hidden RAM”

FFpart of “hidden RAM”

0 0

0

00 1

8

hidden RAM

hid

den

RA

M

hidden RAM FPGAs mainstream since > a decade

this switch box has 150 transistors &

150 flipflops FF

patches even at the customer‘s desk

Reiner Hartenstein
Bill Gates so reich: RAM-basiert
Page 9: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Coarse-grained Reconfigurable Array

Sw

ap

if X > Y then swap;

Y

X

rout thru only rout thru and function (multiplexer)

swap turned into a wiring pattern

Conditional Swap Example

0

1Xo

0

1Yo

Xi

>

Yi

CFB !CFB !CLB (parallelization of the bubble sort algorithm)

Reiner Hartenstein
e, g, part of bubble sort hardwarederived from C descriptionsccd = multiplexer > is relational operator1: rout only3: function AND rout
Reiner
it's a time to space mapping
Page 10: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Another coarse-grained r-Array

10

SNN Filter on supersystolic Array: mainly a Pipe Network

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect not usedbackbus connectarray size: 10 x 16

reconfigurable Data Path Unit, 32 Bits wide

reconfigurable Data Path Unit, 32 Bits wide

no CPU

rDPUrDPU

(99% placement efficiency)

CFB !CFB !rout thru only

CoDe-X inside [Jürgen Becker]by KressArray Xplorer [Ulrich Nageldinger]

Reiner
principles go back to the systolic array 1979supersystolic means antimachine paradigm (counterpart of von Neumann) - see later
Reiner
98 % placement efficiency
Page 11: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Plattform-FPGA

11

256 – 1704 BGA

56 – 424

fast on-chip Block

RAMs: BRAMs

8 – 32fast serial

I/O-channels

DPUs

Configware-Code-input

[courtesy Lattice Semiconductor]

Reiner Hartenstein
Now, large FPGAs featuring special-purpose logic such as dedicated multipliers and on-chip memories embedded into the logic fabric, have become attractive platforms for accelerating kernels in scientific applications.
Page 12: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Reconfigurable Supercomputing

12

Silicon graphics

Reconfigurable Application-

Specific Computing (RASC™)

•Xilinx Virtex-II Pro•Library by Cray

Cray XD1

Supercomputing 2007, Reno, Nevada, USA 9600 registered attandees, 440 exhibitors

Chuck Thacker … (even Microsoft working at it)

(Lab in Cambridge. UK, etc.).

Page 13: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

what means Configware

SoftwareCompiler

Software Code

13

(instruction-procedural)

Software Source

Configware Code(structural: space domain)

mapper

Placement & Routing

Configware Source

space domainspace domaintime domaintime domain

Software to Configware Migration

Software to Configware Migration

traditio

nal

Computing

Reconfigurable Computing

(data-procedural)

data scheduler

Flowware Code

Page 14: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

outline

14

Introduction

The Manycore Crisis & the von Neumann SyndromeThe Impact of Reconfigurable Computing

Programmer education: new roadmap

needed

Conclusions

Page 15: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de15

Many-core: Break-through or Breakdown?

Industry is facing a disruptive turning point

intel’s vision: MultiCor

e

The stakes are high ...

HPC users lack understanding in basic precepts*

*) PRACE consortium (Partnership foR Advanced Computing in Europe) http://www.prace-project.eu/documents/D3.3.1_document_final.pdf

it‘s an education, qualification, and a R&D problem

“could reset µP HW & SW roadmaps for next 30 years”, [David Patterson]

forcing a historic transition to a parallel programming model yet to be invented [David Callahan]

„I would be panicked if I were inindustry“[John Hennessy]

Reiner Hartenstein
intel und Sun keynotes [DAC’08] admit that there are problems of Manycore programming, a lack of suitable software, and curbing by memory latency.
Reiner Hartenstein
who leads Microsoft's Parallel Computing Initiative,
Reiner
[Dave Patterson]: intel has thrown a Hail Mary pass and nobody is running yet“ *) a Hail Mary pass in American football is a forward pass made in desperation, with a very small chance of success
Reiner
Multi-threading, transactional memory, register re-naming, speculative tricks, multiple super scalarism, out-of-order execution, …: no silver bullets.Erfolgsgeheimnis des IBM Cell Processors
Reiner
The stakes are high. When research will not find efficient parallel techniques, programming will become so difficult, that people will not have a benefit from the new hardware:from growth industry to replacement industry ?Hennessy: I woud lbe panicked if I were in industry
Reiner
HPC: lacking or missing multicore programming qualifications*
Reiner
productivity goes down disproportionatly with the number of processesAt particular HPC application domains massive parallelism requires 10 – 30 professionalists in multi-disciplinary multi-insitutional teams for 5 - 10 years [Douglass Post, DoD HPCMP, panelist at SC07]Software done: machine obsolete
Page 16: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de16

Declining Programmer Productivity

At particular HPC application domains massive parallelism requires 10 – 30 professionalists in multi-disciplinary multi-insitutional teams for 5 - 10 years [Douglass Post, DoD HPCMP, panelist at SC07]

The Law of More: programmer productivity declines disproportionately with increasing parallelism

Software done: machine obsolete

Page 17: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de17

The von Neumann SyndromeThe von Neumann Syndrome

Page 18: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de18

The von Neumann SyndromeThe von Neumann Syndrome

More power for creating

foam than to accelerate the

vessel ?More power for creating

foam than to accelerate the

vessel ?

Page 19: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de19

Massive Overhead Phenomena

CPU

CPU

single core

Dijkstra 1968: The Goto considered harmfulKoch et al. 1975: The universal Bus considered harmfulBackus, 1978: Can programming be liberated from the von Neumann style?Arvind et al., 1983: A critique of Multiprocessing the von Neumann Style

overhead piling up to code sizes of astronomic dimensionsvon Neumann

“von Neumann Syndrome”“von Neumann Syndrome”

2006: C.V. “RAM”

Ramamoorthy:

2006: C.V. “RAM”

Ramamoorthy:

1986, E.I.S. Projekt: 94%for address computation

total speed-up:x 15000

„a terrifying number of processes running in parallel, create sequential-processing bottlenecks and losses in

data locality“

2008 - David Callahan:

overheadvon Neumann

machine

instruction fetch instruction streamstate address computation instruction streamdata address computation instruction stream

data meet PU + other overh. instruction streami / o to / from off-chip RAM instruction stream

(list not complete)(list not complete)

C++ compilerC++ compiler

virtualizationvirtualization

many other

featuresmany other

features

Reiner Hartenstein
bus = von Neumann bottleneck
Reiner Hartenstein
criticizing the program counter's flexibility
Reiner Hartenstein
PISA DRC accelerator [ICCAD 1984]94% computation load only for moving a 4-by-3 window (kind of image processing)(entire project: 15000x speed-up)funded by E.I.S. Projekt (German M-&-C)
Reiner
David Callahan joined Microsoft in late 2005. He is part of a cross-divisional team that is looking forward to the coming surge of multi-core processors that will make parallel-computing ubiquitous in home and office. This is a tremendous opportunity for Microsoft to exploit this fundamental shift in programming and how systems will be used to enable new user experiences and capabilities in all our business areas. Callahan’s particular strengths are in programming languages, programming techniques, and compilation techniques focused on expression and exploiting concurrency.
Page 20: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

manycore von Neumann: arrays of massive overhead

phenomena

proportionate to the number of processors

CPU

CPU

single CPU

von Neumann

disproportionate to the number of processors

20

fast on-chip memory cannot store such huge instruction

code blocksCPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

many- core

overheadvon Neumann

machine

instruction fetch instruction streamstate address computation instruction streamdata address computation instruction stream

data meet PU + other overh. instruction streami / o to / from off-chip RAM instruction streamInter PU communication instruction stream

message passing overhead instruction stream

transactional memory overh. instruction stream

multithreading overhead etc. instruction stream

Reiner Hartenstein
HTM overhead:ISCA07: ABSTRACTHardware Transactional Memory (HTM) systems reflect choices from three key design dimensions: conflict detection, version management, and conflict resolution. Previously proposed HTMs represent three points in this design space: lazy conflict detection, lazy version management, committer wins (LL); eager conflict detection, lazy version management, requester wins (EL); and eager conflict detection, eager version management, and requester stalls with conservative deadlock avoidance (EE)
Reiner
coming 16 core per chip, or 32, or 30increase by x2, every 2 years!permanent compatibility problems
Reiner Hartenstein
schneller On-chip-Speicher ist viel zu klein für derartige Kode-Pakete mit astronomischen Dimensionenlangsame off-Chip-Speicher erlauben keinerlei Umgehung der Memory Wall
Page 21: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

outline

21

Introduction

Manycore Crisis & von Neuman Syndrome

The Impact of Reconfigurable Computing

Programmer education: new roadmap needed

Conclusions

Page 22: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Speed-up factors obtained by Software to Configware migration

22

molecular dynamics simulationmolecular dynamics simulation

88

1980 1990 2000 2010100

103

106

real-time face detectionreal-time face detection

60006000video-rate stereo

visionvideo-rate stereo

vision900

pattern recognition

pattern recognition

730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression457

FFTFFT100

Reed-Solomon DecodingReed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

1000

MACMAC

DSP and wireless

Image processing,Pattern matching,

Multimedia

BLASTBLAST52

protein identificationprotein identification 40

Smith-Waterman pattern matching

Smith-Waterman pattern matching

288

Bioinformatics

GRAPEGRAPE2020AstrophysicsAstrophysics

Speedu

p-F

act

or

cryptocrypto1000

X 2/yr

FPGA

28500

DES breaki

ng

3000

Reiner Hartenstein
Success with RC has been achieved in a variety of areas such as signal and image processing, cryptology, communications processing, data and text mining, and global optimization, for a variety of platform types, from high-end systems on earth to mission-critical systems in space.
Page 23: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Accelerator card from Bruchsal

23

• I/O Bandwidth: 50 GByte/s

• Manufacturer: SIEMENS Bruchsal

16 FPGAs

Tera means 1012 or 1 000 000 000

000 (1 trillion)

MAC means Multiply and ACcumulate

• 1.5 TeraMAC/s

Page 24: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de24

Energy saving factors obtained by software to configware migration

molecular dynamics simulationmolecular dynamics simulation

88

1980 1990 2000 2010100

103

106

real-time face detectionreal-time face detection60006000

video-rate stereo vision

video-rate stereo vision

900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression457

FFTFFT100

Reed-Solomon DecodingReed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

1000

MACMAC

DSP und wireless

Image processing,Pattern matching,

Multimedia

BLASTBLAST52

protein identificationprotein identification 40

Smith-Waterman pattern matching

Smith-Waterman pattern matching

288

Bioinformatics

GRAPEGRAPE2020AstrophysicsAstrophysics

Speedu

p-F

act

or

cryptocrypto1000

X 2/yr

3000

28500

DES breaki

ng

FPGA

energy saving fa

ctor*

3440

300

Energy saving: almost x10 less than speed-up …

… could be improved

Reiner
earlier papers do not report energy factors
Page 25: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de25

von Neumann overhead vs. Reconfigurable

Computing

overheadvon Neumann

machine anti machine

instruction fetch instruction stream none*state address computation instruction stream none*data address computation instruction stream none*

data meet PU + other overh. instruction stream none*i / o to / from off-chip RAM instruction stream none*Inter PU communication instruction stream none*

message passing overhead instruction stream none*

transactional memory overh. instruction stream none*

multithreading overhead etc. instruction stream none*

using

reconfigurable

data counters

*) c

onfig

ured

befo

re ru

n tim

e

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPA: reconfigurable datapath arrayrDPA: reconfigurable datapath array

(coa

rse-

grai

ned

rec.

)(c

oars

e-gr

aine

d re

c.)

no inst

ruct

ion

fetc

h a

t ru

n t

ime

25

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

Page 26: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Data meet the processor (CPU)

26

by Software

inefficient transport over off-Chip-memory by memory-cycle-hungry instruction streams

This is just one of many von Neumann-Overhead-Phenomena

illustrating von Neumann syndrome

Page 27: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Data meet the CPU

27

byFlowware

Placement of the execution locality (not moving data)

within pipe network: generated by the Configware-Compiler*

illustrating acceleration

*) before run time (at compile time)

Reiner
processing unitoder DPU: DataPath Unit
Page 28: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de28

What did we learn?There are 2 kinds of datastreams:

“Dataflow machine” would be a nice term, but was introduced by a different scene*

1) indirectly moved by an instruction stream machine (von Neumann): extremely inefficient

2) directly moved by a datastream machine (from Reconfigurable Computing): very efficient

*) meanwhile dead: not really a dataflow machine, but had used compilers accepting a dataflow language

Page 29: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de29

What else did we learn?

There are 2 kinds of parallelism:

1) Concurrent processes: instruction stream parallelism (CPU manycores): inefficient

2) Data parallelism by parallel datastreams (in Reconfigurable Computing Systems): efficient

- Data parallelism brings the performance (we do data processing !)

Conclusion:

Page 30: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

no von Neumann-bottle-neck

no von Neumann-bottle-neck

instruction parallelism:

many von Neumann

bottlenecks

many von Neumann

bottlenecks

30

[Hartenstein’s watering can model]

What Parallelism?data parallelism:

Page 31: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Put old ideas into practice (POIIP)

31

... „The biggest payoff will come from putting old ideas into practice and teaching people how to apply them properly.“ [David Parnas]

“We need a complete re-definition of CS”[Burton Smith and other celebrities]

Wrong! I do not agree,[Reiner Hartenstein]

finding out, that ...

“We need a complete re-definition of curriculum recommendations - missing several key issues.” [Reiner Hartenstein]

Page 32: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

outline

32

Introduction

Manycore Crisis & von Neuman

Syndrome

The Impact of Reconfigurable

Computing

Programmer education: new road map neededConclusions

Page 33: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Fighting against obsolete curricula?

Real-Time Systems (Sweden)

Recommendations for Designing new ICT Curricula

Workshop on Embedded Systems Education

WESE

Chess – Center for Hybrid and Embedded Software Systems(courses in embedded systems)

Graduate Curriculum on Embedded Software and Systems (EU)

Advanced Real Time Systems

The Embedded Systems Approach?

… support their own educational approach

„You can always teach programming

to a hardware guy ...

it‘s not the programmer‘s fault: it‘s due to obsolete CS

curricula

... but you can never teach hardware to a programmer“

Reiner
you always can teach programming to a hardware guybut you can never teach hardware to a programmeri is not the programmers fault - it is the fault of his /her educators
Page 34: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de34

CS is a Monster

fully wrong educational mainstream approaches:

2) mapping parallelism into the time domain – abstracting away the space domain is fatal

1) the basic mind set exclusively instruction-stream-oriented - data streams considered being exotic

We need a dual-rail education

Reiner
e. g. threadswe need time to space mappinginstead of abstracting it away
Reiner Hartenstein
as long as the space domain is abstracted awayall that stuff is implemented by instruction streams!MPI takes ~ 50% of computation time [RAW86]von Neumann syndrome
Reiner
we need data parallelism as a mainstreambeware of "data flow" (indeterministic) - is dead
Page 35: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

We need to POIIP for:

35

2 key rules of thumb - terrifically simple:

1) loop turns into pipeline [1979]2) decision box turns into demultiplexer

[1967]: PvOIIP

Software to Hardware Migration:

Software to Configware Migration: and

Reiner
diese Veranschaulichung paßt gut bei grobkörnigen .....bei FPGAs kann es komplizierter sein
Page 36: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Two Dichotomies

36

Dichotomy = mutual allocation to two opposed domains such, that a third domain is excluded. The dichotomy model as an educational orientation guide for dual rail education to overcome the software/configware chasm & the software/hardware chasm 1) Machine Paradigm Dichotomy (von Neumann /Dataflow machine*): the „Twin Paradigm“ model2) Relativity Dichotomy: time domain / space domain – helps parallelization by time to space mapping

*) see definition

Reiner Hartenstein
(RC) .... it‘s an alternative culture ....now the area is going mainstream: a rapidly widening audience of non-specialists gets interested ...severe communication gaps due to educational deficits not only to users: still many hardware and EDA experts ask: isn’t it just logic design on a strange platform ?it is time to clarify and popularize fundamental aspects and to explain, that it is a fundamentally different culture --- Dichotomy helps to understand it
Reiner
Dichotomy is the solution of a dilemma - the CS education dilemma
Page 37: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Def.: Dataflow Machine

37

The old „Dataflow Machine“ research scene is dead. sequential execution: not really a dataflow machine.

had used compilers accepting a dataflow language

we re-define this term: counterpart of von Neumann

indeterministic: unpredictable order of execution:

deterministic, w. data counters (no program counter)

Reiner Hartenstein
(RC) .... it‘s an alternative culture ....now the area is going mainstream: a rapidly widening audience of non-specialists gets interested ...severe communication gaps due to educational deficits not only to users: still many hardware and EDA experts ask: isn’t it just logic design on a strange platform ?it is time to clarify and popularize fundamental aspects and to explain, that it is a fundamentally different culture --- Dichotomy helps to understand it
Reiner
Dichotomy is the solution of a dilemma - the CS education dilemma
Page 38: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

1 ) Paradigm Dichotomy

38

instruction domain

(procedural dichotomy)

datastream domain

The Twin Paradigm Approach (TTPA)

programcounter

CPUCPUdata

counter

(r)DPA(r)DPA

instructionstream

+- data

stream- +

Page 39: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Paradigm Dichotomy

39

instruction domain

(procedural dichotomy)

datastream domain

The Twin Paradigm Approach (TTPA)

programcounter

CPUCPUdata

counter

(r)DPA(r)DPA

instructionstream

+- data

stream- ++

+ data parallelism

data parallelism

we needwe need

AsymmetryAsymmetry

ss

Reiner Hartenstein
the only asymmetry in our dichotomy
Page 40: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Data Machine: from old stuff [1979

- ...]

40

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|

(r)DPA(r)DPA

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

MAuto-Sequencing Memory

ASM: Data streams[Kung et al. 1979]

RAM

datacounter

GAG

New is only: its generalization [1989]

systolic arraysuper systolic[1995]

[1995]

[1990]

Reiner Hartenstein
several date counters instead of a program counterprogrammed by Flowwarethe data counter: placed in memory**(not with datapath***)*) especially coarse-grained: for instance: platform FPGA**) normaly on-chip***) not like with CPU
Reiner Hartenstein
1) making it reconfigurable2) discard algebraic synthesis methods3) add data sequencers -> machine paradigm4) with reconfigurable addres sgeerator
Page 41: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Procedural Languages Twins

systolic Flowware Languagesread next data itemgoto (data address)jump to (data address)data loopdata loop nestingdata loop escapedata stream branchingyes: internally parallel loops

41

imperative Software Languagesread next instructiongoto (instruction address)jump to (instruction address)instruction loopinstruction loop nestinginstruction loop escapeinstruction stream branchingno: no internally parallel loops

But there is the Asymmetry But there is the Asymmetry

program counter data counter(s)

for data parallelismfor data parallelism

super

Reiner Hartenstein
Befehls-prozedural
Reiner Hartenstein
Daten-prozedural
Reiner Hartenstein
withoug instruction parallelismwithoug instruction streams!
Page 42: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Relativity Dichotomy

time domain: space domain:procedure domain structure domain

2 phases: 1) programming

instruction streams2) run time

3 phases: 1) reconfiguration

of structures

time space

2) programming data streams

3) run time

42von Neumann Machine Anti Machine

(time time/space)

Reiner
Die Relativitätstheorie befasst sich mit der Struktur von Raum und Zeit.Die spezielle Relativitätstheorie befaßt sich mit der Relativität von Raum und Zeit .because space is finie, we need a taim to time/space mapping
Page 43: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

time-iterative to space-iterative

43

a time to space/time mapping

loop transformation methodogy: 70ies and later

k*n time steps, 1 CPU

k time steps, n DPUs

Often the space dimension is limited (e.g. because of the chip size)n time steps,

1 CPU

1 time step, n DPUs

a time to space mapping

Strip mining [D. Loveman, J-ACM, 1977]

POIIP

( n = length of pipeline )

Reiner
diese Veranschaulichung paßt gut bei grobkörnigen .....bei FPGAs kann es komplizierter sein
Page 44: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

outline

44

Introduction

Manycore Crisis & von Neuman Syndrome

The Impact of Reconfigurable Computing

Conclusions

Page 45: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de45

Conclusions (1)

We massively need programmable accelerator co-processors Established technologies are available and we can still use standard software and their tools

Configware skills and basic hardware knowledge are essential qualifications for programmers.

We need a massive Migration of Software to Configware. To cope with the implementation wall: to cope with the programmer population‘s unsustainable skills mismatches

Reiner Hartenstein
modern programmers:Yesterday’s Programmers: people understanding only software.embedded engineers: people who combine understanding of software and hardware.Programmers to-day: people who combine understanding of software and configwareunderstanding configware requires some understanding of hardware.combining understanding of software, configware, and hardware?
Reiner Hartenstein
Hennessy: “… parallelism and ease of use of truly parallel computers: a problem that’s as hard as any that computer science has faced. … I would be panicked if I were in industry.” --- fund [only] three universities to get underway– Berkeley, Illinois, and Stanford - the need for a major, government-sponsored attack on the multicore challenge --- Since the first commercial computer in 1950, cost-performance of computing has improved by about 100 billion overall, using the rapidly increasing transistor speed the last 20 yearsarguing for a new Manhattan project: was TOP SECRET - academic career to support the war. Dave Patterson: need for a major, government-sponsored attack on the multicore challengeAlwyn Goodloe: I think “ease of use” [Hennessy] is really key in the effort. Most CIOs will probably say that recompiling the ole dusty deck is as much as they are willing to do. We probably need to say exactly what we mean by ease of use.
Reiner
Reiner03.09.2008we cannot fully switch quickly to a disruptive paradigm shift. enormous burden by legacy software requires a mass movement...We need a twin paradigm approach, mainly based on computing wisdom mostly having been ignored by our curricula for decadescalls for a mass movement into a run-away computing revolution causing a quick, wide, and strong impact as known from the VLSI design revolution a la Mead & Conway.
Page 46: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de46

Conclusions (2)CS education is a monster !

Yaw-dropping sclerosis of curriculum taskforces

We need a complete re-definition of CS education

CS should learn a lot from Embedded Systems, like in Mechanical Engineering

Fully wrong educational mainstream approaches

We urgently need Dual-Rail Education

Reiner
Solution impossible based on von-Neumann-onlyThe CS education dilemmaTypical programmers: von-Neumann-dominated mind setMissing skills for time to space mappingMissing skills for software to configware migrationWe need a dichotomy-based CS education approach
Reiner
why this is a difficult problem:we are caught in as severe educational dilemmacurriculum taskforces fully ignoring the requirements of the de facto job marketbeing blockheads: stubborn like a donkey:Peter Denning: don't want to discuss any detailsit's criminal: recruiting accreditation
Reiner
curricula - not CS theoryReason of a decade of enrolment declineour CS programs mainly stress subject areas being outsourcing candidates - compared to the more hardware-intensive past e. g. embedded system qualifications acquired by on the job training. Expert then hired by a competitor. no outsourcing candidate subject not taught
Reiner
The Monster paradigm: the von Neumann (vN) model has become an insatiable monster... memory and processing cycles are infinite; no understanding of architecture or assembly codeCS courses are more like IT*, focusing on databases, web design and java; preparing for jobs commonly off-shored; seem to teach that ...*) These imperfect courses even fail to deliver. [Mike Anderson]unsustainable skills mismatch: we need a change from the top [D- Selwood, ETJ]we need people understanding both, software and configwaresogar spezielle job-Zentren für FPGA-Ingenieure
Reiner
We are in the most disruptive development in the entire history of computingbecause of skill mismatch we cannot meet the challenges of the historic turning point in computer industry
Page 47: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

thank you for your patience

47

Page 48: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

END

48

Page 49: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

backup for discussion:

49

Page 50: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

time to space mappingtime domain: space domain:procedure domain structure domain

50

program loopn time steps, 1

CPU

pipeline1 time step, n DPUs

Bubble Sortn x k time

steps, 1 „conditional swap“ unit

Shuffle Sortk time steps, n „conditional swap“ units

time algorithm space algorithm

conditiona

lswap

x

y

condition

alswap

condition

alswap

condition

alswap

condition

alswap

time algorithm space/time algorithm s

Reiner
Die Relativitätstheorie befasst sich mit der Struktur von Raum und Zeit.Die spezielle Relativitätstheorie befaßt sich mit der Relativität von Raum und Zeit .
Page 51: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Architecture instead of synchro

51

„Shuffle Sort“

condition

alswap

condition

alswap

condition

alswap

condition

alswap

modification: with shuffle-

function

condition

alswap

condition

alswap

condition

alswap

condition

alswap

condition

alswap

condition

alswap

swap

condition

alswap

condition

al

direct time to space mappingaccessing conflicts

Better Architectureinstead of complex synchronisation: half he number of Blocks + up und down of data (shuffle function) – no von Neumann-syndrome !

Example

Reiner
Die Relativitätstheorie befasst sich mit der Struktur von Raum und Zeit.Die spezielle Relativitätstheorie befaßt sich mit der Relativität von Raum und Zeit .
Page 52: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Transformations since the 70ies

time domain: space domain:procedure domain structure domain

52

program loopn x k time

steps,

Pipelinek time steps, n DPUs

time algorithm space/time algorithmus

Strip Mining Transformation

loop transformations: rich methodology published [survey: Diss. Karin Schmidt,

1994, Shaker Verlag]

1 CPU

Reiner
Die Relativitätstheorie befasst sich mit der Struktur von Raum und Zeit.Die spezielle Relativitätstheorie befaßt sich mit der Relativität von Raum und Zeit .
Page 53: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Revolution der Lehre: Mikroelektronik-Entwurfs-Revolution

Carver Mead Lynn Conway

53

als Vorbildals Vorbild(in Deutschland: das E.I.S.-Projekt)

tall t

hin

m

an

Anwendung

Spezialisierungsbreitestark reduziert

Die neue M-&-C Arbeitsteilung:

Entrümpelung &intuitive Modelle

zur Behebung des

Ausbildungs-Dilemmas

Betonung auf “Systems”

Silicon Foundry (externeTechnologie)

Koh

äre

nz

Logik-Ebene

Switching-Ebene

Schaltkreis-Ebene

RT-Ebene

Anwendung

Layout-Ebene

Technologieim Hause

Einreichung Rückweisung

Einreichung Rückweisung

Einreichung Rückweisung

Einreichung Rückweisung

Einreichung Rückweisung

traditionelle Arbeitsteilung:

Spezialisierungsbreite

Zers

plit

teru

ng

[1980]

Reiner
10 Jahre predigen und Prügel beziehen
Reiner Hartenstein
BürokratieeinreichenzurückverweisenKomunikationsproblemeEntrümpelung dringend notwendig
Page 54: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Education Revolution: Reconfigurable Computing Revolution

Program level

Application level

54

the t

all t

hin

man

*>

Dic

hoto

my <

Christophe Bobda

The new

Mead & Conway

?

clearing out

von-Neumann-Paradigm

(instructionstream-based)

clearing out

Anti machine Paradigm (datastream-based)

Twin Paradigm

*) or” tall thin woman”

Reiner Hartenstein
Introduction to RC Systems(vorgeschlagen)Christoph Bobda: Der neue M&C
Page 55: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Who generates the data streams?

55

xxx

xxx

xxx

|

||

x xx

x

xx

x x

x

- -

-

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|

„systolic“

Withourt a Sequencer it‘s not a Machine !

Page 56: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

The Anti Machine

56

several date counters instead of a

program counter

the data counter: placed in memory**

(not with datapath***)

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

M

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|

**) normaly on-chip

(r)DPA*(r)DPA*

***) not like with CPU

*) especially coarse-grained: for instance: platform FPGA

*) especially coarse-grained: for instance: platform FPGA

Auto-Sequencing Memory

ASM: Data streams[Kung et al. 1979]

programmed by Flowware

Super-systolic Array

RAMdata

counter

GAG

(KressArray)

Page 57: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de57

Misson of this talk

We need time to space migration

since infinite space is not available,we often need partial time 2 space migration

###

++

software 2 hardware mapping (and,software 2 configware mapping)means time to space migration

(and von Neumann 2 anti machine migration)

Page 58: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de58

Morphware: old stuffstructural programming (non-von-Neumann)

1971 PROMs for small logic

1984 first Xilinx FPGA

1975 PLA

1978 PAL with PALASM tool

meanwhile mainstream …

Reiner Hartenstein
don't call it soft hardware !!!programming in space - not in time
Reiner Hartenstein
What google finds first:politics,law and autismphone loosers of Americaprostitution licensing authority ?
Reiner Hartenstein
fuse programming unit that instantly generated a custom IC on the designer's desktop
Reiner Hartenstein
beginning EDA industry (M&C)
Reiner Hartenstein
change ressources within milliseconds at the customers location
Reiner
The fastest growing segment of the microchip marketmore recently flooding the automotive electronics market
Reiner
more flexible than PLA (similar to memory layout)but much kess area-efficient (pure FPGA) wiring overhead: 2 OO behind Moore's Law reconfigurability overhead 2 OOHowever, modern FPGAs are platform FPGAs
Reiner
Hit rate by Google: >9,000,000 for FPGALarge and growing number of international conferencesMainstream in embedded systems since more than a decadeSince 2006 a hot spot at Supercomputing Conferences, more recently flooding the automotive electronics marketThe fastest growing segment of the microchip market
Page 59: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

POIIP: Loop turns into pipeline [1979]

(reconfigurable) DataPath Unit:

rDPUrDPUloop body

rDPUrDPU

rDPUrDPU

rDPUrDPU

Pipeline:

rDPUrDPUloop body

loop:

CPUCPU

MemoryMemory

Reiner Hartenstein
i DO NOT WANT TO FRIGHTEN YOU
Reiner Hartenstein
diese Veranschaulichung paßt gut bei grobkörnigen .....bei FPGAs kann es komplizierter sein
Reiner Hartenstein
memory-cycle-hungry instruction streams (and data fetch/store)
Reiner
no instruction streams: no memory-cyclestransport-triggered
Page 60: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

super-systolic array

60

(recall this example !)

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

rout thru only

not usedbackbus connect

supporting any complex free form pipe networks

far beyond just uniform linear pipes

CoDe-X inside [Jürgen Becker]by KressArray Xplorer [Ulrich Nageldinger]

Reiner Hartenstein
any wild scheme: zig-zag, fork & join, spiral, maze, and many other
Reiner Hartenstein
replacing algebraic synthesis methods by simulated annealing [Rainer Kress]
Page 61: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

decision box turns into demultiplexer

61

[1967]PvOIIP

01B0

B1

CO

ND

ITIO

N

ENABLE

demultiplexer:

B0

B1

CONDITION

ENABLE

decision box:

RTM as a DEC product available: 1973

[~1971] (introducing HDLs): „That‘ so simple! Why did

it take 30 years to find out ?“

C. G. Bell et al: IEEE Trans-C21/5, May 1972W. A. Clark: 1967 SJCC, AFIPS Conf. Proc.

Reiner
diese Veranschaulichung paßt gut bei grobkörnigen .....bei FPGAs kann es komplizierter sein
Reiner Hartenstein
more than 40 years !
Reiner Hartenstein
that's time to space mapping
Page 62: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de62

von Neumann overhead: an example

overheadvon Neumann

machine

instruction fetch instruction streamstate address computation instruction streamdata address computation instruction stream

data meet PU + other overh. instruction streami / o to / from off-chip RAM instruction stream

CPUCPU single CPU

~94% computation load

only for moving this window

reconfigurable address generator (GAG): ~20x speed-uprDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

PISA DRC accelerator [ICCAD 1984]

funded by E.I.S. Projekt

(entire project: 15000x speed-up)

Reiner
Design Rule Check accelerator:It is a kind of image processing2-D memory address spacehundreds or thousands of Boolean equations per 4-by-4 scan window positionaway from von Neumann: 15,000x total speed-up
Page 63: Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of

© 2008, [email protected] http://hartenstein.de

Data Machine: from old stuff [1979

- ...]

63

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|

(r)DPA(r)DPA

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

MAuto-Sequencing Memory

ASM: Data streams[Kung et al. 1979]

RAM

datacounter

GAG

New is only: its generalization [1989]

systolic arraysuper systolic

data counter

(r)DPA(r)DPAss

ASMASMdata

counter

[1995]

[1995]

[1990]

Reiner Hartenstein
several date counters instead of a program counterprogrammed by Flowwarethe data counter: placed in memory**(not with datapath***)*) especially coarse-grained: for instance: platform FPGA**) normaly on-chip***) not like with CPU
Reiner Hartenstein
1) making it reconfigurable2) discard algebraic synthesis methods3) add data sequencers -> machine paradigm4) with reconfigurable addres sgeerator