speed-ups obtained by reconfigurable computing reiner hartenstein capes/dfg cooperation on...

Speed-ups obtained by

Reconfigurable Computing

Reiner Hartenstein

CAPES/DFG Cooperation on Reconfigurable Computing,

inv. talk, Sept 19, 2008, Dept of Mechanical Engineering,

Universidade de Brasilia

1

slightly modified version

Reiner

Der Titel stimmt eigentlich nicht:wir müssen das Diesseits und das Jenseits (dieses Paradigmas natürlich) miteinander verknüpfen.Darüber geht eigentlich der VortragAußerdem ist das Jenseits hier neu für Sie

© 2008, [email protected] http://hartenstein.de

outline

2

Introduction

Manycore Crisis & von Neumann Syndrome

The Impact of Reconfigurable Computing

Programmer education: new roadmap needed

Conclusions


5 key issues

climate change faster than predicted: by carbon emission, primarily from power plants ?

the manycore programming crisis stalls progress (end of the free ride on the Gordon Moore curve)technologically stalled Moore‘s Law*

very high and growing computer energy cost – and growing number of power plants needed here

3

Reconfigurable Computing is a promising alternative

2008: 65, 45, 32 nm[Nick Tredennick (Gilder), 2003]*) Tom Williams (keynote): the 20 nm wall

Reiner

aber dies ist gewiß

Reiner

by ending the GHz race -- stalls progress to affordable HPC (the manycore programming crisis)end of the free ride on the Gordon Moore Curve

Reiner

the 20 nanometer wallslowdown in speed advance, copper wiring hits the wall, high-k dielectric manufacturable at all ? earlier predictions only based on resolution problems in mask making etc. - but now: material problems

Reiner

A poll by Forrester Research at a conference reveals that firms are increasingly concerned with the impact of energy consumption in their business operations.


History of data processing

• prototyped: 1884 Herman Hollerith

4

•datastream-based•datastream-based

The first reconfigurable computer

DPUDPU

• 1st Xilinx FPGA 100 years later

Reiner

Herman Hollerith *29 Feb 1860 Buffalohand-crafted configware !- no program counter - i. e. no CPU: only DPU- no instruction streams- no bootingtransistor not yet inventedwho knows: vacuum tube invented when ?


Configware Programming

5

60 years later: RAM available –ferrite cores

manually (Configuration)

motivating the von Neumann paradigm

J. v N, 1946

or, by swapping pre-wired board

(Reconfiguration)

no instruction streams

Reiner

Herman Hollerith *29 Feb 1860 Buffalohand-crafted configware !- no program counter - i. e. no CPU: only DPU- no instruction streams- no bootingtransistor not yet inventedwho knows: vacuum tube invented when ?


fine-grained reconfigurable

form

ing

a wi

re

switch box

Conn

ect t

o CL

B

connect box

CLB

CConfigurable Logic Box

6

CLB CLB

CLB CLB

CLB CLB

FPGAField-Programmable Gate Array

B

A

Xilinx old „island architecture“

Reiner Hartenstein

has become mainstreamcame up 25 years ago

Reiner Hartenstein

explain LUT

© 2008, [email protected] http://hartenstein.de7

CLB CLB

CLB CLB

CLB CLB

FPGAField-Programmable Gate Array

Conn

ect t

o CL

Bfo

rmin

g a

wire

B

A

switch box

CLB

CConfigurable Logic Box

connect box

Reiner Hartenstein

has become mainstreamcame up 25 years ago

Reiner

old island architecture from Xilinx


switch box

RAM-based

configware code loaded before run time into “hidden RAM”

FFpart of “hidden RAM”

0 0

0

00 1

8

hidden RAM

hid

den

RA

M

hidden RAM FPGAs mainstream since > a decade

this switch box has 150 transistors &

150 flipflops FF

patches even at the customer‘s desk

Reiner Hartenstein

Bill Gates so reich: RAM-basiert


Coarse-grained Reconfigurable Array

Sw

ap

if X > Y then swap;

Y

X

rout thru only rout thru and function (multiplexer)

swap turned into a wiring pattern

Conditional Swap Example

0

1Xo

0

1Yo

Xi

>

Yi

CFB !CFB !CLB (parallelization of the bubble sort algorithm)

Reiner Hartenstein

e, g, part of bubble sort hardwarederived from C descriptionsccd = multiplexer > is relational operator1: rout only3: function AND rout

Reiner

it's a time to space mapping


Another coarse-grained r-Array

10

SNN Filter on supersystolic Array: mainly a Pipe Network

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect not usedbackbus connectarray size: 10 x 16

reconfigurable Data Path Unit, 32 Bits wide

reconfigurable Data Path Unit, 32 Bits wide

no CPU

rDPUrDPU

(99% placement efficiency)

CFB !CFB !rout thru only

CoDe-X inside [Jürgen Becker]by KressArray Xplorer [Ulrich Nageldinger]

Reiner

principles go back to the systolic array 1979supersystolic means antimachine paradigm (counterpart of von Neumann) - see later

Reiner

98 % placement efficiency


Plattform-FPGA

11

256 – 1704 BGA

56 – 424

fast on-chip Block

RAMs: BRAMs

8 – 32fast serial

I/O-channels

DPUs

Configware-Code-input

[courtesy Lattice Semiconductor]

Reiner Hartenstein

Now, large FPGAs featuring special-purpose logic such as dedicated multipliers and on-chip memories embedded into the logic fabric, have become attractive platforms for accelerating kernels in scientific applications.


Reconfigurable Supercomputing

12

Silicon graphics

Reconfigurable Application-

Specific Computing (RASC™)

•Xilinx Virtex-II Pro•Library by Cray

Cray XD1

Supercomputing 2007, Reno, Nevada, USA 9600 registered attandees, 440 exhibitors

Chuck Thacker … (even Microsoft working at it)

(Lab in Cambridge. UK, etc.).


what means Configware

SoftwareCompiler

Software Code

13

(instruction-procedural)

Software Source

Configware Code(structural: space domain)

mapper

Placement & Routing

Configware Source

space domainspace domaintime domaintime domain

Software to Configware Migration

Software to Configware Migration

traditio

nal

Computing

Reconfigurable Computing

(data-procedural)

data scheduler

Flowware Code


outline

14

Introduction

The Manycore Crisis & the von Neumann SyndromeThe Impact of Reconfigurable Computing

Programmer education: new roadmap

needed

Conclusions


Many-core: Break-through or Breakdown?

Industry is facing a disruptive turning point

intel’s vision: MultiCor

e

The stakes are high ...

HPC users lack understanding in basic precepts*

*) PRACE consortium (Partnership foR Advanced Computing in Europe) http://www.prace-project.eu/documents/D3.3.1_document_final.pdf

it‘s an education, qualification, and a R&D problem

“could reset µP HW & SW roadmaps for next 30 years”, [David Patterson]

forcing a historic transition to a parallel programming model yet to be invented [David Callahan]

„I would be panicked if I were inindustry“[John Hennessy]

Reiner Hartenstein

intel und Sun keynotes [DAC’08] admit that there are problems of Manycore programming, a lack of suitable software, and curbing by memory latency.

Reiner Hartenstein

who leads Microsoft's Parallel Computing Initiative,

Reiner

[Dave Patterson]: intel has thrown a Hail Mary pass and nobody is running yet“ *) a Hail Mary pass in American football is a forward pass made in desperation, with a very small chance of success

Reiner

Multi-threading, transactional memory, register re-naming, speculative tricks, multiple super scalarism, out-of-order execution, …: no silver bullets.Erfolgsgeheimnis des IBM Cell Processors

Reiner

The stakes are high. When research will not find efficient parallel techniques, programming will become so difficult, that people will not have a benefit from the new hardware:from growth industry to replacement industry ?Hennessy: I woud lbe panicked if I were in industry

Reiner

HPC: lacking or missing multicore programming qualifications*

Reiner

productivity goes down disproportionatly with the number of processesAt particular HPC application domains massive parallelism requires 10 – 30 professionalists in multi-disciplinary multi-insitutional teams for 5 - 10 years [Douglass Post, DoD HPCMP, panelist at SC07]Software done: machine obsolete


Declining Programmer Productivity

At particular HPC application domains massive parallelism requires 10 – 30 professionalists in multi-disciplinary multi-insitutional teams for 5 - 10 years [Douglass Post, DoD HPCMP, panelist at SC07]

The Law of More: programmer productivity declines disproportionately with increasing parallelism

Software done: machine obsolete


The von Neumann SyndromeThe von Neumann Syndrome


The von Neumann SyndromeThe von Neumann Syndrome

More power for creating

foam than to accelerate the

vessel ?More power for creating

foam than to accelerate the

vessel ?


Massive Overhead Phenomena

CPU

CPU

single core

Dijkstra 1968: The Goto considered harmfulKoch et al. 1975: The universal Bus considered harmfulBackus, 1978: Can programming be liberated from the von Neumann style?Arvind et al., 1983: A critique of Multiprocessing the von Neumann Style

overhead piling up to code sizes of astronomic dimensionsvon Neumann

“von Neumann Syndrome”“von Neumann Syndrome”

2006: C.V. “RAM”

Ramamoorthy:

2006: C.V. “RAM”

Ramamoorthy:

1986, E.I.S. Projekt: 94%for address computation

total speed-up:x 15000

„a terrifying number of processes running in parallel, create sequential-processing bottlenecks and losses in

data locality“

2008 - David Callahan:

overheadvon Neumann

machine

instruction fetch instruction streamstate address computation instruction streamdata address computation instruction stream

data meet PU + other overh. instruction streami / o to / from off-chip RAM instruction stream

(list not complete)(list not complete)

C++ compilerC++ compiler

virtualizationvirtualization

many other

featuresmany other

features

Reiner Hartenstein

bus = von Neumann bottleneck

Reiner Hartenstein

criticizing the program counter's flexibility

Reiner Hartenstein

PISA DRC accelerator [ICCAD 1984]94% computation load only for moving a 4-by-3 window (kind of image processing)(entire project: 15000x speed-up)funded by E.I.S. Projekt (German M-&-C)

Reiner

David Callahan joined Microsoft in late 2005. He is part of a cross-divisional team that is looking forward to the coming surge of multi-core processors that will make parallel-computing ubiquitous in home and office. This is a tremendous opportunity for Microsoft to exploit this fundamental shift in programming and how systems will be used to enable new user experiences and capabilities in all our business areas. Callahan’s particular strengths are in programming languages, programming techniques, and compilation techniques focused on expression and exploiting concurrency.


manycore von Neumann: arrays of massive overhead

phenomena

proportionate to the number of processors

CPU

CPU

single CPU

von Neumann

disproportionate to the number of processors

20

fast on-chip memory cannot store such huge instruction

code blocksCPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

manycore

overheadvon Neumann

machine


data meet PU + other overh. instruction streami / o to / from off-chip RAM instruction streamInter PU communication instruction stream

message passing overhead instruction stream

transactional memory overh. instruction stream

multithreading overhead etc. instruction stream

Reiner Hartenstein

HTM overhead:ISCA07: ABSTRACTHardware Transactional Memory (HTM) systems reflect choices from three key design dimensions: conflict detection, version management, and conflict resolution. Previously proposed HTMs represent three points in this design space: lazy conflict detection, lazy version management, committer wins (LL); eager conflict detection, lazy version management, requester wins (EL); and eager conflict detection, eager version management, and requester stalls with conservative deadlock avoidance (EE)

Reiner

coming 16 core per chip, or 32, or 30increase by x2, every 2 years!permanent compatibility problems

Reiner Hartenstein

schneller On-chip-Speicher ist viel zu klein für derartige Kode-Pakete mit astronomischen Dimensionenlangsame off-Chip-Speicher erlauben keinerlei Umgehung der Memory Wall


outline

21

Introduction

Manycore Crisis & von Neuman Syndrome


Programmer education: new roadmap needed

Conclusions


Speed-up factors obtained by Software to Configware migration

22

molecular dynamics simulationmolecular dynamics simulation

88

1980 1990 2000 2010100

103

106

real-time face detectionreal-time face detection

60006000video-rate stereo

visionvideo-rate stereo

vision900

pattern recognition

pattern recognition

730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression457

FFTFFT100

Reed-Solomon DecodingReed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

1000

MACMAC

DSP and wireless

Image processing,Pattern matching,

Multimedia

BLASTBLAST52

protein identificationprotein identification 40

Smith-Waterman pattern matching


288

Bioinformatics

GRAPEGRAPE2020AstrophysicsAstrophysics

Speedu

p-F

act

or

cryptocrypto1000

X 2/yr

FPGA

28500

DES breaki

ng

3000

Reiner Hartenstein

Success with RC has been achieved in a variety of areas such as signal and image processing, cryptology, communications processing, data and text mining, and global optimization, for a variety of platform types, from high-end systems on earth to mission-critical systems in space.


Accelerator card from Bruchsal

23

• I/O Bandwidth: 50 GByte/s

• Manufacturer: SIEMENS Bruchsal

16 FPGAs

Tera means 1012 or 1 000 000 000

000 (1 trillion)

MAC means Multiply and ACcumulate

• 1.5 TeraMAC/s


Energy saving factors obtained by software to configware migration

molecular dynamics simulationmolecular dynamics simulation

88

1980 1990 2000 2010100

103

106

real-time face detectionreal-time face detection60006000

video-rate stereo vision

video-rate stereo vision

900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression457

FFTFFT100

Reed-Solomon DecodingReed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

1000

MACMAC

DSP und wireless

Image processing,Pattern matching,

Multimedia

BLASTBLAST52

protein identificationprotein identification 40



288

Bioinformatics

GRAPEGRAPE2020AstrophysicsAstrophysics

Speedu

p-F

act

or

cryptocrypto1000

X 2/yr

3000

28500

DES breaki

ng

FPGA

energy saving fa

ctor*

3440

300

Energy saving: almost x10 less than speed-up …

… could be improved

Reiner

earlier papers do not report energy factors


von Neumann overhead vs. Reconfigurable

Computing

overheadvon Neumann

machine anti machine

instruction fetch instruction stream none*state address computation instruction stream none*data address computation instruction stream none*

data meet PU + other overh. instruction stream none*i / o to / from off-chip RAM instruction stream none*Inter PU communication instruction stream none*

message passing overhead instruction stream none*

transactional memory overh. instruction stream none*

multithreading overhead etc. instruction stream none*

using

reconfigurable

data counters

*) c

onfig

ured

befo

re ru

n tim

e

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPA: reconfigurable datapath arrayrDPA: reconfigurable datapath array

(coa

rse-

grai

ned

rec.

)(c

oars

e-gr

aine

d re

c.)

no inst

ruct

ion

fetc

h a

t ru

n t

ime

25

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU


Data meet the processor (CPU)

26

by Software

inefficient transport over off-Chip-memory by memory-cycle-hungry instruction streams

This is just one of many von Neumann-Overhead-Phenomena

illustrating von Neumann syndrome


Data meet the CPU

27

byFlowware

Placement of the execution locality (not moving data)

within pipe network: generated by the Configware-Compiler*

illustrating acceleration

*) before run time (at compile time)

Reiner

processing unitoder DPU: DataPath Unit


What did we learn?There are 2 kinds of datastreams:

“Dataflow machine” would be a nice term, but was introduced by a different scene*

1) indirectly moved by an instruction stream machine (von Neumann): extremely inefficient

2) directly moved by a datastream machine (from Reconfigurable Computing): very efficient

*) meanwhile dead: not really a dataflow machine, but had used compilers accepting a dataflow language


What else did we learn?

There are 2 kinds of parallelism:

1) Concurrent processes: instruction stream parallelism (CPU manycores): inefficient

2) Data parallelism by parallel datastreams (in Reconfigurable Computing Systems): efficient

- Data parallelism brings the performance (we do data processing !)

Conclusion:


rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

CPUCPU

no von Neumann-bottle-neck

no von Neumann-bottle-neck

instruction parallelism:

many von Neumann

bottlenecks

many von Neumann

bottlenecks

30

[Hartenstein’s watering can model]

What Parallelism?data parallelism:


Put old ideas into practice (POIIP)

31

... „The biggest payoff will come from putting old ideas into practice and teaching people how to apply them properly.“ [David Parnas]

“We need a complete re-definition of CS”[Burton Smith and other celebrities]

Wrong! I do not agree,[Reiner Hartenstein]

finding out, that ...

“We need a complete re-definition of curriculum recommendations - missing several key issues.” [Reiner Hartenstein]


outline

32

Introduction

Manycore Crisis & von Neuman

Syndrome

The Impact of Reconfigurable

Computing

Programmer education: new road map neededConclusions


Fighting against obsolete curricula?

Real-Time Systems (Sweden)

Recommendations for Designing new ICT Curricula

Workshop on Embedded Systems Education

WESE

Chess – Center for Hybrid and Embedded Software Systems(courses in embedded systems)

Graduate Curriculum on Embedded Software and Systems (EU)

Advanced Real Time Systems

The Embedded Systems Approach?

… support their own educational approach

„You can always teach programming

to a hardware guy ...

it‘s not the programmer‘s fault: it‘s due to obsolete CS

curricula

... but you can never teach hardware to a programmer“

Reiner

you always can teach programming to a hardware guybut you can never teach hardware to a programmeri is not the programmers fault - it is the fault of his /her educators


CS is a Monster

fully wrong educational mainstream approaches:

2) mapping parallelism into the time domain – abstracting away the space domain is fatal

1) the basic mind set exclusively instruction-stream-oriented - data streams considered being exotic

We need a dual-rail education

Reiner

e. g. threadswe need time to space mappinginstead of abstracting it away

Reiner Hartenstein

as long as the space domain is abstracted awayall that stuff is implemented by instruction streams!MPI takes ~ 50% of computation time [RAW86]von Neumann syndrome

Reiner

we need data parallelism as a mainstreambeware of "data flow" (indeterministic) - is dead


We need to POIIP for:

35

2 key rules of thumb - terrifically simple:

1) loop turns into pipeline [1979]2) decision box turns into demultiplexer

[1967]: PvOIIP

Software to Hardware Migration:

Software to Configware Migration: and

Reiner

diese Veranschaulichung paßt gut bei grobkörnigen .....bei FPGAs kann es komplizierter sein


Two Dichotomies

36

Dichotomy = mutual allocation to two opposed domains such, that a third domain is excluded. The dichotomy model as an educational orientation guide for dual rail education to overcome the software/configware chasm & the software/hardware chasm 1) Machine Paradigm Dichotomy (von Neumann /Dataflow machine*): the „Twin Paradigm“ model2) Relativity Dichotomy: time domain / space domain – helps parallelization by time to space mapping

*) see definition

Reiner Hartenstein

(RC) .... it‘s an alternative culture ....now the area is going mainstream: a rapidly widening audience of non-specialists gets interested ...severe communication gaps due to educational deficits not only to users: still many hardware and EDA experts ask: isn’t it just logic design on a strange platform ?it is time to clarify and popularize fundamental aspects and to explain, that it is a fundamentally different culture --- Dichotomy helps to understand it

Reiner

Dichotomy is the solution of a dilemma - the CS education dilemma


Def.: Dataflow Machine

37

The old „Dataflow Machine“ research scene is dead. sequential execution: not really a dataflow machine.

had used compilers accepting a dataflow language

we re-define this term: counterpart of von Neumann

indeterministic: unpredictable order of execution:

deterministic, w. data counters (no program counter)

Reiner Hartenstein

(RC) .... it‘s an alternative culture ....now the area is going mainstream: a rapidly widening audience of non-specialists gets interested ...severe communication gaps due to educational deficits not only to users: still many hardware and EDA experts ask: isn’t it just logic design on a strange platform ?it is time to clarify and popularize fundamental aspects and to explain, that it is a fundamentally different culture --- Dichotomy helps to understand it

Reiner

Dichotomy is the solution of a dilemma - the CS education dilemma


1 ) Paradigm Dichotomy

38

instruction domain

(procedural dichotomy)

datastream domain

The Twin Paradigm Approach (TTPA)

programcounter

CPUCPUdata

counter

(r)DPA(r)DPA

instructionstream

+- data

stream- +


Paradigm Dichotomy

39

instruction domain

(procedural dichotomy)

datastream domain

The Twin Paradigm Approach (TTPA)

programcounter

CPUCPUdata

counter

(r)DPA(r)DPA

instructionstream

+- data

stream- ++

+ data parallelism

data parallelism

we needwe need

AsymmetryAsymmetry

ss

Reiner Hartenstein

the only asymmetry in our dichotomy


Data Machine: from old stuff [1979

- ...]

40

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|

(r)DPA(r)DPA

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

MAuto-Sequencing Memory

ASM: Data streams[Kung et al. 1979]

RAM

datacounter

GAG

New is only: its generalization [1989]

systolic arraysuper systolic[1995]

[1995]

[1990]

Reiner Hartenstein

several date counters instead of a program counterprogrammed by Flowwarethe data counter: placed in memory**(not with datapath***)*) especially coarse-grained: for instance: platform FPGA**) normaly on-chip***) not like with CPU

Reiner Hartenstein

1) making it reconfigurable2) discard algebraic synthesis methods3) add data sequencers -> machine paradigm4) with reconfigurable addres sgeerator


Procedural Languages Twins

systolic Flowware Languagesread next data itemgoto (data address)jump to (data address)data loopdata loop nestingdata loop escapedata stream branchingyes: internally parallel loops

41

imperative Software Languagesread next instructiongoto (instruction address)jump to (instruction address)instruction loopinstruction loop nestinginstruction loop escapeinstruction stream branchingno: no internally parallel loops

But there is the Asymmetry But there is the Asymmetry

program counter data counter(s)

for data parallelismfor data parallelism

super

Reiner Hartenstein

Befehls-prozedural

Reiner Hartenstein

Daten-prozedural

Reiner Hartenstein

withoug instruction parallelismwithoug instruction streams!


Relativity Dichotomy

time domain: space domain:procedure domain structure domain

2 phases: 1) programming

instruction streams2) run time

3 phases: 1) reconfiguration

of structures

time space

2) programming data streams

3) run time

42von Neumann Machine Anti Machine

(time time/space)

Reiner

Die Relativitätstheorie befasst sich mit der Struktur von Raum und Zeit.Die spezielle Relativitätstheorie befaßt sich mit der Relativität von Raum und Zeit .because space is finie, we need a taim to time/space mapping


time-iterative to space-iterative

43

a time to space/time mapping

loop transformation methodogy: 70ies and later

k*n time steps, 1 CPU

k time steps, n DPUs

Often the space dimension is limited (e.g. because of the chip size)n time steps,

1 CPU

1 time step, n DPUs

a time to space mapping

Strip mining [D. Loveman, J-ACM, 1977]

POIIP

( n = length of pipeline )

Reiner



outline

44

Introduction

Manycore Crisis & von Neuman Syndrome


Conclusions


Conclusions (1)

We massively need programmable accelerator co-processors Established technologies are available and we can still use standard software and their tools

Configware skills and basic hardware knowledge are essential qualifications for programmers.

We need a massive Migration of Software to Configware. To cope with the implementation wall: to cope with the programmer population‘s unsustainable skills mismatches

Reiner Hartenstein

modern programmers:Yesterday’s Programmers: people understanding only software.embedded engineers: people who combine understanding of software and hardware.Programmers to-day: people who combine understanding of software and configwareunderstanding configware requires some understanding of hardware.combining understanding of software, configware, and hardware?

Reiner Hartenstein

Hennessy: “… parallelism and ease of use of truly parallel computers: a problem that’s as hard as any that computer science has faced. … I would be panicked if I were in industry.” --- fund [only] three universities to get underway– Berkeley, Illinois, and Stanford - the need for a major, government-sponsored attack on the multicore challenge --- Since the first commercial computer in 1950, cost-performance of computing has improved by about 100 billion overall, using the rapidly increasing transistor speed the last 20 yearsarguing for a new Manhattan project: was TOP SECRET - academic career to support the war. Dave Patterson: need for a major, government-sponsored attack on the multicore challengeAlwyn Goodloe: I think “ease of use” [Hennessy] is really key in the effort. Most CIOs will probably say that recompiling the ole dusty deck is as much as they are willing to do. We probably need to say exactly what we mean by ease of use.

Reiner

Reiner03.09.2008we cannot fully switch quickly to a disruptive paradigm shift. enormous burden by legacy software requires a mass movement...We need a twin paradigm approach, mainly based on computing wisdom mostly having been ignored by our curricula for decadescalls for a mass movement into a run-away computing revolution causing a quick, wide, and strong impact as known from the VLSI design revolution a la Mead & Conway.


Conclusions (2)CS education is a monster !

Yaw-dropping sclerosis of curriculum taskforces

We need a complete re-definition of CS education

CS should learn a lot from Embedded Systems, like in Mechanical Engineering

Fully wrong educational mainstream approaches

We urgently need Dual-Rail Education

Reiner

Solution impossible based on von-Neumann-onlyThe CS education dilemmaTypical programmers: von-Neumann-dominated mind setMissing skills for time to space mappingMissing skills for software to configware migrationWe need a dichotomy-based CS education approach

Reiner

why this is a difficult problem:we are caught in as severe educational dilemmacurriculum taskforces fully ignoring the requirements of the de facto job marketbeing blockheads: stubborn like a donkey:Peter Denning: don't want to discuss any detailsit's criminal: recruiting accreditation

Reiner

curricula - not CS theoryReason of a decade of enrolment declineour CS programs mainly stress subject areas being outsourcing candidates - compared to the more hardware-intensive past e. g. embedded system qualifications acquired by on the job training. Expert then hired by a competitor. no outsourcing candidate subject not taught

Reiner

The Monster paradigm: the von Neumann (vN) model has become an insatiable monster... memory and processing cycles are infinite; no understanding of architecture or assembly codeCS courses are more like IT*, focusing on databases, web design and java; preparing for jobs commonly off-shored; seem to teach that ...*) These imperfect courses even fail to deliver. [Mike Anderson]unsustainable skills mismatch: we need a change from the top [D- Selwood, ETJ]we need people understanding both, software and configwaresogar spezielle job-Zentren für FPGA-Ingenieure

Reiner

We are in the most disruptive development in the entire history of computingbecause of skill mismatch we cannot meet the challenges of the historic turning point in computer industry


thank you for your patience

47


END

48


backup for discussion:

49


time to space mappingtime domain: space domain:procedure domain structure domain

50

program loopn time steps, 1

CPU

pipeline1 time step, n DPUs

Bubble Sortn x k time

steps, 1 „conditional swap“ unit

Shuffle Sortk time steps, n „conditional swap“ units

time algorithm space algorithm

conditiona

lswap

x

y

condition

alswap

condition

alswap

condition

alswap

condition

alswap

time algorithm space/time algorithm s

Reiner

Die Relativitätstheorie befasst sich mit der Struktur von Raum und Zeit.Die spezielle Relativitätstheorie befaßt sich mit der Relativität von Raum und Zeit .


Architecture instead of synchro

51

„Shuffle Sort“

condition

alswap

condition

alswap

condition

alswap

condition

alswap

modification: with shuffle-

function

condition

alswap

condition

alswap

condition

alswap

condition

alswap

condition

alswap

condition

alswap

swap

condition

alswap

condition

al

direct time to space mappingaccessing conflicts

Better Architectureinstead of complex synchronisation: half he number of Blocks + up und down of data (shuffle function) – no von Neumann-syndrome !

Example

Reiner



Transformations since the 70ies

time domain: space domain:procedure domain structure domain

52

program loopn x k time

steps,

Pipelinek time steps, n DPUs

time algorithm space/time algorithmus

Strip Mining Transformation

loop transformations: rich methodology published [survey: Diss. Karin Schmidt,

1994, Shaker Verlag]

1 CPU

Reiner



Revolution der Lehre: Mikroelektronik-Entwurfs-Revolution

Carver Mead Lynn Conway

53

als Vorbildals Vorbild(in Deutschland: das E.I.S.-Projekt)

tall t

hin

m

an

Anwendung

Spezialisierungsbreitestark reduziert

Die neue M-&-C Arbeitsteilung:

Entrümpelung &intuitive Modelle

zur Behebung des

Ausbildungs-Dilemmas

Betonung auf “Systems”

Silicon Foundry (externeTechnologie)

Koh

äre

nz

Logik-Ebene

Switching-Ebene

Schaltkreis-Ebene

RT-Ebene

Anwendung

Layout-Ebene

Technologieim Hause

Einreichung Rückweisung





traditionelle Arbeitsteilung:

Spezialisierungsbreite

Zers

plit

teru

ng

[1980]

Reiner

10 Jahre predigen und Prügel beziehen

Reiner Hartenstein

BürokratieeinreichenzurückverweisenKomunikationsproblemeEntrümpelung dringend notwendig


Education Revolution: Reconfigurable Computing Revolution

Program level

Application level

54

the t

all t

hin

man

*>

Dic

hoto

my <

Christophe Bobda

The new

Mead & Conway

?

clearing out

von-Neumann-Paradigm

(instructionstream-based)

clearing out

Anti machine Paradigm (datastream-based)

Twin Paradigm

*) or” tall thin woman”

Reiner Hartenstein

Introduction to RC Systems(vorgeschlagen)Christoph Bobda: Der neue M&C


Who generates the data streams?

55

xxx

xxx

xxx

|

||

x xx

x

xx

x x

x

- -

-

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|

„systolic“

Withourt a Sequencer it‘s not a Machine !


The Anti Machine

56

several date counters instead of a

program counter

the data counter: placed in memory**

(not with datapath***)

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

M

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|

**) normaly on-chip

(r)DPA*(r)DPA*

***) not like with CPU

*) especially coarse-grained: for instance: platform FPGA

*) especially coarse-grained: for instance: platform FPGA

Auto-Sequencing Memory


programmed by Flowware

Super-systolic Array

RAMdata

counter

GAG

(KressArray)


Misson of this talk

We need time to space migration

since infinite space is not available,we often need partial time 2 space migration

###

++

software 2 hardware mapping (and,software 2 configware mapping)means time to space migration

(and von Neumann 2 anti machine migration)


Morphware: old stuffstructural programming (non-von-Neumann)

1971 PROMs for small logic

1984 first Xilinx FPGA

1975 PLA

1978 PAL with PALASM tool

meanwhile mainstream …

Reiner Hartenstein

don't call it soft hardware !!!programming in space - not in time

Reiner Hartenstein

What google finds first:politics,law and autismphone loosers of Americaprostitution licensing authority ?

Reiner Hartenstein

fuse programming unit that instantly generated a custom IC on the designer's desktop

Reiner Hartenstein

beginning EDA industry (M&C)

Reiner Hartenstein

change ressources within milliseconds at the customers location

Reiner

The fastest growing segment of the microchip marketmore recently flooding the automotive electronics market

Reiner

more flexible than PLA (similar to memory layout)but much kess area-efficient (pure FPGA) wiring overhead: 2 OO behind Moore's Law reconfigurability overhead 2 OOHowever, modern FPGAs are platform FPGAs

Reiner

Hit rate by Google: >9,000,000 for FPGALarge and growing number of international conferencesMainstream in embedded systems since more than a decadeSince 2006 a hot spot at Supercomputing Conferences, more recently flooding the automotive electronics marketThe fastest growing segment of the microchip market


POIIP: Loop turns into pipeline [1979]

(reconfigurable) DataPath Unit:

rDPUrDPUloop body

rDPUrDPU

rDPUrDPU

rDPUrDPU

Pipeline:

rDPUrDPUloop body

loop:

CPUCPU

MemoryMemory

Reiner Hartenstein

i DO NOT WANT TO FRIGHTEN YOU

Reiner Hartenstein


Reiner Hartenstein

memory-cycle-hungry instruction streams (and data fetch/store)

Reiner

no instruction streams: no memory-cyclestransport-triggered


super-systolic array

60

(recall this example !)

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

rout thru only

not usedbackbus connect

supporting any complex free form pipe networks

far beyond just uniform linear pipes

CoDe-X inside [Jürgen Becker]by KressArray Xplorer [Ulrich Nageldinger]

Reiner Hartenstein

any wild scheme: zig-zag, fork & join, spiral, maze, and many other

Reiner Hartenstein

replacing algebraic synthesis methods by simulated annealing [Rainer Kress]


decision box turns into demultiplexer

61

[1967]PvOIIP

01B0

B1

CO

ND

ITIO

N

ENABLE

demultiplexer:

B0

B1

CONDITION

ENABLE

decision box:

RTM as a DEC product available: 1973

[~1971] (introducing HDLs): „That‘ so simple! Why did

it take 30 years to find out ?“

C. G. Bell et al: IEEE Trans-C21/5, May 1972W. A. Clark: 1967 SJCC, AFIPS Conf. Proc.

Reiner


Reiner Hartenstein

more than 40 years !

Reiner Hartenstein

that's time to space mapping


von Neumann overhead: an example

overheadvon Neumann

machine


data meet PU + other overh. instruction streami / o to / from off-chip RAM instruction stream

CPUCPU single CPU

~94% computation load

only for moving this window

reconfigurable address generator (GAG): ~20x speed-uprDPUrDPU

rDPUrDPU

rDPUrDPU

rDPUrDPU

PISA DRC accelerator [ICCAD 1984]

funded by E.I.S. Projekt

(entire project: 15000x speed-up)

Reiner

Design Rule Check accelerator:It is a kind of image processing2-D memory address spacehundreds or thousands of Boolean equations per 4-by-4 scan window positionaway from von Neumann: 15,000x total speed-up


Data Machine: from old stuff [1979

- ...]

63

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|

(r)DPA(r)DPA

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

MAuto-Sequencing Memory


RAM

datacounter

GAG

New is only: its generalization [1989]

systolic arraysuper systolic

data counter

(r)DPA(r)DPAss

ASMASMdata

counter

[1995]

[1995]

[1990]

Reiner Hartenstein

several date counters instead of a program counterprogrammed by Flowwarethe data counter: placed in memory**(not with datapath***)*) especially coarse-grained: for instance: platform FPGA**) normaly on-chip***) not like with CPU

Reiner Hartenstein

1) making it reconfigurable2) discard algebraic synthesis methods3) add data sequencers -> machine paradigm4) with reconfigurable addres sgeerator

speed-ups obtained by reconfigurable computing reiner hartenstein capes/dfg cooperation on...

Documents

wirereiner hartenstein

years agoreiner hartenstein

free ride

manycore programming

hidden ramffpart of

deconfigware programming

stalls progress

program counter