speed-ups obtained by reconfigurable computing reiner hartenstein capes/dfg cooperation on...
TRANSCRIPT
Speed-ups obtained by
Reconfigurable Computing
Reiner Hartenstein
CAPES/DFG Cooperation on Reconfigurable Computing,
inv. talk, Sept 19, 2008, Dept of Mechanical Engineering,
Universidade de Brasilia
1
slightly modified version
© 2008, [email protected] http://hartenstein.de
outline
2
Introduction
Manycore Crisis & von Neumann Syndrome
The Impact of Reconfigurable Computing
Programmer education: new roadmap needed
Conclusions
© 2008, [email protected] http://hartenstein.de
5 key issues
climate change faster than predicted: by carbon emission, primarily from power plants ?
the manycore programming crisis stalls progress (end of the free ride on the Gordon Moore curve)technologically stalled Moore‘s Law*
very high and growing computer energy cost – and growing number of power plants needed here
3
Reconfigurable Computing is a promising alternative
2008: 65, 45, 32 nm[Nick Tredennick (Gilder), 2003]*) Tom Williams (keynote): the 20 nm wall
© 2008, [email protected] http://hartenstein.de
History of data processing
• prototyped: 1884 Herman Hollerith
4
•datastream-based•datastream-based
The first reconfigurable computer
DPUDPU
• 1st Xilinx FPGA 100 years later
© 2008, [email protected] http://hartenstein.de
Configware Programming
5
60 years later: RAM available –ferrite cores
manually (Configuration)
motivating the von Neumann paradigm
J. v N, 1946
or, by swapping pre-wired board
(Reconfiguration)
no instruction streams
© 2008, [email protected] http://hartenstein.de
fine-grained reconfigurable
form
ing
a wi
re
switch box
Conn
ect t
o CL
B
connect box
CLB
CConfigurable Logic Box
6
CLB CLB
CLB CLB
CLB CLB
FPGAField-Programmable Gate Array
B
A
Xilinx old „island architecture“
© 2008, [email protected] http://hartenstein.de7
CLB CLB
CLB CLB
CLB CLB
FPGAField-Programmable Gate Array
Conn
ect t
o CL
Bfo
rmin
g a
wire
B
A
switch box
CLB
CConfigurable Logic Box
connect box
© 2008, [email protected] http://hartenstein.de
switch box
RAM-based
configware code loaded before run time into “hidden RAM”
FFpart of “hidden RAM”
0 0
0
00 1
8
hidden RAM
hid
den
RA
M
hidden RAM FPGAs mainstream since > a decade
this switch box has 150 transistors &
150 flipflops FF
patches even at the customer‘s desk
© 2008, [email protected] http://hartenstein.de
Coarse-grained Reconfigurable Array
Sw
ap
if X > Y then swap;
Y
X
rout thru only rout thru and function (multiplexer)
swap turned into a wiring pattern
Conditional Swap Example
0
1Xo
0
1Yo
Xi
>
Yi
CFB !CFB !CLB (parallelization of the bubble sort algorithm)
© 2008, [email protected] http://hartenstein.de
Another coarse-grained r-Array
10
SNN Filter on supersystolic Array: mainly a Pipe Network
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect not usedbackbus connectarray size: 10 x 16
reconfigurable Data Path Unit, 32 Bits wide
reconfigurable Data Path Unit, 32 Bits wide
no CPU
rDPUrDPU
(99% placement efficiency)
CFB !CFB !rout thru only
CoDe-X inside [Jürgen Becker]by KressArray Xplorer [Ulrich Nageldinger]
© 2008, [email protected] http://hartenstein.de
Plattform-FPGA
11
256 – 1704 BGA
56 – 424
fast on-chip Block
RAMs: BRAMs
8 – 32fast serial
I/O-channels
DPUs
Configware-Code-input
[courtesy Lattice Semiconductor]
© 2008, [email protected] http://hartenstein.de
Reconfigurable Supercomputing
12
Silicon graphics
Reconfigurable Application-
Specific Computing (RASC™)
•Xilinx Virtex-II Pro•Library by Cray
Cray XD1
Supercomputing 2007, Reno, Nevada, USA 9600 registered attandees, 440 exhibitors
Chuck Thacker … (even Microsoft working at it)
(Lab in Cambridge. UK, etc.).
© 2008, [email protected] http://hartenstein.de
what means Configware
SoftwareCompiler
Software Code
13
(instruction-procedural)
Software Source
Configware Code(structural: space domain)
mapper
Placement & Routing
Configware Source
space domainspace domaintime domaintime domain
Software to Configware Migration
Software to Configware Migration
traditio
nal
Computing
Reconfigurable Computing
(data-procedural)
data scheduler
Flowware Code
© 2008, [email protected] http://hartenstein.de
outline
14
Introduction
The Manycore Crisis & the von Neumann SyndromeThe Impact of Reconfigurable Computing
Programmer education: new roadmap
needed
Conclusions
© 2008, [email protected] http://hartenstein.de15
Many-core: Break-through or Breakdown?
Industry is facing a disruptive turning point
intel’s vision: MultiCor
e
The stakes are high ...
HPC users lack understanding in basic precepts*
*) PRACE consortium (Partnership foR Advanced Computing in Europe) http://www.prace-project.eu/documents/D3.3.1_document_final.pdf
it‘s an education, qualification, and a R&D problem
“could reset µP HW & SW roadmaps for next 30 years”, [David Patterson]
forcing a historic transition to a parallel programming model yet to be invented [David Callahan]
„I would be panicked if I were inindustry“[John Hennessy]
© 2008, [email protected] http://hartenstein.de16
Declining Programmer Productivity
At particular HPC application domains massive parallelism requires 10 – 30 professionalists in multi-disciplinary multi-insitutional teams for 5 - 10 years [Douglass Post, DoD HPCMP, panelist at SC07]
The Law of More: programmer productivity declines disproportionately with increasing parallelism
Software done: machine obsolete
© 2008, [email protected] http://hartenstein.de17
The von Neumann SyndromeThe von Neumann Syndrome
© 2008, [email protected] http://hartenstein.de18
The von Neumann SyndromeThe von Neumann Syndrome
More power for creating
foam than to accelerate the
vessel ?More power for creating
foam than to accelerate the
vessel ?
© 2008, [email protected] http://hartenstein.de19
Massive Overhead Phenomena
CPU
CPU
single core
Dijkstra 1968: The Goto considered harmfulKoch et al. 1975: The universal Bus considered harmfulBackus, 1978: Can programming be liberated from the von Neumann style?Arvind et al., 1983: A critique of Multiprocessing the von Neumann Style
overhead piling up to code sizes of astronomic dimensionsvon Neumann
“von Neumann Syndrome”“von Neumann Syndrome”
2006: C.V. “RAM”
Ramamoorthy:
2006: C.V. “RAM”
Ramamoorthy:
1986, E.I.S. Projekt: 94%for address computation
total speed-up:x 15000
„a terrifying number of processes running in parallel, create sequential-processing bottlenecks and losses in
data locality“
2008 - David Callahan:
overheadvon Neumann
machine
instruction fetch instruction streamstate address computation instruction streamdata address computation instruction stream
data meet PU + other overh. instruction streami / o to / from off-chip RAM instruction stream
(list not complete)(list not complete)
C++ compilerC++ compiler
virtualizationvirtualization
many other
featuresmany other
features
© 2008, [email protected] http://hartenstein.de
manycore von Neumann: arrays of massive overhead
phenomena
proportionate to the number of processors
CPU
CPU
single CPU
von Neumann
disproportionate to the number of processors
20
fast on-chip memory cannot store such huge instruction
code blocksCPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
many- core
overheadvon Neumann
machine
instruction fetch instruction streamstate address computation instruction streamdata address computation instruction stream
data meet PU + other overh. instruction streami / o to / from off-chip RAM instruction streamInter PU communication instruction stream
message passing overhead instruction stream
transactional memory overh. instruction stream
multithreading overhead etc. instruction stream
© 2008, [email protected] http://hartenstein.de
outline
21
Introduction
Manycore Crisis & von Neuman Syndrome
The Impact of Reconfigurable Computing
Programmer education: new roadmap needed
Conclusions
© 2008, [email protected] http://hartenstein.de
Speed-up factors obtained by Software to Configware migration
22
molecular dynamics simulationmolecular dynamics simulation
88
1980 1990 2000 2010100
103
106
real-time face detectionreal-time face detection
60006000video-rate stereo
visionvideo-rate stereo
vision900
pattern recognition
pattern recognition
730
SPIHT wavelet-based image compressionSPIHT wavelet-based image compression457
FFTFFT100
Reed-Solomon DecodingReed-Solomon Decoding2400
Viterbi DecodingViterbi Decoding
400
1000
MACMAC
DSP and wireless
Image processing,Pattern matching,
Multimedia
BLASTBLAST52
protein identificationprotein identification 40
Smith-Waterman pattern matching
Smith-Waterman pattern matching
288
Bioinformatics
GRAPEGRAPE2020AstrophysicsAstrophysics
Speedu
p-F
act
or
cryptocrypto1000
X 2/yr
FPGA
28500
DES breaki
ng
3000
© 2008, [email protected] http://hartenstein.de
Accelerator card from Bruchsal
23
• I/O Bandwidth: 50 GByte/s
• Manufacturer: SIEMENS Bruchsal
16 FPGAs
Tera means 1012 or 1 000 000 000
000 (1 trillion)
MAC means Multiply and ACcumulate
• 1.5 TeraMAC/s
© 2008, [email protected] http://hartenstein.de24
Energy saving factors obtained by software to configware migration
molecular dynamics simulationmolecular dynamics simulation
88
1980 1990 2000 2010100
103
106
real-time face detectionreal-time face detection60006000
video-rate stereo vision
video-rate stereo vision
900pattern
recognitionpattern
recognition730
SPIHT wavelet-based image compressionSPIHT wavelet-based image compression457
FFTFFT100
Reed-Solomon DecodingReed-Solomon Decoding2400
Viterbi DecodingViterbi Decoding
400
1000
MACMAC
DSP und wireless
Image processing,Pattern matching,
Multimedia
BLASTBLAST52
protein identificationprotein identification 40
Smith-Waterman pattern matching
Smith-Waterman pattern matching
288
Bioinformatics
GRAPEGRAPE2020AstrophysicsAstrophysics
Speedu
p-F
act
or
cryptocrypto1000
X 2/yr
3000
28500
DES breaki
ng
FPGA
energy saving fa
ctor*
3440
300
Energy saving: almost x10 less than speed-up …
… could be improved
© 2008, [email protected] http://hartenstein.de25
von Neumann overhead vs. Reconfigurable
Computing
overheadvon Neumann
machine anti machine
instruction fetch instruction stream none*state address computation instruction stream none*data address computation instruction stream none*
data meet PU + other overh. instruction stream none*i / o to / from off-chip RAM instruction stream none*Inter PU communication instruction stream none*
message passing overhead instruction stream none*
transactional memory overh. instruction stream none*
multithreading overhead etc. instruction stream none*
using
reconfigurable
data counters
*) c
onfig
ured
befo
re ru
n tim
e
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPA: reconfigurable datapath arrayrDPA: reconfigurable datapath array
(coa
rse-
grai
ned
rec.
)(c
oars
e-gr
aine
d re
c.)
no inst
ruct
ion
fetc
h a
t ru
n t
ime
25
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
© 2008, [email protected] http://hartenstein.de
Data meet the processor (CPU)
26
by Software
inefficient transport over off-Chip-memory by memory-cycle-hungry instruction streams
This is just one of many von Neumann-Overhead-Phenomena
illustrating von Neumann syndrome
© 2008, [email protected] http://hartenstein.de
Data meet the CPU
27
byFlowware
Placement of the execution locality (not moving data)
within pipe network: generated by the Configware-Compiler*
illustrating acceleration
*) before run time (at compile time)
© 2008, [email protected] http://hartenstein.de28
What did we learn?There are 2 kinds of datastreams:
“Dataflow machine” would be a nice term, but was introduced by a different scene*
1) indirectly moved by an instruction stream machine (von Neumann): extremely inefficient
2) directly moved by a datastream machine (from Reconfigurable Computing): very efficient
*) meanwhile dead: not really a dataflow machine, but had used compilers accepting a dataflow language
© 2008, [email protected] http://hartenstein.de29
What else did we learn?
There are 2 kinds of parallelism:
1) Concurrent processes: instruction stream parallelism (CPU manycores): inefficient
2) Data parallelism by parallel datastreams (in Reconfigurable Computing Systems): efficient
- Data parallelism brings the performance (we do data processing !)
Conclusion:
© 2008, [email protected] http://hartenstein.de
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
CPUCPU
no von Neumann-bottle-neck
no von Neumann-bottle-neck
instruction parallelism:
many von Neumann
bottlenecks
many von Neumann
bottlenecks
30
[Hartenstein’s watering can model]
What Parallelism?data parallelism:
© 2008, [email protected] http://hartenstein.de
Put old ideas into practice (POIIP)
31
... „The biggest payoff will come from putting old ideas into practice and teaching people how to apply them properly.“ [David Parnas]
“We need a complete re-definition of CS”[Burton Smith and other celebrities]
Wrong! I do not agree,[Reiner Hartenstein]
finding out, that ...
“We need a complete re-definition of curriculum recommendations - missing several key issues.” [Reiner Hartenstein]
© 2008, [email protected] http://hartenstein.de
outline
32
Introduction
Manycore Crisis & von Neuman
Syndrome
The Impact of Reconfigurable
Computing
Programmer education: new road map neededConclusions
© 2008, [email protected] http://hartenstein.de
Fighting against obsolete curricula?
Real-Time Systems (Sweden)
Recommendations for Designing new ICT Curricula
Workshop on Embedded Systems Education
WESE
Chess – Center for Hybrid and Embedded Software Systems(courses in embedded systems)
Graduate Curriculum on Embedded Software and Systems (EU)
Advanced Real Time Systems
The Embedded Systems Approach?
… support their own educational approach
„You can always teach programming
to a hardware guy ...
it‘s not the programmer‘s fault: it‘s due to obsolete CS
curricula
... but you can never teach hardware to a programmer“
© 2008, [email protected] http://hartenstein.de34
CS is a Monster
fully wrong educational mainstream approaches:
2) mapping parallelism into the time domain – abstracting away the space domain is fatal
1) the basic mind set exclusively instruction-stream-oriented - data streams considered being exotic
We need a dual-rail education
© 2008, [email protected] http://hartenstein.de
We need to POIIP for:
35
2 key rules of thumb - terrifically simple:
1) loop turns into pipeline [1979]2) decision box turns into demultiplexer
[1967]: PvOIIP
Software to Hardware Migration:
Software to Configware Migration: and
© 2008, [email protected] http://hartenstein.de
Two Dichotomies
36
Dichotomy = mutual allocation to two opposed domains such, that a third domain is excluded. The dichotomy model as an educational orientation guide for dual rail education to overcome the software/configware chasm & the software/hardware chasm 1) Machine Paradigm Dichotomy (von Neumann /Dataflow machine*): the „Twin Paradigm“ model2) Relativity Dichotomy: time domain / space domain – helps parallelization by time to space mapping
*) see definition
© 2008, [email protected] http://hartenstein.de
Def.: Dataflow Machine
37
The old „Dataflow Machine“ research scene is dead. sequential execution: not really a dataflow machine.
had used compilers accepting a dataflow language
we re-define this term: counterpart of von Neumann
indeterministic: unpredictable order of execution:
deterministic, w. data counters (no program counter)
© 2008, [email protected] http://hartenstein.de
1 ) Paradigm Dichotomy
38
instruction domain
(procedural dichotomy)
datastream domain
The Twin Paradigm Approach (TTPA)
programcounter
CPUCPUdata
counter
(r)DPA(r)DPA
instructionstream
+- data
stream- +
© 2008, [email protected] http://hartenstein.de
Paradigm Dichotomy
39
instruction domain
(procedural dichotomy)
datastream domain
The Twin Paradigm Approach (TTPA)
programcounter
CPUCPUdata
counter
(r)DPA(r)DPA
instructionstream
+- data
stream- ++
+ data parallelism
data parallelism
we needwe need
AsymmetryAsymmetry
ss
© 2008, [email protected] http://hartenstein.de
Data Machine: from old stuff [1979
- ...]
40
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(r)DPA(r)DPA
ASM
ASM
ASM
ASM
ASM
ASM
AS
M
AS
M
AS
M
AS
M
AS
M
AS
MAuto-Sequencing Memory
ASM: Data streams[Kung et al. 1979]
RAM
datacounter
GAG
New is only: its generalization [1989]
systolic arraysuper systolic[1995]
[1995]
[1990]
© 2008, [email protected] http://hartenstein.de
Procedural Languages Twins
systolic Flowware Languagesread next data itemgoto (data address)jump to (data address)data loopdata loop nestingdata loop escapedata stream branchingyes: internally parallel loops
41
imperative Software Languagesread next instructiongoto (instruction address)jump to (instruction address)instruction loopinstruction loop nestinginstruction loop escapeinstruction stream branchingno: no internally parallel loops
But there is the Asymmetry But there is the Asymmetry
program counter data counter(s)
for data parallelismfor data parallelism
super
© 2008, [email protected] http://hartenstein.de
Relativity Dichotomy
time domain: space domain:procedure domain structure domain
2 phases: 1) programming
instruction streams2) run time
3 phases: 1) reconfiguration
of structures
time space
2) programming data streams
3) run time
42von Neumann Machine Anti Machine
(time time/space)
© 2008, [email protected] http://hartenstein.de
time-iterative to space-iterative
43
a time to space/time mapping
loop transformation methodogy: 70ies and later
k*n time steps, 1 CPU
k time steps, n DPUs
Often the space dimension is limited (e.g. because of the chip size)n time steps,
1 CPU
1 time step, n DPUs
a time to space mapping
Strip mining [D. Loveman, J-ACM, 1977]
POIIP
( n = length of pipeline )
© 2008, [email protected] http://hartenstein.de
outline
44
Introduction
Manycore Crisis & von Neuman Syndrome
The Impact of Reconfigurable Computing
Conclusions
© 2008, [email protected] http://hartenstein.de45
Conclusions (1)
We massively need programmable accelerator co-processors Established technologies are available and we can still use standard software and their tools
Configware skills and basic hardware knowledge are essential qualifications for programmers.
We need a massive Migration of Software to Configware. To cope with the implementation wall: to cope with the programmer population‘s unsustainable skills mismatches
© 2008, [email protected] http://hartenstein.de46
Conclusions (2)CS education is a monster !
Yaw-dropping sclerosis of curriculum taskforces
We need a complete re-definition of CS education
CS should learn a lot from Embedded Systems, like in Mechanical Engineering
Fully wrong educational mainstream approaches
We urgently need Dual-Rail Education
© 2008, [email protected] http://hartenstein.de
time to space mappingtime domain: space domain:procedure domain structure domain
50
program loopn time steps, 1
CPU
pipeline1 time step, n DPUs
Bubble Sortn x k time
steps, 1 „conditional swap“ unit
Shuffle Sortk time steps, n „conditional swap“ units
time algorithm space algorithm
conditiona
lswap
x
y
condition
alswap
condition
alswap
condition
alswap
condition
alswap
time algorithm space/time algorithm s
© 2008, [email protected] http://hartenstein.de
Architecture instead of synchro
51
„Shuffle Sort“
condition
alswap
condition
alswap
condition
alswap
condition
alswap
modification: with shuffle-
function
condition
alswap
condition
alswap
condition
alswap
condition
alswap
condition
alswap
condition
alswap
swap
condition
alswap
condition
al
direct time to space mappingaccessing conflicts
Better Architectureinstead of complex synchronisation: half he number of Blocks + up und down of data (shuffle function) – no von Neumann-syndrome !
Example
© 2008, [email protected] http://hartenstein.de
Transformations since the 70ies
time domain: space domain:procedure domain structure domain
52
program loopn x k time
steps,
Pipelinek time steps, n DPUs
time algorithm space/time algorithmus
Strip Mining Transformation
loop transformations: rich methodology published [survey: Diss. Karin Schmidt,
1994, Shaker Verlag]
1 CPU
© 2008, [email protected] http://hartenstein.de
Revolution der Lehre: Mikroelektronik-Entwurfs-Revolution
Carver Mead Lynn Conway
53
als Vorbildals Vorbild(in Deutschland: das E.I.S.-Projekt)
tall t
hin
m
an
Anwendung
Spezialisierungsbreitestark reduziert
Die neue M-&-C Arbeitsteilung:
Entrümpelung &intuitive Modelle
zur Behebung des
Ausbildungs-Dilemmas
Betonung auf “Systems”
Silicon Foundry (externeTechnologie)
Koh
äre
nz
Logik-Ebene
Switching-Ebene
Schaltkreis-Ebene
RT-Ebene
Anwendung
Layout-Ebene
Technologieim Hause
Einreichung Rückweisung
Einreichung Rückweisung
Einreichung Rückweisung
Einreichung Rückweisung
Einreichung Rückweisung
traditionelle Arbeitsteilung:
Spezialisierungsbreite
Zers
plit
teru
ng
[1980]
© 2008, [email protected] http://hartenstein.de
Education Revolution: Reconfigurable Computing Revolution
Program level
Application level
54
the t
all t
hin
man
*>
Dic
hoto
my <
Christophe Bobda
The new
Mead & Conway
?
clearing out
von-Neumann-Paradigm
(instructionstream-based)
clearing out
Anti machine Paradigm (datastream-based)
Twin Paradigm
*) or” tall thin woman”
© 2008, [email protected] http://hartenstein.de
Who generates the data streams?
55
xxx
xxx
xxx
|
||
x xx
x
xx
x x
x
- -
-
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
„systolic“
Withourt a Sequencer it‘s not a Machine !
© 2008, [email protected] http://hartenstein.de
The Anti Machine
56
several date counters instead of a
program counter
the data counter: placed in memory**
(not with datapath***)
ASM
ASM
ASM
ASM
ASM
ASM
AS
M
AS
M
AS
M
AS
M
AS
M
AS
M
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
**) normaly on-chip
(r)DPA*(r)DPA*
***) not like with CPU
*) especially coarse-grained: for instance: platform FPGA
*) especially coarse-grained: for instance: platform FPGA
Auto-Sequencing Memory
ASM: Data streams[Kung et al. 1979]
programmed by Flowware
Super-systolic Array
RAMdata
counter
GAG
(KressArray)
© 2008, [email protected] http://hartenstein.de57
Misson of this talk
We need time to space migration
since infinite space is not available,we often need partial time 2 space migration
###
++
software 2 hardware mapping (and,software 2 configware mapping)means time to space migration
(and von Neumann 2 anti machine migration)
© 2008, [email protected] http://hartenstein.de58
Morphware: old stuffstructural programming (non-von-Neumann)
1971 PROMs for small logic
1984 first Xilinx FPGA
1975 PLA
1978 PAL with PALASM tool
meanwhile mainstream …
© 2008, [email protected] http://hartenstein.de
POIIP: Loop turns into pipeline [1979]
(reconfigurable) DataPath Unit:
rDPUrDPUloop body
rDPUrDPU
rDPUrDPU
rDPUrDPU
Pipeline:
rDPUrDPUloop body
loop:
CPUCPU
MemoryMemory
© 2008, [email protected] http://hartenstein.de
super-systolic array
60
(recall this example !)
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
rout thru only
not usedbackbus connect
supporting any complex free form pipe networks
far beyond just uniform linear pipes
CoDe-X inside [Jürgen Becker]by KressArray Xplorer [Ulrich Nageldinger]
© 2008, [email protected] http://hartenstein.de
decision box turns into demultiplexer
61
[1967]PvOIIP
01B0
B1
CO
ND
ITIO
N
ENABLE
demultiplexer:
B0
B1
CONDITION
ENABLE
decision box:
RTM as a DEC product available: 1973
[~1971] (introducing HDLs): „That‘ so simple! Why did
it take 30 years to find out ?“
C. G. Bell et al: IEEE Trans-C21/5, May 1972W. A. Clark: 1967 SJCC, AFIPS Conf. Proc.
© 2008, [email protected] http://hartenstein.de62
von Neumann overhead: an example
overheadvon Neumann
machine
instruction fetch instruction streamstate address computation instruction streamdata address computation instruction stream
data meet PU + other overh. instruction streami / o to / from off-chip RAM instruction stream
CPUCPU single CPU
~94% computation load
only for moving this window
reconfigurable address generator (GAG): ~20x speed-uprDPUrDPU
rDPUrDPU
rDPUrDPU
rDPUrDPU
PISA DRC accelerator [ICCAD 1984]
funded by E.I.S. Projekt
(entire project: 15000x speed-up)
© 2008, [email protected] http://hartenstein.de
Data Machine: from old stuff [1979
- ...]
63
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(r)DPA(r)DPA
ASM
ASM
ASM
ASM
ASM
ASM
AS
M
AS
M
AS
M
AS
M
AS
M
AS
MAuto-Sequencing Memory
ASM: Data streams[Kung et al. 1979]
RAM
datacounter
GAG
New is only: its generalization [1989]
systolic arraysuper systolic
data counter
(r)DPA(r)DPAss
ASMASMdata
counter
[1995]
[1995]
[1990]