jerker bengtsson jerker.bengtsson@hh · introductioncode generationsynchronous data owcurrent...
TRANSCRIPT
Introduction Code Generation Synchronous Dataflow Current Status
Machine Assisted Code Generation for ManycoreProcessors
Jerker [email protected]
EPC meeting at Lindholmen, March 19, 2008
Centre for Research on Embedded Systems
Introduction Code Generation Synchronous Dataflow Current Status
Introduction CERES
Are new languages necessary for multi-/many-/what-evercores?
The solution is well-defined || models of computation (MoC)Dataflow is a good match with manycore targets
Why investigate on code generation?
Consensus 1: the programming complexity need to be reducedConsensus 2: we want code with a high degree of portabilityAlternative: program for a machine API that abstracts sharedmachine resources
(+) solves parts of the portability issues(-) does not reduce the multicore programming complexity
What do we mean by ”machine assisted”?
Latency optimisation is different from throughput optimisation→ different problems require different optimisation strategiesProvide means to specialize the || mapping strategy
Introduction Code Generation Synchronous Dataflow Current Status
Introduction CERES
Are new languages necessary for multi-/many-/what-evercores?
The solution is well-defined || models of computation (MoC)Dataflow is a good match with manycore targets
Why investigate on code generation?
Consensus 1: the programming complexity need to be reducedConsensus 2: we want code with a high degree of portabilityAlternative: program for a machine API that abstracts sharedmachine resources
(+) solves parts of the portability issues(-) does not reduce the multicore programming complexity
What do we mean by ”machine assisted”?
Latency optimisation is different from throughput optimisation→ different problems require different optimisation strategiesProvide means to specialize the || mapping strategy
Introduction Code Generation Synchronous Dataflow Current Status
Introduction CERES
Are new languages necessary for multi-/many-/what-evercores?
The solution is well-defined || models of computation (MoC)Dataflow is a good match with manycore targets
Why investigate on code generation?
Consensus 1: the programming complexity need to be reducedConsensus 2: we want code with a high degree of portabilityAlternative: program for a machine API that abstracts sharedmachine resources
(+) solves parts of the portability issues(-) does not reduce the multicore programming complexity
What do we mean by ”machine assisted”?
Latency optimisation is different from throughput optimisation→ different problems require different optimisation strategiesProvide means to specialize the || mapping strategy
Introduction Code Generation Synchronous Dataflow Current Status
Introduction CERES
Are new languages necessary for multi-/many-/what-evercores?
The solution is well-defined || models of computation (MoC)Dataflow is a good match with manycore targets
Why investigate on code generation?
Consensus 1: the programming complexity need to be reducedConsensus 2: we want code with a high degree of portabilityAlternative: program for a machine API that abstracts sharedmachine resources
(+) solves parts of the portability issues(-) does not reduce the multicore programming complexity
What do we mean by ”machine assisted”?
Latency optimisation is different from throughput optimisation→ different problems require different optimisation strategiesProvide means to specialize the || mapping strategy
Introduction Code Generation Synchronous Dataflow Current Status
Introduction CERES
Are new languages necessary for multi-/many-/what-evercores?
The solution is well-defined || models of computation (MoC)Dataflow is a good match with manycore targets
Why investigate on code generation?
Consensus 1: the programming complexity need to be reducedConsensus 2: we want code with a high degree of portabilityAlternative: program for a machine API that abstracts sharedmachine resources
(+) solves parts of the portability issues(-) does not reduce the multicore programming complexity
What do we mean by ”machine assisted”?
Latency optimisation is different from throughput optimisation→ different problems require different optimisation strategiesProvide means to specialize the || mapping strategy
Introduction Code Generation Synchronous Dataflow Current Status
Introduction CERES
Are new languages necessary for multi-/many-/what-evercores?
The solution is well-defined || models of computation (MoC)Dataflow is a good match with manycore targets
Why investigate on code generation?
Consensus 1: the programming complexity need to be reducedConsensus 2: we want code with a high degree of portabilityAlternative: program for a machine API that abstracts sharedmachine resources
(+) solves parts of the portability issues(-) does not reduce the multicore programming complexity
What do we mean by ”machine assisted”?
Latency optimisation is different from throughput optimisation→ different problems require different optimisation strategiesProvide means to specialize the || mapping strategy
Introduction Code Generation Synchronous Dataflow Current Status
Introduction CERES
Are new languages necessary for multi-/many-/what-evercores?
The solution is well-defined || models of computation (MoC)Dataflow is a good match with manycore targets
Why investigate on code generation?
Consensus 1: the programming complexity need to be reducedConsensus 2: we want code with a high degree of portabilityAlternative: program for a machine API that abstracts sharedmachine resources
(+) solves parts of the portability issues(-) does not reduce the multicore programming complexity
What do we mean by ”machine assisted”?
Latency optimisation is different from throughput optimisation→ different problems require different optimisation strategiesProvide means to specialize the || mapping strategy
Introduction Code Generation Synchronous Dataflow Current Status
Introduction CERES
Are new languages necessary for multi-/many-/what-evercores?
The solution is well-defined || models of computation (MoC)Dataflow is a good match with manycore targets
Why investigate on code generation?
Consensus 1: the programming complexity need to be reducedConsensus 2: we want code with a high degree of portabilityAlternative: program for a machine API that abstracts sharedmachine resources
(+) solves parts of the portability issues(-) does not reduce the multicore programming complexity
What do we mean by ”machine assisted”?
Latency optimisation is different from throughput optimisation→ different problems require different optimisation strategiesProvide means to specialize the || mapping strategy
Introduction Code Generation Synchronous Dataflow Current Status
Introduction CERES
Are new languages necessary for multi-/many-/what-evercores?
The solution is well-defined || models of computation (MoC)Dataflow is a good match with manycore targets
Why investigate on code generation?
Consensus 1: the programming complexity need to be reducedConsensus 2: we want code with a high degree of portabilityAlternative: program for a machine API that abstracts sharedmachine resources
(+) solves parts of the portability issues(-) does not reduce the multicore programming complexity
What do we mean by ”machine assisted”?
Latency optimisation is different from throughput optimisation→ different problems require different optimisation strategiesProvide means to specialize the || mapping strategy
Introduction Code Generation Synchronous Dataflow Current Status
Introduction CERES
Are new languages necessary for multi-/many-/what-evercores?
The solution is well-defined || models of computation (MoC)Dataflow is a good match with manycore targets
Why investigate on code generation?
Consensus 1: the programming complexity need to be reducedConsensus 2: we want code with a high degree of portabilityAlternative: program for a machine API that abstracts sharedmachine resources
(+) solves parts of the portability issues(-) does not reduce the multicore programming complexity
What do we mean by ”machine assisted”?
Latency optimisation is different from throughput optimisation→ different problems require different optimisation strategiesProvide means to specialize the || mapping strategy
Introduction Code Generation Synchronous Dataflow Current Status
Introduction CERES
Are new languages necessary for multi-/many-/what-evercores?
The solution is well-defined || models of computation (MoC)Dataflow is a good match with manycore targets
Why investigate on code generation?
Consensus 1: the programming complexity need to be reducedConsensus 2: we want code with a high degree of portabilityAlternative: program for a machine API that abstracts sharedmachine resources
(+) solves parts of the portability issues(-) does not reduce the multicore programming complexity
What do we mean by ”machine assisted”?
Latency optimisation is different from throughput optimisation→ different problems require different optimisation strategiesProvide means to specialize the || mapping strategy
Introduction Code Generation Synchronous Dataflow Current Status
Code Generation Framework CERES
Machine Parameters Program
Program Graph
Machine Graph
Configuration Graph
C-code generation & compilation
Manycore Executable
F r o n
t E n d
B a
c k E
n d
Abstract code generator framework
Introduction Code Generation Synchronous Dataflow Current Status
Synchronous Dataflow CERES
Hierarchical SDF model of aWCDMA Adaptive Multi Rate(AMR) transmitter processing
chain
Segm
Segm
Segm
CRC CRCCRC
Code CodeCode
Rm RmRm
Dtx1 Dtx1Dtx1
Intrl IntrlIntrl
Segm
Segm
Segm
Mux
Dtx2Segm Intrl Phy
33
33
1010
1111
2121
2121
3 24
2222
1111
1111
44
44
2222
22
22
55
66
1111
1111
22
11
2211
12
6
2727
27 2727 2727 27
The graph shows the top-levelcomposite actors
The integers associated withthe edges specifies the tokenconsumption and productionrate for each actor when fired
Hierarchical SDF modelsare composed of atomicand composite actors
SDF is a well defined, restricted subset ofdataflow
Pipeline-, task- and data parallelism can bediscovered by a code generator
The properties of SDF guarantees
buffer bounded executionexecution without deadlock
Limitations of SDF
Expressability is limitedNot efficient for dynamic computationproblems
Introduction Code Generation Synchronous Dataflow Current Status
Synchronous Dataflow CERES
Hierarchical SDF model of aWCDMA Adaptive Multi Rate(AMR) transmitter processing
chain
Segm
Segm
Segm
CRC CRCCRC
Code CodeCode
Rm RmRm
Dtx1 Dtx1Dtx1
Intrl IntrlIntrl
Segm
Segm
Segm
Mux
Dtx2Segm Intrl Phy
33
33
1010
1111
2121
2121
3 24
2222
1111
1111
44
44
2222
22
22
55
66
1111
1111
22
11
2211
12
6
2727
27 2727 2727 27
The graph shows the top-levelcomposite actors
The integers associated withthe edges specifies the tokenconsumption and productionrate for each actor when fired
Hierarchical SDF modelsare composed of atomicand composite actors
SDF is a well defined, restricted subset ofdataflow
Pipeline-, task- and data parallelism can bediscovered by a code generator
The properties of SDF guarantees
buffer bounded executionexecution without deadlock
Limitations of SDF
Expressability is limitedNot efficient for dynamic computationproblems
Introduction Code Generation Synchronous Dataflow Current Status
Synchronous Dataflow CERES
Hierarchical SDF model of aWCDMA Adaptive Multi Rate(AMR) transmitter processing
chain
Segm
Segm
Segm
CRC CRCCRC
Code CodeCode
Rm RmRm
Dtx1 Dtx1Dtx1
Intrl IntrlIntrl
Segm
Segm
Segm
Mux
Dtx2Segm Intrl Phy
33
33
1010
1111
2121
2121
3 24
2222
1111
1111
44
44
2222
22
22
55
66
1111
1111
22
11
2211
12
6
2727
27 2727 2727 27
The graph shows the top-levelcomposite actors
The integers associated withthe edges specifies the tokenconsumption and productionrate for each actor when fired
Hierarchical SDF modelsare composed of atomicand composite actors
SDF is a well defined, restricted subset ofdataflow
Pipeline-, task- and data parallelism can bediscovered by a code generator
The properties of SDF guarantees
buffer bounded executionexecution without deadlock
Limitations of SDF
Expressability is limitedNot efficient for dynamic computationproblems
Introduction Code Generation Synchronous Dataflow Current Status
Synchronous Dataflow CERES
Hierarchical SDF model of aWCDMA Adaptive Multi Rate(AMR) transmitter processing
chain
Segm
Segm
Segm
CRC CRCCRC
Code CodeCode
Rm RmRm
Dtx1 Dtx1Dtx1
Intrl IntrlIntrl
Segm
Segm
Segm
Mux
Dtx2Segm Intrl Phy
33
33
1010
1111
2121
2121
3 24
2222
1111
1111
44
44
2222
22
22
55
66
1111
1111
22
11
2211
12
6
2727
27 2727 2727 27
The graph shows the top-levelcomposite actors
The integers associated withthe edges specifies the tokenconsumption and productionrate for each actor when fired
Hierarchical SDF modelsare composed of atomicand composite actors
SDF is a well defined, restricted subset ofdataflow
Pipeline-, task- and data parallelism can bediscovered by a code generator
The properties of SDF guarantees
buffer bounded executionexecution without deadlock
Limitations of SDF
Expressability is limitedNot efficient for dynamic computationproblems
Introduction Code Generation Synchronous Dataflow Current Status
Synchronous Dataflow CERES
Hierarchical SDF model of aWCDMA Adaptive Multi Rate(AMR) transmitter processing
chain
Segm
Segm
Segm
CRC CRCCRC
Code CodeCode
Rm RmRm
Dtx1 Dtx1Dtx1
Intrl IntrlIntrl
Segm
Segm
Segm
Mux
Dtx2Segm Intrl Phy
33
33
1010
1111
2121
2121
3 24
2222
1111
1111
44
44
2222
22
22
55
66
1111
1111
22
11
2211
12
6
2727
27 2727 2727 27
The graph shows the top-levelcomposite actors
The integers associated withthe edges specifies the tokenconsumption and productionrate for each actor when fired
Hierarchical SDF modelsare composed of atomicand composite actors
SDF is a well defined, restricted subset ofdataflow
Pipeline-, task- and data parallelism can bediscovered by a code generator
The properties of SDF guarantees
buffer bounded executionexecution without deadlock
Limitations of SDF
Expressability is limitedNot efficient for dynamic computationproblems
Introduction Code Generation Synchronous Dataflow Current Status
Synchronous Dataflow CERES
Hierarchical SDF model of aWCDMA Adaptive Multi Rate(AMR) transmitter processing
chain
Segm
Segm
Segm
CRC CRCCRC
Code CodeCode
Rm RmRm
Dtx1 Dtx1Dtx1
Intrl IntrlIntrl
Segm
Segm
Segm
Mux
Dtx2Segm Intrl Phy
33
33
1010
1111
2121
2121
3 24
2222
1111
1111
44
44
2222
22
22
55
66
1111
1111
22
11
2211
12
6
2727
27 2727 2727 27
The graph shows the top-levelcomposite actors
The integers associated withthe edges specifies the tokenconsumption and productionrate for each actor when fired
Hierarchical SDF modelsare composed of atomicand composite actors
SDF is a well defined, restricted subset ofdataflow
Pipeline-, task- and data parallelism can bediscovered by a code generator
The properties of SDF guarantees
buffer bounded executionexecution without deadlock
Limitations of SDF
Expressability is limitedNot efficient for dynamic computationproblems
Introduction Code Generation Synchronous Dataflow Current Status
Synchronous Dataflow CERES
Hierarchical SDF model of aWCDMA Adaptive Multi Rate(AMR) transmitter processing
chain
Segm
Segm
Segm
CRC CRCCRC
Code CodeCode
Rm RmRm
Dtx1 Dtx1Dtx1
Intrl IntrlIntrl
Segm
Segm
Segm
Mux
Dtx2Segm Intrl Phy
33
33
1010
1111
2121
2121
3 24
2222
1111
1111
44
44
2222
22
22
55
66
1111
1111
22
11
2211
12
6
2727
27 2727 2727 27
The graph shows the top-levelcomposite actors
The integers associated withthe edges specifies the tokenconsumption and productionrate for each actor when fired
Hierarchical SDF modelsare composed of atomicand composite actors
SDF is a well defined, restricted subset ofdataflow
Pipeline-, task- and data parallelism can bediscovered by a code generator
The properties of SDF guarantees
buffer bounded executionexecution without deadlock
Limitations of SDF
Expressability is limitedNot efficient for dynamic computationproblems
Introduction Code Generation Synchronous Dataflow Current Status
What manycore targets? CERES
PE1,1 PE1,2 PE1,n-1
PE2,1 PE PE PE
PE3,2 PEm,n
Switch/Router
Instructionmem
Datamem
Instr. execution
Regfile
PE3,1
PEm,1 PEm,2
PE1,n
PE2,n
PEm-1,n
PEm,nPEm,n-1
PEm-1,n
Introduction Code Generation Synchronous Dataflow Current Status
Machine abstraction: computational capacity CERES
Computational capacity is described by < P, p, m, bl , bg >
P is the number of coresp is the processing power per corem is local memory sizebl is local memory bandwidthbg is global shared memory bandwidth
Introduction Code Generation Synchronous Dataflow Current Status
Machine abstraction: network capacity CERES
The network capacity is described by < so , sl , nb, nhl , ro , rl >
so is send occupancysl is send latencynb is network buffer capacityc is link bandwidthnhl is network hop latencyrl is receive latencyro is receive occupancy
Introduction Code Generation Synchronous Dataflow Current Status
Performance functions CERES
computation time Tp = d rpp esend time Ts = d rcout
messlene × so + Tblocked()
network injection time is sl + Qblocked()
receive time Tr is d rcinmesslen
e × ro + Tblocked()
network extraction time is rl + Qblocked()
communication latency is
Tc = nhl × nhops + bl + (L− 1)×max( 1c , P
bg)
for global memory accessTc = dist × nhl + (L− 1)× 1
cfor core-to-core communication
State dependent performance functions Tblocked() andQblocked() can be set constant (if we know the worst case...)
...contact me if you want the details
Introduction Code Generation Synchronous Dataflow Current Status
Where we are now CERES
Machine abstraction
The framework is being implemented inPtolemy http://ptolemy.berkeley.edu/
The intermediate representation (IR) is ahierarchical heterogenous model
multicore level is Process Networkscore internals are SDF models
We can generate the IR from SDF input
...but no clustering or optimisation yet
C code can be generated from the IR
parallel code using POSIX threadscan be modified to generate target specificcode
Introduction Code Generation Synchronous Dataflow Current Status
Where we are now CERES
Machine abstraction
The framework is being implemented inPtolemy http://ptolemy.berkeley.edu/
The intermediate representation (IR) is ahierarchical heterogenous model
multicore level is Process Networkscore internals are SDF models
We can generate the IR from SDF input
...but no clustering or optimisation yet
C code can be generated from the IR
parallel code using POSIX threadscan be modified to generate target specificcode
Introduction Code Generation Synchronous Dataflow Current Status
Where we are now CERES
Machine abstraction
The framework is being implemented inPtolemy http://ptolemy.berkeley.edu/
The intermediate representation (IR) is ahierarchical heterogenous model
multicore level is Process Networkscore internals are SDF models
We can generate the IR from SDF input
...but no clustering or optimisation yet
C code can be generated from the IR
parallel code using POSIX threadscan be modified to generate target specificcode
Introduction Code Generation Synchronous Dataflow Current Status
Where we are now CERES
Machine abstraction
The framework is being implemented inPtolemy http://ptolemy.berkeley.edu/
The intermediate representation (IR) is ahierarchical heterogenous model
multicore level is Process Networkscore internals are SDF models
We can generate the IR from SDF input
...but no clustering or optimisation yet
C code can be generated from the IR
parallel code using POSIX threadscan be modified to generate target specificcode
Introduction Code Generation Synchronous Dataflow Current Status
What is going on now? CERES
Investigation of clustering and parallelisation strategies
User specified clusteringDifferent types of automatized clustering
detect and cluster non-parallel actor chainsconstraint-driven clustering (RT constraints)exploit ”hidden” data parallelism
New spin-off proposal: codegen for multicore RTOS
Problem: for non-trivial systems, we will need a higher degreeof run-time flexibilityApproach?: Runtime (real-time scheduling) support formulti-coresTo appoach this problem, we need RT scheduling theory (HoaiHoang)
Introduction Code Generation Synchronous Dataflow Current Status
What is going on now? CERES
Investigation of clustering and parallelisation strategies
User specified clusteringDifferent types of automatized clustering
detect and cluster non-parallel actor chainsconstraint-driven clustering (RT constraints)exploit ”hidden” data parallelism
New spin-off proposal: codegen for multicore RTOS
Problem: for non-trivial systems, we will need a higher degreeof run-time flexibilityApproach?: Runtime (real-time scheduling) support formulti-coresTo appoach this problem, we need RT scheduling theory (HoaiHoang)
Introduction Code Generation Synchronous Dataflow Current Status
What is going on now? CERES
Investigation of clustering and parallelisation strategies
User specified clusteringDifferent types of automatized clustering
detect and cluster non-parallel actor chainsconstraint-driven clustering (RT constraints)exploit ”hidden” data parallelism
New spin-off proposal: codegen for multicore RTOS
Problem: for non-trivial systems, we will need a higher degreeof run-time flexibilityApproach?: Runtime (real-time scheduling) support formulti-coresTo appoach this problem, we need RT scheduling theory (HoaiHoang)