efficient radar processing via array and index algebras
Post on 02-Feb-2016
31 Views
Preview:
DESCRIPTION
TRANSCRIPT
lrm-1lrm 04/22/23
University at Albany, SUNY
Efficient Radar Processing Via Array and Index Algebras
Lenore R. Mullin, Daniel J. Rosenkrantz, and Harry B. Hunt III, Xingmin Luo
University at Albany, SUNY
NSF CCR 0105536
University at Albany,SUNYlrm-2
lrm 04/22/23
Outline
• Overview– Motivation
Radar Software Processing: to exceed 1 x 1012 ops/second
The Mapping Problem: Efficient Use of Memory Hierarchy, Portable, Scalable, …
Radar uses Linear and Multi-linear Operators: Array Based Operations
Array Operations Require Array Algebra and Index Calculus
• Array Algebra: MoA and Index Calculus: Psi Calculus– Reshape to use Processor/Memory Hierarchy Efficiently: Lift
Dimension– High-Level Monolithic Operations: Remove Temporaries
• Time Domain Convolution• Benefits of Using MoA and Psi Calculus
University at Albany,SUNYlrm-3
lrm 04/22/23
Levels of Processor/Memory Hierarchy
• Can be Modeled by Increasing Dimensionality of Data Array.
– Additional dimension for each level of the hierarchy.– Envision data as reshaped to reflect increased
dimensionality.– Calculus automatically transforms algorithm to reflect
reshaped data array.– Data, layout, data movement, and scalarization automatically
generated based on reshaped data array.
University at Albany,SUNYlrm-4
lrm 04/22/23
Levels of Processor/Memory Hierarchycontinued
• Math and indexing operations in same expression
• Framework for design space search– Rigorous and provably correct– Extensible to complex architectures
Approach
Mathematics of Arrays
Example: “raising” arraydimensionality
y= conv (x)
Me
mo
ry H
iera
rch
y
Parallelism
Main Memory
L2 Cache
L1 Cache
Map
x: < 0 1 2 … 35 >
Map:
< 3 4 5 >< 0 1 2 >
< 6 7 8 >< 9 10 11 >
< 12 13 14 >
< 18 19 20 >< 21 22 23 >
< 24 25 26 >< 27 28 29 >
< 30 31 32 >
< 15 16 17 >
< 33 34 35 >
P0 P1 P2
P0
P1
P2
University at Albany,SUNYlrm-5
lrm 04/22/23
Application DomainSignal Processing
3-d Radar Data Processing Composition of Monolithic Array Operations
Algorithm is Input
Architectural Information is Input
Hardware Info:- Memory- Processor
Change algorithmto better match
hardware/memory/communication.Lift dimensionalgebraically
PulseCompression
DopplerFiltering
Beamforming Detection
Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1)
Model All Three: dim=dim+3
Model processors(dim=dim+1); Model time-variance(dim=dim+1); Model Level 1 cache(dim=dim+1)
Model All Three: dim=dim+3
Convolution Matrix Multiply
University at Albany,SUNYlrm-6
lrm 04/22/23
Current Abstraction Approaches
Even when operations compose, they don’t compose,X(YZ) without temporary arrays
Even when operations compose, they don’t compose,X(YZ) without temporary arrays
Classical CompilerTechnology &Optimization
Scalable/PortableFine Tune
High Performance
Blas, Linpack, LAPACK, SCALAPACK
ATLASLibraries
PVL, Blitz++,MTL
Libraries
Loop transformationsTheories
Grammar Changes
CompilerAST OptimizationsGrammar Changes
StandardCompiler
Optimizations
Interpreted
Fortran 95 ZPL
C++ w/classesfunctions, templates
MATLAB
Some Modern Programming Languages with Monolithic Arrays
Requireshighly skilledprogrammers
PartialAlgebras
Compiled
PETEAST Preprocessor
University at Albany,SUNYlrm-7
lrm 04/22/23
Outline
• Overview
• Array Algebra: MoA and Index Calculus: Psi Calculus
– Reshape to use Processor/Memory Hierarchy Efficiently: Lift Dimension
– High-Level Monolithic Operations: Remove Temporaries
• Time Domain Convolution
• Benefits of Using MoA and Psi Calculus
University at Albany,SUNYlrm-8
lrm 04/22/23
PSI Calculus
Basic Properties:• Index calculus: Centers around psi function.• Shape polymorphic functions and operators:
•Operations are defined using shapes and psi.• Fundamental type is the array modeled as (shape_vector, components).
•scalars are 0-dimensional arrays, that is: (empty_vector, scalar value).• Denotational Normal Form(DNF) = reduced form in Cartesian coordinates
(independent of data layout: row major, column major, regular sparse, …)• Operational Normal Form(ONF) = reduced form for 1-d memory layout(s).
University at Albany,SUNYlrm-9
lrm 04/22/23
Psi Reduction
ONF has minimum number of reads/writes
PSI Calculus rules applied mechanically to produce ONF which is easily translated to optimal loop implementation
A=cat(rev(B), rev(C)) A[i]=B[B.size-1-i] if 0≤ i < B.size
A[i]=C[C.size+B.size-1-i] if B.size ≤ i < B.size+C.size)
This becomes by “psi” Reduction
University at Albany,SUNYlrm-10
lrm 04/22/23
Some Psi Calculus Operations
University at Albany,SUNYlrm-11
lrm 04/22/23
Convolution: PSI Calculus Description
PSI Calculus operators compose to form higher level operationsPSI Calculus operators compose to form higher level operations
Definitionof
y=conv(h,x)
y[n]= where x ‘has N elements, h has M elements, 0≤n<N+M-1, andx’ is x padded by M-1 zeros on either end
knxkhM
k
1
0
AlgorithmandPSI
CalculusDescription
Algorithm step Psi Calculus
sum Y=unaryOmega (sum, 1, Prod)
Initial step x= < 1 2 3 4 > h= < 5 6 7 >
rotate x’(N+M-1) times x’ rot=binaryOmega(rotate,0,iota(N+M-1), 1 x’)
Form x’ x’=cat(reshape(<k-1>, <0>), cat(x, reshape(<k-1>,<0>)))=
take the size of h part of x’rot
x’ final=binaryOmega(take,0,reshape<N+M-1>,<M>,=1,x’ rot
multiply Prod=binaryOmega (*,1, h,1,x’ final)
x= < 1 2 3 4 > h= < 5 6 7 >
< 0 0 1 . . . 4 0 0 >
x’ rot=< 0 0 1 2 . . . >< 0 1 2 3 . . . >< 1 2 3 4 . . . >
< 7 20 38 . . . >
< 0 0 1 >< 0 1 2 >< 1 2 3 >
< 0 0 7 >< 0 6 14 >< 5 12 21 >
x’ final=
Prod=
Y=
x’=
University at Albany,SUNYlrm-12
lrm 04/22/23
Experimental Platform and Method
Hardware• DY4 CHAMP-AV Board
– Contains 4 MPC7400’s and 1 MPC 8420
• MPC7400 (G4)– 450 MHz– 32 KB L1 data cache– 2 MB L2 cache– 64 MB memory/processor
Software
• VxWorks 5.2– Real-time OS
• GCC 2.95.4 (non-official release)– GCC 2.95.3 with patches for
VxWorks– Optimization flags:
-O3 -funroll-loops -fstrict-aliasing
Method
• Run many iterations, report average, minimum, maximum time
– From 10,000,000 iterations for small data sizes, to 1000 for large data sizes
• All approaches run on same data
• Only average times shown here
• Only one G4 processor used
• Use of the VxWorks OS resulted in very low variability in timing• High degree of confidence in results
• Use of the VxWorks OS resulted in very low variability in timing• High degree of confidence in results
University at Albany,SUNYlrm-13
lrm 04/22/23
Experiment: Conv(x,h)
• Cost of temporaries in regular C++ approach more pronounced due to large number of operations
• Cost of expression tree manipulation also more pronounced
• Cost of temporaries in regular C++ approach more pronounced due to large number of operations
• Cost of expression tree manipulation also more pronounced
University at Albany,SUNYlrm-14
lrm 04/22/23
Convolution and Dimension Lifting
• Model Processor and Level 1 cache.– Start with 1-d inputs(input dimension).– Envision 2nd dimension ranging over output values.– Envision Processors
Reshaped into a 3rd dimension. The 2nd dimension is partitioned.
– Envision Cache Reshaped into a 4th dimension. The 1st dimension is partitioned.
– “psi” Reduce to Normal Form
University at Albany,SUNYlrm-15
lrm 04/22/23
– Envision 2nd dimension ranging over output values.
Let tz=N+M-1 M=h=4
N=x
tztz
h3 h2 h1 h0 0 0 0 x0 x4
University at Albany,SUNYlrm-16
lrm 04/22/23
2
x xtz
2
tz
2
- Envision Processors Reshaped into a 3rd dimension. The 2nd dimension is partitioned.
Let p = 2
-4-4 - -4-
University at Albany,SUNYlrm-17
lrm 04/22/23
– Envision Cache
Tz/2
x
2x
2
Reshaped into a 4th dimension
The 1st dimension is partitioned.
tz/2
Tz/2
x
2x
2
2
tz
2
2 2
University at Albany,SUNYlrm-18
lrm 04/22/23
ONF for the Convolution Decomposition with Processors & Cache
Generic form- 4 dimensional after “psi” Reduction
1. For i0= 0 to p-1 do:2. For i11= 0 to tz/p –1 do:3. sum 04. For icacherow= 0 to M/cache -1 do:5. For i3 = 0 to cache –1 do:6. sum sum + h [(M-((icacherow cache) + i3))-1]
x’[(((tz/p i0)+i1) + icacherow cache) + i3)]
Let tz=N+M-1M=hN=x
Time DomainTime Domain
Processor
loop
TI
me
loop
Cache
loop
sum is calculated for each element of y.
University at Albany,SUNYlrm-19
lrm 04/22/23
Outline
• Overview• Array Algebra: MoA and Index Calculus: Psi Calculus• Time Domain Convolution• Other algorithms in Radar
– Modified Gram-Schmidt QR Decompositions MOA to ONF Experiments
– Composition of Matrix Multiplication in Beamforming MoA to DNF Experiments
– FFT
• Benefits of Using Moa and Psi Calculus
University at Albany,SUNYlrm-20
lrm 04/22/23
ONFfor1
proc
Algorithms in Radar
Time DomainConvolution (x,y)
Modified Gram SchmidtQR (A)
A x (BH x C)Beamforming
Manualdescription
&derivation
for 1 processor
DNF
Lift dimension- Processor- L1 cache
reformulate
DNF ONF
Mechanize UsingExpression Templates
Use toreason
about RAW
Benchmark at NCSAw/LAPACK
CompilerOptimizationsDNF to ONF
ImplementDNF/ONFFortran 90
Thoughtson an
AbstractMachine
MoA & CalculusMoA & Calculus
University at Albany,SUNYlrm-21
lrm 04/22/23
ONF for the QRDecomposition with Processors & Cache
Modified Gram Schmidt
Main
Loop
ProcessorLoop
ProcessorLoop
ProcessorCacheLoop
ProcessorCacheLoop
Initialization
ComputeNorm
Normalize
DoTProduct
Ortothogonalize
University at Albany,SUNYlrm-22
lrm 04/22/23
DNF for the Composition of A x (BH x C)
Generic form- 4 dimensional
1. Z=02. For i=0 to n-1 do:3. For j=0 to n-1 do:4. For k=0 to n-1 do:5. z[k;]z[k;]+A[k;j]xX[j;i]xB[i;]
Given A, B, X, Zn by n arrays
BeamformingBeamforming
University at Albany,SUNYlrm-23
lrm 04/22/23
Fftpsirad2: Performance Comparisons
University at Albany,SUNYlrm-24
lrm 04/22/23
Mechanizing MoA and Psi Reduction
MoA & calculus theory: Mullin ’88
Prototype compiler: output C, F90, HPF: Mullin and Thibault’94
HPF compiler: AST manipulations: Mullin, et al ‘96
SAC: functional C: Mullin and Bodo’96
C++ classes: Helal, Sameh and Mullin’01
C++ expression templates: Mullin, Ruttledge, Bond’02
PVL with the portable expression template engine(PETE)
Parallel and distributed processing
Abstract machine
Automate cost and determine optimizations: minimize search space
Lifting Compiler Optimizations to Application Programmer Interface
Theory applied to embedded systems
C++
C
Fortran
IndexTheory
IntroducedAbrams 1972
University at Albany,SUNYlrm-25
lrm 04/22/23
On-going research
• we are implementing the psi calculus using expression templates.
• we are building on work done at MIT and we are working with MTL library developers (lumsdaine) at Indiana University and STL library developer, musser, at rpi.
University at Albany,SUNYlrm-26
lrm 04/22/23
Benefits of Using Moa and Psi Calculus
• Processor/Memory Hierarchy can be modeled by reshaping data using an extra dimension for each level.
• Composition of monolithic operations can be reexpressed as composition of operations on smaller data granularities
– Matches memory hierarchy levels– Avoids materialization of intermediate arrays.
• Algorithm can be automatically(algebraically) transformed to reflect array reshapings above.
• Facilitates programming expressed at a high level– Facilitates intentional program design and analysis– Facilitates portability
• This approach is applicable to many other problems in radar.
University at Albany,SUNYlrm-27
lrm 04/22/23
Email and Question?
• Lenore R. Mullin, lenore@cs.albany.edu • Daniel J. Rosenkrantz, djr@cs.albany.edu
• Harry B. Hunt III, hunt@cs.albany.edu
• Xingmin Luo, xluo@cs.albany.edu
*The End*
University at Albany,SUNYlrm-28
lrm 04/22/23
Typical C++ Operator Overloading
temp
B+C temp
temp copy A
Main
Operator +
Operator =
1. Pass B and Creferences tooperator +
2. Create temporaryresult vector
3. Calculate results,store in temporary
4.Return copy of temporary
5. Pass results referenceto operator=
6. Perform assignment
temp copy
temp copy &
Example: A=B+C vector addExample: A=B+C vector add
B&,
C&
Additional Memory Use
Additional Execution Time
•Static memory•Dynamic memory(also affectsexecution time)
• Cache misses/page faults
• Time to create anew vector
• Time to create a copy of a vector
• Time to destructboth temporaries
2 temporary vectors created2 temporary vectors created
University at Albany,SUNYlrm-29
lrm 04/22/23
C++ Expression Templates and PETE
Parse trees, not vectors, createdParse trees, not vectors, created
Reduced Memory Use
Reduced Execution Time
• Parse tree only contains references
• Better cache use• Loop fusion style
optimization• Compile-time
expression tree manipulation
PETE: http://www.acl.lanl.gov/pete
• PETE, the Portable Expression Template Engine, is available from theAdvanced Computing Laboratory at Los Alamos National Laboratory
• PETE provides:– Expression template capability– Facilities to help navigate and evaluating parse trees
A=B+CA=B+CBinaryNode<OpAdd, Reference<Vector>, Reference<Vector > >Expression
Templates
Expression
Expression TypeParse Tree
B+C A
Main
Operator +
Operator =
+
B& C&
1. Pass B and Creferences tooperator +
4. Pass expression treereference to operator
2. Create expressionparse tree
3. Return expressionparse tree
5. Calculate result andperform assignment
copy &
copy
B&,
C&
Parse trees, not vectors, createdParse trees, not vectors, created
+
B C
University at Albany,SUNYlrm-30
lrm 04/22/23
Implementing Psi Calculus with Expression Templates
Example: A=take(4,drop(3,rev(B)))
B=<1 2 3 4 5 6 7 8 9 10>A=<7 6 5 4>
Recall:Psi Reduction for 1-d arrays always yields one or more expressions of the form:
x[i]=y[stride*i+ offset]l ≤ i < u
1. Form expression tree
2. Add size information 3. Apply Psi Reduction rules
4. Rewrite as sub-expressions with iterators at the leaves, and loop bounds information at the root
take
drop
rev
B
4
3
take
drop
rev
B
4
3
size=4
size=7
size=10
size=10
size=4iterator:offset=6stride=-1
size=4 A[i]=B[-i+6]
size=7 A[i]=B[-(i+3)+9]=B[-i+6]
size=10 A[i]=B[-i+B.size-1] =B[-i+9]
size=10 A[i]=B[i]
Siz
e i
nfo
Re
du
cti
on
• Iterators used for efficiency, rather than recalculating indices for each i• One “for” loop to evaluate each sub-expression
top related