platform-independent description of imaging · computer science x - system simulation group harald...
TRANSCRIPT
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Platform-independent
description of imaging
algorithms
H. Köstler
Salt Lake City, 17.3.2015
1
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
ExaStencils Project
2
Christian Lengauer
Armin Größlinger
Stefan Kronawitter
Sven Apel
Alexander GrebhahnMatthias Bolten
Hannah Rittich
Ulrich Rüde.
Harald Köstler
Sebastian Kuckuk
Jürgen Teich.
Frank Hannig
Christian Schmitt
A unique, tool-assisted, domain-specific
co-design approach for the class of
stencil codes
http://www.exastencils.org/
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Problems in High Performance Computing
Hardware: Modern HPC platforms are massively
parallel
Intra-core, intra-node, and inter-node
Software: Imaging applications become more complex
with increasing computational power
More complex models
Code development in interdisciplinary teams
Algorithm: Class of different algorithms grows, many of
them are just a general idea (like multigrid)
Components and parameters depend on image, type of problem, …
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
High Performance Computing: Applications
real-time imaging e.g. medicine
Large-scalesimulation
e.g. multi-physics
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Variational Imaging
General Problem: Find mapping , such that
the energy functional
is minimized.
5
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
State of the Art: Application-driven Projects
6
Mathematician
Hardware specialist
Software specialist
User from application field Description of application
Solution method
Parallel implementation and framework
Efficient implementation on specific hardware
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Proposed: Domain-driven Projects
7
Mathematician
Hardware specialist
Software specialist
Users from different
application fieldsDescription of application in domain specific language
Automatic selection of algorithmic components
Code generation for specific application
Automatic tuning on specific hardware
Domain knowledgeFeature Model
Domain expert
PDE
Operators::Laplacian(Data::solution) = Data::rhs
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Aspects
Code generation for imaging applications
High dynamic range compression
Image registration / Optical Flow
Image segmentation
Image denoising
Domain-specific language design
Domain-specific knowledge representation and optimization
Efficient Algorithms
Parallelization
Performance Tuning
8
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
ExaStencils – Overview
• DSL as intuitive
interface to the user
• Automatic deduction of
configuration if desired
• Prediction and
Optimization of the
configuration’s
performance using
SPL and LFA
• Code generation in
Scala
• Automatic hardware-
specific optimizations
11
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
ExaStencils – Layers
1210
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
DSL Scope
11
dimension 2D 3DDomain UnitSquare UnitCubeSolution Scalar VectorOperator Stencil weak form nonlinearBoundary Dirichlet Neumann periodic
Architecture CPU GPULanguages C++ Scala CUDA OpenCLParallelization MPI OpenMP
discrete domain regularGrid nodebased cellbasedoperator discretization FD FE FVdata types float double complexdiscretization accuracy 2 4
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
DSL EXAMPLE
High Dynamic Range compression
12
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Application: High Dynamic Range Compression
Data: Siemens AG. Healthcare Sector
Sequences of 2D x-ray images
ComputationalSteering
Visualization withOpenGL
Köstler. et al. Performance engineering to achieve real-time high dynamic range imaging.J Real-Time Image Processing. 2013.
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Optimized HDR Compression (size 2048x2048)
0
20
40
60
80
100
120
140
GTX 295/2 GTX 480 GTX 480(wavefront)
fps
half of an NVIDIA GTX 295
112 GB/s peak bandwidth
compute capability 1.3
NVIDIA GTX 480
177 GB/s peak bandwidth
compute capability 2.0 (Fermi)
14
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Selection of Multigrid Components: LFA Toolbox
15
Goal: minimize time T to reach prescribed accuracy κ
Asymptotic convergence rates are estimated via LFA
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Hardware description (DSL level Hardware)
16
Hardware cpu
bandwidth = 40
peak = 30
cores = 4
Node n
sockets = 2
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
HDR Compression
Idea: Modify magnitude of image gradient by position-dependent attenuating function
Energy functional
Solve by multigrid the Euler-Lagrange equation
RR 2:
ICFattal/Lischinski/Werman. Gradient Domain High Dynamic Range Compression. SIGGRAPH. 2002
xdCxuuEu
2
)(min)(
onu
infu
0
17
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
What is Multigrid?
Goal: Solve partial differential equation
After discretization one requires an efficient iterative solver for sparse systems
multigrid solver has complexity O(N) in number of unknowns N
18
hh fAu
onu
infu
0Ω
∂Ω
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Problem Description (continuous)
19
Domain omega = UnitSquare
f : omega -> R^1
u : omega -> R^1
Laplacian : ( omega -> R^1 ) -> ( omega -> R^1 )
Laplacian = dx^2 + dy^2
pde : Laplacian [ u ] = f in omega
bc : u = 0 in partial_omega
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Generation of discrete problem
Domain-specific knowledge
Discretization methods FD, FV, FE
Types of operators supported result in sparse matrices
Domain-specific optimization chooses type of discretization and e.g. concrete data types
Description is parsed, an abstract syntax tree is constructed and then transformed into a discrete representation of the problem
Code generation framework is implemented in Scala language
20
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Problem Description (discrete)
21
Fragments f1 = Regular_Square
Discrete_Domain omega levels 8
xsize [0] = 1024 // from Image
ysize [0] = 1024 // from Image
xsize [l+1] = xsize [l] / 2
ysize [l+1] = ysize [l] / 2
Field<Double,1>@nodes f
Field<Double,1>@nodes u
StencilMatrix<Double,1,1, FD, 2>@nodes Laplacian
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Multigrid Algorithm
22
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Algorithmic Parameters and Components
23
mgcomponents
solver = multigrid
cycletype=Vcycle
mgparameter
nprae = 2
npost = 1
iters = 10
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Layer 4 Functions – Example
24
function VCycle@coarsest ( ) : Unit /* coarse grid solver */
function VCycle@((coarsest + 1) to finest) ( ) : Unit
repeat up 1
Smoother@current ( )
UpResidual@current ( )
Restriction@current ( )
SetSolution@coarser ( 0 )
VCycle@coarser ( )
Correction@current ( )
repeat up 2
Smoother@current ( )
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Example: Jacobi Smoother
25
function Smoother@((coarsest + 1) to finest) ( ) : Unit
communicate Solution[0]@(current)
loop over inner on Solution@(current)
Solution[1]@current =
Solution[0]@current
+ ( ( ( 1.0 / diag ( Lapl@(current) ) ) * 0.8 )
* ( RHS@current - Lapl@(current) * Solution[0]@current ) )
…
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Jacobi Smoother – Resulting Code (w/o basic Opt)
26
#include "MultiGrid/MultiGrid.h"
void Smoother_4()
exchsolutionData_4(0);
#pragma omp parallel for schedule(static) num_threads(8)
for (int fragmentIdx = 0; fragmentIdx < 8; ++fragmentIdx)
if (isValidForSubdomain[fragmentIdx][0])
for (int y = iterationOffsetBegin[fragmentIdx][0][1];
y < (iterationOffsetEnd[fragmentIdx][0][1]+17); y +=1)
for (int x = iterationOffsetBegin[fragmentIdx][0][0];
x < (iterationOffsetEnd[fragmentIdx][0][0]+17); x +=1)
slottedFieldData_Solution[1][fragmentIdx][4][(((y*19)+19)+(x+1))] =
(slottedFieldData_Solution[0][fragmentIdx][4][(((y*19)+19)+(x+1))]
+(((1.0e+00/fieldData_LaplCoeff[fragmentIdx][4][((y*17)+x)])*8.0e-01)
*(fieldData_RHS[fragmentIdx][4][((y*17)+x)]
-(((((fieldData_LaplCoeff[fragmentIdx][4][((y*17)+x)]
*slottedFieldData_Solution[0][fragmentIdx][4][(((y*19)+19)+(x+1))])
+(fieldData_LaplCoeff[fragmentIdx][4][(((y*17)+289)+x)]
*slottedFieldData_Solution[0][fragmentIdx][4][(((y*19)+19)+(x+2))]))
+(fieldData_LaplCoeff[fragmentIdx][4][(((y*17)+578)+x)]
*slottedFieldData_Solution[0][fragmentIdx][4][(((y*19)+19)+x)]))
+(fieldData_LaplCoeff[fragmentIdx][4][(((y*17)+867)+x)]
*slottedFieldData_Solution[0][fragmentIdx][4][(((y*19)+38)+(x+1))]))
+(fieldData_LaplCoeff[fragmentIdx][4][(((y*17)+1156)+x)]
*slottedFieldData_Solution[0][fragmentIdx][4][((y*19)+(x+1))])))));
…
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Jacobi Smoother – Resulting Code (w/ basic Opt)
27
#include "MultiGrid/MultiGrid.h"
void Smoother_4()
exchsolutionData_4(0);
#pragma omp parallel for schedule(static) num_threads(8)
for (int fragmentIdx = 0; fragmentIdx < 8; ++fragmentIdx)
if (isValidForSubdomain[fragmentIdx][0])
for (int c0 = iterationOffsetBegin[fragmentIdx][0][1];
(c0<=(iterationOffsetEnd[fragmentIdx][0][1]+16)); c0 = (c0+1))
double* slottedFieldData_Solution_1_fragmentIdx_4_p1 =
&(slottedFieldData_Solution[1][fragmentIdx][4][(19*c0)]);
double* fieldData_RHS_fragmentIdx_4_p1 = &(fieldData_RHS[fragmentIdx][4][(17*c0)]);
double* slottedFieldData_Solution_0_fragmentIdx_4_p1 = …
double* fieldData_LaplCoeff_fragmentIdx_4_p1 = …
for (int c1 = iterationOffsetBegin[fragmentIdx][0][0];
(c1<=(iterationOffsetEnd[fragmentIdx][0][0]+16)); c1 = (c1+1))
slottedFieldData_Solution_1_fragmentIdx_4_p1[(c1+20)] =
(slottedFieldData_Solution_0_fragmentIdx_4_p1[(c1+20)]
+(((1.0e+00/fieldData_LaplCoeff_fragmentIdx_4_p1[c1])*8.0e-01)*(fieldData_RHS_fragmentIdx_4_p1[c1]
-(((((fieldData_LaplCoeff_fragmentIdx_4_p1[c1]*slottedFieldData_Solution_0_fragmentIdx_4_p1[(c1+20)])
+(fieldData_LaplCoeff_fragmentIdx_4_p1[(c1+289)]*slottedFieldData_Solution_0_fragmentIdx_4_p1[(c1+21)]))
+(fieldData_LaplCoeff_fragmentIdx_4_p1[(c1+578)]*slottedFieldData_Solution_0_fragmentIdx_4_p1[(c1+19)]))
+(fieldData_LaplCoeff_fragmentIdx_4_p1[(c1+867)]*slottedFieldData_Solution_0_fragmentIdx_4_p1[(c1+39)]))
+(fieldData_LaplCoeff_fragmentIdx_4_p1[(c1+1156)]*slottedFieldData_Solution_0_fragmentIdx_4_p1[(c1+1)])))));
…
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Denoising by Diffusion
Idea:
Use nonlinear anisotropic diffusion process to denoise the image
u0 in the domain Ω ,i.e. solve the time-dependent PDE
28
inxuxu
Tonnug
Tint
uugdiv
)()0,(
0,
)(
0
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Runtime Results for Different Problems
29
cpu gpu
1 8 256
Double Float Double Float Double Float
Lap
lace
2D Jacobi 546 1.17 3.50 3.50 6.42 9.10
GaussSeidel 453 1.16 2.42 3.24 3.41 4.72
3D Jacobi 608 1.08 2.60 3.00 6.83 10.31
GaussSeidel 608 1.08 2.60 3.01 4.22 5.96
Co
mp
lex
D
iffu
sio
n 2D Jacobi 9235 1.02 2.81 2.87 23.99 33.58
GaussSeidel 7504 1.11 2.18 2.29 11.62 16.42
3D Jacobi 8799 1.15 2.77 2.91 15.94 39.28
GaussSeidel 9048 1.80 2.60 2.87 10.28 24.19
ms speedup speedup speedup speedup speedup
Image Sizes 4096x4096
resp. 256x256x256
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Future Work
Imaging applications (tested)
High dynamic range compression
Image denoising
Optical Flow
Imaging applications (next)
Image registration
Image segmentation
30
Computer Science X - System Simulation Group
Harald Köstler ([email protected])
Acknowledgements
• Funded by
• Bundesministerium für Bildung und Forschung
• KONWIHR. Bavarian project
• DFG SPP 1648/1 – Software for Exascale computing
http://www.exastencils.org/