scientific computations on modern parallel vector systems leonid oliker julian borrill, jonathan...
Post on 21-Dec-2015
217 views
TRANSCRIPT
Scientific Computations on Modern Parallel
Vector SystemsLeonid Oliker
Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David SkinnerLawrence Berkeley National Laboratories
Stephane EthierPrinceton Plasma Physics Laboratory
http://crd.lbl.gov/~oliker
Architectural Comparison
Node Type Where CPU/
NodeClockMHz
PeakGFlop
Mem BW GB/s
Peak byte/fl
op
NetwkBW
GB/s/P
BisectBW
byte/flop
MPI Latenc
yusec
NetworkTopolog
y
Power3 NERSC 16 375 1.5 1.0 0. 47 0.13 0.087 16.3 Fat-tree
Power4 ORNL 32 1300 5.2 2.3 0.44 0.13 0.025 7.0 Fat-tree
Altix ORNL 2 1500 6.0 6.4 1.1 0.40 0.067 2.8 Fat-treeES ESC 8 500 8.0 32.0 4.0 1.5 0.19 5.6 CrossbarX1 ORNL 4 800 12.8 34.1 2.7 6.3 0.088 7.3 2D-torus
Custom vector architectures have •High memory bandwidth relative to peak•Superior interconnect: latency, point to point, and bisection bandwidth
Overall ES appears as the most balanced architecture, while Altix shows best architectural balance among superscalar architectures
A key ‘balance point’ for vector systems is the scalar:vector ratio
Applications studied
LBMHD Plasma Physics 1,500 lines grid basedLattice Boltzmann approach for magneto-hydrodynamics
CACTUS Astrophysics 100,000 lines grid based Solves Einstein’s equations of general relativity
PARATEC Material Science 50,000 lines Fourier space/grid Density Functional Theory electronic structures codes
GTC Magnetic Fusion 5,000 lines particle based Particle in cell method for gyrokinetic Vlasov-Poisson equation
MADbench Cosmology 2,000 lines dense linear algebra Maximum likelihood two-point angular correlation, I/O intensive
Applications chosen with potential to run at ultrascale Computations contain abundant data parallelism
• ES runs require minimum parallelization and vectorization hurdles Codes originally designed for superscalar systems Ported onto single node of SX6, first multi-node experiments performed
at ESC
Plasma Physics: LBMHD LBMHD uses a Lattice Boltzmann method to
model magneto-hydrodynamics (MHD) Performs 2D simulation of high temperature
plasma Evolves from initial conditions and decaying
to form current sheets 2D spatial grid is coupled to octagonal
streaming lattice Block distributed over 2D processor grid
Main computational components:
Collision requires coefficients for local gridpoint only, no communication Stream values at gridpoints are streamed to neighbors,
at cell boundaries information is exchanged via MPI Interpolation step required between spatial and stream lattices
Developed George Vahala’s group College of William and Mary, ported Jonathan Carter
Current density decays of two cross-shaped structures
LBMHD: Porting Details
Collision routine rewritten: For ES loop ordering switched so gridpoint loop (~1000 iterations) is inner rather
than velocity or magnetic field loops (~10 iterations) X1 compiler made this transformation automatically: multistreaming outer loop and
vectorizing (via strip mining) inner loop Temporary arrays padded reduce bank conflicts
Stream routine performs well: Array shift operations, block copies, 3rd-degree polynomial eval
Boundary value exchange MPI_Isend, MPI_Irecv pairs Further work: plan to use ES "global memory" to remove message copies
(left) octagonal streaming lattice coupled with square spatial grid
(right) example of diagonal streaming vector updating three spatial cells
Material Science: PARATEC
PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set
Density Functional Theory to calc structure & electronic properties of new materials
DFT calc are one of the largest consumers of supercomputer cycles in the world
Induced current and chargedensity in crystallized glycine
Uses all-band CG approach to obtain wavefunction of electrons
33% 3D FFT, 33% BLAS3, 33% Hand coded F90 Part of calculation in real space other in Fourier space
• Uses specialized 3D FFT to transform wavefunction Computationally intensive - generally obtains high percentage of peak Developed Andrew Canning with Louie and Cohen’s groups (UCB, LBNL)
Transpose from Fourier to real space
3D FFT done via 3 sets of 1D FFTs and 2 transposes
Most communication in global transpose (b) to (c) little communication (d) to (e)
Many FFTs done at the same timeto avoid latency issues
Only non-zero elements communicated/calculated
Much faster than vendor 3D-FFT
PARATEC:Wavefunction Transpose(a) (b)
(e)
(c)
(f)
(d)
Astrophysics: CACTUS Numerical solution of Einstein’s equations
from theory of general relativity
Among most complex in physics: set of coupled nonlinear hyperbolic & elliptic systems with thousands of terms
CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes
Evolves PDE’s on regular grid using finite differences
Uses ADM formulation: domain decomposed into 3D hypersurfaces for different slices of space along time dimension
Exciting new field about to be born: Gravitational Wave Astronomy - fundamentally new information about Universe
Gravitational waves: Ripples in spacetime curvature, caused by matter motion, causing distances to change.
Developed at Max Planck Institute, vectorized by John Shalf
Visualization of grazing collision of two black holes
Communication at boundariesExpect high parallel efficiency
Magnetic Fusion: GTC Gyrokinetic Toroidal Code: transport of thermal
energy (plasma microturbulence) Goal magnetic fusion is burning plasma power
plant producing cleaner energy GTC solves 3D gyroaveraged gyrokinetic
system w/ particle-in-cell approach (PIC) PIC scales N instead of N2 – particles interact w/
electromagnetic field on grid Allows solving equation of particle motion with
ODEs (instead of nonlinear PDEs) Main computational tasks:
Scatter deposit particle charge to nearest point Solve Poisson eqn to get potential for each
point Gather calc force based on neighbors potential Move particles by solving eqn of motion Shift particles moved outside local domain
3D visualization of electrostatic potential in magnetic fusion device
Developed at Princeton Plasma Physics Laboratory, vectorized by Stephane Ethier
GTC: Scatter operation
Particle charge deposited amongst nearest grid points. Calculate force based on neighbors potential, then move particle accordingly
Several particles can contribute to same grid points, resulting in memory conflicts (dependencies) that prevent vectorization
Solution: VLEN copies of charge deposition array with reduction after main loop• However, greatly increases memory footprint (8X)
Since particles are randomly localized - scatter also hinders cache reuse
Cosmology: MADbench
Microwave Anisotropy Dataset Computational Analysis Package
Optimal general algorithm for extracting key cosmological data from Cosmic Microwave Background Radiation (CMB)
CMB encodes fundamental parameters of cosmology: Universe geometry, expansion rate, number of neutrino species
Preserves full complexity of underlying scientific problem
Calculates maximum likelihood two-point angular correlation function
Recasts problem in dense linear algebra: ScaLAPACKSteps include: mat-mat, mat-vec, chol decomp, redistribution
High I/O requirement - due to out-of-core nature of calculation
Developed at NERSC/CRD by Julian Borrill
CMB analysis moves from the time domain - observations - O(1012) to the pixel domain - maps - O(108) to the multipole domain - power spectra - O(104)calculating the compressed data and their
reduced error bars at each step.
CMB Data Analysis
MADBench:Performance
Characterization
In depth analysis shows performance contribution of each component for evaluated architectures
Identifies system specific balance and opportunities for optimization
Results show that I/O has more effect on ES than Seaborg - due to ratio between I/O performance and peak ALU speed
Demonstrated IPM capabilities to measure MPI overhead on variety of architectures without the need to recompile, at a trivial runtime overhead (<1%)
0%20%40%60%80%
100%
P=
16
P=
16
P=
16
P=
16
P=
64
P=
64
P=
64
P=
64
P=
25
6
P=
25
6
P=
25
6
P=
25
6
P=
10
24
P=
10
24
Sbg ES Phx Cmb Sbg ES Phx Cmb Sbg ES Phx Cmb Sbg ES
% o
f th
eo
reti
cal
pe
ak
CalcCalc+MPICalc+MPI+I/OCalc+MPI+I/O+Rmp
Overview
Tremendous potential of vector architectures: 5 codes running faster than ever before Vector systems allows resolution not possible with scalar arch (regardless of # procs)
Opportunity to perform scientific runs at unprecedented scale ES shows high raw and much higher sustained performance compared with X1
• Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio)• Evaluation codes contain sufficient regularity in computation for high vector performance
• GTC example code at odds with data-parallelism• Important to characterize full application including I/O effects• Much more difficult to evaluate codes poorly suited for vectorization
Vectors potentially at odds w/ emerging techniques (irregular, multi-physics, multi-scale) Plan to expand scope of application domains/methods, and examine latest HPC architectures
Code(P=64) % peak (P=Max avail) Speedup ES
vs.
Pwr3 Pwr4 Altix ES X1 Pwr3 Pwr4 Altix X1LBMHD 7% 5% 11% 58% 37% 30.6 15.3 7.2 1.5CACTUS 6% 11% 7% 34% 6% 45.0 5.1 6.4 4.0
GTC 9% 6% 5% 20% 11% 9.4 4.3 4.1 1.1PARATEC 57% 33% 54% 58% 20% 8.2 3.9 1.4 3.9MADbenc
h 49% --- 19% 37% 17% 6.3 --- 3.5 1.4
Average 19.9 7.2 4.5 2.4