lecture 16 irregular problems – part ii

42
Lecture 16 Irregular Problems – Part II

Upload: others

Post on 09-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Lecture 16

Irregular Problems – Part II

Announcements

©2012 Scott B. Baden /CSE 260/ Fall 2012 2

Today’s lecture •  Particle methods •  Sparse matrices •  Irregular meshes

3 ©2012 Scott B. Baden /CSE 260/ Fall 2012

©2012 Scott B. Baden /CSE 260/ Fall 2012 4

Classifying application domains – Colella’s “7 Dwarfs”

u  Classify applications according to a pattern of data access and operations on the data: important in HPC

u  Structured grids •  Image processing, simulations

u  Dense linear algebra: •  Gaussian elimination, SVD, many others

u  N-body methods u  Sparse linear algebra u  Unstructured Grids u  Spectral methods (FFT) u  Monte Carlo

©2012 Scott B. Baden /CSE 260/ Fall 2012 5

The N-body problem •  Compute trajectories of a system of N

bodies often called particles, moving under mutual influence

u  The Greek word for particle: somati'dion = “little body”

u  No general analytic (exact) solution when N > 2

u  Numerical simulations required u  N can ranges from thousands to millions

•  A force law governs the way the particles interact

u  We may not need to perform all O(N2) force computations

u  Introduces non-uniformity due to uneven distributions

©2012 Scott B. Baden /CSE 260/ Fall 2012 6

Discretization •  Particles move continuously through space

and time according to a force, a continuous function of position and time: F(x,t)

•  Can’t solve the problem analytically, must solve it numerically

•  We approximate continuous values using a discrete representation

•  Evaluate force field at discrete points in time, called timesteps Δt, 2Δt , 3Δt Δt = discrete time step (a parameter)

•  “Push” the bodies according to Newton’s third law: F = ma = m du/dt

while (current time < end time) forall bodies i ∈ 1:N compute force Fi induced by all bodies j ∈ 1:N

update position xi by Fi Δt ∀i current time += Δt

end

©2012 Scott B. Baden /CSE 260/ Fall 2012 7

Timestep selection •  We approximate the velocity of a particle by the tangent to the

particle’s trajectory •  Since we compute velocities at discrete points in space and time, we

approximate the true trajectory by a straight line •  So long as Δt is small enough, the resultant error is reasonable •  If not then we might “jump” to another trajectory: this is an error

Particle A’s trajectory

Particle B’s trajectory

©2012 Scott B. Baden /CSE 260/ Fall 2012 8

Some details

•  There is no self induced force •  How should we choose the timestep Δt ? •  We apply numerical criteria to select Δt, based on an

analysis of the evolving solution •  Selection may be ongoing, i.e. dynamic

©2012 Scott B. Baden /CSE 260/ Fall 2012 9

Computing the force

•  The running time of the computation is dominated by the force computation, so we ignore the push phase

•  The simplest approach is to use the direct method, with a running time of O(N2) Force on particle i = ∑j=0:N-1 F(xi, xj)

•  F( ) is the force law •  One example is the gravitational force law

G mi mj / r2ij where rij = dist(xi, xj)

G is the gravitational constant

©2012 Scott B. Baden /CSE 260/ Fall 2012 10

Significance of locality •  Many scientific problems exhibit both spatial and temporal

locality, due to the underlying physics u  The values of the solution at nearby points in space and time are

closer than for values at distant points u  Gravitational interactions decay as 1/r2

u  Waves live in a localized region of space at any one time and will appear in a nearby region of space in a nearby point in time

•  Numerically speaking, timesteps are “small:” the solution changes gradually

•  Let’s consider a more highly localized force

Jim Demmel, U. C. Berkeley

©2012 Scott B. Baden /CSE 260/ Fall 2012 11

Localized Van Der Waals Force

F(r) = C (2 - 1 / (30r)) r < δ C / (30r)5 r ≥ δ,

C = 0.1

0 0.05 0.1 0.15 0.2

-0.1

-0.05

0

0.05

0.1

0.15

©2012 Scott B. Baden /CSE 260/ Fall 2012 12

Implementation •  We don’t need to compute all

O(N2) interactions •  To speed up the search for

nearby particles, sort into a chaining mesh (Hockney & Eastwood, 1981)

•  Compute forces one mesh box at a time

•  Only consider particles in the 8 surrounding cells

Jim Demmel, U. C. Berkeley

©2012 Scott B. Baden /CSE 260/ Fall 2012 13

Parallel implementation •  We split the mesh as in a stencil

method •  We use ghost cells to store copies

of nearby particles •  The ghost region also manages

particles that have moved outside the subdomain and must be repatriated to their new owner

•  Packing and unpacking of particle lists

©2012 Scott B. Baden /CSE 260/ Fall 2012 14

Load Balancing •  Boxes carry varying amounts of

work depending on local particle density

•  A uniformly blocked workload assignment will result in a load imbalance

•  Particles can redistribute themselves over processors

•  2 Alternatives u  Static: cyclic mapping u  Dynamic: non-uniform partitioning

©2012 Scott B. Baden /CSE 260/ Fall 2012 15

•  Block cyclic decomposition

•  All processors should get about the same amount of work if we choose a reasonable chunk size

•  What are the tradeoffs?

Static decomposition

0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3

©2012 Scott B. Baden /CSE 260/ Fall 2012 16

Communication overhead

•  With a b=2, each processor must obtain all the simulation data

•  With b=4, this factor decreases to ½

•  If the dependence distance increases, so must the granularity

0 1 0 1 0 12 3 2 3 2 30 1 0 1 0 12 3 2 3 2 30 1 0 1 0 12 3 2 3 2 3

Jim Demmel, U. C. Berkeley

0 1 2 3 0 1 2 34 5 6 7 4 5 6 78 9 10 11 8 9 10 1112 13 14 15 12 13 14 150 1 2 3 0 1 2 34 5 6 7 4 5 6 78 9 10 11 8 9 10 1112 13 14 15 12 13 14 15

©2012 Scott B. Baden /CSE 260/ Fall 2012 17

Non-uniform partitioning •  Cyclic partitioning is a static strategy

u  Partitions space uniformly and maps work irregularly u  A statistical sampling method -- does not attempt to

measure load imbalance u  May incur high communication overheads due to fine

grain data decomposition •  Another approach is partition the work non-

uniformly according to the spatial workload distribution

u  Each non-uniform region carries an equal work u  Particles move: partitioning changes over time

•  Avoids high surface to volume ratio of cyclic decomposition

•  In many numerical problems, the solution changes gradually due to timestep constraints, so we don’t have to repartition every timestep

©2012 Scott B. Baden /CSE 260/ Fall 2012 18

The workload density distribution •  Irregular partitioning algorithms require that we

estimate the workload in space and time •  Construct a workload density mapping ρ(x,t) giving

the workload at each point x=(x,y,z) of the mesh •  In static problems the workload density mapping

depends only on position •  In many applications we can come up with a good

estimate of the mapping •  Since timesteps are “small,” particle motion is

localized in space and in time •  This condition holds for continuum methods such as

classical partial differential equations: we can adjust the partitioning frequency accordingly

©2012 Scott B. Baden /CSE 260/ Fall 2012 19

Non-uniform blocked decomposition

•  A well known partitioning technique is recursive coordinate bisection (RCB)

©2012 Scott B. Baden /CSE 260/ Fall 2012 20

An example of RCB in one dimension

•  Consider the following loop for i = 0 : N-1 do if ( G(i) )

then y[i] = f1( …. ); else y[i] = f2( …. ); end for •  Assume

f1( ) takes twice as long to compute as f2( )

©2012 Scott B. Baden /CSE 260/ Fall 2012 21

Partitioning procedure

•  Let W[i] = if (G(i)) then 2 else 1 1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1

•  Compute the running sum (i.e. the scan) of W 1 2 4 5 7 9 11 12 13 14 15 16 17 19 21 22

•  Split into 2 equal parts, at 22/2 = 11 1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1

©2012 Scott B. Baden /CSE 260/ Fall 2012 22

Recursive coordinate bisection

•  Recurse until the desired number of partitions have been rendered 1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1

•  May be applied in multiple dimensions

©2012 Scott B. Baden /CSE 260/ Fall 2012 23

Load Balancing with space filling curves

•  Multidimensional RCB suffers from granularity problems

•  We can reduce the granularity with many partitions per processor

•  Introduces higher surface to volume effects •  Another approach: spacefilling curves

©2012 Scott B. Baden /CSE 260/ Fall 2012 24

Partitioning with filling curves •  Maps higher dimensional physical

space onto the line u  Load balancing in one dimension u  Many types; Hilbert curves shown here

•  Irregular communication surfaces •  Useful for managing locality:

memory, databases

mathforum.org/advanced/robertd/lsys3d.html

http://www.math.ohio-state.edu/~fiedorow/math655/Peano.html

©2012 Scott B. Baden /CSE 260/ Fall 2012 25

•  If we ignore serial sections and other overheads, then we may express load imbalance in terms of a load balancing efficiency metric

•  Let each processor i complete its assigned work in time Ti •  Thus, the running time TP = MAX ( Ti )

•  Define

•  We define the load balancing efficiency

•  Ideally η = 1.0

Load balancing efficiency

!

T = iTi"

!

" =T

PTP

©2012 Scott B. Baden /CSE 260/ Fall 2012 26

Classifying application domains – Colella’s “7 Dwarfs”

u  Classify applications according to a pattern of data access and operations on the data: important in HPC

u  Structured grids •  Image processing, simulations

u  Dense linear algebra: •  Gaussian elimination, SVD, many others

u  N-body methods u  Sparse linear algebra u  Unstructured Grids u  Spectral methods (FFT) u  Monte Carlo

©2012 Scott B. Baden /CSE 260/ Fall 2012 27

Sparse Matrices

•  A matrix where we employ knowledge about the location of the non-zeroes

•  Consider Jacobi’s method with a 5-point stencil

u’[i,j] = (u[i-1,j] + u[i+1,j]+ u[i,j-1]+ u[i, j+1] - h2f[i,j]) /4

1M x 1M submatrix of the web connectivity graph, constructed from an archive at the Stanford WebBase 3 non-zeroes/row

Dense: 220×220 = 240

= 1024 Gwords Sparse: (3/220) × 240 = 3 Mwords Sparse representation saves a factor of 1 million in storage

©2012 Scott B. Baden /CSE 260/ Fall 2012" 28

Web connectivity Matrix: 1M x 1M

Jim Demmel

©2012 Scott B. Baden /CSE 260/ Fall 2012" 29

Circuit Simulation

www.cise.ufl.edu/research/sparse/matrices/Hamm/scircuit.html

Motorola Circuit 170,9982

958,936 nonzeroes .003% nonzeroes 5.6 nonzeroes/row

©2012 Scott B. Baden /CSE 260/ Fall 2012 30

Cholesky

•  In some cases the matrix is both –  Symmetric, A = AT –  Positive definite A = L * LT L = √A

•  No need for pivoting •  We can use Cholesky Factorization •  Flops count cut in half •  What complications does this avoid? •  Result: many matrices can be transformed into

this form •  Super LU: Demmel and Li

©2012 Scott B. Baden /CSE 260/ Fall 2012 31

Savings with Sparse Matrices

•  7 x 7 grid: 49 x 49 matrix, use Cholesky’s method •  Nonzeroes: 349, 30% of the space •  Flops: 2643, 6.7% •  Fill-in: a zero entry becomes nonzero

©2012 Scott B. Baden /CSE 260/ Fall 2012 32

Managing fill in •  Fill-in and computation reduced with improved ordering •  Advantage increases with N: n=63 (N = n2 = 3969)

–  250109 non-zeroes → 85416 –  16 million flops → 3.6 million

Natural order: nonzeros = 233 Min. Degree order: nonzeros = 207

©2012 Scott B. Baden /CSE 260/ Fall 2012 34

Sparse Matrix Vector Multiplication

•  Important kernel used in sparse linear algebra y[i] += A[i,j] × x[j]

•  Many formats, common format for CPUs is Compressed Sparse Row (CSR)

Jim Demmel

Sparse matrix vector multiply kernel // y[i] += A[i,j] × x[j] #pragma parallel for schedule

(dynamic,chunk) for i = 0 : N-1 // rows i0= ptr[i] i1 = ptr[i+1] – 1 for j = i0 : i1 // cols y[ ind[j]] +=

val[ j ] *x[ ind[j] ] end j end i

©2012 Scott B. Baden /CSE 260/ Fall 2012 35

j→

i ↓

A X

©2012 Scott B. Baden /CSE 260/ Fall 2012" 38

Irregular mesh: NASA Airfoil in 2D

Jim Demmel

•  In the Jacobi method, computational effort is applied uniformly

•  Irregular “unstructured” meshes don’t have a uniform structure and they apply computational effort non-uniformly

•  We can apply direct mesh updates, but more common to generate the equivalent sparse matrix

©2012 Scott B. Baden /CSE 260/ Fall 2012 39

A Structured Adaptive mesh

Courtesy Phil Colella, Lawrence Berkeley National Laboratory

©2012 Scott B. Baden /CSE 260/ Fall 2012 40

Turbulent flow over a step

Applied Mathematics Group Center for Applied Scientific Computing, Lawrence Livermore National Laboratory

http://www.llnl.gov/CASC/groups/casc-amg.html

Fin

©2012 Scott B. Baden /CSE 260/ Fall 2012 42

Irregular Problems •  In the Jacobi application,

computational effort is applied uniformly

•  Irregular applications apply computational effort non-uniformly

•  A load balancing problem arises when partitioning the data

•  There may not be a mesh

Courtesy of Randy Bank

The  matrix  factoriza/on  can  be  represented  as  a  DAG:  • nodes:  tasks  that  operate  on  “/les”  • edges:  dependencies  among  tasks    Tasks  can  be  scheduled  asynchronously  and  in  any  order  as  long  as  dependencies  are  not  violated.  

Achieving Asynchronicity on Multicore Source:  Jack  Dongarra  

 System:      PLASMA  

02/14/2006 CS267 Lecture 9 44

Row and Column Block Cyclic Layout

• processors and matrix blocks are distributed in a 2d array

• prow-by-pcol array of processors • brow-by-bcol matrix blocks

• pcol-fold parallelism in any column, and calls to the BLAS2 and BLAS3 on matrices of size brow-by-bcol

• serial bottleneck is eased

• prow ≠ pcol and brow ≠ bcol possible, even desireable

0 1 0 1 0 1 0 1

2 3 2 3 2 3 2 3

0 1 0 1 0 1 0 1

2 3 2 3 2 3 2 3

0 1 0 1 0 1 0 1

2 3 2 3 2 3 2 3

0 1 0 1 0 1 0 1

2 3 2 3 2 3 2 3

bcol brow

02/14/2006 CS267 Lecture 9 45

Distributed GE with a 2D Block Cyclic Layout

• block size b in the algorithm and the block sizes brow and bcol in the layout satisfy b=bcol.

• shaded regions indicate processors busy with computation or communication.

• unnecessary to have a barrier between each step of the algorithm, e.g.. steps 9, 10, and 11 can be pipelined