a parallel computing tutorial - institute for mathematics ... · examples: most current...

104
Lawrence Livermore National Laboratory Ulrike Meier Yang LLNL-PRES-463131 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 A Parallel Computing Tutorial

Upload: others

Post on 11-Sep-2019

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

Lawrence Livermore National Laboratory

Ulrike Meier Yang

LLNL-PRES-463131

Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551

This work performed under the auspices of the U.S. Department of Energy by

Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

A Parallel Computing Tutorial

Page 2: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

2Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

References

“Introduction to Parallel Computing”, Blaise Barney

https://computing.llnl.gov/tutorials/parallel_comp

“A Parallel Multigrid Tutorial”, Jim Jones

https://computing.llnl.gov/casc/linear_solvers/present.html

“Introduction to Parallel Computing, Design and Analysis

of Algorithms”, Vipin Kumar, Ananth Grama, Anshul

Gupta, George Karypis

Page 3: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

3Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Overview

Basics of Parallel Computing (see Barney)

- Concepts and Terminology

- Computer Architectures

- Programming Models

- Designing Parallel Programs

Parallel algorithms and their implementation

- basic kernels

- Krylov methods

- multigrid

Page 4: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

4Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Serial computation

Traditionally software has been written for serial computation:

- run on a single computer

- instructions are run one after another

- only one instruction executed at a time

Page 5: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

5Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Von Neumann Architecture

John von Neumann first authored the

general requirements for an electronic

computer in 1945

Data and instructions are stored in memory

Control unit fetches instructions/data from

memory, decodes the instructions and then

sequentially coordinates operations to

accomplish the programmed task.

Arithmetic Unit performs basic arithmetic

operations

Input/Output is the interface to the human

operator

Page 6: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

6Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Parallel Computing

Simultaneous use of multiple compute sources to solve a single

problem

Page 7: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

7Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Why use parallel computing?

Save time and/or money

Enables the solution of larger problems

Provide concurrency

Use of non-local resources

Limits to serial computing

Page 8: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

8Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Concepts and Terminology

Flynn’s Taxonomy (1966)

S I S D

Single Instruction, Single Data

S I M D

Single Instruction, Multiple Data

M I S D

Multiple Instruction, Single Data

M I M D

Multiple Instruction, Multiple Data

Page 9: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

9Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

SISD

Serial computer

Deterministic execution

Examples: older generation main frames, work stations, PCs

Page 10: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

10Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

SIMD

A type of parallel computer

All processing units execute the same instruction at any given clock cycle

Each processing unit can operate on a different data element

Two varieties: Processor Arrays and Vector Pipelines

Most modern computers, particularly those with graphics processor units

(GPUs) employ SIMD instructions and execution units.

Page 11: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

11Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

MISD

A single data stream is fed into multiple processing units.

Each processing unit operates on the data independently via independent

instruction streams.

Few actual examples : Carnegie-Mellon C.mmp computer (1971).

Page 12: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

12Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

MIMD

Currently, most common type of parallel computer

Every processor may be executing a different instruction stream

Every processor may be working with a different data stream

Execution can be synchronous or asynchronous, deterministic or non-

deterministic

Examples: most current supercomputers, networked parallel computer

clusters and "grids", multi-processor SMP computers, multi-core PCs.

Note: many MIMD

architectures also

include SIMD execution

sub-components

Page 13: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

13Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Parallel Computer Architectures

Shared memory:

all processors can access the

same memory

Uniform memory access (UMA):

- identical processors

- equal access and access

times to memory

Page 14: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

14Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Non-uniform memory access (NUMA)

Not all processors have equal access to all memories

Memory access across link is slower

Advantages:

- user-friendly programming perspective to memory

- fast and uniform data sharing due to the proximity of memory to CPUs

Disadvantages:

- lack of scalability between memory and CPUs.

- Programmer responsible to ensure "correct" access of global memory

- Expense

Page 15: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

15Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Distributed memory

Distributed memory systems require a communication network to

connect inter-processor memory.

Advantages:

- Memory is scalable with number of processors.

- No memory interference or overhead for trying to keep cache coherency.

- Cost effective

Disadvantages:

- programmer responsible for data communication between processors.

- difficult to map existing data structures to this memory organization.

Page 16: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

16Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Hybrid Distributed-Shared Meory

Generally used for the currently largest and fastest

computers

Has a mixture of previously mentioned advantages and

disadvantages

Page 17: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

17Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Parallel programming models

Shared memory

Threads

Message Passing

Data Parallel

Hybrid

All of these can be implemented on any architecture.

Page 18: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

18Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Shared memory

tasks share a common address space, which they read and write

asynchronously.

Various mechanisms such as locks / semaphores may be used to

control access to the shared memory.

Advantage: no need to explicitly communicate of data between

tasks -> simplified programming

Disadvantages:

• Need to take care when managing memory, avoid

synchronization conflicts

• Harder to control data locality

Page 19: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

19Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Threads

A thread can be considered as a

subroutine in the main program

Threads communicate with each

other through the global memory

commonly associated with shared

memory architectures and

operating systems

Posix Threads or pthreads

OpenMP

Page 20: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

20Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Message Passing

A set of tasks that use their

own local memory during

computation.

Data exchange through

sending and receiving

messages.

Data transfer usually requires

cooperative operations to be

performed by each process.

For example, a send operation

must have a matching receive operation.

MPI (released in 1994)

MPI-2 (released in 1996)

Page 21: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

21Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Data Parallel

The data parallel model demonstrates the following characteristics:

• Most of the parallel work

performs operations on a data

set, organized into a common

structure, such as an array

• A set of tasks works collectively

on the same data structure,

with each task working on a

different partition

• Tasks perform the same operation

on their partition

On shared memory architectures, all tasks may have access to the

data structure through global memory. On distributed memory

architectures the data structure is split up and resides as "chunks"

in the local memory of each task.

Page 22: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

22Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Other Models

Hybrid

- combines various models, e.g. MPI/OpenMP

Single Program Multiple Data (SPMD)

- A single program is executed by all tasks simultaneously

Multiple Program Multiple Data (MPMD)

- An MPMD application has multiple executables. Each task can

execute the same or different program as other tasks.

Page 23: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

23Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Designing Parallel Programs

Examine problem:

- Can the problem be parallelized?

- Are there data dependencies?

- where is most of the work done?

- identify bottlenecks (e.g. I/O)

Partitioning

- How should the data be decomposed?

Page 24: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

24Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Various partitionings

Page 25: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

25Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

How should the algorithm be distributed?

Page 26: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

26Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Communications

Types of communication:

- point-to-point

- collective

Page 27: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

27Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Synchronization types

Barrier

• Each task performs its work until it reaches the barrier. It then

stops, or "blocks".

• When the last task reaches the barrier, all tasks are

synchronized.

Lock / semaphore

• The first task to acquire the lock "sets" it. This task can then

safely (serially) access the protected data or code.

• Other tasks can attempt to acquire the lock but must wait until

the task that owns the lock releases it.

• Can be blocking or non-blocking

Page 28: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

28Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Load balancing

Keep all tasks busy all of the time. Minimize idle time.

The slowest task will determine the overall performance.

Page 29: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

29Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Granularity

In parallel computing, granularity is a qualitative

measure of the ratio of computation to

communication.

Fine-grain Parallelism:

- Low computation to communication ratio

- Facilitates load balancing

- Implies high communication overhead and less

opportunity for performance enhancement

Coarse-grain Parallelism:

- High computation to communication ratio

- Implies more opportunity for performance

increase

- Harder to load balance efficiently

Page 30: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

30Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Parallel Computing Performance Metrics

Let T(n,p) be the time to solve a problem of size n

using p processors

- Speedup: S(n,p) = T(n,1)/T(n,p)

- Efficiency: E(n,p) = S(n,p)/p

- Scaled Efficiency: SE(n,p) = T(n,1)/T(pn,p)

An algorithm is scalable if there exists C>0 with

SE(n,p) ≥ C for all p

Page 31: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

31Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Strong Scalability

increasing the number of processors

for a fixed problem size

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

Sp

ee

du

p

no of procs

perfect speedup

method

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8

se

co

nd

s

no of procs

Run times

0.6

0.7

0.8

0.9

1

1.1

1.2

1 2 3 4 5 6 7 8

Eff

icie

nc

y

no of procs

Page 32: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

32Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Weak scalability

increasing the number of processors using a fixed problem size

per processor

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1 2 3 4 5 6 7 8

se

co

nd

s

no of procs

Run times

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8

Sc

ale

d E

ffic

ien

cy

No of procs

Page 33: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

33Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Amdahl’s Law

Maximal Speedup = 1/(1-P), P parallel portion of code

Page 34: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

34Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Amdahl’s Law

Speedup = 1/((P/N)+S),

N no. of processors, S serial portion of code

Speedup

N P=0.5 P=0.9 P=0.99

10

100

1000

10000

100000

1.82

1.98

1.99

1.99

1.99

5.26

9.17

9.91

9.91

9.99

9.17

50.25

90.99

99.02

99.90

50%

75%

90%

95%

Page 35: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

35Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Performance Model for distributed memory

Time for n floating point operations:

tcomp = fn

1/f Mflops

Time for communicating n words:

tcomm = α + βn

α latency, 1/β bandwidth

Total time: tcomp + tcomm

Page 36: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

36Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Parallel Algorithms and Implementations

basic kernels

Krylov methods

multigrid

Page 37: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

37Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Vector Operations

Adding two vectors

x = y+z

Axpys

y = y+αx

Vector products

Proc 0

Proc 1

Proc 2

z0 = x0 + y0

z1 = x1 + y1

z2 = x2 + y2

Proc 0

Proc 1

Proc 2

s0 = x0 * y0

s1 = x1 * y1

s2 = x2 * y2

s = s0 + s1 + s2

s = x T y

Page 38: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

38Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Dense n x n Matrix Vector Multiplication

Matrix size n x n , p processors

=

T = 2(n2/p)f + (p-1)(α+βn/p)

Proc 0

Proc 1

Proc 2

Proc 3

Page 39: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

39Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Sparse Matrix Vector Multiplication

Matrix of size n2 x n2 , p2 processors

Consider graph of matrix

Partitioning influences run time

T1 ≈ 10(n2/p2)f + 2(α+βn) vs. T2 ≈ 10(n2/p2)f + 4(α+βn/p)

Page 40: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

40Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Effect of Partitioning on Performance

1

3

5

7

9

11

13

15

1 3 5 7 9 11 13 15

Sp

eed

up

No of procs

Perfect Speedup

MPI opt

MPI-noopt

Page 41: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

41Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Coordinate Format (COO)

real Data(nnz), integer Row(nnz), integer Col(nnz)

Data

Row

Col

Matrix-Vector multiplication:

for (k=0; k<nnz; k++)

y[[Row[k]] = y[Row[k]] + Data[k] * x[Col[k]];

Sparse Matrix Vector Multiply with COO Data Structure

1 0 0 2

0 3 4 0

0 5 6 0

7 0 8 9

2 5 1 3 7 9 4 8 6

0 2 0 1 3 3 1 3 2

3 1 0 1 0 3 2 2 2

Page 42: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

42Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Matrix-Vector multiplication:

for (k=0; k<nnz; k++)

y[[Row[k]] = y[Row[k]] + Data[k] * Col[A_j[k]]

Data possible conflict

row 1-3

Row

Col

Proc 0 Proc 1

Sparse Matrix Vector Multiply with COO Data Structure

2 3 1 5 7 9 4 8 6

0 1 0 2 3 3 1 3 2

3 1 0 1 0 3 2 2 2

1 0 0 2

0 3 4 0

0 5 6 0

7 0 8 9

2 3 1 4

0 1 0 1

3 1 0 2

5 7 9 8 6

2 3 3 3 2

1 0 3 2 2

1 0 0 2

0 3 4 0

0 5 6 0

7 0 8 9

Page 43: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

43Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Sparse Matrix Vector Multiply with CSR Data Structure

Compressed sparse row (CSR)

real Data(nnz), integer Rowptr(n+1), Col(nnz)

Data

Rowptr

Col

Matrix-Vector multiplication:

for (i=0; i<n; i++)

for (j=Rowptri[i]; j < Rowptr[i+1]; j++)

y[i] = y[i] + Data[j] * x[Col[j]];

1 0 0 2

0 3 4 0

0 5 6 0

7 0 8 9

1 2 3 4 5 6 7 8 9

0 2 4 6 9

0 3 1 2 1 2 0 2 3

Page 44: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

44Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Sparse Matrix Vector Multiply with CSC Data Structure

Compressed sparse column (CSC)

real Data(nnz), integer Colptr(n+1), Row(nnz)

Data

Colptr

Row

Matrix-Vector multiplication:

for (i=0; i<n; i++)

for (j=Colptri[i]; j < Colptr[i+1]; j++)

y[Row[j]] = y[Row[j]] + Data[j] * x[i];

1 0 0 2

0 3 4 0

0 5 6 0

7 0 8 9

1 7 3 5 4 6 8 2 9

0 2 4 7 9

0 3 1 2 1 2 3 0 3

Page 45: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

45Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

CSR

CSC

Parallel Matvec, CSR vs. CSC

y0 = A0 x0

y1 = A1 x1

y2 = A2 x2

A0 x0

y = y0 + y1 + y2 A0 A2 x1

x2

Page 46: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

46Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Sparse Matrix Vector Multiply with Diagonal Storage

Diagonal storage

real Data(n,m), integer offset(m)

Data offset

Matrix-Vector multiplication:

for (i=0; i<m; i++)

{ k = offset[i];

for (j=0; j < n; j++)

y[j] = y[j] + Data[i,j+k] * x[j+k];}

1 2 0 0

0 3 0 0

5 0 6 7

0 8 0 9

0 1 2

0 3 0

5 6 7

8 9 0

-2 0 1

Page 47: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

47Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Other Data Structures

Block data structures (BCSR, BCSC)

block size bxb, n divisible by b

real Data(nnb*b*b), integer Ptr(n/b+1), integer Index(nnb)

Based on space filling curves

(e.g. Hilbert curves, Z-order)

integer IJ(2*nnz)

real Data(nnz)

Matrix-Vector multiplication:

for (i=0; i<nnz; i++)

y[IJ[2*i+1]] = y[IJ[2*i+1]] + Data[i] * x[IJ[2*i]];

Page 48: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

48Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Data Structures for Distributed Memory

Concept for MATAIJ (PETSc) and ParCSR (Hypre)

x x x

x x x x x x

x x x x x

x x x x x x

x x x x

x x x x x x x

x x x x x x

x x x x x

x x x x x

x x x x x x

x x x x x x

Proc 0

Proc 1

Proc 2

Page 49: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

49Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Data Structures for Distributed Memory

Concept for MATAIJ (PETSc) and ParCSR (Hypre)

x x x

x x x x x x

x x x x x

x x x x x x

x x x x

x x x x x x x

x x x x x x

x x x x x

x x x x x

x x x x x x

x x x x x x

Proc 0

Proc 1

Proc 2

Page 50: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

50Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Data Structures for Distributed Memory

Concept for MATAIJ (PETSc) and ParCSR (Hypre)

Each processor : Local CSR Matrix,

Nonlocal CSR Matrix

Mapping local to global indices

x x x

x x x x x x

x x x x x

x x x x x x

x x x x

x x x x x x x

x x x x x x

x x x x x

x x x x x

x x x x x x

x x x x x x

0 1 2 3 4 5 6 7 8 9 10

Proc 0

Proc 1

Proc 2

0 2 3 8 10

Page 51: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

51Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Communication Information

Neighbor hood information (No. of neighbors)

Amount of data to be sent and/or received

Global partitioning for p processors

not good on very large numbers of processors!

r0 r1 r2 …. rp

Page 52: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

52Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Assumed partition (AP) algorithm

Answering global distribution questions previously required O(P)

storage & computations

On BG/L, O(P) storage may not be possible

New algorithm requires

• O(1) storage

• O(log P) computations

Enables scaling to 100K+

processors

AP idea has general applicability beyond hypre

Data owned by processor 3,

in 4’s assumed partition

Actual partition

1 2 43

1 N

Assumed partition ( p = (iP)/N )

1 2 3 4

Actual partition info is sent to the assumed

partition processorsdistributed directory

Page 53: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

53Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Assumed partition (AP) algorithm is more challenging

for structured AMR grids

AMR can produce grids with “gaps”

Our AP function accounts for these

gaps for scalability

Demonstrated on 32K procs of BG/L

Simple, naïve AP function leaves

processors with empty partitions

Page 54: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

54Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Effect of AP on one of hypre’s multigrid solvers

0

1

2

3

4

5

6

7

0 20000 40000 60000

seco

nd

s

No of procs

Run times on up to 64K procs on BG/P

global

assumed

Page 55: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

55Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Machine specification – Hera - LLNL

Multi-core/ multi-socket

cluster

864 diskless nodes

interconnected

by Quad Data rate (QDR)

Infiniband

AMD opteron quad core

(2.3 GHz)

Fat tree network

Page 56: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

56Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Individual 512 KB L2 cache for each core

2 MB L3 cache shared by 4 cores

4 sockets per node,16 cores sharing the 32 GB

main memory

NUMA memory access

Multicore cluster details (Hera):

Page 57: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

57Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

MxV Performance 2-16 cores – OpenMP vs. MPI

7pt stencil

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 200000 400000

Mfl

op

s

matrix size

OpenMP

1

2

4

8

12

16

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 200000 400000

Mfl

op

s

matrix size

MPI

Page 58: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

58Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

MxV Speedup for various matrix sizes, 7-pt stencil

0

5

10

15

20

25

30

1 3 5 7 9 11 13 15S

pe

ed

up

no of tasks

MPI

0

5

10

15

20

25

30

1 3 5 7 9 11 13 15

Sp

ee

du

p

no of threads

OpenMP

512

8000

21952

32768

64000

110592

512000

perfect

Page 59: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

59Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Iterative linear solvers

Krylov solvers :

- Conjugate gradient

- GMRES

- BiCGSTAB

Generally consist of basic linear kernels

Page 60: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

60Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Conjugate Gradient for solving Ax=b

k = 0, x0 = 0, r0 = b, ρ0 = r0Tr0 ;

while (ρi > ε2ρ0)

Solve Mzk = rk;

γk = rkT zk;

if (k=0) p1 = z0 else pk+1 = zk + γk pk / γk-1 ;

k = k+1 ;

αk = γk-1 / pkT Apk ;

xk = xk-1 + αk pk ;

rk = rk-1 - αk Apk ;

ρk = rkTrk ;

end

x = xk ;

Page 61: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

61Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Weak scalability – diagonally scaled CG

0

2

4

6

8

10

12

0 20 40 60 80 100 120

se

co

nd

s

no of procs

0

100

200

300

400

500

600

0 20 40 60 80 100 120

no of procs

Number of iterationsTotal times 100 iterations

Page 62: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

62Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Achieve algorithmic scalability through multigrid

smooth

Finest Grid

First Coarse Grid

restrict interpolate

Page 63: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

63Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Multigrid parallelized by domain partitioning

Partition fine grid

Interpolation fully parallel

Smoothing

Page 64: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

64Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

• Select coarse “grids,”

• Define interpolation,

• Define restriction and coarse-grid operators

Algebraic Multigrid

Setup Phase:

Solve Phase:

mm)m( fuA Relax mm)m( fuA Relax

1m)m(m ePe eInterpolat

m)m(mm uAfr Compute mmm euu Correct

1m1m)1m( reA

Solve

m)m(1m rRrRestrict

,...2,1m ,P )m(

)m()m()m() 1(mT)m()m( PARA )P(R

Page 65: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

65Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

How about scalability of multigrid V cycle?

Time for relaxation on finest grid for nxn points per proc

T0 ≈ 4α + 4nβ + 5n2 f

Time for relaxation on level k

Tk ≈ 4α + 4(n/2k)β + 5(n/2k)2/4

Total time for V_cycle:

TV ≈ 2 ∑k (4α + 4(n/2k)β + 5(n/2k)2/4)

≈ 8(1+1+1+ …) α

+ 8(1 + 1/2 + 1/4 +…) n β

+ 10 (1 + 1/4 + 1/16 + …) n2 f

≈ 8 L α + 16 n β + 40/3 n2 f

L ≈ log2(pn) number of levels

lim p>>0 TV(n,1)/ TV(pn,p) = O(1/log2(p))

Page 66: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

66Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

How about scalability of multigrid W cycle?

Total time for W cycle:

TW ≈ 2 ∑k 2k (4α + 4(n/2k)β + 5(n/2k)2/4)

≈ 8(1+2+4+ …) α

+ 8(1 + 1 + 1 +…) n β

+ 10 (1 + 1/2 + 1/4 + …) n2 f

≈ 8 (2L+1 - 1) α + 16 L n β + 20 n2 f

≈ 8 (2pn - 1) α + 16 log2(pn) n β + 20 n2 f

L ≈ log2(pn)

Page 67: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

67Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Smoothers

Smoothers significantly affect multigrid convergence and runtime

• solving: Ax=b, A SPD

• smoothing: xn+1 = xn + M-1(b-A xn)

• smoothers:

• Jacobi, M=D, easy to parallelize

• Gauss-Seidel, M=D+L, highly sequential!!

smooth

1-n0,...,i ,)j(x)j,i(A)i(b)i,i(A

)i(xn

ij,j

kk

1

0

1

1

1-n0,...,i ,)j(x)j,i(A)j(x)j,i(A)i(b)i,i(A

)i(xn

ij

k

i

j

kk

1

1

1

1

0

1

Page 68: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

68Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Parallel Smoothers

Red-Black Gauss-Seidel

Multi-color Gauss-Seidel

Page 69: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

69Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Hybrid smoothers

pM

.

.

.

M

M

1

kk

k

k LDω

M 1

Hybrid SOR:

)LD(D)LD(MT

kkkkkk 1

Hybrid Symm. Gauß-Seidel:

kkk LDM Hybrid Gauß-Seidel:

Can be used for distributed as well as shared parallel computing

Needs to be used with care

Might require smoothing parameter Mω = ω M

Corresponding block schemes,

Schwarz solvers, ILU, etc.

Page 70: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

70Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

How is the smoother threaded?

A1

A =A2

Ap

Ap =

t1

t2

t3

t4

Hybrid: GS within each

thread, Jacobi on thread

boundaries

the smoother is threaded

such that each thread

works on a contiguous

subset of the rows

matrices are distributed

across P processors in

contiguous blocks of rows

Page 71: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

71Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

FEM application: 2D Square

AMG determines thread subdomains by splitting nodes (not elements)

the application’s numbering of the elements/nodes becomes relevant!

(i.e., threaded domains dependent on colors)

partitions grid element-wise into nice subdomains for each processor

16 procs

(colors indicate element numbering)

Page 72: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

72Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

We want to use OpenMP on multicore machines...

hybrid-GS is not guaranteed to converge, but in practice it is

effective for diffusion problems when subdomains are well-defined

Some Options

1. applications needs to order unknowns in

processor subdomain “nicely”

• may be more difficult due to refinement

• may impose non-trivial burden on

application

2. AMG needs to reorder unknowns (rows)

• non-trivial, costly

3. smoothers that don’t depend on parallelism

Page 73: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

73Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Coarsening algorithms

Original coarsening algorithm

Parallel coarsening algorithms

Page 74: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

74Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Choosing the Coarse Grid (Classical AMG)

Add measures

(Number of strong

influences of a point)

Page 75: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

75Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Choosing the Coarse Grid (Classical AMG)

Pick a point with

maximal measure and

make it a C-point

23

3

2

4

4

2

1

2

3

2

Page 76: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

76Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Choosing the Coarse Grid (Classical AMG)

Pick a point with

maximal measure and

make it a C-point

Make its neighbors

F-points

3

2

4

2

1

2

Page 77: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

77Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Choosing the Coarse Grid (Classical AMG)

Pick a point with

maximal measure and

make it a C-point

Make its neighbors

F-points

Increase measures of

neighbors of new F-

points by number of

new F-neighbors

5

3

4

3

1

2

Page 78: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

78Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Choosing the Coarse Grid (Classical AMG)

Pick a point with

maximal measure and

make it a C-point

Make its neighbors

F-points

Increase measures of

neighbors of new F-

points by number of

new F-neighbors

Repeat until finished

5

3

4

3

1

2

Page 79: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

79Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Choosing the Coarse Grid (Classical AMG)

Pick a point with

maximal measure and

make it a C-point

Make its neighbors

F-points

Increase measures of

neighbors of new F-

points by number of

new F-neighbors

Repeat until finished

3

5

3

1

Page 80: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

80Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Choosing the Coarse Grid (Classical AMG)

Pick a point with

maximal measure and

make it a C-point

Make its neighbors

F-points

Increase measures of

neighbors of new F-

points by number of

new F-neighbors

Repeat until finished

Page 81: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

81Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Parallel Coarsening Algorithms

Perform sequential algorithm on each processor,

then deal with processor boundaries

Page 82: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

82Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

PMIS: start

select C-pt with maximal measure

select neighbors as F-pts

3 5 5 5 5 5 3

5 8 8 8 8 8 5

5 8 8 8 8 8 5

5 8 8 8 8 8 5

5 8 8 8 8 8 5

5 8 8 8 8 8 5

3 5 5 5 5 5 3

Page 83: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

83Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

PMIS: add random numbers

select C-pt with maximal measure

select neighbors as F-pts

3.7 5.3 5.0 5.9 5.4 5.3 3.4

5.2 8.0 8.5 8.2 8.6 8.2 5.1

5.9 8.1 8.8 8.9 8.4 8.9 5.9

5.7 8.6 8.3 8.8 8.3 8.1 5.0

5.3 8.7 8.3 8.4 8.3 8.8 5.9

5.0 8.8 8.5 8.6 8.7 8.9 5.3

3.2 5.6 5.8 5.6 5.9 5.9 3.0

Page 84: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

84Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

PMIS: exchange neighbor information

select C-pt with maximal measure

select neighbors as F-pts

3.7 5.3 5.0 5.9 5.4 5.3 3.4

5.2 8.0 8.5 8.2 8.6 8.2 5.1

5.9 8.1 8.8 8.9 8.4 8.9 5.9

5.7 8.6 8.3 8.8 8.3 8.1 5.0

5.3 8.7 8.3 8.4 8.3 8.8 5.9

5.0 8.8 8.5 8.6 8.7 8.9 5.3

3.2 5.6 5.8 5.6 5.9 5.9 3.0

Page 85: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

85Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

PMIS: select

8.8

8.9 8.9

8.9

3.7 5.3 5.0 5.9 5.4 5.3 3.4

5.2 8.0 8.5 8.2 8.6 8.2 5.1

5.9 8.1 8.8 8.4 5.9

5.7 8.6 8.3 8.8 8.3 8.1 5.0

5.3 8.7 8.3 8.4 8.3 8.8 5.9

5.0 8.5 8.6 8.7 5.3

3.2 5.6 5.8 5.6 5.9 5.9 3.0

select C-pts with maximal measure locally

make neighbors F-pts

Page 86: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

86Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

PMIS: update 1

5.4 5.3 3.4

select C-pts with maximal measure locally

make neighbors F-pts (requires neighborinfo)

3.7 5.3 5.0 5.9

5.2 8.0

5.9 8.1

5.7 8.6

8.4

8.6

5.6

Page 87: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

87Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

PMIS: select

8.6

5.9

8.6

5.4 5.3 3.43.7 5.3 5.0

5.2 8.0

5.9 8.1

5.7

8.4

5.6

select C-pts with maximal measure locally

make neighbors F-pts

Page 88: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

88Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

PMIS: update 2

5.3 3.4

select C-pts with maximal measure locally

make neighbors F-pts

3.7 5.3

5.2 8.0

Page 89: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

89Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

PMIS: final grid

select C-pts with maximal measure locally

make neighbor F-pts

remove neighbor edges

Page 90: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

90Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Interpolation

Nearest neighbor interpolation straight forward to

parallelize

Harder to parallelize long range interpolation

i

j

lk

Page 91: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

91Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Communication patterns – 128 procs – Level 0

Setup Solve

Page 92: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

92Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Communication patterns – 128 procs – Setup

distance 1 vs. distance 2 interpolation

Distance 1 – Level 4 Distance 2 – Level 4

Page 93: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

93Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Communication patterns – 128 procs – Solve

distance 1 vs. distance 2 interpolation

Distance 1 – Level 4 Distance 2 – Level 4

Page 94: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

94Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube,

100x100x100 grid points per node (16 procs per node)

No. of cores

seco

nd

s

Solve timesSetup times

No. of cores

0

2

4

6

8

10

12

14

16

18

20

0 1000 2000 3000

Hera-1Hera-2BGL-1BGL-2

0

2

4

6

8

10

12

14

16

18

20

0 1000 2000 3000

Page 95: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

95Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Distance 1 vs distance 2 interpolation on different

computer architectures

AMG-GMRES(10), 7pt 3D Laplace problem on a unit cube, using

50x50x25 points per processor, (100x100x100 per node on Hera)

0

5

10

15

20

25

30

0 1000 2000 3000

seco

nd

s

no of cores

Hera-1

Hera-2

BGL-1

BGL-2

Page 96: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

96Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube,

100x100x100 grid points per node (16 procs per node)

No. of cores

seco

nd

s

Solve cycle timesSetup times

No. of cores

0

5

10

15

20

25

30

35

0 5000 10000

MPI 16x1

H 8x2

H 4x4

H 2x8

OMP 1x16

MCSUP 1x16

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 5000 10000

Page 97: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

97Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

0

5

10

15

20

25

30

35

40

45

50

0 2000 4000 6000 8000 10000 12000

MPI 16x1

H 8x2

H 4x4

H 2x8

OMP 1x16

MCSUP 1x16

AMG-GMRES(10) on Hera, 7pt 3D Laplace problem on a unit cube,

100x100x100 grid points per node (16 procs per node)

No. of cores

Seco

nd

s

Page 98: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

98Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Blue Gene/P Solution – Intrepid

Argonne National Laboratory

40 racks with 1024

compute nodes each

Quad-core 850 MHz

PowerPC 450 Processor

163,840 cores

2GB main memory per

node shared by all cores

3D torus network

- isolated dedicated partitions

Page 99: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

99Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

AMG-GMRES(10) on Intrepid, 7pt 3D Laplace problem on a unit

cube, 50x50x25 grid points per core (4 cores per node)

0

1

2

3

4

5

6

0 50000 100000

Se

co

nd

s

No of cores

0

0.05

0.1

0.15

0.2

0.25

0.3

0 50000 100000

Se

co

nd

s

No of cores

smp

dual

vn

vn-omp

H1x4

H2x2

MPI

H4x1

Setup times Cycle times

Page 100: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

100Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

AMG-GMRES(10) on Intrepid, 7pt 3D Laplace problem on a unit

cube, 50x50x25 grid points per core (4 cores per node)

0

2

4

6

8

10

12

14

16

0 50000 100000

Seco

nd

s

No of procs

H1x4

H2x2

MPI

H4x1

H1x4 H2x2 MPI128 17 17 181024 20 20 228192 25 25 2527648 29 27 2865536 30 33 29128000 37 36 36

No. Iterations

Page 101: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

101Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Cray XT-5 Jaguar-PF – Oak Ridge National Laboratory

18,688 nodes in 200 cabinets

Two sockets with AMD Opteron

Hex-Core processor per node

16 GB main memory per node

Network: 3D torus/ mesh with

wrap-around links (torus-like)

in two dimensions and without

such links in the remaining dimension

Compute Node Linux, limited set of services, but reduced noise ratio

PGI’s C compilers (v.904) with and without OpenMP,

OpenMP settings: -mp=numa (preemptive thread pinning, localized

memory allocations)

Page 102: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

102Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

AMG-GMRES(10) on Jaguar, 7pt 3D Laplace problem on a unit

cube, 50x50x30 grid points per core (12 cores per node)

Setup times Cycle times

0

1

2

3

4

5

6

7

8

9

10

0 50000 100000 150000 200000

Se

co

nd

s

No. of cores

MPI

H12x1

H1x12

H4x3

H2x6

0

0.05

0.1

0.15

0.2

0.25

0 50000 100000 150000 200000

Se

co

nd

s

No. of cores

Page 103: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

103Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

AMG-GMRES(10) on Jaguar, 7pt 3D Laplace problem on a unit

cube, 50x50x30 grid points per core (12 cores per node)

0

2

4

6

8

10

12

14

16

0 50000 100000 150000 200000

seco

nd

s

no of cores

MPI

H1x12

H4x3

H2x6

Page 104: A Parallel Computing Tutorial - Institute for Mathematics ... · Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers,

104Option:UCRL# Option:Additional Information

Lawrence Livermore National Laboratory

Thank You!

This work performed under the auspices of the U.S. Department of

Energy by Lawrence Livermore National Laboratory under

Contract DE-AC52-07NA27344.