parallel processing - gravity probe b · science advisory committee meeting - 20 3 september 3,...

16
Science Advisory Committee Meeting - 20 September 3, 2010 • Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin

Upload: dangkhue

Post on 23-Aug-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Science Advisory Committee Meeting - 20 September 3, 2010 • Stanford University

1 04_Parallel Processing

Parallel Processing

Majid AlMeshari John W. Conklin

Science Advisory Committee Meeting - 20

2

September 3, 2010 • Stanford University

04_Parallel Processing

Outline •  Challenge •  Requirements •  Resources •  Approach •  Status •  Tools for Processing

Science Advisory Committee Meeting - 20

3

September 3, 2010 • Stanford University

04_Parallel Processing

Challenge

A computationally intensive algorithm is applied on a huge set of data. Verification of filter’s correctness takes a long time and hinders progress.

Science Advisory Committee Meeting - 20

4

September 3, 2010 • Stanford University

04_Parallel Processing

Requirements •  Achieve a speed-up between 5-10x over serial version •  Minimize parallelization overhead •  Minimize time to parallelize new releases of code •  Achieve reasonable scalability

Number of Processors

Tim

e (s

ec)

SAC 19

Science Advisory Committee Meeting - 20

5

September 3, 2010 • Stanford University

04_Parallel Processing

Resources •  Hardware

– Computer Cluster (Regelation), a 44 64-bit CPUs » 64-bit enables us to address a memory space beyond 4 GB » Using this cluster because:

(a) likely will not need more than 44 processors (b) same platform as our Linux boxes

•  Software – MatlabMPI

» Matlab parallel computing toolbox created by MIT Lincoln Lab

» Eliminates the need to port our code into another language

Science Advisory Committee Meeting - 20

6

September 3, 2010 • Stanford University

04_Parallel Processing

Approach - Serial Code Structure Initialization

Loop over iterations

Loop Gyros

Loop over orbits

Loop over Segments

Build h(x) (the right hand side)

Build Jacobian, J(x)

Load data (Z), perform fit Bayesian estimation,

display results

Computationally Intensive

Components

(Parallelize!!!)

Science Advisory Committee Meeting - 20

7

September 3, 2010 • Stanford University

04_Parallel Processing

Approach - Parallel Code Structure Initialization

Loop over iterations Loop Gyros

Loop over orbits Loop over Segments

Build h(x)

Build Jacobian, J(x)

Bayesian estimation, display results

Distribute J(x), all nodes

Load data (Z), perform fit Loop over partial orbits

Send information matrices to master & combine

In parallel, split up by columns of J(x)

In parallel, split up by orbits

Custom data distribution code

required (Overhead)

Science Advisory Committee Meeting - 20

8

September 3, 2010 • Stanford University

04_Parallel Processing

Code in Action

Proc

1

Proc

2

Proc

3

Proc

n

Proc

1

Proc

2

Proc

3

Proc

n

. . . . . . . . .

. . . . . . . . .

Go

for n

ext G

yro,

Seg

men

t

Go

for

anot

her

Iter

atio

n

Initialization

Compute h and Partial J

Sync Point Get Complete J

Perform Fit

Sync Point Combine Info Matrix

Bayesian Estimation Another Iteration?

Split by Columns of J

Split by Rows of J

Science Advisory Committee Meeting - 20

9

September 3, 2010 • Stanford University

04_Parallel Processing

Parallelization Techniques - 1 •  Minimize inter-processor communication

–  Use MatlabMPI for one-to-one communication only –  Use a custom approach for one-to-many communication

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Comp J

Dist J (I/O)

Est

comp info (I/O)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Comp J

Dist J (I/O)

Est

comp info (I/O)

Processors

Tim

e (S

ec) Comm time with

custom approach Comm time with

MPI

Science Advisory Committee Meeting - 20

10

September 3, 2010 • Stanford University

04_Parallel Processing

Parallelization Techniques - 2 •  Balance processing load among processors

Load per processor = ∑ length(reducedVectork)*(computationWeightk + diskIOWeight)

k: {RT, Tel, Cg, TFM, S0}

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Comp J

Dist J (I/O)

Est

comp info (I/O)

Processors

Tim

e (S

ec) CPUs have

uniform load

Science Advisory Committee Meeting - 20

11

September 3, 2010 • Stanford University

04_Parallel Processing

Parallelization Techniques - 3 •  Minimize memory footprint of slave code

–  Tightly manage memory to prevent swapping to disk, or “out of memory” errors

–  We managed to save ~40% of RAM per processor and never seen “out of memory” since

Science Advisory Committee Meeting - 20

12

September 3, 2010 • Stanford University

04_Parallel Processing

Parallelization Techniques - 4 •  Find contention points

–  Cache what can be cached –  Optimize what can be optimized

0

500

1000

1500

2000

2500

3000

3500

4000

Old Code Optimized Code

Example: Compute h Optimization

Tim

e (S

ec)

Science Advisory Committee Meeting - 20

13

September 3, 2010 • Stanford University

04_Parallel Processing

Status – Current Speed up

1.00

4.26

6.50 7.22

1.00

4.08

7.19

9.62

1.00

5.76

10.83

14.50

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

1 CPU 8 CPUs 16 CPUs 20 CPUs

Sep. '09

May '10

July '10

Number of Processors

Spe

edup

Science Advisory Committee Meeting - 20

14

September 3, 2010 • Stanford University

04_Parallel Processing

Tools for Processing - 1 •  Run Manager/Scheduler/Version Controller

– Built by Ahmad Aljadaan – Given a batch of options files, it schedules batches of runs,

serial or parallel, on cluster –  Facilitates version control by creating a unique directory for

each run with its options file, code and results – Used with large number of test cases – Helped us schedule 1048+ runs since January 2010

»  and counting…

Science Advisory Committee Meeting - 20

15

September 3, 2010 • Stanford University

04_Parallel Processing

Tools for Processing - 2 •  Result Viewer

– Built by Ahmad Aljadaan – Allows searching for runs, displays results, plots related

charts –  Facilitates comparison of results

Science Advisory Committee Meeting - 20

16

September 3, 2010 • Stanford University

04_Parallel Processing

Questions