a case study of communication optimizations on 3d ...charm.cs.illinois.edu/newpapers/09-24/talk.pdfa...

Post on 09-Jul-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

University of Illinois at Urbana-Champaign

Abhinav Bhatele, Eric Bohm, Laxmikant V. KaleParallel Programming Laboratory

Euro-Par 2009

Outline

MotivationSolution: Mapping of OpenAtomPerformance BenefitsBigger Picture:

Resources NeededHeuristic Solutions

Automatic Mapping

August 27th, 2009

2

Abhinav Bhatele @ Euro-Par 2009

OpenAtom

Ab-Initio Molecular Dynamics codeConsider electrostatic interactions between the nuclei and electronsCalculate different energy termsDivided into different phases with lot of communication

August 27th, 2009

3

Abhinav Bhatele @ Euro-Par 2009

OpenAtom on Blue Gene/L

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

4

0

0.05

0.1

0.15

0.2

0.25

0.3

512 1024 2048 4096 8192

Tim

e pe

r ste

p (s

ecs)

No. of cores

w32 Default

Runs on Blue Gene/L at IBM T J Watson Research Center, CO mode

w32 = 32 water molecules with 70 Ry cutoff

The problem lies in …

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

5

Performance Analysis and Visualization Tool: Projections (part of Charm++) – Timeline View

Solution –

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

6

Topology Aware Mapping

Processor Virtualization

User View System View

Programmer: Decomposes the computation into objects

Runtime: Maps the computation on to the processors

August 27th, 2009

7

Abhinav Bhatele @ Euro-Par 2009

Benefits of Charm++

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

8

Computation is divided into objects/chares/virtual processors (VPs)Separates decomposition from mappingVPs can be flexibly mapped to actual physical processors (PEs)

Topology Manager API†

The application needs information such asDimensions of the partitionRank to physical co-ordinates and vice-versa

TopoManager: a uniform APIOn BG/L and BG/P: provides a wrapper for system callsOn XT3/4/5, there are no such system callsProvides a clean and uniform interface to the application

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

9

† http://charm.cs.uiuc.edu/~bhatele/phd/topomgr.htm

Parallelization using Charm++

August 27th, 2009

10

Eric Bohm, Glenn J. Martyna, Abhinav Bhatele, Sameer Kumar, Laxmikant V. Kale, John A. Gunnels, and Mark E. Tuckerman. Fine Grained Parallelization of the Car-Parrinello ab initio MD Method on Blue Gene/L. IBM J. of R. and D.: Applications of Massively Parallel Systems, 52(1/2):159-174, 2008.

Abhinav Bhatele @ Euro-Par 2009

Mapping Challenge

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

11

Load Balancing: Multiple VPs per PEMultiple groups of communicating objects

Intra-group communicationInter-group communication

Conflicting communication requirements

Topology Mapping of Chare Arrays

August 27th, 2009

12

RealSpace and GSpace have state-wisecommunication Paircalculator and

GSpace have plane-wisecommunication

Abhinav Bhatele @ Euro-Par 2009

Performance Improvements on BG/L

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

13

0

0.05

0.1

0.15

0.2

0.25

0.3

512 1024 2048 4096 8192

Tim

e pe

r ste

p (s

ecs)

No. of cores

w32 Defaultw32 Topology

Runs on Blue Gene/L at IBM T J Watson Research Center, CO mode, Year: 2006

w32 = 32 water molecules with 70 Ry cutoff

Improved Timeline Views

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

14

Results on Blue Gene/L

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

15

0

5

10

15

20

25

1024 2048 4096 8192 16384

Tim

e pe

r ste

p (s

ecs)

No. of cores

w256 Defaultw256 TopologyGST_BIG DefaultGST_BIG Topology

GST_BIG = 64 Ge, 128 Sb and 256 Te molecules with 20 Ry cutoff

Runs on Blue Gene/L at IBM T J Watson Research Center, CO mode

Results on Blue Gene/P

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

16

0

2

4

6

8

10

12

1024 2048 4096 8192

Tim

e pe

r ste

p (s

ecs)

No. of cores

w256 Defaultw256 Topology

w256 = 256 water molecules with 70 Ry cutoff

Runs on Blue Gene/P at Argonne National Laboratory, VN mode

Results on Cray XT3

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

17

2

3

4

5

6

7

8

512 1024 2048

Tim

e pe

r ste

p (s

ecs)

No. of cores

w256 Defaultw256 TopologyGST_BIG DefaultGST_BIG Topology

Runs on Cray XT3 (Bigben) at Pittsburgh Supercomputing Center, VN mode(with system reservation to obtain complete 3d mesh shapes)

Performance Analysis

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

18

0

500000

1000000

1500000

2000000

2500000

3000000

1024 2048 4096 8192

Idle

Tim

e (s

ecs)

No. of cores

DefaultTopology

Performance Analysis and Visualization Tool: Projections – Idle time added across all processors

w256M_70Ry on Blue Gene/L

Reduction in Communication Volume

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

19

0

200

400

600

800

1000

1200

1024 2048 4096 8192

Band

wid

th (

GB)

No. of cores

DefaultTopology

Data obtained from Blue Gene/P’s Uniform Performance Counters

w256M_70Ry on Blue Gene/P

Relative Performance Improvement

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

20

0

0.5

1

1.5

2

2.5

512 1024 2048 4096 8192

% Im

prov

emen

t

No. of cores

Cray XT3Blue Gene/LBlue Gene/P

w256M_70Ry

Bigger picture

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

21

Different kinds of applications:Computation boundCommunication bound

Latency tolerantLatency sensitive

Technique:Obtain processor topology and application communication graphHeuristic Techniques for mapping

Why does distance affect message latencies?

Consider a 3D mesh/torus interconnect

Message latencies can be modeled by

(Lf/B) x D + L/B

Lf = length of flit, B = bandwidth,

D = hops, L = message size

When (Lf * D) << L, first term is negligible

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

22

But in presence of contention …

Automatic Topology Aware Mapping

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

23

Many MPI applications exhibit a simple two-dimensional near-neighbor communication patternExamples: MILC, WRF, POP, Stencil, …

Acknowledgements:Shawn Brown and Chad Vizino (PSC)Glenn Martyna, Sameer Kumar, Fred Mintzer (IBM)Teragrid for running time on Bigben (XT3)ANL for running time on Blue Gene/P

DOE Grant B341494 (CSAR), DOE Grant DE-FG05-08OR23332 (ORNL LCF) and NSF Grant ITR 0121357

August 27th, 2009Abhinav Bhatele @ Euro-Par 2009

Funding

E-mail: bhatele@illinois.eduWebpage: http://charm.cs.illinois.edu

top related