1 babak behzad, yan liu 1,2,4, eric shook 1,2, michael p. finn 5, david m. mattli 5 and shaowen wang...

29
1 Babak Behzad Babak Behzad 1,3 , Yan Liu , Yan Liu 1,2,4 1,2,4 , Eric Shook , Eric Shook 1,2 1,2 , Michael P. Finn , Michael P. Finn 5 , , David M. Mattli David M. Mattli 5 and Shaowen Wang and Shaowen Wang 1,2,3,4 1,2,3,4 1 CyberInfrastructure and Geospatial Information CyberInfrastructure and Geospatial Information Laboratory (CIGI) Laboratory (CIGI) 2 Department of Geography and Geographic Information Department of Geography and Geographic Information Science Science 3 Department of Computer Science Department of Computer Science 4 National Center for Supercomputing Applications (NCSA) National Center for Supercomputing Applications (NCSA) University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign 5 Center of Excellence for Geospatial Information Science U.S. Geological Survey (USGS) AutoCarto’12 AutoCarto’12 A Performance Profiling Strategy for High-Performance Map Re-Projection of Coarse- Scale Spatial Raster Data

Upload: todd-leon-ramsey

Post on 27-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

11

Babak BehzadBabak Behzad1,3, Yan Liu, Yan Liu1,2,41,2,4, Eric Shook, Eric Shook1,21,2, Michael P. Finn, Michael P. Finn55, , David M. MattliDavid M. Mattli55 and Shaowen Wang and Shaowen Wang1,2,3,41,2,3,4

11CyberInfrastructure and Geospatial Information Laboratory (CIGI)CyberInfrastructure and Geospatial Information Laboratory (CIGI)22Department of Geography and Geographic Information ScienceDepartment of Geography and Geographic Information Science

33Department of Computer ScienceDepartment of Computer Science44National Center for Supercomputing Applications (NCSA)National Center for Supercomputing Applications (NCSA)

University of Illinois at Urbana-ChampaignUniversity of Illinois at Urbana-Champaign55Center of Excellence for Geospatial Information Science

U.S. Geological Survey (USGS)

AutoCarto’12AutoCarto’12

A Performance Profiling Strategy for High-Performance Map Re-Projection of Coarse-Scale Spatial Raster Data

Page 2: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

OutlineOutline

OverviewOverview– Map re-projectionMap re-projection– pRasterBlaster: HPC Solution to Map pRasterBlaster: HPC Solution to Map

Re-ProjectionRe-Projection Performance ProfilingPerformance Profiling

– pRasterBlaster Computational and pRasterBlaster Computational and Scaling BottlenecksScaling Bottlenecks

ConclusionConclusion

22

Page 3: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

IntroductionIntroduction

Map re-projection Map re-projection – A important cartographic operation A important cartographic operation

Desktop application: mapIMGDesktop application: mapIMG– Challenges exist when scaling for Challenges exist when scaling for

coarse-scale spatial datasetcoarse-scale spatial dataset– Re-projecting a 1GB raster dataset can Re-projecting a 1GB raster dataset can

take 45-60 minutestake 45-60 minutes Parallel computing techniques will Parallel computing techniques will

help scaling to large datasetshelp scaling to large datasets– Raster was born to be parallelizedRaster was born to be parallelized

Page 4: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Parallelizing Map Re-Parallelizing Map Re-ProjectionProjection

Map re-projection on large dataset is too slow or Map re-projection on large dataset is too slow or even impossible on desktop machineseven impossible on desktop machines

pRasterBlasterpRasterBlaster– mapIMG in HPC (High-Performance Computing) mapIMG in HPC (High-Performance Computing)

environmentenvironment– Early DaysEarly Days

Row-wise decompositionRow-wise decomposition I/O occurred directly in program inner loopI/O occurred directly in program inner loop

– Rigorous geometry handling and novel resamplingRigorous geometry handling and novel resampling Resampling options for categorical data and population counts Resampling options for categorical data and population counts

(also standard continuous data resampling methods)(also standard continuous data resampling methods)

– Able to project/re-project large maps in short amount Able to project/re-project large maps in short amount of timeof time

Page 5: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

pRasterBlasterpRasterBlaster

Fast and accurate raster re-projection in Fast and accurate raster re-projection in three (primary) stepsthree (primary) steps

Step 1: Calculate and partition output spaceStep 1: Calculate and partition output space Step 2: Read input and re-projectStep 2: Read input and re-project Step 3: Combine temporary filesStep 3: Combine temporary files

Page 6: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Performance Profiling: Performance Profiling: Motivation and ObjectivesMotivation and Objectives

Exploit performance Exploit performance profiling tools to make profiling tools to make pRasterBlaster more pRasterBlaster more scalable and efficientscalable and efficient– Early version was not Early version was not

scalable to large number scalable to large number of processors of processors

– Resolve computational Resolve computational bottlenecks to allow bottlenecks to allow pRasterBlaster leverage pRasterBlaster leverage thousands of processorsthousands of processors

Demonstrate Demonstrate techniques of using techniques of using performance profilers performance profilers – Potentially useful many Potentially useful many

GIS applicationsGIS applications

Page 7: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

What is performance What is performance profiling?profiling?

A form of dynamic program analysisA form of dynamic program analysis MeasuresMeasures

– memory footprint of programmemory footprint of program– time complexity of program time complexity of program – usage of particular instructionsusage of particular instructions– frequency and duration of function callsfrequency and duration of function calls

Aids program optimizationAids program optimization

77

Page 8: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

How do profilers work?How do profilers work?

Statistical profilersStatistical profilers– Operate by samplingOperate by sampling– Probes the program at regular Probes the program at regular

intervalsintervals– Pros: Low overheadPros: Low overhead– Cons: Typically less numerically Cons: Typically less numerically

accurate and specificaccurate and specific

88

Page 9: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

How do profilers work?How do profilers work?

Instrumenting profilersInstrumenting profilers– Instrument target programs with additional Instrument target programs with additional

instructions to collect required informationinstructions to collect required information– Pros: Much more accurate than statistical Pros: Much more accurate than statistical

profilersprofilers– Cons: Potentially slow the program (since new Cons: Potentially slow the program (since new

instructions are added)instructions are added) Different kinds of instrumenting profilersDifferent kinds of instrumenting profilers

– Manual instrumentingManual instrumenting Done by the programmersDone by the programmers

– Automatic profilers Automatic profilers Software instruments automatically Software instruments automatically TAU and IPM used in this research.TAU and IPM used in this research.

99

Page 10: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Manual InstrumentingManual Instrumenting

The traditional way of instrumenting C code is with the time system call, provided by the time.h library. Here is a code fragment that demonstrates its use:

#include <sys/time.h> #include <sys/time.h>

int main(void) { int main(void) {

time_t start, finish; time_t start, finish;

... ...

time(&start); time(&start);

/* section to be timed */ /* section to be timed */

time(&finish); time(&finish);

printf("Elapsed time: %d\n", finish - start); ... printf("Elapsed time: %d\n", finish - start); ...

......

}}

1010

Page 11: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Manual Instrumenting Manual Instrumenting in Parallel Programsin Parallel Programs Instrument the portion of the program running on individual Instrument the portion of the program running on individual

processorsprocessors

#include <sys/time.h> #include <sys/time.h>

int main(void) { int main(void) {

time_t start, finish; time_t start, finish;

... ...

time(&start); time(&start);

/* section to be timed */ /* section to be timed */

time(&finish); time(&finish);

printf("Elapsed time printf("Elapsed time on Process %don Process %d: %d\n", : %d\n", my_rankmy_rank, , finish - start); ... finish - start); ...

......

}}

1111

Page 12: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

IPMIPM(Integrated Performance (Integrated Performance Monitoring)Monitoring)

IPM is a portable profiling infrastructure for MPI programs– Provides a low-overhead performance profile of

the performance aspects and resource utilization of the parallel program

– Communication, computation, and IO are the primary focus

– http://ipm-hpc.sourceforge.net We initially profiled pRasterBlaster with IPM

to understand how communication, computation and IO usage breakdown for this application

1212

Page 13: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

TAUTAU (Tuning and Analysis Utilities)

TAU performance system is a portable profiling and tracing toolkit – Analysis of parallel programs written in Fortran, C,

C++, Java, Python– http://tau.uoregon.edu

TAU is capable of gathering performance information through instrumentation of functions, methods, basic blocks, and state

IPM is designed to profile MPI applications, while TAU is used to profile any kind of parallel applications

1313

Page 14: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

TAU for pRasterBlasterTAU for pRasterBlaster

1414

Page 15: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

TAU for pRasterBlasterTAU for pRasterBlaster

1515

Page 16: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Computational Computational Bottleneck I: SymptomBottleneck I: Symptom

Page 17: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Computational Computational Bottleneck I: SymptomBottleneck I: Symptom

Page 18: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Computational Computational Bottleneck I: SymptomBottleneck I: Symptom

Page 19: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Cause: Workload Cause: Workload Distribution IssueDistribution Issue

N rows on P processor cores

When P is small When P is big

Page 20: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Solution: Load Solution: Load BalancingBalancing

2020

N rows on P processor cores

When P is small When P is big

Page 21: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Computational Computational Bottleneck I: SummaryBottleneck I: Summary

SymptomSymptom– Load imbalance Load imbalance – Detected by TAU firstDetected by TAU first– Verified by manual instrumentingVerified by manual instrumenting

CauseCause– Workload distribution algorithm problem Workload distribution algorithm problem

(not obvious on small platforms)(not obvious on small platforms) SolutionSolution

– Revised algorithm for distributing Revised algorithm for distributing workloadworkload

2121

Page 22: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Computational Bottleneck Computational Bottleneck II: SymptomII: Symptom

2222

Page 23: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Computational Bottleneck Computational Bottleneck II: SymptomII: Symptom

2323

Page 24: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Computational Bottleneck Computational Bottleneck II: CauseII: Cause

Page 25: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Computational Bottleneck Computational Bottleneck II: AnalysisII: Analysis

Spatial data-dependent performance Spatial data-dependent performance anomalyanomaly– The anomaly is data dependentThe anomaly is data dependent– Four corners of the raster were processed by Four corners of the raster were processed by

processors whose indexes are close to the processors whose indexes are close to the two endstwo ends

Exception handling in C++ is costlyException handling in C++ is costly– Coordinate transformation on nodata area Coordinate transformation on nodata area

was handled as an exceptionwas handled as an exception SolutionSolution

– Remove C++ exception handling partRemove C++ exception handling part

2525

Page 26: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Computational Bottleneck Computational Bottleneck II: Performance II: Performance ImprovementImprovement

Page 27: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Computational Bottleneck Computational Bottleneck II: SummaryII: Summary

Symptom Symptom – Processors responsible for polar regions Processors responsible for polar regions

spent more time than those processing spent more time than those processing equatorial regionequatorial region

CauseCause– Corner cells were mapped to invalid input Corner cells were mapped to invalid input

raster cells generating exceptionsraster cells generating exceptions– C++ exception handling was expensiveC++ exception handling was expensive

Solution Solution – Removed C++ exception handlingRemoved C++ exception handling– Corner cells need not to be processedCorner cells need not to be processed

They now contribute less time of computation They now contribute less time of computation

2727

Page 28: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

ConclusionsConclusions

Performance profiling identified Performance profiling identified computational bottlenecks in computational bottlenecks in pRasterBlasterpRasterBlaster

We demonstrated the value of profilers for pRasterBlaster– The techniques is likely valuable for other GIS

application Performance profiling is an important tool Performance profiling is an important tool

for developing scalable and efficient high for developing scalable and efficient high performance applicationsperformance applications

Page 29: 1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook

Future WorkFuture Work

Identify and resolve remaining Identify and resolve remaining performance issues in performance issues in pRasterBlasterpRasterBlaster– Recently identified I/O is the next Recently identified I/O is the next

major road-blockmajor road-block

2929