cyberinfrastructure for scalable and high performance geospatial computation

12
Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011) and Zhong Chen (2012) School of Computational Science and Engineering (CSE) College of Computing, Georgia Institute of Technology

Upload: minowa

Post on 24-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Cyberinfrastructure for Scalable and High Performance Geospatial Computation. Xuan Shi. Graduate assistants supported by the CyberGIS grant Fei Ye (2011) and Zhong Chen (2012) School of Computational Science and Engineering (CSE) College of Computing, Georgia Institute of Technology. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Cyberinfrastructure for Scalable and High Performance Geospatial Computation

Cyberinfrastructure for Scalable and High Performance Geospatial Computation

Xuan Shi

Graduate assistants supported by the CyberGIS grant

Fei Ye (2011) and Zhong Chen (2012)School of Computational Science and Engineering (CSE)College of Computing, Georgia Institute of Technology

Page 2: Cyberinfrastructure for Scalable and High Performance Geospatial Computation

Overview

Keeneland and Kraken: the Cyberinfrastructure for our research and development

Scalable and high performance geospatial software modules developed in the past 1 year and 7 months

Page 3: Cyberinfrastructure for Scalable and High Performance Geospatial Computation

Keeneland: a hybrid computer architecture and system

A five-year Track 2D cooperative agreement awarded by the National Science Foundation (NSF) in 2009 Developed by GA Tech, UT-Knoxville, and ORNL 120 nodes [240 CPUs + 360 GPUs] Integrated into XSEDE in July 2012 Blue Waters – full scale of hybrid computer systems

Page 4: Cyberinfrastructure for Scalable and High Performance Geospatial Computation

Kraken: a Cray XT5 supercomputer As of November 2010, Kraken is the 8th fastest computer in the

world The world’s first academic supercomputer to enter the petascale Peak performance of 1.17 PetaFLOPs 112,896 computing cores (18,816 2.6 GHz six-core AMD Opteron

processors) 147 TB of memory

Page 5: Cyberinfrastructure for Scalable and High Performance Geospatial Computation

Scalable and high performance geospatial computation (1)

Data SizeTime and Speedup on desktop Time and Speedup on Keeneland

Single CPU Single GPU 1 GPU 3 GPUs 6 GPUs 9 GPUs

2191 1331 (22.2) 15.3 / 87 3 / 444 4 / 333 6 / 222 6 / 222

4596 2502 (41.7) 14.6 / 171 5 / 500 5 / 500 7 / 357 8 / 313

5822 2926 (48.8) 16.5 / 177 7 / 418 5 / 585 6 / 488 6 / 488

6941 3717 (62.0) 17.1 / 217 6 / 620 4 / 929 7 / 531 6 / 620

7689 3978 (66.3) 18.4 / 216 7 / 568 5 / 796 6 / 663 8 / 497

9543 4875 (81.3) 20.6 / 237 7 / 696 4 / 1219 6 / 813 8 / 609

9817 5061 (84.4) 21.2 / 239 7 / 723 4 / 1265 6 / 844 7 / 723

Performance comparison based on different scale of data (i.e. number of sample points) and the computing resources (Time is counted in second)

Speedup is calculated by the time used on a single CPU divided by the time used on the GPU(s)

Interpolation is calculated based on the value of 12 nearest neighbors Output grid size: 1M+ cells

Interpolation Using IDW Algorithm on GPU and Keeneland

Page 6: Cyberinfrastructure for Scalable and High Performance Geospatial Computation

Data SizeTime/speedup on desktop Time/Speedup on Keeneland

Single CPU Single GPU 1 GPU 3 GPUs 6 GPUs 9 GPUs

2191 669 (11.2) 56 / 12 7 / 96 4 / 167 6 / 112 6 / 112

4596 1570 (26.2) 66 / 24 8 / 196 5 / 314 6 / 262 7 / 224

6941 1960 (32.7) 65 / 30 7 / 280 4 / 490 7 / 280 6 / 327

9817 2771 (46.2) 52 / 53 6 / 462 4 / 693 7 / 396 6 / 462

Scalable and high performance geospatial computation (2)

Performance comparison based on different scale of data (i.e. number of sample points) and the computing resources (Time is counted in second)

Speedup is calculated by the time used on a single CPU divided by the time used on the GPU(s)

Interpolation is calculated based on the value of 10 nearest neighbors Output grid size: 1M+ cells

Interpolation Using Kriging Algorithm on GPU and Keeneland

Three Kriging approaches a) Spherical, b) Exponential, and c) Gaussian have been implemented on GPU/Keeneland

Page 7: Cyberinfrastructure for Scalable and High Performance Geospatial Computation

Scalable and High Performance Geospatial Computation (3)Parallelizing Cellular Automata (CA) on GPU and Keeneland (1)

Cellular Automata (CA) is the foundation for geospatial modeling and simulation, such as SLEUTH for urban growth simulation

Game of Life (GOL), invented by Cambridge mathematician John Conway, is a well-known generic CA that consists of a collection of cells which, based on a few mathematical rules, can live, die or multiply.

The Rules:

For a space that is 'populated': Each cell with one or no neighbors dies, as if by loneliness. Each cell with four or more neighbors dies, as if by overpopulation. Each cell with two or three neighbors survives.

For a space that is 'empty' or 'unpopulated' Each cell with three neighbors becomes populated.

Page 8: Cyberinfrastructure for Scalable and High Performance Geospatial Computation

Scalable and High Performance Geospatial Computation (3)

Parallelizing Cellular Automata on GPU and Keeneland (2)

Size of CA: 10,000 x 10,000 Number of iterations: 100 CPU time: ~ 100 minutes GPU [desktop] time: ~ 6 minutes Keeneland [20 GPUs]: 20 seconds

CPU Intel Xeon CPU 5110 @ 1.60 GHz, 3.25 GB of RAMGPU NVIDIA GeForce GTX 260 with 27 streaming multiprocessors (SM)

A cell is “born” if it has exactly 3 neighbors, stays alive if it has 2 or 3 living neighbors, and dies otherwise.

A simple SLEUTH model has implemented on a single GPU Implementation on Kraken and Keeneland using multiple GPUs is under development

Page 9: Cyberinfrastructure for Scalable and High Performance Geospatial Computation

Scalable and High Performance Geospatial Computation (4)Parallelizing ISODATA for Unsupervised Image Classification on Kraken (1)

Iterative Self-Organizing Data Analysis Technique Algorithm (ISODATA)

Performance comparison :: ERDAS uses 3:44:37 (13,477 seconds) to read image file [~ 2 minutes] and do the classification over one tile of 18 GB imagery data [0.5 m resolution in three bands]

Number of Cores 144 324 576 900

Stripe Count 80 80 80 80

Stripe Size (MB) 10 10 10 10

Read Time (Sec) 5.66 5.13 2.94 2.77

Classification Time (Sec) 13.72 6.15 3.56 3.31

Our solution over Kraken using different number of cores with optimized stripe count and stripe size.

Tue Jun 12 12:48:37 EDT 2012Iteration 1: convergence = 0.000Iteration 2: convergence = 0.918Iteration 3: convergence = 0.938Iteration 4: convergence = 0.954---- Classification completed ----The reading file time is 15.4807The classification time is 9.2374The total ISODATA algorithm running time is 24.7181Histogram:Class 0: 1124674113Class 1: 1970406180Class 2: 2845484626Class 3: 2897947070Class 4: 2298948648Class 5: 1662539363

Application 1436660 resources: utime ~30211s, stime ~1215sTue Jun 12 12:49:06 EDT 2012

Tue Jun 12 15:39:10 EDT 2012Iteration 1: convergence = 0.000Iteration 2: convergence = 0.919Iteration 3: convergence = 0.936Iteration 4: convergence = 0.953---- Classification completed ----The reading file time is 53.5952The classification time is 9.1167The total ISODATA algorithm running time is 62.7119Histogram:Class 0: 2811537615Class 1: 8743937711Class 2: 12122628756Class 3: 11850984345Class 4: 9714452352Class 5: 5956459221

Application 1440071 resources: utime ~208415s, stime ~4110sTue Jun 12 15:40:18 EDT 2012

Tue Jun 12 14:24:23 EDT 2012Iteration 1: convergence = 0.000Iteration 2: convergence = 0.915Iteration 3: convergence = 0.935Iteration 4: convergence = 0.952---- Classification completed ----The reading file time is 28.6973The classification time is 8.9810The total ISODATA algorithm running time is 37.6782Histogram:Class 0: 2811537615Class 1: 3715199078Class 2: 5660559329Class 3: 5766104126Class 4: 4652035362Class 5: 2994564490

Application 1439048 resources: utime ~78392s, stime ~2164sTue Jun 12 14:25:05 EDT 2012

Tue Jun 12 16:06:31 EDT 2012Iteration 1: convergence = 0.000Iteration 2: convergence = 0.919Iteration 3: convergence = 0.937Iteration 4: convergence = 0.953---- Classification completed ----The reading file time is 47.8197The classification time is 9.6519The total ISODATA algorithm running time is 57.4716Histogram:Class 0: 2811537623Class 1: 14137169249Class 2: 18231156326Class 3: 17844190199Class 4: 14839032207Class 5: 8936914396

Application 1440335 resources: utime ~275810s, stime ~6377sTue Jun 12 16:07:33 EDT 2012

36 GB 72 GB 144 GB 216 GB

1,800 Cores 3,600 Cores 7,200 Cores 10,800 Cores

20+ hours to load data from GT into Kraken @ ORNL

The more cores are requested, the longer the waiting time will be

~ 10 seconds to complete the classification process

I/O needs to be further optimized

Page 10: Cyberinfrastructure for Scalable and High Performance Geospatial Computation

# oftiles

5 classes 10 classes 15 classes 20 classesI/O CLS Total IR I/O CLS Total IR I/O CLS Total IR I/O CLS Total IR

1 4.32 2.13 6.45 4 4.25 8.62 12.87 11 5.51 12.07 17.57 11 6.00 18.13 24.13 13

2 8.94 2.16 11.10 4 20.31 7.92 28.23 10 17.16 11.32 28.47 11 9.02 15.09 24.11 12

4 21.01 2.21 23.23 4 16.40 7.95 24.35 10 14.80 13.41 28.21 13 16.40 7.95 24.35 10

8 28.83 2.23 31.06 4 28.95 7.41 36.36 9 28.67 14.78 43.46 14 29.52 15.34 44.86 12

12 44.86 2.29 47.15 4 45.92 6.57 52.49 8 58.31 9.43 67.74 9 41.56 15.37 56.93 12

Scalable and High Performance Geospatial Computation (4)Parallelizing ISODATA for Unsupervised Image Classification on Kraken (2)

Iterative Self-Organizing Data Analysis Technique Algorithm (ISODATA)

Performance comparison to classify one tile of 18 GB image into 10, 15, and 20 classes, ERDAS uses about 5.5, 6.5, and 7.5 hours to complete 20 iterations, while the convergence

number is less than 0.95

Page 11: Cyberinfrastructure for Scalable and High Performance Geospatial Computation

Through a re-engineering process, the near-repeat calculation is first parallelized on to a NVIDIA GeForce GTX 260 GPU, which takes about 48.5 minutes to complete one calculation and 999 simulations on two event chains over 30,000 events.

Through a combination of MPI and GPU programs, we can dispatch the simulation work onto multiple nodes in Keeneland to accelerate the simulation process.

We use 100 GPUs on Keeneland to implement 1,000 simulations for about 264 seconds to complete this task.

If more GPUs were used, the simulation time can be reduced.

Scalable and High Performance Geospatial Computation (5)Near-repeat calculation for spatial-temporal analysis on crime

events over GPU and Keeneland

One run of 4+ event chain calculation is easy to approach or go beyond petascale (1015) and exascale (1018)