the scalable data management, analysis, and visualization ... · exchanging ghost layer below:...

The Scalable Data Management, Analysis, and Visualization Institute http://sdav-scidac.org

Exploring Cosmology with SDAV Technologies

Tom Peterka, Venkat Vishwanath, Joe Insley, Juliana Kwan, Adrian Pope, Hal Finkel, Katrin Heitmann, Salman Habib (ANL), Jon Woodring, Chris Sewell, Jim Ahrens (LANL), George Zagaris, Robert Maynard, Berk Geveci (Kitware), Wei-keng Liao, Mostofa Potwary,

Ankit Agrawal, Saba Sehrish (NU)

Above: Cosmology tools plugin in ParaView promotes interactive feature exploration and includes a parallel reader for Voronoi tessellations and Minkowski functionals over connected components.

Right: Connected components of Voronoi cells that have been filtered

on cell volume are further characterized.

Overview SDAV technologies are aiding cosmologists unravel the mysterious nature

of dark matter and energy by transforming raw data into meaningful representations. For example, mesh tessellations help analyze point data because they transform sparse discrete samples into dense continuous functions. Similarly, large-scale structures such as halos and voids are extracted, tracked, and summarized in high-level models using SDAV

techniques. In close partnership with cosmologists, SDAV provides the custom tools needed to manage, analyze, and visualize data.

Visualization

Feature Detection & Modeling

!"#$%&'()'(*%

+$,$($'%-.//.##"0$(%

12#0/-,."3%4.-.50$(%

&."-2,.%6,"57'(*%

8("#9/'/%6$$#/%

In situ analysis

:","+'.;%

Visualization and further analysis

Storage

8("#9/'/%5$(<*2,"0$(%

='32#"0$(%>$(<*2,"0$(% !8>>%

In Situ Analysis Framework

Left: Framework enables in situ analyses at selected time steps. Analysis filters include halo finding, multistream feature classification, feature tracking, and Voronoi tessellation.

Left: Comparison of density-based clustering (DBSCAN) and HACC friends-of-friends (FOF) halo finder shows similar results but improved linking at halo edges and less false linkage. Scalability of parallel DBSCAN is shown in center and right.

Right: Vl3 is a parallel visual

analysis framework for Blue Gene

supercomputers, GPU-based

clusters, and scientists’ laptops. Its modular design scales to 16K GPU cores and supports

interactive visual analysis and data

exploration of 1 trillion particles.

Feature Tracking z=5

z=3

z=2

z=1

Temporal Halo Information

Current Halo Data

Previous Halo Data

Parallel MergerTree Kernel

Distributed!MergerTree!

Trac

king

Tim

e

Data

Size

(GB)

0

0.2

0.4

0.6

0.8

1.0

Timestep0 50 100 150 200 250 300 350 400 450

Storage Requirements for In Situ Merger-Tree from a 10243 N-Body Cosmological Simulation

Left: Parallel in situ merger tree algorithm builds merger trees incrementally in-core. Only the final merger tree is stored at the end of the simulation.

Voronoi Tessellation

2050

100

200

Strong Scaling

Number of Processes

Tess

ella

tion

Tim

e (in

cludi

ng I/

O) (

s)

128 256 512 1024 2048 4096 8192 16384

1024^3 particles512^3256^3128^3Perfect scaling

0.5

1.0

2.0

5.0

10.0

20.0

Weak Scaling

Number of Processes

Tess

ella

tion

Tim

e pe

r Par

ticle

(inc

ludi

ng I/

O) (

micr

osec

onds

)

128 1024 8192

128^3, 256^3, 512^3 particlesPerfect scaling

Histogram of Cell Density Contrast at t = 11

Cell Density Contrast ((density − mean) / mean)

Num

ber o

f Cel

ls

050

010

0015

00

100 binsRange [ −0.77 , 0.59 ]Bin width 0.014Skewness 1.6Kurtosis 4.1

−0.768 −0.496 −0.225 0.046 0.318 0.589



Num

ber o

f Cel

ls

050

010

0015

0020

00 100 binsRange [ −0.77 , 2.4 ]Bin width 0.033Skewness 2Kurtosis 5.5

−0.77 −0.13 0.52 0.84 1.16 1.48 1.80 2.12 2.45



Num

ber o

f Cel

ls

010

0020

0030

0040

0050

0060

0070

00 100 binsRange [ −0.72 , 15 ]Bin width 0.15Skewness 4.5Kurtosis 23

−0.72 2.33 3.85 5.38 6.90 8.42 9.95 11.47 14.52

Right: Overview of parallel algorithm by

using an example with four processes.

Particles and Voronoi cells are colored according to the

process where they originated prior to exchanging ghost

layer.

Below: Density contrast distribution of evolving Voronoi cells at three time steps. Statistically, the trends are consistent with the formation of ;large-scale structures such as halos and voids.

Bottom: In situ strong scaling (left) and weak scaling (center) are plotted on a log-log scale. Weak scaling time is normalized by the number of particles. Plots represent the total tessellation time, including the time to write the result to storage. Strong scaling efficiency is 41%; weak scaling efficiency is 86%. Bottom right: raw performance is tabulated.

Above: Each Voronoi cell is associated with one input particle, the site of the cell. A cell consists all points closer to the site of that cell than to any other site Vi = { x | d(x, si) < d(x, sk) } ∀ k≠ i In 3D, Voronoi cells are polyhedra; dual is Delaunay tetrahedralization.

Above: The CosmoTools framework also includes ParaView plugins.

A unified framework facilitates integration of new algorithms, services, and tools without modifying HACC code. A simple API, consisting of 7 main functions allows different tools to be easily controlled through a configuration file.

Speedups of MPI DBSCAN on astrophysics dataset

!"

#!!"

$%!!!"

$%#!!"

&%!!!"

!" #!!" $%!!!" $%#!!" &%!!!"!"##$%

"&

'()#*&

'($)"

'*$+#)"

'$$#+*)"

(a) Synthetic-cluster-extended dataset

!"

#!!"

$!!"

%&'!!"

%&(!!"

'&!!!"

!" )!!" %&!!!" %&)!!" '&!!!"

!"##$%

"&

'()#*&

*(%+"*,%-)+"*%%)-,+"

(b) Synthetic-random-extended dataset

!"

#!!"

$%!!!"

$%#!!"

&%!!!"

!" #!!" $%!!!" $%#!!" &%!!!"

!"##$%

"&

'()#*&

'(")(")'"

(c) Millennium-run-simulation dataset

!"

#$!!!"

%$!!!"

&$!!!"

'$!!!"

!" #$!!!" %$!!!" &$!!!" '$!!!"

!"##$%

"&'()#*&

(("

(d) Millennium-run-simulation dataset

Figure 6. Speedup of PDSDBSCAN-D on Hopper at NERSC, a CRAY XE6distributed memory computer, on three different categories of datasets.

!"

#!"

$!"

%!"

&!"

'!!"

!" ()!!!" '!)!!!" '()!!!"

!"#$"%

&'(")*+)&',"%

)-.")

/*#"0)

*+,-.",+/012-3+4"5678948"

(a) Local comp. vs. Merging on mm

!"!#

!"$#

!"%#

!"&#

!"'#

(&)*# (+)",*# ()),"+*#

!"#$%&'(%)*$+,

&-''.

/%'0'

1232

4356

782'./%'9:

;'

&%# )$'# $,&# ,)$#

(b) Synthetic-cluster-extended dataset

!"

#"

$"

%"

&"

'!"

'#"

(%')" (*'+,)" ('',+*)"

!"#$%&'(%)*$+,

&-''.

/%'0'

1232

4356

782'./%'9:

;'

%$" '#&" #,%" ,'#"

(c) Synthetic-random-extended dataset

!"

#"

$"

%"

&"

'!"

'#"

()" **" *)" *("

!"#$%&'(%)*$+,

&-./0%.1.

2343

5467

8'3./0%.9:

;.

%$" '#&" #+%" +'#"

(d) Millennium-run-simulation dataset

Figure 7. (a) Trade-off between local computation and merging w.r.t thenumber of processors on mm, a millennium-run-simulation dataset. (b)-(d)Time taken by the preprocessing step, gather-neighbors, compared to the totaltime taken by PDSDBSCAN-D using 64, 128, 256, and 512 processors.

synthetic-cluster-extended and millennium-simulation-run (db,mb, md) datasets are significantly higher than the synthetic-random-extended dataset. However, on the dataset mm inmillennium-simulation-run (Figure 6(d)), we get a speedup of5,765 using 8,192 process cores.

Figure 7(a) shows the trade-off between the local compu-tation and the merging stage by comparing them with thetotal time (local computation time + merging time) in percent.We use mm, the millennium-run-simulation dataset for thispurpose and continue up to 16,384 processors to understandthe behavior clearly. As can be seen, the communication timeincreases (the computation time decreases) with the numberof processors. When using larger than 10,000 processors,communication time starts dominating the computation timeand therefore, the speedup starts decreasing. For example, weachieved a speedup of 5,765 using 8,192 process cores whereas

the speedup is 5,124 using 16,384 process cores. We observesimilar behaviors for other datasets.

Figure 7(b), 7(c), and 7(d) show a comparison of timetaken by the gather-neighbors preprocessing step over thetotal time taken by PDSDBSCAN-D in percent on all datasetsusing 64, 128, 256, and 512 processors. As can be seen,the gather-neighbors step adds an overhead of maximum0.59% (minimum 0.10% and average 0.27%) of total time onsynthetic-cluster-extended datasets. Similar results are foundon millennium-simulation-run datasets (maximum 4.82%,minimum 0.21%, and average 1.25%). However, these num-bers are relatively higher (maximum 9.82%, minimum 1.01%,and average 3.76%) for synthetic-random-extended datasets asthe points are uniformly distributed in the space and thereforethe number of points gathered in each processor is highercompared to the other two test sets. It is also to be notedthat these values increase with the number of processors andalso with the eps parameter as the overlapping region amongthe processors is proportional to the number of processors. Weobserve that on 64 processors the memory space taken by theremote points in each processor is on average 0.68 times, 1.57times, and 1.02 times on synthetic-cluster-extended, synthetic-cluster-extended, and millennium-simulation-run datasets, re-spectively, compared to the memory space taken by the localpoints. These values changes to 1.27 times, 2.94 times, and3.18 times, respectively on 512 processors. However, with thisscheme the local-computation stage in PDSDBSCAN-D canperform the clustering without any communication overheadsimilar to PDSDBSCAN-S. The alternative would be to performcommunication for each point to obtain its remote neighbors.

VI. CONCLUSION AND FUTURE WORKIn this study we have revisited the well-known density based

clustering algorithm, DBSCAN. This algorithm is known tobe challenging to parallelize as the computation involves aninherent data access order. We present a new parallel DBSCAN(PDSDBSCAN) algorithm based on the disjoint-set data struc-ture. The use of this data structure works as a mechanismfor increasing concurrency, which again leads to scalableperformance. The algorithm uses a bottom-up approach toconstruct the clusters as a collection of hierarchical trees. Thisapproach achieves a better-balanced work-load distribution.PDSDBSCAN is implemented using both OpenMP and MPI.Our experimental results conducted on a shared memorycomputer show scalable performance, achieving speedups upto a factor of 30.3 when using 40 cores on data sets contain-ing several hundred million high-dimensional points. Similarscalability results have been obtained on a distributed-memorymachine with a speedup of 5,765 using 8,192 process cores.Our experiments also show that PDSDBSCAN significantlyoutperforms existing parallel DBSCAN algorithms. We intendto conduct further studies to provide more extensive resultson much larger number of cores with datasets from differentscientific domains. Finally, we note that our algorithm alsoseems to be suitable for other parallel architectures, such asGPU and heterogenous architectures.

Detail profiling for computation and communication (merging) costs Comparison of DBSCAN and FOF clustering results

Three representations of the same halo. From left to right: original raw particle data, Voronoi tessellation, and regular grid density sampling.

Voronoi tessellation of cosmological

simulations reveals regions

of irregular low-density voids amid clusters of

high-density halos

Right: Organizing particles in a kd-tree improves performance of the FOF halo finder in HACC. Implementing in PISTON leverages thread-parallelism of GPUs and many-core CPUs.

Left: Cosmic emu 2D slicer portrays small 2D multiples of a 5D space of cosmological input parameters and maps output power spectrum to color.

95763

90797

85856 85864

81016 81028

76241 76242

71510 71516 71511 71517

66846 66861 66847

62269 62280 62270

57764 57766 57780 57767

53351 53356 53360 53355 53369 56119

49096 49098 51782 49103 49099

44951 44952 44959 44955 44954

40936 40933 40935 40938 40937

37038 37039 37036 37043 39501

33274 33275 33276 33277 35648

29649 29650 31949 29654 29658

26229 26230 26231 26232 26236 26239

22965 22964 22966 22970 22973

19906 19907 19908 19909 19911 19912 19910 19913 21847

17029 17030 17032 17031 17028 17033 17034 17040

14378 14379 14381 14380 14382 14377 14383 14384

11956 11957 11958 11960 11961 11959 11955 11962

9766 9767 9768 9769 9771 9770 9765 11154 9772

7848 7849 7850 7851 7853 7854 7855 7852 7847 9038 7856

6142 6143 6145 6144 6147 6148 6149 6146 6150 7190 6140 6141 6151 6152

4680 4681 4682 4683 4686 4687 4688 4685 4684 5563 4679 4678

3450 3452 3451 3454 3455 3456 3453 3449 3448

2451 2452 2453 2455 2456 2457 2454 2450 2449

1687 1689 1690 1691 1688 1686

1104 1106 1450 1107 1105 1108

665 666 667 668 664

360 361

200 201

112 113

I/OMergerTree

Exec

utio

n Ti

me

(s)

0

5

10

15

20

25

30

Timestep0 5 10 15 20 25 30 35 40 45

Parallel MergerTree Performance, 2563 grid, 512 MPI Ranks

Right: Resulting merger tree is validated by scientists. (image courtesy Eve Kovacs)

Above: Run time and memory usage as the tree grows larger over time.

the scalable data management, analysis, and visualization ... · exchanging ghost layer below:...

Documents