high performance computing and big data

81
1 Big Data Institute, Seoul National University, Korea Geoffrey Fox August 22, 2016 [email protected] http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/ Department of Intelligent Systems Engineering School of Informatics and Computing, Digital Science Center Indiana University Bloomington High Performance Computing and Big Data

Upload: geoffrey-fox

Post on 16-Apr-2017

1.643 views

Category:

Technology


7 download

TRANSCRIPT

Page 1: High Performance Computing and Big Data

1

Big Data Institute, Seoul National University, Korea

Geoffrey Fox August 22, [email protected]

http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/ Department of Intelligent Systems Engineering

School of Informatics and Computing, Digital Science CenterIndiana University Bloomington

High Performance Computing and Big Data

Page 2: High Performance Computing and Big Data

05/03/2023

Abstract• We propose a hybrid software stack with Large scale data systems for both

research and commercial applications running on the commodity (Apache) Big Data Stack (ABDS) using High Performance Computing (HPC) enhancements typically to improve performance. We give several examples taken from bio and financial informatics. 

• We look in detail at parallel and distributed run-times including MPI from HPC and Apache Storm, Heron, Spark and Flink from ABDS stressing that one needs to distinguish the different needs of parallel (tightly coupled) and distributed (loosely coupled)  systems.

• We also study "Java Grande" or the principles to use to allow Java codes to perform as fast as those written in more traditional HPC languages. We also note the differences between capacity (individual jobs using many nodes) and capability (lots of independent jobs) computing.

• We discuss how this HPC-ABDS concept allows one to discuss convergence of Big Data, Big Simulation, Cloud and HPC Systems. See http://hpc-abds.org/kaleidoscope/

2

Page 3: High Performance Computing and Big Data

05/03/2023

Why Connect (“Converge”) Big Data and HPC• Two major trends in computing systems are

– Growth in high performance computing (HPC) with an international exascale initiative (China in the lead)

– Big data phenomenon with an accompanying cloud infrastructure of well publicized dramatic and increasing size and sophistication.

• Note “Big Data” largely an industry initiative although software used is often open source– So HPC labels overlaps with “research” e.g. HPC community largely

responsible for Astronomy and Accelerator (LHC, Belle, BEPC ..) data analysis• Merge HPC and Big Data to get

– More efficient sharing of large scale resources running simulations and data analytics

– Higher performance Big Data algorithms– Richer software environment for research community building on many big

data tools– Easier sustainability model for HPC – HPC does not have resources to build

and maintain a full software stack

3

Page 4: High Performance Computing and Big Data

05/03/2023

Convergence Points (Nexus) forHPC-Cloud-Big Data-Simulation

• Nexus 1: Applications – Divide use cases into Data and Model and compare characteristics separately in these two components with 64 Convergence Diamonds (features)

• Nexus 2: Software – High Performance Computing (HPC) Enhanced Big Data Stack HPC-ABDS. 21 Layers adding high performance runtime to Apache systems (Hadoop is fast!). Establish principles to get good performance from Java or C programming languages

• Nexus 3: Hardware – Use Infrastructure as a Service IaaS and DevOps to automate deployment of software defined systems on hardware designed for functionality and performance e.g. appropriate disks, interconnect, memory

4

Page 5: High Performance Computing and Big Data

05/03/2023 5

SPIDAL ProjectDatanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable

Data Science• NSF14-43054 started October 1, 2014 • Indiana University (Fox, Qiu, Crandall, von Laszewski)• Rutgers (Jha)• Virginia Tech (Marathe)• Kansas (Paden)• Stony Brook (Wang)• Arizona State (Beckstein)• Utah (Cheatham)• A co-design project: Software, algorithms, applications

Page 6: High Performance Computing and Big Data

605/03/2023

Software: MIDASHPC-ABDS

Co-designing Building Blocks Collaboratively

Page 7: High Performance Computing and Big Data

05/03/2023

Main Components of SPIDAL Project• Design and Build Scalable High Performance Data Analytics Library• SPIDAL (Scalable Parallel Interoperable Data Analytics Library): Scalable

Analytics for:– Domain specific data analytics libraries – mainly from project.– Add Core Machine learning libraries – mainly from community.– Performance of Java and MIDAS Inter- and Intra-node.

• NIST Big Data Application Analysis – features of data intensive Applications deriving 64 Convergence Diamonds. Application Nexus.

• HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. Software Nexus

• MIDAS: Integrating Middleware – from project.• Applications: Biomolecular Simulations, Network and Computational Social

Science, Epidemiology, Computer Vision, Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics, Streaming for robotics, streaming stock analytics

• Implementations: HPC as well as clouds (OpenStack, Docker) Convergence with common DevOps tool Hardware Nexus

7

Page 8: High Performance Computing and Big Data

05/03/2023 8

Application Nexus

Use-case Data and ModelNIST CollectionBig Data OgresConvergence Diamonds

Page 9: High Performance Computing and Big Data

05/03/2023

Data and Model in Big Data and Simulations I• Need to discuss Data and Model as problems have both

intermingled, but we can get insight by separating which allows better understanding of Big Data - Big Simulation “convergence” (or differences!)

• The Model is a user construction and it has a “concept”, parameters and gives results determined by the computation. We use term “model” in a general fashion to cover all of these.

• Big Data problems can be broken up into Data and Model– For clustering, the model parameters are cluster centers while the data

is set of points to be clustered– For queries, the model is structure of database and results of this query

while the data is whole database queried and SQL query – For deep learning with ImageNet, the model is chosen network with

model parameters as the network link weights. The data is set of images used for training or classification

9

Page 10: High Performance Computing and Big Data

05/03/2023

Data and Model in Big Data and Simulations II• Simulations can also be considered as Data plus Model

– Model can be formulation with particle dynamics or partial differential equations defined by parameters such as particle positions and discretized velocity, pressure, density values

– Data could be small when just boundary conditions – Data large with data assimilation (weather forecasting) or

when data visualizations are produced by simulation• Big Data implies Data is large but Model varies in size

– e.g. LDA with many topics or deep learning has a large model

– Clustering or Dimension reduction can be quite small in model size

• Data often static between iterations (unless streaming); Model parameters vary between iterations

10

Page 11: High Performance Computing and Big Data

1105/03/2023

http://hpc-abds.org/kaleidoscope/survey/

Online Use Case Form

Page 12: High Performance Computing and Big Data

51 Detailed Use Cases: Contributed July-September 2013Covers goals, data features such as 3 V’s, software, hardware• Government Operation(4): National Archives and Records Administration, Census Bureau• Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search,

Digital Materials, Cargo shipping (as in UPS)• Defense(3): Sensors, Image surveillance, Situation Assessment• Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis,

Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity• Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd

Sourcing, Network Science, NIST benchmark datasets• The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source

experiments• Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron

Collider at CERN, Belle Accelerator II in Japan• Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake,

Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors

• Energy(1): Smart grid• Published by NIST as http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-3.pdf

with common set of 26 features recorded for each use-case; “Version 2” being prepared

1205/03/202326 Features for each use case Biased to science

Page 13: High Performance Computing and Big Data

Sample Features of 51 Use Cases I• PP (26) “All” Pleasingly Parallel or Map Only• MR (18) Classic MapReduce MR (add MRStat below for full count)• MRStat (7) Simple version of MR where key computations are simple

reduction as found in statistical averages such as histograms and averages

• MRIter (23) Iterative MapReduce or MPI (Flink, Spark, Twister)• Graph (9) Complex graph data structure needed in analysis • Fusion (11) Integrate diverse data to aid discovery/decision making;

could involve sophisticated algorithms or could just be a portal• Streaming (41) Some data comes in incrementally and is processed

this way• Classify (30) Classification: divide data into categories• S/Q (12) Index, Search and Query

1305/03/2023

Page 14: High Performance Computing and Big Data

Sample Features of 51 Use Cases II• CF (4) Collaborative Filtering for recommender engines• LML (36) Local Machine Learning (Independent for each parallel entity) –

application could have GML as well• GML (23) Global Machine Learning: Deep Learning, Clustering, LDA, PLSI,

MDS, – Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief

Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can call EGO or Exascale Global Optimization with scalable parallel algorithm

• Workflow (51) Universal • GIS (16) Geotagged data and often displayed in ESRI, Microsoft Virtual

Earth, Google Earth, GeoServer etc.• HPC (5) Classic large-scale simulation of cosmos, materials, etc.

generating (visualization) data• Agent (2) Simulations of models of data-defined macroscopic entities

represented as agents

1405/03/2023

Page 15: High Performance Computing and Big Data

1505/03/2023

7 Computational Giants of NRC Massive Data Analysis Report

1) G1: Basic Statistics e.g. MRStat2) G2: Generalized N-Body Problems3) G3: Graph-Theoretic Computations4) G4: Linear Algebraic Computations5) G5: Optimizations e.g. Linear Programming6) G6: Integration e.g. LDA and other GML7) G7: Alignment Problems e.g. BLAST

http://www.nap.edu/catalog.php?record_id=18374 Big Data Models?

Page 16: High Performance Computing and Big Data

1605/03/2023

HPC (Simulation) Benchmark Classics• Linpack or HPL: Parallel LU factorization

for solution of linear equations; HPCG• NPB version 1: Mainly classic HPC solver kernels

– MG: Multigrid– CG: Conjugate Gradient– FT: Fast Fourier Transform– IS: Integer sort– EP: Embarrassingly Parallel– BT: Block Tridiagonal– SP: Scalar Pentadiagonal– LU: Lower-Upper symmetric Gauss Seidel

Simulation Models

Page 17: High Performance Computing and Big Data

1705/03/2023

13 Berkeley Dwarfs1) Dense Linear Algebra 2) Sparse Linear Algebra3) Spectral Methods4) N-Body Methods5) Structured Grids6) Unstructured Grids7) MapReduce8) Combinational Logic9) Graph Traversal10) Dynamic Programming11) Backtrack and

Branch-and-Bound12) Graphical Models13) Finite State Machines

First 6 of these correspond to Colella’s original. (Classic simulations)Monte Carlo dropped.N-body methods are a subset of Particle in Colella.

Note a little inconsistent in that MapReduce is a programming model and spectral method is a numerical method.Need multiple facets to classify use cases!

Largely Models for Data or Simulation

Page 18: High Performance Computing and Big Data

Classifying Use cases

1805/03/2023

Page 19: High Performance Computing and Big Data

05/03/2023

Classifying Use Cases• The Big Data Ogres built on a collection of 51 big data uses gathered by

the NIST Public Working Group where 26 properties were gathered for each application.

• This information was combined with other studies including the Berkeley dwarfs, the NAS parallel benchmarks and the Computational Giants of the NRC Massive Data Analysis Report.

• The Ogre analysis led to a set of 50 features divided into four views that could be used to categorize and distinguish between applications.

• The four views are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing View or runtime features.

• We generalized this approach to integrate Big Data and Simulation applications into a single classification looking separately at Data and Model with the total facets growing to 64 in number, called convergence diamonds, and split between the same 4 views.

• A mapping of facets into work of the SPIDAL project has been given.

19

Page 20: High Performance Computing and Big Data

2005/03/2023

Pleasingly ParallelClassic MapReduceMap-CollectiveMap Point-to-Point

Shared MemorySingle Program Multiple DataBulk Synchronous ParallelFusionDataflowAgentsWorkflow

Geospatial Information SystemHPC SimulationsInternet of ThingsMetadata/ProvenanceShared / Dedicated / Transient / PermanentArchived/Batched/StreamingHDFS/Lustre/GPFSFiles/ObjectsEnterprise Data ModelSQL/NoSQL/NewSQL

Performance

Metrics

FlopsperB

yte;M

emory

I/OExecution

Environment;

Core

librariesVolum

eVelocityVarietyVeracityC

omm

unicationStructure

Data

Abstraction

Metric

=M

/Non-M

etric=

N

ଶ=

NN

/ሺሻ

=N

Regular

=R

/Irregular=

ID

ynamic

=D

/Static=

S

Visualization

Graph A

lgorithms

Linear Algebra K

ernelsA

lignment

Streaming

Optim

ization Methodology

LearningC

lassificationSearch / Q

uery / Index

Base Statistics

Global A

nalyticsLocal A

nalyticsM

icro-benchmarks

Recom

mendations

Data Source and Style View

Execution View

Processing View

234

6789

101112

10987654

321

1 2 3 4 5 6 7 8 9 10 12 14

9 8 7 5 4 3 2 114 13 12 11 10 6

13

Map Streaming 5

4 Ogre Views and 50 Facets

Iterative/Sim

ple

11

1

Problem Architecture View

Page 21: High Performance Computing and Big Data

64 Features in 4 views for Unified Classification of Big Data and Simulation Applications

21

Loca

l (An

alyti

cs/In

form

atics

/Sim

ulati

ons)

2M

Data Source and Style View

Pleasingly ParallelClassic MapReduce

Map-CollectiveMap Point-to-Point

Shared MemorySingle Program Multiple Data

Bulk Synchronous Parallel

FusionDataflow

AgentsWorkflow

Geospatial Information SystemHPC SimulationsInternet of ThingsMetadata/ProvenanceShared / Dedicated / Transient / Permanent

Archived/Batched/Streaming – S1, S2, S3, S4, S5

HDFS/Lustre/GPFS

Files/ObjectsEnterprise Data ModelSQL/NoSQL/NewSQL

1M

Mic

ro-b

ench

mar

ks

Execution View

Processing View1234

6

7891011M

12

10D98D7D6D

5D

4D

3D2D1D

Map Streaming 5

Convergence Diamonds Views and

Facets

Problem Architecture View

15M

Core

Lib

rarie

sVi

sual

izatio

n

14M

Grap

h Al

gorit

hms

13M

Line

ar A

lgeb

ra K

erne

ls/M

any

subc

lass

es12M

Glob

al (A

naly

tics/

Info

rmati

cs/S

imul

ation

s)

3M

Reco

mm

ende

r En

gine

5M

4M

Base

Dat

a St

atisti

cs

10M

Stre

amin

g Da

ta A

lgor

ithm

sO

ptim

izatio

n M

etho

dolo

gy

9M

Lear

ning

8M

Data

Cla

ssifi

catio

n

7M

Data

Sear

ch/Q

uery

/Inde

x

6M

11M

Data

Alig

nmen

t

Big Data Processing Diamonds

Mul

tisca

le M

etho

d

17M

16M

Itera

tive

PDE

Solv

ers

22M

Natu

re o

f mes

h if

used

Evol

ution

of D

iscr

ete

Syst

ems

21M

Parti

cles

and

Fie

lds

20M

N-bo

dy M

etho

ds

19M

Spec

tral M

etho

ds

18M

Simulation (Exascale) Processing Diamonds

DataAbstraction

D12

ModelAbstraction

M12

DataM

etric=

M/Non-M

etric=

N

D13

DataM

etric=

M/Non-M

etric=

N

M13

��ଶ

=NN

/�ሺ�ሻ

=N

M14

Regular=R

/Irregular=IM

odel

M10

Veracity

7

Iterative/Sim

ple

M11

Comm

unicationStructure

M8

Dynamic

=D

/Static=S

D9

Dynamic

=D

/Static=S

M9

Regular=R

/Irregular=IData

D10

ModelVariety

M6

DataVelocity

D5

Performance

Metrics

1

DataVariety

D6

FlopsperByte/M

emory

IO/Flopsperw

att

2Execution

Environment;Core

libraries3

DataVolum

eD4

ModelSize

M4

Simulations Analytics(Model for Big Data)

Both

(All Model)

(Nearly all Data+Model)

(Nearly all Data)

(Mix of Data and Model)

05/03/2023

Page 22: High Performance Computing and Big Data

05/03/2023

Examples in Problem Architecture View PA• The facets in the Problem architecture view include 5 very common ones

describing synchronization structure of a parallel job: – MapOnly or Pleasingly Parallel (PA1): the processing of a collection of

independent events; – MapReduce (PA2): independent calculations (maps) followed by a final

consolidation via MapReduce; – MapCollective (PA3): parallel machine learning dominated by scatter,

gather, reduce and broadcast; – MapPoint-to-Point (PA4): simulations or graph processing with many

local linkages in points (nodes) of studied system. – MapStreaming (PA5): The fifth important problem architecture is seen in

recent approaches to processing real-time data. – We do not focus on pure shared memory architectures PA6 but look at

hybrid architectures with clusters of multicore nodes and find important performances issues dependent on the node programming model.

• Most of our codes are SPMD (PA-7) and BSP (PA-8).

22

Page 23: High Performance Computing and Big Data

6 Forms of MapReduce

Describes Architecture of - Problem (Model reflecting data) - Machine - Software

2 important variants (software) of Iterative MapReduce and Map-Streaminga) “In-place” HPC b) Flow for model and data

2305/03/2023

1) Map-OnlyPleasingly Parallel

4) Map- Point to Point Communication

3) Iterative MapReduceor Map-Collective-

2) Classic MapReduce

Input

map

reduce

Input

map

reduce

IterationsInput

Output

map

Local

Graph

5) Map-Streaming

maps brokers

Events

6) Shared-MemoryMap Communication

Shared Memory

Map & Communication

1) Map-OnlyPleasingly Parallel

4) Map- Point to Point Communication

3) Iterative MapReduceor Map-Collective-

2) Classic MapReduce

Input

map

reduce

Input

map

reduce

IterationsInput

Output

map

Local

Graph

5) Map-Streaming

maps brokers

Events

6) Shared-MemoryMap Communication

Shared Memory

Map & Communication

Page 24: High Performance Computing and Big Data

05/03/2023

Examples in Execution View EV• The Execution view is a mix of facets describing either data or model; PA

was largely the overall Data+Model• EV-M14 is Complexity of model (O(N2) for N points) seen in the non-

metric space models EV-M13 such as one gets with DNA sequences.• EV-M11 describes iterative structure distinguishing Spark, Flink, and Harp

from the original Hadoop. • The facet EV-M8 describes the communication structure which is a focus

of our research as much data analytics relies on collective communication which is in principle understood but we find that significant new work is needed compared to basic HPC releases which tend to address point to point communication.

• The model size EV-M4 and data volume EV-D4 are important in describing the algorithm performance as just like in simulation problems, the grain size (the number of model parameters held in the unit – thread or process – of parallel computing) is a critical measure of performance.

24

Page 25: High Performance Computing and Big Data

05/03/2023

Examples in Data View DV• We can highlight DV-5 streaming where there is a lot of recent

progress;• DV-9 categorizes our Biomolecular simulation application with

data produced by an HPC simulation• DV-10 is Geospatial Information Systems covered by our

spatial algorithms. • DV-7 provenance, is an example of an important feature that

we are not covering. • The data storage and access DV-3 and D-4 is covered in our

pilot data work. • The Internet of Things DV-8 is not a focus of our project

although our recent streaming work relates to this and our addition of HPC to Apache Heron and Storm is an example of the value of HPC-ABDS to IoT.

25

Page 26: High Performance Computing and Big Data

05/03/2023

Examples in Processing View PV• The Processing view PV characterizes algorithms and is only Model (no

Data features) but covers both Big data and Simulation use cases.• Graph PV-M13 and Visualization PV-M14 covered in SPIDAL. • PV-M15 directly describes SPIDAL which is a library of core and other

analytics. • This project covers many aspects of PV-M4 to PV-M11 as these

characterize the SPIDAL algorithms (such as optimization, learning, classification). – We are of course NOT addressing PV-M16 to PV-M22 which are

simulation algorithm characteristics and not applicable to data analytics.

• Our work largely addresses Global Machine Learning PV-M3 although some of our image analytics are local machine learning PV-M2 with parallelism over images and not over the analytics.

• Many of our SPIDAL algorithms have linear algebra PV-M12 at their core; one nice example is multi-dimensional scaling MDS which is based on matrix-matrix multiplication and conjugate gradient.

26

Page 27: High Performance Computing and Big Data

Comparison of Data Analytics with Simulation I• Simulations (models) produce big data as visualization of results – they

are data source– Or consume often smallish data to define a simulation problem– HPC simulation in (weather) data assimilation is data + model

• Pleasingly parallel often important in both• Both are often SPMD and BSP• Non-iterative MapReduce is major big data paradigm

– not a common simulation paradigm except where “Reduce” summarizes pleasingly parallel execution as in some Monte Carlos

• Big Data often has large collective communication– Classic simulation has a lot of smallish point-to-point messages– Motivates MapCollective model

• Simulations characterized often by difference or differential operators leading to nearest neighbor sparsity

• Some important data analytics can be sparse as in PageRank and “Bag of words” algorithms but many involve full matrix algorithm

05/03/2023 27

Page 28: High Performance Computing and Big Data

Comparison of Data Analytics with Simulation II• There are similarities between some graph problems and particle

simulations with a particular cutoff force.– Both are MapPoint-to-Point problem architecture

• Note many big data problems are “long range force” (as in gravitational simulations) as all points are linked.– Easiest to parallelize. Often full matrix algorithms– e.g. in DNA sequence studies, distance (i, j) defined by BLAST,

Smith-Waterman, etc., between all sequences i, j.– Opportunity for “fast multipole” ideas in big data. See NRC report

• Current Ogres/Diamonds do not have facets to designate underlying hardware: GPU v. Many-core (Xeon Phi) v. Multi-core as these define how maps processed; they keep map-X structure fixed; maybe should change as ability to exploit vector or SIMD parallelism could be a model facet.

05/03/2023 28

Page 29: High Performance Computing and Big Data

Comparison of Data Analytics with Simulation III• In image-based deep learning, neural network weights are block sparse

(corresponding to links to pixel blocks) but can be formulated as full matrix operations on GPUs and MPI in blocks.

• In HPC benchmarking, Linpack being challenged by a new sparse conjugate gradient benchmark HPCG, while I am diligently using non- sparse conjugate gradient solvers in clustering and Multi-dimensional scaling.

• Simulations tend to need high precision and very accurate results – partly because of differential operators

• Big Data problems often don’t need high accuracy as seen in trend to low precision (16 or 32 bit) deep learning networks– There are no derivatives and the data has inevitable errors

• Note parallel machine learning (GML not LML) can benefit from HPC style interconnects and architectures as seen in GPU-based deep learning – So commodity clouds not necessarily best

05/03/2023 29

Page 30: High Performance Computing and Big Data

05/03/2023 30

Some use of Analytics

Page 31: High Performance Computing and Big Data

3105/03/2023

Clustering• The SPIDAL Library includes several clustering algorithms with

sophisticated features– Deterministic Annealing– Radius cutoff in cluster membership– Elkans algorithm using triangle inequality

• They also cover two important cases– Points are vectors – algorithm O(N) for N points– Points not vectors – all we know is distance (i, j) between each pair of

points i and j. algorithm O(N2) for N points• We find visualization important to judge quality of clustering• As data typically not 2D or 3D, we use dimension reduction to project

data so we can then view it• Have a browser viewer WebPlotViz that replaces an older Windows

system

Page 32: High Performance Computing and Big Data

3205/03/2023

2D Vector Clustering with cutoff at 3 σ

LCMS Mass Spectrometer Peak Clustering. Charge 2 Sample with 10.9 million points and 420,000 clusters visualized in WebPlotViz

Orange Star – outside all clusters; yellow circle cluster centers

Page 33: High Performance Computing and Big Data

33

Dimension Reduction • Principal Component Analysis (linear mapping) and Multidimensional Scaling

MDS (nonlinear and applicable to non-Euclidean spaces) are methods to map abstract spaces to three dimensions for visualization• Both run well in parallel and give great results

• Semimetric spaces have pairwise distances defined between points in space (i, j) • But data is typically in a high dimensional or non vector space so use dimension

reduction. Associate each point i with a vector Xi in a Euclidean space of dimension

K so that (i, j) d(Xi , Xj) where d(Xi , Xj) is Euclidean distance between mapped

points i and j in K dimensional space.• K = 3 natural for visualization but other values interesting

• Principal Component analysis is best known dimension reduction approach but a) linear b) requires original points in a vector space

• There are many other nonlinear vector space methods such as GTM Generative Topographic Mapping

Page 34: High Performance Computing and Big Data

34

WDA-SMACOF “Best” MDS• MDS Minimizes Stress (X) with pairwise distances (i, j)

(X) = i<j=1N weight(i,j) ((i, j) - d(Xi , Xj))2

• SMACOF clever Expectation Maximization method choses good steepest descent

• Improved by Deterministic Annealing gradually reducing Temperature distance scale; DA does not impact compute time much and gives DA-SMACOF

– Deterministic Annealing like Simulated Annealing but no Monte Carlo• Classic SMACOF is O(N2) for uniform weight and O(N3) for non trivial weights

but get nonuniform weight from– The preferred Sammon method weight(i,j) = 1/(i, j) or– Missing distances put in as weight(i,j) = 0 • Use conjugate gradient – converges in 5-100 iterations – a big gain for

matrix with a million rows. This removes factor of N in time complexity and gives WDA-SMACOF

05/03/2023

Page 35: High Performance Computing and Big Data

3505/03/2023

446K sequences~100 clusters

Note distorted shapes probably due to imperfect distance measures e.g. position correlated with sequence length

Page 36: High Performance Computing and Big Data

Fungi -- 4 Classic Clustering Methods plus Species Coloring

3605/03/2023

Page 37: High Performance Computing and Big Data

Heatmap of original distance vs 3D Euclidean Distances for Sequences and Stocks• One can visualize quality of dimension by comparing as a scatterplot

or heatmap, the distances (i, j) before and after mapping to 3D. • Perfection is a diagonal straight line and results seem good in general

3705/03/2023

Proteomics Example

Stock Market Example

Page 38: High Performance Computing and Big Data

3805/03/2023

Labelled by Species MDS: Same Species Two groupings

Page 39: High Performance Computing and Big Data

RAxML result visualized in FigTree.

MSA or SWG distances

39

Spherical Phylograms

05/03/2023

MSA

SWG

Page 40: High Performance Computing and Big Data

Quality of 3D Phylogenetic Tree• 3 different MDS implementations and 3 different distance measures• EM-SMACOF is basic SMACOF for MDS• LMA was previous best method using Levenberg-Marquardt

nonlinear 2 solver• WDA-SMACOF finds best result

4005/03/2023

Sum of branch lengths of the Spherical Phylogram generated in 3D space on two datasets

MSA SWG NW0

5

10

15

20

25

30Sum of Branches on 599nts Data

WDA-SMACOF LMA EM-SMACOF

Sum

of B

ran

ches

MSA SWG NW0

5

10

15

20

25Sum of Branches on 999nts Data

WDA-SMACOF LMA EM-SMACOF

Sum

of B

ran

ches

Page 41: High Performance Computing and Big Data

HTML5 web viewer WebPlotViz• Supports visualization of 3D point sets (typically derived by mapping from

abstract spaces) for streaming and non-streaming case– Simple data management layer– 3D web visualizer with various capabilities such as defining color

schemes, point sizes, glyphs, labels• Core Technologies

– MongoDB management– Play Server side framework– Three.js– WebGL– JSON data objects– Bootstrap Javascript web pages

• Open Source http://spidal-gw.dsc.soic.indiana.edu/

• ~10,000 lines of extra code

4105/03/2023

Front end view

(Browser)

Plot visualization & time series animation (Three.js)

Web Request Controllers (Play Framework)

Upload

Data Layer (MongoDB)

Request Plots JSON Format Plots

Upload format to JSON

Converter

Server

MongoDB

Page 42: High Performance Computing and Big Data

05/03/2023

Stock Daily Data Streaming Example• Example is collection of around 7000 distinct stocks with daily values

available at ~2750 distinct times– Clustering as provided by Wall Street – Dow Jones set of 30

stocks, S&P 500, various ETF’s etc.• The Center for Research in Security Prices (CSRP) database through

the Wharton Research Data Services (wrds) web interface• Available for free to the Indiana University students for research• 2004 Jan 01 to 2015 Dec 31 have daily Stock prices in the form of a

CSV file• We use the information

– ID, Date, Symbol, Factor to Adjust Volume, Factor to Adjust Price, Price, Outstanding Stocks

42

Page 43: High Performance Computing and Big Data

43

Relative Changes in Stock Values using one day valuesmeasured from January 2004 and starting after one year January 2005Filled circles are final values

Apple

Mid Cap

Energy

S&P Dow Jones

FinanceOrigin0% change

+10%

+20%

Page 44: High Performance Computing and Big Data

44

Relative Changes in Stock Values using one day valuesExpansion of previous data

Mid Cap

Energy

S&P

Dow Jones

Finance

05/03/2023 44

Mid Cap

Energy

S&P

Dow Jones

FinanceOrigin0% change

+10%

Page 45: High Performance Computing and Big Data

Algorithm Challenge• The NRC Massive Data Analysis report stresses importance of finding O(N)

or O(NlogN) algorithms for O(N2) problems – N is number of points

• This is well understood for O(N2) simulation problems where there is a long range force as in gravitational (cosmology) simulations for N stars or galaxies

• Simulations are governed by equations that allow a systematic ”multipole expansion” with O(N) as first term with corrections– Has been used successfully in parallel for 25 years

• O(N2) big data problems don’t have a systematic practical approach even though there is a qualitative argument shown in next slide.

• The work wi,j is labelled by two indices i and j each running from 1 to N.• If points i and j are near each other, need to perform accurate calculations• If far apart, can use approximations and for example, replace points in a far

away cluster of M particles by their cluster center weighted by M

45

Page 46: High Performance Computing and Big Data

46

O(N2) interactions between green and purple clusters should be able to represent by centroids as in Barnes-Hut.

Hard as no Gauss theorem; no multipole expansion and points really in 1000 dimension space as clustered before 3D projection

O(N2) green-green and purple-purple interactions have value but green-purple are “wasted”

“clean” sample of 446K

O(N2) reduced to O(N) times cluster size

Page 47: High Performance Computing and Big Data

05/03/2023 47

Software Nexus

Application Layer OnBig Data Software Components for

Programming and Data Processing OnHPC for runtime On

IaaS and DevOps Hardware and Systems• HPC-ABDS• MIDAS• Java Grande

Page 48: High Performance Computing and Big Data

4805/03/2023

HPC-ABDSKaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies

Cross-Cutting

Functions 1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination: Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca

17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds Networking: Google Cloud DNS, Amazon Route 53

21 layers Over 350 Software Packages January 29 2016

Page 49: High Performance Computing and Big Data

49

Functionality of 21 HPC-ABDS Layers1) Message Protocols:2) Distributed Coordination:3) Security & Privacy:4) Monitoring: 5) IaaS Management from HPC to

hypervisors:6) DevOps: 7) Interoperability:8) File systems: 9) Cluster Resource

Management: 10) Data Transport: 11) A) File management

B) NoSQLC) SQL

05/03/2023

12) In-memory databases & caches / Object-relational mapping / Extraction Tools

13) Inter process communication Collectives, point-to-point, publish-subscribe, MPI:

14) A) Basic Programming model and runtime, SPMD, MapReduce:B) Streaming:

15) A) High level Programming: B) Frameworks

16) Application and Analytics: 17) Workflow-Orchestration:

Lesson of large number (350). This is a rich software environment that HPC cannot “compete” with. Need to use and not regenerateNote level 13 Inter process communication added

Page 50: High Performance Computing and Big Data

05/03/2023

HPC-ABDS SPIDAL Project Activities• Level 17: Orchestration: Apache Beam (Google Cloud Dataflow) integrated

with Heron/Flink and Cloudmesh on HPC cluster• Level 16: Applications: Datamining for molecular dynamics, Image processing

for remote sensing and pathology, graphs, streaming, bioinformatics, social media, financial informatics, text mining

• Level 16: Algorithms: Generic and custom for applications SPIDAL• Level 14: Programming: Storm, Heron (Twitter replaces Storm), Hadoop,

Spark, Flink. Improve Inter- and Intra-node performance; science data structures• Level 13: Runtime Communication: Enhanced Storm and Hadoop (Spark,

Flink, Giraph) using HPC runtime technologies, Harp• Level 12: In-memory Database: Redis + Spark used in Pilot-Data Memory• Level 11: Data management: Hbase and MongoDB integrated via use of Beam

and other Apache tools; enhance Hbase• Level 9: Cluster Management: Integrate Pilot Jobs with Yarn, Mesos, Spark,

Hadoop; integrate Storm and Heron with Slurm• Level 6: DevOps: Python Cloudmesh virtual Cluster Interoperability

50

Green is MIDASBlack is SPIDAL

Page 51: High Performance Computing and Big Data

05/03/2023 51

Java Grande

Revisited on 3 data analytics codesClustering

Multidimensional ScalingLatent Dirichlet Allocation

all sophisticated algorithms

Page 52: High Performance Computing and Big Data

Java MPI performs better than FJ Threads128 24 core Haswell nodes on SPIDAL 200K DA-MDS Code

5205/03/2023

Best FJ Threads intra node; MPI inter node

Best MPI; inter and intra node

MPI; inter/intra node; Java not optimized

Speedup compared to 1 process per node on 48 nodes

Page 53: High Performance Computing and Big Data

05/03/2023

Investigating Process and Thread Models

53

• FJ Fork Join Threads lower performance than Long Running Threads LRT

• Results– Large effects for Java– Best affinity is process

and thread binding to cores - CE

– At best LRT mimics performance of “all processes”

• 6 Thread/Process Affinity Models

Page 54: High Performance Computing and Big Data

05/03/2023

Java and C K-Means LRT-FJ and LRT-BSP with different affinity patterns over varying threads and processes.

JavaC

106 points and 1000 centers on 16 nodes

106 points and 50k, and 500k centersperformance on 16 nodes

Page 55: High Performance Computing and Big Data

05/03/2023

Java versus C Performance• C and Java Comparable with Java doing better on larger problem

sizes• All data from one million point dataset with varying number of centers

on 16 nodes 24 core Haswell

55

Page 56: High Performance Computing and Big Data

05/03/2023

Performance Dependence on Number of Cores inside node (16 nodes total)

• Long-Running Theads LRT Java– All Processes – All Threads

internal to node– Hybrid – Use one

process per chip• Fork Join Java

– All Threads– Hybrid – Use one

process per chip• Fork Join C

– All Threads• All MPI internode

56

Page 57: High Performance Computing and Big Data

05/03/2023 57

HPC-ABDS DataFlow and In-place Runtime

Page 58: High Performance Computing and Big Data

HPC-ABDS Parallel Computing• Both simulations and data analytics use similar parallel computing ideas• Both do decomposition of both model and data• Both tend use SPMD and often use BSP Bulk Synchronous Processing• One has computing (called maps in big data terminology) and

communication/reduction (more generally collective) phases• Big data thinks of problems as multiple linked queries even when queries

are small and uses dataflow model• Simulation uses dataflow for multiple linked applications but small steps

such as iterations are done in place• Reduction in HPC (MPIReduce) done as optimized tree or pipelined

communication between same processes that did computing• Reduction in Hadoop or Flink done as separate map and reduce processes

using dataflow– This leads to 2 forms (In-Place and Flow) of Map-X mentioned earlier

• Interesting Fault Tolerance issues highlighted by Hadoop-MPI comparisons – not discussed here!

5805/03/2023

Page 59: High Performance Computing and Big Data

Illustration of In-Place AllReduce in MPI

5905/03/2023

Page 60: High Performance Computing and Big Data

Breaking Programs into Parts

6005/03/2023

Coarse Grain Dataflow HPC or ABDS

Fine Grain Parallel Computing Data/model parameter decomposition

Page 61: High Performance Computing and Big Data

Kmeans Clustering Flink and MPIone million 2D points fixed; various # centers24 cores on 16 nodes

6105/03/2023

Page 62: High Performance Computing and Big Data

• MPI designed for fine grain case and typical of parallel computing used in large scale simulations– Only change in model parameters are transmitted– In-place implementation

• Dataflow typical of distributed or Grid computing paradigms– Data sometimes and model parameters certainly transmitted – Caching in iterative MapReduce avoids data communication and

in fact systems like TensorFlow, Spark or Flink are called dataflow but usually implement “model-parameter” flow

• We quantify this by an overhead analysis on next slide that works for “in-place” runtimes. Flow implementations have additional sources of overhead that we know are large but haven’t studied as quantitatively

625/17/2016

HPC-ABDS Parallel Computing I

Page 63: High Performance Computing and Big Data

• Overheads are given by similar formulae for big data and simulations Overhead f = (1/Model parameter Size in each map)n x (Typical Hardware communication cost/Typical computing cost)

• Index n>0 depends on communication structure– n=0.5 for matrix problems; n=1 for O(N2) problems

• Large f: Intra-job reduction such as Kmeans clustering where one has center changes at end of each iteration and

• Small f: Inter-Job Reduction as at end of a query as seen in workflow

• Increasing grain size = Model parameter Size in each map, decreases overhead as n>0

635/17/2016

HPC-ABDS Parallel Computing II

Page 64: High Performance Computing and Big Data

• For a given application, need to understand:– Are we using Data Flow or “Model-parameter” Flow– Requirements of compute/communication ratio

• Inefficient to use same runtime mechanism independent of characteristics– Use In-Place or Flow Software implementations

• Classic Dataflow is approach of Spark and Flink so need to add parallel in-place computing as done by Harp for Hadoop– TensorFlow also uses In-Place technology

• HPC-ABDS plan is to keep current user interfaces (say to Spark Flink Hadoop Storm Heron) and transparently use HPC to improve performance exploiting added level 13 in HPC-ABDS

• We have done this to Hadoop (next Slide), Spark, Storm, Heron– Working on further HPC integration with ABDS

645/17/2016

HPC-ABDS Parallel Computing III

Page 65: High Performance Computing and Big Data

Harp (Hadoop Plugin) brings HPC to ABDS • Basic Harp: Iterative HPC communication; scientific data abstractions• Careful support of distributed data AND distributed model• Avoids parameter server approach but distributes model over worker nodes

and supports collective communication to bring global model to each node• Applied first to Latent Dirichlet Allocation LDA with large model and data

6505/03/2023

ShuffleM M M M

Collective Communication

M M M M

R R

MapCollective ModelMapReduce Model

YARN

MapReduce V2

Harp

MapReduce Applications

MapCollective Applications

Page 66: High Performance Computing and Big Data

Clueweb

6605/03/2023

enwiki

Bi-gram

Clueweb

Latent Dirichlet Allocation on 100 Haswell nodes: red is Harp (lgs and rtt)

Page 67: High Performance Computing and Big Data

67

Harp LDA on Big Red II Supercomputer (Cray)

0 20 40 60 80 100 120 1400

5

10

15

20

25

30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Execution Time (hours) Parallel EfficiencyNodes

Exec

utio

n Ti

me

(hou

rs)

Parallel Efficiency

0 5 10 15 20 25 30 350

2

4

6

8

10

12

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Execution Time (hours) Parallel EfficiencyNodes

Exec

utio

n Ti

me

(hou

rs)

Parallel Efficiency

Harp LDA on Juliet (Intel Haswell)

• Big Red II: tested on 25, 50, 75, 100 and 125 nodes; each node uses 32 parallel threads; Gemini interconnect

• Juliet: tested on 10, 15, 20, 25, 30 nodes; each node uses 64 parallel threads on 36 core Intel Haswell nodes (each with 2 chips); Infiniband interconnect

Harp LDA Scaling Tests

Corpus: 3,775,554 Wikipedia documents, Vocabulary: 1 million words; Topics: 10k topics; alpha: 0.01; beta: 0.01; iteration: 200

05/03/2023

Page 68: High Performance Computing and Big Data

Streaming Applications and Technology

6805/03/2023

Page 69: High Performance Computing and Big Data

05/03/2023

Adding HPC to Storm & Heron for Streaming

Robot with a Laser Range

Finder

Map Built from Robot data

Robotics Applications

Robots need to avoid collisions

when they move

N-Body CollisionAvoidance

Simultaneous Localization and MappingTime series data visualization in real time

Map High dimensional data to 3D visualizerApply to Stock market data tracking 6000 stocks

69

Page 70: High Performance Computing and Big Data

Data Pipeline

Hosted on HPC and OpenStack cloud

End to end delays without any processing is less than 10ms

Message BrokersRabbitMQ, Kafka

Gateway

Sending to pub-sub

Sending to Persisting storage

Streaming workflow

A stream application with some tasks running in parallel

Multiple streaming workflows

Streaming WorkflowsApache Heron and Storm

Storm does not support “real parallel processing” within

bolts – add optimized inter-bolt communication

05/03/2023 70

Page 71: High Performance Computing and Big Data

7105/03/2023

Improvement of Storm (Heron) using HPC communication algorithms

Page 72: High Performance Computing and Big Data

End of Software Discussion

7205/03/2023

Page 73: High Performance Computing and Big Data

Workflow in HPC-ABDS• HPC familiar with Taverna, Pegasus, Kepler, Galaxy etc. and• ABDS has many workflow systems with recent Apache

systems being Crunch, NiFi and Beam (open source version of Google Cloud Dataflow)– Use ABDS for sustainability reasons? – ABDS approaches are better integrated than HPC approaches with

ABDS data management like Hbase and are optimized for distributed data.

• Heron, Spark and Flink provide distributed dataflow runtime which is needed for workflow

• Beam uses Spark or Flink as runtime and supports streaming and batch data

• Needs more study

7305/03/2023

Page 74: High Performance Computing and Big Data

Automatic parallelization• Database community looks at big data job as a dataflow of (SQL) queries

and filters• Apache projects like Pig, MRQL and Flink aim at automatic query

optimization by dynamic integration of queries and filters including iteration and different data analytics functions

• Going back to ~1993, High Performance Fortran HPF compilers optimized set of array and loop operations for large scale parallel execution of optimized vector and matrix operations

• HPF worked fine for initial simple regular applications but ran into trouble for cases where parallelism hard (irregular, dynamic)

• Will same happen in Big Data world?• Straightforward to parallelize k-means clustering but sophisticated

algorithms like Elkans method (use triangle inequality) and fuzzy clustering are much harder (but not used much NOW)

• Will Big Data technology run into HPF-style trouble with growing use of sophisticated data analytics?

7405/03/2023

Page 75: High Performance Computing and Big Data

05/03/2023 75

Infrastructure Nexus

IaaSDevOps

Cloudmesh

Page 76: High Performance Computing and Big Data

Constructing HPC-ABDS Exemplars• This is one of next steps in NIST Big Data Working Group• Jobs are defined hierarchically as a combination of Ansible (preferred over Chef or

Puppet as Python) scripts• Scripts are invoked on Infrastructure (Cloudmesh Tool)• INFO 524 “Big Data Open Source Software Projects” IU Data Science class

required final project to be defined in Ansible and decent grade required that script worked (On NSF Chameleon and FutureSystems)– 80 students gave 37 projects with ~15 pretty good such as– “Machine Learning benchmarks on Hadoop with HiBench”, Hadoop/Yarn, Spark,

Mahout, Hbase– “Human and Face Detection from Video”, Hadoop (Yarn), Spark, OpenCV,

Mahout, MLLib• Build up curated collection of Ansible scripts defining use cases for

benchmarking, standards, education https://docs.google.com/document/d/1INwwU4aUAD_bj-XpNzi2rz3qY8rBMPFRVlx95k0-xc4

• Fall 2015 class INFO 523 introductory data science class was less constrained; students just had to run a data science application but catalog interesting– 140 students: 45 Projects (NOT required) with 91 technologies, 39 datasets

7605/03/2023

Page 77: High Performance Computing and Big Data

Cloudmesh Interoperability DevOps Tool• Model: Define software configuration with tools like Ansible (Chef,

Puppet); instantiate on a virtual cluster• Save scripts not virtual machines and let script build applications• Cloudmesh is an easy-to-use command line program/shell and portal to

interface with heterogeneous infrastructures taking script as input– It first defines virtual cluster and then instantiates script on it– It has several common Ansible defined software built in

• Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud supported clouds as well as classic HPC and Docker infrastructures– Has an abstraction layer that makes it possible to integrate other

IaaS frameworks• Managing VMs across different IaaS providers is easier• Demonstrated interaction with various cloud providers:

– FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera, AWS, Azure, virtualbox

• Status: AWS, and Azure, VirtualBox, Docker need improvements; we focus currently on SDSC Comet and NSF resources that use OpenStack

77HPC Cloud Interoperability Layer05/03/2023

Page 78: High Performance Computing and Big Data

Cloudmesh Architecture

• We define a basic virtual cluster which is a set of instances with a common security context• We then add basic tools including languages Python Java etc.• Then add management tools such as Yarn, Mesos, Storm, Slurm etc …..• Then add roles for different HPC-ABDS PaaS subsystems such as Hbase, Spark

– There will be dependencies e.g. Storm role uses Zookeeper• Any one project picks some of HPC-ABDS PaaS Ansible roles and adds >=1 SaaS that are

specific to their project and for example read project data and perform project analytics• E.g. there will be an OpenCV role used in Image processing applications

7805/03/2023

Software Engineering Process

Page 79: High Performance Computing and Big Data

05/03/2023 79

Summary of Big Data - Big Simulation Convergence?

HPC-Clouds convergence? (easier than converging higher levels in stack)

Can HPC continue to do it alone?

Convergence DiamondsHPC-ABDS Software on differently optimized hardware

infrastructure

Page 80: High Performance Computing and Big Data

• Applications, Benchmarks and Libraries– 51 NIST Big Data Use Cases, 7 Computational Giants of the NRC Massive Data Analysis,

13 Berkeley dwarfs, 7 NAS parallel benchmarks– Unified discussion by separately discussing data & model for each application;– 64 facets– Convergence Diamonds -- characterize applications– Characterization identifies hardware and software features for each application across big

data, simulation; “complete” set of benchmarks (NIST)• Software Architecture and its implementation

– HPC-ABDS: Cloud-HPC interoperable software: performance of HPC (High Performance Computing) and the rich functionality of the Apache Big Data Stack.

– Added HPC to Hadoop, Storm, Heron, Spark; could add to Beam and Flink– Could work in Apache model contributing code

• Run same HPC-ABDS across all platforms but “data management” nodes have different balance in I/O, Network and Compute from “model” nodes– Optimize to data and model functions as specified by convergence diamonds– Do not optimize for simulation and big data

• Convergence Language: Make C++, Java, Scala, Python (R) … perform well• Training: Students prefer to learn Big Data rather than HPC• Sustainability: research/HPC communities cannot afford to develop everything (hardware and

software) from scratch

General Aspects of Big Data HPC Convergence

05/03/2023 80

Page 81: High Performance Computing and Big Data

Typical Convergence Architecture• Running same HPC-ABDS software across all platforms but data

management machine has different balance in I/O, Network and Compute from “model” machine– Note data storage approach: HDFS v. Object Store v. Lustre style file

systems is still rather unclear• The Model behaves similarly whether from Big Data or Big Simulation.

8105/03/2023

C D

C D

C D

C D

C D

C D

C D

C D

C D

C D

C D

C D

C D

C D

C D

C D

CD

CD

CD

CD

CD

CD

CD

CD

CD

CD

CD

CD

CD

CD

CD

CD

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

Data Management Model for Big Data and Big Simulation