nsf14-43054 started october 1, 2014 datanet: cif21 dibbs ...deep learning . has large model –...

41
1 NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University (Fox, Qiu, Crandall, von Laszewski) Rutgers (Jha) Virginia Tech (Marathe) Kansas (Paden) Stony Brook (Wang) Arizona State (Beckstein) Utah (Cheatham) Overview by Geoffrey Fox, May 16, 2016 http://spidal.org/ http://news.indiana.edu/releases/iu/2014/10/big-data-dibbs-grant.shtml http://www.nsf.gov/awardsearch/showAward?AWD_ID=1443054

Upload: others

Post on 13-Jan-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

1

NSF14-43054 started October 1, 2014

Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

• Indiana University (Fox, Qiu, Crandall, von Laszewski) • Rutgers (Jha) • Virginia Tech (Marathe) • Kansas (Paden) • Stony Brook (Wang) • Arizona State (Beckstein) • Utah (Cheatham)

Overview by Geoffrey Fox, May 16, 2016

http://spidal.org/ http://news.indiana.edu/releases/iu/2014/10/big-data-dibbs-grant.shtml http://www.nsf.gov/awardsearch/showAward?AWD_ID=1443054

Page 2: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Some Important Components of SPIDAL Dibbs • NIST Big Data Application Analysis – features of data intensive Applications. • HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High

Performance Computing) and the rich functionality of the commodity Apache Big Data Stack.

– This is a reservoir of software subsystems – nearly all from outside the project and being a mix of HPC and Big Data communities.

– Leads to Big Data – Simulation – HPC Convergence.

• MIDAS: Integrating Middleware – from project. • Applications: Biomolecular Simulations, Network and Computational Social

Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics.

• SPIDAL (Scalable Parallel Interoperable Data Analytics Library): Scalable Analytics for:

– Domain specific data analytics libraries – mainly from project. – Add Core Machine learning libraries – mainly from community. – Performance of Java and MIDAS Inter- and Intra-node.

• Benchmarks – project adds to communityWBDB2015 Benchmarking Workshop • Implementations: XSEDE and Blue Waters as well as clouds (OpenStack, Docker)

2

Page 3: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Big Data - Big Simulation (Exascale) Convergence • Discuss Data and Model together as built around problems which combine

them, but we can get insight by separating which allows better understanding of Big Data - Big Simulation “convergence”

• Big Data implies Data is large but Model varies – e.g. LDA with many topics or deep learning has large model – Clustering or Dimension reduction can be quite small

• Simulations can also be considered as Data and Model – Model is solving particle dynamics or partial differential equations – Data could be small when just boundary conditions – Data large with data assimilation (weather forecasting) or when data visualizations are

produced by simulation

• Data often static between iterations (unless streaming); Model varies between iterations

• Take 51 NIST and other use cases derive multiple specific features • Generalize and systematize with features termed “facets” • 50 Facets (Big Data) or 64 Facets (Big Simulation and Data) divided into

4 sets or views where each view has “similar” facets – Allows one to study coverage of benchmark sets and architectures

3

Page 4: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

64 Features in 4 views for Unified Classification of Big Data and Simulation Applications

4

Loca

l (An

alyt

ics/

Info

rmat

ics/

Sim

ulat

ions

)

2M

Data Source and Style View

Pleasingly ParallelClassic MapReduce

Map-CollectiveMap Point-to-Point

Shared MemorySingle Program Multiple Data

Bulk Synchronous Parallel

FusionDataflow

AgentsWorkflow

Geospatial Information SystemHPC SimulationsInternet of ThingsMetadata/ProvenanceShared / Dedicated / Transient / Permanent

Archived/Batched/Streaming – S1, S2, S3, S4, S5

HDFS/Lustre/GPFS

Files/ObjectsEnterprise Data ModelSQL/NoSQL/NewSQL

1M

Mic

ro-b

ench

mar

ks

Execution View

Processing View1234

6

78

910

11M

12

10D98D7D6D

5D

4D

3D2D1D

Map Streaming 5

Convergence Diamonds Views and

Facets

Problem Architecture View

15M

Core

Lib

rarie

sVi

sual

izatio

n

14M

Grap

h Al

gorit

hms

13M

Line

ar A

lgeb

ra K

erne

ls/M

any

subc

lass

es12M

Glob

al (A

naly

tics/

Info

rmat

ics/

Sim

ulat

ions

)

3M

Reco

mm

ende

r En

gine

5M

4M

Base

Dat

a St

atis

tics

10M

Stre

amin

g Da

ta A

lgor

ithm

sO

ptim

izatio

n M

etho

dolo

gy

9M

Lear

ning

8M

Data

Cla

ssifi

catio

n

7M

Data

Sear

ch/Q

uery

/Inde

x

6M

11M

Data

Alig

nmen

t

Big Data Processing Diamonds

Mul

tisca

le M

etho

d

17M

16M

Itera

tive

PDE

Solv

ers

22M

Natu

re o

f mes

h if

used

Evol

utio

n of

Dis

cret

e Sy

stem

s

21M

Parti

cles

and

Fie

lds

20M

N-bo

dy M

etho

ds

19M

Spec

tral M

etho

ds

18M

Simulation (Exascale) Processing Diamonds

DataAbstraction

D12

ModelAbstraction

M12

DataM

etric=

M/Non-M

etric=

N

D13

DataM

etric=

M/Non -M

etric=

N

M13

𝑂𝑁2

=NN

/𝑂(𝑁

)=

N

M14

Regular=R

/Irregular =IM

odel

M10

Veracity

7

Iterative/Sim

ple

M11

Comm

unicationStructure

M8

Dynamic

=D

/Static=S

D9

Dynamic

=D

/Static=S

M9

Regular=R

/Irregular=IData

D10

ModelVariety

M6

DataVelocity

D5

Performance

Metrics

1

DataVariety

D6

FlopsperByte/M

emory

IO/Flopsperw

att

2

ExecutionEnvironm

ent;Corelibraries

3Data

Volume

D4

ModelSize

M4

Simulations Analytics (Model for Data)

Both

(All Model)

(Nearly all Data+Model)

(Nearly all Data)

(Mix of Data and Model)

Page 5: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

6 Forms of MapReduce Cover “all” circumstances Describes - Problem (Model reflecting data) - Machine - Software Architecture

5

Page 6: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

6 asdf

HPC-ABDS Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies

Cross-Cutting

Functions 1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination: Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca

17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds Networking: Google Cloud DNS, Amazon Route 53

21 layers Over 350 Software Packages January 29 2016

Page 7: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

HPC-ABDS Mapping of Activities • Level 17: Orchestration: Apache Beam (Google Cloud Dataflow) integrated

with Cloudmesh on HPC cluster • Level 16: Applications: Datamining for molecular dynamics, Image

processing for remote sensing and pathology, graphs, streaming, bioinformatics, social media, financial informatics, text mining

• Level 16: Algorithms: Generic and custom for applications SPIDAL • Level 14: Programming: Storm, Heron (Twitter replaces Storm), Hadoop,

Spark, Flink. Improve Inter- and Intra-node performance • Level 13: Communication: Enhanced Storm and Hadoop using HPC

runtime technologies, Harp • Level 11: Data management: Hbase and MongoDB integrated via use of

Beam and other Apache tools; enhance Hbase • Level 9: Cluster Management: Integrate Pilot Jobs with Yarn, Mesos,

Spark, Hadoop; integrate Storm and Heron with Slurm • Level 6: DevOps: Python Cloudmesh virtual Cluster Interoperability

7

Green is MIDAS

Page 8: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

04/6/2016 8

Page 9: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

MIDAS: Software Activities in DIBBS • Developing HPC-ABDS concept to integrate HPC and Apache

Technologies • Java: relook at Java Grande to make performance “best simply possible” • DevOps: Cloudmesh provides interoperability between HPC and Cloud

(OpenStack, AWS, Docker) platforms based on virtual clusters with software defined systems using Ansible (Chef )

• Scheduling: Integrate Slurm and Pilot jobs with Yarn & Mesos (ABDS schedulers), Programming layer (Hadoop, Spark, Flink, Heron/Storm)

• Communication and scientific data abstractions: Harp plug-in to Hadoop outperforms ABDS programming layers

• Data Management: use Hbase, MongoDB with customization • Workflow: Use Apache Crunch and Beam (Google Cloud Data flow) as

they link to other ABDS technologies. • Starting to integrate MIDAS components and move into algorithms of

SPIDAL Library

9

Page 10: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Java MPI performs better than Threads 128 24 core Haswell nodes on SPIDAL DA-MDS Code

10 04/6/2016

Best Threads intra node And MPI inter node

Best MPI; inter and intra node

SM = Optimized Shared memory for intra-node MPI

HPC into Java Runtime and Programming Model

Page 11: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Cloudmesh Interoperability DevOps Tool • Model: Define software configuration with tools like Ansible; instantiate on a virtual

cluster • An easy-to-use command line program/shell and portal to interface with

heterogeneous infrastructures – Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud supported

clouds as well as classic HPC and Docker infrastructures – Has an abstraction layer that makes it possible to integrate other IaaS

frameworks – Uses defaults that help interacting with various clouds – Managing VMs across different IaaS providers is easy – The client saves state between consecutive calls

• Demonstrated interaction with various cloud providers: – FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera, AWS, Azure,

virtualbox • Status: AWS, and Azure, VirtualBox, Docker need improvements; we focus currently

on Comet and NSF resources that use OpenStack • Currently evaluating 40 team projects from “Big Data Open Source Software Projects

Class” which used this approach running on VirtualBox, Chameleon and FutureSystems

11

Page 12: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Cloudmesh Interoperability DevOps Tool • Model: Define software configuration with tools like Ansible; instantiate on a virtual

cluster • An easy-to-use command line program/shell and portal to interface with

heterogeneous infrastructures – Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud supported

clouds as well as classic HPC and Docker infrastructures – Has an abstraction layer that makes it possible to integrate other IaaS

frameworks – Uses defaults that help interacting with various clouds – Managing VMs across different IaaS providers is easy – The client saves state between consecutive calls

• Demonstrated interaction with various cloud providers: – FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera, AWS, Azure,

virtualbox • Status: AWS, and Azure, VirtualBox, Docker need improvements; we focus currently

on Comet and NSF resources that use OpenStack • Currently evaluating 40 team projects from “Big Data Open Source Software Projects

Class” which used this approach running on VirtualBox, Chameleon and FutureSystems

12 HPC Cloud Interoperability Layer

Page 13: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Cloudmesh Client - Architecture

13

Component View

Layered View

Systems Configuration

Page 14: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Cloudmesh Client – OSG management

• While using Comet it is possible to use the same images that are used on the internal OSG cluster.

• This reduces overall management effort.

• The client is used to manage the VMs.

14

OSG: Open Science Grid. LIGO data analysis was conducted on Comet supported by the Cloudmesh client. Funded by NSF Comet.

Cloudmesh

Page 15: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Cloudmesh Client

Cloudmesh Client – In support of Experiment Workflow

Create IaaS

Deploy PaaS

Deploy Data

Execute Scripts

Evaluate

Repeat

15

Integrate with Ansible

Buildin gcopy and rsync commands

Cloudmesh scripts Integrate with shell Integrate with Python

Manage VMs and virtual clusters

Scripts Variables Shared state

Integration into Ipython and Apache Beam

Choose general Infrastructure HPC to Clouds

Page 16: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Pilot-Hadoop/Spark Architecture

16

http://arxiv.org/abs/1602.00345

HPC into Scheduling Layer

Page 17: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Pilot-Hadoop Example

17

Page 18: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Pilot-Data/Memory for Iterative Processing

18

Provide common API for distributed cluster memory

Scalable K-Means

Page 19: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Harp Implementations • Basic Harp: Iterative communication and scientific data abstractions • Careful support of distributed data AND distributed model • Avoids parameter server approach but distributes model over worker nodes

and supports collective communication to bring global model to each node • Applied first to Latent Dirichlet Allocation LDA with large model and data

19 HPC into Programming/communication Layer

Page 20: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Clueweb

20 04/6/2016

enwiki

Bi-gram

Clueweb

Latent Dirichlet Allocation on 100 Haswell nodes: red is Harp (lgs and rtt)

Page 21: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

21

Harp LDA on Big Red II Supercomputer (Cray)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

0

5

10

15

20

25

30

0 50 100 150

Parallel Efficiency

Exec

utio

n Ti

me

(hou

rs)

Nodes Execution Time (hours) Parallel Efficiency

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

0

5

10

15

20

25

0 10 20 30 40

Parallel Efficiency

Exec

utio

n Ti

me

(hou

rs)

Nodes Execution Time (hours) Parallel Efficiency

Harp LDA on Juliet (Intel Haswell)

• Big Red II: tested on 25, 50, 75, 100 and 125 nodes; each node uses 32 parallel threads; Gemini interconnect

• Juliet: tested on 10, 15, 20, 25, 30 nodes; each node uses 64 parallel threads on 36 core Intel Haswell nodes (each with 2 chips); Infiniband interconnect

Harp LDA Scaling Tests

Corpus: 3,775,554 Wikipedia documents, Vocabulary: 1 million words; Topics: 10k topics; alpha: 0.01; beta: 0.01; iteration: 200

Page 22: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

SPIDAL Algorithms – Subgraph mining • Finding patterns in graphs is very important

– Counting the number of embeddings of a given labeled/unlabeled template subgraph

– Finding the most frequent subgraphs/motifs efficiently from a given set of candidate templates

– Computing the graphlet frequency distribution. • Reworking existing parallel VT algorithm Sahad with MIDAS

middleware giving HarpSahad which runs 5 (Google) to 9 (Miami) times faster than original Hadoop version

• Work in progress

22 04/6/2016

Network No. Of Nodes

(in million)

No. Of Edges

(in million)

Size (MB)

Web-google 0.9 4.3 65 Miami 2.1 51.2 740

Templates Datasets:

Page 23: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

SPIDAL Algorithms – Random Graph Generation • Random graphs, important and needed with particular degree distribution

and clustering coefficients. – Preferential attachment (PA) model, Chung-Lu (CL), stochastic Kronecker,

stochastic block model (SBM), and block two–level Erdos-Renyi (BTER) – Generative algorithms for these models are mostly sequential and take a

prohibitively long time to generate large-scale graphs. • SPIDAL working on developing efficient parallel algorithms for generating random

graphs using different models with new DG method with low memory and high performance, almost optimal load balancing and excellent scaling. – Algorithms are about 3-4 times faster than the previous ones. – Generate a network with 250 billion edges in 12 seconds using 1024 processors.

• Needs to be packaged for SPIDAL using MIDAS (currently MPI)

23 04/6/2016

Page 24: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

SPIDAL Algorithms – Triangle Counting • Triangle counting; important special case of subgraph mining and

specialized programs can outperform general program • Previous work used Hadoop but MPI based PATRIC is much faster • SPIDAL version uses much more efficient decomposition (non-overlapping

graph decomposition) – a factor of 25 lower memory than PATRIC • Next graph problem – Community detection

24

SPIDAL

MPI version complete. Need to package for SPIDAL and add MIDAS -- Harp

Page 25: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

SPIDAL Algorithms – Core I • Several parallel core machine learning algorithms; need to add SPIDAL

Java optimizations to complete parallel codes except MPI MDS – https://www.gitbook.com/book/esaliya/global-machine-learning-with-dsc-spidal/details

• O(N2) distance matrices calculation with Hadoop parallelism and various options (storage MongoDB vs. distributed files), normalization, packing to save memory usage, exploiting symmetry

• WDA-SMACOF: Multidimensional scaling MDS is optimal nonlinear dimension reduction enhanced by SMACOF, deterministic annealing and Conjugate gradient for non-uniform weights. Used in many applications – MPI (shared memory) and MIDAS (Harp) versions

• MDS Alignment to optimally align related point sets, as in MDS time series • WebPlotViz data management (MongoDB) and browser visualization for

3D point sets including time series. Available as source or SaaS • MDS as χ2 using Manxcat. Alternative more general but less reliable

solution of MDS. Latest version of WDA-SMACOF usually preferable • Other Dimension Reduction: SVD, PCA, GTM to do

25

Page 26: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

SPIDAL Algorithms – Core II • Latent Dirichlet Allocation LDA for topic finding in text collections; new algorithm

with MIDAS runtime outperforming current best practice • DA-PWC Deterministic Annealing Pairwise Clustering for case where points

aren’t in a vector space; used extensively to cluster DNA and proteomic sequences; improved algorithm over other published. Parallelism good but needs SPIDAL Java

• DAVS Deterministic Annealing Clustering for vectors; includes specification of errors and limit on cluster sizes. Gives very accurate answers for cases where distinct clustering exists. Being upgraded for new LC-MS proteomics data with one million clusters in 27 million size data set

• K-means basic vector clustering: fast and adequate where clusters aren’t needed accurately

• Elkan’s improved K-means vector clustering: for high dimensional spaces; uses triangle inequality to avoid expensive distance calcs

• Future work – Classification: logistic regression, Random Forest, SVM, (deep learning); Collaborative Filtering, TF-IDF search and Spark MLlib algorithms

• Harp-DaaL extends Intel DAAL’s local batch mode to multi-node distributed modes – Leveraging Harp’s benefits of communication for iterative compute models

26

Page 27: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

SPIDAL Algorithms – Optimization I • Manxcat: Levenberg Marquardt Algorithm for non-linear χ2

optimization with sophisticated version of Newton’s method calculating value and derivatives of objective function. Parallelism in calculation of objective function and in parameters to be determined. Complete – needs SPIDAL Java optimization

• Viterbi algorithm, for finding the maximum a posteriori (MAP) solution for a Hidden Markov Model (HMM). The running time is O(n*s^2) where n is the number of variables and s is the number of possible states each variable can take. We will provide an "embarrassingly parallel" version that processes multiple problems (e.g. many images) independently; parallelizing within the same problem not needed in our application space. Needs Packaging in SPIDAL

• Forward-backward algorithm, for computing marginal distributions over HMM variables. Similar characteristics as Viterbi above. Needs Packaging in SPIDAL

27

Page 28: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

SPIDAL Algorithms – Optimization II • Loopy belief propagation (LBP) for approximately finding the maximum a

posteriori (MAP) solution for a Markov Random Field (MRF). Here the running time is O(n^2*s^2*i) in the worst case where n is number of variables, s is number of states per variable, and i is number of iterations required (which is usually a function of n, e.g. log(n) or sqrt(n)). Here there are various parallelization strategies depending on values of s and n for any given problem. – We will provide two parallel versions: embarrassingly parallel version for

when s and n are relatively modest, and parallelizing each iteration of the same problem for common situation when s and n are quite large so that each iteration takes a long time relative to number of iterations required.

– Needs Packaging in SPIDAL • Markov Chain Monte Carlo (MCMC) for approximately computing marking

distributions and sampling over MRF variables. Similar to LBP with the same two parallelization strategies. Needs Packaging in SPIDAL

28

Page 29: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Imaging Applications: Remote Sensing, Pathology, Spatial Systems

• Both Pathology/Remote sensing working on 3D images • Each pathology image could have 10 billion pixels, and we may extract a million

spatial objects per image and 100 million features (dozens to 100 features per object) per image. We often tile the image into 4K x 4K tiles for processing. We develop buffering-based tiling to handle boundary-crossing objects. For each typical research study, we may have hundreds to thousands of pathology images

• Remote sensing aimed at radar images of ice and snow sheets • 2D problems need modest parallelism “intra-image” but often need parallelism

over images • 3D problems need parallelism for an individual image • Use Optimization algorithms to support applications (e.g. Markov Chain, Integer

Programming, Bayesian Maximum a posteriori, variational level set, Euler-Lagrange Equation)

• Classification (deep learning convolution neural network, SVM, random forest, etc.) will be important 29

Page 30: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

2D Radar Polar Remote Sensing • Need to estimate structure of earth (ice, snow, rock) from radar signals from

plane in 2 or 3 dimensions. • Original 2D analysis ([11])used Hidden Markov Methods; better results using

MCMC (our solution)

30 04/6/2016

Extending to snow radar layers

Page 31: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

3D Radar Polar Remote Sensing • Uses LBP to analyze 3D radar images

31 04/6/2016

Reconstructing bedrock in 3D, for (left) ground truth, (center) existing algorithm based on maximum likelihood estimators, and (right) our technique based on a Markov Random Field formulation.

Radar gives a cross-section view, parameterized by angle and range, of the ice structure, which yields a set of 2-d tomographic slices (right) along the flight path.

Each image represents a 3d depth map, with along track and cross track dimensions on the x-axis and y-axis respectively, and depth coded as colors.

Page 32: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Algorithms – Nuclei Segmentation for Pathology Images

• Segment boundaries of nuclei from pathology images and extract features for each nucleus

• Consist of tiling, segmentation, vectorization, boundary object aggregation • Could be executed on MapReduce

(MIDAS Harp)

32

Step 1:Preliminary Region

Partition

Step 2:Background

Normalization

Step 3:Nuclear Boundary

Refinement

Step 4:Nuclei

Separation Dist

ribut

ed F

ile S

yste

m

Map

Redu

ce C

ompu

ting

Fram

ewor

k

Segmentation

Mask Images

Boundary Vectorization

Raw Polygons

Boundary Normalization

Normalized Polygons

Isolated Buffer Polygon Removal

Non-boundary Polygons

Whole Slide Polygon Aggregation

Tilin

gM

apRe

duce

Cleaned Polygons

Boundary Polygons

Final Segmented Polygons

Tiled Images

F1

F2

F3

F5 F4

F6 F7

F8

F9 F10

Tiling

Whole Slide Images

F11

Nuclear segmentation algorithm

Execution pipeline on MapReduce (MIDAS Harp)

Page 33: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Algorithms – Spatial Querying Methods

33

Spatial Queries Architecture of Spatial Query Engine

• Hadoop-GIS is a general framework to support high performance spatial queries and analytics for spatial big data on MapReduce.

• It supports multiple types of spatial queries on MapReduce through spatial partitioning, customizable spatial query engine and on-demand indexing.

• SparkGIS is a variation of Hadoop-GIS which runs on Spark to take advantage of in-memory processing.

• Will extend Hadoop/Spark to Harp MIDAS runtime. • 2D complete; 3D in progress

Page 34: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Some Applications Enabled • KU/IU: Remote Sensing in Polar Regions • SB: Digital Pathology Imaging • SB: Large scale GIS applications, including public health • VT: Graph Analysis in studies of networks in many areas • UT, ASU, Rutgers: Analysis of Biomolecular simulations

Applications not part of Dibbs project but algorithms/software used

• IU: Bioinformatics and Financial Modeling with MDS • Integration with Network Science Infrastructure

– VT/IU: CINET: SPIDAL algorithms will be made available – IU: Osome Observatory on Social Media, currently Twitter

https://peerj.com/preprints/2008/ using enhanced HBase – IU: Topic Analysis of text data

• IU/Rutgers: Streaming with HPC enhanced Storm/Heron

34

Page 35: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Enabled Applications – Digital Pathology

• Digital pathology images scanned from human tissue specimens provide rich information about morphological and functional characteristics of biological systems.

• Pathology image analysis has high potential to provide diagnostic assistance, identify therapeutic targets, and predict patient outcomes and therapeutic responses.

• It relies on both pathology image analysis algorithms and spatial querying methods.

• Extremely large image scale.

35

Glass Slides Scanning Whole Slide Images Image Analysis

Page 36: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Applications – Public Health • GIS-oriented public health research has a strong focus on the locations of

patients and the agents of disease, and studies the spatial patterns and variations.

• Integrating multiple spatial big data sources at fine spatial resolutions allow public health researchers and health officials to adequately identify, analyze, and monitor health problems at the community level.

• This will rely on high performance spatial querying methods on data integration.

• Note synergy between GIS and Large image processing as in pathology.

36

Page 37: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Biomolecular Simulation Data Analysis

37 04/6/2016

Path Similarity Analysis (PSA) with Hausdorff distance

• Utah (CPPTraj), Arizona State (MDAnalysis), Rutgers • Parallelize key algorithms including O(N2) distance

computations between trajectories • Integrate SPIDAL O(N2) distance and clustering libraries

Page 38: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Clustered distances for two methods for sampling macromolecular transitions (200 trajectories each) showing that both methods produce distinctly different pathways.

38

RADICAL Pilot benchmark run for three different test sets of trajectories, using 12x12 “blocks” per task.

RADICAL-Pilot Hausdorff distance: all-pairs problem

Page 39: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

Classification of lipids in membranes Biological membranes are lipid bilayers with distinct inner and outer surfaces that are formed by lipid mono layers (leaflets). Movement of lipids between leaflets or change of topology (merging of leaflets during fusion events) is difficult to detect in simulations.

39

Lipids colored by leaflet Same color: continuous leaflet.

Page 40: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

LeafletFinder LeafletFinder is a graph-based algorithm to detect continuous lipid membrane leaflets in a MD simulation*. The current implementation is slow and does not work well for large systems (>100,000 lipids).

40

Phosphate atom coordinates

Build nearest-neighbors adjacency matrix

Find largest connected subgraphs

* N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein. MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. J Comp Chem, 32:2319–2327, 2011.

Page 41: NSF14-43054 started October 1, 2014 Datanet: CIF21 DIBBs ...deep learning . has large model – Clustering or Dimension reduction can be quite small • Simulations. can also be considered

41 Apple

Mid Cap

Energy

S&P Dow Jones

Finance Origin 0% change

+10%

+20%

Time series of Stock Values projected to 3D Using one day stock values measured from January 2004 and starting after one year January 2005 (Filled circles are final values)