judy qiu [email protected] http:// salsa hpc.indiana.edu school of informatics and computing indiana...

Judy [email protected]

http://SALSAhpc.indiana.edu

School of Informatics and ComputingIndiana University

IndexedH BA SE

IndexedHBASE

Analysis Tools for Data Enabled Science

Summer Workshop on Algorithms and Cyberinfrastructure for large scale optimization/AI, August 9, 2013

Big Data Challenge

(Source: Helen Sun, Oracle Big Data)

SALSA

Learning from Big DataConverting raw data to knowledge discovery

Exponential data growthContinuous analysis of streaming dataA variety of algorithms and data structuresMulti/Manycore and GPU architectures

Thousands of cores in clusters and millions in data centers

Cost and time trade-off

Parallelism is a must to process data in a meaningful length of time

SALSA

Programming Runtimes

High-level programming models such as MapReduce adopt a data-centered design

Computation starts from data

Support moving computation to data

Shows promising results for data-intensive computing

( Google, Yahoo, Amazon, Microsoft …)

Challenges: traditional MapReduce and classical parallel runtimes cannot solve iterative algorithms efficiently

Hadoop: repeated data access to HDFS, no optimization to (in memory) data caching and (collective) intermediate data transfers

MPI: no natural support of fault tolerance; programming interface is complicated

Pig Latin, Hive

MPI, PVM, HPF

Hadoop MapReduce

Chapel, X10

Classic Cloud:

Queues, Workers

DAGMan,

BOINC

Workflows, Swift, Falkon

PaaS:Worker Roles

Perform Computations Efficiently

Achieve Higher Throughput

SALSA

(a) Map Only(Pleasingly Parallel)

(b) ClassicMapReduce

(c) Iterative MapReduce

(d) Loosely Synchronous

- CAP3 Gene Analysis- Smith-Waterman

Distances- Document conversion

(PDF -> HTML)- Brute force searches in

cryptography- Parametric sweeps- PolarGrid MATLAB data

analysis

- High Energy Physics (HEP) Histograms

- Distributed search- Distributed sorting- Information retrieval- Calculation of Pairwise

Distances for sequences (BLAST)

- Expectation maximization algorithms

- Linear Algebra- Data mining, includes

K-means clustering - Deterministic

Annealing Clustering- Multidimensional

Scaling (MDS) - PageRank

Many MPI scientific applications utilizing wide variety of communication constructs, including local interactions- Solving Differential

Equations and particle dynamics with short range forces

Pij

Collective Communication MPI

Input

Output

map

Inputmap

reduce

Inputmap

iterations

No Communication

reduce

Applications & Different Interconnection Patterns

Domain of MapReduce and Iterative Extensions

SALSA

Data Analysis ToolsMapReduce optimized for iterative computations

Twister: the speedy elephant

In-Memory• Cacheable map/reduce tasks

Data Flow • Iterative• Loop Invariant • Variable data

Thread • Lightweight• Local aggregation

Map-Collective • Communication patterns optimized for large intermediate data transfer

Portability• HPC (Java)• Azure Cloud (C#)

Abstractions

SALSA

Reduce (Key, List<Value>)

Map(Key, Value)

Loop Invariant DataLoaded only once

Faster intermediate data transfer mechanismCombiner

operation to collect all reduce

outputs

Cacheable map/reduce tasks

(in memory)

Configure()

Combine(Map<Key,Value>)

Programming Model for Iterative MapReduce

Distinction on loop invariant data and variable data (data flow vs. δ flow)

Cacheable map/reduce tasks (in-memory)

Combine operation

Main Programwhile(..){ runMapReduce(..)}

Variable data

SALSA

Map-Collective Communication Model

We generalize the Map-Reduce concept to Map-Collective, noting that large collectives are a distinguishing feature of data intensive and data mining applications.

Collectives generalize Reduce to include all large scale linked communication-compute patterns.

MapReduce already includes a step in the collective direction with sort, shuffle, merge as well as basic reduction.

MapReduce•Wordcount, Grep

MapReduce-MergeBroadcast • KMeansClustering, PageRank

Map-AllGather• MDS-BCCalc

Map-AllReduce • KMeansClustering, MDS-StressCalc

Map-ReduceScatter• PageRank, Belief Propagation

Patterns

H-COLLECTIVES H-COLLECTIVES H-COLLECTIVES

SALSA

Case Studies: Data Analysis Algorithms

Clustering using image dataParallel Inverted Indexing used for HBaseMatrix algebra as needed

Matrix MultiplicationEquation SolvingEigenvector/value Calculation

Support a suite of parallel data-analysis capabilities

SALSA

Iterative Computations

K-means Matrix Multiplication

Performance of K-Means Parallel Overhead Matrix Multiplication

SALSA

PageRank

Well-known page rank algorithm [1]

Used ClueWeb09 [2] (1TB in size) from CMU

Hadoop loads the web graph in every iteration

Twister keeps the graph in memory

Pregel approach seems natural to graph-based problems[1] Pagerank Algorithm, http://en.wikipedia.org/wiki/PageRank[2] ClueWeb09 Data Set, http://boston.lti.cs.cmu.edu/Data/clueweb09/

Partial Updates

M

R

Current Page ranks (Compressed)

Partial Adjacency Matrix

CPartially merged Updates

Iterations

http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html

http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html

SALSA

Data Intensive Kmeans Clustering

Image Classification: 7 million images; 512 features per image; 1 million clusters; 10K Map tasks; 64G broadcasting data (1GB data transfer per Map task node); 20 TB intermediate data in shuffling.

99000076

I9900007

099000070-0

99000432

99000070-4

99000432-0

99000432-4

ClusteringFeature Extraction

𝑓 1 , 𝑓 2… 𝑓 𝑑𝑖𝑚

𝑓 1 , 𝑓 2… 𝑓 𝑑𝑖𝑚

𝑓 1 , 𝑓 2… 𝑓 𝑑𝑖𝑚

𝑓 1 , 𝑓 2… 𝑓 𝑑𝑖𝑚

Images

Patches HOG Features

II

III

Clusters

99000070

99000076

99000432

Collaboration with Prof. David Crandall

SALSA

High Dimensional Image DataK-means Clustering algorithm is used to cluster the images with similar features.Each image is characterized as a data point (vector) with dimensions in the range of 512 ~ 2048. Each value (feature) ranges from 0 to 255. A full execution of the image clustering application

We successfully cluster 7.42 million vectors into 1 million cluster centers. 10000 map tasks are created on 125 nodes. Each node has 80 tasks, each task caches 742 vectors. For 1 million centroids, broadcasting data size is about 512 MB. Shuffling data is 20 TB, while the data size after local aggregation is about 250 GB. Since the total memory size on 125 nodes is 2 TB, we cannot even execute the program unless local aggregation is performed.

SALSA14

Map Map

Local Aggregation

Reduce

Combine to Driver

Local Aggregation

Reduce

Map

Local Aggregation

Reduce

Broadcast from Driver

Worker 1 Worker 2 Worker 3

Shuffle

Image Clustering Control Flow in Twister with new local aggregation feature in Map-Collective to drastically reduce intermediate data size

We explore operations such as high performance broadcasting and shuffling, then add them to Twister iterative MapReduce framework. There are different algorithms for broadcasting.

Pipeline (works well for Cloud)minimum-spanning tree bidirectional exchangebucket algorithm

SALSA15

Broadcast Comparison: Twister vs. MPI

The new topology-aware chain broadcasting algorithm gives 20% better performance than best C/C++ MPI methods (four times faster than Java MPJ)

A factor of 5 improvement over non-optimized (for topology) pipeline-based method over 150 nodes.

Performance comparison of Twister chain method and Open MPI MPI_Bcast

Performance comparison of Twister chain method and MPJ broadcasting method (MPJ 2GB is prediction only)

Chain method with/without topology-awareness

SALSA16

Broadcast Comparison: Local Aggregation

Left figure shows the time cost on shuffling is only 10% of the original time

Right figure presents the collective communication cost per iteration, which is 169 seconds (less than 3 minutes).

Comparison between shuffling with and without local aggregation

Communication cost per iteration of the image clustering application

SALSA

Triangle Inequality and KmeansDominant part of Kmeans algorithm is finding nearest center to each pointO(#Points * #Clusters * Vector Dimension)

Simple algorithms findmin over centers c: d(x, c) = distance(point x, center c) But most of d(x, c) calculations are wasted, as they are much larger than minimum valueElkan [1] showed how to use triangle inequality to speed up

relations like:d(x, c) >= d(x, c-last) – d(c, c-last)c-last position of center at last iteration

So compare d(x,c-last) – d(c, c-last) with d(x, c-best) where c-best is nearest cluster at last iterationComplexity reduced by a factor = Vector Dimension, and so this is important in clustering high dimension spaces such as social imagery with 512 or more features per image

[1] Charles Elkan, Using the triangle inequality to accelerate k-means, in TWENTIETH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, Tom Fawcett and Nina Mishra, Editors. August 21-24, 2003. Washington DC. pages. 147-153.

SALSA

Fast Kmeans Algorithm

Graph shows fraction of distances d(x, c) calculated in each iteration for a test data set200K points, 124 centers, Vector Dimension 74

d(x(P), m(now, c1)) ≥ d(x(P), m(last, c1)) – d(m(now, c1), m(last, c1)) (1)

lower_bound = d(x(P), m(last, c)) – d(m(now, c), m(last, c)) ≥ d(x(P), m(last, c - current_best)) (2)

SALSA

Results on Fast Kmeans Algorithm

Histograms of distance distributions for 3200 clusters for 76800 points in a 2048 dimensional space.

The distances of points to their nearest center is shown as triangles; the distance to other centers (further away) as crosses; the distances between centers are the filled circles

SALSA

Linux HPCBare-system

Amazon Cloud Windows Server HPC

Bare-system Virtualization

Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling)

Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping

CPU Nodes

Virtualization

Applications/Algorithms

Programming Model

Infrastructure

Hardware

Azure Cloud

Security, Provenance, Portal

High Level Language

Distributed File Systems Data Parallel File System

Grid Appliance

GPU Nodes

Support Scientific Simulations (Data Mining and Data Analysis)

Runtime

Storage

Services and Workflow

Object Store

Data Analysis Architecture

judy qiu [email protected] http:// salsa hpc.indiana.edu school of informatics and computing indiana...

Documents

data support

raw data

big data learning

memory data caching

oracle big data slide

data structures multimanycore

data centers cost

repeated data access