judy qiu xqiu@indiana http :// salsa hpcdiana
DESCRIPTION
Analysis Tools for Data Enabled Science. Indexed HBase. Judy Qiu [email protected] http :// SALSA hpc.indiana.edu School of Informatics and Computing Indiana University. Summer Workshop on Algorithms and Cyberinfrastructure for large scale optimization/AI, August 9, 2013. - PowerPoint PPT PresentationTRANSCRIPT
Judy [email protected]
http://SALSAhpc.indiana.edu
School of Informatics and ComputingIndiana University
IndexedHBASE
Analysis Tools for Data Enabled Science
Summer Workshop on Algorithms and Cyberinfrastructure for large scale optimization/AI, August 9, 2013
Big Data Challenge
(Source: Helen Sun, Oracle Big Data)
SALSA
Learning from Big DataConverting raw data to knowledge discovery
Exponential data growthContinuous analysis of streaming dataA variety of algorithms and data structuresMulti/Manycore and GPU architectures
Thousands of cores in clusters and millions in data centers
Cost and time trade-off Parallelism is a must to process data in a meaningful length of time
SALSA
Programming Runtimes
High-level programming models such as MapReduce adopt a data-centered design
Computation starts from dataSupport moving computation to dataShows promising results for data-intensive computing( Google, Yahoo, Amazon, Microsoft …)
Challenges: traditional MapReduce and classical parallel runtimes cannot solve iterative algorithms efficiently
Hadoop: repeated data access to HDFS, no optimization to (in memory) data caching and (collective) intermediate data transfers MPI: no natural support of fault tolerance; programming interface is complicated
Pig Latin, Hive
MPI, PVM, HPF
Hadoop MapReduce Chape
l, X10
Classic Cloud:
Queues, Workers
DAGMan,
BOINC
Workflows, Swift, Falkon
PaaS:Worker Roles
Perform Computations EfficientlyAchieve Higher Throughput
SALSA
(a) Map Only(Pleasingly Parallel)
(b) ClassicMapReduce
(c) Iterative MapReduce
(d) Loosely Synchronous
- CAP3 Gene Analysis- Smith-Waterman
Distances- Document conversion
(PDF -> HTML)- Brute force searches in
cryptography- Parametric sweeps- PolarGrid MATLAB data
analysis
- High Energy Physics (HEP) Histograms
- Distributed search- Distributed sorting- Information retrieval- Calculation of Pairwise
Distances for sequences (BLAST)
- Expectation maximization algorithms
- Linear Algebra- Data mining, includes
K-means clustering - Deterministic
Annealing Clustering- Multidimensional
Scaling (MDS) - PageRank
Many MPI scientific applications utilizing wide variety of communication constructs, including local interactions- Solving Differential
Equations and particle dynamics with short range forces
Pij
Collective Communication MPI
Input
Output
mapInput
map
reduce
Inputmap
iterations
No Communication
reduce
Applications & Different Interconnection Patterns
Domain of MapReduce and Iterative Extensions
SALSA
Data Analysis ToolsMapReduce optimized for iterative computations
Twister: the speedy elephant
In-Memory• Cacheable map/reduce tasks
Data Flow • Iterative• Loop Invariant • Variable data
Thread • Lightweight• Local aggregation
Map-Collective • Communication patterns optimized for large intermediate data transfer
Portability• HPC (Java)• Azure Cloud (C#)
Abstractions
SALSA
Reduce (Key, List<Value>)
Map(Key, Value)
Loop Invariant DataLoaded only once
Faster intermediate data transfer mechanismCombiner
operation to collect all reduce
outputs
Cacheable map/reduce tasks
(in memory)
Configure()
Combine(Map<Key,Value>)
Programming Model for Iterative MapReduce
Distinction on loop invariant data and variable data (data flow vs. δ flow)Cacheable map/reduce tasks (in-memory)Combine operation
Main Programwhile(..){ runMapReduce(..)}
Variable data
SALSA
Map-Collective Communication Model
We generalize the Map-Reduce concept to Map-Collective, noting that large collectives are a distinguishing feature of data intensive and data mining applications. Collectives generalize Reduce to include all large scale linked communication-compute patterns.MapReduce already includes a step in the collective direction with sort, shuffle, merge as well as basic reduction.
MapReduce• Wordcount, Grep
MapReduce-MergeBroadcast • KMeansClustering, PageRank
Map-AllGather• MDS-BCCalc
Map-AllReduce • KMeansClustering, MDS-StressCalc
Map-ReduceScatter• PageRank, Belief Propagation
Patterns
H-COLLECTIVES H-COLLECTIVES H-COLLECTIVES
SALSA
Case Studies: Data Analysis Algorithms
Clustering using image dataParallel Inverted Indexing used for HBaseMatrix algebra as needed
Matrix MultiplicationEquation SolvingEigenvector/value Calculation
Support a suite of parallel data-analysis capabilities
SALSA
Iterative Computations
K-means Matrix Multiplication
Performance of K-Means Parallel Overhead Matrix Multiplication
SALSA
PageRank
Well-known page rank algorithm [1]Used ClueWeb09 [2] (1TB in size) from CMUHadoop loads the web graph in every iterationTwister keeps the graph in memoryPregel approach seems natural to graph-based problems[1] Pagerank Algorithm, http://en.wikipedia.org/wiki/PageRank[2] ClueWeb09 Data Set, http://boston.lti.cs.cmu.edu/Data/clueweb09/
Partial Updates
M
R
Current Page ranks (Compressed)
Partial Adjacency Matrix
CPartially merged Updates
Iterations
SALSA
Data Intensive Kmeans Clustering
Image Classification: 7 million images; 512 features per image; 1 million clusters; 10K Map tasks; 64G broadcasting data (1GB data transfer per Map task node); 20 TB intermediate data in shuffling.
99000076
I9900007
099000070-0
99000432
99000070-4
99000432-0
99000432-4
ClusteringFeature Extraction
𝑓 1 , 𝑓 2… 𝑓 𝑑𝑖𝑚
𝑓 1 , 𝑓 2… 𝑓 𝑑𝑖𝑚
𝑓 1 , 𝑓 2… 𝑓 𝑑𝑖𝑚
𝑓 1 , 𝑓 2… 𝑓 𝑑𝑖𝑚
Images
Patches HOG Features
II
III
Clusters
99000070
99000076
99000432
Collaboration with Prof. David Crandall
SALSA
High Dimensional Image DataK-means Clustering algorithm is used to cluster the images with similar features.Each image is characterized as a data point (vector) with dimensions in the range of 512 ~ 2048. Each value (feature) ranges from 0 to 255. A full execution of the image clustering application
We successfully cluster 7.42 million vectors into 1 million cluster centers. 10000 map tasks are created on 125 nodes. Each node has 80 tasks, each task caches 742 vectors. For 1 million centroids, broadcasting data size is about 512 MB. Shuffling data is 20 TB, while the data size after local aggregation is about 250 GB. Since the total memory size on 125 nodes is 2 TB, we cannot even execute the program unless local aggregation is performed.
SALSA14
Map Map
Local Aggregation
Reduce
Combine to Driver
Local Aggregation
Reduce
Map
Local Aggregation
Reduce
Broadcast from Driver
Worker 1 Worker 2 Worker 3
Shuffle
Image Clustering Control Flow in Twister with new local aggregation feature in Map-Collective to drastically reduce intermediate data size
We explore operations such as high performance broadcasting and shuffling, then add them to Twister iterative MapReduce framework. There are different algorithms for broadcasting.
Pipeline (works well for Cloud)minimum-spanning tree bidirectional exchangebucket algorithm
SALSA15
Broadcast Comparison: Twister vs. MPI
The new topology-aware chain broadcasting algorithm gives 20% better performance than best C/C++ MPI methods (four times faster than Java MPJ) A factor of 5 improvement over non-optimized (for topology) pipeline-based method over 150 nodes.
Performance comparison of Twister chain method and Open MPI MPI_Bcast
Performance comparison of Twister chain method and MPJ broadcasting method (MPJ 2GB is prediction only)
Chain method with/without topology-awareness
SALSA16
Broadcast Comparison: Local Aggregation
Left figure shows the time cost on shuffling is only 10% of the original timeRight figure presents the collective communication cost per iteration, which is 169 seconds (less than 3 minutes).
Comparison between shuffling with and without local aggregation
Communication cost per iteration of the image clustering application
SALSA
Triangle Inequality and KmeansDominant part of Kmeans algorithm is finding nearest center to each pointO(#Points * #Clusters * Vector Dimension)Simple algorithms findmin over centers c: d(x, c) = distance(point x, center c) But most of d(x, c) calculations are wasted, as they are much larger than minimum valueElkan [1] showed how to use triangle inequality to speed up
relations like:d(x, c) >= d(x, c-last) – d(c, c-last)c-last position of center at last iteration
So compare d(x,c-last) – d(c, c-last) with d(x, c-best) where c-best is nearest cluster at last iterationComplexity reduced by a factor = Vector Dimension, and so this is important in clustering high dimension spaces such as social imagery with 512 or more features per image
[1] Charles Elkan, Using the triangle inequality to accelerate k-means, in TWENTIETH INTERNATIONAL CONFERENCE ON MACHINE LEARNING, Tom Fawcett and Nina Mishra, Editors. August 21-24, 2003. Washington DC. pages. 147-153.
SALSA
Fast Kmeans Algorithm
Graph shows fraction of distances d(x, c) calculated in each iteration for a test data set200K points, 124 centers, Vector Dimension 74
d(x(P), m(now, c1)) ≥ d(x(P), m(last, c1)) – d(m(now, c1), m(last, c1)) (1)
lower_bound = d(x(P), m(last, c)) – d(m(now, c), m(last, c)) ≥ d(x(P), m(last, c - current_best)) (2)
SALSA
Results on Fast Kmeans Algorithm
Histograms of distance distributions for 3200 clusters for 76800 points in a 2048 dimensional space.The distances of points to their nearest center is shown as triangles; the distance to other centers
(further away) as crosses; the distances between centers are the filled circles
SALSA
Linux HPCBare-system
Amazon Cloud Windows Server HPC
Bare-system Virtualization
Cross Platform Iterative MapReduce (Collectives, Fault Tolerance, Scheduling)
Kernels, Genomics, Proteomics, Information Retrieval, Polar Science, Scientific Simulation Data Analysis and Management, Dissimilarity Computation, Clustering, Multidimensional Scaling, Generative Topological Mapping
CPU Nodes
Virtualization
Applications/Algorithms
Programming Model
Infrastructure
Hardware
Azure Cloud
Security, Provenance, Portal
High Level Language
Distributed File Systems Data Parallel File System
Grid Appliance
GPU Nodes
Support Scientific Simulations (Data Mining and Data Analysis)
Runtime
Storage
Services and Workflow
Object Store
Data Analysis Architecture