performance issues in parallelizing data-intensive applications on a multi-core cluster vignesh ravi...

29
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio- state.edu

Upload: derick-harrison

Post on 30-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core

Cluster

Vignesh Ravi and Gagan Agrawal

{raviv,agrawal}@cse.ohio-state.edu

Page 2: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

OUTLINE

• Motivation • FREERIDE Middleware• Generalized Reduction structure• Shared Memory Parallelization techniques• Scalability results - Kmeans, Apriori & EM• Performance Analysis results• Related work & Conclusion

Page 3: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Motivation

• Availability of huge amount of data – Data-intensive applications

• Advent of multi-core• Need for abstractions and parallel

programming systems• Best Shared Memory Parallelization (SMP)

technique is still not clear.

Page 4: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Context: FREERIDE

• A middle-ware for parallelizing Data-intensive applications

• Motivated by difficulties in implementing parallel datamining applications

• Provides high-level APIs for easier parallel programming

• Based on an observation of similar generalized reduction among many datamining and scientific applications

Page 5: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

FREERIDE – Core

• Reduction Object – A shared data structure where results from processed data instances are stored

Types of Reduction• Local Reduction – Reduction within a single

node• Global Reduction – Reduction among a cluster

of nodes

Page 6: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Generalized Reduction structure

Page 7: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Parallelization Challenges

• Reduction object cannot be statically partitioned between threads/nodes– Data races should be handled at runtime

• Size of reduction object could be large– Replication can cause memory overhead

• Updates to reduction object is fine-grained– Locking schemes can cause significant overhead

Page 8: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Techniques in FREERIDE

• Full-replication(f-r) • Locking based techniques– Full-locking (f-l)– Optimized Full-locking(o-f-l)– Cache-sensitive locking( cs-l)

Page 9: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Memory Layout of locking schemes

Page 10: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Applications Implemented on FREERIDE

• Apriori (Association mining)• Kmeans (Clustering based)• Expectation Maximization (E-M) (clustering

based)

Page 11: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Goals in Experimental Study

• Scalability of data-intensive applications on multi-core

• Comparison of different shared memory parallelization (SMP) techniques and mpi

• Performance analysis of SMP techniques

Page 12: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Experimental setup

Each node in the cluster has:• Intel Xeon E5345 CPU• 2 Quad-core machine• Each core 2.33GHz• 6GB Main memoryNodes in cluster are connected by Infiniband

Page 13: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Experiments

Two sets of experiments:• Comparison of scalability results for f-r, cs-l, o-f-l and mpi

with k-means, Apriori and E-M– Single node– Cluster of nodes

• Performance analysis results with k-means, Apriori and E-M

Page 14: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Applications data setup

• Apriori– Dataset size 900MB– Support = 3%, Confidence = 9%

• K-means– Dataset size 6.4 GB– 3-Dimensional points– No. of clusters, 250

• E-M– Dataset size 6.4 GB– 3-Dimensional points– No. of clusters, 60

Page 15: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Apriori (Single node)

Page 16: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Apriori (cluster)

Page 17: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

k-means (single node)

Page 18: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

K-means (cluster)

Page 19: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

E-M (Single node)

Page 20: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

E-M (cluster)

Page 21: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Performance Analysis of SMP techniques

• Given an application can we predict the factors that determines the best SMP technique?

• Why locking techniques suffer with Apriori, but competes well with other applications?

• What factors limit the overall scalability of data-intensive applications?

Page 22: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Performance Analysis setup

• Valgrind used for the Dynamic Binary Analysis• Cachegrind used for the analysis of cache

utilization

Page 23: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Performance Analysis

Locking vs Merge Overhead

Page 24: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Performance Analysis (contd…)Relative L2 misses for reduction object

Page 25: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Performance Analysis (contd …) Total program read/write misses

Page 26: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Analysis• Important Trade-off– Memory needs of application– Frequency of updating reduction object

• E-M is compute and memory intensive– Locking overhead is very low– Replication overhead is high

• Apriori has high update fraction and very less computation– Locking overhead is extremely high– Replication performs the best

Page 27: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Related Work

• Google Mapreduce• Yahoo Hadoop• Phoenix – Stanford university• SALSA – Indiana university

Page 28: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Conclusion• Replication and locking schemes can outperform

each other• Locking schemes have huge overhead when there is

little computation between updates in ReductionObject

• MPI processes competes well upto 4 threads, but experiences communication overheads with 8 threads

• Performance analysis shows memory needs of an application and update fraction are significant factors for scalability

Page 29: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu

Thank you!!!!Questions???