Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi...
-
Upload
derick-harrison -
Category
Documents
-
view
217 -
download
1
Transcript of Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi...
![Page 1: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/1.jpg)
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core
Cluster
Vignesh Ravi and Gagan Agrawal
{raviv,agrawal}@cse.ohio-state.edu
![Page 2: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/2.jpg)
OUTLINE
• Motivation • FREERIDE Middleware• Generalized Reduction structure• Shared Memory Parallelization techniques• Scalability results - Kmeans, Apriori & EM• Performance Analysis results• Related work & Conclusion
![Page 3: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/3.jpg)
Motivation
• Availability of huge amount of data – Data-intensive applications
• Advent of multi-core• Need for abstractions and parallel
programming systems• Best Shared Memory Parallelization (SMP)
technique is still not clear.
![Page 4: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/4.jpg)
Context: FREERIDE
• A middle-ware for parallelizing Data-intensive applications
• Motivated by difficulties in implementing parallel datamining applications
• Provides high-level APIs for easier parallel programming
• Based on an observation of similar generalized reduction among many datamining and scientific applications
![Page 5: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/5.jpg)
FREERIDE – Core
• Reduction Object – A shared data structure where results from processed data instances are stored
Types of Reduction• Local Reduction – Reduction within a single
node• Global Reduction – Reduction among a cluster
of nodes
![Page 6: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/6.jpg)
Generalized Reduction structure
![Page 7: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/7.jpg)
Parallelization Challenges
• Reduction object cannot be statically partitioned between threads/nodes– Data races should be handled at runtime
• Size of reduction object could be large– Replication can cause memory overhead
• Updates to reduction object is fine-grained– Locking schemes can cause significant overhead
![Page 8: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/8.jpg)
Techniques in FREERIDE
• Full-replication(f-r) • Locking based techniques– Full-locking (f-l)– Optimized Full-locking(o-f-l)– Cache-sensitive locking( cs-l)
![Page 9: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/9.jpg)
Memory Layout of locking schemes
![Page 10: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/10.jpg)
Applications Implemented on FREERIDE
• Apriori (Association mining)• Kmeans (Clustering based)• Expectation Maximization (E-M) (clustering
based)
![Page 11: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/11.jpg)
Goals in Experimental Study
• Scalability of data-intensive applications on multi-core
• Comparison of different shared memory parallelization (SMP) techniques and mpi
• Performance analysis of SMP techniques
![Page 12: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/12.jpg)
Experimental setup
Each node in the cluster has:• Intel Xeon E5345 CPU• 2 Quad-core machine• Each core 2.33GHz• 6GB Main memoryNodes in cluster are connected by Infiniband
![Page 13: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/13.jpg)
Experiments
Two sets of experiments:• Comparison of scalability results for f-r, cs-l, o-f-l and mpi
with k-means, Apriori and E-M– Single node– Cluster of nodes
• Performance analysis results with k-means, Apriori and E-M
![Page 14: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/14.jpg)
Applications data setup
• Apriori– Dataset size 900MB– Support = 3%, Confidence = 9%
• K-means– Dataset size 6.4 GB– 3-Dimensional points– No. of clusters, 250
• E-M– Dataset size 6.4 GB– 3-Dimensional points– No. of clusters, 60
![Page 15: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/15.jpg)
Apriori (Single node)
![Page 16: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/16.jpg)
Apriori (cluster)
![Page 17: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/17.jpg)
k-means (single node)
![Page 18: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/18.jpg)
K-means (cluster)
![Page 19: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/19.jpg)
E-M (Single node)
![Page 20: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/20.jpg)
E-M (cluster)
![Page 21: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/21.jpg)
Performance Analysis of SMP techniques
• Given an application can we predict the factors that determines the best SMP technique?
• Why locking techniques suffer with Apriori, but competes well with other applications?
• What factors limit the overall scalability of data-intensive applications?
![Page 22: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/22.jpg)
Performance Analysis setup
• Valgrind used for the Dynamic Binary Analysis• Cachegrind used for the analysis of cache
utilization
![Page 23: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/23.jpg)
Performance Analysis
Locking vs Merge Overhead
![Page 24: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/24.jpg)
Performance Analysis (contd…)Relative L2 misses for reduction object
![Page 25: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/25.jpg)
Performance Analysis (contd …) Total program read/write misses
![Page 26: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/26.jpg)
Analysis• Important Trade-off– Memory needs of application– Frequency of updating reduction object
• E-M is compute and memory intensive– Locking overhead is very low– Replication overhead is high
• Apriori has high update fraction and very less computation– Locking overhead is extremely high– Replication performs the best
![Page 27: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/27.jpg)
Related Work
• Google Mapreduce• Yahoo Hadoop• Phoenix – Stanford university• SALSA – Indiana university
![Page 28: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/28.jpg)
Conclusion• Replication and locking schemes can outperform
each other• Locking schemes have huge overhead when there is
little computation between updates in ReductionObject
• MPI processes competes well upto 4 threads, but experiences communication overheads with 8 threads
• Performance analysis shows memory needs of an application and update fraction are significant factors for scalability
![Page 29: Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal {raviv,agrawal}@cse.ohio-state.edu.](https://reader030.fdocuments.us/reader030/viewer/2022032606/56649e8e5503460f94b91cd4/html5/thumbnails/29.jpg)
Thank you!!!!Questions???