inspector-executor load balancing algorithms for block-sparse tensor contractions
DESCRIPTION
Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions. David Ozog *, Jeff R. Hammond † , James Dinan † , Pavan Balaji † , Sameer Shende *, Allen Malony * *University of Oregon † Argonne National Laboratory - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/1.jpg)
Inspector-Executor Load Balancing Algorithms for Block-Sparse
Tensor Contractions
David Ozog*, Jeff R. Hammond†, James Dinan†, Pavan Balaji†, Sameer Shende*, Allen Malony*
*University of Oregon †Argonne National Laboratory
2013 International Conference on Parallel Processing (ICPP)October 2, 2013
![Page 2: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/2.jpg)
Outline
1. NWChem, Coupled Cluster, Tensor Contraction Engine2. Load Balance Challenges3. Dynamic Load Balancing with Global Arrays (GA)4. Nxtval Performance Experiments5. Inspector/Executor Design6. Performance Modeling (DGEMM and TCE Sort)7. Largest Processing Time (LPT) Algorithm8. Dynamic Buckets – Design and Implementation9. Results10. Conclusions 11. Future Work
![Page 3: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/3.jpg)
NWChem and Coupled ClusterNWChem:• Wide range of methods, accuracies, and
supported supercomputer architectures• Well-known for its support of many
quantum mechanical methods on massively parallel systems.
• Built on top of Global Arrays (GA) / ARMCI
Coupled Cluster (CC):• Ab initio - i.e., Highly accurate• Solves an approximate Schrödinger
Equation• Accuracy hierarchy: CCSD < CCSD(T) < CCSDT < CCSDT(Q) < CCSDTQ• The respective computational costs:
• And respective storage costs:
)()()()()( 109876 nOnOnOnOnO
)()()()()( 86644 nOnOnOnOnO
*Photos from nwchem-sw.org
![Page 4: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/4.jpg)
NWChem and Coupled Cluster
*Diagram from GA tutorial (ACTS 2009)
Global Address Space
Distributed Memory Spaces
NWChem:• Wide range of methods, accuracies, and
supported supercomputer architectures• Well-known for its support of many
quantum mechanical methods on massively parallel systems.
• Built on top of Global Arrays (GA) / ARMCI
Coupled Cluster (CC):• Ab initio - i.e., Highly accurate• Solves an approximate Schrödinger
Equation• Accuracy hierarchy: CCSD < CCSD(T) < CCSDT < CCSDT(Q) < CCSDTQ• The respective computational costs:
• And respective storage costs:
)()()()()( 109876 nOnOnOnOnO
)()()()()( 86644 nOnOnOnOnO
![Page 5: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/5.jpg)
DGEMM Tasks - Load Imbalance
• In CCSX (X=D,T,Q), 1 tensor contraction contains between 1 hundred and 1 million DGEMMs
• MFLOPs per task depend on: • number of atoms• Spin and spatial
symmetry • Accuracy of chosen basis• The tile size
![Page 6: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/6.jpg)
Computational Challenges
Benzene Water Clusters Macro-Molecules
Highly symmetric Asymmetric QM/MM
• Load balance is crucially important for performance• Obtaining optimal load balance is an NP-Hard problem.
*Photos from nwchem-sw.org
![Page 7: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/7.jpg)
GA Dynamic Load Balancing Template
![Page 8: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/8.jpg)
GA Dynamic Load Balancing Template
![Page 9: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/9.jpg)
GA Dynamic Load Balancing Template
![Page 10: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/10.jpg)
GA Dynamic Load Balancing Template
![Page 11: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/11.jpg)
GA Dynamic Load Balancing Template
![Page 12: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/12.jpg)
GA Dynamic Load Balancing Template
![Page 13: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/13.jpg)
GA Dynamic Load Balancing Template
![Page 14: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/14.jpg)
GA Dynamic Load Balancing Template
![Page 15: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/15.jpg)
GA Dynamic Load Balancing Template
Works best when:
• On a single node (in SysV shared memory)
• Time spent in FOO(a) is huge
• On high-speed interconnects
• Number of simultaneous calls is reasonably small (less than 1,000).
![Page 16: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/16.jpg)
Nxtval - Performance ExperimentsTAU Profiling• 14 water molecules, aug-cc-PVDZ
• 123 nodes, 8 ppn
• Nxtval consumes a large percentage of the execution time.
Flooding micro-benchmark
• Proportional time within Nxtval increases with more participating processes.
• When the arrival rate exceeds the processing rate, process hosting the counter must utilize buffer and flow control.
![Page 17: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/17.jpg)
Nxtval Performance Experiments
Strong Scaling
• 10 water molecules, (aDZ)• 14 water molecules, (aDZ)
• 8 processes per node
• Percentage of overall execution time within Nxtval increases with scaling.
![Page 18: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/18.jpg)
Inspector/Executor Design1. Inspector
• Calculate memory requirements• Remove null tasks• Collate task-list
2. Task Cost Estimator• Two options:
• Use performance models • Load gettimeofday() measurement from previous iteration(s)
• Deduce performance models off-line
3. Static Partitioner• Partition into N groups where N is the number of MPI
processes• Minimize load balance according to cost estimations• Write task list information for each proc/contraction to
volatile memory
4. Executor• Launch all tasks
![Page 19: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/19.jpg)
Performance Modeling - DGEMMDGEMM:
• A(m,k), B(k,n), and C(m,n) are 2D matrices• α and β are scalar coefficients
Our Performance Model:
• (mn) dot products of length k• Corresponding (mn) store operations in C• m loads of size k from A• n loads of size k from B• a, b, c, and d are found by solving a nonlinear least squares problem (in Matlab)
![Page 20: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/20.jpg)
Performance Modeling - DGEMMDGEMM:
• A(m,k), B(k,n), and C(m,n) are 2D matrices• α and β are scalar coefficients
Our Performance Model:
• (mn) dot products of length k• Corresponding (mn) store operations in C• m loads of size k from A• n loads of size k from B• a, b, c, and d are found by solving a nonlinear least squares problem (in Matlab)
![Page 21: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/21.jpg)
Performance Modeling – TCE “Sort”
Our Performance Model:
• TCE “Sorts” are actually matrix permutations
• 3rd order polynomial fit suffices
• Data always fits in L2 cache for this architecture
• Somewhat noisy measurements, but that’s OK.
(bytes)
![Page 22: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/22.jpg)
Largest Processing Time (LPT) Algorithm
1. Sort tasks by cost in descending order
2. Assign to least loaded process so far
*SIAM Journal on Applied Mathematics, Vol. 17, No. 2. (Mar., 1969), pp. 416-429.
• Polynomial time algorithm applied to an NP-Hard problem
• Proven “4/3 approximate” by Richard Graham*
![Page 23: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/23.jpg)
1. Sort tasks by cost in descending order
2. Assign to least loaded process so far
*SIAM Journal on Applied Mathematics, Vol. 17, No. 2. (Mar., 1969), pp. 416-429.
• Polynomial time algorithm applied to an NP-Hard problem
• Proven “4/3 approximate” by Richard Graham*
Largest Processing Time (LPT) Algorithm
![Page 24: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/24.jpg)
LPT - Binary Min Heap
1. Initialize a heap with N nodes (N = # of procs) each having zero cost.
2. Perform IncreaseMin() operation for each new cost from the sorted list of tasks.
• IncreaseMin() is quite efficient because UpdateRoot() often occurs in O(1) time.
• Far more efficient than the naïve approach of iterating through an array to find the min.
• Execution time for this phase is negligible.
![Page 25: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/25.jpg)
LPT - Load Balance
(a) Original with Nxtval Measured
(b) Inspector/Executor with Nxtval Measured
(c) LPT – 1st iteration
(d) LPT – subsequent iterations
![Page 26: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/26.jpg)
Dynamic Buckets Design
![Page 27: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/27.jpg)
Dynamic Buckets Implementation
![Page 28: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/28.jpg)
Dynamic Buckets Load Balance
(a) LPT Predicted
(b) LPT Measured
(c) Dynamic Buckets Predicted
d) Dynamic Buckets Measured
![Page 29: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/29.jpg)
I/E ResultsNitrogen - CCSDT Benzene - CCSD
![Page 30: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/30.jpg)
10-H2O Cluster Results (DB)CCSD_t2_7_3 CCSD_t2_7
![Page 31: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/31.jpg)
Conclusions
1. Nxtval can be expensive at large scales2. Static Partitioning can fix the problem, but has
weaknesses:• Requires performance model• Noise degrades results
3. Dynamic Buckets is a viable alternative, and requires few changes to GA applications.
4. Solving load balance issues differs from problem to problem – work needs to be done to pinpoint why and what to do about it.
![Page 32: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions](https://reader035.vdocuments.net/reader035/viewer/2022062315/5681649e550346895dd68355/html5/thumbnails/32.jpg)
Future Work (Research)
1. Cyclops Tensor Framework (CTF)2. DAG Scheduling of tensor contractions3. What happens with accelerators (MIC/GPU)?
1. Performance model2. Balancing load across both CPU and device
4. Comparison with hierarchical distributed load balancing, work stealing, etc.
5. Hypergraph partitioning / data locality