the scalable petascale data- driven approach for the cholesky...
TRANSCRIPT
![Page 1: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/1.jpg)
The Scalable Petascale Data-Driven Approach for the Cholesky Factorization
with multiple GPUs
Yuki Tsujita, Toshio Endo, Katsuki FujisawaTokyo Institute of Technology, Japan
ESPM2 2015@Austin Texas, USA
1
![Page 2: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/2.jpg)
What is the Cholesky factorization?
• The Cholesky factorization is a factorization of a real symmetric positive-definite matrix into the product of a lower triangular matrix and its transpose
• StatementA=LLT(A∈Rm×m)
• The time complexity of the Cholesky is O(m3)
A LLT
2
![Page 3: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/3.jpg)
SDPARA: Our Target application
• Dense Cholesky factorization is the important kernel of SDPARA GPU ver.[Fujisawa et al. 2011]
• SDPARA GPU ver.– Application to solve SDP(SemiDefinite Program)– Offload a part of its calculations to GPU
3
Year n m CHOLESKY(Flops)
2003 630 24,503 78.58 Giga
2010 10,462 76,554 2.414 Tera
2012 1,779,204 1,484,406 0.533 Peta
2014 2,752,649 2,339331 1.713 Peta
Table: Performance record of CHOLESKY of SDPARA
![Page 4: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/4.jpg)
Existing approach I:Synchronous Implementation[Fujisawa et al. IPDPS 2014]
• Block Cholesky factorization– The input data is divided into the blocks
• The calculations proceed in each block– The blocks are assigned to the processes by
two dimensional block cyclic division• Processes do calculations of the only assigned data
• Each iteration proceeds synchronously – The data are transferred from CPU to GPU at the
beginning of each iteration– If a process has no task in a certain iteration, it has to
wait for the other processes finishing without doing anything
L0
A11
A21 A22
L0 L21
L11
�A224
![Page 5: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/5.jpg)
The DAG of the Cholesky factorization
• Kernels are divided into fine-grained tasks– Basically each task proceeds asynchronously
• PCIe comm. performs only when it needs• Inter-process comm. performs
in Point-to-Point way• We found the performance may decrease
in extremely large scale case
Existing approach II:Data-Driven Implementation[Tsujita&Endo JSSPP 2015]
5
DPOTRF
DTRSM
DSYRK
DGEMM
intra-process dependencyinter-processdependency
proc 0 proc 1 proc 2 proc 3
![Page 6: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/6.jpg)
• Large problem size(m>2M)– Use capacity of host memory to put the matrix
data[Tsujita,Endo JSSPP2015]
• High performance(>1.7PFlops)– Use multiple GPUs and reduce PCIe
communication by GPU memory aware scheduling[Tsujita,Endo JSSPP2015]
– Solve the communication bottleneck by introducing the scalable communication
6
Our Target
![Page 7: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/7.jpg)
Contribution
7
• Goal– Performance improvement of the multi-node multi-GPU
Cholesky factorization• Approach
– Data-Driven scheduling to reduce data movement(presented@JSSPP2015)
• Scheduling tasks in an application• Task selection to improve GPU memory reusability
– The MPI communication pattern for the scalability improvement
• Achieve the performance of 1.77PFlops with 1360 nodes
![Page 8: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/8.jpg)
Existing method Ⅰ(Synchronous)
Existing method Ⅱ(Data-Driven) Proposed method
Data driven × ✓ ✓
PCIe Commreducing
×(Naïve)
✓(Swap)
✓(Swap)
MPI CommScalability
✓(Group)
×(Point-to-point)
✓(Scalable
communication)
Overlap of calculations &
communications✓ ✓ ✓
8
Implementation overview
![Page 9: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/9.jpg)
• GPU memory-aware scheduling– Task selection considering the reusability of GPU
memory
• Point-to-Point asynchronous MPI communication
• GPU memory management by swapping– select an unnecessary data as a victim
9
Our Basic Data-Driven Implementation
(Existing Approach II)
![Page 10: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/10.jpg)
Worker thread & Ignition thread
• MPI process has several worker thread and one ignition thread
• Worker– Executes tasks– Process has two or three worker per one GPU in order to
achieve overlapping of calculation, PCIe and MPI simply
• Ignition – checks arrival of notice messages from other processes– handles data request
• All threads in a process shares a single task queue10
![Page 11: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/11.jpg)
Task Execution
11
MPI Process 1
Task A
Receive data request
Send data
MPI Process 2
Receive data
cudaMemcpy
Execute on GPU
Send notice of task end
Firing
Send data request
worker2
Firing
ignition2
Task B
data
request
Noticeworker1 ignition1
Task C
MPI Process 3
NoticeFiring
ignition3 worker3
T1
T1
T2
![Page 12: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/12.jpg)
12
The pitfall of Data-Driven
0
0.5
1
1.5
2
2.5
3
3.5
0 5 10 15 20
Spee
d (T
Flop
s)
Number of Nodes
Synchronous Data-Driven
0
100
200
300
400
500
600
0 100 200 300 400 500
Spee
d (T
Flop
s)
Number of NodesSYNC (QAP5) SYNC (QAP6) SYNC (QAP7)
D2 (QAP5) D2 (QAP6) D2 (QAP7)
As problem size or number of the nodes increases,
the performance decreases in data-driven execution
By Data-Driven implementation,we get better performance
The suspected bottleneck is aconcentration of MPI
communication
Not only does our approach suffer from this problem !
![Page 13: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/13.jpg)
Synchronous implementation uses MPI_Bcast for data transfer
But in Data-Driven implementation•Each task runs asynchronously -> MPI_Bcast, MPI_Ibcast: וWhen many processes request the same tile, Point-to-Point communication is executed 2√P times
The existing data-driven shows less performance in high parallel situation
13
The pitfall of Data-Driven
For scalable data transfer, we createa broadcast tree structure dynamically
・・・
2√P
T
![Page 14: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/14.jpg)
• Presupposition– Data send is occurred only when a process receive
requests from other processes– The order of data requests is unsettle
• For scalable data transfer, We make CSlist(Client-Server list)
– one CSlist for one tile– CSlist has clients and corresponding servers– When a process receives requests, checks CSlist・Server: send data ・Others: forward to its server
– When a process sends data, forces a part of its clients on requestor 14
Scalable Communication
Tile A
C 2 3 4 5S 1 1 1 1
CSlist
![Page 15: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/15.jpg)
15
Scalable Communication
Tile A
Process 1Process 2
Process 3 Process 4
Process 5
1. request
C 2 3 4 5
S 1 1 1 1
CSlist
![Page 16: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/16.jpg)
16
Scalable Communication
C 2 3 4 5
S 3 - 1 1
Tile A
Process 1Process 2
Process 3 Process 4
Process 5
Tile A
C 2 3
S 3 -
2. data send
CSlist
![Page 17: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/17.jpg)
17
Scalable Communication
C 2 3 4 5
S 3 - 1 1
Tile A
Process 1Process 2
Process 3 Process 4
Process 5
Tile A
C 2 3
S 3 -
3. data send
1. request CSlist
2. request(forward)
![Page 18: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/18.jpg)
18
Scalable termination detectionA process cannot exit even if all of its tasks has been finished
→ Process may still receive requests for its owned data from other running processes
The detection of process’s termination becomes difficult !
we solve this by using CSlistCSlist shows “which client has been requested this tile, or not yet”So there is no further request message, when CSlists for all local tiles become emptyBy using CSlist we can detect process’s termination without especial communications
Tile A
Process 1
Process 2
Process 3Process 4
Process 5
C 2 3 4 5
S - - - -If all servers in the CSlist
become emptyits tile has been sent to all
processes that need itC 2 3
S - -
Tile A
![Page 19: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/19.jpg)
Experiment Conditions• We use 1360 nodes of TSUBAME2.5
19
node architecture of TSUBAME 2.5
CPU Intel Xeon 2.93 GHz (6 cores) x 2
CPU memory 54GiB
GPU NVIDIA Tesla K20X × 3
GPU memory 6GiB
• Three MPI processes per a node• One GPU per a MPI process(3 GPU/node)• Tile Size:2,048 x 2,048• GPU memory:5,000MiB per a GPU• NVIDIA CUDA 7.0 and CUBLAS 7.0
![Page 20: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/20.jpg)
Performance Evaluation
• Compared Implementations
20
• Evaluation– Scalability evaluation– Extremely Large Scale
• Problem sizeQAP5: m=379,350 QAP6: m=709,275QAP7: m=1,218,400 QAP9: m=1,962,225
Existing approach Ⅰ(Synchronous: SYNC)
Existing approach Ⅱ(Data-Driven:DD)
Proposed method(Proposal)
PCI CommReducing × ✓ ✓
MPI CommScalability
✓(Group)
×(Point-to-point)
✓(scalable
communication)
![Page 21: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/21.jpg)
0
100
200
300
400
500
600
700
800
0 100 200 300 400 500
Spee
d (T
Flop
s)
Number of NodesSYNC (QAP5) DD (QAP5) Proposal (QAP5)
SYNC (QAP6) DD (QAP6) Proposal (QAP6)
SYNC (QAP7) DD (QAP7) Proposal (QAP7)
Conduct scalability evaluation on TSUBAME2.5 using until 400 nodes(3 GPUs per a node)
21
Scalability Evaluation
By Data-Driven + Tree Comm. 37% performance improvement695 TFlops on 400 nodes with 1200 GPUs
Without Tree Comm. Performance largely decrease than SYNC(communication bottleneck)
![Page 22: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/22.jpg)
22
Extremely Large Scale• Conduct scalability evaluation on from 400 nodes to 1360
nodes (3GPUs per a node)
0200400600800
100012001400160018002000
0 500 1000 1500
Spee
d (T
Flop
s)
Number of NodesSYNC (QAP7) Proposal (QAP7)SYNC (QAP9) Proposal (QAP9)
1.775PFlopson 1360 nodes with 4080 GPU by our approach
![Page 23: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/23.jpg)
Related work• StarPU: a unified platform for task scheduling on heterogeneous multicore
Architectures[Cédric Augonnet et al.]– A DAG scheduling framework for heterogeneous environments– Allows for each task to run either on CPUs or GPUs according to the resource
utilization in order to improve the performance– But StarPU does not have scalability improvement techniques as our approach
• DAGuE: A generic distributed DAG engine for high performance computing[George Bosilca et al.]– DAG(Direct Acyclic Graph) scheduler for distributed environments with GPUs– The Cholesky factorization is one of their target application– But it is not clear how DAGuE treats memory objects when GPU memory is full
23
![Page 24: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/24.jpg)
• The solution of the scalability issues in typical data driven implementation by introducing scalable data transfer method and termination detection method
• Compared with the synchronous implementation, 37% performance improvement on 400 nodes and 1,200 GPUs of TSUBAME2.5 supercomputer
• Achieved 1.775PFlops on 1360 nodes and 4080 GPUs of TSUBAME2.5 supercomputer
24
Conclusion
![Page 25: The Scalable Petascale Data- Driven Approach for the Cholesky …endo/publication/tsujita-espm2... · 2016. 10. 28. · The Scalable Petascale Data- Driven Approach for the Cholesky](https://reader035.vdocuments.net/reader035/viewer/2022071103/5fdd5d9b08b09d599516d51f/html5/thumbnails/25.jpg)
• Use both CPU & GPU for kernel calculations• Comparative experiments with related works• Construct the ideal task selection model and
conduct comparative experiments with it• Apply our approach to other applications
25
Future Work