real world hpc systems for big data/ai research2019/06/16  · [3] haoyuzhang, mohamed wahib,...

1
Real World HPC Systems for Big Data/AI Research Efficient 2D Convolution on CUDA-enabled GPUs 2 [1] Shweta Salaria, Aleksandr Drozd, Artur Podobas, Satoshi Matsuoka, Learning Neural Representations for Predicting GPU Performance, ISC’19 [2] Peng Chen, Mohamed Wahib, Shin'ichiro Takizawa, Ryousei Takano, Satoshi Matsuoka, Efficient Algorithms for the Summed Area Tables Primitive on GPUs. IEEE CLUSTER’18 [3] Haoyu Zhang, Mohamed Wahib, Satoshi Matsuoka, Can Local Binary Convolutions Make Neural Networks Models Smaller?, Submitted to ICPP’19 (Poster) [4] Mateusz Bysiek, Aleksandr Drozd, and Satoshi Matsuoka. Migrating Legacy Fortran to Python While Retaining Fortran-level Performance Through Transpilation and Type Hints, 6th Workshop on Python for High-Performance and Scientific Computing, co-held with SC16 ⋯⋯⋯ ⋯⋯⋯ Thread 31 ⋯⋯⋯ $ % & $ % ( × 32× ; ⋯⋯⋯⋯ Sliding Window A CUDA Warp Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 30 $ % ; < $ % ; < $ % ; < e0 e1 e2 e4 e5 e6 e30 e0 e1 e3 e4 e5 e6 e30 e31 e0 e1 e3 e4 e5 e6 e30 e31 #1 + + + + + + + + + + + + + ⋯⋯⋯ ⋯⋯⋯ #2 #3 Thread 0 A CUDA Warp Convolution Results e31 + + Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 30 Thread 31 $ % $ % ; A CUDA thread ; ? ?×?+?B$ (1) Register Cache (32×) (2) Compute partial sums (3) Transfer partial sums Problem 2D convolution computation on CUDA-enabled GPUs Proposal (1) Using registers as a cache Lower latency, high throughput High data reuse (2)Computing partial sums in parallel (3)Transferring partial sums in parallel No stalling within thread Efficient communication between threads by shuffle Can Local Binary Convolutions Make Neural Networks Models Smaller? 3 Background Local Binary Convolution Neural Networks(LBCNN) models use Local Binary Convolution(LBC) layers (+ )less learned parameters ( - )less accuracy Problem Not applicable to Large Model Too slow Too much Acc. loss Proposed Method Only replace half of the normal convolution layer by LBC Accuracy and Speed of different models Results of models for different number of difference maps generated by LBC filters Results Learning Neural Representations for Predicting GPU Performance 1 Problem l Modeling performance of applications across systems with different GPU microarchitectures Proposal l A collaborative filtering based matrix factorization (MF) model to automatically learn latent features describing performance of applications on systems l A multi-layer perceptron (MLP) to model complex non-linear interactions between applications and systems Multi-layer perceptron model using latent features Application System Application Latent Vector System Latent Vector Application Embeddings System Embeddings L a y e r 1 L a y e r 2 Concatenation L a y e r N predicted score Multi-Layer Perceptron ReLU ReLU ReLU Results MLP with 2 layers (MLP-2) achieves 90.6% accuracy when predicting instructions per second (IPS) Actual vs. Predicted normalized IPS values Accuracy of the models using IPS dataset Framework for Transpliation Between Python and Fortran 4 Fortran + top performance + HPC legacy - hard to maintain Python + ease of programming + general-purpose tools - big runtime overhead Dillema Performance or ease-of-programming? Solution Python for development, Fortran at runtime. take legacy Fortran reuse legacy code keep battle-tested implementations will be translated by our framework migrate to Python happens once semi-automatic or automatic performance-critical data is retained (via Type Hints) user can easily extend functionality translate performance-critical kernels to Fortran JIT, at runtime fully automatic • original performance is retained user doesn’t interact with Fortran Problem Results Figure 1: DGEMM performance the same as Fortran. 5× better than Numba. Figure 2: Migrated Miranda IO benchmark retains original performance. 230× 5× 6× 6× Figure 2 Figure 1 Evaluation on a Nvidia Titan Xp GPU Results

Upload: others

Post on 21-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Real World HPC Systems for Big Data/AI Research2019/06/16  · [3] HaoyuZhang, Mohamed Wahib, Satoshi Matsuoka, Can Local Binary Convolutions Make Neural Networks Models Smaller?,

Real World HPC Systems for Big Data/AI Research

Efficient 2D Convolution on CUDA-enabled GPUs2

[1] Shweta Salaria, Aleksandr Drozd, Artur Podobas, Satoshi Matsuoka, Learning Neural Representations for Predicting GPU Performance, ISC’19[2] Peng Chen, Mohamed Wahib, Shin'ichiro Takizawa, Ryousei Takano, Satoshi Matsuoka, Efficient Algorithms for the Summed Area Tables Primitive on GPUs. IEEE CLUSTER’18[3] Haoyu Zhang, Mohamed Wahib, Satoshi Matsuoka, Can Local Binary Convolutions Make Neural Networks Models Smaller?, Submitted to ICPP’19 (Poster)[4] Mateusz Bysiek, Aleksandr Drozd, and Satoshi Matsuoka. Migrating Legacy Fortran to Python While Retaining Fortran-level Performance Through Transpilation and Type Hints,

6th Workshop on Python for High-Performance and Scientific Computing, co-held with SC16

⋯⋯⋯

⋯⋯⋯

Threa

d 31

⋯⋯⋯

𝑤$ 𝑤% 𝑤&𝑣$𝑣%

𝑣(𝑀×𝑁𝑠𝑖𝑧𝑒 𝑓𝑖𝑙𝑡𝑒𝑟32×𝐶 𝑅𝑒𝑔𝑖𝑠𝑡𝑒𝑟𝑀𝑎𝑡𝑟𝑖𝑥

𝑣;⋯⋯⋯⋯ ⋯

Sliding Window

A CUDA Warp

Threa

d 0Th

read 1

Threa

d 2Th

read 3

Threa

d 4Th

read 5

Threa

d 30

𝑠$𝑠%

𝑠;

𝑠<

𝑠$𝑠%

𝑠;

𝑠<

𝑠$𝑠%

𝑠;

𝑠<

e0 e1 e2 e4 e5 e6 e30

e0 e1 e3 e4 e5 e6 e30 e31

e0 e1 e3 e4 e5 e6 e30 e31

#1

+ + + + + + +

++++++

⋯⋯⋯

⋯⋯⋯#2

#3

Threa

d 0

A CUDA Warp

Convolution Results

e31+

+

Threa

d 1Th

read 2

Threa

d 3Th

read 4

Threa

d 5

Threa

d 30

Threa

d 31

𝑣$𝑣%

𝑠$𝑠%

𝑠;

A CUDA thread

𝑤

𝑣;

𝑣

𝑠𝑢𝑚? ← 𝑣?× 𝑠?+𝑠𝑢𝑚?B$

(1) Register Cache (32×𝐶)

(2) Compute partial sums

(3) Transfer partial sums

Problem• 2D convolution computation on CUDA-enabled GPUs

Proposal(1) Using registers as a cache

• Lower latency, high throughput• High data reuse

(2)Computing partial sums in parallel

(3)Transferring partial sums in parallel• No stalling within thread• Efficient communication between threads by shuffle

Can Local Binary Convolutions Make Neural Networks Models Smaller?3

Background• Local Binary Convolution

Neural Networks(LBCNN)• models use Local Binary

Convolution(LBC) layers• (+ )less learned

parameters• ( - )less accuracy

Problem• Not applicable to Large Model

• Too slow• Too much Acc. loss

Proposed MethodOnly replace half of the normal convolution layer by LBC

Accuracy and Speed of different modelsResults of models for different number of difference maps generated by LBC filters

Results

Learning Neural Representations for Predicting GPU Performance1

Problem l Modeling performance of applications across systems with different

GPU microarchitecturesProposall A collaborative filtering based matrix factorization (MF) model to

automatically learn latent features describing performance ofapplications on systems

l A multi-layer perceptron (MLP) to model complex non-linearinteractions between applications and systems

Multi-layer perceptron model using latent features

Application

System

Application Latent Vector

System Latent Vector

Application Embeddings

System Embeddings

L a y e r 1

L a y e r 2

Concatenation

L a y e r N

��� predicted score

Multi-Layer Perceptron

ReLU ReLU ReLU

Results• MLP with 2 layers (MLP-2) achieves 90.6% accuracy when

predicting instructions per second (IPS)

Actual vs. Predicted normalized IPS values

Accuracy of the models using IPS dataset

Framework for Transpliation Between Python and Fortran4

Fortran+ top performance

+ HPC legacy- hard to maintain

Python+ ease of programming+ general-purpose tools- big runtime overhead

Dillema• Performance or ease-of-programming?

Solution• Python for development, Fortran at runtime.

take legacy Fortran

• reuse legacy code• keep battle-tested

implementations• will be translated

by our framework

migrate to Python

• happens once• semi-automatic or

automatic• performance-critical data

is retained (via Type Hints)

• user can easily extend functionality

translate performance-critical kernels to Fortran

• JIT, at runtime• fully automatic• original

performance is retained

• user doesn’t interact with Fortran

Problem

ResultsFigure 1: DGEMM performance the same as Fortran. 5× better than Numba.

Figure 2: Migrated Miranda IO benchmark retains original performance.

230×

5×6×6×

Figure 2Figure 1

Evaluation on a Nvidia Titan Xp GPU

Results