real world hpc systems for big data/ai research2019/06/16 · [3] haoyuzhang, mohamed wahib,...

Real World HPC Systems for Big Data/AI Research

Efficient 2D Convolution on CUDA-enabled GPUs2

[1] Shweta Salaria, Aleksandr Drozd, Artur Podobas, Satoshi Matsuoka, Learning Neural Representations for Predicting GPU Performance, ISC’19[2] Peng Chen, Mohamed Wahib, Shin'ichiro Takizawa, Ryousei Takano, Satoshi Matsuoka, Efficient Algorithms for the Summed Area Tables Primitive on GPUs. IEEE CLUSTER’18[3] Haoyu Zhang, Mohamed Wahib, Satoshi Matsuoka, Can Local Binary Convolutions Make Neural Networks Models Smaller?, Submitted to ICPP’19 (Poster)[4] Mateusz Bysiek, Aleksandr Drozd, and Satoshi Matsuoka. Migrating Legacy Fortran to Python While Retaining Fortran-level Performance Through Transpilation and Type Hints,

6th Workshop on Python for High-Performance and Scientific Computing, co-held with SC16

⋯⋯⋯

⋯⋯⋯

Threa

d 31

∗

⋯⋯⋯

𝑤$ 𝑤% 𝑤&𝑣$𝑣%

𝑣(𝑀×𝑁𝑠𝑖𝑧𝑒 𝑓𝑖𝑙𝑡𝑒𝑟32×𝐶 𝑅𝑒𝑔𝑖𝑠𝑡𝑒𝑟𝑀𝑎𝑡𝑟𝑖𝑥

𝑣;⋯⋯⋯⋯ ⋯

Sliding Window

A CUDA Warp

Threa

d 0Th

read 1

Threa

d 2Th

read 3

Threa

d 4Th

read 5

Threa

d 30

𝑠$𝑠%

𝑠;

𝑠<

𝑠$𝑠%

𝑠;

𝑠<

𝑠$𝑠%

𝑠;

𝑠<

e0 e1 e2 e4 e5 e6 e30

e0 e1 e3 e4 e5 e6 e30 e31

e0 e1 e3 e4 e5 e6 e30 e31

#1

+ + + + + + +

++++++

⋯⋯⋯

⋯⋯⋯#2

#3

Threa

d 0

A CUDA Warp

Convolution Results

e31+

+

Threa

d 1Th

read 2

Threa

d 3Th

read 4

Threa

d 5

Threa

d 30

Threa

d 31

𝑣$𝑣%

⋯

𝑠$𝑠%

𝑠;

A CUDA thread

⋯

𝑤

𝑣;

𝑣

𝑠𝑢𝑚? ← 𝑣?× 𝑠?+𝑠𝑢𝑚?B$

(1) Register Cache (32×𝐶)

(2) Compute partial sums

(3) Transfer partial sums

Problem• 2D convolution computation on CUDA-enabled GPUs

Proposal(1) Using registers as a cache

• Lower latency, high throughput• High data reuse

(2)Computing partial sums in parallel

(3)Transferring partial sums in parallel• No stalling within thread• Efficient communication between threads by shuffle

Can Local Binary Convolutions Make Neural Networks Models Smaller?3

Background• Local Binary Convolution

Neural Networks(LBCNN)• models use Local Binary

Convolution(LBC) layers• (+ )less learned

parameters• ( - )less accuracy

Problem• Not applicable to Large Model

• Too slow• Too much Acc. loss

Proposed MethodOnly replace half of the normal convolution layer by LBC

Accuracy and Speed of different modelsResults of models for different number of difference maps generated by LBC filters

Results

Learning Neural Representations for Predicting GPU Performance1

Problem l Modeling performance of applications across systems with different

GPU microarchitecturesProposall A collaborative filtering based matrix factorization (MF) model to

automatically learn latent features describing performance ofapplications on systems

l A multi-layer perceptron (MLP) to model complex non-linearinteractions between applications and systems

Multi-layer perceptron model using latent features

Application

System

Application Latent Vector

System Latent Vector

Application Embeddings

System Embeddings

L a y e r 1

L a y e r 2

Concatenation

L a y e r N

�� predicted score

Multi-Layer Perceptron

ReLU ReLU ReLU

Results• MLP with 2 layers (MLP-2) achieves 90.6% accuracy when

predicting instructions per second (IPS)

Actual vs. Predicted normalized IPS values

Accuracy of the models using IPS dataset

Framework for Transpliation Between Python and Fortran4

Fortran+ top performance

+ HPC legacy- hard to maintain

Python+ ease of programming+ general-purpose tools- big runtime overhead

Dillema• Performance or ease-of-programming?

Solution• Python for development, Fortran at runtime.

take legacy Fortran

• reuse legacy code• keep battle-tested

implementations• will be translated

by our framework

migrate to Python

• happens once• semi-automatic or

automatic• performance-critical data

is retained (via Type Hints)

• user can easily extend functionality

translate performance-critical kernels to Fortran

• JIT, at runtime• fully automatic• original

performance is retained

• user doesn’t interact with Fortran

Problem

ResultsFigure 1: DGEMM performance the same as Fortran. 5× better than Numba.

Figure 2: Migrated Miranda IO benchmark retains original performance.

230×

5×6×6×

Figure 2Figure 1

Evaluation on a Nvidia Titan Xp GPU

Results

real world hpc systems for big data/ai research2019/06/16 · [3] haoyuzhang, mohamed wahib,...

Documents