mauricio breternitz, ph.d. amd research - …fleuret/dltm/slides/dltm-slides-breternitz...amd hsa...

MACHINE LEARNING AT AMDFOUNDATIONS AND SUPPORT

MAURICIO BRETERNITZ, PH.D.AMD RESEARCH

| LDIAP / JULY 6, 2016|2

AMD RESEARCH

| LDIAP / JULY 6, 2016|3

OUTLINE

Phone

Tablet

Notebook

Workstation

Dense Server

Super computer

From Phil Rogers APU13 Keynote

Vision

Foundations – AMD Hardware

Foundations – AMD Open Source and Libraries

Optimizing Open Source Libraries

Machine Learning Support

| LDIAP / JULY 6, 2016|4

AMD’S VISION

Computers/Supercomputers of high-performance coherent heterogeneous nodes

‒ Build from fast x86 CPUs with a strong existing HPC software ecosystem

‒ Coherently connected to fast AMD GPUs for parallel compute

‒ 2P nodes for high density and memory footprint

HSA programming model makes parallel acceleration on the GPU easier

‒ C++ and OpenMP directly accelerated

‒ Programmers freed from managing memory transfers

‒ True single source programming to use the GPUs on day one

Open source software on open standard hardware

‒ Fully open source execution stack

‒ Fully open source compilers

‒ Fully open source enablement for DSLs and runtimes

| LDIAP / JULY 6, 2016|5

OPEN STANDARDS

Pervasive philosophy across all AMD products

‒ Build great products based on open standards

AMD hardware and software HPC architecture is based on HSA

‒ Extensive industrial, academic and government membership including DOE labs

‒ IP Rules enable collaboration

‒ Specifications ratified, released, and public at www.hsafoundation.com

HSA memory model supports other open standards

‒ C++ 14, PGAS languages, .Net and Java memory models

Programming models based on open standards

‒ HCC C++ based on C++ ISO Standards https://isocpp.org/

‒ OpenMP http://openmp.org/wp/openmp-specifications/

‒ LLVM optimizing compiler http://llvm.org/

‒ CLANG front end to LLVM http://clang.llvm.org/

http://www.hsafoundation.com/

https://isocpp.org/

http://openmp.org/wp/openmp-specifications/

http://llvm.org/

http://clang.llvm.org/

| LDIAP / JULY 6, 2016|6

OPEN SOURCE AT AMD

AMD HPC Software is built on open source Linux

Operating system support

‒ All AMD patches for HSA and IOMMU are upstream at https://www.kernel.org/

‒ Linux Distros shipping in 2015 with HSA enabled

AMD HSA Execution stack is all open source

‒ Kernel mode drivers

‒ HSA Runtime

‒ HSAIL Finalizer

AMD’s OpenCL, C++, and Python HPC compilers are open source

‒ Based on CLANG and LLVM

‒ HSAIL backend for LLVM being upstreamed to http://llvm.org/

Additional compilers from third-party companies are closed source

‒ OpenMP

‒ OpenACC 6

https://www.kernel.org/

http://llvm.org/

| LDIAP / JULY 6, 2016|7

AMD VISION

Open Source Software

‒ www.gpuopen.com

Open, Standard Hardware

8

RadeonTM Technology Group Presents

New Path Forward for HPC and Ultrascale Computing Markets

Focused Commitment to Meet Customer Computing Needs

Open Foundation For Development, Discovery and Education

ROCm: Radeon Open Compute

Platform

| LDIAP / JULY 6, 2016|9

OUTLINE

Phone

Tablet

Notebook

Workstation

Dense Server

Super computer


Vision





| LDIAP / JULY 6, 2016|10

11

Gear for the ROCm stage

8GB @ 1 TB/s memory

bandwidth

13.9 TFlops Single

Precision

300 Watts

4GB @ 512 GB/s memory

bandwidth

8.19 TFlops Single

Precision

175 Watts

8GB @ 256 GB/s memory

bandwidth

5.8 TFlops Single

Precision

100 Watts

Radeon R9 Nano FirePro S9300 x2 Radeon RX480

12

GPU

Dual

x86 CPU

Module

A heterogeneous platform with CPU+GPU on the same silicon die

Ideal for both serial and data-parallel parts of application

TDP of less than 100 Watts

ACCELERATED PROCESSING UNIT (APU)

Data-parallel Serial

13

ROCm Keeping the Signal Clean User Mode DMA: Drives Down Latency of the Transfer SHOC

0123456789

10111213

1K

2K

4K

8K

16

K

32

K

64

K

12

8K

25

6K

51

2K

1M

2M

4M

8M

16

M

32

M

64

M

12

8M

25

6M

Host->Device Measured Bandwidth (GB/s)

ROCm NANO (Direct Attach) CATALYST NANO (Direct Attach)

0123456789

1011121314

1K

2K

4K

8K

16

K

32

K

64

K

12

8K

25

6K

51

2K

1M

2M

4M

8M

16

M

32

M

64

M

12

8M

25

6M

Device->Host Measured Bandwidth (GB/s)

ROCm NANO (Direct Attach) CATALYST NANO (Direct Attach)

SHOC Version 1.1.5 ROC RC2 vs Radeon Software Crimson Edition15.12 on R9 Nano

14

ROOFLINE FROM MIXBENCH* HIP PORT

RadeonTM R9 NANO BRINGS A “SOLID SOUND”

0

1000

2000

3000

4000

5000

6000

7000

8000

0 10 20 30 40 50 60 70

Single Precision Gflops

Fiji GFLOPS Flops/byte

0

100

200

300

400

0 10 20 30 40

Double Precision Gflops

Fiji… Flops/byte

https://github.com/ekondis/mixbench Mixbench on ROCm RC2 Radeon R9 NANO ( formerly codenamed “FIJI”)

https://github.com/ekondis/mixbench

| LDIAP / JULY 6, 2016|15

OUTLINE

Phone

Tablet

Notebook

Workstation

Dense Server

Super computer


Vision





16

Bringing HSA to Discrete GPU & HPC

AMD has invested in HSA since before June-2012 (HSA Foundation)‒ Initial focus on APUs

‒ Recently extended HSA technology to discrete GPUs and HPC market

Radeon Open Compute (ROCm) Architecture for dGPU‒ ROC Runtime (ROCR) : HSA Runtime + Extensions (for peer-to-peer, improved copies, memory pinning)

‒ ROC Compilers HCC, Python, HIP initially developed on HSA APUs

‒ Hardware Architecture includes user-mode queues, signals, context switching (HSA Arch)

‒ Discrete memory exposed to programmer

‒ ROCm discrete devices pass conformance for HSA “Base” Profile

ROCm Platform ‒ Announced Nov/2015 (SuperComputing) as “Boltzmann Initiative”

‒ ROCm Preview release Jan/2016

‒ ROCm 1.0 released April/2016

‒ ROCm 1.1 released June/2016

‒ ROCm 1.2 July/2016 – Broadening Platform Support

‒ ROCm 1.3 Sept/2016 – Broadening Programming Model Support

1717 | LDIAP – JULY 2016

AMD and the GPUOpen Initiative

Vendor-optimized open-source support for important GPU software‒ http://gpuopen.com/

‒ Most source code available on GitHub or Bitbucket!

Open-source Gaming Libraries

‒ e.g., TressFX – Hair physics

‒ e.g., AOFX – optimized ambient occlusion

‒ Many others!

Open-source Compute Libraries

‒ clBLAS

‒ clFFT

‒ clRNG

http://gpuopen.com/

18

It’s about making “premium sound” on the ROCm

stage

HCC is a single source ISO C++ 11/14 compiler for both the CPU and

GPU

OPENMP 3.1 C & C++ support soday for CPU

C++17 “Parallel Standard Template Library”

Built on rich compiler infrastructure CLANG/LLVM and libC++

Performance Optimization for Accelerators

Low-level memory placement controls: pre-fetch, discard data movement

Asynchronous compute kernels

Scratchpad memories support

Blog discussing ROC Language Options:

http://gpuopen.com/rocm-do-you-speaka-my-language/

HCC (Heterogeneous Compute Compiler) Mainstream Standard Languages for GPU Acceleration

19

HCC - Heterogeneous Compute Compiler

HCC support ISO C++ 11/14 & C11 compilation

‒ Fully Open Sourced Compiler leverage CLANG/LLVM Technology

‒ Single-source : Host and dev code in the same source file

‒ C++17 “Parallel Standard Template Library” support

‒ OpenMP 3.1/4.0 on CPU initially, planning OpenMP 4.5 for GPU

HCC generates both CPU and GPU code

‒ Traditional CPU programs can be compiled with HCC

‒ Heterogeneous programs

‒ Compiles the host code for CPU

‒ Compiles the kernel/parallel region for the GPU

Motivated programmers can drill down to optimize, where desired

‒ Control, pre-fetch, discard data movement

‒ Run asynchronous compute kernels

‒ Access GPU scratchpad memories

C++ & C COMPILER

20

C++17 Example (APU)

int column = 128;

int row = 256;

// define the compute grid

bounds<2> grid { column, row };

float* a = new float[grid.size()];

float* b = new float[grid.size()];

float* c = new float[grid.size()];

// …Write a and b here…

// Using standard C++17 launch syntax

parallel::for_each(par, begin(grid), end(grid),

[&](index<2> idx) {

int i = idx[1] * column + idx[0];

c[i] = a[i] + b[i];

}

Single-Heap

Single-Source

ISO Standard C++17

21

Port from CUDA to a common C++ programming model

HIP code runs through either CUDA NVCC or HCC

HiPify tools simplify porting from CUDA to HIP

Builds on HCC Compiler

Host and device code can use templates, lambdas, advanced C++ features

C-based runtime APIs (hipMalloc, hipMemcpy, hipKernelLaunch and more)

HIP = “Heterogeneous-Compute Interface for Portability”

Bringing rhythm to today’s developers

cudaMemcpy(M, m, Size2,cudaMemcpyHostToDevice );

cudaMemcpy(A, a, Size2,cudaMemcpyHostToDevice );

cudaMemcpy(B, b, Size, cudaMemcpyHostToDevice );

dim3 dimBlockXY(blockSize2d,blockSize2d);

dim3 dimGridXY(gridSize2d,gridSize2d);

for (t=0; t<(Size-1); t++)

{

Kernel1<<<dimGrid,dimBlock>>>(M,A,Size2,t);

cudaThreadSynchronize();

Kernel2<<<dimGridXY,dimBlockXY>>>(M,A,B,Size,t);

cudaThreadSynchronize();

}

hipMemcpy(M, m, Size2, hipMemcpyHostToDevice );

hipMemcpy(A, a, Size2, hipMemcpyHostToDevice );

hipMemcpy(B, b, Size, hipMemcpyHostToDevice );

dim3 dimBlockXY(blockSize2d,blockSize2d);

dim3 dimGridXY(gridSize2d,gridSize2d);

for (t=0; t<(Size-1); t++)

{

hipLaunchKernel(Kernel1, dimGrid, dimBlock, 0, 0, M,A,Size2,t);

hipDeviceSynchronize();

hipLaunchKernel(Kernel2, dimGridXY, dimBlockXY, 0, 0,

M,A,B,Size,t);

hipDeviceSynchronize();

}

HIPIFY

22

Playing “Deep Bass” with Anaconda

Numba speed up your applications with high performance functions written directly in Python

Rich set of options to optimize for APU or discrete GPU

‒ Async execution, specify group size, use shared memory

‒ Data transfer can be performed implicitly based on kernel arguments

ANACONDA on ROCm a New Class of Performance for Python

Embracing Python Developer Community with Heterogeneous Acceleration

23

Kick up the BeatFocusing on Solution & Building Out Key Foundations to Support Libraries, Frameworks and Applications via GPUopen

Machine Learning

and Deep Neural

Network

Data/Graph

Analytics

Molecular

and Atomic

Simulation

Energy

and

Signal

Processing

24

ROCm A Stage For Your “Songs” Build a Rich Foundation of Libraries and Applications

Core math libraries

‒ BLAS ( Skinny, Non-Square, Square ) SGEMM & DGEMM

‒ FFT

‒ SPARSE

‒ RAND

Third-party Frameworks

‒ Kokkos C++ Framework

‒ CHARM++ Frameworks

‒ HPX Framework ( in development )

Benchmarks and Applications

‒ SHOC

‒ Parboil

‒ Lullesh

‒ MiniFE

‒ Cloverleaf

‒ NAMD

‒ LeanMD

‒ CoMD

‒ MiniAMR

‒ HPGMG

‒ XSBench

25

Rocking the neural pathways

Supporting Key Neural Network Frameworks

Torch 7 and Caffe

mlOpen

Optimized Convolution Neural Network for NN Frameworks

OpenVX with Graph Optimizer

Foundation for rich Machine Learning

Instinctive Computing foundation for Machine Learning and Neural Networks

| LDIAP / JULY 6, 2016|26

AMD OPEN SOURCE PROJECTS

ACL – the AMD Compute Libraries are a set of open source solutions, providing developers with libraries targeted to those who want to accelerate computations on GPUs, APUs and CPUs.

HIP : C++ Heterogeneous-Compute Interface for Portability — a C++ runtime API and kernel language that allows developers to create portable applications that can run on AMD and other GPU’s.

Bolt C++ – an STL compatible library for creating accelerated data parallel applications

Aparapi – an API for expressing data parallel workloads in Java™ and a runtime component capable of converting the Java™ bytecode of compatible workloads into OpenCL™

Introducing: The Boltzmann Initiative – A Defining Moment for Heterogeneous Computing Graphics driver – open source Linux driver for Radeon graphics cards

http://developer.amd.com/tools-and-sdks/opencl-zone/acl-amd-compute-libraries/

http://gpuopen.com/compute-product/hip-convert-cuda-to-portable-c-code/?webSyncID=20bd30b8-088b-00c1-d9c0-673e0c9d796c&sessionGUID=b8693ee9-da3c-37a7-495c-33d4e39a0139

https://github.com/HSA-Libraries/Bolt

http://developer.amd.com/tools-and-sdks/opencl-zone/aparapi/

https://community.amd.com/community/amd-business/blog/2015/11/16/a-defining-moment-for-heterogeneous-computing

http://xorg.freedesktop.org/wiki/RadeonFeature/

| LDIAP / JULY 6, 2016|27

ACL – AMD COMPUTE LIBRARIES

GPU Libraries

‒ clBLAS: The primary goal of clBLAS is to make it easier for developers to utilize the inherent performance and power efficiency benefits of heterogeneous computing. clBLAS interfaces do not hide nor wrap OpenCL interfaces, but rather leaves OpenCL™ state management to the control of the user to allow for maximum performance and flexibility.

‒ clRNG: has both host and device side interfaces to give you, the developer, sufficient control. The host side interfaces are native C APIs. clRNG 1.0 is an OpenCL library that generates uniform random numbers.

‒ clSPARSE: True in spirit with the other clMath libraries, clSparse exports a “C” interface to allow projects to build wrappers around clSPARSE in any language they need. clSPARSE is an OpenCL™ library implementing Sparse linear algebra. This project is a result of a collaboration between AMD Inc. and Vratis Ltd.

‒ clFFT: is a software library containing Fast Fourier Transform (FFT) functions written in OpenCL. In addition to GPU devices, the libraries also support running on CPU devices to facilitate debugging and heterogeneous programming.

HTTP://DEVELOPER.AMD.COM/TOOLS-AND-SDKS/OPENCL-ZONE/ACL-AMD-COMPUTE-LIBRARIES/

https://github.com/clMathLibraries/clBLAS

https://github.com/clMathLibraries/clRNG

https://github.com/clMathLibraries/clSPARSE

https://github.com/clMathLibraries/clFFT

| LDIAP / JULY 6, 2016|28

ACL – AMD COMPUTE LIBRARIES

CPU Libraries

‒ BLIS: is a portable framework for instantiating BLAS-like libraries. The framework was designed to isolate essential kernels of computation that, when optimized, immediately enable optimized implementations of most of its commonly used and computationally intensive operations.

‒ Libflame: is a portable library for dense matrix computations, providing much of the functionality present in LAPACK. In fact, libflame includes a compatibility layer, FLAPACK, which includes a complete LAPACK implementation.

HTTP://DEVELOPER.AMD.COM/TOOLS-AND-SDKS/OPENCL-ZONE/ACL-AMD-COMPUTE-LIBRARIES/

https://github.com/flame/blis

https://github.com/flame/libflame

| LDIAP / JULY 6, 2016|29

OUTLINE

Phone

Tablet

Notebook

Workstation

Dense Server

Super computer



Foundations – AMD Libraries



| LDIAP / JULY 6, 2016|30

AMD CONTRIBUTIONS TO OPEN SOURCE MACHINE LEARNING

Caffe – a deep learning framework made with expression, speed, and modularity in mind.

Torch – a scientific computing framework with wide support for machine learning algorithms.

Numba – speed up your applications with high-performance functions written directly in Python.

libAV – cross-platform tools and libraries to convert, manipulate, and stream a wide range of multimedia formats and protocols.

http://gpuopen.com/compute-product/hccaffe/?webSyncID=20bd30b8-088b-00c1-d9c0-673e0c9d796c&sessionGUID=b8693ee9-da3c-37a7-495c-33d4e39a0139

http://torch.ch/

http://numba.pydata.org/

https://libav.org/

| LDIAP / JULY 6, 2016|31

CAFFE

HCCAFFE is a Caffe-based deep-learning framework that supports the HCC compiler and Boltzmann runtime

The Caffe framework has been ported using HCC and the C++ AMP dialect.

HcCaffe includes a number of models for test:

‒ Alexnet

‒ Googlenet

‒ Cifar10

‒ Imagenet

‒ Mnist

‒ Feature extraction

https://bitbucket.org/multicoreware/hccaffe

| LDIAP / JULY 6, 2016|32

TORCH

HCTORCH is a Torch-based deep-learning framework that supports the HCC compiler and Boltzmann runtime

The Torch framework has been ported using HCC

https://bitbucket.org/multicoreware/hctorch/

https://bitbucket.org/multicoreware/hctorch/

| LDIAP / JULY 6, 2016|33

MapReduce in 30 seconds: Estimating PI

1- pick random points in unit rectangle

2- count fraction inside circle

Area: π/ 4

Map-Reduce:

Map: random point inside? Issue k=1, v=1 else k=0,v=1

Reduce: count 0 keys and count 1 keys

Programmer: writes { map, reduce } methods,

system does rest

| LDIAP / JULY 6, 2016|34

• Open source implementation of MapReduce programming model

• Runs on distributed network in several Java VMs

• Distributed file system, reliability guarantees, speculative execution, Java programming language and libraries, implicit parallelism

Map

Map

Map

V1

V2

V3

Reduce

Reduce

V1 V2

V3

barrier

HADOOP

| LDIAP / JULY 6, 2016|35

TARGET ARCHITECTURE

Target: CLUSTER of APUs

Two-Level Parallelism:

‒ Across nodes in cluster

‒ Within Node (APU)

‒ Multicore (CPU)

‒ Data parallel(GPU)

| LDIAP / JULY 6, 2016|36

NESTED PROCESSING APPROACH

PARTITIONWORK LOCALLY

COMBINE

RESULTS

nest: CPU+GPU

| LDIAP / JULY 6, 2016|37

WHY APUS – MAP REDUCE-REDUCE

Map

Map

Map

V1

V2

V3

Reduce

Reduce

V1 V2

V3

barrier

Map

Map

Map

V1

V2

V3

Reduce

Reduce

V1 V2

V3

Map

Map

Map

V1

V2

V3

Reduce

Reduce

V1 V2

V3

Reduce

Reduce

V1 V2

V3

Reduce

Reduce

V1 V2

V3

Reduce

Reduce

V1 V2

V3

SPLIT: enough work for one GPU

NO MEMORY LIMIT

CPU+GPU execution

CPU+GPU execution on each nodeFinal reduction in cluster

Aggregate node’s results

Network

| LDIAP / JULY 6, 2016|38

HadoopCLMapReduce on Distributed Heterogeneous Platforms Through

Seamless Integration of Hadoop and OpenCL

Max Grossman1, Mauricio Breternitz2, Vivek Sarkar1

1Rice University, 2AMD Research2013 International Workshop on High Performance Data Intensive Computing.

May 2013.

Implemented with AMD’s APARAPI:

• Java methods -> GPU

HADOOPCL

Collaboration with Rice University:

M. Grossman, M. Breternitz, V. Sarkar. “HadoopCL2: Motivating the Design of a Distributed,

Heterogeneous Programming System With Machine-Learning Applications.”

IEEE Transactions on Parallel and Distributed Systems, 2014

| LDIAP / JULY 6, 2016|39

IEEE Transactions on Parallel and Distributed Systems 27.3 (2016): 762-775

AMD A10-7850K APU

4 CPU cores at 3.7GHz,

Radeon GPU

2GB global memory,

14.5GB of system memory

| LDIAP / JULY 6, 2016|40

OPEN-SOURCE – APACHE SPARK ACCELERATION

COLLABORATION: RICE UNIVERSITY

PROF. VIVEK SARKARMAX GROSSMAN

Tool: aparapi-swat from https://github.com/agrippa/aparapi-swat, where there is a build.sh.

SWAT : https://github.com/agrippa/spark-swata Maven project: cd $SWAT_HOME/swat/ && mvn clean

https://github.com/agrippa/aparapi-swat

https://github.com/agrippa/spark-swat

clSPARSE:An OpenCL™

Sparse BLAS Library

| LDIAP / JULY 6, 2016|42

SINGLE PRECISION SPMV – OPEN SOURCE

0

15

30

45

60

75

90

105

Den

se8

Pro

tein

FEM

-Sp

her

es

FEM

-…

Win

dTu

nn

el

FEM

-Har

bo

r

FEM

-Sh

ip

Eco

no

mic

s

FEM

-…

Cir

cuit

Web

bas

e

circ

uit

5M

eu-2

00

5

Ga4

1A

s41

H7

2

in-2

00

4

mip

1

Si4

1G

e41

H7

2

ASI

C_6

80

k

dc2

FullC

hip

ins2

bo

ne0

10

cran

kseg

_2

ldo

or

raja

t31

Ru

cci1

bo

yd2

sls

tran

sien

t

Geo

Mea

n

SP G

FLO

Ps

ViennaCL-AMD clSPARSE-AMD

clSPARSE 2.5x faster on average

| LDIAP / JULY 6, 2016|43

ISCA’2016

| LDIAP / JULY 6, 2016|44

| LDIAP / JULY 6, 2016|45

QUESTIONS ?

[email protected]

| LDIAP / JULY 6, 2016|46

DISCLAIMER & ATTRIBUTIONThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Radeon, AMD Catalyst, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL is a trademark of Apple Inc. used by permission by Khronos. Other names are for informational purposes only and may be trademarks of their respective owners.

mauricio breternitz, ph.d. amd research - …fleuret/dltm/slides/dltm-slides-breternitz...amd hsa...

Documents