mauricio breternitz, ph.d. amd research - …fleuret/dltm/slides/dltm-slides-breternitz...amd hsa...
TRANSCRIPT
MACHINE LEARNING AT AMDFOUNDATIONS AND SUPPORT
MAURICIO BRETERNITZ, PH.D.AMD RESEARCH
| LDIAP / JULY 6, 2016|2
AMD RESEARCH
| LDIAP / JULY 6, 2016|3
OUTLINE
Phone
Tablet
Notebook
Workstation
Dense Server
Super computer
From Phil Rogers APU13 Keynote
Vision
Foundations – AMD Hardware
Foundations – AMD Open Source and Libraries
Optimizing Open Source Libraries
Machine Learning Support
| LDIAP / JULY 6, 2016|4
AMD’S VISION
Computers/Supercomputers of high-performance coherent heterogeneous nodes
‒ Build from fast x86 CPUs with a strong existing HPC software ecosystem
‒ Coherently connected to fast AMD GPUs for parallel compute
‒ 2P nodes for high density and memory footprint
HSA programming model makes parallel acceleration on the GPU easier
‒ C++ and OpenMP directly accelerated
‒ Programmers freed from managing memory transfers
‒ True single source programming to use the GPUs on day one
Open source software on open standard hardware
‒ Fully open source execution stack
‒ Fully open source compilers
‒ Fully open source enablement for DSLs and runtimes
| LDIAP / JULY 6, 2016|5
OPEN STANDARDS
Pervasive philosophy across all AMD products
‒ Build great products based on open standards
AMD hardware and software HPC architecture is based on HSA
‒ Extensive industrial, academic and government membership including DOE labs
‒ IP Rules enable collaboration
‒ Specifications ratified, released, and public at www.hsafoundation.com
HSA memory model supports other open standards
‒ C++ 14, PGAS languages, .Net and Java memory models
Programming models based on open standards
‒ HCC C++ based on C++ ISO Standards https://isocpp.org/
‒ OpenMP http://openmp.org/wp/openmp-specifications/
‒ LLVM optimizing compiler http://llvm.org/
‒ CLANG front end to LLVM http://clang.llvm.org/
| LDIAP / JULY 6, 2016|6
OPEN SOURCE AT AMD
AMD HPC Software is built on open source Linux
Operating system support
‒ All AMD patches for HSA and IOMMU are upstream at https://www.kernel.org/
‒ Linux Distros shipping in 2015 with HSA enabled
AMD HSA Execution stack is all open source
‒ Kernel mode drivers
‒ HSA Runtime
‒ HSAIL Finalizer
AMD’s OpenCL, C++, and Python HPC compilers are open source
‒ Based on CLANG and LLVM
‒ HSAIL backend for LLVM being upstreamed to http://llvm.org/
Additional compilers from third-party companies are closed source
‒ OpenMP
‒ OpenACC 6
| LDIAP / JULY 6, 2016|7
AMD VISION
Open Source Software
‒ www.gpuopen.com
Open, Standard Hardware
8
RadeonTM Technology Group Presents
New Path Forward for HPC and Ultrascale Computing Markets
Focused Commitment to Meet Customer Computing Needs
Open Foundation For Development, Discovery and Education
ROCm: Radeon Open Compute
Platform
| LDIAP / JULY 6, 2016|9
OUTLINE
Phone
Tablet
Notebook
Workstation
Dense Server
Super computer
From Phil Rogers APU13 Keynote
Vision
Foundations – AMD Hardware
Foundations – AMD Open Source and Libraries
Optimizing Open Source Libraries
Machine Learning Support
| LDIAP / JULY 6, 2016|10
11
Gear for the ROCm stage
8GB @ 1 TB/s memory
bandwidth
13.9 TFlops Single
Precision
300 Watts
4GB @ 512 GB/s memory
bandwidth
8.19 TFlops Single
Precision
175 Watts
8GB @ 256 GB/s memory
bandwidth
5.8 TFlops Single
Precision
100 Watts
Radeon R9 Nano FirePro S9300 x2 Radeon RX480
12
GPU
Dual
x86 CPU
Module
A heterogeneous platform with CPU+GPU on the same silicon die
Ideal for both serial and data-parallel parts of application
TDP of less than 100 Watts
ACCELERATED PROCESSING UNIT (APU)
Data-parallel Serial
13
ROCm Keeping the Signal Clean User Mode DMA: Drives Down Latency of the Transfer SHOC
0123456789
10111213
1K
2K
4K
8K
16
K
32
K
64
K
12
8K
25
6K
51
2K
1M
2M
4M
8M
16
M
32
M
64
M
12
8M
25
6M
Host->Device Measured Bandwidth (GB/s)
ROCm NANO (Direct Attach) CATALYST NANO (Direct Attach)
0123456789
1011121314
1K
2K
4K
8K
16
K
32
K
64
K
12
8K
25
6K
51
2K
1M
2M
4M
8M
16
M
32
M
64
M
12
8M
25
6M
Device->Host Measured Bandwidth (GB/s)
ROCm NANO (Direct Attach) CATALYST NANO (Direct Attach)
SHOC Version 1.1.5 ROC RC2 vs Radeon Software Crimson Edition15.12 on R9 Nano
14
ROOFLINE FROM MIXBENCH* HIP PORT
RadeonTM R9 NANO BRINGS A “SOLID SOUND”
0
1000
2000
3000
4000
5000
6000
7000
8000
0 10 20 30 40 50 60 70
Single Precision Gflops
Fiji GFLOPS Flops/byte
0
100
200
300
400
0 10 20 30 40
Double Precision Gflops
Fiji… Flops/byte
https://github.com/ekondis/mixbench Mixbench on ROCm RC2 Radeon R9 NANO ( formerly codenamed “FIJI”)
| LDIAP / JULY 6, 2016|15
OUTLINE
Phone
Tablet
Notebook
Workstation
Dense Server
Super computer
From Phil Rogers APU13 Keynote
Vision
Foundations – AMD Hardware
Foundations – AMD Open Source and Libraries
Optimizing Open Source Libraries
Machine Learning Support
16
Bringing HSA to Discrete GPU & HPC
AMD has invested in HSA since before June-2012 (HSA Foundation)‒ Initial focus on APUs
‒ Recently extended HSA technology to discrete GPUs and HPC market
Radeon Open Compute (ROCm) Architecture for dGPU‒ ROC Runtime (ROCR) : HSA Runtime + Extensions (for peer-to-peer, improved copies, memory pinning)
‒ ROC Compilers HCC, Python, HIP initially developed on HSA APUs
‒ Hardware Architecture includes user-mode queues, signals, context switching (HSA Arch)
‒ Discrete memory exposed to programmer
‒ ROCm discrete devices pass conformance for HSA “Base” Profile
ROCm Platform ‒ Announced Nov/2015 (SuperComputing) as “Boltzmann Initiative”
‒ ROCm Preview release Jan/2016
‒ ROCm 1.0 released April/2016
‒ ROCm 1.1 released June/2016
‒ ROCm 1.2 July/2016 – Broadening Platform Support
‒ ROCm 1.3 Sept/2016 – Broadening Programming Model Support
1717 | LDIAP – JULY 2016
AMD and the GPUOpen Initiative
Vendor-optimized open-source support for important GPU software‒ http://gpuopen.com/
‒ Most source code available on GitHub or Bitbucket!
Open-source Gaming Libraries
‒ e.g., TressFX – Hair physics
‒ e.g., AOFX – optimized ambient occlusion
‒ Many others!
Open-source Compute Libraries
‒ clBLAS
‒ clFFT
‒ clRNG
18
It’s about making “premium sound” on the ROCm
stage
HCC is a single source ISO C++ 11/14 compiler for both the CPU and
GPU
OPENMP 3.1 C & C++ support soday for CPU
C++17 “Parallel Standard Template Library”
Built on rich compiler infrastructure CLANG/LLVM and libC++
Performance Optimization for Accelerators
Low-level memory placement controls: pre-fetch, discard data movement
Asynchronous compute kernels
Scratchpad memories support
Blog discussing ROC Language Options:
http://gpuopen.com/rocm-do-you-speaka-my-language/
HCC (Heterogeneous Compute Compiler) Mainstream Standard Languages for GPU Acceleration
19
HCC - Heterogeneous Compute Compiler
HCC support ISO C++ 11/14 & C11 compilation
‒ Fully Open Sourced Compiler leverage CLANG/LLVM Technology
‒ Single-source : Host and dev code in the same source file
‒ C++17 “Parallel Standard Template Library” support
‒ OpenMP 3.1/4.0 on CPU initially, planning OpenMP 4.5 for GPU
HCC generates both CPU and GPU code
‒ Traditional CPU programs can be compiled with HCC
‒ Heterogeneous programs
‒ Compiles the host code for CPU
‒ Compiles the kernel/parallel region for the GPU
Motivated programmers can drill down to optimize, where desired
‒ Control, pre-fetch, discard data movement
‒ Run asynchronous compute kernels
‒ Access GPU scratchpad memories
C++ & C COMPILER
20
C++17 Example (APU)
int column = 128;
int row = 256;
// define the compute grid
bounds<2> grid { column, row };
float* a = new float[grid.size()];
float* b = new float[grid.size()];
float* c = new float[grid.size()];
// …Write a and b here…
// Using standard C++17 launch syntax
parallel::for_each(par, begin(grid), end(grid),
[&](index<2> idx) {
int i = idx[1] * column + idx[0];
c[i] = a[i] + b[i];
}
Single-Heap
Single-Source
ISO Standard C++17
21
Port from CUDA to a common C++ programming model
HIP code runs through either CUDA NVCC or HCC
HiPify tools simplify porting from CUDA to HIP
Builds on HCC Compiler
Host and device code can use templates, lambdas, advanced C++ features
C-based runtime APIs (hipMalloc, hipMemcpy, hipKernelLaunch and more)
HIP = “Heterogeneous-Compute Interface for Portability”
Bringing rhythm to today’s developers
cudaMemcpy(M, m, Size2,cudaMemcpyHostToDevice );
cudaMemcpy(A, a, Size2,cudaMemcpyHostToDevice );
cudaMemcpy(B, b, Size, cudaMemcpyHostToDevice );
dim3 dimBlockXY(blockSize2d,blockSize2d);
dim3 dimGridXY(gridSize2d,gridSize2d);
for (t=0; t<(Size-1); t++)
{
Kernel1<<<dimGrid,dimBlock>>>(M,A,Size2,t);
cudaThreadSynchronize();
Kernel2<<<dimGridXY,dimBlockXY>>>(M,A,B,Size,t);
cudaThreadSynchronize();
}
hipMemcpy(M, m, Size2, hipMemcpyHostToDevice );
hipMemcpy(A, a, Size2, hipMemcpyHostToDevice );
hipMemcpy(B, b, Size, hipMemcpyHostToDevice );
dim3 dimBlockXY(blockSize2d,blockSize2d);
dim3 dimGridXY(gridSize2d,gridSize2d);
for (t=0; t<(Size-1); t++)
{
hipLaunchKernel(Kernel1, dimGrid, dimBlock, 0, 0, M,A,Size2,t);
hipDeviceSynchronize();
hipLaunchKernel(Kernel2, dimGridXY, dimBlockXY, 0, 0,
M,A,B,Size,t);
hipDeviceSynchronize();
}
HIPIFY
22
Playing “Deep Bass” with Anaconda
Numba speed up your applications with high performance functions written directly in Python
Rich set of options to optimize for APU or discrete GPU
‒ Async execution, specify group size, use shared memory
‒ Data transfer can be performed implicitly based on kernel arguments
ANACONDA on ROCm a New Class of Performance for Python
Embracing Python Developer Community with Heterogeneous Acceleration
23
Kick up the BeatFocusing on Solution & Building Out Key Foundations to Support Libraries, Frameworks and Applications via GPUopen
Machine Learning
and Deep Neural
Network
Data/Graph
Analytics
Molecular
and Atomic
Simulation
Energy
and
Signal
Processing
24
ROCm A Stage For Your “Songs” Build a Rich Foundation of Libraries and Applications
Core math libraries
‒ BLAS ( Skinny, Non-Square, Square ) SGEMM & DGEMM
‒ FFT
‒ SPARSE
‒ RAND
Third-party Frameworks
‒ Kokkos C++ Framework
‒ CHARM++ Frameworks
‒ HPX Framework ( in development )
Benchmarks and Applications
‒ SHOC
‒ Parboil
‒ Lullesh
‒ MiniFE
‒ Cloverleaf
‒ NAMD
‒ LeanMD
‒ CoMD
‒ MiniAMR
‒ HPGMG
‒ XSBench
25
Rocking the neural pathways
Supporting Key Neural Network Frameworks
Torch 7 and Caffe
mlOpen
Optimized Convolution Neural Network for NN Frameworks
OpenVX with Graph Optimizer
Foundation for rich Machine Learning
Instinctive Computing foundation for Machine Learning and Neural Networks
| LDIAP / JULY 6, 2016|26
AMD OPEN SOURCE PROJECTS
ACL – the AMD Compute Libraries are a set of open source solutions, providing developers with libraries targeted to those who want to accelerate computations on GPUs, APUs and CPUs.
HIP : C++ Heterogeneous-Compute Interface for Portability — a C++ runtime API and kernel language that allows developers to create portable applications that can run on AMD and other GPU’s.
Bolt C++ – an STL compatible library for creating accelerated data parallel applications
Aparapi – an API for expressing data parallel workloads in Java™ and a runtime component capable of converting the Java™ bytecode of compatible workloads into OpenCL™
Introducing: The Boltzmann Initiative – A Defining Moment for Heterogeneous Computing Graphics driver – open source Linux driver for Radeon graphics cards
| LDIAP / JULY 6, 2016|27
ACL – AMD COMPUTE LIBRARIES
GPU Libraries
‒ clBLAS: The primary goal of clBLAS is to make it easier for developers to utilize the inherent performance and power efficiency benefits of heterogeneous computing. clBLAS interfaces do not hide nor wrap OpenCL interfaces, but rather leaves OpenCL™ state management to the control of the user to allow for maximum performance and flexibility.
‒ clRNG: has both host and device side interfaces to give you, the developer, sufficient control. The host side interfaces are native C APIs. clRNG 1.0 is an OpenCL library that generates uniform random numbers.
‒ clSPARSE: True in spirit with the other clMath libraries, clSparse exports a “C” interface to allow projects to build wrappers around clSPARSE in any language they need. clSPARSE is an OpenCL™ library implementing Sparse linear algebra. This project is a result of a collaboration between AMD Inc. and Vratis Ltd.
‒ clFFT: is a software library containing Fast Fourier Transform (FFT) functions written in OpenCL. In addition to GPU devices, the libraries also support running on CPU devices to facilitate debugging and heterogeneous programming.
HTTP://DEVELOPER.AMD.COM/TOOLS-AND-SDKS/OPENCL-ZONE/ACL-AMD-COMPUTE-LIBRARIES/
| LDIAP / JULY 6, 2016|28
ACL – AMD COMPUTE LIBRARIES
CPU Libraries
‒ BLIS: is a portable framework for instantiating BLAS-like libraries. The framework was designed to isolate essential kernels of computation that, when optimized, immediately enable optimized implementations of most of its commonly used and computationally intensive operations.
‒ Libflame: is a portable library for dense matrix computations, providing much of the functionality present in LAPACK. In fact, libflame includes a compatibility layer, FLAPACK, which includes a complete LAPACK implementation.
HTTP://DEVELOPER.AMD.COM/TOOLS-AND-SDKS/OPENCL-ZONE/ACL-AMD-COMPUTE-LIBRARIES/
| LDIAP / JULY 6, 2016|29
OUTLINE
Phone
Tablet
Notebook
Workstation
Dense Server
Super computer
From Phil Rogers APU13 Keynote
Foundations – AMD Hardware
Foundations – AMD Libraries
Optimizing Open Source Libraries
Machine Learning Support
| LDIAP / JULY 6, 2016|30
AMD CONTRIBUTIONS TO OPEN SOURCE MACHINE LEARNING
Caffe – a deep learning framework made with expression, speed, and modularity in mind.
Torch – a scientific computing framework with wide support for machine learning algorithms.
Numba – speed up your applications with high-performance functions written directly in Python.
libAV – cross-platform tools and libraries to convert, manipulate, and stream a wide range of multimedia formats and protocols.
| LDIAP / JULY 6, 2016|31
CAFFE
HCCAFFE is a Caffe-based deep-learning framework that supports the HCC compiler and Boltzmann runtime
The Caffe framework has been ported using HCC and the C++ AMP dialect.
HcCaffe includes a number of models for test:
‒ Alexnet
‒ Googlenet
‒ Cifar10
‒ Imagenet
‒ Mnist
‒ Feature extraction
https://bitbucket.org/multicoreware/hccaffe
| LDIAP / JULY 6, 2016|32
TORCH
HCTORCH is a Torch-based deep-learning framework that supports the HCC compiler and Boltzmann runtime
The Torch framework has been ported using HCC
https://bitbucket.org/multicoreware/hctorch/
| LDIAP / JULY 6, 2016|33
MapReduce in 30 seconds: Estimating PI
1- pick random points in unit rectangle
2- count fraction inside circle
Area: π/ 4
Map-Reduce:
Map: random point inside? Issue k=1, v=1 else k=0,v=1
Reduce: count 0 keys and count 1 keys
Programmer: writes { map, reduce } methods,
system does rest
| LDIAP / JULY 6, 2016|34
• Open source implementation of MapReduce programming model
• Runs on distributed network in several Java VMs
• Distributed file system, reliability guarantees, speculative execution, Java programming language and libraries, implicit parallelism
Map
Map
Map
V1
V2
V3
Reduce
Reduce
V1 V2
V3
barrier
HADOOP
| LDIAP / JULY 6, 2016|35
TARGET ARCHITECTURE
Target: CLUSTER of APUs
Two-Level Parallelism:
‒ Across nodes in cluster
‒ Within Node (APU)
‒ Multicore (CPU)
‒ Data parallel(GPU)
| LDIAP / JULY 6, 2016|36
NESTED PROCESSING APPROACH
PARTITIONWORK LOCALLY
COMBINE
RESULTS
nest: CPU+GPU
| LDIAP / JULY 6, 2016|37
WHY APUS – MAP REDUCE-REDUCE
Map
Map
Map
V1
V2
V3
Reduce
Reduce
V1 V2
V3
barrier
Map
Map
Map
V1
V2
V3
Reduce
Reduce
V1 V2
V3
Map
Map
Map
V1
V2
V3
Reduce
Reduce
V1 V2
V3
Reduce
Reduce
V1 V2
V3
Reduce
Reduce
V1 V2
V3
Reduce
Reduce
V1 V2
V3
SPLIT: enough work for one GPU
NO MEMORY LIMIT
CPU+GPU execution
CPU+GPU execution on each nodeFinal reduction in cluster
Aggregate node’s results
Network
| LDIAP / JULY 6, 2016|38
HadoopCLMapReduce on Distributed Heterogeneous Platforms Through
Seamless Integration of Hadoop and OpenCL
Max Grossman1, Mauricio Breternitz2, Vivek Sarkar1
1Rice University, 2AMD Research2013 International Workshop on High Performance Data Intensive Computing.
May 2013.
Implemented with AMD’s APARAPI:
• Java methods -> GPU
HADOOPCL
Collaboration with Rice University:
M. Grossman, M. Breternitz, V. Sarkar. “HadoopCL2: Motivating the Design of a Distributed,
Heterogeneous Programming System With Machine-Learning Applications.”
IEEE Transactions on Parallel and Distributed Systems, 2014
| LDIAP / JULY 6, 2016|39
IEEE Transactions on Parallel and Distributed Systems 27.3 (2016): 762-775
AMD A10-7850K APU
4 CPU cores at 3.7GHz,
Radeon GPU
2GB global memory,
14.5GB of system memory
| LDIAP / JULY 6, 2016|40
OPEN-SOURCE – APACHE SPARK ACCELERATION
COLLABORATION: RICE UNIVERSITY
PROF. VIVEK SARKARMAX GROSSMAN
Tool: aparapi-swat from https://github.com/agrippa/aparapi-swat, where there is a build.sh.
SWAT : https://github.com/agrippa/spark-swata Maven project: cd $SWAT_HOME/swat/ && mvn clean
clSPARSE:An OpenCL™
Sparse BLAS Library
| LDIAP / JULY 6, 2016|42
SINGLE PRECISION SPMV – OPEN SOURCE
0
15
30
45
60
75
90
105
Den
se8
Pro
tein
FEM
-Sp
her
es
FEM
-…
Win
dTu
nn
el
FEM
-Har
bo
r
FEM
-Sh
ip
Eco
no
mic
s
FEM
-…
Cir
cuit
Web
bas
e
circ
uit
5M
eu-2
00
5
Ga4
1A
s41
H7
2
in-2
00
4
mip
1
Si4
1G
e41
H7
2
ASI
C_6
80
k
dc2
FullC
hip
ins2
bo
ne0
10
cran
kseg
_2
ldo
or
raja
t31
Ru
cci1
bo
yd2
sls
tran
sien
t
Geo
Mea
n
SP G
FLO
Ps
ViennaCL-AMD clSPARSE-AMD
clSPARSE 2.5x faster on average
| LDIAP / JULY 6, 2016|43
ISCA’2016
| LDIAP / JULY 6, 2016|44
| LDIAP / JULY 6, 2016|46
DISCLAIMER & ATTRIBUTIONThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Radeon, AMD Catalyst, and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. OpenCL is a trademark of Apple Inc. used by permission by Khronos. Other names are for informational purposes only and may be trademarks of their respective owners.