blas and vectorization extensions
TRANSCRIPT
BLAS and Vectorization extensions.
Carlos Pachajoa
December 5, 2012
Contents
BLAS
Vectorization extensions for X86
GPGPU
BLAS
Stands for Basic Linear Algebra Subprograms
Is an interface for linear algebra operations. BLAS itself is aspecification for Fortran. The equivalent interface in C is CBLAS.
Use the local implementation with#include <cblas.h>
BLAS levels
The operations are divided in three levels:
I Level 1: Vector operations likey← αx + y, x, y ∈ ZN
and also dot products and vector norms.
I Level 2: Matrix-vector operations likey← αAx + βy, x, y ∈ ZN , A ∈ ZM×N
and solutions for triangular systems, among others.
I Level 3: Matrix-matrix operations likeC← αAB + βC, C ∈ ZM×N , A ∈ ZM×P , B ∈ ZP×N
Different calls for different precisions and whether real or complexnumbers.
BLAS function naming conventions
The first letter specifies precision:
I S for real, single precision.
I D for real, double precision.
I C for complex, single precision.
I Z for complex, double precision.
The first letter is followed by a function, for examplexAXPY is y ← αx + y from level one. Here, x represents a spaceand a precision.Therefore, SAXPY will perform the operation using single precisionfloating point numbers.
CBLAS data representation
CBLAS receives data as contiguous positions in memory and asize. Both matrices and vectors are stored in this way. To specify amatrix, one additionally has to provide a stride and define whetherit is row- or column- major.
{1,2,3,4,5,6,7,8,9} will be1 2 34 5 67 8 9
in R.M. and
1 4 72 5 83 6 9
in C.M.
The ordering is given using this enumeration:enum CBLAS ORDER {CblasRowMajor=101, CblasColMajor=102};
A function signature
y← αAx + βy, y← αATx + βy
vo i d cb l a s sgemv (con s t enum CBLAS ORDER Order ,con s t enum CBLAS TRANSPOSE TransA ,con s t i n t M,con s t i n t N,con s t f l o a t a lpha ,con s t f l o a t ∗A,con s t i n t lda ,con s t f l o a t ∗X,con s t i n t incX ,con s t f l o a t beta ,f l o a t ∗Y,con s t i n t incY
) ;
Enumeration types
enum CBLAS ORDER {CblasRowMajor=101 , /∗ row−major a r r a y s ∗/Cb lasCo lMajo r =102}; /∗ column−major a r r a y s ∗/
enum CBLAS TRANSPOSE { // Whether to work wi th t r a n s p o s eCblasNoTrans=111 , /∗ t r a n s =’N’ ∗/Cb lasTrans =112 , /∗ t r a n s =’T’ ∗/Cb lasCon jTrans =113}; /∗ t r a n s =’C ’ ∗/
enum CBLAS UPLO { // The mat r i x i s upper o r l owe rCblasUpper=121 , /∗ up lo =’U’ ∗/CblasLower =122}; /∗ up lo =’L ’ ∗/
enum CBLAS DIAG { // The mat r i x i s u n i t r i a n g u l a rCblasNonUnit=131 , /∗ d i ag =’N’ ∗/Cb l a sUn i t =132}; /∗ d i ag =’U’ ∗/
enum CBLAS SIDE { // Order o f mat r i x m u l t i p l i c a t i o nCb l a s L e f t =141 , /∗ s i d e =’L ’ ∗/Cb l a sR i gh t =142}; /∗ s i d e =’R ’ ∗/
Some CBLAS implementations
I ATLAS (Automatically Tuned Linear Algebra Software)
I MKL (Math Kernel Library)
I CUBLAS
Documents with CBLAS routines
I http:
//math-atlas.sourceforge.net/psdoc/cblasqref.ps
I https://developer.apple.com/library/mac/
documentation/Accelerate/Reference/BLAS_Ref/
Reference/reference.html
SIMD
Single Instruction, Multiple Data
Taken from
http://archive.arstechnica.com/cpu/1q00/simd/figure6.gif
SSE
Streaming SIMD Extensions.
Additional registers in the processor and operations in thearchitecture.
http://en.wikipedia.org/wiki/File:XMM_registers.svg
8 registers with 128 bits each. 4 single precision floating pointnumbers in each register.
Some instructions
; A l l i n s t r u c t i o n s end up i n S; p enu l t ima t e l e t t e r deno te s s c a l a r o r v e c t o r; S s t and s f o r s c a l a r , P f o r v e c t o r .; Operands a r e XMM r e g i s t e r s
; Adds a l l e l ement s o f a r r a y s op1 and op2 i n t o op1ADDPS op1 , op2
; Adds the f i r s t e l ement o f op1 and op2 i n t o; the f i r s t p o s i t i o n o f op2ADDSS op1 , op2
Some SSE instructions
v e c r e s . x = v1 . x + v2 . x ;v e c r e s . y = v1 . y + v2 . y ;v e c r e s . z = v1 . z + v2 . z ;v e c r e s .w = v1 .w + v2 .w;
; xmm0 = v1 .w | v1 . z | v1 . y | v1 . xmovaps xmm0, [ v1 ]
; xmm0 = v1 .w+v2 .w | v1 . z+v2 . z | v1 . y+v2 . y | v1 . x+v2 . xaddps xmm0, [ v2 ]
movaps [ v e c r e s ] , xmm0
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
SSE upgrades
SSE2 Allows to represent multiple types of data fitting onthe vectors, including integers and characters, andperform the corresponding operations.AMD’s implementation also doubled the number ofXMM arrays.
SSE3 Addition of horizontal operations within the XMMarrays, such as data reduction.
SSSE3 Additional instructions for SSE3.
SSE4 Introduction of Dword multiply, allowing to multiplytwo pairs of 32-bit integers to produce 2 64-bitnumbers. Vector dot products.
AVX
Intel’s extension to SSE for the Sandy Bridge microarchitecture,introduced in 2011. Also available in AMD’s Bulldozer.
http://upload.wikimedia.org/wikipedia/commons/f/f5/AVX_registers.svg
Automatic vectorization
The Intel compiler can, under certain conditions, vectorize loops inthe code.
This can be activated by using the -vec option.
for(i=0; i<SIZE; i++)A[i] = B[i] + C[i]
Obstacles to vectorization
I Non-contiguous memory accessfor(i=0; i < SIZE; i+=stride) A[i] = B[i] + C[i];
I Data dependenciesfor(i=0; i < SIZE; i++) A[i] = A[i-1] + B[i];
Guiding ICC vectorization
I Pragmas For example, #pragma ivdep, among others tocontrol when to vectorize a loop.
I Keywords, such as restrict.
I Switches passed to the compiler as optimization levels.
Look at the ICC automatic vectorization documentation[11].
GPGPU
http://blogs.nvidia.com/2009/12/
whats-the-difference-between-a-cpu-and-a-gpu/
GPGPU and CPU
CPU
I General purpose
I Pipelines
I Few threads
I Lots of cache (Correlation)
GPGPU
I Specialized for local vector operations
I Many cores and threads
I Little cache
I Smaller consumption relative to a CPU
CUDA
Stands for Compute Unified Device Architecture.
Effectively, a programming model to access and control GPUsusing a virtual instruction set, in a simmilar manner as a CPU.
Only supported on NVIDIA cards.
Uses the NVIDIA compiler, and can be programmed using CUDAC/C++, languages based on C/C++.
OpenCL
Stands for Open Computing Language
It also provides access to the GPU.
I It’s an open standard, supported by NVIDIA and AMD,among others.
I Provides a language based on C99.
I Functionality provided by a driver.
I Compilation handled by linking to the correct library.
References
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
http://www.netlib.org/blas/
http://www.stanford.edu/class/me200c/tutorial_77/18.1_blas.html
http://math-atlas.sourceforge.net/faq.html
http://software.intel.com/sites/products/documentation/hpc/mkl/
mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htm
https://developer.nvidia.com/cublas
http://www.godevtool.com/TestbugHelp/XMMfpins.htm
software.intel.com/en-us/avx
http://www.khronos.org/opencl/
http://developer.download.nvidia.com/CUDA/training/GTC_Express_
Sarah_Tariq_June2011.pdf
http:
//software.intel.com/sites/products/documentation/hpc/composerxe/
en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htm