blas and vectorization extensions

BLAS and Vectorization extensions.

Carlos Pachajoa

December 5, 2012

Contents

BLAS

Vectorization extensions for X86

GPGPU

BLAS

Stands for Basic Linear Algebra Subprograms

Is an interface for linear algebra operations. BLAS itself is aspecification for Fortran. The equivalent interface in C is CBLAS.

Use the local implementation with#include <cblas.h>

BLAS levels

The operations are divided in three levels:

I Level 1: Vector operations likey← αx + y, x, y ∈ ZN

and also dot products and vector norms.

I Level 2: Matrix-vector operations likey← αAx + βy, x, y ∈ ZN , A ∈ ZM×N

and solutions for triangular systems, among others.

I Level 3: Matrix-matrix operations likeC← αAB + βC, C ∈ ZM×N , A ∈ ZM×P , B ∈ ZP×N

Different calls for different precisions and whether real or complexnumbers.

BLAS function naming conventions

The first letter specifies precision:

I S for real, single precision.

I D for real, double precision.

I C for complex, single precision.

I Z for complex, double precision.

The first letter is followed by a function, for examplexAXPY is y ← αx + y from level one. Here, x represents a spaceand a precision.Therefore, SAXPY will perform the operation using single precisionfloating point numbers.

CBLAS data representation

CBLAS receives data as contiguous positions in memory and asize. Both matrices and vectors are stored in this way. To specify amatrix, one additionally has to provide a stride and define whetherit is row- or column- major.

{1,2,3,4,5,6,7,8,9} will be1 2 34 5 67 8 9

in R.M. and

1 4 72 5 83 6 9

in C.M.

The ordering is given using this enumeration:enum CBLAS ORDER {CblasRowMajor=101, CblasColMajor=102};

A function signature

y← αAx + βy, y← αATx + βy

vo i d cb l a s sgemv (con s t enum CBLAS ORDER Order ,con s t enum CBLAS TRANSPOSE TransA ,con s t i n t M,con s t i n t N,con s t f l o a t a lpha ,con s t f l o a t ∗A,con s t i n t lda ,con s t f l o a t ∗X,con s t i n t incX ,con s t f l o a t beta ,f l o a t ∗Y,con s t i n t incY

) ;

Enumeration types

enum CBLAS ORDER {CblasRowMajor=101 , /∗ row−major a r r a y s ∗/Cb lasCo lMajo r =102}; /∗ column−major a r r a y s ∗/

enum CBLAS TRANSPOSE { // Whether to work wi th t r a n s p o s eCblasNoTrans=111 , /∗ t r a n s =’N’ ∗/Cb lasTrans =112 , /∗ t r a n s =’T’ ∗/Cb lasCon jTrans =113}; /∗ t r a n s =’C ’ ∗/

enum CBLAS UPLO { // The mat r i x i s upper o r l owe rCblasUpper=121 , /∗ up lo =’U’ ∗/CblasLower =122}; /∗ up lo =’L ’ ∗/

enum CBLAS DIAG { // The mat r i x i s u n i t r i a n g u l a rCblasNonUnit=131 , /∗ d i ag =’N’ ∗/Cb l a sUn i t =132}; /∗ d i ag =’U’ ∗/

enum CBLAS SIDE { // Order o f mat r i x m u l t i p l i c a t i o nCb l a s L e f t =141 , /∗ s i d e =’L ’ ∗/Cb l a sR i gh t =142}; /∗ s i d e =’R ’ ∗/

Some CBLAS implementations

I ATLAS (Automatically Tuned Linear Algebra Software)

I MKL (Math Kernel Library)

I CUBLAS

Documents with CBLAS routines

I http:

//math-atlas.sourceforge.net/psdoc/cblasqref.ps

I https://developer.apple.com/library/mac/

documentation/Accelerate/Reference/BLAS_Ref/

Reference/reference.html

http://math-atlas.sourceforge.net/psdoc/cblasqref.ps

http://math-atlas.sourceforge.net/psdoc/cblasqref.ps

https://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.html



SIMD

Single Instruction, Multiple Data

Taken from

http://archive.arstechnica.com/cpu/1q00/simd/figure6.gif

http://archive.arstechnica.com/cpu/1q00/simd/figure6.gif

SSE

Streaming SIMD Extensions.

Additional registers in the processor and operations in thearchitecture.

http://en.wikipedia.org/wiki/File:XMM_registers.svg

8 registers with 128 bits each. 4 single precision floating pointnumbers in each register.

http://en.wikipedia.org/wiki/File:XMM_registers.svg

Some instructions

; A l l i n s t r u c t i o n s end up i n S; p enu l t ima t e l e t t e r deno te s s c a l a r o r v e c t o r; S s t and s f o r s c a l a r , P f o r v e c t o r .; Operands a r e XMM r e g i s t e r s

; Adds a l l e l ement s o f a r r a y s op1 and op2 i n t o op1ADDPS op1 , op2

; Adds the f i r s t e l ement o f op1 and op2 i n t o; the f i r s t p o s i t i o n o f op2ADDSS op1 , op2

Some SSE instructions

v e c r e s . x = v1 . x + v2 . x ;v e c r e s . y = v1 . y + v2 . y ;v e c r e s . z = v1 . z + v2 . z ;v e c r e s .w = v1 .w + v2 .w;

; xmm0 = v1 .w | v1 . z | v1 . y | v1 . xmovaps xmm0, [ v1 ]

; xmm0 = v1 .w+v2 .w | v1 . z+v2 . z | v1 . y+v2 . y | v1 . x+v2 . xaddps xmm0, [ v2 ]

movaps [ v e c r e s ] , xmm0

http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions


SSE upgrades

SSE2 Allows to represent multiple types of data fitting onthe vectors, including integers and characters, andperform the corresponding operations.AMD’s implementation also doubled the number ofXMM arrays.

SSE3 Addition of horizontal operations within the XMMarrays, such as data reduction.

SSSE3 Additional instructions for SSE3.

SSE4 Introduction of Dword multiply, allowing to multiplytwo pairs of 32-bit integers to produce 2 64-bitnumbers. Vector dot products.

AVX

Intel’s extension to SSE for the Sandy Bridge microarchitecture,introduced in 2011. Also available in AMD’s Bulldozer.

http://upload.wikimedia.org/wikipedia/commons/f/f5/AVX_registers.svg

http://upload.wikimedia.org/wikipedia/commons/f/f5/AVX_registers.svg

Automatic vectorization

The Intel compiler can, under certain conditions, vectorize loops inthe code.

This can be activated by using the -vec option.

for(i=0; i<SIZE; i++)A[i] = B[i] + C[i]

Obstacles to vectorization

I Non-contiguous memory accessfor(i=0; i < SIZE; i+=stride) A[i] = B[i] + C[i];

I Data dependenciesfor(i=0; i < SIZE; i++) A[i] = A[i-1] + B[i];

Guiding ICC vectorization

I Pragmas For example, #pragma ivdep, among others tocontrol when to vectorize a loop.

I Keywords, such as restrict.

I Switches passed to the compiler as optimization levels.

Look at the ICC automatic vectorization documentation[11].

GPGPU

http://blogs.nvidia.com/2009/12/

whats-the-difference-between-a-cpu-and-a-gpu/

http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/

http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/

GPGPU and CPU

CPU

I General purpose

I Pipelines

I Few threads

I Lots of cache (Correlation)

GPGPU

I Specialized for local vector operations

I Many cores and threads

I Little cache

I Smaller consumption relative to a CPU

CUDA

Stands for Compute Unified Device Architecture.

Effectively, a programming model to access and control GPUsusing a virtual instruction set, in a simmilar manner as a CPU.

Only supported on NVIDIA cards.

Uses the NVIDIA compiler, and can be programmed using CUDAC/C++, languages based on C/C++.

OpenCL

Stands for Open Computing Language

It also provides access to the GPU.

I It’s an open standard, supported by NVIDIA and AMD,among others.

I Provides a language based on C99.

I Functionality provided by a driver.

I Compilation handled by linking to the correct library.

References


http://www.netlib.org/blas/

http://www.stanford.edu/class/me200c/tutorial_77/18.1_blas.html

http://math-atlas.sourceforge.net/faq.html

http://software.intel.com/sites/products/documentation/hpc/mkl/

mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htm

https://developer.nvidia.com/cublas

http://www.godevtool.com/TestbugHelp/XMMfpins.htm

software.intel.com/en-us/avx

http://www.khronos.org/opencl/

http://developer.download.nvidia.com/CUDA/training/GTC_Express_

Sarah_Tariq_June2011.pdf

http:

//software.intel.com/sites/products/documentation/hpc/composerxe/

en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htm


http://www.netlib.org/blas/

http://www.stanford.edu/class/me200c/tutorial_77/18.1_blas.html

http://math-atlas.sourceforge.net/faq.html

http://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htm

http://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htm

https://developer.nvidia.com/cublas

http://www.godevtool.com/TestbugHelp/XMMfpins.htm

software.intel.com/en-us/avx

http://www.khronos.org/opencl/

http://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdf

http://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdf

http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htm