deep neural network algorithms on graphics processors for ... · pdf file...

Click here to load reader

Post on 22-Mar-2020

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • School of Computer Science and Statistics

    Deep Neural Network Algorithms on Graphics Processors for Embedded

    Systems

    David Cronin 13319942

    10 May 2018

    A dissertation submitted in partial fulfilment of the requirements for the degree of

    Magister in Arte Ingeniaria at The College of the Holy and Undivided Trinity of Queen Elizabeth near Dublin

    with the supervision of Prof. David Gregg and Dr. Andrew Anderson

    Submitted to the University of Dublin, Trinity College, May, 2018

    http://www.scss.tcd.ie

  • Declaration

    I, David Cronin, declare that the following dissertation, except where otherwise stated, is entirely my own work; that it has not previously been submitted as an exercise for a degree, either in Trinity College Dublin, or in any other University; and that the library may lend or copy it or any part thereof on request.

    I have read and I understand the plagiarism provisions in the General Regulations of the University Calendar for the current year, found at http://www.tcd.ie/calendar.

    I have also completed the Online Tutorial on avoiding plagiarism ‘Ready Steady Write’, located at http://tcd-ie.libguides.com/plagiarism/ready-steady-write.

    Signed: Date:

    i

    http://www.tcd.ie/calendar http://tcd-ie.libguides.com/plagiarism/ready-steady-write

  • Abstract

    Deep Neural Network Algorithms on Graphics Processors for Embedded Systems by David Cronin

    Deep neural networks are becoming an increasingly popular method for tackling a range of problems, particularly those of recognition. A lot of focus is given to improving improving their performance by utilising graphics processors to perform key computations. However this focus is primarily on using computers with very high performance components. This thesis tackles the idea of performing these tasks on low powered embedded devices, often made with components that have comparatively very low performance. As such this thesis describes the implementation of an algorithm that performs a key computation associated with deep neural networks, specifically matrix multiplication.

    The algorithm described can target graphics processors using OpenGL on a large range of devices, scaling from small low powered embedded devices with ARM architectures and integrated graphics processors to large high powered desktop workstations with Intel architectures and discrete graphics processors. The implementation is also portable and cross platform, supporting both macOS and Linux. This is achieved by repurposing OpenGL functionality, originally intended for graphics processing, to perform more general purpose computations.

    This thesis demonstrates that it is possible to achieve improved execution times when performing matrix multiplication by efficiently using using OpenGL to target the graphics processor. This thesis also demonstrates as a proof on concept that this algorithm can compile and run on a small low powered embedded device for more limited computations. Further work is also described that could be done to achieve improved results. By being able to perform the work of deep neural networks on these embedded devices a whole range of new applications could become possible, as these devices would be more capable of interpreting their environments using image and audio processing.

    ii

  • Acknowledgements

    I would like to thank my project supervisors, David Gregg and Andrew Anderson, for their advice and encouragement for the duration of this project. My thanks to Andrew in particular for helping me to integrate my work into the triNNity library and for reliably responding to my emails with helpful pointers and suggestions, especially when they were at the weekend.

    iii

  • Contents

    1 Introduction 1 1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Purpose of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Layout of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2 Background 5 2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.3.1 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.3 Convolution Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.4 GEMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 Dot Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.2 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.3 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.4 Convolution Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.5 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.5 Graphical Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.2 OpenGL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.3 OpenGL ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3 Implementation 21 3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Transferring Data to the GPU . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Performing the Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Retrieving Results from the GPU . . . . . . . . . . . . . . . . . . . . . . . 27 3.5 Transpositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    iv

  • 3.5.1 Matrix A and Matrix B . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5.2 Matrix A and Matrix BT . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5.3 Matrix AT and Matrix B . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5.4 Matrix AT and Matrix BT . . . . . . . . . . . . . . . . . . . . . . . 32

    3.6 Shader Programmes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.7 Error Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.8 Final Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    4 Evaluation 37 4.1 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    4.1.1 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1.2 Validation and Results . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.3 GEMM Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.4 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.1 MacBook Air . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.2 ASUS Tinkerboard . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    5 Conclusion 48

    A1Appendix 54 A1.1 Source code developed for the implementation . . . . . . . . . . . . . . . . 54

    v

  • List of Figures

    2.1 Rosenblatts Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Single Layer Perceptron Network . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Multi-Layer Perceptron Network . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 An example of a Convolutional Neural Network . . . . . . . . . . . . . . . . 10 2.5 A diagram of an Input Layer and a Fully Connected Layer . . . . . . . . . . 11 2.6 An illustration of a Max Pooling Layer . . . . . . . . . . . . . . . . . . . . 12 2.7 An illustration of a Convolution Layer . . . . . . . . . . . . . . . . . . . . . 13 2.8 Example kernels with values or weights for three different patterns . . . . . . 13 2.9 Example of a kernel applied to an input . . . . . . . . . . . . . . . . . . . . 13 2.10 An illustration of Matrix Multiplication . . . . . . . . . . . . . . . . . . . . 15 2.11 A Fully Connected Layer implemented with Matrix Multiplication . . . . . . 15 2.12 The creation of a Patch Matrix . . . . . . . . . . . . . . . . . . . . . . . . 16 2.13 Patch Matrix and Kernel Matrix being multiplied to perform convolution . . 16 2.14 im2row and im2col Patch Matrices . . . . . . . . . . . . . . . . . . . . . . 17 2.15 Single Instruction Multiple Data Stream . . . . . . . . . . . . . . . . . . . . 18 2.16 Programmable Graphics Pipeline . . . . . . . . . . . . . . . . . . . . . . . 19

    3.1 How Transform Feedback fits into the Graphics Pipeline . . . . . . . . . . . 27 3.2 Layout of Matrix A and B, and how B is arranged in texture memory . . . . 29 3.3 Layout of Matrix A and BT, and how BT is arranged in texture memory . . . 30 3.4 Layout of Matrix AT and B, and how B is arranged in texture memory . . . . 31 3.5 Layout of Matrix AT and BT, and how BT is arranged in texture memory . . 32

    4.1 Benchmark Results for the GEMM implementation on a MacBook Air . . . . 42

    vi

  • List of Tables

    2.1 Truth Table for Logical AND Operation . . . . . . . . . . . . . . . . . . . . 9

    4.1 The specifications of each convolution scenario tested . . . . . . . . . . . . 40 4.2 Technical Specifications for MacBook Ai