building end-to-end ml workflows with arm

54
Building end-to-end ML workflows with Arm Gian Marco Iodice, Tech Lead ACL, Arm Wei Xiao, Principal Evangelist AI Ecosystem, Arm

Upload: others

Post on 09-Apr-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building end-to-end ML workflows with Arm

Building end-to-end ML workflows with ArmGian Marco Iodice, Tech Lead ACL, Arm

Wei Xiao, Principal Evangelist AI Ecosystem, Arm

Page 2: Building end-to-end ML workflows with Arm

Agenda• Introduction: Presenting ArmNN and Compute Library

• Q/A – Break

• Hands-on session: Preventing Disaster with CNN

• Q/A – Break

• Performance Analysis for Deep Learning Inference

• Wrap up

Page 3: Building end-to-end ML workflows with Arm

Agenda• Introduction: Presenting ArmNN and Compute Library

• Q/A – Break

• Hands-on session: Preventing Disaster with CNN

• Q/A – Break

• Hands-on session: Performance Analysis for Deep Learning Inference

• Wrap up

Page 4: Building end-to-end ML workflows with Arm

ArmNN and Compute Library

Page 5: Building end-to-end ML workflows with Arm

ArmNN and Compute Library

Are free and open source SW libraries developed by Arm for ML inference applications• Both support Android and Linux

• Quarterly officially released together on GitHub

• Development branch available on Linaro

Page 6: Building end-to-end ML workflows with Arm

Linaro AI InitiativeIs the home for ArmNN and Compute Library and brings companies together to

develop the best-in-class Deep Learning performance on Arm

https://mlplatform.org https://mlplatform.org/contributing

Page 7: Building end-to-end ML workflows with Arm

Challenges AddressedThere are challenges to deploy ML applications such as:

• Framework/Code/Performance portability• Code optimization on specific architectures

ArmNN and Compute library were developed to address these challenges:

The libraries want to make deployment of intelligent vision applications easy and performant on Arm-based platforms

Page 8: Building end-to-end ML workflows with Arm

SW Architecture Overview• ML workload can use Arm NN or Compute

Library to access CPU and GPU acceleration

• Only ArmNN provides the access to Arm Ethos NPU

• Only ArmNN provides parsers for 3rd party libraries (TensorFlow, TensorFlowLite, Caffe, ONNX...)

• Arm Android NN HAL driver provides access to ArmNN for Android applications

Compute Library

Cortex-A

CPUMali GPU Ethos NPU

NPU driver

ArmNN

HAL driver

ML workload

Android NN3rd party

ML library

TensorFlow, Caffe,..

Page 9: Building end-to-end ML workflows with Arm

ArmNNIs the inference engine to enable the deployment of ML workloads

efficiently on Arm Cortex-A CPUs, Mali GPUs and Ethos NPUs

• ArmNN is also available for the AndroidNN API

• ArmNN SDK includes tools, parsers for various frameworks (i.e. TensorFlow, TensorFlow Lite, Caffe,..) and utilizes Compute Library to target Arm Cortex-A CPUs and Arm Mali GPUs

https://developer.arm.com/ip-products/processors/machine-learning/arm-nn

Page 10: Building end-to-end ML workflows with Arm

Arm Compute Library (ACL)Bundle of optimized functions for ML, Computer Vision and Image

Processing for Arm Cortex-A CPUs and Arm Mali GPUs

• Provides acceleration on Arm CPUs through Neon/SVE (aarch32/aarch64)

• Provides acceleration on Arm Mali GPU through OpenCL

• Over 120 functions implemented!

https://developer.arm.com/technologies/compute-library

Page 11: Building end-to-end ML workflows with Arm

FunctionsMachine learning• Activation

• Convolution

• Depth-wise Convolution

• Normalization

• Pooling

• Softmax

• and many more…

Computer vision Image processing• Canny Edge

• Harris corner

• HOG

• Gaussian Pyramid

• Gradient

• Optical Flow

• and many more…

• Colour convert

• Dilate

• Gaussian/Sobel 3x3/5x5

• Histogram equalization

• Remap

• Warp Affine/Perspective

• and many more…

Page 12: Building end-to-end ML workflows with Arm

Key Features• Different data type support for CPU and GPU: FP32/FP16/Int8/Uint8

• More than 40 examples ready to be profiled with our benchmark test suite

• Different algorithms available for convolution layer (GEMM, Winograd, FFT and Direct)

• Memory manager

• Micro-architecture optimizations for key algorithms like GEMM or Winograd

• OpenCL tuner

• Fast math support

Page 13: Building end-to-end ML workflows with Arm

OpenCL TunerSimply tweaking the number of work-items in a work-group can have a

huge performance impact

Setting the optimal LWS can be tricky because of:

• Cache size

• Maximum number of threads per compute unit

• Work-group dispatching

• Input and output dimensions

• …

In ACL we implemented the OpenCL tuner to look for the optimal LWS

Page 14: Building end-to-end ML workflows with Arm

OpenCL Tuner Improvement

0.00%10.00%20.00%30.00%40.00%50.00%60.00%

AlexNet F32

GoogleNet F32

Inception V3 F32

Inception V4 F32

MobileN

et F32

MobileN

et v2 F32

ResNet12 F32

ResNet50 F32

SqueezeNet F32

VGG16 F32

MobileN

et QASYM

M8

MobileN

et v2 QASYM

M8

VGG VDSR QASYM

M8

Impr

ovem

ent (

%)

Performance improvement(higher, better) Exhaustive Normal Rapid

Three different levels of tuning supported, trade-offs between performance improvement and tuning time

Page 15: Building end-to-end ML workflows with Arm

Agenda• Introduction: Presenting ArmNN and Compute Library

• Q/A

• Hands-on sessions

• Q/A

• Hands-on session: Performance Analysis for Deep Learning Inference

• Wrap up

Page 16: Building end-to-end ML workflows with Arm

Hands-on session

Page 17: Building end-to-end ML workflows with Arm

Let’s have fun on Raspberry Pi 4!

Page 18: Building end-to-end ML workflows with Arm

Raspberry Pi 4Tiny but powerful development board with the mission to put the power of computing and digital making into the hands of people all over the world*

Raspberry Pi 4 is the latest version released in June 2019

• Arm Cortex-A72 quad-core running at 1.65GHz

• Wifi, Bluetooth, Ethernet

• Dual monitor support (4K)

• and many more…

* https://www.raspberrypi.org/

Page 19: Building end-to-end ML workflows with Arm

What Will you Find in your Raspberry Pi?

• Raspbian OS pre-installed (September 2019 release) with serial communication and remote connections (ssh) enabled

• Pre-built binaries for ArmNN and Compute Library (19.08 release)

• Ready to use examples for the hands-on sessions

All the instructions to reproduce the labs from scratch can be found in the Backup section of the Support Material

Page 20: Building end-to-end ML workflows with Arm

Note about Raspbian OS

Arm Cortex-A72 processor is an aarch64 architecture but Raspbian OS is based on an Armv7-A (aarch32) filesystem

This means that the Arm Compute Library will not be able to call the optimized routines for aarch64

Page 21: Building end-to-end ML workflows with Arm

How to Control the Raspberry PiThere are multiples ways to control the Raspberry Pi accordingly

with your neededAssuming:

• both host (your laptop) and Raspberry Pi are in the same network

• internet connection and control of the desktop interface are needed

We can use the serial and ethernet cable to control our board

Page 22: Building end-to-end ML workflows with Arm

Raspberry Pi 4: Connections

Laptop(Host)

Raspberry Pi 4(Target)

USB Serial cable

USB-C power cable

AC socket

Router / Ethernet socket

Page 23: Building end-to-end ML workflows with Arm

Raspberry Pi 4: Serial CableVccGndTxRx

Raspberry Pi 4(Target)

VccGndRxTx

Laptop(Host)

Are swapped!

Red = VccBlack = GndWhite = TxGreen = Rx

Page 24: Building end-to-end ML workflows with Arm

Step 1: Get IP Address• Open the serial terminal following the instructions provided

• Login in Raspbian OS

• User: pi

• Password: raspberry

• In the terminal window, enter the command:

• Copy the IP address along with the desktop number (i.e. 10.42.0.53:1)

$ vncserver

Page 25: Building end-to-end ML workflows with Arm

Step 2: Control the Desktop Interface• Install VNC viewer on Google Chrome following

the instructions provided on your laptop

• Open the VNC app and enter the IP address and desktop number of Raspberry Pi

• Login in Raspbian OS using the same credentials of step 1

Page 26: Building end-to-end ML workflows with Arm

Setup: https://tinyurl.com/u7a6jfv

Workshop instruction - https://tinyurl.com/uwvy59a

Page 27: Building end-to-end ML workflows with Arm

Now, we are ready to play!

Page 28: Building end-to-end ML workflows with Arm

Lab 1

Run Image Classification for Fire Detection

Page 29: Building end-to-end ML workflows with Arm

• Arm NN Core

Graph builder API

Optimizer

Runtime

Reference and Neon/Cl via Compute Library

New backend planned• Parsers

Tensorflow

Tensorflow Lite

Caffe

ONNX• Android NNAPI Driver

Arm NN Components

Page 30: Building end-to-end ML workflows with Arm

Arm NN Supported Layers … more to come!

• Activation• Addition• BatchNormalization• BatchToSpaceND• Constant• Convolution2d• DepthwiseConvolution2d• DetectionPostProcess• Division• Equal• Floor• FullyConnected

• Output• Pad• Permute• Pooling2d• Reshape• ResizeBilinear• Rsqrt• Softmax• SpaceToBatchND• Splitter• StridedSlice• Subtraction

• Gather• Greater• Input• L2Normalization• LSTM• Maximum• Mean• MemCopy• Merger• Minimum• Multiplication• Normalization

Page 31: Building end-to-end ML workflows with Arm

Lab 2

Run Time Series ML Model

Page 32: Building end-to-end ML workflows with Arm

Agenda• Introduction: Presenting ArmNN and Compute Library

• Q/A

• Hands-on session:

• Q/A

• Performance Analysis for Deep Learning Inference

• Wrap up

Page 33: Building end-to-end ML workflows with Arm

Performance Analysis for Deep Learning Inference

Page 34: Building end-to-end ML workflows with Arm

Introduction

What can we say about the following measurements?

SqueezeNet: 200 fps, 5 msMobileNet: 250 fps, 4 ms

Nothing...

• What processor?

• What frequency?

• How many threads?

• What data type?

Page 35: Building end-to-end ML workflows with Arm

Goals for Performance Analysis

1. Give a meaning to our performance numbers. i.e. Good? Bad?

2. Understand performance bottlenecks to fix inefficiencies

• No lucky or casual optimizations

• Better optimization planning for the short and long term

We need to introduce a formal methodology

Page 36: Building end-to-end ML workflows with Arm

DNN Under Test (DUT)

In order to give a meaning to our perf. numbers, we need to know:

• Hardware capabilities

• Algorithm complexity (number of ops required by the algorithm)

DNNi.e. Image, audio, text Exec. Time, fps

Page 37: Building end-to-end ML workflows with Arm

HW Capabilities vs Algorithm ComplexityHardware capabilities Algorithm complexity• (Fl)ops/Core/Cycle

# arithmetic operations per core and per cycle

• MACs/Core/Cycle

# multiply-accumulate (mul+add) operations per core and per cycle

• Max external memory bandwidth (i.e. DDR):

# bytes read from or stored into memory in unit of time (bytes/s)

• Sum of the ops required by each function/layer in the network

Well defined for DNNs and mainly dominated by the MACs operations in Convolution/Fully Connected Layers

• Since convolution layer could use multiples algorithms (GEMM, Winograd, FFT…), we should consider the ops required by the one executed

Page 38: Building end-to-end ML workflows with Arm

Example Algorithm ComplexityVGG16 layers Parameters (W/H/Kernel/OFMs) Algorithm GFlops

Conv1 224/224/3/3/64 GEMM 3.55

Conv2 224/224/3/64/64 GEMM 1.95

Conv3 112/112/3/64/128 GEMM 1.20

Conv4 112/112/3/128/128 GEMM 2.08

Conv5 56/56/3/128/256 GEMM 1.20

Conv6 / Conv7 56/56/3/256/256 GEMM 2.39 x2

… … … …

Conv11 / Conv12 / Conv13 14/14/3/512/512 GEMM 0.354 x3

FC1 1/1/1/4096 GEMM 0.0000502

FC2 1/1/1/4096 GEMM 0.00000819

FC3 1/1/1/1000 GEMM 0.00000819

Total GFlops ~30

Page 39: Building end-to-end ML workflows with Arm

Processor UtilizationExpresses how well the algorithm uses the available

processor computational resources

Ta: actual execution time (time measured)

Tt: theoretical execution time

Putil=TtTa

∈[0, 1]

=Ops23456789:

Op𝑠<56=/<?<3= @ 𝑛𝑢𝑚<56=D @ 𝑓𝑟𝑒𝑞

Ops required by the algorithm

Ops/Core/Cycle processor

Page 40: Building end-to-end ML workflows with Arm

Static Memory Bound Analysis

Used to check if an algorithm can be limited by the memory transfers

𝐴𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚P2QRS =𝐷𝑅85823 + 𝐷𝑊85823

𝑇8Tt: theoretical execution timeDRtotal: Total data readDWtotal: Total data write

Should be compared with the external memory bandwidth (i.e. DDR)

Page 41: Building end-to-end ML workflows with Arm

Exercise 1Features Value

GFLops network 1.1

Actual execution time (ms) 4

Input image size 224x224

Flops/Core/Cycle 512

Num. cores 20

Frequency 500 MHz

Max external memory bandwidth 12.9 GB/s

Processor utilization?5.6 %

Really low!

Page 42: Building end-to-end ML workflows with Arm

Exercise 2Features Value

GFLops network 34.3

Actual execution time (ms) 200

Input image size 224x224

Flops/Core/Cycle 16

Num. cores 4

Frequency 2 GHz

Max external memory bandwidth 12.9 GB/s

Processor utilization?133.6 %

It cannot be > 100%! Maybe GFLops network is not correct?

Page 43: Building end-to-end ML workflows with Arm

Exercise 3Features Value

Flops/Core/Cycle 16

Num. cores 4

Frequency 2 GHz

Max external memory bandwidth 12.9 GB/s

Data type M,N,K DR[GB]

DW [GB]

Tt

[s]AlgorithmMaxBW

[GB/s]

F32 2704, 256, 1152 0.01160 0.00109 0.012 1.019

F32 1, 4096, 25088 9.34E-5 0.38 0.0016 238.5

This is memory bound!

Matrix multiplication: • M: Number of output rows• N: Number of output columns• K: Number of right hand side matrix rows

Page 44: Building end-to-end ML workflows with Arm

Formal MethodologyTop-Down approach

Made up of different granular investigations to find performance bottlenecks

Level 1: Graph (DNN) profiling

Level 2: Function profiling

Level 3: HW counters profilingFunction HW counters (L2 cache hits, mis-predictions, cycles)

Layer/function timing

Network timing

Page 45: Building end-to-end ML workflows with Arm

Driver Overhead Estimation

TN0

t0 t1 t2

Other processor

CPUoverhead

func0 func1 func2

time

Graph profiling

t0 t1 t2

func0 func1 func2

Overhead increased by the profiler!

Function profiling

TN1time

TN0 != TN1TN0 != (t0+t1+t2)

Overhead = TN0 -(t0+t1+t2)

Overhead

Page 46: Building end-to-end ML workflows with Arm

Driver Overhead EstimationStreamline

Arm® Streamline Performance Analyzer is a system-wide visualizer and profiler for Arm hardware targets, and models of them

Page 47: Building end-to-end ML workflows with Arm

Locating and Fixing InefficienciesWith graph and function profiling we can gather four important

information points

1. Network utilization2. Overhead limitation3. Layer/function utilization4. Memory transfer limitation

Phase 1

Phase 2

Page 48: Building end-to-end ML workflows with Arm

Phase 1

Overhead evaluation

Graph profiling

Function profiling

Putili < Thresholdutil

Overhead < ThresholdoverheadOverhead optimization

DNN optimized

Phase 2Go to level 3HW counters

profiling

Yes

Yes

No

No

Page 49: Building end-to-end ML workflows with Arm

Phase 2 (1 of 2)

Overhead optimization

Function profiling

Layer analysis

Remove layers with % time spent < Thresholdt

Remove layers with Putil> Thresholdp

Is the list empty?

Layer ResNet12 (F32) % time spent

% Putil

Conv12 61.2 5.3

Conv1 3.63 40.1

Conv4 3.15 58.3

Conv6 3.15 59.2

Conv3 3.11 61.3

Conv1 3.11 60.1

Conv5 3.1 59.3

Conv7 3.1 58.6

Activation12 0.01 3

Phase 1Next slide…

Yes

No

i.e. > 50

i.e. < 1

Layer analysis: ResNet12 (F32)

Page 50: Building end-to-end ML workflows with Arm

Phase 2 (2 of 2)Memory bandwidth evaluation

Is it memory bound?

Layer ResNet12 (F32)

Max Memory bandwidthGB/s

Conv12 79.5 (GEMM-based)

Conv1 4.71 (GEMM-based)

Features Value

Flops/Core/Cycle 16

Num. Cores 4

Frequency 2 GHz

Max Memory Bandwidth 12.9 GB/s

Can we use a different algorithm?

Change algorithm andupdate Flops required

Phase 1

No

Yes

Yes

Can we change theDNN design?

i.e. FFT

Change DNN design

No

No

Yes

Go to level 3HW counters

profiling

Go to level 3HW counters

profiling

Page 51: Building end-to-end ML workflows with Arm

MLPerf

Overhead optimization

Useful benchmarks for measuring training and inference performance of ML hardware, software, and services as a collaboration of companies and researchers

https://mlperf.org

Page 52: Building end-to-end ML workflows with Arm

Lab 3

Performance Evaluation with ACL

Page 53: Building end-to-end ML workflows with Arm

ACL Benchmark Framework

Overhead optimization

• Currently in experimental phase

• Only available on the Arm Compute Library development branch (Linaro)

• Allows performance profiling of all the examples included in the “examples” folder

• For each example, it generates a new binary with prefix “benchmark_” in the build/tests folder

Page 54: Building end-to-end ML workflows with Arm

Benchmark Examples

Overhead optimizationi.e../benchmark_graph_mobilenet –example_args=--threads=1 --iterations=10./benchmark_graph_mobilenet –example_args=--threads=4 --iterations=10./benchmark_graph_mobilenet –example_args=--threads=4 --iterations=10 --instruments=scheduler_timer_ms

Plenty of DNN examples are included in ACL and ready to be profiled