building end-to-end ml workflows with arm

Building end-to-end ML workflows with ArmGian Marco Iodice, Tech Lead ACL, Arm

Wei Xiao, Principal Evangelist AI Ecosystem, Arm

Agenda• Introduction: Presenting ArmNN and Compute Library

• Q/A – Break

• Hands-on session: Preventing Disaster with CNN

• Q/A – Break

• Performance Analysis for Deep Learning Inference

• Wrap up


• Q/A – Break

• Hands-on session: Preventing Disaster with CNN

• Q/A – Break

• Hands-on session: Performance Analysis for Deep Learning Inference

• Wrap up

ArmNN and Compute Library

ArmNN and Compute Library

Are free and open source SW libraries developed by Arm for ML inference applications• Both support Android and Linux

• Quarterly officially released together on GitHub

• Development branch available on Linaro

Linaro AI InitiativeIs the home for ArmNN and Compute Library and brings companies together to

develop the best-in-class Deep Learning performance on Arm

https://mlplatform.org https://mlplatform.org/contributing

Challenges AddressedThere are challenges to deploy ML applications such as:

• Framework/Code/Performance portability• Code optimization on specific architectures

ArmNN and Compute library were developed to address these challenges:

The libraries want to make deployment of intelligent vision applications easy and performant on Arm-based platforms

SW Architecture Overview• ML workload can use Arm NN or Compute

Library to access CPU and GPU acceleration

• Only ArmNN provides the access to Arm Ethos NPU

• Only ArmNN provides parsers for 3rd party libraries (TensorFlow, TensorFlowLite, Caffe, ONNX...)

• Arm Android NN HAL driver provides access to ArmNN for Android applications

Compute Library

Cortex-A

CPUMali GPU Ethos NPU

NPU driver

ArmNN

HAL driver

ML workload

Android NN3rd party

ML library

TensorFlow, Caffe,..

ArmNNIs the inference engine to enable the deployment of ML workloads

efficiently on Arm Cortex-A CPUs, Mali GPUs and Ethos NPUs

• ArmNN is also available for the AndroidNN API

• ArmNN SDK includes tools, parsers for various frameworks (i.e. TensorFlow, TensorFlow Lite, Caffe,..) and utilizes Compute Library to target Arm Cortex-A CPUs and Arm Mali GPUs

https://developer.arm.com/ip-products/processors/machine-learning/arm-nn

Arm Compute Library (ACL)Bundle of optimized functions for ML, Computer Vision and Image

Processing for Arm Cortex-A CPUs and Arm Mali GPUs

• Provides acceleration on Arm CPUs through Neon/SVE (aarch32/aarch64)

• Provides acceleration on Arm Mali GPU through OpenCL

• Over 120 functions implemented!

https://developer.arm.com/technologies/compute-library

FunctionsMachine learning• Activation

• Convolution

• Depth-wise Convolution

• Normalization

• Pooling

• Softmax

• and many more…

Computer vision Image processing• Canny Edge

• Harris corner

• HOG

• Gaussian Pyramid

• Gradient

• Optical Flow


• Colour convert

• Dilate

• Gaussian/Sobel 3x3/5x5

• Histogram equalization

• Remap

• Warp Affine/Perspective


Key Features• Different data type support for CPU and GPU: FP32/FP16/Int8/Uint8

• More than 40 examples ready to be profiled with our benchmark test suite

• Different algorithms available for convolution layer (GEMM, Winograd, FFT and Direct)

• Memory manager

• Micro-architecture optimizations for key algorithms like GEMM or Winograd

• OpenCL tuner

• Fast math support

OpenCL TunerSimply tweaking the number of work-items in a work-group can have a

huge performance impact

Setting the optimal LWS can be tricky because of:

• Cache size

• Maximum number of threads per compute unit

• Work-group dispatching

• Input and output dimensions

• …

In ACL we implemented the OpenCL tuner to look for the optimal LWS

OpenCL Tuner Improvement

0.00%10.00%20.00%30.00%40.00%50.00%60.00%

AlexNet F32

GoogleNet F32

Inception V3 F32

Inception V4 F32

MobileN

et F32

MobileN

et v2 F32

ResNet12 F32

ResNet50 F32

SqueezeNet F32

VGG16 F32

MobileN

et QASYM

M8

MobileN

et v2 QASYM

M8

VGG VDSR QASYM

M8

Impr

ovem

ent (

%)

Performance improvement(higher, better) Exhaustive Normal Rapid

Three different levels of tuning supported, trade-offs between performance improvement and tuning time


• Q/A

• Hands-on sessions

• Q/A

• Hands-on session: Performance Analysis for Deep Learning Inference

• Wrap up

Hands-on session

Let’s have fun on Raspberry Pi 4!

Raspberry Pi 4Tiny but powerful development board with the mission to put the power of computing and digital making into the hands of people all over the world*

Raspberry Pi 4 is the latest version released in June 2019

• Arm Cortex-A72 quad-core running at 1.65GHz

• Wifi, Bluetooth, Ethernet

• Dual monitor support (4K)


* https://www.raspberrypi.org/

What Will you Find in your Raspberry Pi?

• Raspbian OS pre-installed (September 2019 release) with serial communication and remote connections (ssh) enabled

• Pre-built binaries for ArmNN and Compute Library (19.08 release)

• Ready to use examples for the hands-on sessions

All the instructions to reproduce the labs from scratch can be found in the Backup section of the Support Material

Note about Raspbian OS

Arm Cortex-A72 processor is an aarch64 architecture but Raspbian OS is based on an Armv7-A (aarch32) filesystem

This means that the Arm Compute Library will not be able to call the optimized routines for aarch64

How to Control the Raspberry PiThere are multiples ways to control the Raspberry Pi accordingly

with your neededAssuming:

• both host (your laptop) and Raspberry Pi are in the same network

• internet connection and control of the desktop interface are needed

We can use the serial and ethernet cable to control our board

Raspberry Pi 4: Connections

Laptop(Host)

Raspberry Pi 4(Target)

USB Serial cable

USB-C power cable

AC socket

Router / Ethernet socket

Raspberry Pi 4: Serial CableVccGndTxRx

Raspberry Pi 4(Target)

VccGndRxTx

Laptop(Host)

Are swapped!

Red = VccBlack = GndWhite = TxGreen = Rx

Step 1: Get IP Address• Open the serial terminal following the instructions provided

• Login in Raspbian OS

• User: pi

• Password: raspberry

• In the terminal window, enter the command:

• Copy the IP address along with the desktop number (i.e. 10.42.0.53:1)

$ vncserver

Step 2: Control the Desktop Interface• Install VNC viewer on Google Chrome following

the instructions provided on your laptop

• Open the VNC app and enter the IP address and desktop number of Raspberry Pi

• Login in Raspbian OS using the same credentials of step 1

Setup: https://tinyurl.com/u7a6jfv

Workshop instruction - https://tinyurl.com/uwvy59a

Now, we are ready to play!

Lab 1

Run Image Classification for Fire Detection

• Arm NN Core

Graph builder API

Optimizer

Runtime

Reference and Neon/Cl via Compute Library

New backend planned• Parsers

Tensorflow

Tensorflow Lite

Caffe

ONNX• Android NNAPI Driver

Arm NN Components

Arm NN Supported Layers … more to come!

• Activation• Addition• BatchNormalization• BatchToSpaceND• Constant• Convolution2d• DepthwiseConvolution2d• DetectionPostProcess• Division• Equal• Floor• FullyConnected

• Output• Pad• Permute• Pooling2d• Reshape• ResizeBilinear• Rsqrt• Softmax• SpaceToBatchND• Splitter• StridedSlice• Subtraction

• Gather• Greater• Input• L2Normalization• LSTM• Maximum• Mean• MemCopy• Merger• Minimum• Multiplication• Normalization

Lab 2

Run Time Series ML Model


• Q/A

• Hands-on session:

• Q/A

• Performance Analysis for Deep Learning Inference

• Wrap up

Performance Analysis for Deep Learning Inference

Introduction

What can we say about the following measurements?

SqueezeNet: 200 fps, 5 msMobileNet: 250 fps, 4 ms

Nothing...

• What processor?

• What frequency?

• How many threads?

• What data type?

Goals for Performance Analysis

1. Give a meaning to our performance numbers. i.e. Good? Bad?

2. Understand performance bottlenecks to fix inefficiencies

• No lucky or casual optimizations

• Better optimization planning for the short and long term

We need to introduce a formal methodology

DNN Under Test (DUT)

In order to give a meaning to our perf. numbers, we need to know:

• Hardware capabilities

• Algorithm complexity (number of ops required by the algorithm)

DNNi.e. Image, audio, text Exec. Time, fps

HW Capabilities vs Algorithm ComplexityHardware capabilities Algorithm complexity• (Fl)ops/Core/Cycle

# arithmetic operations per core and per cycle

• MACs/Core/Cycle

# multiply-accumulate (mul+add) operations per core and per cycle

• Max external memory bandwidth (i.e. DDR):

# bytes read from or stored into memory in unit of time (bytes/s)

• Sum of the ops required by each function/layer in the network

Well defined for DNNs and mainly dominated by the MACs operations in Convolution/Fully Connected Layers

• Since convolution layer could use multiples algorithms (GEMM, Winograd, FFT…), we should consider the ops required by the one executed

Example Algorithm ComplexityVGG16 layers Parameters (W/H/Kernel/OFMs) Algorithm GFlops

Conv1 224/224/3/3/64 GEMM 3.55

Conv2 224/224/3/64/64 GEMM 1.95

Conv3 112/112/3/64/128 GEMM 1.20

Conv4 112/112/3/128/128 GEMM 2.08

Conv5 56/56/3/128/256 GEMM 1.20

Conv6 / Conv7 56/56/3/256/256 GEMM 2.39 x2

… … … …

Conv11 / Conv12 / Conv13 14/14/3/512/512 GEMM 0.354 x3

FC1 1/1/1/4096 GEMM 0.0000502

FC2 1/1/1/4096 GEMM 0.00000819

FC3 1/1/1/1000 GEMM 0.00000819

Total GFlops ~30

Processor UtilizationExpresses how well the algorithm uses the available

processor computational resources

Ta: actual execution time (time measured)

Tt: theoretical execution time

Putil=TtTa

∈[0, 1]

=Ops23456789:

Op𝑠<56=/<?<3= @ 𝑛𝑢𝑚<56=D @ 𝑓𝑟𝑒𝑞

Ops required by the algorithm

Ops/Core/Cycle processor

Static Memory Bound Analysis

Used to check if an algorithm can be limited by the memory transfers

𝐴𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚P2QRS =𝐷𝑅85823 + 𝐷𝑊85823

𝑇8Tt: theoretical execution timeDRtotal: Total data readDWtotal: Total data write

Should be compared with the external memory bandwidth (i.e. DDR)

Exercise 1Features Value

GFLops network 1.1

Actual execution time (ms) 4

Input image size 224x224

Flops/Core/Cycle 512

Num. cores 20

Frequency 500 MHz

Max external memory bandwidth 12.9 GB/s

Processor utilization?5.6 %

Really low!


GFLops network 34.3

Actual execution time (ms) 200

Input image size 224x224

Flops/Core/Cycle 16

Num. cores 4

Frequency 2 GHz


Processor utilization?133.6 %

It cannot be > 100%! Maybe GFLops network is not correct?


Flops/Core/Cycle 16

Num. cores 4

Frequency 2 GHz


Data type M,N,K DR[GB]

DW [GB]

Tt

[s]AlgorithmMaxBW

[GB/s]

F32 2704, 256, 1152 0.01160 0.00109 0.012 1.019

F32 1, 4096, 25088 9.34E-5 0.38 0.0016 238.5

This is memory bound!

Matrix multiplication: • M: Number of output rows• N: Number of output columns• K: Number of right hand side matrix rows

Formal MethodologyTop-Down approach

Made up of different granular investigations to find performance bottlenecks

Level 1: Graph (DNN) profiling

Level 2: Function profiling

Level 3: HW counters profilingFunction HW counters (L2 cache hits, mis-predictions, cycles)

Layer/function timing

Network timing

Driver Overhead Estimation

TN0

t0 t1 t2

Other processor

CPUoverhead

func0 func1 func2

time

Graph profiling

t0 t1 t2

func0 func1 func2

Overhead increased by the profiler!

Function profiling

TN1time

TN0 != TN1TN0 != (t0+t1+t2)

Overhead = TN0 -(t0+t1+t2)

Overhead

Driver Overhead EstimationStreamline

Arm® Streamline Performance Analyzer is a system-wide visualizer and profiler for Arm hardware targets, and models of them

Locating and Fixing InefficienciesWith graph and function profiling we can gather four important

information points

1. Network utilization2. Overhead limitation3. Layer/function utilization4. Memory transfer limitation

Phase 1

Phase 2

Phase 1

Overhead evaluation

Graph profiling

Function profiling

Putili < Thresholdutil

Overhead < ThresholdoverheadOverhead optimization

DNN optimized

Phase 2Go to level 3HW counters

profiling

Yes

Yes

No

No

Phase 2 (1 of 2)

Overhead optimization

Function profiling

Layer analysis

Remove layers with % time spent < Thresholdt

Remove layers with Putil> Thresholdp

Is the list empty?

Layer ResNet12 (F32) % time spent

% Putil

Conv12 61.2 5.3

Conv1 3.63 40.1

Conv4 3.15 58.3

Conv6 3.15 59.2

Conv3 3.11 61.3

Conv1 3.11 60.1

Conv5 3.1 59.3

Conv7 3.1 58.6

…

Activation12 0.01 3

Phase 1Next slide…

Yes

No

i.e. > 50

i.e. < 1

Layer analysis: ResNet12 (F32)

Phase 2 (2 of 2)Memory bandwidth evaluation

Is it memory bound?

Layer ResNet12 (F32)

Max Memory bandwidthGB/s

Conv12 79.5 (GEMM-based)

Conv1 4.71 (GEMM-based)

Features Value

Flops/Core/Cycle 16

Num. Cores 4

Frequency 2 GHz

Max Memory Bandwidth 12.9 GB/s

Can we use a different algorithm?

Change algorithm andupdate Flops required

Phase 1

No

Yes

Yes

Can we change theDNN design?

i.e. FFT

Change DNN design

No

No

Yes

Go to level 3HW counters

profiling

Go to level 3HW counters

profiling

MLPerf


Useful benchmarks for measuring training and inference performance of ML hardware, software, and services as a collaboration of companies and researchers

https://mlperf.org

Lab 3

Performance Evaluation with ACL

ACL Benchmark Framework


• Currently in experimental phase

• Only available on the Arm Compute Library development branch (Linaro)

• Allows performance profiling of all the examples included in the “examples” folder

• For each example, it generates a new binary with prefix “benchmark_” in the build/tests folder

Benchmark Examples

Overhead optimizationi.e../benchmark_graph_mobilenet –example_args=--threads=1 --iterations=10./benchmark_graph_mobilenet –example_args=--threads=4 --iterations=10./benchmark_graph_mobilenet –example_args=--threads=4 --iterations=10 --instruments=scheduler_timer_ms

Plenty of DNN examples are included in ACL and ready to be profiled

building end-to-end ml workflows with arm

Documents