building end-to-end ml workflows with arm
TRANSCRIPT
Building end-to-end ML workflows with ArmGian Marco Iodice, Tech Lead ACL, Arm
Wei Xiao, Principal Evangelist AI Ecosystem, Arm
Agenda• Introduction: Presenting ArmNN and Compute Library
• Q/A – Break
• Hands-on session: Preventing Disaster with CNN
• Q/A – Break
• Performance Analysis for Deep Learning Inference
• Wrap up
Agenda• Introduction: Presenting ArmNN and Compute Library
• Q/A – Break
• Hands-on session: Preventing Disaster with CNN
• Q/A – Break
• Hands-on session: Performance Analysis for Deep Learning Inference
• Wrap up
ArmNN and Compute Library
ArmNN and Compute Library
Are free and open source SW libraries developed by Arm for ML inference applications• Both support Android and Linux
• Quarterly officially released together on GitHub
• Development branch available on Linaro
Linaro AI InitiativeIs the home for ArmNN and Compute Library and brings companies together to
develop the best-in-class Deep Learning performance on Arm
https://mlplatform.org https://mlplatform.org/contributing
Challenges AddressedThere are challenges to deploy ML applications such as:
• Framework/Code/Performance portability• Code optimization on specific architectures
ArmNN and Compute library were developed to address these challenges:
The libraries want to make deployment of intelligent vision applications easy and performant on Arm-based platforms
SW Architecture Overview• ML workload can use Arm NN or Compute
Library to access CPU and GPU acceleration
• Only ArmNN provides the access to Arm Ethos NPU
• Only ArmNN provides parsers for 3rd party libraries (TensorFlow, TensorFlowLite, Caffe, ONNX...)
• Arm Android NN HAL driver provides access to ArmNN for Android applications
Compute Library
Cortex-A
CPUMali GPU Ethos NPU
NPU driver
ArmNN
HAL driver
ML workload
Android NN3rd party
ML library
TensorFlow, Caffe,..
ArmNNIs the inference engine to enable the deployment of ML workloads
efficiently on Arm Cortex-A CPUs, Mali GPUs and Ethos NPUs
• ArmNN is also available for the AndroidNN API
• ArmNN SDK includes tools, parsers for various frameworks (i.e. TensorFlow, TensorFlow Lite, Caffe,..) and utilizes Compute Library to target Arm Cortex-A CPUs and Arm Mali GPUs
https://developer.arm.com/ip-products/processors/machine-learning/arm-nn
Arm Compute Library (ACL)Bundle of optimized functions for ML, Computer Vision and Image
Processing for Arm Cortex-A CPUs and Arm Mali GPUs
• Provides acceleration on Arm CPUs through Neon/SVE (aarch32/aarch64)
• Provides acceleration on Arm Mali GPU through OpenCL
• Over 120 functions implemented!
https://developer.arm.com/technologies/compute-library
FunctionsMachine learning• Activation
• Convolution
• Depth-wise Convolution
• Normalization
• Pooling
• Softmax
• and many more…
Computer vision Image processing• Canny Edge
• Harris corner
• HOG
• Gaussian Pyramid
• Gradient
• Optical Flow
• and many more…
• Colour convert
• Dilate
• Gaussian/Sobel 3x3/5x5
• Histogram equalization
• Remap
• Warp Affine/Perspective
• and many more…
Key Features• Different data type support for CPU and GPU: FP32/FP16/Int8/Uint8
• More than 40 examples ready to be profiled with our benchmark test suite
• Different algorithms available for convolution layer (GEMM, Winograd, FFT and Direct)
• Memory manager
• Micro-architecture optimizations for key algorithms like GEMM or Winograd
• OpenCL tuner
• Fast math support
OpenCL TunerSimply tweaking the number of work-items in a work-group can have a
huge performance impact
Setting the optimal LWS can be tricky because of:
• Cache size
• Maximum number of threads per compute unit
• Work-group dispatching
• Input and output dimensions
• …
In ACL we implemented the OpenCL tuner to look for the optimal LWS
OpenCL Tuner Improvement
0.00%10.00%20.00%30.00%40.00%50.00%60.00%
AlexNet F32
GoogleNet F32
Inception V3 F32
Inception V4 F32
MobileN
et F32
MobileN
et v2 F32
ResNet12 F32
ResNet50 F32
SqueezeNet F32
VGG16 F32
MobileN
et QASYM
M8
MobileN
et v2 QASYM
M8
VGG VDSR QASYM
M8
Impr
ovem
ent (
%)
Performance improvement(higher, better) Exhaustive Normal Rapid
Three different levels of tuning supported, trade-offs between performance improvement and tuning time
Agenda• Introduction: Presenting ArmNN and Compute Library
• Q/A
• Hands-on sessions
• Q/A
• Hands-on session: Performance Analysis for Deep Learning Inference
• Wrap up
Hands-on session
Let’s have fun on Raspberry Pi 4!
Raspberry Pi 4Tiny but powerful development board with the mission to put the power of computing and digital making into the hands of people all over the world*
Raspberry Pi 4 is the latest version released in June 2019
• Arm Cortex-A72 quad-core running at 1.65GHz
• Wifi, Bluetooth, Ethernet
• Dual monitor support (4K)
• and many more…
* https://www.raspberrypi.org/
What Will you Find in your Raspberry Pi?
• Raspbian OS pre-installed (September 2019 release) with serial communication and remote connections (ssh) enabled
• Pre-built binaries for ArmNN and Compute Library (19.08 release)
• Ready to use examples for the hands-on sessions
All the instructions to reproduce the labs from scratch can be found in the Backup section of the Support Material
Note about Raspbian OS
Arm Cortex-A72 processor is an aarch64 architecture but Raspbian OS is based on an Armv7-A (aarch32) filesystem
This means that the Arm Compute Library will not be able to call the optimized routines for aarch64
How to Control the Raspberry PiThere are multiples ways to control the Raspberry Pi accordingly
with your neededAssuming:
• both host (your laptop) and Raspberry Pi are in the same network
• internet connection and control of the desktop interface are needed
We can use the serial and ethernet cable to control our board
Raspberry Pi 4: Connections
Laptop(Host)
Raspberry Pi 4(Target)
USB Serial cable
USB-C power cable
AC socket
Router / Ethernet socket
Raspberry Pi 4: Serial CableVccGndTxRx
Raspberry Pi 4(Target)
VccGndRxTx
Laptop(Host)
Are swapped!
Red = VccBlack = GndWhite = TxGreen = Rx
Step 1: Get IP Address• Open the serial terminal following the instructions provided
• Login in Raspbian OS
• User: pi
• Password: raspberry
• In the terminal window, enter the command:
• Copy the IP address along with the desktop number (i.e. 10.42.0.53:1)
$ vncserver
Step 2: Control the Desktop Interface• Install VNC viewer on Google Chrome following
the instructions provided on your laptop
• Open the VNC app and enter the IP address and desktop number of Raspberry Pi
• Login in Raspbian OS using the same credentials of step 1
Setup: https://tinyurl.com/u7a6jfv
Workshop instruction - https://tinyurl.com/uwvy59a
Now, we are ready to play!
Lab 1
Run Image Classification for Fire Detection
• Arm NN Core
Graph builder API
Optimizer
Runtime
Reference and Neon/Cl via Compute Library
New backend planned• Parsers
Tensorflow
Tensorflow Lite
Caffe
ONNX• Android NNAPI Driver
Arm NN Components
Arm NN Supported Layers … more to come!
• Activation• Addition• BatchNormalization• BatchToSpaceND• Constant• Convolution2d• DepthwiseConvolution2d• DetectionPostProcess• Division• Equal• Floor• FullyConnected
• Output• Pad• Permute• Pooling2d• Reshape• ResizeBilinear• Rsqrt• Softmax• SpaceToBatchND• Splitter• StridedSlice• Subtraction
• Gather• Greater• Input• L2Normalization• LSTM• Maximum• Mean• MemCopy• Merger• Minimum• Multiplication• Normalization
Lab 2
Run Time Series ML Model
Agenda• Introduction: Presenting ArmNN and Compute Library
• Q/A
• Hands-on session:
• Q/A
• Performance Analysis for Deep Learning Inference
• Wrap up
Performance Analysis for Deep Learning Inference
Introduction
What can we say about the following measurements?
SqueezeNet: 200 fps, 5 msMobileNet: 250 fps, 4 ms
Nothing...
• What processor?
• What frequency?
• How many threads?
• What data type?
Goals for Performance Analysis
1. Give a meaning to our performance numbers. i.e. Good? Bad?
2. Understand performance bottlenecks to fix inefficiencies
• No lucky or casual optimizations
• Better optimization planning for the short and long term
We need to introduce a formal methodology
DNN Under Test (DUT)
In order to give a meaning to our perf. numbers, we need to know:
• Hardware capabilities
• Algorithm complexity (number of ops required by the algorithm)
DNNi.e. Image, audio, text Exec. Time, fps
HW Capabilities vs Algorithm ComplexityHardware capabilities Algorithm complexity• (Fl)ops/Core/Cycle
# arithmetic operations per core and per cycle
• MACs/Core/Cycle
# multiply-accumulate (mul+add) operations per core and per cycle
• Max external memory bandwidth (i.e. DDR):
# bytes read from or stored into memory in unit of time (bytes/s)
• Sum of the ops required by each function/layer in the network
Well defined for DNNs and mainly dominated by the MACs operations in Convolution/Fully Connected Layers
• Since convolution layer could use multiples algorithms (GEMM, Winograd, FFT…), we should consider the ops required by the one executed
Example Algorithm ComplexityVGG16 layers Parameters (W/H/Kernel/OFMs) Algorithm GFlops
Conv1 224/224/3/3/64 GEMM 3.55
Conv2 224/224/3/64/64 GEMM 1.95
Conv3 112/112/3/64/128 GEMM 1.20
Conv4 112/112/3/128/128 GEMM 2.08
Conv5 56/56/3/128/256 GEMM 1.20
Conv6 / Conv7 56/56/3/256/256 GEMM 2.39 x2
… … … …
Conv11 / Conv12 / Conv13 14/14/3/512/512 GEMM 0.354 x3
FC1 1/1/1/4096 GEMM 0.0000502
FC2 1/1/1/4096 GEMM 0.00000819
FC3 1/1/1/1000 GEMM 0.00000819
Total GFlops ~30
Processor UtilizationExpresses how well the algorithm uses the available
processor computational resources
Ta: actual execution time (time measured)
Tt: theoretical execution time
Putil=TtTa
∈[0, 1]
=Ops23456789:
Op𝑠<56=/<?<3= @ 𝑛𝑢𝑚<56=D @ 𝑓𝑟𝑒𝑞
Ops required by the algorithm
Ops/Core/Cycle processor
Static Memory Bound Analysis
Used to check if an algorithm can be limited by the memory transfers
𝐴𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚P2QRS =𝐷𝑅85823 + 𝐷𝑊85823
𝑇8Tt: theoretical execution timeDRtotal: Total data readDWtotal: Total data write
Should be compared with the external memory bandwidth (i.e. DDR)
Exercise 1Features Value
GFLops network 1.1
Actual execution time (ms) 4
Input image size 224x224
Flops/Core/Cycle 512
Num. cores 20
Frequency 500 MHz
Max external memory bandwidth 12.9 GB/s
Processor utilization?5.6 %
Really low!
Exercise 2Features Value
GFLops network 34.3
Actual execution time (ms) 200
Input image size 224x224
Flops/Core/Cycle 16
Num. cores 4
Frequency 2 GHz
Max external memory bandwidth 12.9 GB/s
Processor utilization?133.6 %
It cannot be > 100%! Maybe GFLops network is not correct?
Exercise 3Features Value
Flops/Core/Cycle 16
Num. cores 4
Frequency 2 GHz
Max external memory bandwidth 12.9 GB/s
Data type M,N,K DR[GB]
DW [GB]
Tt
[s]AlgorithmMaxBW
[GB/s]
F32 2704, 256, 1152 0.01160 0.00109 0.012 1.019
F32 1, 4096, 25088 9.34E-5 0.38 0.0016 238.5
This is memory bound!
Matrix multiplication: • M: Number of output rows• N: Number of output columns• K: Number of right hand side matrix rows
Formal MethodologyTop-Down approach
Made up of different granular investigations to find performance bottlenecks
Level 1: Graph (DNN) profiling
Level 2: Function profiling
Level 3: HW counters profilingFunction HW counters (L2 cache hits, mis-predictions, cycles)
Layer/function timing
Network timing
Driver Overhead Estimation
TN0
t0 t1 t2
Other processor
CPUoverhead
func0 func1 func2
time
Graph profiling
t0 t1 t2
func0 func1 func2
Overhead increased by the profiler!
Function profiling
TN1time
TN0 != TN1TN0 != (t0+t1+t2)
Overhead = TN0 -(t0+t1+t2)
Overhead
Driver Overhead EstimationStreamline
Arm® Streamline Performance Analyzer is a system-wide visualizer and profiler for Arm hardware targets, and models of them
Locating and Fixing InefficienciesWith graph and function profiling we can gather four important
information points
1. Network utilization2. Overhead limitation3. Layer/function utilization4. Memory transfer limitation
Phase 1
Phase 2
Phase 1
Overhead evaluation
Graph profiling
Function profiling
Putili < Thresholdutil
Overhead < ThresholdoverheadOverhead optimization
DNN optimized
Phase 2Go to level 3HW counters
profiling
Yes
Yes
No
No
Phase 2 (1 of 2)
Overhead optimization
Function profiling
Layer analysis
Remove layers with % time spent < Thresholdt
Remove layers with Putil> Thresholdp
Is the list empty?
Layer ResNet12 (F32) % time spent
% Putil
Conv12 61.2 5.3
Conv1 3.63 40.1
Conv4 3.15 58.3
Conv6 3.15 59.2
Conv3 3.11 61.3
Conv1 3.11 60.1
Conv5 3.1 59.3
Conv7 3.1 58.6
…
Activation12 0.01 3
Phase 1Next slide…
Yes
No
i.e. > 50
i.e. < 1
Layer analysis: ResNet12 (F32)
Phase 2 (2 of 2)Memory bandwidth evaluation
Is it memory bound?
Layer ResNet12 (F32)
Max Memory bandwidthGB/s
Conv12 79.5 (GEMM-based)
Conv1 4.71 (GEMM-based)
Features Value
Flops/Core/Cycle 16
Num. Cores 4
Frequency 2 GHz
Max Memory Bandwidth 12.9 GB/s
Can we use a different algorithm?
Change algorithm andupdate Flops required
Phase 1
No
Yes
Yes
Can we change theDNN design?
i.e. FFT
Change DNN design
No
No
Yes
Go to level 3HW counters
profiling
Go to level 3HW counters
profiling
MLPerf
Overhead optimization
Useful benchmarks for measuring training and inference performance of ML hardware, software, and services as a collaboration of companies and researchers
https://mlperf.org
Lab 3
Performance Evaluation with ACL
ACL Benchmark Framework
Overhead optimization
• Currently in experimental phase
• Only available on the Arm Compute Library development branch (Linaro)
• Allows performance profiling of all the examples included in the “examples” folder
• For each example, it generates a new binary with prefix “benchmark_” in the build/tests folder
Benchmark Examples
Overhead optimizationi.e../benchmark_graph_mobilenet –example_args=--threads=1 --iterations=10./benchmark_graph_mobilenet –example_args=--threads=4 --iterations=10./benchmark_graph_mobilenet –example_args=--threads=4 --iterations=10 --instruments=scheduler_timer_ms
Plenty of DNN examples are included in ACL and ready to be profiled