comprehensive arm solutions for innovative machine learning (ml) and computer vision...

© 2017 Arm Limited Arm Technical Symposium 2017

Comprehensive ArmSolutions for Innovative

Machine Learning (ML) and Computer Vision (CV)

Applications

Helena Zheng| Senior Product Manager

© 2017 Arm Limited 3

Machine Learning is a Subset of Artificial IntelligenceAI means many things to many people

Artificial Intelligence

Machine Learning

Perception & Vision

Natural Language Processing

Knowledge Representation

Planning & Navigation

Generalized Intelligence

ML itself has a lot of depth


Why Artificial Intelligence(AI) Exploding NowAvailability of increased data sourced at the edge with ubiquitous powerful compute!

Compute Data

2016 – 1 zettabyte

2020 – 2.3 zettabyte

IP Traffic

2010

2015


AI Presents Significant Opportunity for Innovation

Robotics

Home, surveillance & analytics

VR/MR

IoT

Shipping & logistics

Mobile

Drones

Automotive


Distributing Intelligence

Security and privacy for your data

Cloud-based training

High-performance processing

AI in your hand & cloud

Real-time inference for autonomous systems

On-device learning


Why is On-device ML Driving AI to the Edge?

Bandwidth PrivacyLatencyCostPower


Arm ML Platform Enables

FlexibilityEfficiency Freedom


Arm’s ML Computing Platform

Arm DS-5 / Keil tools / compilers / drivers

AI Applications | Incorporating ML, CV, speech recognition etc. Applications

Edge devices

Stable SW

interfaces

Neural network frameworks(e.g. TensorFlow, Caffe, AndroidNN)

Spirit metadatalibrary

Optional Spirit libraries

& model sets

Compute library

SpiritComputer Vision

GPUCPUCPU

SVE


Components of Arm ML Platform

Software Specialized Acceleration Hardware


Software Development


Arm Compute LibraryFaster, advanced processing

Functions for CV and deep-learning algorithms

Optimized for Arm CPU and GPU

OS and platform agnostic

No fee, MIT license

Use as a plug-in backend for your own runtime implementation

What is the Arm Compute Library?

Delivers faster processing Offers OpenCV and Open VX compatibility

Available now: https://developer.arm.com/technologies/compute-library

4.6x faster than stock OpenCVon NEON

https://developer.arm.com/technologies/compute-library


Arm Compute Library Partners


Specialized Acceleration

© 2017 Arm Limited

Computer Vision (CV)


Spirit for Object Detection and Localization

Metadata stream (Regions of interest)

Image stream

Ben

Beth

SpiritCV pre-processor

Senso

r in

terface

Feature

extraction

Classifier 1

Classifier 2

ISPSensor

CPUGPU

Acceleration


Comparison with Neural Network Framework Solutions

SSD Neural Network

Yolo

Spirit


Object localization (>200k locations in full HD)

Size variation (~20x for full HD)

Scalable to 4K without performance compromise

Real time, 60 fps, no dropped frames

Invariance to optical distortions

Invariance to illumination conditions

High occlusion tolerance

Suitable for stationary and moving cameras

Spirit: Key Features


Spirit uses a form of HOG*/ SVM* baked into an efficient hardware design

Using a DSP to achieve the same performance

E.g. Pedestrian Detection on a DSP

• Processed at VGA resolution

• Achieving 5fps

• Operating at 40MHz/50MHz

• Scaling this to Spirit performance levels of 1080p60 would require the DSP to run at 3.24GHz

*Histogram of Oriented Gradients *Support Vector Machines

Comparison with a DSP

© 2017 Arm Limited

ML on Mali GPUs


Mali GPUs: Increasing ML Throughput and Efficiency

Increasing efficiency

0.9

0.95

1

1.05

1.1

1.15

1.2

Mali-G71 Mali-G72

Rel

ativ

e En

eryg

Eff

icen

cy SGEMM FP32 SGEMM FP16

17% Efficiency

gain

• GEMM depicts core functionality of ML algorithms

• Mali-G72 has several optimizations to improve ML inference

• Less power-hungry FMA unit

• Bigger L1 cache in the execution engine

• Mali-G72 is the most efficient Mali GPU for machine learning


Hardware


AI Applications at the Edge on Arm

Detect plant diseases Sort cucumbers Detect Caltrain delays


Instruction Sets for AI

• Additional dot product instructions (Cortex-A55 and Cortex-A75)

• New Scalable Vector Extensions (SVE) instructions

Cortex-A

• Optimized CMSIS-DSP libraries for matrix multiplication

Cortex-M

• Improved performance and efficiency (for broader use cases)

• Flexibility in multi-core computing with Arm DynamIQ technology

Closely-coupled acceleration


DynamIQ: New Cluster Design for New Cores

Arm DynamIQ big.LITTLE systems:

• Greater product differentiation and scalability

• Improved energy efficiency and performance

• SW compatibility with Energy Aware Scheduling (EAS)

Private L2 and shared L3 caches

• Local cache close to processors

• L3 cache shared between all cores

DynamIQ Shared Unit (DSU)

• Contains L3, Snoop Control Unit (SCU) and all cluster interfaces

Additional instructions for ML1b+4L1b+3L1b+2L

1b+7L

Example: DynamIQ big.LITTLEconfigurations

..

AMBA4 ACE

SCU

Shared L3 cacheACP

Cortex-A5532b/64b Core

Private L2 cache

Async BridgesPeripheral Port

Cortex-A7532b/64b Core

Private L2 cache

DynamIQ Shared Unit (DSU)

2b+6L4b+4L


New DynamIQ-based CPUs for New Possibilities

>50%

more performancecompared to current devices

2.5x

higher power efficiencycompared to current devices

Estimated device performance using SPECINT2006, final device results may varyComparison using Cortex-A73 at 2.4GHz vs Cortex-A75 at 3GHz

Comparison using Cortex-A53 in 28nm devices vs Cortex-A55 in 16nm devices

Cortex-A75 processor Cortex-A55 processor


1.000.90

0.700.59

0.440.35

0

0.2

0.4

0.6

0.8

1

1.2

Tim

e (l

ow

er is

bet

ter)

Harris Corners (relative to Cortex-A53)

Cortex-A53 (FP32) Cortex-A55 (FP32) Cortex-A55(FP16)

Cortex-A73 (FP32) Cortex-A75(FP32) Cortex-A75(FP16)

Enhanced Architecture for Emerging Use Cases

Computer Vision Machine Learning

1 1.2

2.5

5.5

0

1

2

3

4

5

6

MA

C/c

ycle

General Matrix Multiply (relative to Cortex-A53)

Cortex-A53 (FP32) Cortex-A55 (FP32)

Cortex-A55 (FP16) Cortex-A55 (8-bit)

8-bit dot

product

FP16 FP16


Introducing the Scalable Vector Extension (SVE)

Predicate-driven loop control and management

Per-lane predicationGather-load and scatter-store

Extended floating-point horizontal reductions

1 + 2 + 3 + 4

1 + 2

+

3 + 4

3 7

= =

=

=

1 2 3 45 5 5 51 0 1 0

6 2 8 4

+

=

pred

Vector partitioning and software-managed speculation

1 2 0 0

1 1 0 0

+

pred

1 2

n-2

1 01 0CMPLT n

n-1 n n+1INDEX i

for (i = 0; i < n; ++i)


Instruction Sets for AI


• New SVE instructions

Cortex-A


Cortex-M


• Flexibility in multi-core computing with DynamIQ technology



Software Optimizations: Cortex-M Example

Convolution

• Use of partial im2col to reduce the memory footprint

• Optimized data dimension layout (Height-Width-Channel) to save im2col overhead

Pooling

• Split into x-pooling and y-pooling instead of window-based

• 5.1X improvements compared to Caffe-like implementation

Activation

• ReLU: use SIMD within a register, 2.6X improvement compared to Caffe-like implementation

• Sigmoid and Tanh: use table look-up

*Baseline uses CMSIS 1D Conv and Caffe-like Pooling/ReLU

CIFAR-10 network runtime improvement

0

1

2

3

4

5

6

Conv Pooling ReLU Total

Rel

ativ

e th

rou

ghp

ut

Baseline New kernels

The new kernels will be integrated into future versions of CMSIS

4x higher

perf


Instruction sets for AI


• New SVE instructions

Cortex-A


Cortex-M


• Flexibility in multi-core computing with DynamIQ technology



Interface to Acceleration Logic

DynamIQ clusterAccelerator

(4) Writes result into CPU memory

(1) Configure registers for task

(2) Fetches data from CPU memory

(3) Carries out acceleration

ACP

PP

DynamIQ cluster

I/O agent

(4) Reads result from CPU memory

or sends data for onward processing

(3) Processing completed

(1) Writes data into CPU memory

(2) Carries out computation

ACP

PP


Flexible Acceleration Platform

XPXP

XPXP

CMN-600

Agile System Cache

Accelerator

DMC-620

Acc

eler

ato

r

DMC-620

CCIX

DynamIQ Cluster

Level-3 Cache

Accel Cache

CCIX

Accelerator

XP

XP

Agile System Cache

Closely coupled Independent Off-chip coherent


Arm ML Platform Enables

FlexibilityEfficiency Freedom

comprehensive arm solutions for innovative machine learning (ml) and computer vision...

Documents