comprehensive arm solutions for innovative machine learning (ml) and computer vision...
TRANSCRIPT
© 2017 Arm Limited Arm Technical Symposium 2017
Comprehensive ArmSolutions for Innovative
Machine Learning (ML) and Computer Vision (CV)
Applications
Helena Zheng| Senior Product Manager
© 2017 Arm Limited 3
Machine Learning is a Subset of Artificial IntelligenceAI means many things to many people
Artificial Intelligence
Machine Learning
Perception & Vision
Natural Language Processing
Knowledge Representation
Planning & Navigation
Generalized Intelligence
ML itself has a lot of depth
© 2017 Arm Limited 4
Why Artificial Intelligence(AI) Exploding NowAvailability of increased data sourced at the edge with ubiquitous powerful compute!
Compute Data
2016 – 1 zettabyte
2020 – 2.3 zettabyte
IP Traffic
2010
2015
© 2017 Arm Limited 5
AI Presents Significant Opportunity for Innovation
Robotics
Home, surveillance & analytics
VR/MR
IoT
Shipping & logistics
Mobile
Drones
Automotive
© 2017 Arm Limited 6
Distributing Intelligence
Security and privacy for your data
Cloud-based training
High-performance processing
AI in your hand & cloud
Real-time inference for autonomous systems
On-device learning
© 2017 Arm Limited 7
Why is On-device ML Driving AI to the Edge?
Bandwidth PrivacyLatencyCostPower
© 2017 Arm Limited 8
Arm ML Platform Enables
FlexibilityEfficiency Freedom
© 2017 Arm Limited 9
Arm’s ML Computing Platform
Arm DS-5 / Keil tools / compilers / drivers
AI Applications | Incorporating ML, CV, speech recognition etc. Applications
Edge devices
Stable SW
interfaces
Neural network frameworks(e.g. TensorFlow, Caffe, AndroidNN)
Spirit metadatalibrary
Optional Spirit libraries
& model sets
Compute library
SpiritComputer Vision
GPUCPUCPU
SVE
© 2017 Arm Limited 10
Components of Arm ML Platform
Software Specialized Acceleration Hardware
© 2017 Arm Limited 11
Software Development
© 2017 Arm Limited 12
Arm Compute LibraryFaster, advanced processing
Functions for CV and deep-learning algorithms
Optimized for Arm CPU and GPU
OS and platform agnostic
No fee, MIT license
Use as a plug-in backend for your own runtime implementation
What is the Arm Compute Library?
Delivers faster processing Offers OpenCV and Open VX compatibility
Available now: https://developer.arm.com/technologies/compute-library
4.6x faster than stock OpenCVon NEON
© 2017 Arm Limited 13
Arm Compute Library Partners
© 2017 Arm Limited 14
Specialized Acceleration
© 2017 Arm Limited
Computer Vision (CV)
© 2017 Arm Limited 16
Spirit for Object Detection and Localization
Metadata stream (Regions of interest)
Image stream
Ben
Beth
SpiritCV pre-processor
Senso
r in
terface
Feature
extraction
Classifier 1
Classifier 2
ISPSensor
CPUGPU
Acceleration
© 2017 Arm Limited 17
Comparison with Neural Network Framework Solutions
SSD Neural Network
Yolo
Spirit
© 2017 Arm Limited 18
Object localization (>200k locations in full HD)
Size variation (~20x for full HD)
Scalable to 4K without performance compromise
Real time, 60 fps, no dropped frames
Invariance to optical distortions
Invariance to illumination conditions
High occlusion tolerance
Suitable for stationary and moving cameras
Spirit: Key Features
© 2017 Arm Limited 19
Spirit uses a form of HOG*/ SVM* baked into an efficient hardware design
Using a DSP to achieve the same performance
E.g. Pedestrian Detection on a DSP
• Processed at VGA resolution
• Achieving 5fps
• Operating at 40MHz/50MHz
• Scaling this to Spirit performance levels of 1080p60 would require the DSP to run at 3.24GHz
*Histogram of Oriented Gradients *Support Vector Machines
Comparison with a DSP
© 2017 Arm Limited
ML on Mali GPUs
© 2017 Arm Limited 21
Mali GPUs: Increasing ML Throughput and Efficiency
Increasing efficiency
0.9
0.95
1
1.05
1.1
1.15
1.2
Mali-G71 Mali-G72
Rel
ativ
e En
eryg
Eff
icen
cy SGEMM FP32 SGEMM FP16
17% Efficiency
gain
• GEMM depicts core functionality of ML algorithms
• Mali-G72 has several optimizations to improve ML inference
• Less power-hungry FMA unit
• Bigger L1 cache in the execution engine
• Mali-G72 is the most efficient Mali GPU for machine learning
© 2017 Arm Limited 22
Hardware
© 2017 Arm Limited 23
AI Applications at the Edge on Arm
Detect plant diseases Sort cucumbers Detect Caltrain delays
© 2017 Arm Limited 25
Instruction Sets for AI
• Additional dot product instructions (Cortex-A55 and Cortex-A75)
• New Scalable Vector Extensions (SVE) instructions
Cortex-A
• Optimized CMSIS-DSP libraries for matrix multiplication
Cortex-M
• Improved performance and efficiency (for broader use cases)
• Flexibility in multi-core computing with Arm DynamIQ technology
Closely-coupled acceleration
© 2017 Arm Limited 26
DynamIQ: New Cluster Design for New Cores
Arm DynamIQ big.LITTLE systems:
• Greater product differentiation and scalability
• Improved energy efficiency and performance
• SW compatibility with Energy Aware Scheduling (EAS)
Private L2 and shared L3 caches
• Local cache close to processors
• L3 cache shared between all cores
DynamIQ Shared Unit (DSU)
• Contains L3, Snoop Control Unit (SCU) and all cluster interfaces
Additional instructions for ML1b+4L1b+3L1b+2L
1b+7L
Example: DynamIQ big.LITTLEconfigurations
..
AMBA4 ACE
SCU
Shared L3 cacheACP
Cortex-A5532b/64b Core
Private L2 cache
Async BridgesPeripheral Port
Cortex-A7532b/64b Core
Private L2 cache
DynamIQ Shared Unit (DSU)
2b+6L4b+4L
© 2017 Arm Limited 27
New DynamIQ-based CPUs for New Possibilities
>50%
more performancecompared to current devices
2.5x
higher power efficiencycompared to current devices
Estimated device performance using SPECINT2006, final device results may varyComparison using Cortex-A73 at 2.4GHz vs Cortex-A75 at 3GHz
Comparison using Cortex-A53 in 28nm devices vs Cortex-A55 in 16nm devices
Cortex-A75 processor Cortex-A55 processor
© 2017 Arm Limited 28
1.000.90
0.700.59
0.440.35
0
0.2
0.4
0.6
0.8
1
1.2
Tim
e (l
ow
er is
bet
ter)
Harris Corners (relative to Cortex-A53)
Cortex-A53 (FP32) Cortex-A55 (FP32) Cortex-A55(FP16)
Cortex-A73 (FP32) Cortex-A75(FP32) Cortex-A75(FP16)
Enhanced Architecture for Emerging Use Cases
Computer Vision Machine Learning
1 1.2
2.5
5.5
0
1
2
3
4
5
6
MA
C/c
ycle
General Matrix Multiply (relative to Cortex-A53)
Cortex-A53 (FP32) Cortex-A55 (FP32)
Cortex-A55 (FP16) Cortex-A55 (8-bit)
8-bit dot
product
FP16 FP16
© 2017 Arm Limited 29
Introducing the Scalable Vector Extension (SVE)
Predicate-driven loop control and management
Per-lane predicationGather-load and scatter-store
Extended floating-point horizontal reductions
1 + 2 + 3 + 4
1 + 2
+
3 + 4
3 7
= =
=
=
1 2 3 45 5 5 51 0 1 0
6 2 8 4
+
=
pred
Vector partitioning and software-managed speculation
1 2 0 0
1 1 0 0
+
pred
1 2
n-2
1 01 0CMPLT n
n-1 n n+1INDEX i
for (i = 0; i < n; ++i)
© 2017 Arm Limited 30
Instruction Sets for AI
• Additional dot product instructions (Cortex-A55 and Cortex-A75)
• New SVE instructions
Cortex-A
• Optimized CMSIS-DSP libraries for matrix multiplication
Cortex-M
• Improved performance and efficiency (for broader use cases)
• Flexibility in multi-core computing with DynamIQ technology
Closely-coupled acceleration
© 2017 Arm Limited 31
Software Optimizations: Cortex-M Example
Convolution
• Use of partial im2col to reduce the memory footprint
• Optimized data dimension layout (Height-Width-Channel) to save im2col overhead
Pooling
• Split into x-pooling and y-pooling instead of window-based
• 5.1X improvements compared to Caffe-like implementation
Activation
• ReLU: use SIMD within a register, 2.6X improvement compared to Caffe-like implementation
• Sigmoid and Tanh: use table look-up
*Baseline uses CMSIS 1D Conv and Caffe-like Pooling/ReLU
CIFAR-10 network runtime improvement
0
1
2
3
4
5
6
Conv Pooling ReLU Total
Rel
ativ
e th
rou
ghp
ut
Baseline New kernels
The new kernels will be integrated into future versions of CMSIS
4x higher
perf
© 2017 Arm Limited 32
Instruction sets for AI
• Additional dot product instructions (Cortex-A55 and Cortex-A75)
• New SVE instructions
Cortex-A
• Optimized CMSIS-DSP libraries for matrix multiplication
Cortex-M
• Improved performance and efficiency (for broader use cases)
• Flexibility in multi-core computing with DynamIQ technology
Closely-coupled acceleration
© 2017 Arm Limited 33
Interface to Acceleration Logic
DynamIQ clusterAccelerator
(4) Writes result into CPU memory
(1) Configure registers for task
(2) Fetches data from CPU memory
(3) Carries out acceleration
ACP
PP
DynamIQ cluster
I/O agent
(4) Reads result from CPU memory
or sends data for onward processing
(3) Processing completed
(1) Writes data into CPU memory
(2) Carries out computation
ACP
PP
© 2017 Arm Limited 34
Flexible Acceleration Platform
XPXP
XPXP
CMN-600
Agile System Cache
Accelerator
DMC-620
Acc
eler
ato
r
DMC-620
CCIX
DynamIQ Cluster
Level-3 Cache
Accel Cache
CCIX
Accelerator
XP
XP
Agile System Cache
Closely coupled Independent Off-chip coherent
© 2017 Arm Limited 35
Arm ML Platform Enables
FlexibilityEfficiency Freedom
3838
Thank You!Danke!Merci!谢谢!ありがとう!Gracias!Kiitos!감사합니다धन्यवाद
© 2017 Arm Limited