deep learning on mobile phones - a practitionersguide · deep learning on mobile phones - a...
TRANSCRIPT
Deep Learning on mobile phones- A Practitioners guide
Anirudh Koul, Siddha Ganju, Meher Kasam
Deep Learning on mobile phones- A Practitioners guide
Anirudh Koul, Siddha Ganju, Meher Kasam
Anirudh Koul
@AnirudhKoulHead of AI & Research, Aira
[Lastname]@aira.io
Siddha Ganju
@SiddhaGanjuArchitect, Self-Driving Vehicles, NVIDIA
[FirstnameLastname]@gmail.com
Meher Anand Kasam
@MeherKasamSoftware Engineer, Square
[FirstnameMiddlenameK]@gmail.com
Why Deep Learning On Mobile?
Latency Privacy
Response Time Limits – Powers of 10
0.1 second : Reacting instantly
1.0 seconds : User ’s flow of thought
10 seconds : Keeping the user ’s attention
[Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:
Mobile Deep Learning Recipe
Mobile Inference Engine + Pretrained Model = DL App
(Efficient) (Efficient)
Building a DL App in _ time
Building a DL App in 1 hour
Use Cloud APIs for General Recognition Needs
• Microsoft Cognitive Services
• Clarifai
• Google Cloud Vision
• IBM Watson Services
• Amazon Rekognition
How to Choose a Computer Vision Based API?
Benchmark & Compare them
COCO-Text v2.0 for Text reading in the wild• ~2k random images• Candidate text has at least 2 characters together• Direct word match
COCO-Val 2017 for Image Tagging in the wild• ~4k random images• Tag similarity match instead of word match
Pricing
Recognize Text Benchmarks
Text API Accuracy
Amazon Rekognition 45.4%
Google Cloud Vision 33.4%
Microsoft Cognitive Services 55.4%
Evaluation criteria:• Photos have candidate words with at length>=2• Direct word match with ground truth
Image Tagging Benchmarks
Evaluation criteria:
• Concept similarity match instead of word match
• E.g. ‘military-officer ’ tag matched with ground truth tag ‘person’
Text API Accuracy
Amazon Rekognition 65%
Google Cloud Vision 47.6%
Microsoft Cognitive Services 50.0%
Image Tagging Benchmarks
Evaluation criteria:
• Concept similarity match instead of word match
• E.g. ‘military-officer ’ tag matched with ground truth tag ‘person’
Text API Accuracy Avg #Tags
Amazon Rekognition 65% 14
Google Cloud Vision 47.6% 14
Microsoft Cognitive Services 50.0% 8
Image Tagging Benchmarks
Hard to do Precision-Recall since COCO ground truth tags are not exhaustive
Lower # of tags for a given accuracy indicates higher F-measure
Text API Accuracy Avg #Tags
Amazon Rekognition 65% 14
Google Cloud Vision 47.6% 14
Microsoft Cognitive Services 50.0% 8
Tips for reducing network latency
• For Text Recognition• Compressing setting of upto 90% has little effect on accuracy, but drastic
savings in size
• Resizing is dangerous, text recognition needs a minimum size for recognition
• For image recognition• Resize to 224 as the minimum(height,width) at 50% compression with
bilinear interpolation
Building a DL App in 1 day
http://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/
Energy to train
Convolutional
Neural Network
Energy to use
Convolutional
Neural Network
Base Pretrained Model
ImageNet – 1000 Object Categorizer
VGG16
Inception-v3
Resnet-50
MobileNet
SqueezeNet
Running pre-trained models on mobile
Core ML
TensorFlow Lite
Caffe2
Apple’s Ecosystem
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017 2018
Apple’s Ecosystem
Metal
- low-level, low-overhead hardware-accelerated 3D graphic and compute shader application programming interface (API)
- Available since iOS 8
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017 2018
Apple’s Ecosystem
Fast low-level primitives:
• BNNS – Basic Neural Network Subroutine• Ideal case: Fully connected NN
• MPS – Metal Performance Shaders• Ideal case: Convolutions
Inconvenient for large networks:
• Inception-v3 inference consisted of 1.5K hard coded model definition
• Libraries Like Forge by Matthijs Hollemans provide abstraction
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017 2018
Apple’s Ecosystem
Convert Caffe/Tensorflow model to CoreML model in 3 lines:
import coremltools
coreml_model = coremltools.converters.caffe.convert('my_caffe_model.caffemodel’)
coreml_model.save('my_model.mlmodel’)
Add model to iOS project and call for prediction.
Direct support for Keras, Caffe, scikit-learn, XGBoost, LibSVM
Automatically minimizes memory footprint and power consumption
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017 2018
Apple’s Ecosystem
• Model quantization support upto 1 bit
• Batch API for improved performance
• Conversion support for MXNet, ONNX • ONNX opens models from PyTorch, Cognitive Toolkit, Caffe2, Chainer
• ML Create for quick training
• tf-coreml for direct conversion from tensorflow
Metal BNNS +MPS CoreML CoreML2
2014 2016 2017 2018
CoreML Benchmark - Pick a DNN for your mobile architecture
Model Top-1 Accurac
y
Size of Model (MB)
iPhone 5SExecution Time (ms)
iPhone 6Execution Time (ms)
iPhone 6S/SE
Execution Time (ms)
iPhone 7 Execution Time (ms)
iPhone 8/X Execution Time (ms)
VGG 16 71 553 7408 4556 235 181 146
Inception v3 78 95 727 637 114 90 78
Resnet 50 75 103 538 557 77 74 71
MobileNet 71 17 129 109 44 35 33
SqueezeNet 57 5 75 78 36 30 29
2014 2015 2016
Huge improvement in GPU hardware in 2015
2013 2017
Putting out more frames than an art gallery
TensorFlow Ecosystem
TensorFlow TensorFlow Mobile TensorFlow Lite
2015 2016 2018
TensorFlow Ecosystem
The full, bulky deal
TensorFlow TensorFlow Mobile TensorFlow Lite
2015 2016 2018
TensorFlow Ecosystem
TensorFlow TensorFlow Mobile TensorFlow Lite
2015 2016 2018
Easy pipeline to bring Tensorflow models to mobile
Excellent documentation
Optimizations to bring model to mobile
TensorFlow Ecosystem
• Smaller
• Faster
• Minimal dependencies• Easier to package & deploy
• Allows running custom operators
1 line conversion from Keras to TensorFlow lite
• tflite_convert --keras_model_file=keras_model.h5 --output_file=foo.tflite
TensorFlow TensorFlow Mobile TensorFlow Lite
2015 2016 2018
TensorFlow Lite is small
• ~75KB for core interpreter
• ~400KB for core interpreter + supported operations
• Compared to 1.5MB for Tensorflow Mobile
TensorFlow Lite is fast
• Takes advantage of on-device hardware acceleration
• Uses FlatBuffers• Reduces code footprint, memory usage• Reduces CPU cycles on serialization and deserialization• Improves startup time
• Pre-fused activations• Combining batch normalization layer with previous Convolution
• Interpreter uses static memory and static execution plan• Decreases load time
TensorFlow Lite Architecture
TensorFlow Lite Benchmarks - http://alpha.lab.numericcal.com/
TensorFlow Lite Benchmarks - http://ai-benchmark.com/
• Crowdsourcing benchmarking with AI Benchmark android app• By Andrey Ignatov from ETH
• 9 Tests• E.g. Semantic Segmentation, Image Super Resolution, Face Recognition
TensorFlow Lite acceleration – GPU delegate (dev preview)
Caffe2
From Facebook
Under 1 MB of binary size
Built for Speed :
For ARM CPU : Uses NEON Kernels, NNPack
For iPhone GPU : Uses Metal Performance Shaders and Metal
For Android GPU : Uses Qualcomm Snapdragon NPE (4-5x speedup)
ONNX format support to import models from CNTK/PyTorch
Caffe2
MLKit
• Simple, easy to use
• Abstraction over TensorFlow Lite
• Built in Image Labeling, OCR, Face Detection, Barcode scanning, landmark detection, Smart reply
• Model management with Firebase• Upload model on web interface to distribute
• A/B Testing
MLKit – Face Contours
By leveraging GPU delegate,
~4x speed up on Pixel 3
~6x speed up on iPhone7
Recommendation for production development
1. Train a model using Keras
2. Convert to Tensorflow Lite format
3. Upload to Firebase
4. Deploy to iOS/Android apps with MLKit
Keras
.tflite file
tflite_convert
Common Questions
“My app has become too big to download. What do I do?”
• iOS doesn’t allow apps over 150 MB to be downloaded
• Solution : Download on demand, and compile on device
• 0 MB change to app size on first install
Common Questions
“Do I need to ship a new app update with every model improvement?”
• Making App updates is a decent amount of overheard, plus ~2 days wait time
• Solution : Check for model updates, download and compile on device
• Easier solution – Use a framework for Model Management, e.g. • Google ML Kit
• Fritz
• Numerrical
Common Questions
“Why does my app not recognize objects at top/bottom of screen?”
• Solution : Check the cropping used, by default, its center crop ☺
Building a DL App in 1 week
Learn Playing an Accordion
3 months
Learn Playing an Accordion
3 months
Knows Piano
Fine Tune Skills
1 week
I got a dataset, Now What?
Step 1 : Find a pre-trained model
Step 2 : Fine tune a pre-trained model
Step 3 : Run using existing frameworks
“Don’t Be A Hero” - Andrej Karpathy
How to find pretrained models for my task?
Model Zoo
https://modelzoo.co
- 300+ models
Papers with Code
https://paperswithcode.com/sota
AlexNet, 2012 (simplified)
[Krizhevsky, Sutskever,Hinton’12]
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11
n-dimension
Feature
representation
Deciding how to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Deciding when to fine tune
Size of New Dataset Similarity to Original Dataset What to do?
Large High Fine tune.
Small High Don’t Fine Tune, it will overfit.
Train linear classifier on CNN Features
Small Low Train a classifier from activations in lower layers.
Higher layers are dataset specific to older dataset.
Large Low Train CNN from scratch
http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html
Could you training your own classifier ... without coding?
• Microsoft CustomVision.ai• Unique: Under a minute training, Custom object detection (100x speedup)
• Google AutoML• Unique: Full CNN training, crowdsourced workers
• IBM Watson Visual recognition
• Baidu EZDL• Unique: Custom Sound recognition
Custom Vision Service (customvision.ai) – Drag and drop training
Tip : Upload 30 photos per class for make prototype model
Upload 200 photos per class for more robust production model
More distinct the shape/type of object, lesser images required.
Custom Vision Service (customvision.ai) – Drag and drop training
Tip : Use Fatkun Browser Extension to download images from Search Engine,
or use Bing Image Search API to programmatically download photos with
proper rights
CoreML exporter from customvision.ai
– Drag and drop training
5 minute shortcut to training, finetuning and getting model ready in CoreML format
Drag and drop interface
Building a Crowdsourced Data Collector in 1 months
Barcode recognition from Seeing AI
Live Guide user in finding a barcode with audio cues
With
Server
Decode barcode to identify product
Tech MPSCNN running on mobile GPU + barcode library
Metrics 40 FPS (~25 ms) on iPhone 7
Aim : Help blind users identify products using barcode
Issue : Blind users don’t know where the barcode is
Currency recognition from Seeing AI
Aim : Identify currency
Live Identify denomination of paper currency instantly
With
Server
-
Tech Task specific CNN running on mobile GPU
Metrics 40 FPS (~25 ms) on iPhone 7
Training Data Collection App
Request volunteers to take photos of objects
in non-obvious settings
Sends photos to cloud, trains model nightly
Newsletter shows the best photos from volunteers
Let them compete for fame
Daily challenge - Collected by volunteers
Daily challenge - Collected by volunteers
Building a production DL App in 3 months
What you want
https://www.flickr.com/photos/kenjonbro/9075514760/and http://www.newcars.com/land-rover/range-rover-sport/2016
$2000$200,000
What you can afford
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
3x3 conv, 64
3x3 conv, 64, pool/2
3x3 conv, 128
3x3 conv, 128, pool/2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
fc, 4096
fc, 4096
fc, 1000
VGG, 19 layers
(ILSVRC 2014)
input
Conv
7x7+ 2(S)
MaxPool
3x3+ 2(S)
LocalRespNorm
Conv
1x1+ 1(V)
Conv
3x3+ 1(S)
LocalRespNorm
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
AveragePool
7x7+ 1(V)
FC
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max0
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max1
Soft maxAct ivat ion
soft max2
GoogleNet, 22 layers
(ILSVRC 2014)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
AlexNet, 8 layers
(ILSVRC 2012)
ResNet, 152 layers
(ILSVRC 2015)
3x3 conv, 64
3x3 conv, 64, pool/2
3x3 conv, 128
3x3 conv, 128, pool/2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
fc, 4096
fc, 4096
fc, 1000
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
7x7 conv, 64, /2, pool/2
VGG, 19 layers
(ILSVRC 2014)
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
Ultra deep
ResNet, 152 layers 1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
7x7 conv, 64, /2, pool/2
Revolution of Depth
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
28.2
25.8
16.4
11.7
7.3 6.7
3.6 2.9
ILSVRC'10 ILSVRC'11 ILSVRC'12AlexNet
ILSVRC'13 ILSVRC'14VGG
ILSVRC'14GoogleNet
ILSVRC'15ResNet
ILSVRC'16Ensemble
ImageNet Classification top-5 error (%)
shallow 8 layers
19 layers 22 layers
152 layers
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015
Revolution of Depth vs Classification Accuracy
Ensemble of
Resnet, Inception Resnet, Inception and Wide Residual Network
Accuracy vs Operations Per Image Inference
Size is proportional to num parameters
Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016
552 MB
240 MB
What we want
Your Budget - Smartphone Floating Point Operations Per Second (2015)
http://pages.experts-exchange.com/processing-power-compared/
iPhone X is more powerful than a Macbook Pro
https://thenextweb.com/apple/2017/09/12/apples-new-iphone-x-already-destroying-android-devices-g/
Strategies to get maximum efficiency from your CNN
Before training
• Pick an efficient architecture for your task
• Designing efficient layers
After training
• Pruning
• Quantization
• Network binarization
CoreML Benchmark - Pick a DNN for your mobile architecture
Model Top-1 Accura
cy
Size of Model (MB)
Million Multi Adds
iPhone 5SExecution Time (ms)
iPhone 6Execution Time (ms)
iPhone 6S/SE
Execution Time (ms)
iPhone 7 Execution Time (ms)
iPhone 8/X
Execution Time (ms)
VGG 16 71 553 15300 7408 4556 235 181 146
Inception v3
78 95 5000 727 637 114 90 78
Resnet 50 75 103 3900 538 557 77 74 71
MobileNet 71 17 569 129 109 44 35 33
SqueezeNet
57 5 800 75 78 36 30 29
2014 2015 2016
Huge improvement in GPU hardware in 2015
2013 2017
MobileNet family
Splits the convolution into a 3x3 depthwise conv and a 1x1 pointwise conv
Tune with two parameters – Width Multiplier and resolution multiplier
Andrew G. Howard et al, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017
Efficient Classification Architectures
https://ai.googleblog.com/2018/04/mobilenetv2-next-generation-of-on.html
MobileNetV2 is the current favorite
Efficient Detection Architectures
Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017
Efficient Detection Architectures
Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017
Efficient Segmentation Architectures
ICNet - Image cascade network
Tricks while designing your own network
• Dilated Convolutions• Great for Segmentation / when target object has high area in image
• Replace NxN convolutions with Nx1 followed by 1xN
• Depth wise Separable Convolutions (e.g. MobileNet)
• Inverted residual block (e.g. MobileNetV2)
• Replacing large filters with multiple small filters• 5x5 is slower than 3x3 followed by 3x3
Design consideration for custom architectures – Small Filters
Three layers of 3x3 convolutions >>
One layer of 7x7 convolution
Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions
Replace NxN convolutions with stack of 1xN and Nx1
Fewer parameters ☺
Less compute ☺
More non-linearity ☺
Better
Faster
Stronger
Andrej Karpathy, CS-231n Notes, Lecture 11
Selective training to keep networks shallow
Idea : Augment data limited to how your network will be used
Example : If making a selfie app, no benefit in rotating training images beyond +-45 degrees. Your phone will anyway rotate.
Followed by WordLens / Google Translate
Example : Add blur if analyzing mobile phone frames
Pruning
Aim : Remove all connections with absolute weights below a threshold
Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015
Observation : Most parameters in Fully Connected Layers
AlexNet 240 MB VGG-16 552 MB
96% of all parameters
90% of all parameters
Pruning gets quickest model compression without accuracy loss
AlexNet 240 MB VGG-16 552 MB
First layer which directly interacts with image is sensitive and cannot be pruned too much without hurting accuracy
Prune in Keras (Before)
(x_train, y_train), (x_test, y_test) = mnist.load_data()x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),tf.keras.layers.Dense(512, activation=tf.nn.relu),tf.keras.layers.Dropout(0.2),tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])model.compile( optimizer='adam’,
loss= ‘sparse_categorical_crossentropy’,metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)model.evaluate(x_test, y_test)
Prune in Keras (After)
(x_train, y_train), (x_test, y_test) = mnist.load_data()x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),
prune.Prune(tf.keras.layers.Dense(512, activation=tf.nn.relu)),tf.keras.layers.Dropout(0.2),
prune.Prune(tf.keras.layers.Dense(10, activation=tf.nn.softmax))])model.compile( optimizer='adam’,
loss= ‘sparse_categorical_crossentropy’,metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)model.evaluate(x_test, y_test)
Weight Sharing
Idea : Cluster weights with similar values together, and store in a dictionary.
Codebook
Huffman coding
HashedNets
Cons: Need a special inference engine, doesn’t work for most applications
Filter Pruning - ThiNet
Idea : Discard whole filter if not important to predictions
Advantage:
• No change in architecture, other than thinning of filters per layer
• Can be further compressed with other methods
Just like feature selection, select filter to discard. Possible greedy methods:
• Absolute weight sum of entire filter closest to 0
• Average percentage of ‘Zeros’ as outputs
• ThiNet – Collect statistics on the output of the next layer
Quantization
Reduce precision from 32 bits to <=16 bits or lesser
Use stochastic rounding for best results
In Practice:
• Ristretto + Caffe• Automatic Network quantization• Finds balance between compression rate and accuracy
• Apple Metal Performance Shaders automatically quantize to 16 bits
• Tensorflow has 8 bit quantization support• Gemmlowp – Low precision matrix multiplication library
Quantizing CNNs in Practice
Reducing CoreML models to half size
# Load a model, lower its precision, and then save the smaller model.
model_spec = coremltools.utils.load_spec(‘model.mlmodel’)model_fp16_spec = coremltools.utils.convert_neural_network_spec_weights_to_fp16(model_spec)coremltools.utils.save_spec(model_fp16_spec, ‘modelFP16.mlmodel')
Quantizing CNNs in Practice
Reducing CoreML models to even smaller size
Choose bits and quantization mode
Bits from [1,2,4,8]
Quantization mode from [“linear","linear_lut","kmeans_lut",”custom_lut”]
• Lut = look up table
from coremltools.models.neural_network.quantization_utils import *quantized_model= quantize_weights(model, 8, 'linear')quantized_model.save('quantizedModel.mlmodel’)compare_model(model, quantized_model, './sample_data/')
Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
Binary weighted Networks
Idea :Reduce the weights to -1,+1
Speedup : Convolution operation can be approximated by only summation and subtraction
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net
Idea :Reduce both weights + inputs to -1,+1
Speedup : Convolution operation can be approximated by XNOR and Bitcount operations
Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”
XNOR-Net on Mobile
Battery free, solar powered AI Device from XNOR.AI
Challenges
Off the shelf CNNs not robust for video
Solutions:
• Collective confidence over several frames
• CortexNet
Building a DL App and get $10 million in funding
(or a PhD)
Competitions to follow
Winners = High accuracy + Low energy consumption
* LPIRC - Low-Power Image Recognition Challenge
* EDLDC - Embedded deep learning design contest
* System Design Contest at Design Automation Conference (DAC)
AutoML – Let AI design an efficient AI architecture
MnasNet: Platform-Aware Neural Architecture Search for Mobile
• An automated neural architecture search approach for designing mobile models using reinforcement learning
• Incorporates latency information into the reward objective function
• Measure real-world inference latency by executing on a particular platform
Sample models
from search space TrainerMobile
phones
Multi-objective
reward
latency
reward
Controller
accuracy
AutoML – Let AI design an efficient AI architecture
For same accuracy:
• 1.5x faster than MobileNetV2
• ResNet-50 accuracy with 19x less parameters
• SSD300 mAP with 35x less FLOPs
Mr. Data Scientist PhD
One Last Question