deep learning on mobile phones - a practitionersguide · deep learning on mobile phones - a...

Deep Learning on mobile phones- A Practitioners guide

Anirudh Koul, Siddha Ganju, Meher Kasam

Deep Learning on mobile phones- A Practitioners guide

Anirudh Koul, Siddha Ganju, Meher Kasam

Anirudh Koul

@AnirudhKoulHead of AI & Research, Aira

[Lastname]@aira.io

Siddha Ganju

@SiddhaGanjuArchitect, Self-Driving Vehicles, NVIDIA

[FirstnameLastname]@gmail.com

Meher Anand Kasam

@MeherKasamSoftware Engineer, Square

[FirstnameMiddlenameK]@gmail.com

Why Deep Learning On Mobile?

Latency Privacy

Response Time Limits – Powers of 10

0.1 second : Reacting instantly

1.0 seconds : User ’s flow of thought

10 seconds : Keeping the user ’s attention

[Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:

Mobile Deep Learning Recipe

Mobile Inference Engine + Pretrained Model = DL App

(Efficient) (Efficient)

Building a DL App in _ time

Building a DL App in 1 hour

Use Cloud APIs for General Recognition Needs

• Microsoft Cognitive Services

• Clarifai

• Google Cloud Vision

• IBM Watson Services

• Amazon Rekognition

How to Choose a Computer Vision Based API?

Benchmark & Compare them

COCO-Text v2.0 for Text reading in the wild• ~2k random images• Candidate text has at least 2 characters together• Direct word match

COCO-Val 2017 for Image Tagging in the wild• ~4k random images• Tag similarity match instead of word match

Pricing

Recognize Text Benchmarks

Text API Accuracy

Amazon Rekognition 45.4%

Google Cloud Vision 33.4%

Microsoft Cognitive Services 55.4%

Evaluation criteria:• Photos have candidate words with at length>=2• Direct word match with ground truth

Image Tagging Benchmarks

Evaluation criteria:

• Concept similarity match instead of word match

• E.g. ‘military-officer ’ tag matched with ground truth tag ‘person’

Text API Accuracy

Amazon Rekognition 65%

Google Cloud Vision 47.6%

Microsoft Cognitive Services 50.0%

Evaluation criteria:

• Concept similarity match instead of word match

• E.g. ‘military-officer ’ tag matched with ground truth tag ‘person’

Text API Accuracy Avg #Tags

Amazon Rekognition 65% 14

Google Cloud Vision 47.6% 14

Microsoft Cognitive Services 50.0% 8

Hard to do Precision-Recall since COCO ground truth tags are not exhaustive

Lower # of tags for a given accuracy indicates higher F-measure

Text API Accuracy Avg #Tags

Amazon Rekognition 65% 14

Google Cloud Vision 47.6% 14

Microsoft Cognitive Services 50.0% 8

Tips for reducing network latency

• For Text Recognition• Compressing setting of upto 90% has little effect on accuracy, but drastic

savings in size

• Resizing is dangerous, text recognition needs a minimum size for recognition

• For image recognition• Resize to 224 as the minimum(height,width) at 50% compression with

bilinear interpolation

Building a DL App in 1 day

http://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/

Energy to train

Convolutional

Neural Network

Energy to use

Convolutional

Neural Network

Base Pretrained Model

ImageNet – 1000 Object Categorizer

Inception-v3

Resnet-50

MobileNet

SqueezeNet

Running pre-trained models on mobile

Core ML

TensorFlow Lite

Caffe2

Apple’s Ecosystem

Metal BNNS +MPS CoreML CoreML2

2014 2016 2017 2018

Apple’s Ecosystem

- low-level, low-overhead hardware-accelerated 3D graphic and compute shader application programming interface (API)

- Available since iOS 8

2014 2016 2017 2018

Apple’s Ecosystem

Fast low-level primitives:

• BNNS – Basic Neural Network Subroutine• Ideal case: Fully connected NN

• MPS – Metal Performance Shaders• Ideal case: Convolutions

Inconvenient for large networks:

• Inception-v3 inference consisted of 1.5K hard coded model definition

• Libraries Like Forge by Matthijs Hollemans provide abstraction

2014 2016 2017 2018

Apple’s Ecosystem

Convert Caffe/Tensorflow model to CoreML model in 3 lines:

import coremltools

coreml_model = coremltools.converters.caffe.convert('my_caffe_model.caffemodel’)

coreml_model.save('my_model.mlmodel’)

Add model to iOS project and call for prediction.

Direct support for Keras, Caffe, scikit-learn, XGBoost, LibSVM

Automatically minimizes memory footprint and power consumption

2014 2016 2017 2018

Apple’s Ecosystem

• Model quantization support upto 1 bit

• Batch API for improved performance

• Conversion support for MXNet, ONNX • ONNX opens models from PyTorch, Cognitive Toolkit, Caffe2, Chainer

• ML Create for quick training

• tf-coreml for direct conversion from tensorflow

2014 2016 2017 2018

CoreML Benchmark - Pick a DNN for your mobile architecture

Model Top-1 Accurac

Size of Model (MB)

iPhone 5SExecution Time (ms)

iPhone 6Execution Time (ms)

iPhone 6S/SE

Execution Time (ms)

iPhone 7 Execution Time (ms)

iPhone 8/X Execution Time (ms)

VGG 16 71 553 7408 4556 235 181 146

Inception v3 78 95 727 637 114 90 78

Resnet 50 75 103 538 557 77 74 71

MobileNet 71 17 129 109 44 35 33

SqueezeNet 57 5 75 78 36 30 29

2014 2015 2016

Huge improvement in GPU hardware in 2015

2013 2017

Putting out more frames than an art gallery

TensorFlow Ecosystem

TensorFlow TensorFlow Mobile TensorFlow Lite

2015 2016 2018

The full, bulky deal

2015 2016 2018

Easy pipeline to bring Tensorflow models to mobile

Excellent documentation

Optimizations to bring model to mobile

• Smaller

• Faster

• Minimal dependencies• Easier to package & deploy

• Allows running custom operators

1 line conversion from Keras to TensorFlow lite

• tflite_convert --keras_model_file=keras_model.h5 --output_file=foo.tflite

2015 2016 2018

TensorFlow Lite is small

• ~75KB for core interpreter

• ~400KB for core interpreter + supported operations

• Compared to 1.5MB for Tensorflow Mobile

TensorFlow Lite is fast

• Takes advantage of on-device hardware acceleration

• Uses FlatBuffers• Reduces code footprint, memory usage• Reduces CPU cycles on serialization and deserialization• Improves startup time

• Pre-fused activations• Combining batch normalization layer with previous Convolution

• Interpreter uses static memory and static execution plan• Decreases load time

TensorFlow Lite Architecture

TensorFlow Lite Benchmarks - http://alpha.lab.numericcal.com/

TensorFlow Lite Benchmarks - http://ai-benchmark.com/

• Crowdsourcing benchmarking with AI Benchmark android app• By Andrey Ignatov from ETH

• 9 Tests• E.g. Semantic Segmentation, Image Super Resolution, Face Recognition

TensorFlow Lite acceleration – GPU delegate (dev preview)

Caffe2

From Facebook

Under 1 MB of binary size

Built for Speed :

For ARM CPU : Uses NEON Kernels, NNPack

For iPhone GPU : Uses Metal Performance Shaders and Metal

For Android GPU : Uses Qualcomm Snapdragon NPE (4-5x speedup)

ONNX format support to import models from CNTK/PyTorch

Caffe2

• Simple, easy to use

• Abstraction over TensorFlow Lite

• Built in Image Labeling, OCR, Face Detection, Barcode scanning, landmark detection, Smart reply

• Model management with Firebase• Upload model on web interface to distribute

• A/B Testing

MLKit – Face Contours

By leveraging GPU delegate,

~4x speed up on Pixel 3

~6x speed up on iPhone7

Recommendation for production development

1. Train a model using Keras

2. Convert to Tensorflow Lite format

3. Upload to Firebase

4. Deploy to iOS/Android apps with MLKit

.tflite file

tflite_convert

Common Questions

“My app has become too big to download. What do I do?”

• iOS doesn’t allow apps over 150 MB to be downloaded

• Solution : Download on demand, and compile on device

• 0 MB change to app size on first install

Common Questions

“Do I need to ship a new app update with every model improvement?”

• Making App updates is a decent amount of overheard, plus ~2 days wait time

• Solution : Check for model updates, download and compile on device

• Easier solution – Use a framework for Model Management, e.g. • Google ML Kit

• Fritz

• Numerrical

Common Questions

“Why does my app not recognize objects at top/bottom of screen?”

• Solution : Check the cropping used, by default, its center crop ☺

Building a DL App in 1 week

Learn Playing an Accordion

3 months

Learn Playing an Accordion

3 months

Knows Piano

Fine Tune Skills

1 week

I got a dataset, Now What?

Step 1 : Find a pre-trained model

Step 2 : Fine tune a pre-trained model

Step 3 : Run using existing frameworks

“Don’t Be A Hero” - Andrej Karpathy

How to find pretrained models for my task?

Model Zoo

https://modelzoo.co

- 300+ models

Papers with Code

https://paperswithcode.com/sota

AlexNet, 2012 (simplified)

[Krizhevsky, Sutskever,Hinton’12]

Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11

n-dimension

Feature

representation

Deciding how to fine tune

Size of New Dataset Similarity to Original Dataset What to do?

Large High Fine tune.

Small High Don’t Fine Tune, it will overfit.

Train linear classifier on CNN Features

Small Low Train a classifier from activations in lower layers.

Higher layers are dataset specific to older dataset.

Large Low Train CNN from scratch

http://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

Deciding when to fine tune

Could you training your own classifier ... without coding?

• Microsoft CustomVision.ai• Unique: Under a minute training, Custom object detection (100x speedup)

• Google AutoML• Unique: Full CNN training, crowdsourced workers

• IBM Watson Visual recognition

• Baidu EZDL• Unique: Custom Sound recognition

Custom Vision Service (customvision.ai) – Drag and drop training

Tip : Upload 30 photos per class for make prototype model

Upload 200 photos per class for more robust production model

More distinct the shape/type of object, lesser images required.

Custom Vision Service (customvision.ai) – Drag and drop training

Tip : Use Fatkun Browser Extension to download images from Search Engine,

or use Bing Image Search API to programmatically download photos with

proper rights

CoreML exporter from customvision.ai

– Drag and drop training

5 minute shortcut to training, finetuning and getting model ready in CoreML format

Drag and drop interface

Building a Crowdsourced Data Collector in 1 months

Barcode recognition from Seeing AI

Live Guide user in finding a barcode with audio cues

Server

Decode barcode to identify product

Tech MPSCNN running on mobile GPU + barcode library

Metrics 40 FPS (~25 ms) on iPhone 7

Aim : Help blind users identify products using barcode

Issue : Blind users don’t know where the barcode is

Currency recognition from Seeing AI

Aim : Identify currency

Live Identify denomination of paper currency instantly

Server

Tech Task specific CNN running on mobile GPU

Metrics 40 FPS (~25 ms) on iPhone 7

Training Data Collection App

Request volunteers to take photos of objects

in non-obvious settings

Sends photos to cloud, trains model nightly

Newsletter shows the best photos from volunteers

Let them compete for fame

Daily challenge - Collected by volunteers

Building a production DL App in 3 months

What you want

https://www.flickr.com/photos/kenjonbro/9075514760/and http://www.newcars.com/land-rover/range-rover-sport/2016

$2000$200,000

What you can afford

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2

3x3 conv, 384

fc, 4096

fc, 1000

AlexNet, 8 layers

(ILSVRC 2012)

Revolution of Depth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

11x11 conv, 96, /4, pool/2

3x3 conv, 384

fc, 4096

fc, 1000

AlexNet, 8 layers

(ILSVRC 2012)

3x3 conv, 64

3x3 conv, 128

3x3 conv, 256

3x3 conv, 512

fc, 4096

fc, 1000

VGG, 19 layers

(ILSVRC 2014)

7x7+ 2(S)

MaxPool

3x3+ 2(S)

LocalRespNorm

1x1+ 1(V)

3x3+ 1(S)

LocalRespNorm

MaxPool

3x3+ 2(S)

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

MaxPool

3x3+ 2(S)

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

AveragePool

5x5+ 3(V)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

AveragePool

5x5+ 3(V)

Dept hConcat

MaxPool

3x3+ 2(S)

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

AveragePool

7x7+ 1(V)

1x1+ 1(S)

Soft maxAct ivat ion

soft max0

1x1+ 1(S)

soft max1

soft max2

GoogleNet, 22 layers

(ILSVRC 2014)

Revolution of Depth

AlexNet, 8 layers

(ILSVRC 2012)

ResNet, 152 layers

(ILSVRC 2015)

3x3 conv, 64

3x3 conv, 128

3x3 conv, 256

3x3 conv, 512

fc, 4096

fc, 1000

11x11 conv, 96, /4, pool/2

3x3 conv, 384

fc, 4096

fc, 1000

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

7x7 conv, 64, /2, pool/2

VGG, 19 layers

(ILSVRC 2014)

Revolution of Depth

Ultra deep

ResNet, 152 layers 1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

7x7 conv, 64, /2, pool/2

Revolution of Depth

7.3 6.7

3.6 2.9

ILSVRC'10 ILSVRC'11 ILSVRC'12AlexNet

ILSVRC'13 ILSVRC'14VGG

ILSVRC'14GoogleNet

ILSVRC'15ResNet

ILSVRC'16Ensemble

ImageNet Classification top-5 error (%)

shallow 8 layers

19 layers 22 layers

152 layers

Revolution of Depth vs Classification Accuracy

Ensemble of

Resnet, Inception Resnet, Inception and Wide Residual Network

Accuracy vs Operations Per Image Inference

Size is proportional to num parameters

Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016

552 MB

240 MB

What we want

Your Budget - Smartphone Floating Point Operations Per Second (2015)

http://pages.experts-exchange.com/processing-power-compared/

iPhone X is more powerful than a Macbook Pro

https://thenextweb.com/apple/2017/09/12/apples-new-iphone-x-already-destroying-android-devices-g/

Strategies to get maximum efficiency from your CNN

Before training

• Pick an efficient architecture for your task

• Designing efficient layers

After training

• Pruning

• Quantization

• Network binarization

CoreML Benchmark - Pick a DNN for your mobile architecture

Model Top-1 Accura

Size of Model (MB)

Million Multi Adds

iPhone 5SExecution Time (ms)

iPhone 6Execution Time (ms)

iPhone 6S/SE

Execution Time (ms)

iPhone 7 Execution Time (ms)

iPhone 8/X

Execution Time (ms)

VGG 16 71 553 15300 7408 4556 235 181 146

Inception v3

78 95 5000 727 637 114 90 78

Resnet 50 75 103 3900 538 557 77 74 71

MobileNet 71 17 569 129 109 44 35 33

SqueezeNet

57 5 800 75 78 36 30 29

2014 2015 2016

Huge improvement in GPU hardware in 2015

2013 2017

MobileNet family

Splits the convolution into a 3x3 depthwise conv and a 1x1 pointwise conv

Tune with two parameters – Width Multiplier and resolution multiplier

Andrew G. Howard et al, "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017

Efficient Classification Architectures

https://ai.googleblog.com/2018/04/mobilenetv2-next-generation-of-on.html

MobileNetV2 is the current favorite

Efficient Detection Architectures

Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017

Efficient Detection Architectures

Jonathan Huang et al, "Speed/accuracy trade-offs for modern convolutional object detectors”, 2017

Efficient Segmentation Architectures

ICNet - Image cascade network

Tricks while designing your own network

• Dilated Convolutions• Great for Segmentation / when target object has high area in image

• Replace NxN convolutions with Nx1 followed by 1xN

• Depth wise Separable Convolutions (e.g. MobileNet)

• Inverted residual block (e.g. MobileNetV2)

• Replacing large filters with multiple small filters• 5x5 is slower than 3x3 followed by 3x3

Design consideration for custom architectures – Small Filters

Three layers of 3x3 convolutions >>

One layer of 7x7 convolution

Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutions

Replace NxN convolutions with stack of 1xN and Nx1

Fewer parameters ☺

Less compute ☺

More non-linearity ☺

Better

Faster

Stronger

Andrej Karpathy, CS-231n Notes, Lecture 11

Selective training to keep networks shallow

Idea : Augment data limited to how your network will be used

Example : If making a selfie app, no benefit in rotating training images beyond +-45 degrees. Your phone will anyway rotate.

Followed by WordLens / Google Translate

Example : Add blur if analyzing mobile phone frames

Pruning

Aim : Remove all connections with absolute weights below a threshold

Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015

Observation : Most parameters in Fully Connected Layers

AlexNet 240 MB VGG-16 552 MB

96% of all parameters

90% of all parameters

Pruning gets quickest model compression without accuracy loss

AlexNet 240 MB VGG-16 552 MB

First layer which directly interacts with image is sensitive and cannot be pruned too much without hurting accuracy

Prune in Keras (Before)

(x_train, y_train), (x_test, y_test) = mnist.load_data()x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),tf.keras.layers.Dense(512, activation=tf.nn.relu),tf.keras.layers.Dropout(0.2),tf.keras.layers.Dense(10, activation=tf.nn.softmax)

])model.compile( optimizer='adam’,

loss= ‘sparse_categorical_crossentropy’,metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)model.evaluate(x_test, y_test)

Prune in Keras (After)

(x_train, y_train), (x_test, y_test) = mnist.load_data()x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),

prune.Prune(tf.keras.layers.Dense(512, activation=tf.nn.relu)),tf.keras.layers.Dropout(0.2),

prune.Prune(tf.keras.layers.Dense(10, activation=tf.nn.softmax))])model.compile( optimizer='adam’,

loss= ‘sparse_categorical_crossentropy’,metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)model.evaluate(x_test, y_test)

Weight Sharing

Idea : Cluster weights with similar values together, and store in a dictionary.

Codebook

Huffman coding

HashedNets

Cons: Need a special inference engine, doesn’t work for most applications

Filter Pruning - ThiNet

Idea : Discard whole filter if not important to predictions

Advantage:

• No change in architecture, other than thinning of filters per layer

• Can be further compressed with other methods

Just like feature selection, select filter to discard. Possible greedy methods:

• Absolute weight sum of entire filter closest to 0

• Average percentage of ‘Zeros’ as outputs

• ThiNet – Collect statistics on the output of the next layer

Quantization

Reduce precision from 32 bits to <=16 bits or lesser

Use stochastic rounding for best results

In Practice:

• Ristretto + Caffe• Automatic Network quantization• Finds balance between compression rate and accuracy

• Apple Metal Performance Shaders automatically quantize to 16 bits

• Tensorflow has 8 bit quantization support• Gemmlowp – Low precision matrix multiplication library

Quantizing CNNs in Practice

Reducing CoreML models to half size

# Load a model, lower its precision, and then save the smaller model.

model_spec = coremltools.utils.load_spec(‘model.mlmodel’)model_fp16_spec = coremltools.utils.convert_neural_network_spec_weights_to_fp16(model_spec)coremltools.utils.save_spec(model_fp16_spec, ‘modelFP16.mlmodel')

Quantizing CNNs in Practice

Reducing CoreML models to even smaller size

Choose bits and quantization mode

Bits from [1,2,4,8]

Quantization mode from [“linear","linear_lut","kmeans_lut",”custom_lut”]

• Lut = look up table

from coremltools.models.neural_network.quantization_utils import *quantized_model= quantize_weights(model, 8, 'linear')quantized_model.save('quantizedModel.mlmodel’)compare_model(model, quantized_model, './sample_data/')

Binary weighted Networks

Idea :Reduce the weights to -1,+1

Speedup : Convolution operation can be approximated by only summation and subtraction

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

XNOR-Net

Idea :Reduce both weights + inputs to -1,+1

Speedup : Convolution operation can be approximated by XNOR and Bitcount operations

XNOR-Net

XNOR-Net on Mobile

Battery free, solar powered AI Device from XNOR.AI

Challenges

Off the shelf CNNs not robust for video

Solutions:

• Collective confidence over several frames

• CortexNet

Building a DL App and get $10 million in funding

(or a PhD)

Competitions to follow

Winners = High accuracy + Low energy consumption

* LPIRC - Low-Power Image Recognition Challenge

* EDLDC - Embedded deep learning design contest

* System Design Contest at Design Automation Conference (DAC)

AutoML – Let AI design an efficient AI architecture

MnasNet: Platform-Aware Neural Architecture Search for Mobile

• An automated neural architecture search approach for designing mobile models using reinforcement learning

• Incorporates latency information into the reward objective function

• Measure real-world inference latency by executing on a particular platform

Sample models

from search space TrainerMobile

phones

Multi-objective

reward

latency

reward

Controller

accuracy

AutoML – Let AI design an efficient AI architecture

For same accuracy:

• 1.5x faster than MobileNetV2

• ResNet-50 accuracy with 19x less parameters

• SSD300 mAP with 35x less FLOPs

Mr. Data Scientist PhD

One Last Question

How to access the slides in 1 second

http://bit.ly/ml-slides@anirudhkoul

deep learning on mobile phones - a practitionersguide · deep learning on mobile phones - a...

Documents

cell phones and smart phones

mobile secyourity - g+d group iot cellular iot mobile phones...

mobile phones price in india - cell phones buying guide

squeezing deep learning into mobile phones

boost mobile - prepaid cell phones & no contract cell phones

website hosting windows - telstra - mobile phones, prepaid...

partnerships - cetis hotel phones - cetis hotel phones

convergence phones

thru-hull chirp transducers | airmar · too deep too deep:...

nokia e51 process - telstra - mobile phones, prepaid phones

ip phones data sheet pg 1 · ip phones its ip phones are...

-cse-2017-12-01 - dada.cs.washington.edu mobile phones,...

cell phones

are cordless phones safer than mobile phones? · how mobile...

smart phones - virginia...

premium handset protection - cell phones | 4g phones |...

bay sediment budgets: sediment accounting 101 david...

modular phones-new generation smart phones

kartik k ganju - national institute of public finance and...

supercharge crisis services - vijay ganju (natcon15)