squeezing deep learning into mobile phones

Squeezing Deep Learning into mobile phones

- A Practitioners guideAnirudh Koul

Anirudh Koul , @anirudhkoul , http://koul.aiProject Lead, Seeing AIApplied Researcher, Microsoft AI & ResearchAkoul at Microsoft dot com

Currently working on applying artificial intelligence for productivity, augmented reality and accessibilityAlong with Eugene Seleznev, Saqib Shaikh, Meher Kasam

Why Deep Learning On Mobile?

Latency

Privacy

Mobile Deep Learning Recipe

Mobile Inference Engine + Pretrained Model = DL App(Efficient) (Efficient)

Building a DL App in _ time

Building a DL App in 1 hour

Use Cloud APIs

Microsoft Cognitive ServicesClarifaiGoogle Cloud VisionIBM Watson ServicesAmazon Rekognition

Microsoft Cognitive Services

Models won the 2015 ImageNet Large Scale Visual Recognition ChallengeVision, Face, Emotion, Video and 21 other topics

Building a DL App in 1 day

ihttp://deeplearningkit.org/2015/12/28/deeplearningkit-deep-learning-for-ios-tested-on-iphone-6s-tvos-and-os-x-developed-in-metal-and-swift/

Energy to trainConvolutionalNeural Network

Energy to useConvolutionalNeural Network

Base PreTrained Model

ImageNet – 1000 Object CategorizerInceptionResnet

Running pre-trained models on mobile

MXNet TensorflowCNNDroidDeepLearningKitCaffeTorch

Amalgamation : Pack all the code in a single source file

Pro:• Cross Platform (iOS, Android), Easy porting• Usable in any programming language

Con:• CPU only, Slow https://github.com/Leliana/WhatsThis

Tensorflow

Easy pipeline to bring Tensorflow models to mobileGreat documentationOptimizations to bring model to mobileUpcoming : XLA (Accelerated Linear Algebra) compiler to optimize for hardware

CNNdroid

GPU accelerated CNNs for AndroidSupports Caffe, Torch and Theano models~30-40x Speedup using mobile GPU vs CPU (AlexNet)

Internally, CNNdroid expresses data parallelism for different layers, instead of leaving to the GPU’s hardware scheduler

DeepLearningKit

Platform : iOS, OS X and tvOS (Apple TV)DNN Type : CNNs models trained in CaffeRuns on mobile GPU, uses Metal

Pro : Fast, directly ingests Caffe modelsCon : Unmaintained

Caffe for Android https://github.com/sh1r0/caffe-android-libSample app https://github.com/sh1r0/caffe-android-demo

Caffe for iOS : https://github.com/aleph7/caffeSample app https://github.com/noradaiko/caffe-ios-sample

Pro : Usually couple of lines to port a pretrained model to mobile CPUCon : Unmaintained

Running pre-trained models on mobile

Mobile Library

Platform GPU

DNN Architecture Supported

Trained Models Supported

Tensorflow iOS/Android

Yes CNN,RNN,LSTM, etc

Tensorflow

CNNDroid Android Yes CNN Caffe, Torch, Theano

DeepLearningKit

iOS Yes CNN Caffe

MXNet iOS/Android

No CNN,RNN,LSTM, etc

Caffe iOS/Android

No CNN Caffe

Torch iOS/Android

No CNN,RNN,LSTM, etc

Building a DL App in 1 week

Learn Playing an Accordion3 months

Knows Piano

Fine Tune Skills

1 week

I got a dataset, Now What?

Step 1 : Find a pre-trained modelStep 2 : Fine tune a pre-trained modelStep 3 : Run using existing frameworks

“Don’t Be A Hero” - Andrej Karpathy

How to find pretrained models for my task?

Search “Model Zoo”

Microsoft Cognitive Toolkit (previously called CNTK) – 50 ModelsCaffe Model ZooKerasTensorflowMXNet

AlexNet, 2012 (simplified)

i[Krizhevsky, Sutskever,Hinton’12]

Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng, “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks”, 11

n-dimensionFeature

representation

Deciding how to fine tune

Size of New Dataset

Similarity to Original Dataset

What to do?

Large High Fine tune.Small High Don’t Fine Tune, it will overfit.

Train linear classifier on CNN Features

Small Low Train a classifier from activations in lower layers.Higher layers are dataset specific to older dataset.

Large Low Train CNN from scratchhttp://blog.revolutionanalytics.com/2016/08/deep-learning-part-2.html

Deciding when to fine tune

Size of New Dataset

What to do?

Size of New Dataset

What to do?

Size of New Dataset

What to do?

Building a DL Website in 1 week

Less Data + Smaller Networks = Faster browser training

Several JavaScript Libraries

Run large CNNs• Keras-JS• MXNetJS• CaffeJS

Train and Run CNNs• ConvNetJS

Train and Run LSTMs• Brain.js• Synaptic.js

Train and Run NNs• Mind.js• DN2A

ConvNetJS

Both Train and Test NNs in browserTrain CNNs in browser

Keras.js

Run Keras models in browser, with GPU support.

Brain.JS

Train and run NNs in browserSupports Feedforward, RNN, LSTM, GRUNo CNNsDemo : http://brainjs.com/

Trained NN to recognize color contrast

MXNetJS

On Firefox and Microsoft Edge, performance is 8x faster than Chrome. Optimization difference because of ASM.js.

Building a DL App in 1 month

(and get featured in Apple App store)

Response Time Limits – Powers of 10

0.1 second : Reacting instantly1.0 seconds : User’s flow of thought10 seconds : Keeping the user’s attention

[Miller 1968; Card et al. 1991; Jakob Nielsen 1993]:

Apple frameworks for Deep Learning Inference

BNNS – Basic Neural Network SubroutineMPS – Metal Performance Shaders

Metal Performance Shaders (MPS)

Fast, Provides GPU acceleration for inference phaseFaster app load times than Tensorflow (Jan 2017)About 1/3rd the run time memory of Tensorflow on Inception-V3 (Jan 2017)~130 ms on iPhone 7S Plus to run Inception-V3

Cons: • Limited documentation. • No easy way to programmatically port models. • No batch normalization. Solution : Join Conv and BatchNorm weights

Putting out more frames than an art gallery

Basic Neural Network Subroutines (BNNS)

Runs on CPU

BNNS is faster for smaller networks than MPS but slower for bigger networks

BrainCore

NN Framework for iOSProvides LSTMs functionalityFast, uses Metal, runs on iPhone GPUhttps://github.com/aleph7/braincore

Building a DL App in 6 months

What you want

https://www.flickr.com/photos/kenjonbro/9075514760/ and http://www.newcars.com/land-rover/range-rover-sport/2016

$2000$200,000What you can afford

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2

3x3 conv, 384

fc, 4096

fc, 1000

AlexNet, 8 layers

(ILSVRC 2012)

Revolution of Depth

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015i

11x11 conv, 96, /4, pool/2

3x3 conv, 384

fc, 4096

fc, 1000

AlexNet, 8 layers

(ILSVRC 2012)

3x3 conv, 64

3x3 conv, 128

3x3 conv, 256

3x3 conv, 512

fc, 4096

fc, 1000

VGG, 19 layers

(ILSVRC 2014)

Conv7x7+ 2(S)

MaxPool 3x3+ 2(S)

LocalRespNorm

Conv1x1+ 1(V)

Conv3x3+ 1(S)

LocalRespNorm

MaxPool 3x3+ 2(S)

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

MaxPool 3x3+ 2(S)

Dept hConcat

Av eragePool 5x5+ 3(V)

Dept hConcat

AveragePool 5x5+ 3(V)

Dept hConcat

MaxPool 3x3+ 2(S)

Dept hConcat

AveragePool 7x7+ 1(V)

Conv1x1+ 1(S)

Soft maxAct iv at ion

soft max0

Conv1x1+ 1(S)

Soft maxActivat ion

soft max1

Soft maxAct ivat ion

soft max2

GoogleNet, 22 layers

(ILSVRC 2014)

Revolution of Depth

AlexNet, 8 layers

(ILSVRC 2012)

ResNet, 152 layers

(ILSVRC 2015)

3x3 conv, 64

3x3 conv, 128

3x3 conv, 256

3x3 conv, 512

fc, 4096

fc, 1000

11x11 conv , 96, /4, pool/2

3x3 conv, 384

fc, 4096

fc, 1000

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

7x7 conv, 64, /2, pool/2

VGG, 19 layers

(ILSVRC 2014)

Revolution of Depth

Ultra deep

ResNet, 152 layers

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

7x7 conv, 64, /2, pool/2

Revolution of Depth

ILSVRC'15 ResNet

ILSVRC'14 GoogleNet

ILSVRC'14VGG

ILSVRC'13 ILSVRC'12 AlexNet

ILSVRC'11 ILSVRC'10

6.7 7.3

25.828.2

ImageNet Classification top-5 error (%)

shallow8 layers

19 layers22 layers

152 layers

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”, 2015

8 layers

Revolution of Depth

Your Budget - Smartphone Floating Point Operations Per Second (2015)

i http://pages.experts-exchange.com/processing-power-compared/

Accuracy vs Operations Per Image Inference

Size is proportional to num parameters

Alfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016

552 MB

240 MB

What we want

Accuracy Per Parameter

iAlfredo Canziani, Adam Paszke, Eugenio Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications” 2016

Pick your DNN Architecture for your mobile architecture

Resnet Family

Under 150 ms on iPhone 7 using Metal GPUKaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, "Deep Residual Learning for Image Recognition”, 2015

Strategies to make DNNs even more efficient

Shallow networksCompressing pre-trained networksDesigning compact layersQuantizing parametersNetwork binarization

Pruning

Aim : Remove all connections with absolute weights below a threshold

Song Han, Jeff Pool, John Tran, William J. Dally, "Learning both Weights and Connections for Efficient Neural Networks", 2015

Observation : Most parameters in Fully Connected Layers

iAlexNet 240 MB

VGG-16 552 MB

96% of all parameters

90% of all parameters

Pruning gets quickest model compression without accuracy loss

iAlexNet 240 MB

VGG-16 552 MB

First layer which directly interacts with image is sensitive and cannot be pruned too much without hurting accuracy

Weight Sharing

Idea : Cluster weights with similar values together, and store in a dictionary.

CodebookHuffman codingHashedNets

Simplest implementation:• Round all weights into 256 levels• Tensorflow export script reduces inception zip file from 87 MB to 26 MB

with 1% drop in precision

Selective training to keep networks shallow

Idea : Augment data limited to how your network will be used

Example : If making a selfie app, no benefit in rotating training images beyond +-45 degrees. Your phone will anyway rotate.Followed by WordLens / Google Translate

Example : Add blur if analyzing mobile phone frames

Design consideration for custom architectures – Small Filters

Three layers of 3x3 convolutions >> One layer of 7x7 convolution

Replace large 5x5, 7x7 convolutions with stacks of 3x3 convolutionsReplace NxN convolutions with stack of 1xN and Nx1ÞFewer parameters ÞLess compute ÞMore non-linearity

BetterFasterStronger

Andrej Karpathy, CS-231n Notes, Lecture 11

SqueezeNet - AlexNet-level accuracy in 0.5 MB

SqueezeNet base 4.8 MBSqueezeNet compressed 0.5 MB

80.3% top-5 Accuracy on ImageNet0.72 GFLOPS/image

Fire Block

Forrest N. Iandola, Song Han et al, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size"

Reduced precision

Reduce precision from 32 bits to <=16 bits or lesserUse stochastic rounding for best results

In Practice:• Ristretto + Caffe

• Automatic Network quantization• Finds balance between compression rate and accuracy

• Apple Metal Performance Shaders automatically quantize to 16 bits

• Tensorflow has 8 bit quantization support• Gemmlowp – Low precision matrix multiplication library

Binary weighted Networks

Idea :Reduce the weights to -1,+1Speedup : Convolution operation can be approximated by only summation and subtraction

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, Ali Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks”

XNOR-Net

Idea :Reduce both weights + inputs to -1,+1Speedup : Convolution operation can be approximated by XNOR and Bitcount operations

XNOR-Net

XNOR-Net on Mobile

Building a DL App and get $10 million in

funding(or a PhD)

Minerva

DeepX Toolkit

iNicholas D. Lane et al, “DXTK : Enabling Resource-efficient Deep Learning on Mobile and Embedded Devices with the DeepX Toolkit",2016

EIE : Efficient Inference Engine on Compressed DNNs

iSong Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, William Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network", 2016

189x faster on CPU13x faster on GPU

One Last Question

How to access the slides in 1 second

Link posted here -> @anirudhkoul

squeezing deep learning into mobile phones

Technology

the orange crush: the squeezing of orange county's middle...

thru-hull chirp transducers | airmar · too deep too deep:...

proceedings squeezing salts

deep learning on mobile phones - a practitionersguide ·...

squeezing price elasticity into the pricing matrix

feature squeezing: detecting adversarial examples in deep...

feature squeezing: detecting adversarial examples in deep

multi-partite squeezing and su (1,1) symmetry

manufacturing processes · (6) stretch forming, (7) sheet...

squeezing the proton

squeezing the gambits

squeezing performance out of hazelcast

squeezing the hardware to make performance juice

squeezing in optical fibers - dspace.mit.edu

squeezing secrets from the trees - uaf.edu

tunnelling & underground design (topic6-sequential...

leakage squeezing using cellular automata€¦ · leakage...

spin squeezing - university of colorado boulder

jaxlondon - squeezing performance of imdgs

squeezing the show