deep learning for vision - cse.msu.educse803/lectures/deeplearningtutorial.pptx.pdf · deep...

57
Deep Learning for Vision Xiaoming Liu Slides adapted from Adam Coates

Upload: others

Post on 20-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

Deep Learning for Vision

Xiaoming  Liu        

Slides  adapted  from  Adam  Coates  

Page 2: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

2  

What do we want ML to do?

•  Given  image,  predict  complex  high-­‐level  pa>erns:  

Object  recogniCon   DetecCon   SegmentaCon  

“Cat”  

[MarCn  et  al.,  2001]  

Page 3: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

3  

How is ML done?

•  Machine  learning  oNen  uses  common  pipeline  with  hand-­‐designed  feature  extracCon.  •  Final  ML  algorithm  learns  to  make  decisions  starCng  from  

the  higher-­‐level  representaCon.  •  SomeCmes  layers  of  increasingly  high-­‐level  abstracCons.  

–  Constructed  using  prior  knowledge  about  problem  domain.  

Feature  ExtracCon   Machine  Learning  Algorithm   “Cat”?  

Prior  Knowledge,  Experience  

Page 4: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

4  

“Deep Learning”

•  Deep  Learning  •  Train  mul$ple  layers  of  features/abstracCons  from  data.  •  Try  to  discover  representa$on  that  makes  decisions  easy.  

Low-­‐level  Features  

Mid-­‐level  Features  

High-­‐level  Features   Classifier   “Cat”?  

Deep  Learning:    train  layers  of  features  so  that  classifier  works  well.  

More  abstract  representaCon  

Page 5: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

5  

“Deep Learning”

•  Why  do  we  want  “deep  learning”?  –  Some  decisions  require  many  stages  of  processing.  

•  Easy  to  invent  cases  where  a  “deep”  model  is  compact  but  a  shallow  model  is  very  large  /  inefficient.  

– We  already,  intuiCvely,  hand-­‐engineer  “layers”  of  representaCon.  •  Let’s  replace  this  with  something  automated!  

– Algorithms  scale  well  with  data  and  compuCng  power.  •  In  pracCce,  one  of  the  most  consistently  successful  ways  to  get  good  results  in  ML.  

•  Can  try  to  take  advantage  of  unlabeled  data  to  learn  representaCons  before  the  task.  

Page 6: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

6  

Have we been here before? Ø Yes.  –  Basic  ideas  common  to  past  ML  and  neural  networks  research.  •  Supervised  learning  is  straight-­‐forward.  •  Standard  ML  development  strategies  sCll  relevant.  •  Some  knowledge  carried  over  from  problem  domains.  

Ø No.  –  Faster  computers;    more  data.  –  Be>er  opCmizers;    be>er  iniCalizaCon  schemes.  

•  “Unsupervised  pre-­‐training”  trick    [Hinton  et  al.  2006;    Bengio  et  al.  2006]  

–  Lots  of  empirical  evidence  about  what  works.  •  Made  useful  by  ability  to  “mix  and  match”  components.  [See,  e.g.,  Jarre>  et  al.,  ICCV  2009]  

Page 7: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

7  

Real impact

•  DL  systems  are  high  performers  in  many  tasks  over  many  domains.  

Image  recogniCon  [E.g.,  Krizhevsky  et  al.,  2012]  

Speech  recogniCon  [E.g.,  Heigold  et  al.,  2013]  

NLP  [E.g.,  Socher  et  al.,  ICML  2011;  Collobert  &  Weston,  ICML  2008]  

[Honglak  Lee]  

Page 8: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

8  

Outline •  ML  refresher  /  crash  course  

–  LogisCc  regression  –  OpCmizaCon  –  Features  

•  Supervised  deep  learning  –  Neural  network  models  –  Back-­‐propagaCon  –  Training  procedures  

•  Supervised  DL  for  images  –  Neural  network  architectures  for  images.  –  ApplicaCon  to  Image-­‐Net  

•  References  /  Resources  

Page 9: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

MACHINE LEARNING REFRESHER

Crash  Course  

Page 10: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

10  

Supervised Learning

•  Given  labeled  training  examples:  

•  For  instance:    x(i)  =  vector  of  pixel  intensiCes.                                                    y(i)  =  object  class  ID.  

•  Goal:    find  f(x)  to  predict  y  from  x  on  training  data.  – Hopefully:    learned  predictor  works  on  “test”  data.  

255  98  93  87  …  

f(x)   y  =  1      (“Cat”)  

Page 11: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

11  

Logistic Regression •  Simple  binary  classificaCon  algorithm  – Start  with  a  funcCon  of  the  form:  

–  InterpretaCon:    f(x)  is  probability  that  y  =  1.  •  Sigmoid  “nonlinearity”  squashes  linear  funcCon  to  [0,1].  

– Find  choice  of          that  minimizes  objecCve:  

1  

Page 12: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

12  

Optimization

•  How  do  we  tune          to  minimize                ?  •  One  algorithm:    gradient  descent  –  Compute  gradient:  

–  Follow  gradient  “downhill”:  

 •  StochasCc  Gradient  Descent  (SGD):    take  step  using  gradient  from  only  small  batch  of  examples.  –  Scales  to  larger  datasets.    [Bo>ou  &  LeCun,  2005]  

 

Page 13: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

13  

Is this enough?

•  Loss  is  convex  à  we  always  find  minimum.  •  Works  for  simple  problems:  –  Classify  digits  as  0  or  1  using  pixel  intensity.  –  Certain  pixels  are  highly  informaCve  -­‐-­‐-­‐  e.g.,  center  pixel.  

 •  Fails  for  even  slightly  harder  problems.  –  Is  this  a  coffee  mug?  

Page 14: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

14  

Why is vision so hard?

“Coffee  Mug”  

Pixel  Intensity  

Pixel  intensity  is  a  very  poor  representaCon.  

Page 15: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

15  

pixel  1  

pixel  2  

+   Coffee  Mug  

Not  Coffee  Mug  -­‐  

Why is vision so hard?

+  -­‐  

-­‐  

pixel  1  

pixel  2  

+  

Pixel  Intensity  [72    160]  

Page 16: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

16  

Why is vision so hard?

+  

pixel  1  

pixel  2  

-­‐  +  

+  

-­‐  -­‐  

+   -­‐  +  

+   Coffee  Mug  

Not  Coffee  Mug  -­‐  

+  

pixel  1  

pixel  2  

-­‐  +  

+  

-­‐  -­‐  

+   -­‐  +  

Is  this  a  Coffee  Mug?  

Learning  Algorithm  

Page 17: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

17  

Features

+  handle?  

cylinder?  

-­‐  +  +  -­‐  -­‐  

+  

-­‐   +  

+   Coffee  Mug  

Not  Coffee  Mug  -­‐  

cylinder?  handle?  

Is  this  a  Coffee  Mug?  

Learning  Algorithm   +  handle?  

cylinder?  

-­‐  +  +  -­‐  -­‐  

+  

-­‐   +  

Page 18: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

18  

Features •  Features  are  usually  hard-­‐wired  transformaCons  built  into  the  system.  – Formally,  a  funcCon  that  maps  raw  input  to  a  “higher  level”  representaCon.  

– Completely  staCc  -­‐-­‐-­‐  so  just  subsCtute  φ(x)  for  x  and  do  logisCc  regression  like  before.  

Where  do  we  get  good  features?  

Page 19: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

19  

Features

•  Huge  investment  devoted  to  building  applicaCon-­‐specific  feature  representaCons.  –  Find  higher-­‐level  pa>erns  so  that  final  decision  is  easy  to  learn  with  ML  algorithm.  

Object  Bank  [Li  et  al.,  2010]   Super-­‐pixels  [Gould  et  al.,  2008;    Ren  &  Malik,  2003]  

SIFT  [Lowe,  1999]   Spin  Images  [Johnson  &  Hebert,  1999]  

Page 20: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

SUPERVISED DEEP LEARNING

Extension  to  neural  networks  

Page 21: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

21  

Basic idea

•  We  saw  how  to  do  supervised  learning  when  the  “features”  φ(x)  are  fixed.  – Let’s  extend  to  case  where  features  are  given  by  tunable  funcCons  with  their  own  parameters.  

Inputs  are  “features”-­‐-­‐-­‐one  feature  for  each  row  of  W:  Outer  part  of  funcCon  is  same  

as  logisCc  regression.  

Page 22: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

22  

Basic idea

•  To  do  supervised  learning  for  two-­‐class  classificaCon,  minimize:  

 •  Same  as  logisCc  regression,  but  now  f(x)  has  mulCple  stages  (“layers”,  “modules”):  

Intermediate  representaCon  (“features”)   PredicCon  for  

Page 23: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

23  

Neural network

•  This  model  is  a  sigmoid  “neural  network”:  

Flow  of  computaCon.  “Forward  prop”  

“Neuron”  

Page 24: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

24  

Neural network •  Can  stack  up  several  layers:   Must  learn  mulCple  stages  

of  internal  “representaCon”.  

Page 25: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

25  

Back-propagation

•  Minimize:  

•  To  minimize                            we  need  gradients:  

– Then  use  gradient  descent  algorithm  as  before.  

•  Formula  for                                  can  be  found  by  hand  (same  as  before);    but  what  about  W?  

Page 26: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

26  

The Chain Rule •  Suppose  we  have  a  module  that  looks  like:  

•  And  we  know                                                                and                ,  chain  rule  gives:  

 

Similarly  for  W:  

Ø   Given  gradient  with  respect  to  output,  we  can  build  a  new  “module”  that  finds  gradient  with  respect  to  inputs.      

Jacobian  matrix.  

Page 27: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

27  

The Chain Rule •  Easy  to  build  toolkit  of  known  rules  to  compute  gradients  given  –  Automated  differenCaCon!    E.g.,  Theano  [Bergstra  et  al.,  2010]  

FuncCon   Gradient  w.r.t.  input   Gradient  w.r.t.  parameters  

Page 28: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

28  

Back-propagation •  Can  re-­‐apply  chain  rule  to  get  gradients  for  all  intermediate  values  and  parameters.  

“Backward”  modules  for  each  forward  stage.  

Page 29: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

29  

Example

•  Given                ,  compute                    :  

   Using  several  items  from  our  table:      

Page 30: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

30  

Training Procedure •  Collect  labeled  training  data  –  For  SGD:    Randomly  shuffle  aNer  each  epoch!  

•  For  a  batch  of  examples:  –  Compute  gradient  w.r.t.  all  parameters  in  network.  

– Make  a  small  update  to  parameters.  

–  Repeat  unCl  convergence.  

Page 31: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

31  

Training Procedure

•  Historically,  this  has  not  worked  so  easily.  – Non-­‐convex:    Local  minima;    convergence  criteria.  – OpCmizaCon  becomes  difficult  with  many  stages.  

•  “Vanishing  gradient  problem”  – Hard  to  diagnose  and  debug  malfuncCons.  

•  Many  things  turn  out  to  ma>er:  – Choice  of  nonlineariCes.  –  IniCalizaCon  of  parameters.  – OpCmizer  parameters:    step  size,  schedule.  

Page 32: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

32  

Nonlinearities

•  Choice  of  funcCons  inside  network  ma>ers.  – Sigmoid  funcCon  turns  out  to  be  difficult.  – Some  other  choices  oNen  used:  

1  

-­‐1  

1  

tanh(z)   ReLu(z)  =  max{0,  z}  

“RecCfied  Linear  Unit”  à  Increasingly  popular.  

1  

abs(z)  

[Nair  &  Hinton,  2010]  

Page 33: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

33  

Initialization

•  Usually  small  random  values.  –  Try  to  choose  so  that  typical  input  to  a  neuron  avoids  saturaCng  /  non-­‐differenCable  areas.  

–  Has  to  be  random  otherwise  every  neuron  will  be  equal!    –  Occasionally  inspect  units  for  saturaCon  /  blowup.  –  Larger  values  may  give  faster  convergence,  but  worse  models!  

•  IniCalizaCon  schemes  for  parCcular  units:  –  tanh  units:    Unif[-­‐r,  r];    sigmoid:    Unif[-­‐4r,  4r].      See  [Glorot  et  al.,  AISTATS  2010]    

   

Page 34: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

34  

Optimization: Step sizes •  Choose  SGD  step  size  carefully.  –  Up  to  factor  ~2  can  make  a  difference.  

•  Strategies:  –  Brute-­‐force:    try  many;    pick  one  with  best  result.  –  Racing:    pick  size  with  best  error  on  validaCon  data  aNer  T  steps.  •  Not  always  accurate  if  T  is  too  small.  

•  Step  size  schedule:  –  Fixed:  same  step  size    –  Step    – MulC-­‐Step  –  Inverse  

Bengio,  2012:    “PracCcal  RecommendaCons  for  Gradient-­‐Based  Training  of  Deep  Architectures”  Hinton,  2010:    “A  PracCcal  Guide  to  Training  Restricted  Boltzmann  Machines”  

Page 35: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

35  

Optimization: Momentum •  “Smooth”  esCmate  of  gradient  from  several  steps  of  SGD:  

 •  A  li>le  bit  like  second-­‐order  informaCon.  –  High-­‐curvature  direcCons  cancel  out.  –  Low-­‐curvature  direcCons  “add  up”  and  accelerate.  

Bengio,  2012:    “PracCcal  RecommendaCons  for  Gradient-­‐Based  Training  of  Deep  Architectures”  Hinton,  2010:    “A  PracCcal  Guide  to  Training  Restricted  Boltzmann  Machines”  

Page 36: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

37  

Other factors

•  “Weight  decay”  penalty  can  help.  – Add  small  penalty  for  squared  weight  magnitude.    

•  For  modest  datasets,  LBFGS  or  second-­‐order  methods  are  easier  than  SGD.  – See,  e.g.:    Martens  &  Sutskever,  ICML  2011.  – Can  crudely  extend  to  mini-­‐batch  case  if  batches  are  large.    [Le  et  al.,  ICML  2011]  

Page 37: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

SUPERVISED DL FOR VISION

ApplicaCon  

Page 38: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

39  

Working with images

•  Major  factors:  –  Choose  funcConal  form  of  network  to  roughly  match  the  computaCons  we  need  to  represent.  •  E.g.,  “selecCve”  features  and  “invariant”  features.  

–  Try  to  exploit  knowledge  of  images  to  accelerate  training  or  improve  performance.  

•  Generally  try  to  avoid  wiring  detailed  visual  knowledge  into  system  -­‐-­‐-­‐  prefer  to  learn.  

Page 39: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

40  

Local connectivity

•  Neural  network  view  of  single  neuron:  Extremely  large  number  of  connecCons.  à More  parameters  to  train.  à Higher  computaConal  expense.  à Turn  out  not  to  be  helpful  in  pracCce.  

Page 40: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

41  

Local connectivity •  Reduce  parameters  with  local  connecCons.  

–  Weight  vector  is  a  spaCally  localized  “filter”.  

Page 41: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

42  

Local connectivity

•  SomeCmes  think  of  neurons  as  viewing  small  adjacent  windows.  –  Specify  connecCvity  by  the  size  (“recepCve  field”  size)  and  spacing  (“step”  or  “stride”)  of  windows.  •  Typical  RF  size  =  3  to  20  •  Typical  step  size  =  1  pixel  up  to  RF  size.  

Page 42: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

43  

Local connectivity

•  SpaCal  organizaCon  of  filters  means  output  features  can  also  be  organized  like  an  image.  –  X,Y  dimensions  correspond  to  X,Y  posiCon  of  neuron  window.  

–  “Channels”  are  different  features  extracted  from  same  spaCal  locaCon.    (Also  called  “feature  maps”,  or  “maps”.)  

1D  input  

1-­‐dimensional  example:  

X  spaCal  locaCon  

“Channel”  or  “map”  index  

Page 43: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

44  

Local connectivity Ø We  can  treat  output  of  a  layer  like  an  image  and  re-­‐use  the  

same  tricks.  

1D  input  

1-­‐dimensional  example:  

X  spaCal  locaCon  

“Channel”  or  “map”  index  

Page 44: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

45  

Weight-Tying

•  Even  with  local  connecCons,  may  sCll  have  too  many  weights.  – Trick:    constrain  some  weights  to  be  equal  if  we  know  that  some  parts  of  input  should  learn  same  kinds  of  features.  

–  Images  tend  to  be  “staConary”:    different  patches  tend  to  have  similar  low-­‐level  structure.  Ø Constrain  weights  used  at  different  spaCal  posiCons  to  be  the  equal.  

Page 45: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

46  

Weight-Tying Ø  Before,  could  have  neurons  with  different  weights  at  different  

locaCons.    But  can  reduce  parameters  by  making  them  equal.  

1D  input  

1-­‐dimensional  example:  

X  spaCal  locaCon  

“Channel”  or  “map”  index  

•  SomeCmes  called  a  “convoluConal”  network.    Each  unique  filter  is  spaCally  convolved  with  the  input  to  produce  responses  for  each  map.    [LeCun  et  al.,  1989;    LeCun  et  al.,  2004]  

Page 46: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

47  

Pooling •  FuncConal  layers  designed  to  represent  invariant  features.  •  Usually  locally  connected  with  specific  nonlineariCes.  

–  Combined  with  convoluCon,  corresponds  to  hard-­‐wired  translaCon  invariance.  

•  Usually  fix  weights  to  local  box  or  Gaussian  filter.  –  Easy  to  represent  max-­‐,  average-­‐,  or  2-­‐norm  pooling.  

[Scherer  et  al.,  ICANN  2010]  [Boureau  et  al.,  ICML  2010]  

Page 47: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

48  

Batch Normalization

•  Reduce  covariance  shiN  during  training  •  SpaCally  over  all  samples  in  each  batch    

…  Sample  1   Sample  2   Sample  m  

µB ←1

W × H ×mxw,h,i

w,h,i∑

σ B2 ← 1

W × H ×m(xw,h,i −

w,h,i∑ µB )

2

x∧

w,h,i ←xi − µB

σ B2 + ε

yi ←γ x∧

w,h,i+ β ≡ BNγ ,β (xw,h,i )

Page 48: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

50  

Application: Image-Net •  System  from  Krizhevsky  et  al.,  NIPS  2012:  –  ConvoluConal  neural  network.  – Max-­‐pooling.  –  RecCfied  linear  units  (ReLu).  –  Local  response  normalizaCon.  –  Local  connecCvity.  

Page 49: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

51  

Application: Image-Net

•  Top  result  in  LSVRC  2012:    ~85%,  Top-­‐5  accuracy.  

What’s  an  Agaric!?  

Page 50: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

52  

More applications •  SegmentaCon:    predict  classes  of  pixels  /  super-­‐pixels.  

•  DetecCon:    combine  classifiers  with  sliding-­‐window  architecture.  –  Economical  when  used  with  convoluConal  nets.  

   •  RoboCc  grasping.      [Lenz  et  al.,  RSS  2013]  

ß  Ciresan  et  al.,  NIPS  2012  

Farabet  et  al.,  ICML  2012  à  

Pierre  Sermanet  (2010)  à  

h>p://www.youtube.com/watch?v=f9CuzqI1SkE  

Page 51: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

Go

Page 52: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

AlphaGo beat best human (Lee Sedol) 3/2016 (win-win-win-loss-win) Games are available at https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

Page 53: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

55  

How does AlphaGo work? (For all the details, see their paper:

www.nature.com/nature/journal/v529/n7587/pdf/nature16961.pdf ) •  Combines  two  techniques:  

– Monte  Carlo  Tree  Search  (MCTS)  •  Historical  advantages  of  MCTS  over  minimax-­‐search  approaches:  –  Does  not  require  an  evaluaCon  funcCon  –  Typically  works  be>er  with  large  branching  factors  

•  Improved  Go  programs  by  ~10  kyu  around  2006  – Deep  Learning  

•  ‘Value  networks’  to  evaluate  board  posiCons  and  •  ‘Policy  networks’  to  select  moves  

Page 54: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

56  

AlphaGo hardware

•  EvaluaCng  policy  and  value  networks  requires  several  orders  of  magnitude  more  computaCon  than  tradiConal  search  heurisCcs  

•  AlphaGo  uses  an  asynchronous  mulC-­‐threaded  search  that  executes  simulaCons  on  CPUs,  and  computes  policy  and  value  networks  in  parallel  on  GPUs  

•  Final  version  of  AlphaGo  used  40  search  threads,  48  CPUs,  and  8  GPUs  

•  (They  also  implemented  a  distributed  version)  

Page 55: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

57  

Speech Recognition

A.  Hannun,  C.  Case,  J.  Casper,  B.  Catanzaro,  G.  Diamos,  E.  Elsen,  R.  Prenger,  S.  Satheesh,  S.  Sengupta,  A.  Coates,  A.  Ng  ”Deep  Speech:  Scaling  up  end-­‐to-­‐end  speech  recogniCon”,  arXiv:1412.5567v2,  2014.

Output:  characters  (and  space)

Input:  acousCc  features  (spectrograms)

No  phoneme  and  lexicon (No  OOV  problem)

+  null  (~)

Page 56: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

58  

Resources

Tutorials  Stanford  Deep  Learning  tutorial:  

h>p://ufldl.stanford.edu/wiki  

Deep  Learning  tutorials  list:  h>p://deeplearning.net/tutorials  

IPAM  DL/UFL  Summer  School:  h>p://www.ipam.ucla.edu/programs/gss2012/  

ICML  2012  RepresentaCon  Learning  Tutorial  h>p://www.iro.umontreal.ca/~bengioy/talks/deep-­‐learning-­‐tutorial-­‐2012.html  

Page 57: Deep Learning for Vision - cse.msu.educse803/Lectures/deepLearningTutorial.pptx.pdf · Deep Learning for Vision Xiaoming(Liu((((Slides(adapted(from(Adam(Coates

59  

References h>p://www.stanford.edu/~acoates/bmvc2013refs.pdf  Overviews:  Yoshua  Bengio,    

“PracCcal  RecommendaCons  for  Gradient-­‐Based  Training  of  Deep  Architectures”  

 Yoshua  Bengio  &  Yann  LeCun,    

“Scaling  Learning  Algorithms  towards  AI”    Yoshua  Bengio,  Aaron  Courville  &  Pascal  Vincent,    

“RepresentaCon  Learning:  A  Review  and  New  PerspecCves”    

SoNware:  Theano  GPU  library:    h>p://deeplearning.net/soNware/theano  SPAMS  toolkit:    h>p://spams-­‐devel.gforge.inria.fr/