introduction to deeplearning

Download introduction to deeplearning

Post on 12-Jul-2015

299 views

Category:

Technology

0 download

Embed Size (px)

TRANSCRIPT

  • Deep Learning

  • DeepFace: Closing the Gap to Human-Level Performance in Face Verification

  • Perceptrons

    One of the earliest supervised training algorithms is that of the perceptron, a basic neural network building block.

    = + transfer function:activation function: = 1: > 00:

  • DrawbacksThe single perceptron has one major drawback: it can only learn linearly separable functions.How major is this drawback? Take XOR, a relatively simple function, and notice that it cant be classified by a linear separator

  • Multilayer networks

    Multilayer networks could learn complicated things and they did - butvery slowly. What emerged from this second neural network revolution was that we had a good theory but learning was slow and results while good but not amazing.

    the real question, that received very little attention for such an important one, was - why don't multilayer networks learn?

  • Feedforward Neural Networks

  • if each of our perceptrons is only allowed to use a linear activation function Then, the final output of our network will still be some linear function of the inputs, just adjusted with a ton of different weights that its collected throughout the network.

    A linear composition of a bunch of linear functions is still just a linear function, so most neural networks use

    non-linear activation functions.

  • Hidden layersa single hidden layer is powerful enough to learn any function.

    :

    We often learn better in practice with multiple hidden layers.

    :

    Deeper networks

  • The Problem with Large Networks

    The problem is that it is fairly easy to create things that behave like neurons, the brains major component. What is not easy is working out what the whole thing does once you have assembled it.

    why don't multilayer networks learn?it all had to do with the way the training errors were being passed back from the output layer to the deeper layers of artificial neurons.

  • Vanishing gradient"vanishing gradient" problem meant that as soon as a neural network got reasonably good at a task the lower layers didn't really get any information about how to change to help do the task better.

    because the error in a layer gets "split up" and partitioned out to each unit in the layer. This, in turn, further reduces the gradient in layers below that.

  • Overfitting

  • sredocneotuAdrawrofdeef a yllacipyt smia hcihw krowten laruen ,desserpmoc a nrael ot noitatneserper detubirtsid.tesatad a fo )gnidocne(

    , :

    . ,

  • What .auto..what !!??The intuition behind this architecture is that the network will not learn a mapping between the training data and its labels, but will instead learn the internal structure and features of the data itself.(Because of this, the hidden layer is also called feature

    detector.)

  • So what !Usually, the number of hidden units is smaller than the input/output layers, which forces the network to learn only the most important features and achieves a dimensionality reduction.

    dimensionality reduction = feature selection and feature extraction.

    were attempting to learn the data in a truer sense

  • Restricted Boltzmann Machinesclassical factor analysisExample:Suppose you ask a bunch of users to rate a set of movies on a 0-100 scale. In classical factor analysis, you could then try to explain each movie and user in terms of a set of latent factors.

  • Restricted Boltzmann MachinesRestricted Boltzmann Machines essentially perform a binary version of factor analysis.Instead of users rating a set of movies on a continuous scale, they simply tell you whether they like a movie or not, and the RBM will try to discover latent factors that can explain the activation of these movie choices.Restricted Boltzmann Machine is a stochastic neural networkstochastic meaning these activations have a probabilistic element

  • Restricted Boltzmann MachinesRBMs are composed of a hidden, and visible layer. Unlike the feedforward networks, the connections between the visible and hidden layers are undirected. (the values can be propagated in both th visible-to-hidden and hidden-to-visible directions)

  • contrastive divergence - training Positive phase:An input sample v is clamped to the input layer.

    v is propagated to the hidden layer in a similar manner to the feedforward networks.

    The result of the hidden layer activations is h.

    positive functions i j. =

    Negative phase:Propagate h back to the visible layer with result vPropagate the new v back to the hidden layer with activations result h.

    . .

    Negative function i j.

    Weight update:

    = + ( )

  • Example

  • Why deep learning now:

    What's different is that we can run very large and very deep

    networks on fast GPUs (sometimes with billions of connections,

    and 12 layers) and train them on large datasets with millions of

    examples.

  • What is wrong with back-propagation?

    It requires labelled training data. Almost all data is unlabeled.

    The learning time does not scale well

    It is very slow in networks with multiple hidden layers.

    It can get stuck in poor local optima.

    These are often quite good, but for deep nets they are far from optimal.

  • Training a deep network

    First train a layer of features that receive input directly from the pixels.

    Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer.

    Do it again.

    It can be proved (Were not going to do it!) that each time we add another layer of features we improve a variational lower bound on the log probability of generating the training data.

    .

  • ? gninrael peed no gnikrow si ohW

  • Geoffrey HintonHe is the co-inventor of the backpropagation and contrastive divergence training algorithms and is an important figure in the deep learning movement.

    I get very excited when we discover a way of making neural networks better and when thats closely related to how the brain works.

    Geoffrey Hinton

    Google had hired him along with two of his University of Toronto graduate students.

  • Yann LeCuncomputer science researcher with contributions in machine learning, known for his work on optical character recognition and computer vision using convolutional neural networks.

    has been much in the news lately, as one of the leading experts in Deep Learning

    Facebook has created a new research laboratory with the ambitious, long-term goal of bringing about major advances in Artificial Intelligence.

    Director of AI Research, Facebook

  • Andrew Ng

    On May 16, 2014, Ng announced from his Coursera blog that he would be stepping away from his day-to-day responsibilities at Coursera, and join Baidu as Chief Scientist, working on Baidu Brain project.

    Coursera co-founder

  • Facts:

    At some point in the late 1990s, one of ConvNet-based systems was reading 10 to 20% of all the checks in the US.

    ConvNets are now widely used by Facebook, Google, Microsoft, IBM, Baidu, NEC and others for image and speech recognition.

  • Example

    Create an algorithm to distinguish dogs from cats In this competition, you'll write an algorithm to classify whether images contain

    either a dog or a cat.

    A student of Yann Lecun recently won Dogs vs Cats competition using a version of

    ConvNet, achieving 98.9% accuracy.

  • ImageNet LSVRC-2010 contest

    Best system in 2010competition got 47% error for its first choice and 25% error for its top 5 choices.

    classify 1.2 million high-resolution images into 1000 different classes.

    A very deep neural net (Krizhevsky et. al. 2012) gets less that 40% error for its first choice and less than 20% for its top 5choices

  • The Speech Recognition Task(Mohamed, Dahl, & Hinton, 2012) Deep neural networks pioneered by George Dahl and

    Abdel-rahman Mohamed are now replacing the previous machine learning method for the acoustic model.

    After standard processing, a deep net with 8 layers gets 20.7% error rate.

    The best previous speaker- independent result was 24.4% and this required averaging several models.

  • Happy deep day

    Slide Number 1DeepFace: Closing the Gap to Human-Level Performance in Face VerificationPerceptronsDrawbacksMultilayer networksFeedforward Neural NetworksSlide Number 7Hidden layersThe Problem with Large NetworksVanishing gradientOverfitting AutoencodersWhat .auto..what !!??So what !Slide Number 16Restricted Boltzmann MachinesRestricted Boltzmann MachinesRestricted Boltzmann Machinescontrastive divergence - training ExampleWhy deep learning now:What is wrong with back-propagation? Training a deep networkWho is working on deep learning ? Geoffrey HintonYann LeCunAndrew NgFacts:ExampleImageNet LSVRC-2010 contestSlide Number 32Slide Number 33Slide Number 34The Speech Recognition Task(Mohamed, Dahl, & Hinton, 2012) Happy deep day

Recommended

View more >