introduction to deeplearning

Deep Learning

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Perceptrons

One of the earliest supervised training algorithms is that of the perceptron, a basic neural network building block.

𝑓𝑓 𝑥𝑥 = 𝑥𝑥 ∗ 𝑤𝑤 + 𝑏𝑏transfer function:

activation function: ℎ 𝑥𝑥 = �1: 𝑖𝑖𝑓𝑓 𝑓𝑓 𝑥𝑥 > 00: 𝑜𝑜𝑜𝑜ℎ𝑜𝑜𝑜𝑜𝑤𝑤𝑖𝑖𝑜𝑜𝑜𝑜

DrawbacksThe single perceptron has one major drawback: it can only learn linearly separable functions.How major is this drawback? Take XOR, a relatively simple function, and notice that it can’t be classified by a linear separator

Multilayer networks

Multilayer networks could learn complicated things and they did - butvery slowly. What emerged from this second neural network revolution was that we had a good theory but learning was slow and results while good but not amazing.

the real question, that received very little attention for such an important one, was - why don't multilayer networks learn?

Feedforward Neural Networks

if each of our perceptrons is only allowed to use a linear activation function Then, the final output of our network will still be some linear function of the inputs, just adjusted with a ton of different weights that it’s collected throughout the network.

A linear composition of a bunch of linear functions is still just a linear function, so most neural networks use

non-linear activation functions.

Hidden layersa single hidden layer is powerful enough to learn any function.

:هذا يعني

We often learn better in practice with multiple hidden layers.

:يعنيييي لك

Deeper networksوأخيراً

The Problem with Large Networks

The problem is that it is fairly easy to create things that behave like neurons, the brains major component. What is not easy is working out what the whole thing does once you have assembled it.

why don't multilayer networks learn?it all had to do with the way the training errors were being passed back from the output layer to the deeper layers of artificial neurons.

Vanishing gradient"vanishing gradient" problem meant that as soon as a neural network got reasonably good at a task the lower layers didn't really get any information about how to change to help do the task better.

because the error in a layer gets "split up" and partitioned out to each unit in the layer. This, in turn, further reduces the gradient in layers below that.

Overfitting

شو الحل

Autoencoderstypically a feedforwardneural network which aims to learn a compressed, distributed representation (encoding) of a dataset.

عدد العصبونات , المالحظة االساسية :في كل طبقة من الطبقات

الدخل في طبقتي متساويعدد العصبونات •والخرج

منأقلالوسطىعدد العصبونات في الطبقة •.ومنافذ الدخل, عدد عصبونات الخرج

What ….auto..what !!??The intuition behind this architecture is that the network will not learn a “mapping” between the training data and its labels, but will instead learn the internal structure and features of the data itself.(Because of this, the hidden layer is also called feature

detector.)

So what …!Usually, the number of hidden units is smaller than the input/output layers, which forces the network to learn only the most important features and achieves a dimensionality reduction.

dimensionality reduction = feature selection and feature extraction.

we’re attempting to learn the data in a truer sense

Restricted Boltzmann Machinesclassical factor analysisExample:Suppose you ask a bunch of users to rate a set of movies on a 0-100 scale. In classical factor analysis, you could then try to explain each movie and user in terms of a set of latent factors.

Restricted Boltzmann MachinesRestricted Boltzmann Machines essentially perform a binary version of factor analysis.Instead of users rating a set of movies on a continuous scale, they simply tell you whether they like a movie or not, and the RBM will try to discover latent factors that can explain the activation of these movie choices.Restricted Boltzmann Machine is a stochastic neural networkstochastic meaning these activations have a probabilistic element

Restricted Boltzmann MachinesRBMs are composed of a hidden, and visible layer. Unlike the feedforward networks, the connections between the visible and hidden layers are undirected. (the values can be propagated in both th visible-to-hidden and hidden-to-visible directions)

contrastive divergence - training • Positive phase:

An input sample v is clamped to the input layer.

v is propagated to the hidden layer in a similar manner to the feedforward networks.

The result of the hidden layer activations is h.

.jودخل الشبكة عند العقدة iبين كل عقدتين وهو يساوي الى جداء بين خرج الشبكة عند عقدة الخرج positive functionsبعد هذا يتم حساب ما يدعى 𝑃𝑃𝑜𝑜𝑜𝑜𝑖𝑖𝑜𝑜𝑖𝑖𝑃𝑃𝑜𝑜 𝑜𝑜𝑖𝑖𝑖𝑖 = 𝑥𝑥𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ∗ 𝑥𝑥𝑖𝑖𝑖𝑖𝑜𝑜𝑜𝑜𝑜𝑜

• Negative phase:

Propagate h back to the visible layer with result v’Propagate the new v’ back to the hidden layer with activations result h’.

.يتم بعدها اعادة بناء الدخل عن طريق عملية االنتشار العكسي .بعد ذلك يعاد نشر الدخل الجديد الى طبقة الخرج

.jالعقدة ودخل الشبكة عند iالخرج وهو يساوي الى جداء بين خرج الشبكة عند عقدة بين كل عقدتين Negative functionيتم حساب ما يدعى

• Weight update:

• 𝑤𝑤𝑖𝑖𝑖𝑖 = 𝑤𝑤𝑖𝑖𝑖𝑖 + 𝑙𝑙 ∗ (𝑝𝑝𝑜𝑜𝑜𝑜𝑖𝑖𝑜𝑜𝑖𝑖𝑃𝑃𝑜𝑜 𝑜𝑜𝑖𝑖𝑖𝑖 − 𝑛𝑛𝑜𝑜𝑛𝑛𝑛𝑛𝑜𝑜𝑖𝑖𝑃𝑃𝑜𝑜 𝑜𝑜𝑖𝑖𝑖𝑖 )

Example

Why deep learning now:

What's different is that we can run very large and very deep

networks on fast GPUs (sometimes with billions of connections,

and 12 layers) and train them on large datasets with millions of

examples.

What is wrong with back-propagation?

• It requires labelled training data. • Almost all data is unlabeled.

• The learning time does not scale well

• It is very slow in networks with multiple hidden layers.

• It can get stuck in poor local optima.

• These are often quite good, but for deep nets they are far from optimal.

Training a deep network

• First train a layer of features that receive input directly from the pixels.

• Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer.

• Do it again.

• It can be proved (We’re not going to do it!) that each time we add another layer of features we improve a variational lower bound on the log probability of generating the training data.

.يعني في تحسن•

Who is working on deep learning ?

يلي عم يشتغلوا بهالشغلة ؟الشبابمين

Geoffrey HintonHe is the co-inventor of the backpropagation and contrastive divergence training algorithms and is an important figure in the deep learning movement.

‘I get very excited when we discover a way of making neural networks better — and when that’s closely related to how the brain works.’

— Geoffrey Hinton

Google had hired him along with two of his University of Toronto graduate students.

Yann LeCuncomputer science researcher with contributions in machine learning, known for his work on optical character recognition and computer vision using convolutional neural networks.

has been much in the news lately, as one of the leading experts in Deep Learning

Facebook has created a new research laboratory with the ambitious, long-term goal of bringing about major advances in Artificial Intelligence.

Director of AI Research, Facebook

Andrew Ng

On May 16, 2014, Ng announced from his Coursera blog that he would be stepping away from his day-to-day responsibilities at Coursera, and join Baidu as Chief Scientist, working on Baidu Brain project.

Coursera co-founder

Facts:

• At some point in the late 1990s, one of ConvNet-based systems was reading 10 to 20% of all the checks in the US.

• ConvNets are now widely used by Facebook, Google, Microsoft, IBM, Baidu, NEC and others for image and speech recognition.

Example

Create an algorithm to distinguish dogs from cats In this competition, you'll write an algorithm to classify whether images contain

either a dog or a cat.

A student of Yann Lecun recently won Dogs vs Cats competition using a version of

ConvNet, achieving 98.9% accuracy.

ImageNet LSVRC-2010 contest

• Best system in 2010competition got 47% error for its first choice and 25% error for its top 5 choices.

classify 1.2 million high-resolution images into 1000 different classes.

• A very deep neural net (Krizhevsky et. al. 2012) gets less that 40% error for its first choice and less than 20% for its top 5choices

The Speech Recognition Task(Mohamed, Dahl, & Hinton, 2012) • Deep neural networks pioneered by George Dahl and

Abdel-rahman Mohamed are now replacing the previous machine learning method for the acoustic model.

• After standard processing, a deep net with 8 layers gets 20.7% error rate.

• The best previous speaker- independent result was 24.4% and this required averaging several models.

Happy deep day

introduction to deeplearning

Technology