let’s learn deep

Post on 11-Feb-2017






Click to see full reader



Some Interesting Results

Image Source: http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ Distributed representations of words and phrases and their compositionalityT Mikolov, I Sutskever, K Chen, GS Corrado, J Dean - Advances in neural information processing systems, 2013

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed representations of words and phrases and their compositionality. In Proc. Advances in Neural Information Processing Systems 26 3111–3119 (2013).

Deep learningY LeCun, Y Bengio, G Hinton - Nature, 2015


Zou, Will Y., et al. "Bilingual Word Embeddings for Phrase-Based Machine Translation." EMNLP. 2013.

Paraphrase Detection

Socher, Richard, et al. "Dynamic pooling and unfolding recursive autoencoders for paraphrase detection." Advances in Neural Information Processing Systems. 2011.

Socher, Richard; Perelygin, Alex; Y. Wu, Jean; Chuang, Jason; D. Manning, Christopher; Y. Ng, Andrew; Potts, Christopher. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" (PDF). EMNLP 2013.


Why Neural Networks? - The perceptron algorithm can learn to classify linearly separable samples. ALWAYS.

- BUT, how to tackle non-linearity?


- Add a non linear transform to the data

- 1 layer ANNs can approximate any continuous function [1,2]

- Can be trained through BACKPROPOGRATION

http://cs231n.github.io/neural-networks-1/[1] Cybenko, George. "Approximation by superpositions of a sigmoidal function."Mathematics of control, signals and systems 2.4 (1989): 303-314.[2] http://neuralnetworksanddeeplearning.com/chap4.html

A simple Neural Network



𝑙𝑜𝑠𝑠=𝐻 ( 𝑓 (𝑊 ,𝑋 ) ,𝑌 )

log 𝑙𝑜𝑠𝑠❑=∑ 𝑦 ∗ log ( 𝑓 (𝑊 ,𝑋 ))h𝑖𝑛𝑔𝑒𝑙𝑜𝑠𝑠=∑ max (0 ,1− 𝑓 (𝑊 , 𝑋 )∗ 𝑦)

Train it through back propagation

𝑊 𝑡=𝑊 𝑡− 1− 𝑙∗𝜕𝑙𝑜𝑠𝑠(𝑊 )


Types of ANN: Vanilla Feed Forward NN


Hinton, Geoffrey E. "Learning distributed representations of concepts."Proceedings of the eighth annual conference of the cognitive science society. Vol. 1. 1986.



Collobert, Ronan, et al. "Natural language processing (almost) from scratch."The Journal of Machine Learning Research 12 (2011): 2493-2537.

Example of multitasking with NN. Task 1 and Task 2 are two tasks trained with the window approach architecture presented in Figure 1. Lookup tables as well as the first hidden layer are shared. The last layer is task specific. The principle is the same with more than two tasks.

AI Question AnsweringCounting Compound Coreference

Factoid Q/A with supporting facts

Weston J, Bordes A, Chopra S, Mikolov T. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. In: Unpublished.; 2015. doi:10.1016/j.jpowsour.2014.09.131.

Reasoning about agents motivation

Bordes A, Usunier N, Chopra S, Weston J. Large-scale Simple Question Answering with Memory Networks. arXiv. 2015.

Weston J, Chopra S, Bordes A. Memory Networks. In: International Conference on Learning Representations.; 2015:1-14. http://arxiv.org/abs/1410.3916.

Total 20 tasks. System should solve all tasks. No task specific system. Use Memory Network to solve these tasks. Accuracy of ~42% beats the older benchmarks.


Types of ANN: Recurrent NN


Learn sequential structures like sequence of chars, words, audio signals etc.

Types of ANN: Recurrent NN


From Machine Learning to Machine Reasoning Léon Bottou

Learn arbitrary structures like parse trees.

Types of ANN: Convolutional Neural Nets


Learn similar features in different parts of the inputs

Are used heavily in Image Data because various parts of the image can refer to the same data.

Types of ANN: Auto Encoders

From Machine Learning to Machine Reasoning Léon Bottou


Learn to reconstruct the input

Types of ANN: RBMsand DBNsRBM: Restricted Boltzmann MachineDBN: Deep Belief NetworksGenerative graphical model

Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. "Restricted Boltzmann machines for collaborative filtering." Proceedings of the 24th international conference on Machine learning. ACM, 2007.

What is Deep About Deep Learning?

1. Deep Belief networks

2. RBMs, Auto encoders

3. Convolutional Neural Networks

4. Stacked Auto Encoders

Deeper NNs are helpful so that number of parameters to learn are of polynomial order compared to less layers where number of parameters to learn will increase exponentially.

Wolf, Lior. "Deepface: Closing the gap to human-level performance in face verification." Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE. 2014.

What is Deep Learning? Like a Lego Building exercise.

Stacking of various models and propagating the error from the output of this architecture to each layer.

Solves the issue of feature selection

Non linear relationship between features

Much easier to train a model on large data than to hand craft features.

When Deep Learning?




Why were Deep ANN’s in shadows?

There were major challenges in training ANNs:◦ Need large amounts of data to train (for better function approximation)◦ More weights to train for (Standard image classification models have weights in millions or billions)◦ Vanishing and exploding gradient problem (for Deeper Neural Networks)

What changed? Algorithms for training ANN:

◦ Stochastic Gradient Descent (with momentum)◦ RMSProp◦ Adam, AdaDelta

Fixed vanishing and exploding gradient problems:

◦ LSTM, GRU Units (for vanishing gradients)◦ Gradient Clipping (for exploding gradients)

Methods to prevent overfitting:◦ Regularization◦ Dropout◦ Adversial Networks

Computation Resources:◦ GPU Computing◦ HPC, MPI

Larger Datasets:◦ ImageNet (for image classifications)◦ Google Billion Words Corpus (for auto

generated word vectors)

Methods to gain sparsity:◦ DropOut◦ ReLU, MaxOut activations

Machine Learning to Neural NetworksMACHINE LEARNING METHODS

Deterministic Models◦ Linear Regression◦ Logistic Regression◦ SVM◦ CRFGenerative Models◦ HMM◦ LDA◦ Collaborative FilteringUnsupervised◦ K-means◦ Hierarchal Clustering


Deterministic Models◦ ANN Squared Error loss◦ ANN Softmax layer and log loss◦ ANN Hinge loss◦ RNN with prediction at endGenerative Models◦ RNN generating sequences◦ RBMs◦ RBMsUnsupervised◦ Auto Encoders◦ RBMs◦ Deep Belief Networks


Loss Functions & Optimization

Rmsprop and Adagrad, Adadelta are used in high performance networks.

Idea is:

For some f(W, X) minimize the loss

Between y and f(W,X).

This is done using a loss function.

Major one is log-loss

Open Questions Autoencoders for text data

AI Question Answering

Sarcasm Sentiment analysis

Collaborate SEMEVAL 2016 is coming up and there are tasks like

◦ Sentiment analysis◦ Question Answering◦ http://alt.qcri.org/semeval2016/task4/

the didbend first water.bond warmerial in roid.the lagents to duttersprantessi harkian, arow ... with enkyber fanter-indoug tood cool... the summer small winding skates the moutledday markedgly searl.doupy of it your sold all ic house bat she - etther of thouder fol my old starsgream trains ond cat out the song"saurand shide of gres dewill a now centher mother of at, the creaking passs cool sunsing sapcingatale dowthing aland suncaking in.do a back-end stliagh in in ithicn like into whereso to the touther pate patin on' gal on the aloopmesaterfleoss the sound i lean

I andhe had begetter by His husband, brought unto a hundred cruelings,shrouded me, pierced Arjuna, on thy foe, proud directions and urged bySatyaki in the heart as the filled hill with his flying poison. Untothy host, called Earth, recognise him, by means of her abode, 'Thou shaltconquer thy car is in all kinds of righteousness. Whatever I is filledwith respect. In thee enjoyment will iniunto that Kshatriya enjoys verilyto that as to him that I have now take for me of Kuru's race.'"


"Drona said, 'Renounced still, thou art my great science and foreholder,thou wilt, O best of men, go now, may be said to be Pandu. Persons offooly acts also may injury With regions of entirety? Thou art the deterioryfrom this point of desire. There should be and enjoyeth rites defeatedby the world meet with without injury.


top related