deep learning for natural language processing

Prerana Singhal

THE NEED FOR NATURAL LANGUAGE PROCESSING

No. of internet users – huge and growing

Treasure chest of data in the form of Natural Language

APPLICATIONS

Search

Customer SupportQ & A

Summarization

Sentiment Analysis

NATURAL LANGUAGE PROCESSING

Rule based systems (since 1960s) Statistical Machine Learning (since

late 1980s) Naïve Bayes, SVM, HMM, LDA, … Spam classifier, Google news, Google

Translate

WHY IS NLP HARD?

“Flipkart is a good website” (Easy)

“I didn’t receive the product on time” (Negation)

“Really shoddy service” (Rare words)

“It’s gr8 to see this” (Misspellings)

“Well played Flipkart! You’re giving IRCTC a run for their money”(Sarcasm)

Accuracy sometimes not good enough for production

EXCITING DEEP LEARNING RESULTS

Amazing results, esp. in image and speech domain Image Net: 6% error rate Facial Recognition: 97.35% accuracy Speech Recognition: 25% error

reduction Handwriting Recognition (ICDAR)

IMAGE MODELS

SENSIBLE ERRORS

DEEP LEARNING FOR NLP

Positive – Negative Sentiment Analysis Accuracy increase: 85% to 96% 73% error reduction

State-of-the-art results on various text classification tasks (Same Model)

Tweets, Reviews, Emails Beyond Text Classification

Why does it outperform statistical models?

STATISTICAL CLASSIFIERS

RAW DATA

Flipkart! You need to improve your delivery

FEATURE ENGINEERING

Functions which transform input (raw) data into a feature space

Discriminative – for decision boundary Feature engineering is painful Deep Neural Networks: Identify the

features automatically

Neural Networks

DEEP NEURAL NETWORKS

Higher layers form higher levels of abstractions.

DEEP NEURAL NETWORKS

Unsupervised pre-training

DEEP LEARNING FOR NLP

Why Deep Learning?

Problems with applying deep-learning to natural language

PROBLEMS WITH STATISTICAL MODELS

BAG OF WORDS

“FLIPKART IS BETTER THAN AMAZON”

PROBLEMS WITH STATISTICAL MODELS

Word ordering information lost Data sparsity Words as atomic symbols Very hard to find higher level

features Features other than BOW

HOW TO ENCODE THE MEANING OF A WORD?

Wordnet: Dictionary of synonyms

Synonyms: Adept, expert, good, practiced, proficient, skillful

WORD EMBEDDINGS: THE FIRST BREAKTHROUGH

NEURAL LANGUAGE MODEL

WORD EMBEDDINGS:VISUALIZATIONS

CAPTURE RELATIONSHIPS

WORD EMBEDDING: VISUALIZATIONS

WORD EMBEDDING: VISUALIZATIONS

Trained in a completely unsupervised way

Reduce data sparsity Semantic Hashing Appear to carry semantic

information about the words Freely available for Out of Box usage

COMPOSITIONALITY

How do we go beyond words (sentences and paragraphs)?

This turns out to be a very hard problem

Simple Approaches Word Vector Averaging Weighted Word Vector Averaging

CONVOLUTIONAL NEURAL NETWORKS

Excellent feature extractors in image Features are detected regardless of

position in image NLP Almost from Scratch: Collobert et

al 2011 First applied CNN for NLP

CNN FOR TEXT

-0.33

0.56

0.98

-0.13

-0.81

-0.01

0.17

0.64

-0.16

0.97

0.99

0.90

-0.23

0.16

0.68

-0.33

0.56

0.98

-0.13

-0.81

-0.01

0.17

0.64

-0.16

0.97

0.99

0.90

-0.23

0.16

0.68

0.46 0.04 -0.09 Composition

-0.33

0.56

0.98

-0.13

-0.81

-0.01

0.17

0.64

-0.16

0.97

0.99

0.90

-0.23

0.16

0.68

Weight Matrix (3 x 9)

[-0.33 0.56 0.98 -0.13 -0.81 -0.01 0.17 0.64 -0.16]

[-0.33 0.56 0.98 -0.13 -0.81 -0.01 0.17 0.64 -0.16]

12

[0.46 0.04 -0.09]0.46 0.04 -0.09

-0.33

0.56

0.98

-0.13

-0.81

-0.01

0.17

0.64

-0.16

0.97

0.99

0.90

-0.23

0.16

0.68

-0.57 0.81 0.25

0.46

0.04

-0.09

-0.33

0.56

0.98

-0.13

-0.81

-0.01

0.17

0.64

-0.16

0.97

0.99

0.90

-0.23

0.16

0.68

-0.18 0.26 0.40

-0.57

0.81

0.25

0.46

0.04

-0.09

-0.33

0.56

0.98

-0.13

-0.81

-0.01

0.17

0.64

-0.16

0.97

0.99

0.90

-0.23

0.16

0.68

-0.57

0.81

0.25

0.46

0.04

-0.09

-0.13

0.26

0.40

-0.33

0.56

0.98

-0.13

-0.81

-0.01

0.17

0.64

-0.16

0.97

0.99

0.90

-0.23

0.16

0.68

-0.57

0.81

0.25

0.46

0.04

-0.09

-0.13

0.26

0.40

0.46

0.81

0.40

-0.33

0.56

0.98

-0.13

-0.81

-0.01

0.17

0.64

-0.16

0.97

0.99

0.90

-0.23

0.16

0.68

-0.57

0.81

0.25

0.46

0.04

-0.09

-0.13

0.26

0.40

0.46

0.81

0.40

Neutral

DEMYSTIFYING MAX POOLING Finds the most important part(s) of

sentence

CNN FOR TEXT

Window sizes: 3,4,5 Static mode Non Static mode Multichannel mode Multiclass Classification

RESULTSDataset Source Labels Statistical

ModelsCNN

Flipkart Twitter Sentiment

Twitter Pos, Neg 85% 96%

Flipkart Twitter Sentiment

Twitter Pos, Neg, Neu 76% 89%

Fine grained sentiment in Emails

Emails Angry, Sad, Complaint, Request

55% 68%

SST2 Movie Reviews

Pos, Neg 79.4% 87.5%

SemEval Task 4 RestaurantReviews

food / service / ambience / price / misc

88.5% 89.6%

SENTIMENT: ANECDOTES

DRAWBACKS & LEARNINGS

Computationally Expensive How to scale training? How to scale prediction? Libraries for Deep Learning

Theano PyLearn2 Torch

“I THINK YOU SHOULD BE MORE EXPLICIT HERE IN STEP TWO”

OPEN SOURCED

https://github.com/flipkart-incubator/optimus



BEYOND TEXT CLASSIFICATION

Text Classification covers a lot of NLP problems (or problems can be reduced to it)

Word Embedding Unsupervised Learning Sequence Learning

RNN, LSTM

RECURRENT MODELS

RNNs, LSTMs Machine Translation, Chat,

Classification

ANY QUESTIONS ?

deep learning for natural language processing

Data & Analytics

statistical models word

word embeddings

painful deep neural

image nlp

image features

image models

neural language model

data sparsity words