![Page 1: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/1.jpg)
Supervised and semi-supervised learning for NLPJohn Blitzer
自然语言计算组http://research.microsoft.com/asia/group/nlc/
![Page 2: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/2.jpg)
Why should I know about machine learning?
This is an NLP summer school. Why should I care about machine learning?ACL 2008: 50 of 96 full papers mention learning, or statistics in their titles4 of 4 outstanding papers propose new learning or statistical inference methods
![Page 3: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/3.jpg)
Example 1: Review classification
Running with Scissors: A Memoir
Title: Horrible book, horrible.
This book was horrible. I read half of it,
suffering from a headache the entire time, and
eventually i lit it on fire. One less copy in the
world...don't waste your money. I wish i had the
time spent reading this book back so i could use
it for better purposes. This book wasted my life
Positive
Negative
Output: Labels
Input: Product Review
![Page 4: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/4.jpg)
From the MSRA 机器学习组 http://research.microsoft.com/research/china/DCCUE/
ml.aspx
![Page 5: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/5.jpg)
Example 2: Relevance Ranking
. . .
Un-ranked List Ranked List
. . .
![Page 6: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/6.jpg)
Example 3: Machine Translation
全国田径冠军赛结束
The national track & field championships concluded
Input: English sentence
Output: Chinese sentence
![Page 7: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/7.jpg)
Course Outline
1) Supervised Learning [2.5 hrs]
2) Semi-supervised learning [3 hrs]
3) Learning bounds for domain adaptation [30 mins]
![Page 8: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/8.jpg)
Supervised Learning Outline1) Notation and Definitions [5 mins]
2) Generative Models [25 mins]
3) Discriminative Models [55 mins]
4) Machine Learning Examples [15 mins]
![Page 9: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/9.jpg)
Training and testing dataTraining data: labeled pairs
. . .
?? ?? . . . ??
Use this function to label unlabeled testing data
Use training data to learn a function
![Page 10: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/10.jpg)
Feature representations of
3 0
horrible
read_half waste
0... 1 0 ... 0 2
0 0 0... 1 ... 00
horrible
2
excellent
loved_it
![Page 11: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/11.jpg)
Generative model
![Page 12: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/12.jpg)
Graphical Model RepresentationEncode a multivariate probability distribution
Nodes indicate random variablesEdges indicate conditional dependency
![Page 13: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/13.jpg)
Graphical Model Inference
p(horrible | -
1)
horrible
waste
p(read_half | -
1)
read_half
p(y = -1)
![Page 14: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/14.jpg)
Inference at test timeGiven an unlabeled instance, how can we find its label?
??
Just choose the most probable label y
![Page 15: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/15.jpg)
Estimating parameters from training dataBack to labeled training data:
. . .
![Page 16: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/16.jpg)
Multiclass Classification
Input query: “ 自然语言处理”
Query classificationTravel
TechnologyNews
Entertainment
. . . . Training and testing same as in binary case
![Page 17: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/17.jpg)
Maximum Likelihood EstimationWhy set parameters to counts?
![Page 18: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/18.jpg)
MLE – Label marginals
![Page 19: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/19.jpg)
Problems with Naïve BayesPredicting broken traffic lights
Lights are broken: both lights are red alwaysLights are working: 1 is red & 1 is green
![Page 20: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/20.jpg)
Problems with Naïve Bayes 2Now, suppose both lights are red. What will our model predict?
We got the wrong answer. Is there a better model?
The MLE generative model is not the best model!!
![Page 21: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/21.jpg)
More on Generative modelsWe can introduce more dependencies
This can explode parameter space
Discriminative models minimize error -- nextFurther readingK. Toutanova. Competitive generative models with structure learning for NLP classification tasks. EMNLP 2006.A. Ng and M. Jordan. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naïve Bayes. NIPS 2002
![Page 22: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/22.jpg)
Discriminative LearningWe will focus on linear models
Model training error
![Page 23: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/23.jpg)
Upper bounds on binary training error
-1.5 -1 -0.5 0 0.5 1 1.5
2
4
60-1 loss (error): NP-hard to minimize over all data points
Exp loss: exp(-score): Minimized by AdaBoost
Hinge loss: Minimized by support vector machines
![Page 24: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/24.jpg)
Binary classification: Weak hypotheses
In NLP, a feature can be a weak learner
Sentiment example:
![Page 25: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/25.jpg)
The AdaBoost algorithmInput: training sample(1) Initialize (2) For t = 1 … T,
Train a weak hypothesis to minimize error on
Set [later]
Update
(3) Output model
![Page 26: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/26.jpg)
A small example+ + – –
Excellent book. The_plot was riveting
Excellent read
Terrible: The_plot was boring and opaque
Awful book. Couldn’t follow the_plot.
+++– –
++– – – –– –
– –– –– –
![Page 27: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/27.jpg)
Bound on training error [Freund & Schapire 1995]
We greedily minimize error by minimizing
![Page 28: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/28.jpg)
For proofs and a more complete discussionRobert Schapire and Yoram Singer. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning Journal 1998.
![Page 29: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/29.jpg)
Exponential convergence of error in t
We chose to minimize . Was that the right choice?
This gives
Plugging in our solution for , we have
![Page 30: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/30.jpg)
AdaBoost drawbacks
-1.5 -1 -0.5 0 0.5 1 1.5
2
4
6What happens when an example is mis-labeled or an outlier?Exp loss exponentially penalizes incorrect scores.
Hinge loss linearly penalizes incorrect scores.
![Page 31: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/31.jpg)
Support Vector MachinesLinearly separable
++––– –
+++
+––
––
++Non-
separable
![Page 32: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/32.jpg)
Margin
Lots of separating hyperplanes. Which should we choose?
++––– –
++ ++––– –
++
Choose the hyperplane with largest margin
![Page 33: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/33.jpg)
Max-margin optimization
score of correct label greater thanmargin
Why do we fix norm of w to be less than 1?
Scaling the weight vector doesn’t change the optimal hyperplane
![Page 34: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/34.jpg)
Equivalent optimization problem
Minimize the norm of the weight vectorWith fixed margin for each example
![Page 35: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/35.jpg)
Back to the non-separable case
We can’t satisfy the margin constraintsBut some hyperplanes are better than others
++
––––
++
![Page 36: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/36.jpg)
Soft margin optimizationAdd slack variables to the optimization
Allow margin constraints to be violatedBut minimize the violation as much as possible
![Page 37: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/37.jpg)
Optimization 1: Absorbing constraints
![Page 38: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/38.jpg)
Optimization 2: Sub-gradient descent
-0.5 0 0.5 1 1.5
0.5
1
1.5
Max creates a non-differentiable point, but there is a subgradient
Subgradient:
![Page 39: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/39.jpg)
Stochastic subgradient descentSubgradient descent is like gradient descent.Also guaranteed to converge, but slowPegasos [Shalev-Schwartz and Singer 2007]
Sub-gradient descent for a randomly selected subset of examples. Convergence bound:
Objective after T iterations
Best objective value Linear
convergence
![Page 40: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/40.jpg)
SVMs for NLPWe’ve been looking at binary classification
But most NLP problems aren’t binaryPiece-wise linear decision boundaries
We showed 2-dimensional examplesBut NLP is typically very high dimensionalJoachims [2000] discusses linear models in high-dimensional spaces
![Page 41: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/41.jpg)
Kernels and non-linearityKernels let us efficiently map training data into a high-dimensional feature spaceThen learn a model which is linear in the new space, but non-linear in our original spaceBut for NLP, we already have a high-dimensional representation!Optimization with non-linear kernels is often super-linear in number of examples
![Page 42: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/42.jpg)
More on SVMsJohn Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press 2004.Dan Klein and Ben Taskar. Max Margin Methods for NLP: Estimation, Structure, and Applications. ACL 2005 Tutorial.
Ryan McDonald. Generalized Linear Classifiers in NLP. Tutorial at the Swedish Graduate School in Language Technology. 2007.
![Page 43: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/43.jpg)
SVMs vs. AdaBoostSVMs with slack are noise tolerantAdaBoost has no explicit regularization
Must resort to early stopping
AdaBoost easily extends to non-linear modelsNon-linear optimization for SVMs is super-linear in the number of examples
Can be important for examples with hundreds or thousands of features
![Page 44: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/44.jpg)
More on discriminative methodsLogistic regression: Also known as Maximum Entropy
Probabilistic discriminative model which directly models p(y | x)
A good general machine learning bookOn discriminative learning and moreChris Bishop. Pattern Recognition and Machine Learning. Springer 2006.
![Page 45: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/45.jpg)
Learning to rank
(1)
(2)
(3)
(4)
![Page 46: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/46.jpg)
Features for web page ranking
Good features for this model? (1) How many words are shared between
the query and the web page?(2) What is the PageRank of the
webpage?(3) Other ideas?
![Page 47: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/47.jpg)
Optimization Problem
Loss for a query and a pair of documents Score for documents of different ranks must be separated by a margin
MSRA 互联网搜索与挖掘组http://research.microsoft.com/asia/group/wsm/
![Page 48: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815c91550346895dcaa122/html5/thumbnails/48.jpg)
Come work with us at Microsoft!
http://www.msra.cn/recruitment/