semi-supervised learning and self-training ling 572 fei xia 02/14/06
Post on 19-Dec-2015
220 views
TRANSCRIPT
![Page 1: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/1.jpg)
Semi-supervised learning and self-training
LING 572
Fei Xia
02/14/06
![Page 2: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/2.jpg)
Outline
• Overview of Semi-supervised learning (SSL)
• Self-training (a.k.a. Bootstrapping)
![Page 3: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/3.jpg)
Additional Reference
• Xiaojin Zhu (2006): Semi-supervised learning literature survey.
• Olivier Chapelle et al. (2005): Semi-supervised Learning. The MIT Press.
![Page 4: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/4.jpg)
Overview of SSL
![Page 5: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/5.jpg)
What is SSL?
• Labeled data:– Ex: POS tagging: tagged sentences – Creating labeled data is difficult, expensive, and/or
time-consuming.
• Unlabeled data:– Ex: POS tagging: untagged sentences. – Obtaining unlabeled data is easier.
• Goal: use both labeled and unlabeled data to improve the performance
![Page 6: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/6.jpg)
• Learning– Supervised (labeled data only)– Semi-supervised (both labeled and unlabeled data)– Unsupervised (unlabeled data only)
• Problems:– Classification– Regression– Clustering– …
Focus on semi-supervised classification problem
![Page 7: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/7.jpg)
A brief history of SSL
• The idea of self-training appeared in the 1960s.
• SSL took off in the 1970s.
• The interest for SSL increased in the 1990s, mostly due to applications in NLP.
![Page 8: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/8.jpg)
Does SSL work?• Yes, under certain conditions.
– Problem itself: the knowledge on p(x) carry information that is useful in the inference of p(y | x).
– Algorithm: the modeling assumption fits well with the problem structure.
• SSL will be most useful when there are far more unlabeled data than labeled data.
• SSL could degrade the performance when mistakes reinforce themselves.
![Page 9: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/9.jpg)
Illustration(Zhu, 2006)
![Page 10: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/10.jpg)
Illustration (cont)
![Page 11: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/11.jpg)
Assumptions
• Smoothness (continuity) assumption: if two points x1 and x2 in a high-density region are close, then so should be the corresponding outputs y1 and y2.
• Cluster assumption: If points are in the same cluster, they are likely to be of the same class.
Low density separation: the decision boundary should lie in a low density region.
• ….
![Page 12: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/12.jpg)
SSL algorithms
• Self-training
• Co-training
• Generative models: – Ex: EM with generative mixture models
• Low Density Separations:– Ex: Transductive SVM
• Graph-based models
![Page 13: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/13.jpg)
Which SSL method should we use?
• It depends.
• Semi-supervised methods make strong model assumptions. Choose the ones whose assumptions fit the problem structure.
![Page 14: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/14.jpg)
Semi-supervised and active learning
• They address the same issue: labeled data are hard to get.
• Semi-supervised: choose the unlabeled data to be added to the labeled data.
• Active learning: choose the unlabeled data to be annotated.
![Page 15: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/15.jpg)
Self-training
![Page 16: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/16.jpg)
Basics of self-training
• Probably the earliest SSL idea.
• Also called self-teaching or bootstrapping.
• Appeared in the 1960s and 1970s.
• First well-known NLP paper: (Yarowsky, 1995)
![Page 17: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/17.jpg)
Self-training algorithm
• Let L be the set of labeled data, U be the set of unlabeled data.
• Repeat – Train a classifier h with training data L– Classify data in U with h– Find a subset U’ of U with the most confident scores.– L + U’ L – U – U’ U
![Page 18: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/18.jpg)
Case study: (Yarowsky, 1995)
![Page 19: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/19.jpg)
Setting
• Task: WSD– Ex: plant: living / factory
• “Unsupervised”: just need a few seed collocations for each sense.
• Learner: Decision list
![Page 20: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/20.jpg)
Assumption #1: One sense per collocation
• Nearby words provide strong and consistent clues to the sense of a target word.
• The effect varies depending on the type of collocation– It is strongest for immediately adjacent collocations.
• Assumption #1 Use collocations in the decision rules.
![Page 21: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/21.jpg)
Assumption #2: One sense per discourse
• The sense of a target word is highly consistent within any given document.
• The assumption holds most of the time (99.8% in their experiment)
• Assumption #2 filter and augment the addition of unlabeled data.
![Page 22: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/22.jpg)
Step 1: identify all examples of the given word: “plant”
Our sample set S
![Page 23: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/23.jpg)
Step 2: Create initial labeled data using a small number of seed collocations
Sense A: “life”
Sense B: “manufacturing”Our L(0)
U(0) = S – L(0): residual data set.
![Page 24: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/24.jpg)
Initial “labeled” data
![Page 25: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/25.jpg)
Step 3a: Train a BL classifier
![Page 26: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/26.jpg)
Step 3b: Apply the classifier to the entire set
• Add to L the members in U which are tagged with prob above a threshold.
)1()()(
)1()()(
iii
iii
UVU
LVL
)1()()0(
)1()()0(
ii
ii
UVU
LVL
or
![Page 27: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/27.jpg)
Step 3c: filter and augment this addition with assumption #2
![Page 28: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/28.jpg)
Repeat step 3 until converge
![Page 29: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/29.jpg)
The final DL classifier
![Page 30: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/30.jpg)
The original algorithm
Keep initial labeling unchanged.
![Page 31: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/31.jpg)
Options for obtaining the seeds
• Use words in dictionary definitions
• Use a single defining collocate for each sense:– Ex: “bird” and “machine” for the word “crane”
• Label salient corpus collocates:– Collect frequent collocates– Manually check the collocates
Getting the seeds is not hard.
![Page 32: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/32.jpg)
Performance
Baseline: 63.9%Supervised learning: 96.1%Unsupervised learning: 90.6% - 96.5%
![Page 33: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/33.jpg)
Why does it work?
• (Steven Abney, 2004): “Understanding the Yarowsky Algorithm”
![Page 34: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/34.jpg)
Summary of self-training
• The algorithm is straightforward and intuitive.
• It produces outstanding results.
• Added unlabeled data pollute the original labeled data:
![Page 35: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/35.jpg)
Additional slides
![Page 36: Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06](https://reader036.vdocuments.net/reader036/viewer/2022062714/56649d395503460f94a13af9/html5/thumbnails/36.jpg)
Papers on self-training
• Yarowsky (1995): WSD
• Riloff et al. (2003): identify subjective nouns
• Maeireizo et al. (2004): classify dialogues as “emotional” or “non-emotional”.