stephen clark...stephen clark, julia hockenmaier, rebecca hwa, miles osborne, paul ruhlen, anoop...

128
Semi-Supervised Training for Statistical Parsing Mark Steedman, Steven Baker, Jeremiah Crim, Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002 August 21, 2002

Upload: others

Post on 01-Sep-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Semi-Supervised Training for Statistical Parsing

Mark Steedman, Steven Baker, Jeremiah Crim,Stephen Clark, Julia Hockenmaier, Rebecca Hwa,

Miles Osborne, Paul Ruhlen, Anoop Sarkar(Edinburgh, Penn, UMD, JHU, Cornell)

August 21, 2002

Steedman et al. CLSP WS-2002 August 21, 2002

Page 2: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Introduction

Mark Steedman

Steedman et al. CLSP WS-2002 1

Page 3: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

The TeamName E-Mail AffiliationSteedman, Mark [email protected] University of EdinburghSarkar, Anoop [email protected] University of PennsylvaniaOsborne, Miles [email protected] University of EdinburghHwa, Rebecca [email protected] University of MarylandClark, Stephen [email protected] University of EdinburghHockenmaier, Julia [email protected] University of EdinburghRuhlen, Paul [email protected] Johns Hopkins UniversityBaker, Steven [email protected] Cornell UniversityCrim, Jeremiah [email protected] Johns Hopkins University

AssociatesName E-Mail AffiliationJohn Henderson [email protected] Mitre Corp.Chris Callison-Burch [email protected] Stanford/Edinburgh

Steedman et al. CLSP WS-2002 2

Page 4: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Statistical parsing: the State of the Art

• Parsers trained on the Penn Treebank have improved rapidly . . .

• . . . and have been widely and usefully applied.

• Generalizing to other genres of text and other languages is highly desirable.

• But we will probably never get another 1M words of any kind of human-annotated text

• Automatically parsed text such as the BLLIP Parsed Corpus (30M words) istoo noisy.

• Co-training has been shown to support bootstrapping respectable performancefrom small labeled corpora (Sarkar 2001)

Steedman et al. CLSP WS-2002 3

Page 5: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

What is Co-training?

• Co-Training (Blum and Mitchell 1998) is a weakly supervised method forbootstrapping a model from a (small) seed set of labeled examples,augmented with a much larger set of unlabeled examples by exploitingredundancy among multiple statistical models.

• A set of models is trained on the same labeled seed material.

• The whole set of models is then run on a larger amount of unlabeled data.Novel parses from any parser that are deemed reliable under some measureare used as additional labeled data for retraining the other models.

Steedman et al. CLSP WS-2002 4

Page 6: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

What is Co-training? (Contd.)

• Co-training can be thought of as seeking to optimize an objective functionthat measures the degree of agreement between the predictions for theunlabeled data based on the two views (see below).

• Crucially, Co-Training is training on others’ output, in contrast to Self-training, or training on own output (cf. Charniak 1997).

• The theory of co-training has been developed in application to classifiers. Thepresent project is an empirical study of its applicability to parsers, following(Sarkar 2001)

Steedman et al. CLSP WS-2002 5

Page 7: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Potential Payoffs are Large

• Payoff 1:

– Improved performance of Wide-Coverage Parsers

• Payoff 2:

– Methods for building Very Large Treebanks (larger and less noisy thanBLLIP) Useful for numerous speech/language applications

• Payoff 3:

– Methods for bootstrapping parsers for novel genres and new languagesfrom small labeled datasets

Steedman et al. CLSP WS-2002 6

Page 8: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Co-training: Project Details

• Three Distinct Parsers

– Treebank CFG parsers (Collins 1999, 2000, . . .)

– LTAG parser (Sarkar)

– CCG parsers (and CCG supertaggers) (Clark, Hockenmaier)

• Approaches

– Co-training with supertaggers and parsers (CCG)– Co-training with different parsers (CFG and LTAG)– Co-training Rerankers (Collins 2000)

• Data:

– Use labeled WSJ (1M words) and Brown (440K words) Penn Treebank– Then use unlabeled North American News Text corpus: 500M words

Steedman et al. CLSP WS-2002 7

Page 9: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Workshop Goals

• Identify criteria for Parse Selection that exclude noise and maximize co-training effects;

• Explore the way co-training effects depend on size of labeled seed set;

• Explore effectiveness of co-training for porting parsers to new genres bytraining on Brown Corpus, co-training on unlabeled WSJ and held-outPT-WSJ labeled secn. 00, ultimately evaluating on PT-WSJ secn. 23.

• Explore effectiveness of co-training on parsers trained on all of PT-WSJ 2-22,co-trained on unlabeled WSJ and labeled secn. 00, ultimately evaluating onsecn. 23.

Steedman et al. CLSP WS-2002 8

Page 10: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

What We Have to Show

• We will show that co-training enhances performance for parsers and taggerstrained on small (500—10,000 sentences) amounts of labeled data.

• We will also show that co-training can be used for porting parsers trainedon one genre to parse on another without any new human-labeled data atall, improving on state-of-the-art for this task.

• We will also show that even tiny amounts of human-labelled data for thetarget genre enhace porting via co-training.

• We will show a number of preliminary results on ways to deliver similarimprovements for parsers trained on large (Penn WSJ Treebank) labeleddatasets and expressive grammars.

• These include a number of novel methods for Parse Selection.

Steedman et al. CLSP WS-2002 9

Page 11: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Outline

• Data Management (5 min, Steven Baker)

• Parse Selection (15 min, Rebecca Hwa)

• Experiments I: Collins-CFG/LTAG (25 min, Anoop Sarkar)

• Experiments II: CCG/CCG-Supertagger (30 min, Stephen Clark)

• Student Presentation: Oracle experiments (10 min, Steven Baker)

• Coffee Break (30min)

• Student Proposal:Smoothing for Co-training (15 min, Julia Hockenmaier)

• Experiments III: Rerankers (10 min, Jeremiah Crim)

• Student Proposal: Cotraining Rerankers (5 min, Jeremiah Crim)

• Conclusions (5 min, Miles Osborne)

• Questions and Discussion (30 min)

Steedman et al. CLSP WS-2002 10

Page 12: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Co-training Architecture and Parse Selection

Steve Baker, Rebecca Hwa, and Paul Ruhlen

Steedman et al. CLSP WS-2002 August 22, 2002

Page 13: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Co-training: The Algorithm

• Requires:

– Two learners with different views of the task– Cache Manager (CM) to interface with the disparate learners– A small set of labelled seed data and a larger pool of unlabelled data

• Pseudo-Code:

– Init: Train both learners with labelled seed data– Loop:∗ CM picks unlabelled data to add to cache∗ Both learners label cache∗ CM selects newly labelled data to add to the learners' respective training

sets∗ Learners re-train

Steedman et al. CLSP WS-2002 1

Page 14: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Co-training: Theoretical Considerations

• Co-training works (provably) when the views are independent.

– Blum and Mitchell, 1998– Nigam and Ghani, 2000– Dasgupta et al., 2001

• To apply theory to practice, we would like to explicitly optimise an objectivefunction that measures the degree of agreement between the predictions forthe unlabelled data based on the two views.

Steedman et al. CLSP WS-2002 2

Page 15: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Parse Selection: Practical Heuristics

• Computing the objective function is impractical.

• We use parse selection heuristics to determine which labelled examples toadd.

– Each parser assigns a score for every sentence, estimating the goodness ofits best parse.

– Cache Manager uses these scores to select the newly labelled data for bothparsers according to some selection method.

Steedman et al. CLSP WS-2002 3

Page 16: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Scoring Phase

• Ideally, we want the parser to return its true accuracy for the parses it produced(e.g., parseval scores).

• In practice, we can approximate these scores with con�dence metrics.

– log probability of the best parse– entropy of the n-best parses (tree-entropy)– number of parses, sentence length– . . .

Steedman et al. CLSP WS-2002 4

Page 17: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Selection Phase

• Want to select training examples for one parser (student) labelled by the other(teacher) so as to minimise noise and maximise training utility.

– Top-n: Choose the n examples for which the teacher assigned the highestscores.

– Difference: Choose the examples for which the teacher assigned a higherscore than the student by some threshold.

– Intersection: Choose the examples that received high scores from theteacher but low scores from the student.

– Disagreement: Choose the examples for which the two parsers provideddifferent analyses and the teacher assigned a higher score than the student.

Steedman et al. CLSP WS-2002 5

Page 18: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Two Pilot Studies

• What scoring function might work well?

– Experiment with different scoring functions while using the same selectionmethod.

• What selection method might work well?

– Experiment with different selection method while using the same scoringmethod.

Steedman et al. CLSP WS-2002 6

Page 19: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Experimental Setup

• Unlabelled training data: derived from Penn Treebank WSJ sec02-21.

• Cache size: 50 sentences per iteration.

• Development test set: WSJ sec00.

• Evaluation metric: parseval

– Precision (P) =# of correctly guessed labelled constituents

# of guessed labelled consituents

– Recall (R) =# of correctly guessed labelled constituents# of labelled constituents in gold standard

– F2 = 2×P×RP+R

Steedman et al. CLSP WS-2002 7

Page 20: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Scoring Function Variation Study

• Setting: Two versions of the same parser trained on different initial labelledseed data (1000 sentence subsets of Penn Treebank; resp. w/o conjunctions,and w/o prepositions).

• Scoring functions: parseval, best parse probability, tree-entropy, combination.

• Selection method: intersection, picks 20 sentences per iteration.

Steedman et al. CLSP WS-2002 8

Page 21: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Scoring Function Variation Learning Curve

• All practical scoring functions performed similarly

77

78

79

80

81

82

83

1000 1500 2000 2500 3000

Par

sing

Acc

urac

y on

Tes

t Dat

a (f

2sco

re)

Number of Training Sentences

"Parseval""Probability"

"Tree Entropy""Combination"

81.3

Steedman et al. CLSP WS-2002 9

Page 22: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Selection Method Variation Study

• Setting: Two different parsers (CFG and LTAG) trained on the same initiallabelled seed data (500, 1000 sentences from Penn Treebank)

• Scoring function: parseval

• Selection methods: Top-n, Difference, Intersection, Disagreement

Steedman et al. CLSP WS-2002 10

Page 23: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Selection Method Variation (500 examples)

• \Smart" selection methods do better than Top- n

• None is as good as human annotated data

77

78

79

80

81

82

83

600 800 1000 1200 1400 1600 1800 2000

Par

sing

Acc

urac

y on

Tes

t Dat

a (F

2)

Number of Training Sentences

"Hand-Annotated""Top-N"

"Difference""Intersection"

"Disagreement"

Steedman et al. CLSP WS-2002 11

Page 24: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Selection Method Variation (1000 examples)

• Similar trend for experiment with larger initial labelled set

• Disparity between selection methods and hand-annotated upper-bound is bigger

80

80.5

81

81.5

82

82.5

83

1000 1200 1400 1600 1800 2000

Par

sing

Acc

urac

y on

Tes

t Dat

a (F

2)

Number of Training Sentences

"Hand-Annotated""Top-n"

"Difference""Intersection"

"Disagreement"

Steedman et al. CLSP WS-2002 12

Page 25: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Summary

• Use parse selection heuristics to approximate objective function.

• Scoring function differences are inconclusive – use log probability for large-scaleexperiments.

• Selection methods that use scores from both parsers (Difference, Intersection,Disagreement) seem more reliable.

Steedman et al. CLSP WS-2002 13

Page 26: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Co-training between Collins-CFG and LTAG

Rebecca Hwa, Miles Osborne and Anoop Sarkar

1

Page 27: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Collins-CFG and LTAG: Different Views for Co-Training

Collins-CFG LTAGBi-lexical dependencies Bi-lexical dependenciesare between nonterminals are between elementary treesCan produce novel elementary Can produce novel bi-lexicaltrees for the LTAG dependencies for Collins-CFGLearning curves show convergence Learning curves show LTAGon 1M words labeled data needs relatively more labeled dataWhen using small amounts of When using small amounts ofseed data, abstains less seed data, abstains more oftenoften than LTAG than Collins-CFG

2

Page 28: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

72.5

73

73.5

74

74.5

75

75.5

76

76.5

0 20 40 60 80 100 120

fsco

re

Iterations

Experimental comparison between Collins-CFG and LTAG (self-training)

Collins-CFG self-trainingLTAG self-training

3

Page 29: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Workshop Goals

• Use co-training to boost performance, when faced with small seeddata→ Use small subsets of WSJ labeled data as seed data

• Use co-training to port parsers to new genres→ Use Brown corpus as seed data, co-train and test on WSJ

• Use a large set of labeled data and use unlabeled data to improveparsing performance→ Use Penn Treebank (40K sents) as seed data

4

Page 30: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Experiments on Small Labeled Seed Data

• Motivating the size of the initial seed data set

• We plotted learning curves, tracking parser accuracy while varyingthe amount of labeled data

• Find the “elbow” in the curve where the payoff will occur

• This was done for both the Collins-CFG and the LTAG parser

5

Page 31: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

0 0.5 1 1.5 2 2.5 3 3.5

x 104

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

Training sentences

F−m

easu

re

Collins Parser learning curve:Penn Treebank sec.2−21 in order

f = a(1−bxc

) a = 0.892732 b = 0.574215

c = 0.20543

Predictedperformance

=0.88586

Best observed = 0.886645

Upper bound = 89.3% Projected to Treebank of size 80K sentences = 88.9% 400K = 89.2% 6

Page 32: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Workshop Goals

• Use co-training to boost performance, when faced with small seeddata→ Use 500 sentences of WSJ labeled data as seed data→ Compare performance of co-training vs. self-training

• Use co-training to port parsers to new genres→ Use Brown corpus as seed data, co-train and test on WSJ

• Use a large set of labeled data and use unlabeled data to improveparsing performance→ Use Penn Treebank (40K sents) as seed data

7

Page 33: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

74.5

75

75.5

76

76.5

77

77.5

78

78.5

0 20 40 60 80 100 120

fsco

re

Iterations

Co-training compared with Self-training (Collins-CFG with LTAG, initialized with 500 sentences seed data)

Collins-CFG co-trainingCollins-CFG self-training

8

Page 34: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Workshop Goals

• Use co-training to boost performance, when faced with small seeddata→ Co-training beats self-training with 500 sentence seed data→ Compare parse selection methods for different parser views

• Use co-training to port parsers to new genres→ Use Brown corpus as seed data, co-train and test on WSJ

• Use a large set of labeled data and use unlabeled data to improveparsing performance→ Use Penn Treebank (40K sents) as seed data

9

Page 35: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

75

75.5

76

76.5

77

77.5

78

0 10 20 30 40 50 60 70 80

fsco

re

Iterations

Comparing parse selection methods between Collins-CFG and LTAG

Collins-CFG throw allCollins-CFG intersection

10

Page 36: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

87

87.5

88

88.5

89

89.5

0 10 20 30 40 50 60

fsco

re

Iterations

Comparing parse selection methods between Collins-CFG and LTAG

LTAG throw allLTAG intersection

11

Page 37: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Workshop Goals

• Use co-training to boost performance, when faced with small seeddata→ Co-training beats self-training with 500 sentence seed data→ Different parse selection methods better for different parser views→ Compare performance when seed data is doubled to 1K sentences

• Use co-training to port parsers to new genres→ Use Brown corpus as seed data, co-train and test on WSJ

• Use a large set of labeled data and use unlabeled data to improveparsing performance→ Use Penn Treebank (40K sents) as seed data

12

Page 38: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

75

75.5

76

76.5

77

77.5

78

78.5

79

79.5

80

0 20 40 60 80 100 120

fsco

re

Iterations

Comparison between 500 sentence seed data vs. 1K sentence seed data (Collins-CFG with LTAG)

Collins-CFG 500 co-training (throw all)Collins-CFG 1K co-training (intersection)

13

Page 39: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Workshop Goals

• Use co-training to boost performance, when faced with small seeddata→ Co-training beats self-training with 500 sentence seed data→ Different parse selection methods better for different parser views→ Co-training still improves performance with 1K sentence seed data

• Use co-training to port parsers to new genres→ Use Brown corpus as seed data, co-train and test on WSJ

• Use a large set of labeled data and use unlabeled data to improveparsing performance→ Use Penn Treebank (40K sents) as seed data

14

Page 40: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

76.5

77

77.5

78

78.5

79

79.5

80

80.5

0 20 40 60 80 100 120 140

fsco

re

Iterations

Train on Brown corpus as seed data, cotrain and test on WSJ (Collins-CFG with LTAG)

Brown seed co-trainingBrown seed plus 100 WSJ sents co-training

15

Page 41: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Workshop Goals

• Use co-training to boost performance, when faced with small seeddata→ Co-training beats self-training with 500 sentence seed data→ Different parse selection methods better for different parser views→ Co-training still improves performance with 1K sentence seed data

• Use co-training to port parsers to new genres→ Co-training improves performance significantly when porting fromone genre (Brown) to another (WSJ)

• Use a large set of labeled data and use unlabeled data to improveparsing performance→ Use Penn Treebank (40K sents) as seed data

16

Page 42: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Co-training using a large labeled seed set

• Experiments using 40K sentences Penn Treebank WSJ sentences asseed data for co-training did not produce a positive result

• Even after adding 260K sentences of unlabeled data using co-trainingdid not significantly improve performance over the baseline

• However, we plan to do more experiments in the future whichleverage more recent work on parse selection and the differencebetween the Collins-CFG and LTAG views

17

Page 43: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

0 0.5 1 1.5 2 2.5 3 3.5

x 104

0.76

0.78

0.8

0.82

0.84

0.86

0.88

0.9

Training sentences

F−m

easu

re

Collins Parser learning curve:Penn Treebank sec.2−21 in order

f = a(1−bxc

) a = 0.892732 b = 0.574215

c = 0.20543

Predictedperformance

=0.88586

Best observed = 0.886645

Upper bound = 89.3% Projected to Treebank of size 80K sentences = 88.9% 400K = 89.2% 18

Page 44: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

0 0.5 1 1.5 2 2.5 3 3.5

x 104

0.83

0.84

0.85

0.86

0.87

0.88

0.89

0.9

0.91

0.92

Training sentences

F−

mea

sure

LTAG parser learning curve: fscores for sentences <= 40 words

f = a(1−bxc

) a = 0.915678 b = 0.31822

c = 0.104404first 6 pts. ignored

Predictedperformance

=0.886902

Best observed = 0.887

Upper bound = 91.6% Projected to Treebank of size 80K sentences = 89.3% 400K = 90.4% 19

Page 45: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Summary

• Use co-training to boost performance, when faced with small seeddata→ Co-training beats self-training with 500 sentence seed data→ Different parse selection methods better for different parser views→ Co-training still improves performance with 1K sentence seed data

• Use co-training to port parsers to new genres→ Co-training improves performance significantly when porting fromone genre (Brown) to another (WSJ)

20

Page 46: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Summary

• Use a large set of labeled data and use unlabeled data to improveparsing performance→ Using 40K sentences of Penn Treebank as seed data showed noimprovement over the baseline. Future work: improving LTAGperformance

21

Page 47: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Summary of experiments done during the workshop

Seed Seed Improvement added/cache total Upper Bound SelectType Size start end added Gold OracleCFG WSJ 500 75 78 20/30 3000 85 80 all

1000 79 80 20/30 3000 86 82 intall(40K) 88 88 1000/2000 260K - - all

LTAG WSJ 500 87.5 89.5 20/30 3000 - - int1000 89 90 20/30 3000 - - int

all(40K) - - - - - - -CFG Brown 1000 75 78 20/30 3000 - 81 all

all(24K) 80.7 80.8 200/300 - - - intCFG Brown 1000 77 79 20/30 3000 - 82 int+100 WSJ all(24K) - - - - - - -

22

Page 48: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

collins runs1–4

74.5

75

75.5

76

76.5

77

77.5

78

78.5

0 20 40 60 80 100 120

fsco

re

Iterations

throw all (500)all+intersection (500)

self-training (500)intersection (500)

23

Page 49: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

tagparser runs1–4

84.5

85

85.5

86

86.5

87

87.5

88

88.5

89

89.5

0 20 40 60 80 100 120

fsco

re

Iterations

throw all (500)all+intersection (500)

self-training (500)intersection (500)

24

Page 50: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

collins runs5–8

78.6

78.8

79

79.2

79.4

79.6

79.8

80

0 20 40 60 80 100 120

fsco

re

Iterations

throw all (1000)all+intersection (1000)

self-training (1000)intersection (1000)

25

Page 51: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

tagparser runs5–8

88.6

88.8

89

89.2

89.4

89.6

89.8

90

90.2

90.4

0 20 40 60 80 100 120

fsco

re

Iterations

throw all (1000)all+intersection (1000)

self-training (1000)intersection (1000)

26

Page 52: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

brown-wsj

76.6

76.8

77

77.2

77.4

77.6

77.8

78

78.2

78.4

78.6

0 10 20 30 40 50 60

F2

Co-training rounds

Brown 1k, WSJ 100, CFG results (co-trained with LTAG)

"/tmp/self""/tmp/all""/tmp/int"

27

Page 53: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Experimental Setup – Evaluating the Collins-CFG parser

• Various starting seed human labeled data sets were used (WSJ 40K,500, 1K sents, Brown 1K, 24K sents)

• Evaluation of the newly trained model at each iteration of co-trainingwas done on the entire Section 00 of the Penn Treebank (1921sentences)

• The fscore graphs we show are plotted for each iteration of theco-training algorithm

• We also show self-training results where the output of theCollins-CFG parser was used to bootstrap new labeled data

28

Page 54: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Experimental Setup – Evaluating the LTAG parser

• Various starting seed human labeled data sets were used (WSJ 500,1K sentences)

• Evaluation of the newly trained model at each iteration of co-trainingwas done on a 250 sentence subset (≤ 15 words) of the Section 00of the Penn Treebank

• We also show self-training results where the output of the LTAGparser was used to bootstrap new labeled data

29

Page 55: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Summary: Collins-CFG co-training with LTAG

• Positive cotraining effect has been obtained for the Collins-CFGparser using output of the LTAG parser

• Training on 500-sentence and 1000-sentence seed sets and selectiontechniques including intersection

• Full set selection works better for very small seed data: 500sentences

30

Page 56: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Summary: LTAG co-training with Collins-CFG

• Positive cotraining effect has been obtained for the LTAG parser usingoutput of the Collins-CFG parser

• Training on 500-sentence and 1000-sentence seedsets and selectiontechniques including intersection

• Selection techniques like Intersection work better than full setselection

31

Page 57: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Co-training with Combinatory Categorial Grammar(CCG)

Steve Baker, Stephen Clark, Jay Crim, Julia HockenmaierPaul Ruhlen, Mark Steedman

August 2002

JHU 2002 Workshop Co-training with CCG August 2002

Page 58: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Outline

• Brief introduction to CCG

• Possible views for CCG co-training

• Co-training experiments

• Oracle experiments

JHU 2002 Workshop Co-training with CCG 1

Page 59: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Combinatory Categorial Grammar (CCG)

• CCG is a mildly context-sensitive, lexicalised grammar formalism

• CCG is based on categories

– atomic categories: S, N, NP, PP– complex categories: (S\NP)/NP

• Categories are combined using a small number of combinatory rules

– example of function application: (S\NP)/NP NP → S\NP

• Combinatory rules allow unbounded dependencies to be captured for wide-coverage parsing (Clark et al., Hockenmaier and Steedman, ACL 2002)

JHU 2002 Workshop Co-training with CCG 2

Page 60: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

A CCG derivation tree

S[dcl]

NP

N

There

S[dcl]\NP

S[dcl]\NP

(S[dcl]\NP)/NP

is

NP

NP/N

no

N

asbestos

(S\NP)\(S\NP)

((S\NP) \(S\NP))/NP

in

NP

NP/N

our

N

products

• CCG trades

– lexical category types (≈1,200 compared with ≈50 Penn Treebank pos-tags)– for phrase-structure rules (≈3,000 binary rule instantiations against ≈12,000

n-ary rules for a treebank grammar)

JHU 2002 Workshop Co-training with CCG 3

Page 61: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Data and models for CCG parsing

• The corpus: CCGbank (Hockenmaier and Steedman, LREC 2002)– translation of Penn Treebank phrase-structure trees to CCG derivations

• The parser: (Hockenmaier and Steedman, ACL 2002)– top-down generative probability model similar to lexicalised PCFG– state-of-the-art performance at (unlabelled) dependency recovery– CCG Parseval performance not comparable with CFG∗ binary CCG trees heavily penalised

– but use Parseval for co-training experiments∗ consistent with other co-training experiments; relative difference important

JHU 2002 Workshop Co-training with CCG 4

Page 62: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Resolving Lexical Category Ambiguity:Supertagging

• Many words have many possible categories

• eg: the(39) man(4) is(93) an(12) executive(4) director(3)

• Prior frequency and contextual information can help resolve ambiguity

• Standard techniques for pos-tagging (maximum entropy models, HMMs) canbe used to estimate word-category sequence probabilities

– Max-Ent super-tagger gives 90% accuracy (Clark, TAG+ 2002)– use Brant's TnT tagger for co-training experiments (quick to train)

JHU 2002 Workshop Co-training with CCG 5

Page 63: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Two co-training views (Sarkar 2001)

S[dcl]

NP

N

There

S[dcl]\NP

S[dcl]\NP

(S[dcl]\NP)/NP

is

NP

NP/N

no

N

asbestos

(S\NP)\(S\NP)

((S\NP) \(S\NP))/NP

in

NP

NP/N

our

N

products

• generative probability of parse tree• probability of lexical category sequence

– for parses with the same category sequence use tree probability• Why not co-train across grammar formalisms with CCG?

JHU 2002 Workshop Co-training with CCG 6

Page 64: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

A learning curve for the CCG parser

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

0.72

0.74

0.76

0.78

0.8

0.82

0.84

0.86

Training sentences

F−

mea

sure

CCG parser learning curve: sequential subsets of Penn Treebank

f = a(1−bxc

) a = 0.842109 b = 0.874194 c = 0.302984

Predictedperformance

=0.812077

Best observed = 0.8118

Upper bound = 84.2% Projected to Treebank of size 80K sentences = 82.8% 400K = 84.1%

JHU 2002 Workshop Co-training with CCG 7

Page 65: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Co-training Experiments

Seed type Seed size added/cache selection evaluation

wsj 1,000 90/300 top-N Section 00 wsj PTB

wsj 10,000 39/300

Brown 24,000 39/300

wsj 40,000 30m words

JHU 2002 Workshop Co-training with CCG 8

Page 66: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Parser scores with 1,000 sentences WSJ seed data

75.2

75.4

75.6

75.8

76

76.2

76.4

76.6

0 2000 4000 6000 8000 10000 12000

Par

seva

l F-s

core

Number of Training Sentences

Parseval performance of parser -- 1K seed

"parser_withSuperTagger_FScore""parser_SelfTraining_FScore"

JHU 2002 Workshop Co-training with CCG 9

Page 67: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Supertagger with 1,000 sentences WSJ seed data

80.5

81

81.5

82

82.5

83

83.5

84

84.5

85

85.5

86

0 2000 4000 6000 8000 10000 12000

"results.tnt" using 1:2

JHU 2002 Workshop Co-training with CCG 10

Page 68: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Parser as a supertagger with 1,000 WSJ seed data

87.4

87.5

87.6

87.7

87.8

87.9

88

88.1

88.2

88.3

88.4

0 2000 4000 6000 8000 10000 12000

Sup

erta

ggin

g ac

cura

cy

Number of Training Sentences

Performance of parser as supertagger -- 1K seed

"parser_withSuperTagger_SuperTag""parser_SelfTraining_SuperTag"

JHU 2002 Workshop Co-training with CCG 11

Page 69: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Parser scores with 10,000 sentences WSJ seed data

80.8

80.9

81

81.1

81.2

81.3

81.4

81.5

81.6

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

"results.juliadep" using 1:4

JHU 2002 Workshop Co-training with CCG 12

Page 70: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Supertagger with 10,000 sentences WSJ seed data

87.05

87.1

87.15

87.2

87.25

87.3

87.35

87.4

87.45

87.5

87.55

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

"results.tnt" using 1:2

JHU 2002 Workshop Co-training with CCG 13

Page 71: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Parser as a supertagger with 10,000 WSJ seed

91.8

91.85

91.9

91.95

92

92.05

92.1

92.15

92.2

0 20 40 60 80 100 120 140

Sup

erta

ggin

g ac

cura

cy

Number of iterations

Performance of parser as supertagger -- Brown seed

"parser_withSuperTagger_10K_Tagging"

JHU 2002 Workshop Co-training with CCG 14

Page 72: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Summary of seed-WSJ experiments

• Supertagger performance improves signi�cantly for 1K and 10K seed data

• Parser as a supertagger improves signi�cantly for 1K and 10K seed data(performance better than supertagger itself)

• Parsing results inconclusive

– parsing view is currently not obtaining useful information from thesupertagger view (for parsing)

JHU 2002 Workshop Co-training with CCG 15

Page 73: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Parser scores with Brown seed data

72

72.2

72.4

72.6

72.8

73

73.2

73.4

73.6

0 1000 2000 3000 4000 5000 6000

"results.juliadep" using 1:4

JHU 2002 Workshop Co-training with CCG 16

Page 74: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Supertagger scores with Brown seed data

81

81.5

82

82.5

83

83.5

84

84.5

0 1000 2000 3000 4000 5000 6000

"results.tnt" using 1:2

JHU 2002 Workshop Co-training with CCG 17

Page 75: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Parser as a supertagger with Brown seed data

87.66

87.68

87.7

87.72

87.74

87.76

87.78

87.8

87.82

87.84

0 5 10 15 20 25 30 35 40

Sup

erta

ggin

g ac

cura

cy

Number of iterations

Performance of parser as supertagger -- Brown seed

"parser_withSuperTagger_Brown_Tagging"

JHU 2002 Workshop Co-training with CCG 18

Page 76: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Summary of Brown-seed experiments

• Supertagger performance improves signi�cantly

• Parser as a supertagger improves signi�cantly

• Parsing results improve significantly

– parser is obtaining new information about the new genre from thesupertagger view

JHU 2002 Workshop Co-training with CCG 19

Page 77: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Supertagger with lots of pre-parsed data

88.6

88.8

89

89.2

89.4

89.6

89.8

90

0 5 10 15 20 25 30 35

"results" using 1:2"results" using 3:4"results" using 5:6"results" using 7:8

JHU 2002 Workshop Co-training with CCG 20

Page 78: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Summary of CCG co-training experiments

• Supertagger performance – and parser performance as a supertagger – increasedsigni�cantly for all experiments (across all types and sizes of seed data)

• Parser performance increased for the Brown porting experiment

• Other parsing results were inconclusive; possible reasons:

– the 2 parsing views were too similar (consider Sarkar (2001))– current set-up does not provide new categories for the lexicon

JHU 2002 Workshop Co-training with CCG 21

Page 79: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Summary of all CCG co-training experimentsSeed Seed Improvement

type size Parser Supertagger added/cache select

start max start max

wsj (dist) 1,000 71.16 72.06 80.78 81.88 30/300 top 30

wsj (dist) 1,000 71.16 72.18 80.78 84.36 90/300 top 90

wsj (dist) 1,000 71.16 71.87 80.78 81.46 300/300 all

wsj 1,000 76.34 76.56 80.78 85.49 90/300 top 90

wsj 1,000 76.34 76.41 (dist) 71.16 71.38 90/300 top 90

wsj 1,000 76.34 76.51 selftrain 90/300 top 90

wsj 10,000 81.21 81.80 87.07 87.51 39/300 top 39

wsj 10,000 80.47 80.80 87.07 87.54 39/300 intersect

brown 24,000 72.78 74.06 81.49 84.47 39/300 top 39

JHU 2002 Workshop Co-training with CCG 22

Page 80: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

CCG Parseval Oracle ExperimentsSteven Baker

Steedman et al. CLSP WS-2002 1

Page 81: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

CCG Parseval Oracle Experiments

• Use Parseval as an oracle to obtain a score for newly-parsed data

• Can use this measure to investigate differences between parse selection methodswithout the inaccuracy of the sentence score skewing the results

• Also useful in establishing realistic upper bounds of performance when usingnon-oracle scores to determine parse-selection

Steedman et al. CLSP WS-2002 2

Page 82: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Oracle Top-N / Difference Comparison, 10k Seed

80.7

80.8

80.9

81

81.1

81.2

81.3

81.4

81.5

81.6

81.7

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

10k Seed Data, Top-Scoring 30% Added

Parseval Top NParseval Difference

Steedman et al. CLSP WS-2002 3

Page 83: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Oracle Top-N / Difference Comparison, 10k Seed

80.9

81

81.1

81.2

81.3

81.4

81.5

81.6

0 500 1000 1500 2000 2500 3000

10k Seed Data, Top-Scoring 10% Added

Parseval Top NParseval Difference

Steedman et al. CLSP WS-2002 4

Page 84: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

10k Seed Training Set Results

• Top-N selection method results in increased performance over Differenceselection

• Top-N Method performs similarly using both yield rates, even though thehigher yield rate means the selection method is less restrictive about what itadds back in

• Upon examination, with 10k initial training, almost 30% of new parses wereperfect according to Parseval, so the quality of parses being added would notbe changed as Top-N's yield is increased from 10% to 30%.

• Also suggests that at 10k, some improvement can be had when cotraining withthe right selection method.

Steedman et al. CLSP WS-2002 5

Page 85: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Oracle Top-N / Difference Comparison, 1k Seed

70.8

71

71.2

71.4

71.6

71.8

72

72.2

0 1000 2000 3000 4000 5000 6000

1k Seed Data, Top-Scoring 15% Added

Parseval Top NParseval Difference

Steedman et al. CLSP WS-2002 6

Page 86: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

1k Seed Training Set Results

• Difference selection method performs better than Top-N selection at 1k

• Shows that there is little room for improvement, even using oracle selectionmethods, of current CCG parsers at 1k.

Steedman et al. CLSP WS-2002 7

Page 87: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

CCG Parseval Oracle Summary

• A selection method that works well for one set of views may not work well foranother

- Top-N was the better method in the 10k experiments- Difference was the better method in the 1k experiment

• These results are similar to Oracle selection results seen in the CFG/LTAGdomain

• Only small improvements were seen in the CCG Oracle-selection experiments,which mirrors the non-Oracle results

Steedman et al. CLSP WS-2002 8

Page 88: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Smoothing to help parsers co-train

Julia Hockenmaier and Mark Steedman

August 22, 2002

Steedman et al. Smoothing to help parsers co-train August 22, 2002

Page 89: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Graduate student proposal

• Investigate smoothing for statistical parsing with CCG

– Important general question– Essential to allow CCG to benefit from co-training

• Student: Julia Hockenmaier, University of Edinburgh

• Supervisor: Prof. Mark Steedman, University of Edinburgh

Steedman et al. Smoothing to help parsers co-train 1

Page 90: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Why smoothing is important

• Statistical parsing will always suffer from a sparse data problem

• Statistical CCG parsers suffer from a lexical coverage problem

• Smoothing: assign non-zero probabilities to unseen events(eg. unseen lexical entries )

– What unseen events should get a non-zero probability?– How do we compute this probability?

Steedman et al. Smoothing to help parsers co-train 2

Page 91: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Smoothing in our current parser

• Dependency model: linear interpolation with baseline model,using the same technique as Collins ’99 to estimate weights

• Smoothing of lexicalized rule probabilities :back off from head word, but not from lexical head category

• Smoothing of lexicon probabilities :use POS-tag information to generate new lexical entries

Steedman et al. Smoothing to help parsers co-train 3

Page 92: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Why smoothing is important for co-training

• Co-training especially beneficial with limited amounts of seed data⇒ Smoothing is essential in this setting

• Collins–LTAG co-training works because:

– Both parsers have more levels of backoff, hence can deal with noisyinput, and achieve reasonable performance with little seed data

– Both parsers provide each other with novel input

• With our present setup, cotraining for CCG has limited effect:

– The two views cannot provide novel input such asnew lexical categories or new rule instantiations

Steedman et al. Smoothing to help parsers co-train 4

Page 93: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Smoothing (lexicalized) rule probabilities

• Treebank-style:P (S〈VBZ,buys〉 → NP VP)

• CCG-style:

P (S[dcl]〈(S[dcl]\NP)/NP,buys〉 → NP S[dcl]\NP)

⇒ We cannot back off from the lexical category!(But might be able to back off to classes of lexical categories)⇒ Can we use techniques like (Eisner 2001) or similarity-based clusteringfor CCG?

Steedman et al. Smoothing to help parsers co-train 5

Page 94: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Smoothing lexicon probabilities

• In CCG, lexical categories determine the space of possible parses

• In our parsers, the lexicon consists of observed word-category pairs

• Even a lexicon extracted from the entire labelled training corpusis incomplete

• This is not so much a problem for unknown words, as forlow-frequency observed words

Steedman et al. Smoothing to help parsers co-train 6

Page 95: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Lexical rules

• Assume we have observed 3rd pers. sg buys with (S[dcl]\NP)/NP

• If we know English morphology, we also know the following entries:

– non-3rd-pers-sg buy: (S[dcl]\NP)/NP– infinitival buy: (S[b]\NP)/NP– present participle buying: (S[ng]\NP)/NP– past participle bought: (S[pt]\NP)/NP– passive bought: S[pass]\NP

• We can use a morphological analyzer to generate new word forms

Steedman et al. Smoothing to help parsers co-train 7

Page 96: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Smoothing for lexical rules

• How do we estimate probabilities for these new lexical entries?

• Conduct a comprehensive study similar to (Chen and Goodman, ’96)

Steedman et al. Smoothing to help parsers co-train 8

Page 97: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

The proposalWe propose to investigate:

• how we can make use of lexical rules in order toovercome the sparse data problem in the CCG lexicon

• which smoothing techniques are most effectiveto estimate the probabilities of these new entries

• how we can smooth the rule probabilities

Steedman et al. Smoothing to help parsers co-train 9

Page 98: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Evaluating the effectiveness of smoothingWe propose to evaluate the effectiveness of smoothing:

• by measuring the effect on overall parse performance

• by measuring the effect on co-training

Steedman et al. Smoothing to help parsers co-train 10

Page 99: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Possible payoffs

• Better CCG parsers

• Better understanding of smoothing for parsing with expressive grammars

• Allow CCG to benefit from co-training

Steedman et al. Smoothing to help parsers co-train 11

Page 100: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Co-training with Re-rankers

Jeremiah CrimJohns Hopkins University

Steedman et al. CLSP WS-2002 1

Page 101: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Re-ranking vs. Parsing

• A re-ranker reorders the output of an n-best (probabilistic) parser based onfeatures of the parse

• While parsers use local features to make decisions, re-rankers usefeatures that can span the entire tree

• Instead of co-training parsers, co-train different re-rankers

Steedman et al. CLSP WS-2002 2

Page 102: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Which features?

Steedman et al. CLSP WS-2002 3

Page 103: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Motivation: Why re-rankers?

• Speed

– We need a large amount of unlabeled data to get an improvement– But parsing is slow, and we must re-parse data multiple times– If we are re-ranking, we only have to parse data once– Output will be reordered many times, but this can be done very quickly

• Objective function

– The lower runtime of re-rankers allows us to explicitly maximizeagreement between parses

– Measures such as intersection and difference that attempt toapproximate agreement not needed in this framework

Steedman et al. CLSP WS-2002 4

Page 104: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Motivation: Why re-rankers?

• Accuracy

– Re-rankers can improve performance of existing parsers– Collins ’00 cites a 13 percent reduction of error rate by re-ranking

• Task closer to classification

– A re-ranker can be seen as a binary classifier: either a parse is the bestfor a sentence or it isn’t

– This is the original domain cotraining was intended for

Steedman et al. CLSP WS-2002 5

Page 105: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Reranker 1: Log linear model

• Similar to the re-ranker proposed by Collins in ”Discriminative Rerankingfor Natural Language Parsing”, ICML ’00

• Implemented prior to the workshop by Miles Osborne

Steedman et al. CLSP WS-2002 6

Page 106: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Reranker 2: Linear perceptron

• Learn a mapping from vectors of features for a given parse to the parsevalscore for that parse:

• Perceptron Algorithm:

1. Assign starting values to weights for each feature and use them topredict parseval scores for first training parse

2. Update weights based on how far off the guess was3. Repeat step 2 for all training examples

• Use these weights to re-rank the n-best output of Collins’ parser

Steedman et al. CLSP WS-2002 7

Page 107: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Differences between Re-rankers

Log linear Linear perceptronLearns weights for features that Learns weights for features thatpredict the best parse for a predict an expected parseval scoresentence. for a parse.Can teach perceptron about Can teach log linear model aboutfeatures that predict a good features that will lead to a highparse for a sentence, regardless parseval score, no matter whatof parseval score. the rank of the parse.

Steedman et al. CLSP WS-2002 8

Page 108: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Maximizing Objective Function

• Maximize agreement between re-rankers

1. Parse next sentences into cache.

2. Re-rank cache using one model andpartition into sections.

3. Other model searches for bestpartition to retrain on.4. Retrain on this partition.

5. Switch roles of models. Repeat.

Steedman et al. CLSP WS-2002 9

Page 109: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Experiments III

• Initial small-scale experiments:

– Train Collins parser on first 1000 sentences of Penn Treebank section 2– Parse remainder of PTB 02-21 using this parser– Train initial re-rankers on first 1000 sentences– Co-train re-rankers using remainder of parsed data

Steedman et al. CLSP WS-2002 10

Page 110: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Baseline and Upper Bound

• Baseline parseval score (section 0):

– Random choice from top n parses: 77.82– Parser’s initial choice: 82.28– Log linear model before co-training: 82.07– Linear Perceptron before co-training: 82.36

• Upper bound parseval score (section 0):

– Gold standard: always pick best from top n: 89.50– Oracle: Re-rankers trained on additional ”clean” data:∗ Log linear model, +10k clean: 83.61∗ Linear perceptron, +10k clean: 82.76

Steedman et al. CLSP WS-2002 11

Page 111: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Results III

• Replace only 1 partition of cache each iteration

• Log linear re-ranker improved to 82.41 from 82.07

– 27 percent of possible error rate reduction achieved by clean data

• No improvement for linear perceptron

Steedman et al. CLSP WS-2002 12

Page 112: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Summary III: Co-Training Re-rankers

Seed Data Seed Set Effect: n+/ncache ntotal Upper Bound:Type Size F0 Fn per iter’n added Gold OracleWSJ 1000 82.07 82.41 500/2500 10000 89.5 83.61

Steedman et al. CLSP WS-2002 13

Page 113: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Post-WS02 Project Proposal

Co-training with Re-rankers

Jeremiah Crim, Johns Hopkins UniversityMiles Osborne, University of Edinburgh

Steedman et al. CLSP WS-2002 14

Page 114: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Post-WS02 Project - Why re-rank?

• Faster than parsing

• Can maximize agreement

• Greater accuracy

• Task closer to classification

Steedman et al. CLSP WS-2002 15

Page 115: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Post-WS02 Project - More re-rankers

• We have only co-trained two re-rankers: Log Linear and Linear Perceptron

• Other re-rankers to try - Winnow, Voted Perceptron (Collins), Naive Bayes,Decision Tree, etc.

• Many have already been implemented, during the workshop (Winnow,Perceptron) or before, and co-training infrastructure is in place

Steedman et al. CLSP WS-2002 16

Page 116: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Post-WS02 Project - Other feature sets

• We’ve only worked with one feature set so far

• There are many other ways that features could be assigned to parses

– ie: use all nonterminal subtrees (see Collins and Duffy 02) or lexicalnonterminals for subtrees of depth 1

– features from other parsers?

Steedman et al. CLSP WS-2002 17

Page 117: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Post-WS02 Project - Maximizing agreement

• We use Spearman’s rank to measure similarity between output of re-rankers

– This treats all ranks as equally important– Explore non-uniform weighting extensions

Steedman et al. CLSP WS-2002 18

Page 118: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Post-WS02 Project - Maximizing agreement

• Exploring huge search space... must intelligently choose what to retrainon

• Current method greedily optimizes agreement one re-ranker at a time

• But a slightly suboptimal set for one re-ranker may allow a largerimprovement in the other

• Explore alternative search methods

– ie: Best first search, etc.

Steedman et al. CLSP WS-2002 19

Page 119: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Post-WS02 Project - Possible Payoffs

• Beat section 23?

(improving state-of-the-art parsing results)

– After learning best parameters for co-training on small sets, train parserand re-rankers on all of PTB

– Retrain on up to 500m additional words (NANC) - large portion alreadypre-processed during workshop

– Better chance of improvement than continuing to tweak parsers thathave already almost converged

– Room for improvement:∗ Current best parser: 89.7∗ Oracle that picks best parse from top 50: 95 +

Steedman et al. CLSP WS-2002 20

Page 120: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Summary

August 22, 2002

Summary August 22, 2002

Page 121: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Summary 1

• Co-training worked in a variety of situations.

• Main results:

– Co-training is beneficial when the seed data is limited.– Co-training is less helpful (but not harmful) when the seed data is

plentiful.– Novel parse selection techniques were developed, and were crucial for

co-training with larger amounts of seed data.

Summary 1

Page 122: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Workshop Goals

• Explore the way co-training effects depend on size of labelled seed set.

• Explore effectiveness of co-training for porting parsers to new genres.

• Explore effectiveness of co-training on parsers seeded with all availablelabelled material.

Summary 2

Page 123: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Small labelled seed setsCo-training enhances performance:

• For CFG/LTAG:

– CFG always improves.– LTAG always improves.

• For Supertagger/CCG:

– Supertagger – and parser as a Supertagger – always improves (not yetconverged).

– CCG parsers inconclusive.

Co-training allows rapid development of statistical parsers for languages withlimited available resources.

Summary 3

Page 124: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Porting parsers to new genresCo-training can help porting:

• For CFG/LTAG:

– CFG always improves.– LTAG always improves.

• For Supertagger/CCG:

– Supertagger always improves.– CCG parsers improve.

Co-training allows more effective porting to novel domains.

Summary 4

Page 125: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Improving state-of-the-art parsing performanceNot there yet:

• Collins-CFG Parser (nearly) converged already.

• CCG requires enhanced smoothing for co-training to be effective.

Solutions:

• Collins-CFG/LTAG re-ranking experiments have started.

• CCG smoothing will be implemented (as described by Julia).

Summary 5

Page 126: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Future work 1Need for analysis:

• Workshop has been largely experimental.

• Understand relationship between view independence and parserconfigurations.

• Understand link between seed labelled set size and co-training.

Summary 6

Page 127: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Future work 2

• For Collins-CFG/LTAG:

– Better scoring functions for parse selection.– Large-scale experiments have started.– Develop re-rankers.

• For CCG:

– Extend statistical CCG modelling such that it can benefit from co-training.

– Do large-scale experiments.

Summary 7

Page 128: Stephen Clark...Stephen Clark, Julia Hockenmaier, Rebecca Hwa, Miles Osborne, Paul Ruhlen, Anoop Sarkar (Edinburgh, Penn, UMD, JHU, Cornell) August 21, 2002 Steedman et al. CLSP WS-2002

Conclusion

• This is the largest experimental study to date on the use of unlabelleddata for improving parser performance.

• Co-training enhances performance for parsers and taggers trained onsmall (500—10,000 sentences) amounts of labeled data .

• Co-training can be used for porting parsers trained on one genre toparse on another without any new human-labeled data at all , improvingon state-of-the-art for this task.

• Even tiny amounts of human-labelled data for the target genre enhaceporting via co-training.

• New methods for Parse Selection have been developed, and play a crucialrole.

Summary 8