inducing structure for perception

99
Inducing Structure for Perception Slav Petrov Advisors: Dan Klein, Jitendra Malik Collaborators: L. Barrett, R. Thibaux, A. Faria, A. Pauls, P. Liang, A. Berg a.k.a. Slav’s split&merge Hammer

Upload: emory

Post on 30-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Inducing Structure for Perception. a.k.a. Slav’s split&merge Hammer. Slav Petrov Advisors: Dan Klein, Jitendra Malik Collaborators: L. Barrett, R. Thibaux, A. Faria, A. Pauls, P. Liang, A. Berg. The Main Idea. True structure. Manually specified structure. MLE structure. He was right. - PowerPoint PPT Presentation

TRANSCRIPT

  • Inducing Structure for PerceptionSlav Petrov

    Advisors: Dan Klein, Jitendra MalikCollaborators: L. Barrett, R. Thibaux, A. Faria, A. Pauls, P. Liang, A. Berga.k.a. Slavs split&merge Hammer

  • The Main IdeaComplex underlying processObservationManually specified structureTrue structureMLE structureHe was right.

  • The Main IdeaComplex underlying processObservationHe was right.Manually specified structureAutomatically refined structureEM

  • Why Structure?the the the food cat dog ate andt e c a e h t g f a o d o o d n h e t d a

  • Structure is important

  • Syntactic AmbiguityLast night I shot an elephant in my pajamas.

  • Visual AmbiguityOld or young?

  • Three Peaks?

  • No, One Mountain!

  • Three Domains

  • Timeline

  • SyntaxLanguageModelingSplit & MergeLearningSyntacticMachineTranslationCoarse-to-FineInferenceNon-parametricBayesianLearningGenerativevs. ConditionalLearningSyntax

  • Learning accurate, compact and interpretable Tree AnnotationSlav Petrov, Leon Barrett, Romain Thibaux, Dan Klein

  • Motivation (Syntax)Task:He was right. Why? Information Extraction Syntactic Machine Translation

  • Treebank Parsing

  • Non-IndependenceIndependence assumptions are often too strong.All NPs

    Chart5

    0.113

    0.093

    0.055

    Sheet1

    ALLSUBJECTOBJECT

    NP PP11.39.323

    DT NN9.38.86.7

    PRP5.520.53.7

    ALLSUBJECTOBJECT

    NP PP0.1130.0930.23

    DT NN0.0930.0880.067

    PRP0.0550.2050.037

    Sheet1

    0

    0

    0

    Sheet2

    0

    0

    0

    Sheet3

    0.23

    0.067

    0.037

  • The Game of Designing a GrammarAnnotation refines base treebank symbols to improve statistical fit of the grammarParent annotation [Johnson 98]

  • The Game of Designing a GrammarAnnotation refines base treebank symbols to improve statistical fit of the grammarParent annotation [Johnson 98]Head lexicalization [Collins 99, Charniak 00]

  • The Game of Designing a GrammarAnnotation refines base treebank symbols to improve statistical fit of the grammarParent annotation [Johnson 98]Head lexicalization [Collins 99, Charniak 00]Automatic clustering?

  • Learning Latent AnnotationsEM algorithm:

    Brackets are known Base categories are known Only induce subcategoriesJust like Forward-Backward for HMMs.

  • Inside/Outside ScoresInside:Outside: Ax

  • Learning Latent Annotations (Details)E-Step:

    M-Step:

  • Overview- Hierarchical Training- Adaptive Splitting- Parameter Smoothing

  • Refinement of the DT tagDT

  • Refinement of the DT tagDT

  • Hierarchical refinement of the DT tagDT

  • Hierarchical Estimation Results

    ModelF1Baseline87.3Hierarchical Training88.4

    Chart2

    63.863.8

    76.176.1

    83.283.6

    86.887.2

    87.388.4

    Flat Training

    Hierarchical Training

    Total Number of grammar symbols

    Parsing accuracy (F1)

    Sheet1

    9863.863.8

    19876.176.114676.175.1

    39383.283.621683.784.3

    78586.887.232187.287.4

    156987.388.447988.489.1

    71689.190

    104389.590.7

    Sheet1

    Flat Training

    Hierarchical Training

    Total Number of grammar symbols

    Parsing accuracy (F1)

    Hierarchical Training

    Sheet2

    50% Merging

    Hierarchical Training

    Flat Training

    Total Number of grammar symbols

    Parsing accuracy (F1)

    Sheet3

    50% Merging and Smoothing

    50% Merging

    Hierarchical Training

    Flat Training

    Total Number of grammar symbols

    Parsing accuracy (F1)

    Flat Training

    Total Number of grammar symbols

    Parsing accuracy (F1)

  • Refinement of the , tagSplitting all categories the same amount is wasteful:

  • The DT tag revisited

  • Adaptive SplittingWant to split complex categories moreIdea: split everything, roll back splits which were least useful

  • Adaptive SplittingWant to split complex categories moreIdea: split everything, roll back splits which were least useful

  • Adaptive SplittingEvaluate loss in likelihood from removing each split =Data likelihood with split reversedData likelihood with splitNo loss in accuracy when 50% of the splits are reversed.

  • Adaptive Splitting (Details)True data likelihood:

    Approximate likelihood with split at n reversed:

    Approximate loss in likelihood:

  • Adaptive Splitting Results

    ModelF1Previous88.4With 50% Merging89.5

  • Number of Phrasal Subcategories

    Chart8

    37

    32

    28

    22

    21

    19

    15

    9

    5

    4

    4

    3

    2

    2

    2

    2

    2

    2

    2

    2

    1

    1

    1

    1

    1

    1

    1

    Sheet1

    NNP62NP37

    JJ58VP32

    NNS57PP28

    NN56ADVP22

    VBN49S21

    RB47ADJP19

    VBG40SBAR15

    VB37QP9

    VBD36WHNP5

    CD32PRN4

    IN27NX4

    VBZ25SINV3

    VBP19PRT2

    DT17WHPP2

    NNPS11SQ2

    CC7CONJP2

    JJR5FRAG2

    JJS5NAC2

    :5UCP2

    PRP4WHADVP2

    PRP$4INTJ1

    MD3SBARQ1

    RBR3RRC1

    WP2WHADJP1

    POS2X1

    PDT2ROOT1

    WRB2LST1

    -LRB-2

    .2

    EX2

    WP$2

    WDT2

    -RRB-2

    ''1

    FW1

    RBS1

    TO1

    $1

    UH1

    ,1

    ``1

    SYM1

    RP1

    LS1

    #1

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Sheet2

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Sheet3

  • Number of Phrasal SubcategoriesPPVPNP

    Chart8

    37

    32

    28

    22

    21

    19

    15

    9

    5

    4

    4

    3

    2

    2

    2

    2

    2

    2

    2

    2

    1

    1

    1

    1

    1

    1

    1

    Sheet1

    NNP62NP37

    JJ58VP32

    NNS57PP28

    NN56ADVP22

    VBN49S21

    RB47ADJP19

    VBG40SBAR15

    VB37QP9

    VBD36WHNP5

    CD32PRN4

    IN27NX4

    VBZ25SINV3

    VBP19PRT2

    DT17WHPP2

    NNPS11SQ2

    CC7CONJP2

    JJR5FRAG2

    JJS5NAC2

    :5UCP2

    PRP4WHADVP2

    PRP$4INTJ1

    MD3SBARQ1

    RBR3RRC1

    WP2WHADJP1

    POS2X1

    PDT2ROOT1

    WRB2LST1

    -LRB-2

    .2

    EX2

    WP$2

    WDT2

    -RRB-2

    ''1

    FW1

    RBS1

    TO1

    $1

    UH1

    ,1

    ``1

    SYM1

    RP1

    LS1

    #1

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Sheet2

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Sheet3

  • Number of Phrasal SubcategoriesXNAC

    Chart8

    37

    32

    28

    22

    21

    19

    15

    9

    5

    4

    4

    3

    2

    2

    2

    2

    2

    2

    2

    2

    1

    1

    1

    1

    1

    1

    1

    Sheet1

    NNP62NP37

    JJ58VP32

    NNS57PP28

    NN56ADVP22

    VBN49S21

    RB47ADJP19

    VBG40SBAR15

    VB37QP9

    VBD36WHNP5

    CD32PRN4

    IN27NX4

    VBZ25SINV3

    VBP19PRT2

    DT17WHPP2

    NNPS11SQ2

    CC7CONJP2

    JJR5FRAG2

    JJS5NAC2

    :5UCP2

    PRP4WHADVP2

    PRP$4INTJ1

    MD3SBARQ1

    RBR3RRC1

    WP2WHADJP1

    POS2X1

    PDT2ROOT1

    WRB2LST1

    -LRB-2

    .2

    EX2

    WP$2

    WDT2

    -RRB-2

    ''1

    FW1

    RBS1

    TO1

    $1

    UH1

    ,1

    ``1

    SYM1

    RP1

    LS1

    #1

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Sheet2

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Sheet3

  • Number of Lexical SubcategoriesTO,POS

    Chart6

    62

    58

    57

    56

    49

    47

    40

    37

    36

    32

    27

    25

    19

    17

    11

    7

    5

    5

    5

    4

    4

    3

    3

    2

    2

    2

    2

    2

    2

    2

    2

    2

    2

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    Sheet1

    NNP62NP37

    JJ58VP32

    NNS57PP28

    NN56ADVP22

    VBN49S21

    RB47ADJP19

    VBG40SBAR15

    VB37QP9

    VBD36WHNP5

    CD32PRN4

    IN27NX4

    VBZ25SINV3

    VBP19PRT2

    DT17WHPP2

    NNPS11SQ2

    CC7CONJP2

    JJR5FRAG2

    JJS5NAC2

    :5UCP2

    PRP4WHADVP2

    PRP$4INTJ1

    MD3SBARQ1

    RBR3RRC1

    WP2WHADJP1

    POS2X1

    PDT2ROOT1

    WRB2LST1

    -LRB-2

    .2

    EX2

    WP$2

    WDT2

    -RRB-2

    ''1

    FW1

    RBS1

    TO1

    $1

    UH1

    ,1

    ``1

    SYM1

    RP1

    LS1

    #1

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Sheet2

    Sheet3

  • Number of Lexical SubcategoriesINDTRBVBx

    Chart6

    62

    58

    57

    56

    49

    47

    40

    37

    36

    32

    27

    25

    19

    17

    11

    7

    5

    5

    5

    4

    4

    3

    3

    2

    2

    2

    2

    2

    2

    2

    2

    2

    2

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    Sheet1

    NNP62NP37

    JJ58VP32

    NNS57PP28

    NN56ADVP22

    VBN49S21

    RB47ADJP19

    VBG40SBAR15

    VB37QP9

    VBD36WHNP5

    CD32PRN4

    IN27NX4

    VBZ25SINV3

    VBP19PRT2

    DT17WHPP2

    NNPS11SQ2

    CC7CONJP2

    JJR5FRAG2

    JJS5NAC2

    :5UCP2

    PRP4WHADVP2

    PRP$4INTJ1

    MD3SBARQ1

    RBR3RRC1

    WP2WHADJP1

    POS2X1

    PDT2ROOT1

    WRB2LST1

    -LRB-2

    .2

    EX2

    WP$2

    WDT2

    -RRB-2

    ''1

    FW1

    RBS1

    TO1

    $1

    UH1

    ,1

    ``1

    SYM1

    RP1

    LS1

    #1

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Sheet2

    Sheet3

  • Number of Lexical SubcategoriesNNNNSNNPJJ

    Chart6

    62

    58

    57

    56

    49

    47

    40

    37

    36

    32

    27

    25

    19

    17

    11

    7

    5

    5

    5

    4

    4

    3

    3

    2

    2

    2

    2

    2

    2

    2

    2

    2

    2

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    Sheet1

    NNP62NP37

    JJ58VP32

    NNS57PP28

    NN56ADVP22

    VBN49S21

    RB47ADJP19

    VBG40SBAR15

    VB37QP9

    VBD36WHNP5

    CD32PRN4

    IN27NX4

    VBZ25SINV3

    VBP19PRT2

    DT17WHPP2

    NNPS11SQ2

    CC7CONJP2

    JJR5FRAG2

    JJS5NAC2

    :5UCP2

    PRP4WHADVP2

    PRP$4INTJ1

    MD3SBARQ1

    RBR3RRC1

    WP2WHADJP1

    POS2X1

    PDT2ROOT1

    WRB2LST1

    -LRB-2

    .2

    EX2

    WP$2

    WDT2

    -RRB-2

    ''1

    FW1

    RBS1

    TO1

    $1

    UH1

    ,1

    ``1

    SYM1

    RP1

    LS1

    #1

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Sheet2

    Sheet3

  • SmoothingHeavy splitting can lead to overfitting

    Idea: Smoothing allows us to pool statistics

  • Linear Smoothing

  • Result Overview

    ModelF1Previous89.5With Smoothing90.7

  • Linguistic CandyProper Nouns (NNP):

    Personal pronouns (PRP):

    NNP-14Oct.Nov.Sept.NNP-12JohnRobertJamesNNP-2J.E.L.NNP-1BushNoriegaPetersNNP-15NewSanWallNNP-3YorkFranciscoStreet

    PRP-0ItHeIPRP-1ithetheyPRP-2itthemhim

  • Linguistic CandyRelative adverbs (RBR):

    Cardinal Numbers (CD):

    RBR-0furtherlowerhigherRBR-1morelessMoreRBR-2earlierEarlierlater

    CD-7onetwoThreeCD-4198919901988CD-11millionbilliontrillionCD-0150100CD-313031CD-9785834

  • Nonparametric PCFGs using Dirichlet ProcessesPercy Liang, Slav Petrov,Dan Klein and Michael Jordan

  • Improved Inference for Unlexicalized ParsingSlav Petrov and Dan Klein

  • 1621 min

  • Coarse-to-Fine Parsing[Goodman 97, Charniak&Johnson 05]

  • Prune?For each chart item X[i,j], compute posterior probability:coarse:refined:E.g. consider the span 5 to 12:< threshold

    QPNPVP

  • 1621 min111 min(no search error)

  • Hierarchical PruningConsider again the span 5 to 12:coarse:split in two:split in four:split in eight:

    QPNPVP

    QP1QP2NP1NP2VP1VP2

    QP1QP1QP3QP4NP1NP2NP3NP4VP1VP2VP3VP4

  • Intermediate GrammarsX-Bar=G0G=

  • 1621 min111 min35 min(no search error)

  • State Drift (DT tag)

  • Projected GrammarsX-Bar=G0G=

  • Estimating Projected GrammarsNonterminals?Nonterminals in GNP1VP1VP0S0S1NP0Nonterminals in (G)

  • Estimating Projected GrammarsRules?S1 NP1 VP1 0.20S1 NP1 VP2 0.12S1 NP2 VP1 0.02S1 NP2 VP2 0.03S2 NP1 VP1 0.11S2 NP1 VP2 0.05S2 NP2 VP1 0.08S2 NP2 VP2 0.12S NP VP

  • Estimating Projected Grammars[Corazza & Satta 06] 0.56Estimating Grammars

  • Calculating ExpectationsNonterminals:

    ck(X): expected counts up to depth kConverges within 25 iterations (few seconds)

    Rules:

  • 1621 min111 min35 min15 min(no search error)

  • Parsing timesX-Bar=G0G=

  • Bracket Posteriors(after G0)

  • Bracket Posteriors (after G1)

  • Bracket Posteriors(Movie)(Final Chart)

  • Bracket Posteriors (Best Tree)

  • Parse SelectionComputing most likely unsplit tree is NP-hard:Settle for best derivation.Rerank n-best list.Use alternative objective function.

  • Final Results (Efficiency)Berkeley Parser: 15 min 91.2 F-score Implemented in Java

    Charniak & Johnson 05 Parser 19 min 90.7 F-score Implemented in C

  • Final Results (Accuracy)

    40 wordsF1all F1ENGCharniak&Johnson 05 (generative)90.189.6This Work90.690.1

    GERDubey 0576.3-This Work80.880.1

    CHNChiang et al. 0280.076.6This Work86.383.4

  • Conclusions (Syntax) Split & Merge LearningHierarchical TrainingAdaptive SplittingParameter Smoothing

    Hierarchical Coarse-to-Fine InferenceProjectionsMarginalization

    Multi-lingual Unlexicalized Parsing

  • Generative vs. Discriminative

    Conditional EstimationL-BFGSIterative Scaling

    Conditional StructureAlternative Merging Criterion

  • How much supervision?

  • Syntactic Machine TranslationCollaboration with ISI/USC:Use parse treesUse annotated parse trees

    Learn split synchronous grammars

  • SpeechSpeech SynthesisSplit & MergeLearning

    Coarse-to-FineDecoding

    Combined Generative + Conditional LearningSpeech

  • Learning Structured Models for Phone RecognitionSlav Petrov, Adam Pauls,Dan Klein

  • Motivation (Speech)

  • Traditional ModelsdadStartEndBegin - Middle - End Structure

  • Model OverviewTraditional:Our Model:

  • Differences to Grammars

  • Refinement of the ih-phone

  • InferenceCoarse-To-Fine

    Variational Approximation

  • Phone Classification Results

    MethodError RateGMM Baseline (Sha and Saul, 2006) 26.0 %HMM Baseline (Gunawardana et al., 2005)25.1 %SVM (Clarkson and Moreno, 1999)22.4 %Hidden CRF (Gunawardana et al., 2005)21.7 %This Paper21.4 %Large Margin GMM (Sha and Saul, 2006)21.1 %

  • Phone Recognition Results

    MethodError RateState-Tied Triphone HMM (HTK)(Young and Woodland, 1994)27.1 %Gender Dependent Triphone HMM (Lamel and Gauvain, 1993) 27.1 %This Paper26.1 %Bayesian Triphone HMM (Ming and Smith, 1998) 25.6 %Heterogeneous classifiers(Halberstadt and Glass, 1998) 24.4 %

  • Confusion Matrix

  • How much supervision?Hand-alignedExact phone boundaries are known

    Automatically-alignedOnly sequence of phones is known

  • Generative + Conditional LearningLearn structure generativelyEstimate Gaussians conditionallyCollaboration with Fei Sha

  • Speech SynthesisAcoustic phone model:GenerativeAccurateModels phone internal structure well

    Use it for speech synthesis!

  • Large Vocabulary ASRASR System = Acoustic Model + Decoder

    Coarse-to-Fine Decoder:Subphone PhonePhone Syllable Word Bigram

  • ScenesSplit & MergeLearning

    Decoding

    Scenes

  • Motivation (Scenes)Seascape

  • Motivation (Scenes)

  • LearningOversegment the image

    Extract vertical stripes

    Extract features

    Train HMMs

  • InferenceDecode stripes

    Enforce horizontal consistency

  • Alternative ApproachConditional Random Fields

    Pro:Vertical and horizontal dependencies learntInference more natural

    Contra:Computationally more expensive

  • Timeline

  • Results so farState of the art parser for different languages:Automatically learntSimple & CompactFast & AccurateAvailable for download

    Phone recognizer:Automatically learntCompetitive performanceGood foundation for speech recognizer

  • Proposed DeliverablesSyntax Parser Speech RecognizerSpeech SynthesizerSyntactic Translation MachineScene Recognizer

  • Thank You!

    To make this less abstract, here is an example.There is some complex real-world process that produces an observable output. W would like to model and understand this process. Typically we will ask an expert to build a model of the structure of the process. However, this model will be at a level of abstraction that is suitable for humans, but far too coarse for a computer to accurately model the process. Therefore taking the MLE from the collection of labeled examples will result in a poor model.What people often do instead is to manually enrich the structure by imposing additional constraints that have some scientific motivaion. In contrast, we will take the structure that was specified by the human and automatically enrich it. We will do this by refining the model with latent variables and using EM to learn the values of those latent variables. My talk will be about ways to guide EM to a good local maximum and then about how to exploit the structure of the model at inference time, once the model has been learnt.But why should we care about the structure at all? If we true everything into a ba and scrambled it around, then a bad of pixels model or a bag of character model would look like this. Not very informative. Of course we would not do that but would rather keep at least the words or objects separate. This gives you a pretty good idea about what is going on. Something about dogs and cats and food.Presumably you have formed an idea like this in your head: The dog and the cat ate the food. While this is the most probable thing to do, one can also order the words and objects differently:Many of you have played the game of designing a grammar. The main problem that one faces is that the categories in the treebank are too general. Consider for example the two NPs in this sentence. It is well known that these NPs have different distributions. There are many ways to capture the difference. One way that has been shown to work well for this particular example is to add parent annotation.Another way is head lexicalization. Yet another way is to automatically cluster the categories into subcategories. Today I will present a way to start from a simple X-Bar grammar, which has less than 100 symbols, and to automatically learn compact and interpretable subcategories.How are we going to this? We will use an EM algorithm as Matsuzaki et al. did. The details are in their paper and also in ours and I will just give a quick high level overview. Since the brackets and categories for our training trees are already known, we only need to induce the subcategories. This means that we do not need to run the general inside outside algorithm. We can turn our parse trees into tree shaped graphical models and then use an algorithm just like the forward backward algorithm for HMMs. To emphasize, we dont need to run the inside outside algorithm and the algorithm is not cubic but linear, therefore training can be done efficiently.

    Using this machinery we can reproduce the results of previous work. Here we have plotted the grammar size measured in number of subcategories versus accuracy. The first thing to notice is that we start from a more basic baseline than previous work. The X-Bar grammar that we use as initializer has less than 100 symbols and a performance of less than 65%. One can see that increasing the number of latent annotations improves parsing performance. However, since the number of rules grows cubic in the number of subcategories one reaches relatively quickly the limit of what is computationally feasible. This is where our innovations start. In the rest of this talk I will present three new ideas: I will first show that hierarchical training leads to better parameter estimates. Then we will see that adaptive splitting allows us to refine only those categories that need to be split. Finally we will smooth our estimates to prevent overfitting.Let us look at what happens when we split a category. What I am showing here are the top three words and their emission probabilities for each (sub)category of the determiner tag. When we split the category into 4 EM learns a clustering. However, the more subcategories we have, the harder it gets for EM to find a good clustering. Since EM is a local search method, it is more and more likely to converge to a suboptimal solution when we increase the number of subcategories.

    We therefore propose to use a staged training. We will split and retrain our grammars. In each iteration we will initialize EM with the results of the smaller grammar splitting every previous subcategory into two and adding some randomness to break symmetry.If we go back and look at the learned categories we find that they are actually interpretable. In the first step the algorithm learns to distinguish between determiners and demonstratives. A subcategorization that is linguistically sensible. KlMa also found this split to be useful and hand-annotated this distinction in their work. In the next iteration the algorithm on the one side learns to distinguish between definite and indefinite determiners while on the other side it learns to separate demonstratives from quantificational elements. One should note, that these splits are very reliable and occur typically in this order.As we had hoped the hierarchical training leads to better parameter estimates, which in turn results in a 1% improvement in parsing accuracy.This being said, there are still some things that are broken in our model. Lets look at another category, e.g. the , tag. Whether we apply hierarchical training or not, the result will be the same. We are generating several subcategories that are exactly the same. This was a trivial example, but oversplitting will occur everywhere after some number of splits. Lets look at the DT tag again and let us focus on the determiner branch. Remember that we had learned to distinguish definite from indefinite determiners. In the next iteration we learn the following four categories. On the right we have learned to distinguish sentence initial definite determiners from sentence medial determiners something that can be useful. On the left we have generated two clusters that are essentially the same. However, it is hard to tell whether oversplitting has already occurred, since we dont know the underlying statistics. The algorithm can judge much better than us which splits are useful and which arent. But we need to be careful. Oversplitting is not only wasteful, but also potentially dangerous as it fragments the counts of other categories that interact with this category.In general, we want to split as much as possible and let the algorithm itself figure out where to allocate the splits. After each learning phase we will therefore revisit each split that we have just made and evaluate whether the split was useful. If not, we will undo the split. In this example the algorithm might decide that this split was not useful and merge the two subcategories back together. In such a way the algorithm revisits all newly introduced pairs of splits. For each split it computes the loss in likelihood that would be incurred if the split was removed. We found that merging half of the newly introduced splits results in no loss in accuracy but allows us to control the grammar size.Here you can see how the grammar size has been dramatically reduced. The arrows connect models that have undergone the same number of split iterations and therefore have the same maximal number of subcategories. Because we have merged some of the subcategories back in, the blue model has far fewer total subcategories. The adaptive splitting allows us to do two more split iterations and further refine more complicated categories. This additional expressiveness gives us in turn another 1% improvement in parsing accuracy. We are now close to 90% with a grammar that has a total of barely more than 1000 subcategories. This grammar can allocate upto 64 subcategories, and here is where they get allocated.Here are the number of subcategories for each of the 25 phrasal categories. In general phrasal categories are split less heavily than lexical categories, a trend that one could have expected.

    As expected noun and verb phrases are the most complex categories and have therefore been split more heavily.The majority of the categories have been split very little or not at all. This is because they are rareOr have little variance.The more heavily split lexical categories are the different verb categoriesAnd especially the nominal categories and adjectives. We will see some examples later. It is interesting to notice that none of the categories has been split into the maximal possible number of subcategories. But still, proper nouns have been split into 62 subcategories which is a non-neglegible number.Since we treat each of them as a separate entity we have quite heavily fragmented our data. This in turn leads to poor probability estimates and we run into danger of overfitting our model.A way to mitigate the risk of overfitting is to pool the statistics in a smoothed model.We do smoothing in a linear way where each rewrite probability is smoothed towards the mean across the subcategories of the parent. We found that the results to be robust in regard to the smoothing parameter alpha.Our best model, which has about 1000 subcategories achieves a surprisingly high F1 score of 90.7 on the development section of the WSJ.Before I finish, let me give you some interesting examples of what our grammars learn. These are intended to highlight some interesting observations, the full list of subcategories and more details are in the paper. The subcategories sometimes capture syntactic and sometimes semantic difference. But very often they represent syntactico semantic relations, which are similar to those found in distributional clustering results. For example for the proper nouns the system learns a subcategory for months, first names, last names and initials. It also learns which words typically come first and second in multi-word units.For personal pronouns there is a subcategory for accusative case and one for sentence initial and sentence medial nominative case.Relative adverbs are divided into distance, degree and time.For the cardinal numbers the system learns to distinguish between spelled out numbers, dates and others.Wait a minute. 1621 minutes? For 1600 sentences? It literally takes one whole minute to parse a sentence exhaustively with our refined grammar.But obviously one can do better than this nave version. For example the idea of Coarse-to-Fine parsing has been around for a while. In c-to-f parsing one estimates two grammars from the treebank: A refined grammar and a coarse grammar. Those grammars are estimated independently, the only requirement is that there is some type of mapping between the symbols of the two grammars. For Charniak, the refined grammar is a lexicalized grammar and the coarse grammar is some coarser, markovized, unlexicalized grammar. For us the refined grammar is a split grammar and the coarse grammar is the X-bar grammar. Once we have these two grammars we can speed up parsing a lot by first exhaustively parsing with the coarse grammar and then pruning chart items with low posterior probability. We can use this pruned chart then as a mask to constrain the refined chart. To make this explicit, what do I mean by pruning? For each item we compute its posterior probability using the inside/outside scores. For example for a given span we have constituents like QP, NP, VP in our coarse grammar. When we prune, we will compute the probability of the coarse symbols. If this probability is below a threshold we can be pretty confident that there wont be such a constituent in the final parse tree. Each of those symbols corresponds to a set of symbols in the refined grammar. We can therefore safely remove all its refined version from the chart of the refined grammar. How much does this help?It helps quite a bit. If we tune the threshold as not to make any parse error, the parsing time goes down to two hours from more than a day. But still, for a practical application this is too slow.Consider again the same span as before. We had decided that QP is not a valid constituent. In hierarchical pruning we will do several pre-parses instead of doing just one. Instead of going directly to our most refined grammar we can then use a grammar where the categories have been split in two. We already said that QPs are not valid, so that we wont need to consider them. We will compute the posterior probability of the different NPs and VPs and we might decide that NP1 is valid but NP2 is not. We can then go to to the next, more refined grammar, where some other subcategories will get pruned and so on. This is just an example but it is pretty representative. The pruning allows us to keep the number of active chart items roughly constant even though we are doubling the size of the grammar.And as you might guess, we can use the grammars from training for pruning. We just need to keep them around after training.If we do this, we can save some more time. We are getting into a reasonable ball-park now But we can do much better.There is one problem with using the grammars from training for pruning. Lets look at what happens when train with our hierarchical procedure. Here we are looking at one of the branches of the DT tag after 3 splits and I am showing the most likely tag for each subcategory. In the next split round we will split each category in two, add some noise and then use EM to refit the model. But there will be nothing tying the categories together and enforcing the hierarchical structure. So, what can happen is that two categories swap places. This state drift can occur in theory and does often occur in practice. While the categories dont get completely scrambled, the grammar history is not ideal for pruning. What works better is to compute grammars specifically for pruning. During training we build this hierarchy of increasingly refined grammars, where EM finds some wiggly path to our most refined grammar. What we will do is, throw away the intermediate grammars and keep only the final grammar. We can then obtain grammars of intermediate complexity, but optimally close to the final, refined grammar through projections. The advantage of those projections is that they will capture the substate drifts. They are also easy to compute from the final grammar and do not require us to keep around a treebank, we can estimate them directly from the final grammar.How do we go about estimating them? Well, the nonterminals are easy: we just project them their coarser ancestors. Here we are showing how a grammar with 2 subcategories is projected back to an X-bar grammar. In practice we will be projecting pour final grammar, which has upto 64 subcategories to grammars of intermediate complexities, 32, 16, subcategories. So we said that we can just map refined symbols to coarse symbols when we project.We can do this for the rules too. For example those 8 split rules will all map to the unsplit rule S goes to NP VP. But how do we get the rule probabilities? They are harder to estimate, but still pretty intuitive. We cannot just add up the rule probabilities of the split rules since we need to weight them by the frequency of the parent symbol.We will need to take a small detour. Think about how we estimate grammars in general. Typically we estimate grammars from treebanks by counting up the how often each rule occurs. But here we dont have a treebank that corresponds to our refined grammar. What we have is the the grammar. But this grammar induces an infinite tree distribution. We would like our coarse grammar to be as close to this tree distribution as possible. Corazza & Satta worked out that the equivalent of taking the maximum likelihood estimate from a finite grammar corresponds to minimizing the KL divergence between the tree distributions of the two grammars. To fit the distribution of the coarse grammar as close as possible to the one of the refined grammar we just need to compute the parameters as in the MLE case, but replace the counts with expected counts.These expected counts can be computed with a simple iterative procedure. We initialize the count of the root symbol to one and then propagate the counts. Where each update corresponds to the expected counts up to depth k. This procedure converges within a few seconds. We can then use those expected counts to compute the rules probabilities.We can now use these projected grammars in our hierarchical c-to-f parsing scheme.Wow, thats fast now. 15 min without search error, remember thats at 91.2 F-score. And if you are willing to sacrifice 0.1 in accuracy you can this time in half again.Lets look a little bit under the hood and see what happens when we parse. Here I am showing the fraction of time that we spend in each phase. The first pass with the X-bar grammar is the most costly one, and then each of the following ones is pretty cheap. Another thing that we can do is, look at the chart. Here I am showing the chart for the first sentence of our dev set. Black corresponds to high posterior probability. Since we have only two dimensions, I have collapsed the different constituents and am showing only bracket posteriors, obviously in practice the charts are much more fine grained. The first pass (X-Bar grammar) removes already a lot of chart items.The second one clears our a lot too. As you can see large blocks of the chart have been ruled out, and the valid chart items tend to cluster. These clusters correspond to different ways of attaching certain phrases and cannot be resolved by the simple grammars. Well, actually every pass is useful. After each pass we double the number of chart items, but by pruning we are able to keep the number of valid chart items roughly constant. You can also see that some ambiguities are resolved early on, while others, like some attachment ambiguities, are not resolved until the end. At the end we have this extremely sparse chart which contains most of the probability mass. What we need to do is extract a single parse tree from it.For example this one. But how do we get this tree?But how do we extract this best tree from the posteriors? The problem is that our grammar induces a derivation distribution, but we care only about parse trees, which have unsplit evaluation symbols. Unfortunately computing the best unsplit tree is intractable so that we will need an approximation of some type. We can settle for the best derivation, which however is a poor choice. We can extract an n-best list and rerank it, which works relatively well. Or, the best strategy is to use an objective function which decomposes along the posteriors.How does this compare to the state-of-the art? Well, in regard of efficiency, we compared to Charniak&Johnsons parser which is arguably the best parser for English. We ran their parser on the same machine and on the same setup as ours. Our Java implementation is faster and more accurate than theirs. And we can speed up parsing even more by allowing a small search error.In terms of accuracy we are better than the generative component of the Brown parser, but still worse than the reranking parser. However, we could extract n-best lists and rerank them too, if we wanted The nice thing about our approach is that we do not require any knowledge about the target language. We just need a treebank. We therefore trained some grammars for German and Chinese. We surpass the previous stae of the art by a huge margin there. Mainly because most work on parsing has been done on English and parsing techniques for other languages are still somewhat behind.To conclude, there are the following take home points. Hierarchical c-to-f inference is extremely fast. It is important to use good grammars for pruning, such as the ones obtained from projecting. For selecting the best parse tree it is important to marginalize out the latent annotation. Unlexicalized parsing is applicable to other languages as well.Sonorant - non-sonorantDuration depends on contextR-coloring and nasalization

    And finally, we have made the parser, along with grammars for a variety of languages available, and we would be happy if you used it in your research. Thank you for your attention and am happy to take some questions now.