复旦大学大数据学院 statistical natural language parsing · 2017. 11. 29. ·...

66
复旦大学大数据学院 School of Data Science, Fudan University DATA130006 Text Management and Analysis Statistical Natural Language Parsing 魏忠钰 November 29 th , 2017 Adapted from Stanford CS124U

Upload: others

Post on 14-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

复旦大学大数据学院School of Data Science, Fudan University

DATA130006 Text Management and Analysis

Statistical Natural Language Parsing魏忠钰

November29th,2017

Adapted from Stanford CS124U

Page 2: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

NLP Startups

Page 3: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

NLP Startups

Page 4: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and
Page 5: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and
Page 6: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Constituency (phrase structure)

§ Phrasestructureorganizeswordsintonestedconstituents.

Page 7: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Parsing Tree Example

Page 8: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Attachment ambiguities

§ Akeyparsingdecisionishowwe‘attach’variousconstituents§ PPs,adverbialorparticipialphrases,infinitives,coordinations,etc.

§ Catalan numbers: Cn = (2n)!/[(n+1)!n!]

§ An exponentially growing series, which arises in many tree-like contexts

Page 9: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Two problems to solve: 1. Repeated work…

Page 10: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Two problems to solve: 2. Choosing the correct parse

§Wordsaregoodpredictorsofattachment

§ Moscowsentmorethan100,000soldiersintoAfghanistan…

§ SydneyWaterbreachedanagreementwithNSWHealth…

§Ourstatisticalparserswilltrytoexploitsuchstatistics.

Page 11: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

context-free grammars (CFGs)

§G=(T,N,S,R)§ Tisasetofterminalsymbols§ Nisasetofnonterminalsymbols§ Sisthestartsymbol(S∈ N)§ Risasetofrules/productionsoftheformX® g

§ X∈ Nandg∈ (N∪ T)*

§ AgrammarGgeneratesalanguageL.

Page 12: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

A phrase structure grammar

S® NPVPVP® VNPVP® VNPPPNP® NPNPNP® NPPPNP® NNP® ePP® PNP

peoplefishtankspeoplefishwithrods

N® peopleN® fishN® tanksN® rodsV® peopleV® fishV® tanksP® with

Page 13: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Many parses

Page 14: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Probabilistic – or stochastic – context-free grammars (PCFGs)

• G=(T,N,S,R,P)• Tisasetofterminalsymbols• Nisasetofnonterminalsymbols• Sisthestartsymbol(S∈ N)• Risasetofrules/productionsoftheformX® g• Pisaprobabilityfunction

• P:R® [0,1]•

• AgrammarGgeneratesalanguagemodelL.

∀X ∈ N, P(X →γ) =1X→γ ∈R∑

P(γ) =1γ ∈T *∑

Page 15: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

A PCFG

S® NPVP 1.0VP® VNP 0.6VP® VNPPP 0.4NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks 0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

Page 16: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

The rise of annotated data: The Penn Treebank((S(NP-SBJ(DTThe)(NNmove))(VP(VBDfollowed)(NP(NP(DTa)(NNround))(PP(INof)(NP(NP(JJsimilar)(NNSincreases))(PP(INby)(NP(JJother)(NNSlenders)))(PP(INagainst)(NP(NNPArizona)(JJreal)(NNestate)(NNSloans))))))

(,,)(S-ADV(NP-SBJ(-NONE- *))(VP(VBGreflecting)(NP(NP(DTa)(VBGcontinuing)(NNdecline))(PP-LOC(INin)(NP(DTthat)(NNmarket)))))))

(..)))

Page 17: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

The probability of trees and strings

§ P(t)– Theprobabilityofatreet istheproductoftheprobabilitiesoftherulesusedtogenerateit.

§ P(s)– Theprobabilityofthestrings isthesumoftheprobabilitiesofthetreeswhichhavethatstringastheiryield

P(s)=Σj P(s,t)wheret isaparseofs

=Σj P(t)

Page 18: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

P(t1)=1.0× 0.7× 0.4× 0.5× 0.6× 0.7× 1.0× 0.2× 1.0× 0.7× 0.1

=0.0008232

Page 19: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

P(t2)=1.0× 0.7× 0.6× 0.5× 0.6× 0.2× 0.7× 1.0× 0.2× 1.0× 0.7× 0.1

=0.00024696

Page 20: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Tree and String Probabilities

• s =peoplefishtankswithrods• P(t1)=1.0× 0.7× 0.4× 0.5× 0.6× 0.7

× 1.0× 0.2× 1.0× 0.7× 0.1=0.0008232

• P(t2)=1.0× 0.7× 0.6× 0.5× 0.6× 0.2× 0.7× 1.0× 0.2× 1.0× 0.7× 0.1

=0.00024696• P(s)=P(t1)+P(t2)

=0.0008232 +0.00024696=0.00107016

Verb attach

Noun attach

Page 21: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

• A PP is more likely attached to a VP compared to NP.

Page 22: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Outline

§Restricting the grammar form for efficient parsing

Page 23: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Chomsky Normal Form

§ AllrulesareoftheformX® YZorX® w§ X,Y,Z∈ Nandw∈ T

§ Atransformationtothisformdoesn’tchangethegenerativecapacityofaCFG§ Thatis,itrecognizesthesamelanguage

§ Butmaybewithdifferenttrees

§ Emptiesandun-aries areremovedrecursively§ n-ary rulesaredividedbyintroducingnewnon-terminals(n>2)

Page 24: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

A phrase structure grammar

S® NPVPVP® VNPVP® VNPPPNP® NPNPNP® NPPPNP® NNP® ePP® PNP

S® VPVP® V

N® peopleN® fishN® tanksN® rodsV® peopleV® fishV® tanksP® with

Page 25: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Chomsky Normal Form steps

S® NPVPS® VPVP® VNPVP® VVP® VNPPPVP® VPPNP® NPNPNP® NPNP® NPPPNP® PPNP® NPP® PNPPP® P

N® peopleN® fishN® tanksN® rodsV® peopleV® fishV® tanksP® with

S® VNPS® V

Page 26: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Chomsky Normal Form steps

S® NPVP

VP® VNP

S® VNP

VP® V

S® V

VP® VNPPP

S® VNPPP

VP® VPP

S® VPP

NP® NPNP

NP® NP

NP® NPPP

NP® PP

NP® N

PP® PNP

PP® P

N® peopleN® fishN® tanksN® rodsV® peopleV® fishV® tanksP® with

Page 27: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Chomsky Normal Form steps

S® NPVP

VP® VNP

S® VNP

VP® V

VP® VNPPP

S® VNPPP

VP® VPP

S® VPP

NP® NPNP

NP® NP

NP® NPPP

NP® PP

NP® N

PP® PNP

PP® P

N® peopleN® fishN® tanksN® rodsV® peopleS® peopleV® fishS® fishV® tanksS® tanksP® with

Page 28: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Chomsky Normal Form steps

S® NPVP

VP® VNP

S® VNP

VP® VNPPP

S® VNPPP

VP® VPP

S® VPP

NP® NPNP

NP® NP

NP® NPPP

NP® PP

NP® N

PP® PNP

PP® P

N® people

N® fish

N® tanks

N® rods

V® people

S® people

VP® people

V® fish

S® fish

VP® fish

V® tanks

S® tanks

VP® tanks

P® with

Page 29: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Chomsky Normal Form steps

S® NPVP

VP® VNP

S® VNP

VP® VNPPP

S® VNPPP

VP® VPP

S® VPP

NP® NPNP

NP® NPPP

NP® PNP

PP® PNP

VP® VX

X® NPPP

X=@VP_V

NP® people

NP® fish

NP® tanks

NP® rods

V® people

S® people

VP® people

V® fish

S® fish

VP® fish

V® tanks

S® tanks

VP® tanks

P® with

PP® with

Page 30: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Chomsky Normal Form steps

S® NPVP

VP® VNP

S® VNP

VP® V@VP_V

@VP_V® NPPP

S® V@S_V

@S_V® NPPP

VP® VPP

S® VPP

NP® NPNP

NP® NPPP

NP® PNP

PP® PNP

NP® people

NP® fish

NP® tanks

NP® rods

V® people

S® people

VP® people

V® fish

S® fish

VP® fish

V® tanks

S® tanks

VP® tanks

P® with

PP® with

Page 31: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

A phrase structure grammar

S® NPVPVP® VNPVP® VNPPPNP® NPNPNP® NPPPNP® NNP® ePP® PNP

N® peopleN® fishN® tanksN® rodsV® peopleV® fishV® tanksP® with

Page 32: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Chomsky Normal Form steps

S® NPVP

VP® VNP

S® VNP

VP® V@VP_V

@VP_V® NPPP

S® V@S_V

@S_V® NPPP

VP® VPP

S® VPP

NP® NPNP

NP® NPPP

NP® PNP

PP® PNP

NP® people

NP® fish

NP® tanks

NP® rods

V® people

S® people

VP® people

V® fish

S® fish

VP® fish

V® tanks

S® tanks

VP® tanks

P® with

PP® with

Page 33: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Chomsky Normal Form

§Withsomeextrabook-keepinginsymbolnames,youcanevenreconstructthesametreeswithade-transform

§ Youshouldthinkofthisasatransformationforefficientparsing

§ InpracticefullChomskyNormalFormisapain§ Reconstructingn-aries iseasy§ Reconstructingunaries/emptiesistrickier

§Binarization iscrucialforcubictimeCFGparsing

Page 34: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

An example: before binarization…

ROOT

S

NP VP

N

people

V NP PP

P NP

rodswithtanksfish

N N

Page 35: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

After binarization…

P NP

rods

N

with

NP

N

people tanksfish

N

VP

V NP PP

@VP_V

ROOT

S

Page 36: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Treebank: empties and unaries

ROOT

S-HLN

NP-SUBJ VP

VB-NONE-

e Atone

PTB Tree

ROOT

S

NP VP

VB-NONE-

e Atone

NoFuncTags

ROOT

S

VP

VB

Atone

NoEmpties

ROOT

S

Atone

NoUnaries

ROOT

VB

Atone

High Low

Page 37: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Outline

§Restricting the grammar form for efficient parsing

§ Exact polynomial time parsing of (P)CFGs

Page 38: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Constituency Parsing

fishpeoplefishtanks

RuleProb θiS® NPVP θ0NP® NPNP θ1…

N® fish θ42N® people θ43V® fish θ44…

PCFG

N N V N

VP

NPNP

S

Page 39: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Cocke-Kasami-Younger (CKY) Constituency Parsing

fishpeoplefishtanks

Page 40: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Viterbi (Max) Scores

peoplefish

NP 0.35V 0.1N 0.5

VP 0.06NP 0.14V 0.6N 0.2

S® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

S -> NPVP=0.35*0.06*0.9VP -> V NP = 0.1 *0.5 *0.14NP -> NPNP = 0.1 *0.35 *0.14

Page 41: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Extended CKY parsing

§Unaries canbeincorporatedintothealgorithm§ Messy,butdoesn’tincreasealgorithmiccomplexity

§ Emptiescanbeincorporated§ Doesn’tincreasecomplexity;essentiallylikeu-naries

§ Binarizationisvital§ Withoutbinarization,youdon’tgetparsingcubicinthelengthofthesentenceandinthenumberofnon-terminalsinthegrammar

Page 42: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

The CKY algorithm (1960/1965) extended to unaries

function CKY(words, grammar) returns [most_probable_parse,prob]score = new double[#(words)+1][#(words)+1][#(nonterms)]back = new Pair[#(words)+1][#(words)+1][#nonterms]]for i=0; i<#(words); i++for A in nontermsif A -> words[i] in grammarscore[i][i+1][A] = P(A -> words[i])

//handle unariesboolean added = truewhile added added = falsefor A, B in nontermsif score[i][i+1][B] > 0 && A->B in grammarprob = P(A->B)*score[i][i+1][B]if prob > score[i][i+1][A]score[i][i+1][A] = probback[i][i+1][A] = Badded = true

Page 43: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

The CKY algorithm (1960/1965) extended to unaries

for span = 2 to #(words)for begin = 0 to #(words)- spanend = begin + spanfor split = begin+1 to end-1for A,B,C in nonterms

prob=score[begin][split][B]*score[split][end][C]*P(A->BC)if prob > score[begin][end][A]score[begin]end][A] = probback[begin][end][A] = new Triple(split,B,C)

//handle unariesboolean added = truewhile addedadded = falsefor A, B in nontermsprob = P(A->B)*score[begin][end][B];if prob > score[begin][end][A]score[begin][end][A] = probback[begin][end][A] = Badded = true

return buildTree(score, back)

Page 44: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

The grammar: Binary, no epsilons,

S® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks 0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

Page 45: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

score[0][1]

score[1][2]

score[2][3]

score[3][4]

score[0][2]

score[1][3]

score[2][4]

score[0][3]

score[1][4]

score[0][4]

0

1

2

3

4

1 2 3 4fish people fish tanks

Page 46: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

for i=0; i<#(words); i++for A in nontermsif A -> words[i] in grammarscore[i][i+1][A] = P(A -> words[i]);

Page 47: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

for i=0; i<#(words); i++for A in nontermsif A -> words[i] in grammarscore[i][i+1][A] = P(A -> words[i]);

Page 48: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

// handle unariesboolean added = true

while added added = falsefor A, B in nonterms

if score[i][i+1][B] > 0 && A->B in grammarprob = P(A->B)*score[i][i+1][B]if(prob > score[i][i+1][A])

score[i][i+1][A] = probback[i][i+1][A] = Badded = true

Page 49: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6

N® people0.5V® people0.1

N® fish0.2V® fish0.6

N® tanks0.2V® tanks0.1

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

// handle unariesboolean added = true

while added added = falsefor A, B in nonterms

if score[i][i+1][B] > 0 && A->B in grammarprob = P(A->B)*score[i][i+1][B]if(prob > score[i][i+1][A])

score[i][i+1][A] = probback[i][i+1][A] = Badded = true

Page 50: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® people0.5V® people0.1

N® fish0.2V® fish0.6

N® tanks0.2V® tanks0.1

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

// handle unariesboolean added = true

while added added = falsefor A, B in nonterms

if score[i][i+1][B] > 0 && A->B in grammarprob = P(A->B)*score[i][i+1][B]if(prob > score[i][i+1][A])

score[i][i+1][A] = probback[i][i+1][A] = Badded = true

Page 51: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® people0.5V® people0.1NP® N0.35VP® V0.01S® VP0.001

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® tanks0.2V® tanks0.1NP® N0.14VP® V0.03S® VP0.003

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

prob=score[begin][split][B]*score[split][end][C]*P(A->BC)if (prob > score[begin][end][A])

score[begin]end][A] = probback[begin][end][A] = new Triple(split,B,C)

for span = 2 to #(words)for begin = 0 to #(words)- span

end = begin + spanfor split = begin+1 to end-1

Page 52: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® people0.5V® people0.1NP® N0.35VP® V0.01S® VP0.001

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® tanks0.2V® tanks0.1NP® N0.14VP® V0.03S® VP0.003

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

prob=score[begin][split][B]*score[split][end][C]*P(A->BC)if (prob > score[begin][end][A])

score[begin]end][A] = probback[begin][end][A] = new Triple(split,B,C)

for span = 2 to #(words)for begin = 0 to #(words)- span

end = begin + spanfor split = begin+1 to end-1

Page 53: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® people0.5V® people0.1NP® N0.35VP® V0.01S® VP0.001

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® tanks0.2V® tanks0.1NP® N0.14VP® V0.03S® VP0.003

NP® NPNP0.0049

VP® VNP0.105

S® NPVP0.00126

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

prob=score[begin][split][B]*score[split][end][C]*P(A->BC)if (prob > score[begin][end][A])

score[begin]end][A] = probback[begin][end][A] = new Triple(split,B,C)

for span = 2 to #(words)for begin = 0 to #(words)- span

end = begin + spanfor split = begin+1 to end-1

Page 54: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® people0.5V® people0.1NP® N0.35VP® V0.01S® VP0.001

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® tanks0.2V® tanks0.1NP® N0.14VP® V0.03S® VP0.003

NP® NPNP0.0049

VP® VNP0.105

S® NPVP0.00126

NP® NPNP0.0049

VP® VNP0.007

S® NPVP0.0189

NP® NPNP0.00196

VP® VNP0.042

S® NPVP0.00378

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

//handle unariesboolean added = truewhile added

added = falsefor A, B in nonterms

prob = P(A->B)*score[begin][end][B];if prob > score[begin][end][A]

score[begin][end][A] = probback[begin][end][A] = Badded = true

Page 55: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® people0.5V® people0.1NP® N0.35VP® V0.01S® VP0.001

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® tanks0.2V® tanks0.1NP® N0.14VP® V0.03S® VP0.003

NP® NPNP0.0049

VP® VNP0.105

S® VP0.0105

NP® NPNP0.0049

VP® VNP0.007

S® NPVP0.0189

NP® NPNP0.00196

VP® VNP0.042

S® VP0.0042

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

for split = begin+1 to end-1for A,B,C in nonterms

prob=score[begin][split][B]*score[split][end][C]*P(A->BC)if prob > score[begin][end][A]

score[begin]end][A] = probback[begin][end][A] = new Triple(split,B,C)

for span = 2 to #(words)for begin = 0 to #(words)- span

end = begin + spanfor split = begin+1 to end-1

Page 56: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® people0.5V® people0.1NP® N0.35VP® V0.01S® VP0.001

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® tanks0.2V® tanks0.1NP® N0.14VP® V0.03S® VP0.003

NP® NPNP0.0049

VP® VNP0.105

S® VP0.0105

NP® NPNP0.0049

VP® VNP0.007

S® NPVP0.0189

NP® NPNP0.00196

VP® VNP0.042

S® VP0.0042

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

for split = begin+1 to end-1for A,B,C in nonterms

prob=score[begin][split][B]*score[split][end][C]*P(A->BC)if prob > score[begin][end][A]

score[begin]end][A] = probback[begin][end][A] = new Triple(split,B,C)

for span = 2 to #(words)for begin = 0 to #(words)- span

end = begin + spanfor split = begin+1 to end-1

Page 57: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® people0.5V® people0.1NP® N0.35VP® V0.01S® VP0.001

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® tanks0.2V® tanks0.1NP® N0.14VP® V0.03S® VP0.003

NP® NPNP0.0049

VP® VNP0.105

S® VP0.0105

NP® NPNP0.0049

VP® VNP0.007

S® NPVP0.0189

NP® NPNP0.00196

VP® VNP0.042

S® VP0.0042

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

for split = begin+1 to end-1for A,B,C in nonterms

prob=score[begin][split][B]*score[split][end][C]*P(A->BC)if prob > score[begin][end][A]

score[begin]end][A] = probback[begin][end][A] = new Triple(split,B,C)

for span = 2 to #(words)for begin = 0 to #(words)- span

end = begin + spanfor split = begin+1 to end-1

Page 58: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® people0.5V® people0.1NP® N0.35VP® V0.01S® VP0.001

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® tanks0.2V® tanks0.1NP® N0.14VP® V0.03S® VP0.003

NP® NPNP0.0049

VP® VNP0.105

S® VP0.0105

NP® NPNP0.0049

VP® VNP0.007

S® NPVP0.0189

NP® NPNP0.00196

VP® VNP0.042

S® VP0.0042

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

for split = begin+1 to end-1for A,B,C in nonterms

prob=score[begin][split][B]*score[split][end][C]*P(A->BC)if prob > score[begin][end][A]

score[begin]end][A] = probback[begin][end][A] = new Triple(split,B,C)

for span = 2 to #(words)for begin = 0 to #(words)- span

end = begin + spanfor split = begin+1 to end-1

Page 59: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® people0.5V® people0.1NP® N0.35VP® V0.01S® VP0.001

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® tanks0.2V® tanks0.1NP® N0.14VP® V0.03S® VP0.003

NP® NPNP0.0049

VP® VNP0.105

S® VP0.0105

NP® NPNP0.0049

VP® VNP0.007

S® NPVP0.0189

NP® NPNP0.00196

VP® VNP0.042

S® VP0.0042

NP® NPNP0.0000686

VP® VNP0.00147

S® NPVP0.000882

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

for split = begin+1 to end-1for A,B,C in nonterms

prob=score[begin][split][B]*score[split][end][C]*P(A->BC)if prob > score[begin][end][A]

score[begin]end][A] = probback[begin][end][A] = new Triple(split,B,C)

for span = 2 to #(words)for begin = 0 to #(words)- span

end = begin + spanfor split = begin+1 to end-1

Page 60: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® people0.5V® people0.1NP® N0.35VP® V0.01S® VP0.001

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® tanks0.2V® tanks0.1NP® N0.14VP® V0.03S® VP0.003

NP® NPNP0.0049

VP® VNP0.105

S® VP0.0105

NP® NPNP0.0049

VP® VNP0.007

S® NPVP0.0189

NP® NPNP0.00196

VP® VNP0.042

S® VP0.0042

NP® NPNP0.0000686

VP® VNP0.00147

S® NPVP0.000882

NP® NPNP0.0000686

VP® VNP0.000098

S® NPVP0.01323

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

for split = begin+1 to end-1for A,B,C in nonterms

prob=score[begin][split][B]*score[split][end][C]*P(A->BC)if prob > score[begin][end][A]

score[begin]end][A] = probback[begin][end][A] = new Triple(split,B,C)

Page 61: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® people0.5V® people0.1NP® N0.35VP® V0.01S® VP0.001

N® fish0.2V® fish0.6NP® N0.14VP® V0.06S® VP0.006

N® tanks0.2V® tanks0.1NP® N0.14VP® V0.03S® VP0.003

NP® NPNP0.0049

VP® VNP0.105

S® VP0.0105

NP® NPNP0.0049

VP® VNP0.007

S® NPVP0.0189

NP® NPNP0.00196

VP® VNP0.042

S® VP0.0042

NP® NPNP0.0000686

VP® VNP0.00147

S® NPVP0.000882

NP® NPNP0.0000686

VP® VNP0.000098

S® NPVP0.01323

NP® NPNP0.0000009604

VP® VNP0.00002058

S® NPVP0.00018522

0

1

2

3

4

1 2 3 4fish people fish tanksS® NPVP 0.9S® VP 0.1VP® VNP 0.5VP® V 0.1VP® V@VP_V 0.3VP® VPP 0.1@VP_V® NPPP 1.0NP® NPNP 0.1NP® NPPP 0.2NP® N 0.7PP® PNP 1.0

N® people 0.5N® fish 0.2N® tanks

0.2N® rods 0.1V® people 0.1V® fish 0.6V® tanks 0.3P® with 1.0

CallbuildTree(score,back)togetthebestparse

Page 62: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Quiz Question!

runs down

NNS 0.0023VB 0.001

PP 0.2IN 0.0014NNS 0.0001

PP → IN 0.002NP → NNS NNS 0.01NP → NNS NP 0.005NP → NNS PP 0.01VP → VB PP 0.045VP → VB NP 0.015

?? ???? ??

What constituents (with what probability can you make?

Page 63: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Outline

§ Restrictingthegrammarformforefficientparsing§ Exactpolynomialtimeparsingof(P)CFGs§ ConstituencyParserEvaluation

Page 64: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Evaluating constituency parsing

Page 65: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

Evaluating constituency parsing

Goldstandardbrackets:S-(0:11), NP-(0:2),VP-(2:9),VP-(3:9),NP-(4:6),PP-(6-9),NP-(7,9),NP-(9:10)

Candidatebrackets:S-(0:11),NP-(0:2),VP-(2:10),VP-(3:10),NP-(4:6),PP-(6-10),NP-(7,10)

LabeledPrecision 3/7=42.9%LabeledRecall 3/8=37.5%LP/LRF1 40.0%TaggingAccuracy 11/11=100.0%

Page 66: 复旦大学大数据学院 Statistical Natural Language Parsing · 2017. 11. 29. · 复旦大学大数据学院 SchoolofDataScience,FudanUniversity DATA130006Text Management and

How good are PCFGs?

§ PennWSJparsingaccuracy:about73%LP/LRF1

§ Robust§Usuallyadmiteverything,butwithlowprobability

§ Partialsolutionforgrammarambiguity§ APCFGgivessomeideaoftheplausibilityofaparse§ Butnotsogoodbecausetheindependenceassumptionsaretoostrong

§Giveaprobabilisticlanguagemodel§ Butinthesimplecaseitperformsworsethanatrigrammodel

§ TheproblemseemstobethatPCFGslackthelexicalizationofatrigrammodel