natural language processing assignment – final presentation varun suprashanth , 09005063

Natural Language Processing Assignment – Final Presentation

Varun Suprashanth, 09005063Tarun Gujjula, 09005068

Asok Ramachandran, 09005072

Part 1 : POS Tagger

Tasks Completed

• Implementation of Viterbi – Unigram, Bigram.

• Five Fold Evaluation.

• Per POS Accuracy.

• Confusion Matrix.

AJ0

AJ0-N

N1

AJ0-V

VGAJC AT0

AV0-AJ0

AVP-PRP

AVQ-CJS CJS

CJS-P

RP

CJT-D

T0

CRD-PNI

DT0DTQ IT

JNN1

NN1-NP0

NN1-VVG

NN2-VVZ

NP0-NN1

PNIPNP

PNXPRP

PRP-CJS TO0

VBBVBG

VBNVDB

VDGVDN

VHBVHG

VHNVM

0

VVB-NN1

VVD-AJ0

VVG

VVG-NN1

VVN

VVN-VVD

VVZ-NN2

0

0.2

0.4

0.6

0.8

1

1.2

Series1

Per POS Accuracy for Bigram Assumption.

Screen shot of Confusion Matrix

AJ0AJ0-AV0

AJ0-NN1

AJ0-VVD

AJ0-VVG

AJ0-VVN AJC AJS AT0 AV0

AV0-AJ0 AVP

AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0

AJC 2 0 0 0 0 0 69 0 0 11 0 0

AJS 6 0 0 0 0 0 0 38 0 2 0 0

AT0 192 0 0 0 0 0 0 0 7000 13 0 0

AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0

AVP 24 0 0 0 0 0 0 0 1 11 0 737

Part 2 : Discriminative VS Generative

Problem Statement• Generate unigram parameters of P(t_i|w_i). You already

have the annotated corpus.• Compute the argmax of P(T|W); do not invert through

Bayes theorem.• Compare with unigram based unigram performance

between (2) and the HMM based system.

Tasks Completed

• Generated unigram parameters of P(ti|wi).• Computed the argmax of P(T|W).

• Compared with unigram based unigram performance between the HMM based system and the above.

• Better results were produced by the generative model in cases of ambiguous sentences.

Discriminative• P(T|W) = P(| )

= P(| ).

P(| ). ………

P(| )

• Assuming word tag pair to be independent, • P(T|W) =

• precision 0.896788• F-measure 0.896788

Per-PoS Accuracy

AJ0

AJ0-V

VDAJC AV0

AVP-PRP

CJC

CJS-P

RPCRD

DT0EX0

NN1

NN1-VVB

NN2-VVZ

ORDPNP

PRF

PRP-CJS

UNCVBG

VBZVDG

VDZVHG

VHZ

VVB-NN1

VVD-VVN

VVG-NN1

VVN-AJ0

VVZ-NN2

0

0.2

0.4

0.6

0.8

1

1.2

Series1

Generative• P(T|W) = P(T|W). P(T).

• Assuming unigram assumption and word tag pairs to be independent,

• P(T|W) = P(| ). P()

Part 3 : Analysis of Corpora Using Word Prediction

Tasks Completed• Predicted the next word on the basis of the patterns

occurring in both the corpora.

• First Corpus had untagged-word sentences and the second one had tagged-word sentences.

• The corpus with the tagged words gives better results for word prediction.

Untagged Corpus• P(|) =

• Where c() is the count.

• By Bigram Assumption, • P(|) =

• By Trigram Assumption,• P(|) =

Tagged Corpus• P(|,) =

• Using Bigram Assumption,• P(|,) = • Using Trigram Assumption,• P(|,) =

Examples.• Example 1 :

• TO0_to VBI_be CJC_or XX0_not TO0_to• VBI_be

• to be or not to• The

• Example 2:• AJ0_complete CJC_and AJ0_utter

• NN1_contempt

• complete and utter• Loud

Examples Cont.• Example 3:

• PNQ_who VBZ_is DPS_your AJ0-NN1_favourite • NN1_gardening

• who is your favourite• is

Results• Raw text LM :

• Word Prediction Accuracy: 13.21%

• POS tagged text LM :• Word Prediction Accuracy : 15.53%

Part 4 : A-star Implementation

Problem Statement

• The goal is to see which algorithm is better for POS tagging, Viterbi or A*

• Look upon the column of POS tags above all the words as forming the state space graph.

• The start state S is '^' and the goal stage G is '$'. 6. Your job is to come up with a good heuristic. One possibility is that the heuristic value h(N), where N is a node on a word W, is the product of the distance of W from '$' and the least arc cost in the state space graph.

• G(N) is the cost of the best path found so far to W from '^'.• Run A* with this heuristic and see the result.• Compare the result with Viterbi.

A-Star Implementation.• precision 0.937254• F-measure 0.937254

AJ0

AJ0-V

VDAJC AV0

AVP-PRP

CJC

CJS-P

RPCRD

DT0EX0

NN1

NN1-VVB

NN2-VVZ

ORDPNP

PRF

PRP-CJS

UNCVBG

VBZVDG

VDZVHG

VHZ

VVB-NN1

VVD-VVN

VVG-NN1

VVN-AJ0

VVZ-NN2

0

0.2

0.4

0.6

0.8

1

1.2

Series1

Screen shot of Confusion Matrix12836 58 187 9 13 28 0 0 240 110 52 7

98 44 3 0 0 0 0 0 0 5 26 0

357 1 377 0 2 0 0 0 1 0 1 0

33 0 0 2 0 1 0 0 7 0 0 0

33 0 2 0 29 0 0 0 4 0 0 0

42 0 0 5 0 15 0 0 5 0 0 0

4 0 0 0 0 0 403 0 3 38 0 0

4 0 0 0 0 0 0 214 0 18 0 0

1 0 0 0 0 0 0 0 23454 55 0 0

82 11 2 0 0 0 58 11 99 9198 68 42

34 12 0 0 0 0 0 0 0 69 75 0

4 0 0 0 0 0 0 0 1 38 0 1533

0 0 0 0 0 0 0 0 0 5 0 72

0 0 0 0 0 0 0 0 1 15 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 3 0 0

0 0 0 0 0 0 0 0 1 109 0 0

0 0 0 0 0 0 0 0 0 0 0 0

Heuristics.• h = g * (N - n)/ n

• Where N is the length of the sentence, and n is the index of the current word in the sentence.

A-star Vs. Viterbi

Part 5 : YAGO

Problem Statement• Take as input two words and show A PATH between them

listing all the concepts that are encountered on the way. • For example, in the path from 'bulldog' to 'cheshire cat',

one would presumably encounter 'bulldog-dog-mammal-cat-cheshire cat'. Similarly for 'VVS Laxman' and 'Hyderabad', 'Tendulkar' and 'Tennis' (you will be surprised!!).

Part 6: Parser Projection

28

Example

• English: Dhoni is the captain of India.• Hindi: dhoni bhaarat ke kaptaan hai.

• Hindi -parse:

[

[ [dhoni]NN]NP

[ [[[bhaarat]NNP]NP [ke]P ]PP [kaptaan]NN]NP [hai]VBZ ]VP

]S

• English -parse:

[

[ [Delhi]NN]NP

[ [is]VBZ [[the]ART [capital]NN]NP [[of]P [[India]NNP]NP]PP]VP

]S

29

Problems and Conclusions

• Many Idioms in English are translated directly, even though they mean something else,• E.g. Phrases like “break a leg”, “He Lost His Head”, “French kiss”,

“Flip the bird”

• Noise because of misalignments.

30

Natural Language Tool Kit

• The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language.

• NLTK includes graphical demonstrations and sample data. • It is accompanied by extensive documentation, including a

book that explains the underlying concepts behind the language processing tasks supported by the toolkitIt provides lexical resources such as WordNet.

• It has a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

natural language processing assignment – final presentation varun suprashanth , 09005063

Documents

word w

word prediction accuracy

current word

untaggedword sentences

unigram performance

n n nwhere n

pos tagging

column of pos tags