natural language processing assignment – final presentation varun suprashanth , 09005063
DESCRIPTION
Natural Language Processing Assignment – Final Presentation Varun Suprashanth , 09005063 Tarun Gujjula , 09005068 Asok Ramachandran, 09005072. Part 1 : POS Tagger. Tasks Completed. Implementation of Viterbi – Unigram, Bigram. Five Fold Evaluation. Per POS Accuracy. Confusion Matrix. - PowerPoint PPT PresentationTRANSCRIPT
Natural Language Processing Assignment – Final Presentation
Varun Suprashanth, 09005063Tarun Gujjula, 09005068
Asok Ramachandran, 09005072
Part 1 : POS Tagger
Tasks Completed
• Implementation of Viterbi – Unigram, Bigram.
• Five Fold Evaluation.
• Per POS Accuracy.
• Confusion Matrix.
AJ0
AJ0-N
N1
AJ0-V
VGAJC AT0
AV0-AJ0
AVP-PRP
AVQ-CJS CJS
CJS-P
RP
CJT-D
T0
CRD-PNI
DT0DTQ IT
JNN1
NN1-NP0
NN1-VVG
NN2-VVZ
NP0-NN1
PNIPNP
PNXPRP
PRP-CJS TO0
VBBVBG
VBNVDB
VDGVDN
VHBVHG
VHNVM
0
VVB-NN1
VVD-AJ0
VVG
VVG-NN1
VVN
VVN-VVD
VVZ-NN2
0
0.2
0.4
0.6
0.8
1
1.2
Series1
Per POS Accuracy for Bigram Assumption.
Screen shot of Confusion Matrix
AJ0AJ0-AV0
AJ0-NN1
AJ0-VVD
AJ0-VVG
AJ0-VVN AJC AJS AT0 AV0
AV0-AJ0 AVP
AJ0 2899 20 32 1 3 3 0 0 18 35 27 1AJ0-AV0 31 18 2 0 0 0 0 0 0 1 15 0AJ0-NN1 161 0 116 0 0 0 0 0 0 0 1 0AJ0-VVD 7 0 0 0 0 0 0 0 0 0 0 0AJ0-VVG 8 0 0 0 2 0 0 0 1 0 0 0AJ0-VVN 8 0 0 3 0 2 0 0 1 0 0 0
AJC 2 0 0 0 0 0 69 0 0 11 0 0
AJS 6 0 0 0 0 0 0 38 0 2 0 0
AT0 192 0 0 0 0 0 0 0 7000 13 0 0
AV0 120 8 2 0 0 0 15 2 24 2444 29 11AV0-AJ0 10 7 0 0 0 0 0 0 0 16 33 0
AVP 24 0 0 0 0 0 0 0 1 11 0 737
Part 2 : Discriminative VS Generative
Problem Statement• Generate unigram parameters of P(t_i|w_i). You already
have the annotated corpus.• Compute the argmax of P(T|W); do not invert through
Bayes theorem.• Compare with unigram based unigram performance
between (2) and the HMM based system.
Tasks Completed
• Generated unigram parameters of P(ti|wi).• Computed the argmax of P(T|W).
• Compared with unigram based unigram performance between the HMM based system and the above.
• Better results were produced by the generative model in cases of ambiguous sentences.
Discriminative• P(T|W) = P(| )
= P(| ).
P(| ). ………
P(| )
• Assuming word tag pair to be independent, • P(T|W) =
• precision 0.896788• F-measure 0.896788
Per-PoS Accuracy
AJ0
AJ0-V
VDAJC AV0
AVP-PRP
CJC
CJS-P
RPCRD
DT0EX0
NN1
NN1-VVB
NN2-VVZ
ORDPNP
PRF
PRP-CJS
UNCVBG
VBZVDG
VDZVHG
VHZ
VVB-NN1
VVD-VVN
VVG-NN1
VVN-AJ0
VVZ-NN2
0
0.2
0.4
0.6
0.8
1
1.2
Series1
Generative• P(T|W) = P(T|W). P(T).
• Assuming unigram assumption and word tag pairs to be independent,
• P(T|W) = P(| ). P()
Part 3 : Analysis of Corpora Using Word Prediction
Tasks Completed• Predicted the next word on the basis of the patterns
occurring in both the corpora.
• First Corpus had untagged-word sentences and the second one had tagged-word sentences.
• The corpus with the tagged words gives better results for word prediction.
Untagged Corpus• P(|) =
• Where c() is the count.
• By Bigram Assumption, • P(|) =
• By Trigram Assumption,• P(|) =
Tagged Corpus• P(|,) =
• Using Bigram Assumption,• P(|,) = • Using Trigram Assumption,• P(|,) =
Examples.• Example 1 :
• TO0_to VBI_be CJC_or XX0_not TO0_to• VBI_be
• to be or not to• The
• Example 2:• AJ0_complete CJC_and AJ0_utter
• NN1_contempt
• complete and utter• Loud
Examples Cont.• Example 3:
• PNQ_who VBZ_is DPS_your AJ0-NN1_favourite • NN1_gardening
• who is your favourite• is
Results• Raw text LM :
• Word Prediction Accuracy: 13.21%
• POS tagged text LM :• Word Prediction Accuracy : 15.53%
Part 4 : A-star Implementation
Problem Statement
• The goal is to see which algorithm is better for POS tagging, Viterbi or A*
• Look upon the column of POS tags above all the words as forming the state space graph.
• The start state S is '^' and the goal stage G is '$'. 6. Your job is to come up with a good heuristic. One possibility is that the heuristic value h(N), where N is a node on a word W, is the product of the distance of W from '$' and the least arc cost in the state space graph.
• G(N) is the cost of the best path found so far to W from '^'.• Run A* with this heuristic and see the result.• Compare the result with Viterbi.
A-Star Implementation.• precision 0.937254• F-measure 0.937254
AJ0
AJ0-V
VDAJC AV0
AVP-PRP
CJC
CJS-P
RPCRD
DT0EX0
NN1
NN1-VVB
NN2-VVZ
ORDPNP
PRF
PRP-CJS
UNCVBG
VBZVDG
VDZVHG
VHZ
VVB-NN1
VVD-VVN
VVG-NN1
VVN-AJ0
VVZ-NN2
0
0.2
0.4
0.6
0.8
1
1.2
Series1
Screen shot of Confusion Matrix12836 58 187 9 13 28 0 0 240 110 52 7
98 44 3 0 0 0 0 0 0 5 26 0
357 1 377 0 2 0 0 0 1 0 1 0
33 0 0 2 0 1 0 0 7 0 0 0
33 0 2 0 29 0 0 0 4 0 0 0
42 0 0 5 0 15 0 0 5 0 0 0
4 0 0 0 0 0 403 0 3 38 0 0
4 0 0 0 0 0 0 214 0 18 0 0
1 0 0 0 0 0 0 0 23454 55 0 0
82 11 2 0 0 0 58 11 99 9198 68 42
34 12 0 0 0 0 0 0 0 69 75 0
4 0 0 0 0 0 0 0 1 38 0 1533
0 0 0 0 0 0 0 0 0 5 0 72
0 0 0 0 0 0 0 0 1 15 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 3 0 0
0 0 0 0 0 0 0 0 1 109 0 0
0 0 0 0 0 0 0 0 0 0 0 0
Heuristics.• h = g * (N - n)/ n
• Where N is the length of the sentence, and n is the index of the current word in the sentence.
A-star Vs. Viterbi
Part 5 : YAGO
Problem Statement• Take as input two words and show A PATH between them
listing all the concepts that are encountered on the way. • For example, in the path from 'bulldog' to 'cheshire cat',
one would presumably encounter 'bulldog-dog-mammal-cat-cheshire cat'. Similarly for 'VVS Laxman' and 'Hyderabad', 'Tendulkar' and 'Tennis' (you will be surprised!!).
Part 6: Parser Projection
28
Example
• English: Dhoni is the captain of India.• Hindi: dhoni bhaarat ke kaptaan hai.
• Hindi -parse:
[
[ [dhoni]NN]NP
[ [[[bhaarat]NNP]NP [ke]P ]PP [kaptaan]NN]NP [hai]VBZ ]VP
]S
• English -parse:
[
[ [Delhi]NN]NP
[ [is]VBZ [[the]ART [capital]NN]NP [[of]P [[India]NNP]NP]PP]VP
]S
29
Problems and Conclusions
• Many Idioms in English are translated directly, even though they mean something else,• E.g. Phrases like “break a leg”, “He Lost His Head”, “French kiss”,
“Flip the bird”
• Noise because of misalignments.
30
Natural Language Tool Kit
• The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language.
• NLTK includes graphical demonstrations and sample data. • It is accompanied by extensive documentation, including a
book that explains the underlying concepts behind the language processing tasks supported by the toolkitIt provides lexical resources such as WordNet.
• It has a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
[EOF]