Download - Word Bi-grams and PoS Tags
![Page 1: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/1.jpg)
School of somethingFACULTY OF OTHER
School of ComputingFACULTY OF ENGINEERING
Word Bi-grams and PoS Tags
COMP3310 Natural Language Processing
Eric Atwell, Language Research Group
(with thanks to Katja Markert, Marti Hearst, and other contributors)
![Page 2: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/2.jpg)
Reminder
FreqDist counts of tokens and their distribution can be useful
Eg find main characters in Gutenberg texts
Eg compare word-lengths in different languages
Human can predict the next word …
N-gram models are based on counts in a large corpus
Auto-generate a story ... (but gets stuck in local maximum)
Grammatical trends: modal verb distribution predicts genre
![Page 3: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/3.jpg)
Why do puns make us groan?
He drove his expensive car into a tree and found
out how the Mercedes bends.
Isn't the Grand Canyon just gorges?
Time flies like an arrow. Fruit flies like a banana.
![Page 4: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/4.jpg)
Predicting Next Words
One reason puns make us groan is they play on our assumptions of what the next word will be – human language processing involves predicting the most probable next word
They also exploit
• homonymy – same sound, different spelling and meaning (bends, Benz; gorges, gorgeous)
• polysemy – same spelling, different meaning
NLP programs can also make use of word-sequence modeling
![Page 5: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/5.jpg)
Auto-generate a Story
How to fix this? Use a random number generator.
![Page 6: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/6.jpg)
Auto-generate a Story The choice() method chooses one item
randomly from a list(from random import *)
![Page 7: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/7.jpg)
Part-of-Speech Tagging: Terminology
Tagging
• The process of associating labels with each token in a text, using an algorithm to select a tag for each word, eg
Hand-coded rules
Statistical taggers
Brill (transformation-based) tagger
Hybrid tagger: combination, eg by “vote”
Tags
• The labels
Tag Set
• The collection of tags used for a particular task, eg Brown or LOB tagset
![Page 8: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/8.jpg)
Example from the GENIA corpus
Typically a tagged text is a sequence of white-space separated word/tag tokens:
These/DT
findings/NNS
should/MD
be/VB
useful/JJ
for/IN
therapeutic/JJ
strategies/NNS
and/CC
the/DT
development/NN
of/IN
immunosuppressants/NNS
targeting/VBG
the/DT
CD28/NN
costimulatory/NN
pathway/NN
./.
![Page 9: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/9.jpg)
What does Tagging do?
Collapses Distinctions
• Lexical identity may be discarded
• e.g., all personal pronouns tagged with PRP
Introduces Distinctions
• Ambiguities may be resolved
• e.g. deal tagged with NN or VB
Helps in classification and prediction
![Page 10: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/10.jpg)
Significance of Parts of Speech
A word’s POS tells us a lot about the word and its neighbors:
• Limits the range of meanings (deal), pronunciation (object vs object) or both (wind)
• Helps in stemming
• Limits the range of following words
• Can help select nouns from a document for summarization
• Basis for partial parsing (chunked parsing)
• Parsers can build trees directly on the POS tags instead of maintaining a lexicon
![Page 11: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/11.jpg)
Choosing a tagset
The choice of tagset greatly affects the difficulty of the problem
Need to strike a balance between
• Getting better information about context
• Make it possible for classifiers to do their job
![Page 12: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/12.jpg)
Some of the best-known Tagsets
Brown corpus: 87 tags
• (more when tags are combined, eg isn’t)
LOB corpus: 132 tags
Penn Treebank: 45 tags
Lancaster UCREL C5 (used to tag the BNC): 61 tags
Lancaster C7: 145 tags
![Page 13: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/13.jpg)
The Brown Corpus
An early digital corpus (1961)
• Francis and Kucera, Brown University
Contents: 500 texts, each 2000 words long
• From American books, newspapers, magazines
• Representing genres:
• Science fiction, romance fiction, press reportage scientific writing, popular lore
![Page 14: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/14.jpg)
help(nltk.corpus.brown)
>>> help(nltk.corpus.brown)
| paras(self, fileids=None, categories=None)
|
| raw(self, fileids=None, categories=None)
|
| sents(self, fileids=None, categories=None)
|
| tagged_paras(self, fileids=None, categories=None, simplify_tags=False)
|
| tagged_sents(self, fileids=None, categories=None, simplify_tags=False)
|
| tagged_words(self, fileids=None, categories=None, simplify_tags=False)
|
| words(self, fileids=None, categories=None)
|
![Page 15: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/15.jpg)
nltk.corpus.brown
>>> nltk.corpus.brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> nltk.corpus.brown.tagged_words()
[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
>>> nltk.corpus.brown.tagged_sents()
[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), …
![Page 16: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/16.jpg)
Penn Treebank
First large syntactically annotated corpus
1 million words from Wall Street Journal
Part-of-speech tags and syntax trees
![Page 17: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/17.jpg)
help(nltk.corpus.treebank)
| parsed(*args, **kwargs)
| @deprecated: Use .parsed_sents() instead.
|
| parsed_sents(self, files=None)
|
| raw(self, files=None)
|
| read(*args, **kwargs)
| @deprecated: Use .raw() or .sents() or .tagged_sents() or
| .parsed_sents() instead.
|
| sents(self, files=None)
|
| tagged(*args, **kwargs)
| @deprecated: Use .tagged_sents() instead.
|
| tagged_sents(self, files=None)
|
| tagged_words(self, files=None)
![Page 18: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/18.jpg)
How hard is POS tagging?
Number of tags 1 2 3 4 5 6 7
Number of word types
35340 3760 264 61 12 2 1
In the Brown corpus, 12% of word types ambiguous 40% of word tokens ambiguous
![Page 19: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/19.jpg)
Tagging with lexical frequencies
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN
Problem: assign a tag to race given its lexical frequency
Solution: we choose the tag that has the greater probability
• P(race|VB)
• P(race|NN)
Actual estimate from the Switchboard corpus:
• P(race|NN) = .00041
• P(race|VB) = .00003
This suggests we should always tag race/NN (correct 41/44=93%)
![Page 20: Word Bi-grams and PoS Tags](https://reader030.vdocuments.net/reader030/viewer/2022032607/56813061550346895d9630a3/html5/thumbnails/20.jpg)
Reminder
Puns play on our assumptions of the next word…
… eg they present us with an unexpected homonym (bends)
ConditionalFreqDist() counts word-pairs: word bigrams
Used for story generation, Speech recognition, …
Parts of Speech: groups words into grammatical categories
… and separates different functions of a word
In English, many words are ambiguous: 2 or more PoS-tags
Very simple tagger: choose by lexical probability (only)
Better Pos-Taggers: to come…