part-of-speech tagging with hidden markov models · part-of-speech tagging with hidden markov...
Post on 20-Jun-2018
237 Views
Preview:
TRANSCRIPT
University of Oslo : Department of Informatics
Part-of-speech taggingwith Hidden Markov Models
Stephan Oepen Jonathon Read
Date: October 7 2011 Venue: INF4820Department of InformaticsUniversity of Oslo
Topics for today
Parts-of-speechI Lexical categoriesI POS taggingI A symbolic approach
Hidden Markov ModelsI Hidden Markov Models (HMMs) for stochastic POS
tagging
Topics for today
Parts-of-speechI Lexical categoriesI POS taggingI A symbolic approach
Hidden Markov ModelsI Hidden Markov Models (HMMs) for stochastic POS
tagging
Parts of speech
I Known by a variety of names: parts-of-speech, POS, lexicalcategories, word classes, morphological classes, lexicaltags...
I but essentially they label collections of words which servesimilar purposes
I Open-classesI New words created/updated/deleted all the time
I Closed-classesI Smaller classes, relatively static membershipI Usually function words
Parts of speech
I Known by a variety of names: parts-of-speech, POS, lexicalcategories, word classes, morphological classes, lexicaltags...
I but essentially they label collections of words which servesimilar purposes
I Open-classesI New words created/updated/deleted all the time
I Closed-classesI Smaller classes, relatively static membershipI Usually function words
Parts of speech
I Known by a variety of names: parts-of-speech, POS, lexicalcategories, word classes, morphological classes, lexicaltags...
I but essentially they label collections of words which servesimilar purposes
I Open-classesI New words created/updated/deleted all the time
I Closed-classesI Smaller classes, relatively static membershipI Usually function words
Open Class Words
NounsI Typically denoting people, places, things, concepts,
phenomena. . .I Proper nouns (University of Oslo)I Common nouns (the rest, e.g. dog, language, idea)
I Count nouns: Countable, plural forms (dog, dogs, one dog,two dogs)
I Mass nouns: Uncountable (snow, altruism, *two snows)
VerbsI Typically denoting actions, processes, etc.I Morphological affixes for person, tense, and aspect (eat,
eats, eaten, ate)
Open Class Words
NounsI Typically denoting people, places, things, concepts,
phenomena. . .I Proper nouns (University of Oslo)I Common nouns (the rest, e.g. dog, language, idea)
I Count nouns: Countable, plural forms (dog, dogs, one dog,two dogs)
I Mass nouns: Uncountable (snow, altruism, *two snows)
VerbsI Typically denoting actions, processes, etc.I Morphological affixes for person, tense, and aspect (eat,
eats, eaten, ate)
Open Class Words
AdjectivesI Typically descriptive of a noun, denoting properties,
characteristics, qualities, etc. (for example, white, old, bad)I Can be compared for degree (small – smaller –smallest)
AdverbsI Very heterogeneous lexical classI Modifying verbs, verb phrases, or other adverbs.
I Many possible subclasses:I Directional/locative adverbs (here, home, downhill)I Degree adverbs (extremely, very, somewhat)I Manner adverbs (slowly, delicately)I Temporal adverbs (yesterday)
Open Class Words
AdjectivesI Typically descriptive of a noun, denoting properties,
characteristics, qualities, etc. (for example, white, old, bad)I Can be compared for degree (small – smaller –smallest)
AdverbsI Very heterogeneous lexical classI Modifying verbs, verb phrases, or other adverbs.
I Many possible subclasses:I Directional/locative adverbs (here, home, downhill)I Degree adverbs (extremely, very, somewhat)I Manner adverbs (slowly, delicately)I Temporal adverbs (yesterday)
Closed Class Words
I Prepositions: on, under, from, at, near, over, . . .I Determiners: a, an, the, that, . . .I Pronouns: she, who, I, others, . . .I Conjunctions: and, but, or, when, . . .I Auxiliary verbs: can, may, should, must, . . .I Interjections, particles, numerals, negatives, politeness
markers, greetings, existential there . . .
(Examples from Jurafsky & Martin)
Tagsets
The previous lists are by no means exhaustive—there are manydifferent lists (tagsets).
Consider these examples from the Penn Treebank version of theBrown corpus, which uses 45 tags:
1. The/DT grand/JJ jury/NN commented/VBD on/IN a/DTnumber/NN of/IN other/JJ topics/NNS ./.
2. There/EX are/VBP 70/CD children/NNS there/RB
Part-of-speech tagging
Part-of-speech tagging is the process of labeling words with theappropriate part-of-speech.
string ofwords
part-of-speechtagger
tagset
taggedstring ofwords
Part-of-speech tagging
Part-of-speech tagging is important for a number of tasks innatural language processing:I Parsing
I A prerequisite for determining phrase structureI Lemmatisation
I Knowing a word’s part-of-speech tells us what affixs mighthave been applied
I Word sense disambiguationI “The felt the plane bank” vs. “Shares in the bank fell”
I Machine translationI sky (Norwegian)⇒ cloud, avoid or shy in English?
Rule-based part-of-speech tagging
Two stage solution:1. Morphological analysis and dictionary look-up to
enumerate all possible POS for each word2. Apply hand-written rules to remove inconsistent tags
Adverbial-that ruleGiven input: “that”if
(+1 A/ADV/QUANT); /* if next word is adj, adverb, or quantifier */(+2 SENT+LIM); /* and following which is a sentence boundary, */(NOT -1 SVOC/A); /* and the previous word is not a verb like */
/* ’consider’ which allows adj as *//* object complements */
then eliminate non-ADV tagselse eliminate ADV tag
Rule-based part-of-speech tagging
Two stage solution:1. Morphological analysis and dictionary look-up to
enumerate all possible POS for each word2. Apply hand-written rules to remove inconsistent tags
Adverbial-that ruleGiven input: “that”if
(+1 A/ADV/QUANT); /* if next word is adj, adverb, or quantifier */(+2 SENT+LIM); /* and following which is a sentence boundary, */(NOT -1 SVOC/A); /* and the previous word is not a verb like */
/* ’consider’ which allows adj as *//* object complements */
then eliminate non-ADV tagselse eliminate ADV tag
Stochastic part-of-speech tagging
I View part-of-speech tagging as a sequence classificationtask:
I given a sequence of words wn1
I determine a corresponding sequence of classes t̂n1
t̂n1 = arg max
tn1
P(tn1 |w
n1
)
I Note on notation: arg maxx f (x) should be read as “the xsuch that f (x) is maximised”.
Stochastic part-of-speech tagging
I View part-of-speech tagging as a sequence classificationtask:
I given a sequence of words wn1
I determine a corresponding sequence of classes t̂n1
t̂n1 = arg max
tn1
P(tn1 |w
n1
)
I Note on notation: arg maxx f (x) should be read as “the xsuch that f (x) is maximised”.
Refactoring for a tractable formulation
t̂n1 = arg max
tn1
P(tn1 |w
n1)
≈ arg maxtn1
n∏i
P(wi|ti)P(ti|ti−1)
Refactoring for a tractable formulation
t̂n1 = arg max
tn1
P(tn1 |w
n1)
≈ arg maxtn1
n∏i
P(wi|ti)P(ti|ti−1)
Estimation
Tag transition probabilitiesBased on a training corpus of previously tagged text, the MLEcan be computed from the counts of observed tags:
P(ti|tt−1) =C(ti−1, ti)C(ti−1)
Word likelihoodsComputed from relative frequencies in the same way:
P(wi|tj) =C(ti,wj)
C(ti)
Sparse data problemThe issues related to MLE / smoothing that we discussed forn-gram models also applies here. . .
Estimation
Tag transition probabilitiesBased on a training corpus of previously tagged text, the MLEcan be computed from the counts of observed tags:
P(ti|tt−1) =C(ti−1, ti)C(ti−1)
Word likelihoodsComputed from relative frequencies in the same way:
P(wi|tj) =C(ti,wj)
C(ti)
Sparse data problemThe issues related to MLE / smoothing that we discussed forn-gram models also applies here. . .
Estimation
Tag transition probabilitiesBased on a training corpus of previously tagged text, the MLEcan be computed from the counts of observed tags:
P(ti|tt−1) =C(ti−1, ti)C(ti−1)
Word likelihoodsComputed from relative frequencies in the same way:
P(wi|tj) =C(ti,wj)
C(ti)
Sparse data problemThe issues related to MLE / smoothing that we discussed forn-gram models also applies here. . .
Search space
The determinerbrief noun, adjective, adverbnotes noun, verbintroducing verbeach adjective, adverb, pronounwork noun, verb, adjectiveoffer noun, verbsalient noun, adjectivehistorical adjectiveor noun, conjunctiontechnical noun, adjectivepoints noun, verb
Search space
The determiner 1brief noun, adjective, adverb ×3 = 3notes noun, verb ×2 = 6introducing verbeach adjective, adverb, pronoun ×3 = 18work noun, verb, adjective ×3 = 54offer noun, verb ×2 = 108salient noun, adjective ×2 = 216historical adjectiveor noun, conjunction ×2 = 432technical noun, adjective ×2 = 864points noun, verb ×2 = 1728
Search space
I Large search spaceI Composed of several smaller searches
Dynamic programmingI decompose into smaller problemsI solve each smaller problem onceI combine the results to find the overall solution
Search space
I Large search spaceI Composed of several smaller searches
Dynamic programmingI decompose into smaller problemsI solve each smaller problem onceI combine the results to find the overall solution
Hidden Markov Models
A hidden Markov model lets us handle both:I observed events (like the words in a sentence) andI hidden events (like part-of-speech tags.
Q = q1q2 . . . qn: a set of N states
A = a11a12 . . . an1 . . . ann: a transition probability matrix A,representing the probability of moving from state i tostate j, such that
∑nj=1 aij = 1 ∀i
O = o1o2 . . . oT: a sequence of T observations, each one drawn from avocabulary V = v1v2 . . . vT
B = bi(ot): A sequence of observation likeihoods, also calledemission probabilities, each expressing the probabilityof an observation ot being generated from a state i
q0, qF: a special start state and final state that are notassociated with observations, together with transitionprobabilities a01a02 . . . a0n out of the start state anda1Fa2F . . . anF into the final state.
Hidden Markov Models
A hidden Markov model lets us handle both:I observed events (like the words in a sentence) andI hidden events (like part-of-speech tags.
Q = q1q2 . . . qn: a set of N states
A = a11a12 . . . an1 . . . ann: a transition probability matrix A,representing the probability of moving from state i tostate j, such that
∑nj=1 aij = 1 ∀i
O = o1o2 . . . oT: a sequence of T observations, each one drawn from avocabulary V = v1v2 . . . vT
B = bi(ot): A sequence of observation likeihoods, also calledemission probabilities, each expressing the probabilityof an observation ot being generated from a state i
q0, qF: a special start state and final state that are notassociated with observations, together with transitionprobabilities a01a02 . . . a0n out of the start state anda1Fa2F . . . anF into the final state.
Hidden Markov Models
A hidden Markov model lets us handle both:I observed events (like the words in a sentence) andI hidden events (like part-of-speech tags.
Q = q1q2 . . . qn: a set of N states
A = a11a12 . . . an1 . . . ann: a transition probability matrix A,representing the probability of moving from state i tostate j, such that
∑nj=1 aij = 1 ∀i
O = o1o2 . . . oT: a sequence of T observations, each one drawn from avocabulary V = v1v2 . . . vT
B = bi(ot): A sequence of observation likeihoods, also calledemission probabilities, each expressing the probabilityof an observation ot being generated from a state i
q0, qF: a special start state and final state that are notassociated with observations, together with transitionprobabilities a01a02 . . . a0n out of the start state anda1Fa2F . . . anF into the final state.
Hidden Markov Models
A hidden Markov model lets us handle both:I observed events (like the words in a sentence) andI hidden events (like part-of-speech tags.
Q = q1q2 . . . qn: a set of N states
A = a11a12 . . . an1 . . . ann: a transition probability matrix A,representing the probability of moving from state i tostate j, such that
∑nj=1 aij = 1 ∀i
O = o1o2 . . . oT: a sequence of T observations, each one drawn from avocabulary V = v1v2 . . . vT
B = bi(ot): A sequence of observation likeihoods, also calledemission probabilities, each expressing the probabilityof an observation ot being generated from a state i
q0, qF: a special start state and final state that are notassociated with observations, together with transitionprobabilities a01a02 . . . a0n out of the start state anda1Fa2F . . . anF into the final state.
Hidden Markov Models
A hidden Markov model lets us handle both:I observed events (like the words in a sentence) andI hidden events (like part-of-speech tags.
Q = q1q2 . . . qn: a set of N states
A = a11a12 . . . an1 . . . ann: a transition probability matrix A,representing the probability of moving from state i tostate j, such that
∑nj=1 aij = 1 ∀i
O = o1o2 . . . oT: a sequence of T observations, each one drawn from avocabulary V = v1v2 . . . vT
B = bi(ot): A sequence of observation likeihoods, also calledemission probabilities, each expressing the probabilityof an observation ot being generated from a state i
q0, qF: a special start state and final state that are notassociated with observations, together with transitionprobabilities a01a02 . . . a0n out of the start state anda1Fa2F . . . anF into the final state.
Hidden Markov Models
A hidden Markov model lets us handle both:I observed events (like the words in a sentence) andI hidden events (like part-of-speech tags.
Q = q1q2 . . . qn: a set of N states
A = a11a12 . . . an1 . . . ann: a transition probability matrix A,representing the probability of moving from state i tostate j, such that
∑nj=1 aij = 1 ∀i
O = o1o2 . . . oT: a sequence of T observations, each one drawn from avocabulary V = v1v2 . . . vT
B = bi(ot): A sequence of observation likeihoods, also calledemission probabilities, each expressing the probabilityof an observation ot being generated from a state i
q0, qF: a special start state and final state that are notassociated with observations, together with transitionprobabilities a01a02 . . . a0n out of the start state anda1Fa2F . . . anF into the final state.
top related