part-of-speech tagging with hidden markov models · part-of-speech tagging with hidden markov...

University of Oslo : Department of Informatics

Part-of-speech taggingwith Hidden Markov Models

Stephan Oepen Jonathon Read

Date: October 7 2011 Venue: INF4820Department of InformaticsUniversity of Oslo

Topics for today

Parts-of-speechI Lexical categoriesI POS taggingI A symbolic approach

Hidden Markov ModelsI Hidden Markov Models (HMMs) for stochastic POS

tagging

Topics for today

Parts-of-speechI Lexical categoriesI POS taggingI A symbolic approach

Hidden Markov ModelsI Hidden Markov Models (HMMs) for stochastic POS

tagging

Parts of speech

I Known by a variety of names: parts-of-speech, POS, lexicalcategories, word classes, morphological classes, lexicaltags...

I but essentially they label collections of words which servesimilar purposes

I Open-classesI New words created/updated/deleted all the time

I Closed-classesI Smaller classes, relatively static membershipI Usually function words

Parts of speech

Open Class Words

NounsI Typically denoting people, places, things, concepts,

phenomena. . .I Proper nouns (University of Oslo)I Common nouns (the rest, e.g. dog, language, idea)

I Count nouns: Countable, plural forms (dog, dogs, one dog,two dogs)

I Mass nouns: Uncountable (snow, altruism, *two snows)

VerbsI Typically denoting actions, processes, etc.I Morphological affixes for person, tense, and aspect (eat,

eats, eaten, ate)

Open Class Words

NounsI Typically denoting people, places, things, concepts,

phenomena. . .I Proper nouns (University of Oslo)I Common nouns (the rest, e.g. dog, language, idea)

I Count nouns: Countable, plural forms (dog, dogs, one dog,two dogs)

I Mass nouns: Uncountable (snow, altruism, *two snows)

VerbsI Typically denoting actions, processes, etc.I Morphological affixes for person, tense, and aspect (eat,

eats, eaten, ate)

Open Class Words

AdjectivesI Typically descriptive of a noun, denoting properties,

characteristics, qualities, etc. (for example, white, old, bad)I Can be compared for degree (small – smaller –smallest)

AdverbsI Very heterogeneous lexical classI Modifying verbs, verb phrases, or other adverbs.

I Many possible subclasses:I Directional/locative adverbs (here, home, downhill)I Degree adverbs (extremely, very, somewhat)I Manner adverbs (slowly, delicately)I Temporal adverbs (yesterday)

Open Class Words

AdjectivesI Typically descriptive of a noun, denoting properties,

characteristics, qualities, etc. (for example, white, old, bad)I Can be compared for degree (small – smaller –smallest)

AdverbsI Very heterogeneous lexical classI Modifying verbs, verb phrases, or other adverbs.

I Many possible subclasses:I Directional/locative adverbs (here, home, downhill)I Degree adverbs (extremely, very, somewhat)I Manner adverbs (slowly, delicately)I Temporal adverbs (yesterday)

Closed Class Words

I Prepositions: on, under, from, at, near, over, . . .I Determiners: a, an, the, that, . . .I Pronouns: she, who, I, others, . . .I Conjunctions: and, but, or, when, . . .I Auxiliary verbs: can, may, should, must, . . .I Interjections, particles, numerals, negatives, politeness

markers, greetings, existential there . . .

(Examples from Jurafsky & Martin)

For example

Hold the newsreader ’s nose squarely

Tagsets

The previous lists are by no means exhaustive—there are manydifferent lists (tagsets).

Consider these examples from the Penn Treebank version of theBrown corpus, which uses 45 tags:

1. The/DT grand/JJ jury/NN commented/VBD on/IN a/DTnumber/NN of/IN other/JJ topics/NNS ./.

2. There/EX are/VBP 70/CD children/NNS there/RB

Part-of-speech tagging

Part-of-speech tagging is the process of labeling words with theappropriate part-of-speech.

string ofwords

part-of-speechtagger

tagset

taggedstring ofwords

Part-of-speech tagging

Part-of-speech tagging is important for a number of tasks innatural language processing:I Parsing

I A prerequisite for determining phrase structureI Lemmatisation

I Knowing a word’s part-of-speech tells us what affixs mighthave been applied

I Word sense disambiguationI “The felt the plane bank” vs. “Shares in the bank fell”

I Machine translationI sky (Norwegian)⇒ cloud, avoid or shy in English?

Rule-based part-of-speech tagging

Two stage solution:1. Morphological analysis and dictionary look-up to

enumerate all possible POS for each word2. Apply hand-written rules to remove inconsistent tags

Adverbial-that ruleGiven input: “that”if

(+1 A/ADV/QUANT); /* if next word is adj, adverb, or quantifier */(+2 SENT+LIM); /* and following which is a sentence boundary, */(NOT -1 SVOC/A); /* and the previous word is not a verb like */

/* ’consider’ which allows adj as *//* object complements */

then eliminate non-ADV tagselse eliminate ADV tag

Rule-based part-of-speech tagging

Two stage solution:1. Morphological analysis and dictionary look-up to

enumerate all possible POS for each word2. Apply hand-written rules to remove inconsistent tags

Adverbial-that ruleGiven input: “that”if

(+1 A/ADV/QUANT); /* if next word is adj, adverb, or quantifier */(+2 SENT+LIM); /* and following which is a sentence boundary, */(NOT -1 SVOC/A); /* and the previous word is not a verb like */

/* ’consider’ which allows adj as *//* object complements */

then eliminate non-ADV tagselse eliminate ADV tag

Stochastic part-of-speech tagging

I View part-of-speech tagging as a sequence classificationtask:

I given a sequence of words wn1

I determine a corresponding sequence of classes t̂n1

t̂n1 = arg max

P(tn1 |w

I Note on notation: arg maxx f (x) should be read as “the xsuch that f (x) is maximised”.

Stochastic part-of-speech tagging

I View part-of-speech tagging as a sequence classificationtask:

I given a sequence of words wn1

I determine a corresponding sequence of classes t̂n1

t̂n1 = arg max

P(tn1 |w

I Note on notation: arg maxx f (x) should be read as “the xsuch that f (x) is maximised”.

Refactoring for a tractable formulation

t̂n1 = arg max

P(tn1 |w

≈ arg maxtn1

P(wi|ti)P(ti|ti−1)

Refactoring for a tractable formulation

t̂n1 = arg max

P(tn1 |w

≈ arg maxtn1

P(wi|ti)P(ti|ti−1)

Estimation

Tag transition probabilitiesBased on a training corpus of previously tagged text, the MLEcan be computed from the counts of observed tags:

P(ti|tt−1) =C(ti−1, ti)C(ti−1)

Word likelihoodsComputed from relative frequencies in the same way:

P(wi|tj) =C(ti,wj)

Sparse data problemThe issues related to MLE / smoothing that we discussed forn-gram models also applies here. . .

Estimation

P(wi|tj) =C(ti,wj)

Estimation

P(wi|tj) =C(ti,wj)

An example

Search space

Thebriefnotesintroducingeachworkoffersalienthistoricalortechnicalpoints

Search space

The determinerbrief noun, adjective, adverbnotes noun, verbintroducing verbeach adjective, adverb, pronounwork noun, verb, adjectiveoffer noun, verbsalient noun, adjectivehistorical adjectiveor noun, conjunctiontechnical noun, adjectivepoints noun, verb

Search space

The determiner 1brief noun, adjective, adverb ×3 = 3notes noun, verb ×2 = 6introducing verbeach adjective, adverb, pronoun ×3 = 18work noun, verb, adjective ×3 = 54offer noun, verb ×2 = 108salient noun, adjective ×2 = 216historical adjectiveor noun, conjunction ×2 = 432technical noun, adjective ×2 = 864points noun, verb ×2 = 1728

Search space

I Large search spaceI Composed of several smaller searches

Dynamic programmingI decompose into smaller problemsI solve each smaller problem onceI combine the results to find the overall solution

Search space

I Large search spaceI Composed of several smaller searches

Dynamic programmingI decompose into smaller problemsI solve each smaller problem onceI combine the results to find the overall solution

Hidden Markov Models

A hidden Markov model lets us handle both:I observed events (like the words in a sentence) andI hidden events (like part-of-speech tags.

Q = q1q2 . . . qn: a set of N states

A = a11a12 . . . an1 . . . ann: a transition probability matrix A,representing the probability of moving from state i tostate j, such that

∑nj=1 aij = 1 ∀i

O = o1o2 . . . oT: a sequence of T observations, each one drawn from avocabulary V = v1v2 . . . vT

B = bi(ot): A sequence of observation likeihoods, also calledemission probabilities, each expressing the probabilityof an observation ot being generated from a state i

q0, qF: a special start state and final state that are notassociated with observations, together with transitionprobabilities a01a02 . . . a0n out of the start state anda1Fa2F . . . anF into the final state.

∑nj=1 aij = 1 ∀i

Topics for next week

I Formalisation of hidden Markov models

I Dynamic programming

I The Forward algorithm, applied to computing the HMMprobability of an observed sequence of words

I The Viterbi algorithm, applied to computing the HMMprobability of an unobserved sequence of tags

part-of-speech tagging with hidden markov models · part-of-speech tagging with hidden markov...

Documents

hidden markov models./awm/tutorials/hmm14.pdf · hidden...

comp90042 trevor cohn 1 wsta lecture 15 tagging with hmms...

hidden markov models -...

inference in mixed hidden markov models and applications...

part-of-speech tagging & hidden markov model...

part-of-speech tagging for bengali with hidden markov model

hidden markov model

hidden markov models (hmm)•hidden markov models (hmm) many...

tagging problems, and hidden markov...

markov hidden

tagging with hidden markov models cmpt 882 final project...

part of speech tagging & hidden markov models (part 1)

hidden markov models and gaussian mixture models · hidden...

hidden markov models

lecture 6a: introduction to hidden markov models · lecture...

part of speech tagging & hidden markov models mitch marcus...

tagging with hidden markov models

tagging problems, and hidden markov...

9 markov chains and hidden markov models - freie … · 9...

hidden markov model (hmm) tagging using an hmm to do pos...