learning hidden markov model structure for information extraction kristie seymour, andrew mccullum,...

Learning Hidden Markov Model Learning Hidden Markov Model

Structure for Information ExtractionStructure for Information Extraction

Kristie Seymour, Kristie Seymour,

Andrew McCullum, Andrew McCullum,

& Ronald Rosenfeld& Ronald Rosenfeld

Hidden Markov Model StructuresHidden Markov Model Structures

Machine learning tool applied to Machine learning tool applied to Information ExtractionInformation Extraction

Part of speech tagging Part of speech tagging (Kupiec 1992)(Kupiec 1992)

Topic detection & tracking Topic detection & tracking (Yamron et al 1998)(Yamron et al 1998)

Dialog act modeling Dialog act modeling (Stolcke, Shriberg, & others (Stolcke, Shriberg, & others 1998)1998)

HMM in Information ExtractionHMM in Information Extraction

Gene names and locations Gene names and locations (Luek 1997)(Luek 1997)

Named-entity extraction Named-entity extraction (Nymble system (Nymble system – Friberg & McCallum 1999)– Friberg & McCallum 1999)

Information Extraction StrategyInformation Extraction Strategy 1 HMM = 1 Field1 HMM = 1 Field 1 state / class1 state / class Hand-built models using human data Hand-built models using human data

inspectioninspection

HMM AdvantagesHMM Advantages

Strong statistical foundations Strong statistical foundations

Used well in Natural Language Used well in Natural Language programmingprogramming

Handles new data robustlyHandles new data robustly

Uses established training algorithms which Uses established training algorithms which are computationally efficient to develop are computationally efficient to develop and evaluateand evaluate

HMM DisadvantagesHMM Disadvantages

Require Require a prioria priori notion of model notion of model topologytopology

Need large amounts of training data Need large amounts of training data to use to use

Authors’ ContributionAuthors’ Contribution

Automatically determined model Automatically determined model structure from datastructure from data

One HMM to extract all informationOne HMM to extract all information

Introduced DISTANTLY-LABELED Introduced DISTANTLY-LABELED DATA DATA

OUTLINEOUTLINE

Information Extraction basics with HMMInformation Extraction basics with HMM Learning model structure from dataLearning model structure from data Training dataTraining data Experiment resultsExperiment results Model selectionModel selection Error breakdownError breakdown Conclusions Conclusions Future workFuture work

Information Extraction basics with Information Extraction basics with HMMHMM

OBJECT – to code every word of CS OBJECT – to code every word of CS research paper headersresearch paper headers

TitleTitle AuthorAuthor DateDate KeywordKeyword Etc.Etc.

1 HMM / 1 Header1 HMM / 1 Header Initial state to Final stateInitial state to Final state

Discrete output, First-order HMMDiscrete output, First-order HMM

Q – set of statesQ – set of states qqII – initial state – initial state qqFF – final state in transition – final state in transition ∑ ∑ = {= {σσ11, , σσ22, . . . , , . . . , σσmm} - discrete output vocabulary} - discrete output vocabulary X = xX = x11 x x22 . . . x . . . xi i - output string- output string

PROCESSPROCESS Initital state -> new state -> emit output symbol -> Initital state -> new state -> emit output symbol -> another state -> new state -> emit another output symbol -another state -> new state -> emit another output symbol -

>> . . . FINAL STATE. . . FINAL STATE

PARAMETERSPARAMETERS P(q -> q’) – transition probabilitiesP(q -> q’) – transition probabilities P(q P(q ↑ ↑ σσ) – emission probabilities) – emission probabilities

The probability of string The probability of string xx being being emitted by an HMM emitted by an HMM MM is computed is computed

as a sum over all possible paths as a sum over all possible paths where qwhere q00 and q and ql+1l+1 are restricted to are restricted to

be qbe qII and q and qFF respectively, and x respectively, and xl+1l+1 is is

an end-of-string token (uses an end-of-string token (uses Forward algorithm)Forward algorithm)

The output is observable, but The output is observable, but the underlying state sequence the underlying state sequence

is HIDDENis HIDDEN

To recover the state sequence V(x|To recover the state sequence V(x|M)M)

that has the highest probability of that has the highest probability of having produced an observation having produced an observation

sequence (uses Viterbi algorithm)sequence (uses Viterbi algorithm)

HMM applicationHMM application

Each state has a class (i.e. title, author)Each state has a class (i.e. title, author)

Each word in the header is an observationEach word in the header is an observation

Each state emits words from header with Each state emits words from header with associated CLASS TAGassociated CLASS TAG

This is learned from TRAINING DATAThis is learned from TRAINING DATA

Learning model structure from dataLearning model structure from data

Decide on states and associated transition statesDecide on states and associated transition states

Set up labeled training dataSet up labeled training data

Use MERGE techniquesUse MERGE techniques• Neighbor merge (link all adjacent words in title)Neighbor merge (link all adjacent words in title)• V-merging - 2 states with same label and transitions V-merging - 2 states with same label and transitions

(one transition to title and out)(one transition to title and out)

Apply Bayesian model merging to maximize Apply Bayesian model merging to maximize result accuracyresult accuracy

Example Hidden Markov ModelExample Hidden Markov Model

Bayesian model merging seeks to Bayesian model merging seeks to find the model structure that find the model structure that

maximizes the probability of the maximizes the probability of the model (M) given some training data model (M) given some training data

(D), by iteratively merging states (D), by iteratively merging states until an optimal tradeoff between fit until an optimal tradeoff between fit

to the data and model size has to the data and model size has been reached been reached

Three types of training dataThree types of training data

Labeled data Labeled data

Unlabeled dataUnlabeled data

Distantly-labeled dataDistantly-labeled data

Labeled data Labeled data

manual and expensivemanual and expensive

Provides COUNTS function c() Provides COUNTS function c() estimates model parametersestimates model parameters

Formulas for deriving parametersFormulas for deriving parametersusing counts c()using counts c()

(4) Transition Probabilities(4) Transition Probabilities

(5) Emission Probabilities(5) Emission Probabilities

Unlabeled DataUnlabeled Data

Needs estimated parameters from labeled Needs estimated parameters from labeled data data

Use Baum-Welch training algorithm Use Baum-Welch training algorithm

• Iterative expectation-maximization algorithm Iterative expectation-maximization algorithm which adjusts model parameters to locally which adjusts model parameters to locally maximize results from unlabeled datamaximize results from unlabeled data

• Sensitive to initial parametersSensitive to initial parameters

Distantly-labeled dataDistantly-labeled data

Data labeled for another purposeData labeled for another purpose

Partially applied to this domain for Partially applied to this domain for trainingtraining

EXAMPLE - CS research headers – EXAMPLE - CS research headers – BibTeX bibliographic labeled citationsBibTeX bibliographic labeled citations

Experiment resultsExperiment results Prepare text using computer programPrepare text using computer program

• Header- beginning to INTRODUCTION or end of 1Header- beginning to INTRODUCTION or end of 1stst page page• Remove punctuation, case, & newlinesRemove punctuation, case, & newlines• Label Label

+ABSTRACT++ABSTRACT+ AbstractAbstract +INTRO++INTRO+ IntroductionIntroduction +PAGE++PAGE+ End of 1End of 1stst page page

Manually label 1000 headersManually label 1000 headers• Minus 65 discarded due to poor formatMinus 65 discarded due to poor format

Derive fixed word vocabularies from trainingDerive fixed word vocabularies from training

Sources & Amounts of Training Sources & Amounts of Training DataData

Model selectionModel selection

MODELS 1-4 - 1 state / classMODELS 1-4 - 1 state / class

MODEL 1MODEL 1 – – fully connectedfully connected HMM model with HMM model with uniformuniform transition estimates between statestransition estimates between states

MODEL 2MODEL 2 – – maximummaximum likelihood transition estimate with likelihood transition estimate with others uniformothers uniform

MODEL 3MODEL 3 – – allall likelihood transitions estimates BASELINE likelihood transitions estimates BASELINE used for HMM modelused for HMM model

MODEL 4MODEL 4 – adds – adds smoothingsmoothing – no zero results – no zero results

ACCURACY OF MODELS ACCURACY OF MODELS ((by % word classification accuracyby % word classification accuracy))LL Labeled data Labeled data

L+DL+D Labeled and Distantly-labeled Labeled and Distantly-labeled

Multiple states / classMultiple states / class-- hand distantly-labeled hand distantly-labeled

++ automatic distantly-labeled automatic distantly-labeled

Compared Compared BASELINEBASELINE to best to best MULTI-STATEMULTI-STATE to to V-MERGEDV-MERGED

modelsmodels

UNLABELED DATA & TRAININGUNLABELED DATA & TRAINING

INITIALINITIAL L + D + U L + D + Uλλ = 0.5 = 0.5 0.5 each emission distribution 0.5 each emission distribution λλ varies varies optimum distribution optimum distributionPPPP includes smoothing includes smoothing

Error breakdownError breakdown

Errors by CLASS TAGErrors by CLASS TAG

BOLD – distantly-labeled data tagsBOLD – distantly-labeled data tags

Conclusions Conclusions

Research paper headers workResearch paper headers work

Improvement factorsImprovement factors• Multi-state classesMulti-state classes• Distantly-labeled data (10%)Distantly-labeled data (10%)• Distantly-labeled data can reduce Distantly-labeled data can reduce

labeled data labeled data

Future workFuture work

Use Bayesian model merging to Use Bayesian model merging to completely automate model learningcompletely automate model learning

Also describe layout by position on Also describe layout by position on pagepage

Model internal state structureModel internal state structure

Model of Internal State StructureModel of Internal State Structure

First 2 words – explicitFirst 2 words – explicitMultiple affiliations possibleMultiple affiliations possibleLast 2 words - explicitLast 2 words - explicit

My AssessmentMy Assessment

Highly mathematical and complexHighly mathematical and complex Even unlabeled data is in a preset orderEven unlabeled data is in a preset order Model requires work setting up training Model requires work setting up training

datadata Change in target data will completely Change in target data will completely

change modelchange model Valuable experiments with heuristics and Valuable experiments with heuristics and

smoothing impacting resultssmoothing impacting results Wish they had included a sample 1Wish they had included a sample 1stst page page

QUESTIONSQUESTIONS

learning hidden markov model structure for information extraction kristie seymour, andrew mccullum,...

Documents

final state initial

data slide

final state slide

state q f final state

new data

data learning model

header initial state

transition q f final