learning hidden markov model structure for information extraction kristie seymour, andrew mccullum,...
Post on 22-Dec-2015
222 views
TRANSCRIPT
Learning Hidden Markov Model Learning Hidden Markov Model
Structure for Information ExtractionStructure for Information Extraction
Kristie Seymour, Kristie Seymour,
Andrew McCullum, Andrew McCullum,
& Ronald Rosenfeld& Ronald Rosenfeld
Hidden Markov Model StructuresHidden Markov Model Structures
Machine learning tool applied to Machine learning tool applied to Information ExtractionInformation Extraction
Part of speech tagging Part of speech tagging (Kupiec 1992)(Kupiec 1992)
Topic detection & tracking Topic detection & tracking (Yamron et al 1998)(Yamron et al 1998)
Dialog act modeling Dialog act modeling (Stolcke, Shriberg, & others (Stolcke, Shriberg, & others 1998)1998)
HMM in Information ExtractionHMM in Information Extraction
Gene names and locations Gene names and locations (Luek 1997)(Luek 1997)
Named-entity extraction Named-entity extraction (Nymble system (Nymble system – Friberg & McCallum 1999)– Friberg & McCallum 1999)
Information Extraction StrategyInformation Extraction Strategy 1 HMM = 1 Field1 HMM = 1 Field 1 state / class1 state / class Hand-built models using human data Hand-built models using human data
inspectioninspection
HMM AdvantagesHMM Advantages
Strong statistical foundations Strong statistical foundations
Used well in Natural Language Used well in Natural Language programmingprogramming
Handles new data robustlyHandles new data robustly
Uses established training algorithms which Uses established training algorithms which are computationally efficient to develop are computationally efficient to develop and evaluateand evaluate
HMM DisadvantagesHMM Disadvantages
Require Require a prioria priori notion of model notion of model topologytopology
Need large amounts of training data Need large amounts of training data to use to use
Authors’ ContributionAuthors’ Contribution
Automatically determined model Automatically determined model structure from datastructure from data
One HMM to extract all informationOne HMM to extract all information
Introduced DISTANTLY-LABELED Introduced DISTANTLY-LABELED DATA DATA
OUTLINEOUTLINE
Information Extraction basics with HMMInformation Extraction basics with HMM Learning model structure from dataLearning model structure from data Training dataTraining data Experiment resultsExperiment results Model selectionModel selection Error breakdownError breakdown Conclusions Conclusions Future workFuture work
Information Extraction basics with Information Extraction basics with HMMHMM
OBJECT – to code every word of CS OBJECT – to code every word of CS research paper headersresearch paper headers
TitleTitle AuthorAuthor DateDate KeywordKeyword Etc.Etc.
1 HMM / 1 Header1 HMM / 1 Header Initial state to Final stateInitial state to Final state
Discrete output, First-order HMMDiscrete output, First-order HMM
Q – set of statesQ – set of states qqII – initial state – initial state qqFF – final state in transition – final state in transition ∑ ∑ = {= {σσ11, , σσ22, . . . , , . . . , σσmm} - discrete output vocabulary} - discrete output vocabulary X = xX = x11 x x22 . . . x . . . xi i - output string- output string
PROCESSPROCESS Initital state -> new state -> emit output symbol -> Initital state -> new state -> emit output symbol -> another state -> new state -> emit another output symbol -another state -> new state -> emit another output symbol -
>> . . . FINAL STATE. . . FINAL STATE
PARAMETERSPARAMETERS P(q -> q’) – transition probabilitiesP(q -> q’) – transition probabilities P(q P(q ↑ ↑ σσ) – emission probabilities) – emission probabilities
The probability of string The probability of string xx being being emitted by an HMM emitted by an HMM MM is computed is computed
as a sum over all possible paths as a sum over all possible paths where qwhere q00 and q and ql+1l+1 are restricted to are restricted to
be qbe qII and q and qFF respectively, and x respectively, and xl+1l+1 is is
an end-of-string token (uses an end-of-string token (uses Forward algorithm)Forward algorithm)
The output is observable, but The output is observable, but the underlying state sequence the underlying state sequence
is HIDDENis HIDDEN
To recover the state sequence V(x|To recover the state sequence V(x|M)M)
that has the highest probability of that has the highest probability of having produced an observation having produced an observation
sequence (uses Viterbi algorithm)sequence (uses Viterbi algorithm)
HMM applicationHMM application
Each state has a class (i.e. title, author)Each state has a class (i.e. title, author)
Each word in the header is an observationEach word in the header is an observation
Each state emits words from header with Each state emits words from header with associated CLASS TAGassociated CLASS TAG
This is learned from TRAINING DATAThis is learned from TRAINING DATA
Learning model structure from dataLearning model structure from data
Decide on states and associated transition statesDecide on states and associated transition states
Set up labeled training dataSet up labeled training data
Use MERGE techniquesUse MERGE techniques• Neighbor merge (link all adjacent words in title)Neighbor merge (link all adjacent words in title)• V-merging - 2 states with same label and transitions V-merging - 2 states with same label and transitions
(one transition to title and out)(one transition to title and out)
Apply Bayesian model merging to maximize Apply Bayesian model merging to maximize result accuracyresult accuracy
Example Hidden Markov ModelExample Hidden Markov Model
Bayesian model merging seeks to Bayesian model merging seeks to find the model structure that find the model structure that
maximizes the probability of the maximizes the probability of the model (M) given some training data model (M) given some training data
(D), by iteratively merging states (D), by iteratively merging states until an optimal tradeoff between fit until an optimal tradeoff between fit
to the data and model size has to the data and model size has been reached been reached
Three types of training dataThree types of training data
Labeled data Labeled data
Unlabeled dataUnlabeled data
Distantly-labeled dataDistantly-labeled data
Labeled data Labeled data
manual and expensivemanual and expensive
Provides COUNTS function c() Provides COUNTS function c() estimates model parametersestimates model parameters
Formulas for deriving parametersFormulas for deriving parametersusing counts c()using counts c()
(4) Transition Probabilities(4) Transition Probabilities
(5) Emission Probabilities(5) Emission Probabilities
Unlabeled DataUnlabeled Data
Needs estimated parameters from labeled Needs estimated parameters from labeled data data
Use Baum-Welch training algorithm Use Baum-Welch training algorithm
• Iterative expectation-maximization algorithm Iterative expectation-maximization algorithm which adjusts model parameters to locally which adjusts model parameters to locally maximize results from unlabeled datamaximize results from unlabeled data
• Sensitive to initial parametersSensitive to initial parameters
Distantly-labeled dataDistantly-labeled data
Data labeled for another purposeData labeled for another purpose
Partially applied to this domain for Partially applied to this domain for trainingtraining
EXAMPLE - CS research headers – EXAMPLE - CS research headers – BibTeX bibliographic labeled citationsBibTeX bibliographic labeled citations
Experiment resultsExperiment results Prepare text using computer programPrepare text using computer program
• Header- beginning to INTRODUCTION or end of 1Header- beginning to INTRODUCTION or end of 1stst page page• Remove punctuation, case, & newlinesRemove punctuation, case, & newlines• Label Label
+ABSTRACT++ABSTRACT+ AbstractAbstract +INTRO++INTRO+ IntroductionIntroduction +PAGE++PAGE+ End of 1End of 1stst page page
Manually label 1000 headersManually label 1000 headers• Minus 65 discarded due to poor formatMinus 65 discarded due to poor format
Derive fixed word vocabularies from trainingDerive fixed word vocabularies from training
Sources & Amounts of Training Sources & Amounts of Training DataData
Model selectionModel selection
MODELS 1-4 - 1 state / classMODELS 1-4 - 1 state / class
MODEL 1MODEL 1 – – fully connectedfully connected HMM model with HMM model with uniformuniform transition estimates between statestransition estimates between states
MODEL 2MODEL 2 – – maximummaximum likelihood transition estimate with likelihood transition estimate with others uniformothers uniform
MODEL 3MODEL 3 – – allall likelihood transitions estimates BASELINE likelihood transitions estimates BASELINE used for HMM modelused for HMM model
MODEL 4MODEL 4 – adds – adds smoothingsmoothing – no zero results – no zero results
ACCURACY OF MODELS ACCURACY OF MODELS ((by % word classification accuracyby % word classification accuracy))LL Labeled data Labeled data
L+DL+D Labeled and Distantly-labeled Labeled and Distantly-labeled
Multiple states / classMultiple states / class-- hand distantly-labeled hand distantly-labeled
++ automatic distantly-labeled automatic distantly-labeled
Compared Compared BASELINEBASELINE to best to best MULTI-STATEMULTI-STATE to to V-MERGEDV-MERGED
modelsmodels
UNLABELED DATA & TRAININGUNLABELED DATA & TRAINING
INITIALINITIAL L + D + U L + D + Uλλ = 0.5 = 0.5 0.5 each emission distribution 0.5 each emission distribution λλ varies varies optimum distribution optimum distributionPPPP includes smoothing includes smoothing
Error breakdownError breakdown
Errors by CLASS TAGErrors by CLASS TAG
BOLD – distantly-labeled data tagsBOLD – distantly-labeled data tags
Conclusions Conclusions
Research paper headers workResearch paper headers work
Improvement factorsImprovement factors• Multi-state classesMulti-state classes• Distantly-labeled data (10%)Distantly-labeled data (10%)• Distantly-labeled data can reduce Distantly-labeled data can reduce
labeled data labeled data
Future workFuture work
Use Bayesian model merging to Use Bayesian model merging to completely automate model learningcompletely automate model learning
Also describe layout by position on Also describe layout by position on pagepage
Model internal state structureModel internal state structure
Model of Internal State StructureModel of Internal State Structure
First 2 words – explicitFirst 2 words – explicitMultiple affiliations possibleMultiple affiliations possibleLast 2 words - explicitLast 2 words - explicit
My AssessmentMy Assessment
Highly mathematical and complexHighly mathematical and complex Even unlabeled data is in a preset orderEven unlabeled data is in a preset order Model requires work setting up training Model requires work setting up training
datadata Change in target data will completely Change in target data will completely
change modelchange model Valuable experiments with heuristics and Valuable experiments with heuristics and
smoothing impacting resultssmoothing impacting results Wish they had included a sample 1Wish they had included a sample 1stst page page
QUESTIONSQUESTIONS