hidden markov models for information extraction recent results and current projects joseph smarr...

Hidden Markov Modelsfor Information Extraction

Recent Results and Current Projects

Joseph Smarr & Huy NguyenAdvisor: Chris Manning

HMM Approach to IE

HMM states are associated with a semantic type background-text, person-name, etc.

Constrained EM learns transitions and emissions

Viterbi alignment of a document marks tagged ranges of text with the same semantic typeExtract range with highest probability

2 3 4 5 6 2

Speaker is Huy Nguyen this week

Existing Work

Leek (97 [UCSD MS thesis]) Early results, fixed structures

Freitag & McCallum (99, 00) Grow complex structures

Limitations of Existing Work

Only one field extracted at a time Relative position of fields is ignored

e.g. authors usually come before titles in citations Similar-looking fields aren’t competed for

e.g. acquired company vs. purchasing company

Simple model of unknown words Use <UNK> for all words seen less than N

times No separation of content and context

e.g. can’t plug in generic date extractors, etc.

Current Research Goals

Flexibly train and combine extractors for multiple fields of information Learn structures suited for individual fields

Can be recombined and reused with many HMMs Learn intelligent context structures to link

targets Canonical ordering of fields Common prefixes and suffixes

Construct merged HMM for actual extraction Context/target split makes search problem

tractable Transitions between models are compiled out in

merge

Current Research Goals

Richer models for handling unknown words Estimate likelihood of novel words in each state Featural decomposition for finer-grained probs

e.g. Nguyen UNK[Capitalized, No-numbers] Character-level models for higher precision

e.g. phone numbers, room numbers, dates, etc.

Conditional training to focus on extraction task Classical joint estimation often wastes states

modeling patterns in English background text Conditional training is slower, but only rewards

structure that increases labeling accuracy

Learning Target Structures

Goal: Learn flexible structure tailored to composition of particular fields

Representation: Disjunction of multi-state chains

Learning method: Collect and isolate all examples of the target

field Initialization: single state Search operators (greedy search):

extend current chain(s) Start a new chain

Stopping criteria: MDL score

Example Target HMM: dlramt

START END

13.5240100

mlnbillionU.S.

Canadian

dlrsdollarsyenpesos

undisclosedwithheld

amount

Learning Context Structures

Goal: Learn structure to connect multiple target HMMs

Captures canonical ordering of fields Identifies prefix and suffix patterns around targets

Initialization: Background state connected to each target Find minimum # words between each target type in

corpus Connect targets directly if distance is 0 Add context state between targets if they’re close

Search operators (greedy search): Add prefix/suffix between background and target Lengthen an existing chain Start a new chain (by splitting an existing one)

Stopping criteria: MDL score

Example of Context HMM

Background

ContextPurchaser Acquired

purchasedacquiredbought

START END

TheyesterdayReuters

Merging Context and Targets

In context HMM, targets are collapsed into a single state that always emits “purchaser” etc.

Target HMMs have single START and END state Glue target HMMs into place by “compiling out”

start/end transitions and creating one big HMM

Challenge: create supportive structure without being overly restrictive Too little structure hard to find regularities Too much structure can’t generate all docs

Example of Merging HMMsBackground

ContextPurchaser Acquired

START END

START END

Background

Context Acquired

START END

Tricks and Optimizations

Mandatory end state Allows explicit modeling of document end

Structural enhancements Add transitions from start directly to targets Add transitions from target/suffix directly to end Allow “skip-ahead” transitions

Separation of core structure learning Structure learning is performed on “skeleton”

structure Enhancements are added during parameter

estimation Keeps search tractable while exploiting rich

transitions

Sample of Recent F1 Results

40%

45%

50%

55%

60%

65%

purchaser dlramt average

Ave

rag

e F

1 o

ver

10 f

old

s

FrMcC

Jim

Chris2

S-Merged

Merged

Unknown Word Results

0%

10%

20%

30%

40%

50%

60%

70%

80%

purchaser dlramt average

Single UNK

Held Out Decomp

Conditional Training

Observation: Joint HMMs waste states modeling patterns in background text Improves document likelihood (like n-grams) Doesn’t improve labeling accuracy (can hurt

it!) Ideally focus on prefixes, suffixes, etc. only

Idea: Maximize conditional probability of labels P(labels|words) instead of P(labels, words) Should only reward modeling helpful patterns Can’t use standard Baum-Welch training Solution: use numerical optimization (CG)

Potential of Conditional Training

Don’t waste states modeling background patterns

Toy data model: ((abc)*(eTo))* [T is target] e.g. abcabcabcabceToabcabceToabcabcabc Modeling abc improves joint likelihood but

provides no help for labeling targets

a|o

b

c|e

T

o

a|b|c

e

T

Optimal Joint Model Optimal Labeling Model

Running Conditional Training

Gradient descent requires differentiable function

Value:

Deriv:

Likelihood and expectations are easily computed with existing HMM algorithms Compute values with and without type

constraints

)()),,(

),,(log())|(log(

,

uc

ct

t lltwcP

twcPwcP

))|(),|(())|(log(

wwcwcP

ijuijcij

Forward algorithm

Param expectations

Challenges for Cond. Training

Need additional constraint to keep numbers small Can’t guarantee you’ll get a probability distribution But it’s ok if you’re just summing and multiplying! Solution: sum of all params must equal a constant

Need to fix parameter space ahead of time Can’t add states, new words, etc. Solution: start with large ergodic model in which all

states emit entire vocabulary (use UNK tokens) Need sensible initialization

Uniform structure has high variance Fixed structure usually dictates training

Results on Toy Data Set

Results on (([ae][bt][co])*(eto))* Contains spurious prefix/target/suffix-like

symbols Joint training always labels every t Conditional training eventually gets it

perfectly

Current and Future Work

Richer search operators for structure learning Richer models of unknown words (char-level) Reduce variance of conditional training Build reusable repository of target HMMs Integrate with larger IE framework(s)

Semantic Web / KAON LTG

Applications Semi-automatic ontology markup for web pages Smart email processing

hidden markov models for information extraction recent results and current projects joseph smarr...

Documents

week slide

merge slide

background state

labeling accuracy slide

context state

mdl score slide

chris manning slide

yesterday reuters slide