introduction to c onditional r andom f ields

16
Introduction to Conditional Random Fields John Osborne Sept 4, 2009

Upload: serena-guthrie

Post on 31-Dec-2015

17 views

Category:

Documents


1 download

DESCRIPTION

Introduction to C onditional R andom F ields. John Osborne Sept 4, 2009. Overview. Useful Definitions Background HMM MEMM Conditional Random Fields Statistical and Graph Definitions Computation (Training and Inference) Extensions Bayesian Conditional Random Fields - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to  C onditional  R andom  F ields

Introduction to Conditional Random Fields

John OsborneSept 4, 2009

Page 2: Introduction to  C onditional  R andom  F ields

Overview• Useful Definitions• Background

– HMM– MEMM

• Conditional Random Fields– Statistical and Graph Definitions

• Computation (Training and Inference)• Extensions

– Bayesian Conditional Random Fields– Hierarchical Conditional Random Fields– Semi-CRFs

• Future Directions

Page 3: Introduction to  C onditional  R andom  F ields

Useful Definitions• Random Field (wikipedia)

– In probability theory, let S = {X1, ..., Xn}, with the Xi in {0, 1, ..., G − 1} being a set of random variables on the sample space Ω = {0, 1, ..., G − 1}n. A probability measure π is a random field if, for all ω in Ω, π(ω) > 0.

• Markov Process (chain if finite sequence)– Stochastic process with Markov property

• Markov Property– The probability that a random variable assumes a value depends on the other random

variables only through the ones that are its immediate neighbors– “memoryless”

• Hidden Markov Model (HMM)– Markov Model where the current state is unobserved

• Viterbi Algorithm– Dynamic programming technique to discover the most likely sequence of states

required to explain the observed states in an HMM– Determine labels

• Potential Function == Feature Function– In CRF the potential function scores the compatibility of yt, yt-1 and wt(X)

Page 4: Introduction to  C onditional  R andom  F ields

Background

• Interest in CRFs arose from Richa’s work with gene expression

• Current literature shows them performing better on NLP tasks than other commonly used NLP approaches like Support Vector Machines (SVM), neural networks, HMMs and others– Termed coined by Lafftery in 2001

• Predecessor was HMM and maximum entropy Markov models (MEMM)

Page 5: Introduction to  C onditional  R andom  F ields

HMM

–Definition• Markov Model where the current state is

unobserved

–Generative Model–To examine all input X would be prohibitive, hence Markov property looking at only current element in the sequence–No multiple interacting features, long range dependencies

Page 6: Introduction to  C onditional  R andom  F ields

MEMMs

– McCallum et al, 2000– Non-generative finite-

state model based on next-state classifier

– Directed graph– P(YjX) = ∏t P(yt | yt-1

wt(X)) where wt(X) is a sliding window over the X sequence

Page 7: Introduction to  C onditional  R andom  F ields

Label Bias Problem

• Transitions leaving a given state complete only against each other, rather than against all other transitions in the model

• Implies “Conversation of score mass” (Bottou, 1991)• Observations can be ignored, Viterbi decoding can’t

downgrade a branch

• CRF will solve this problem by having a single exponential model for the joint probability of the ENTIRE SEQUENCE OF LABELS given the observation sequence

Page 8: Introduction to  C onditional  R andom  F ields

Big Picture Definition• Wikipedia Definition (Aug 2009)

– A conditional random field (CRF) is a type of discriminative probabilistic model most often used for the labeling or parsing of sequential data, such as natural language text or biological sequences.

• Probabilistic model is a statistical model, in math terms “a pair (Y,P) where Y is the set of possible observations and P the set of possible probability distributions on Y”– In statistics terms this means the objective is to infer (or pick) the distinct element

(probability distribution) in the set “P” given your observation Y• Discriminative model meaning it models the conditional probability distribution

P(y|x) which can predict y given x. – It can not do it the other way around (produce x from y) since it does not a generative

model (capable of generating sample data given a model) as it does not model a joint probability distribution

– Similar to other discriminative models like support vector machines and neural networks• When analyzing sequential data a conditional model specifies the probabilities of

possible label sequences given an observation sequence

Page 9: Introduction to  C onditional  R andom  F ields

CRF Graphical Definition

Definition from Lafferty• Undirected graphical model• Let g = (V,E) be a graph such that

Y = (Yv)vεV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w≠v)=p(Yv|X,Yw,w~v), where w~v means that w and v are neighbors in G

CRF Undirected Graph

Page 10: Introduction to  C onditional  R andom  F ields

Computation of CRF

• Training– Conditioning– Calculation of Feature Function– P(Y|X) = 1/Z(X)exp ∑t PSI (yt, yt-1 and wt(X))

• Z is normalizing factor• Potential Function in paratheses

• Inference– Viterbi Decoding– Approximate Model Averaging– Others?

Page 11: Introduction to  C onditional  R andom  F ields

Training Approaches• CRF is supervised learning so can train using

– Maximum Likehood (original paper)• Used iterative scaling method, was very slow

– Gradient Assent• Also slow when naïve

– Mallet Implementation used BFGS algorithm• http://en.wikipedia.org/wiki/BFGS• Broyden-Fletcher-Goldfarb – Shanno• Approximate 2nd order algorithm

– Stochastic Gradient Method (2006) accelerated via Stochastic Meta Descent– Gradient Tree Boosting (variant of a 2001

• http://jmlr.csail.mit.edu/papers/volume9/dietterich08a/dietterich08a.pdf• Potential functions are sums of regression trees

– Decision trees using real values

• Published 2008• Competitive with Mallet

– Bayesian (estimate posterior probability)

Page 12: Introduction to  C onditional  R andom  F ields

Conditional Random Field ExtensionsSemi-CRF

• Semi-CRF– Instead of assigning labels to each member of

sequence, labels are assigned to sub-sequences– Advantage – “features for semi-CRF can measure

properties of segments, and transition within a segment can be non-Markovian”

– http://www.cs.cmu.edu/~wcohen/postscript/semiCRF.pdf

Page 13: Introduction to  C onditional  R andom  F ields

Bayesian CRF

• Qi et al, (2005)• http://

www.cs.purdue.edu/homes/alanqi/papers/Qi-Bayesian-CRF-AIstat05.pdf

• Replacement for ML method of Lafferty• Reducing over-fitting• “Power EP Method”

Page 14: Introduction to  C onditional  R andom  F ields

Hierarchical CRF (HCRF)

• http://www.springerlink.com/content/r84055k2754464v5/

• http://www.cs.washington.edu/homes/fox/postscripts/places-isrr-05.pdf

• GPS motion, for surveillance, tracking, dividing people’s workday into labels of work, travel, sleep, etc..

• Less work

Page 15: Introduction to  C onditional  R andom  F ields

Future Directions

• Less work on conditional random fields in biology– PubMed hits

• Conditional Random Field - 21• Conditional Random Fields - 43

– CRF variants & promoter/regulatory element shows no hits

• CRF and ontology show no hits• Plan– Implement CRF in Java, apply to biology problems, try

to find ways to extend?