cs 6784 paper presentationcs 6784 paper presentation conditional random fields: probabilistic models...

77
Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion CS 6784 Paper Presentation Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Laerty, Andrew McCallum, Fernando C. N. Pereira Presenters: Brad Gulko and Stephanie Hyland February 20, 2014 Presenters: Brad Gulko and Stephanie Hyland CS 6784 Paper Presentation

Upload: others

Post on 05-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    CS 6784 Paper Presentation

    Conditional Random Fields: Probabilistic Models for

    Segmenting and Labeling Sequence Data

    John La↵erty, Andrew McCallum, Fernando C. N. Pereira

    Presenters: Brad Gulko and Stephanie Hyland

    February 20, 2014

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Main Contributions

    Main Contribution Summary

    This 2001 paper introduced the Conditional Random Field

    (CRF).

    Describes e�cient representation of field potentials in terms

    of features.

    Provides two algorithms for finding Maximum Likelihood

    parameter values.

    Provides some really unconvincing examples...

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Main Contributions

    Main Contribution Summary

    ... examples are NOT the strongest point of this paper

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Main Contributions

    Talk Structure

    Brad

    CRF in context

    The Label Bias Problem

    Stephanie

    Parameter Estimation

    Experiments

    Conclusion

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Main Contributions

    Talk Structure

    Brad

    CRF in context

    The Label Bias Problem

    Stephanie

    Parameter Estimation

    Experiments

    Conclusion

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Main Contributions

    Talk Structure

    Brad

    CRF in context

    The Label Bias Problem

    Stephanie

    Parameter Estimation

    Experiments

    Conclusion

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Main Contributions

    Talk Structure

    Brad

    CRF in context

    The Label Bias Problem

    Stephanie

    Parameter Estimation

    Experiments

    Conclusion

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Intro

    CRF in Context

    X = {X1,X2, · · · } be a set of observed RV

    Y = {Y1,Y2, · · · } be a set of label RV

    X ,Y be a set of joint observations of X,Y

    Generative Discriminative

    Directed ???

    HMM

    ???

    ME-MM

    Undirected ???

    MRF

    ???

    ???CRF

    In 2001, HMM, ME-MM and MRF were

    well known,the paper presents the CRF.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Intro

    CRF in Context

    X = {X1,X2, · · · } be a set of observed RV

    Y = {Y1,Y2, · · · } be a set of label RV

    X ,Y be a set of joint observations of X,Y

    Generative Discriminative

    Directed

    ???

    HMM ???

    ME-MM

    Undirected

    ???

    MRF ???

    ???CRF

    In 2001, HMM, ME-MM and MRF were

    well known,the paper presents the CRF.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Intro

    CRF in Context

    X = {X1,X2, · · · } be a set of observed RV

    Y = {Y1,Y2, · · · } be a set of label RV

    X ,Y be a set of joint observations of X,Y

    Generative Discriminative

    Directed

    ???

    HMM

    ???

    ME-MM

    Undirected

    ???

    MRF

    ???

    ???

    CRF

    In 2001, HMM, ME-MM and MRF were

    well known,

    the paper presents the CRF.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Intro

    CRF in Context

    X = {X1,X2, · · · } be a set of observed RV

    Y = {Y1,Y2, · · · } be a set of label RV

    X ,Y be a set of joint observations of X,Y

    Generative Discriminative

    Directed

    ???

    HMM

    ???

    ME-MM

    Undirected

    ???

    MRF

    ??????

    CRF

    In 2001, HMM, ME-MM and MRF were

    well known,the paper presents the CRF.Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Generative vs. Discriminative

    Generative vs. Discriminative

    Generative: maximise joint P(Y ,X ) = P(Y |X )P(X )

    Discriminative: maximise conditional P(Y |X )

    When is Discriminative helpful?

    Tractability requires independence

    ...but sometimes there are important correlations in X .

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Generative vs. Discriminative

    Generative vs. Discriminative

    Generative: maximise joint P(Y ,X ) = P(Y |X )P(X )

    Discriminative: maximise conditional P(Y |X )

    When is Discriminative helpful?

    Tractability requires independence

    ...but sometimes there are important correlations in X .

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Generative vs. Discriminative

    Examples: important correlations

    Long range interactions in

    human genomics

    Pronoun definition and

    binding

    Context in whole scene

    image recognition

    Recursive structure in

    language

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Generative vs. Discriminative

    Examples: important correlations

    Long range interactions in

    human genomics

    Pronoun definition and

    binding

    Context in whole scene

    image recognition

    Recursive structure in

    language

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Generative vs. Discriminative

    Examples: important correlations

    Long range interactions in

    human genomics

    Pronoun definition and

    binding

    Context in whole scene

    image recognition

    Recursive structure in

    language

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Generative vs. Discriminative

    Examples: important correlations

    Long range interactions in

    human genomics

    Pronoun definition and

    binding

    Context in whole scene

    image recognition

    Recursive structure in

    language

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Directed vs. Undirected

    Directed vs. Undirected

    For a graphical model G(E,V) with joint

    potential (V).

    Let C be the set of cliques (fully

    connected subgroups) in G, with c 2 C

    having edges Ec

    and vertices Vc

    .

    Finally, Dom(V) is the set of all values assumable by the

    random variables, V(= X [ Y).

    P(V) =1

    Z (V), Z =

    X

    vinDom(V)

    (v)

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Directed vs. Undirected

    Directed vs. Undirected

    For a graphical model G(E,V) with joint

    potential (V).

    Let C be the set of cliques (fully

    connected subgroups) in G, with c 2 C

    having edges Ec

    and vertices Vc

    .

    Finally, Dom(V) is the set of all values assumable by the

    random variables, V(= X [ Y).

    P(V) =1

    Z (V), Z =

    X

    vinDom(V)

    (v)

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Directed vs. Undirected

    Directed vs. Undirected, continued

    Compactness requires factorization (Hammersley-Cli↵ord,

    1971):

    (V) =Y

    c2C

    c

    (Vc

    )

    Directed: local Normalization -

    8c 2 C,X

    v2Dom(Vc

    )

    c

    (v) = 1

    Undirected: Global Normalization - relaxes this constraint...

    but what does it buy us?

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Directed vs. Undirected

    Directed vs. Undirected, continued

    Compactness requires factorization (Hammersley-Cli↵ord,

    1971):

    (V) =Y

    c2C

    c

    (Vc

    )

    Directed: local Normalization -

    8c 2 C,X

    v2Dom(Vc

    )

    c

    (v) = 1

    Undirected: Global Normalization - relaxes this constraint...

    but what does it buy us?

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Directed vs. Undirected

    Directed vs. Undirected, continued

    Compactness requires factorization (Hammersley-Cli↵ord,

    1971):

    (V) =Y

    c2C

    c

    (Vc

    )

    Directed: local Normalization - each c

    is a probability.

    8c 2 C,X

    v2Dom(Vc

    )

    c

    (v) = 1

    Undirected: Global Normalization - relaxes this constraint...

    but what does it buy us?

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Directed vs. Undirected

    Directed vs. Undirected, continued

    Compactness requires factorization (Hammersley-Cli↵ord,

    1971):

    (V) =Y

    c2C

    c

    (Vc

    )

    Directed: local Normalization - each c

    is a probability.

    8c 2 C,X

    v2Dom(Vc

    )

    c

    (v) = 1

    Undirected: Global Normalization - relaxes this constraint...

    but what does it buy us?

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • The Label Bias Problem: Conditional Markov Model (EM MM)Toy Problem – fragment of a ME MM

    Training Data:

  • The Label Bias Problem: Conditional Markov Model (EM MM)Toy Problem – fragment of a ME MM

    Training Data:

    3 4X2=I 8X2=OX2=IX2=O 2

    Y 2

    Y1 =1

    Rel. Joint(Y2 ,X 2 ,Y1 )

    Y1 =2

    1 2X1=R 0.8 0.2

    Y 1P(Y1 |X1 )

  • The Label Bias Problem: Conditional Markov Model (EM MM)Toy Problem – fragment of a ME MM

    Training Data:

    3 4X2=I 8X2=OX2=IX2=O 2

    Y 2

    Y1 =1

    Rel. Joint(Y2 ,X 2 ,Y1 )

    Y1 =2

    3 4X2=I 1-X2=O 0.5 0.5X2=I 0.5 0.5X2=O 1-

    Y1 =1

    Y1 =2

    ConditionalP(Y2 |X2 ,Y1 )

    Y 2

    1 2X1=R 0.8 0.2

    Y 1P(Y1 |X1 )

  • The Label Bias Problem: Conditional Markov Model (EM MM)Toy Problem – fragment of a ME MM

    Training Data:

    3 4X2=I 8X2=OX2=IX2=O 2

    Y 2

    Y1 =1

    Rel. Joint(Y2 ,X 2 ,Y1 )

    Y1 =2

    3 4X2=I 1-X2=O 0.5 0.5X2=I 0.5 0.5X2=O 1-

    Y1 =1

    Y1 =2

    ConditionalP(Y2 |X2 ,Y1 )

    Y 2

    Viterbi is

    1 2X1=R 0.8 0.2

    Y 1P(Y1 |X1 )

    Y1 ,Y2 P(Y1 |R) P(Y2 |Y1 ,O) P(Y1 ,Y2 |RO)1,3 0.8 0.51,4 0.8 0.52,3 0.22,4 0.2 1-Lets try it for

    Whichlabelingwins?

  • The Label Bias Problem: Conditional Markov Model (EM MM)Toy Problem – fragment of a ME MM

    Training Data:

    3 4X2=I 8X2=OX2=IX2=O 2

    Y 2

    Y1 =1

    Rel. Joint(Y2 ,X 2 ,Y1 )

    Y1 =2

    3 4X2=I 1-X2=O 0.5 0.5X2=I 0.5 0.5X2=O 1-

    Y1 =1

    Y1 =2

    ConditionalP(Y2 |X2 ,Y1 )

    Y 2

    Viterbi is

    1 2X1=R 0.8 0.2

    Y 1P(Y1 |X1 )

    Y1 ,Y2 P(Y1 |R) P(Y2 |Y1 ,O) P(Y1 ,Y2 |RO)1,3 0.8 0.5 0.41,4 0.8 0.5 0.42,3 0.22,4 0.2 1- 0.2Lets try it for

    But we want

    What happened?

  • The Label Bias Problem: Conditional Markov Model (EM MM)

    Toy Problem – fragment of a ME MM

    Training Data:

    3 4X2=I 8X2=OX2=IX2=O 2

    Y 2

    Y1 =1

    Rel. Joint

    (Y2 ,X 2 ,Y1 )

    Y1 =2

    3 4X2=I 1-X2=O 0.5 0.5X2=I 0.5 0.5X2=O 1-

    Y1 =1

    Y1 =2

    Conditional

    P(Y2 |X2 ,Y1 )Y 2

    Y1 ,Y2 P(Y1 |R) P(Y2 |Y1 ,O) P(Y1 ,Y2 |RO)1,3 0.8 0.5 0.41,4 0.8 0.5 0.42,3 0.22,4 0.2 1- 0.2

    Local Normalization

    requires a

    probability…. So..

  • The Label Bias Problem: PotentialsToy Problem – fragment of a CRF

    Training Data:

  • The Label Bias Problem: PotentialsToy Problem – fragment of a CRF

    Training Data:

    1 2R 8 2-

    X1

    Y13 4

    1 82 2

    Y2

    Y1

    3 4I 8O 2

    Y2

    X2

  • The Label Bias Problem: PotentialsToy Problem – fragment of a CRF

    Training Data:

    1 2R 8 2-

    X1

    Y13 4

    1 82 2

    Y2

    Y1

    3 4I 8O 2

    Y2

    X2

    Y1,Y2 (X1,Y1) (Y1,Y2) (X2,Y2) (Y1,Y2,X=RO) P(Y1,Y2|X)1,3 8 81,4 8 22,3 22,4 2 2 2

    Whichlabeling

    wins, now?

  • The Label Bias Problem: PotentialsToy Problem – fragment of a CRF

    Training Data:

    1 2R 8 2-

    X1

    Y13 4

    1 82 2

    Y2

    Y1

    3 4I 8O 2

    Y2

    X2

    Y1,Y2 (X1,Y1) (Y1,Y2) (X2,Y2) (Y1,Y2,X=RO) P(Y1,Y2|X)1,3 8 8 64 small1,4 8 2 16 tiny2,3 2 2 2 infinitesimal2,4 2 2 2 8 ~100%

    Because potentials do not have to normalize into probabilities until AFTERaggregation, they don’t suffer from inappropriate conditioning.

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Fun fact

    Fun fact: We have seen this in class before!

    Graphical model G(E,V) with joint potential (V), C the set

    of cliques in G with c 2 C having edges Ec

    and vertices Vc

    P(V) / (V) =Y

    c2C

    c

    (Vc

    )

    M

    3nets: cliques are pairs, and all conditioned on observed x:

    P(y|x) /Y

    (i ,j)2E

    ij

    (yi

    , yj

    , x)

    AMN: cliques are pairs of nodes and singletons:

    P�(y) =1

    Z

    NY

    i

    �i

    (yi

    )Y

    i ,j2E�i ,j(yi , yj)

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Fun fact

    Fun fact: We have seen this in class before!

    Graphical model G(E,V) with joint potential (V), C the set

    of cliques in G with c 2 C having edges Ec

    and vertices Vc

    P(V) / (V) =Y

    c2C

    c

    (Vc

    )

    M

    3nets: cliques are pairs, and all conditioned on observed x:

    P(y|x) /Y

    (i ,j)2E

    ij

    (yi

    , yj

    , x)

    AMN: cliques are pairs of nodes and singletons:

    P�(y) =1

    Z

    NY

    i

    �i

    (yi

    )Y

    i ,j2E�i ,j(yi , yj)

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Fun fact

    Fun fact: We have seen this in class before!

    Graphical model G(E,V) with joint potential (V), C the set

    of cliques in G with c 2 C having edges Ec

    and vertices Vc

    P(V) / (V) =Y

    c2C

    c

    (Vc

    )

    M

    3nets: cliques are pairs, and all conditioned on observed x:

    P(y|x) /Y

    (i ,j)2E

    ij

    (yi

    , yj

    , x)

    AMN: cliques are pairs of nodes and singletons:

    P�(y) =1

    Z

    NY

    i

    �i

    (yi

    )Y

    i ,j2E�i ,j(yi , yj)

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Where do Parameters Come From?

    CRF’s are part of the same general class,  ∝ Ψ ∏ Ψ∈

  • Where do Parameters Come From?

    CRF’s are part of the same general class,  ∝ Ψ ∏ Ψ∈

    For trees, cliques are pairs of vertices sharing an edge ( | ), and single vertices ( | ):

    Ψ Ψ∈

    Ψ∈

  • Where do Parameters Come From?

    CRF’s are part of the same general class,  ∝ Ψ ∏ Ψ∈

    For trees, cliques are pairs of vertices sharing an edge ( | ), and single vertices ( | ):

    Ψ Ψ∈

    Ψ∈

    And because this is a conditional network with  ∪

    Ψ | Ψ ,∈

    Ψ ,∈

  • Where do Parameters Come From?

    CRF’s are part of the same general class,  ∝ Ψ ∏ Ψ∈

    For trees, cliques are pairs of vertices sharing an edge ( | ), and single vertices ( | ):

    Ψ Ψ∈

    Ψ∈

    And because this is a conditional network with  ∪

    Ψ | Ψ ,∈

    Ψ ,∈

    An exponential identity gives us

    Ψ | exp log Ψ ,∈

    log Ψ ,∈

  • Where do Parameters Come From?

    CRF’s are part of the same general class,  ∝ Ψ ∏ Ψ∈

    For trees, cliques are pairs of vertices sharing an edge ( | ), and single vertices ( | ):

    Ψ Ψ∈

    Ψ∈

    And because this is a conditional network with  ∪

    Ψ | Ψ ,∈

    Ψ ,∈

    An exponential identity gives us

    Ψ | exp log Ψ ,∈

    log Ψ ,∈

    Potentials can be ANY positive values… like linear combinations of arbitrary features

    Ψ | exp , , , , ∈ , ∈∈ , ∈

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Training algorithms

    Improved iterative scaling

    Want to maximize log-likelihodod with respect to parameters

    ✓ = (�1,�2, · · · ;µ1, µ2, · · · )

    Method: Improved Iterative Scaling1: Extension of

    Generalised Iterative Scaling (Darroch and Ratcli↵ 1972).

    Improved because features need not sum to constant.

    Idea: new set of parameters

    ✓0 = ✓ + �✓ = (�1 + ��1, · · · ;µ1 + �µ1 · · · ) which will not

    decrease objective function. Iteratively apply!

    Problem: slow, and nobody uses this any more.

    1Della Pietra et al. (1997)

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Training algorithms

    Improved iterative scaling

    Want to maximize log-likelihodod with respect to parameters

    ✓ = (�1,�2, · · · ;µ1, µ2, · · · )

    Method: Improved Iterative Scaling1: Extension of

    Generalised Iterative Scaling (Darroch and Ratcli↵ 1972).

    Improved because features need not sum to constant.

    Idea: new set of parameters

    ✓0 = ✓ + �✓ = (�1 + ��1, · · · ;µ1 + �µ1 · · · ) which will not

    decrease objective function. Iteratively apply!

    Problem: slow, and nobody uses this any more.

    1Della Pietra et al. (1997)

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Training algorithms

    Improved iterative scaling

    Want to maximize log-likelihodod with respect to parameters

    ✓ = (�1,�2, · · · ;µ1, µ2, · · · )

    Method: Improved Iterative Scaling1: Extension of

    Generalised Iterative Scaling (Darroch and Ratcli↵ 1972).

    Improved because features need not sum to constant.

    Idea: new set of parameters

    ✓0 = ✓ + �✓ = (�1 + ��1, · · · ;µ1 + �µ1 · · · ) which will not

    decrease objective function. Iteratively apply!

    Problem: slow, and nobody uses this any more.

    1Della Pietra et al. (1997)

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Training algorithms

    Improved iterative scaling

    Want to maximize log-likelihodod with respect to parameters

    ✓ = (�1,�2, · · · ;µ1, µ2, · · · )

    Method: Improved Iterative Scaling1: Extension of

    Generalised Iterative Scaling (Darroch and Ratcli↵ 1972).

    Improved because features need not sum to constant.

    Idea: new set of parameters

    ✓0 = ✓ + �✓ = (�1 + ��1, · · · ;µ1 + �µ1 · · · ) which will not

    decrease objective function. Iteratively apply!

    Problem: slow, and nobody uses this any more.

    1Della Pietra et al. (1997)

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Training algorithms

    Improved iterative scaling

    Want to maximize log-likelihodod with respect to parameters

    ✓ = (�1,�2, · · · ;µ1, µ2, · · · )

    Method: Improved Iterative Scaling1: Extension of

    Generalised Iterative Scaling (Darroch and Ratcli↵ 1972).

    Improved because features need not sum to constant.

    Idea: new set of parameters

    ✓0 = ✓ + �✓ = (�1 + ��1, · · · ;µ1 + �µ1 · · · ) which will not

    decrease objective function. Iteratively apply!

    Problem: slow, and nobody uses this any more.1Della Pietra et al. (1997)

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Training algorithms

    Modern CRF training - L-BFGS

    Generally use L-BFGS2 algorithm.

    Approximates Newton’s method. Optimise multivariate

    function f (✓) through updates

    ✓t+1 = ✓t � [Hf (✓t)]�1rf (✓t)

    Quasi-Newtonian: approximates Hessian Hf (✓).

    Limited-memory: doesn’t store full (approximate) Hessian.

    2Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Training algorithms

    Modern CRF training - L-BFGS

    Generally use L-BFGS2 algorithm.

    Approximates Newton’s method. Optimise multivariate

    function f (✓) through updates

    ✓t+1 = ✓t � [Hf (✓t)]�1rf (✓t)

    Quasi-Newtonian: approximates Hessian Hf (✓).

    Limited-memory: doesn’t store full (approximate) Hessian.

    2Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Training algorithms

    Modern CRF training - L-BFGS

    Generally use L-BFGS2 algorithm.

    Approximates Newton’s method. Optimise multivariate

    function f (✓) through updates

    ✓t+1 = ✓t � [Hf (✓t)]�1rf (✓t)

    Quasi-Newtonian: approximates Hessian Hf (✓).

    Limited-memory: doesn’t store full (approximate) Hessian.

    2Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Training algorithms

    Modern CRF training - L-BFGS

    Generally use L-BFGS2 algorithm.

    Approximates Newton’s method. Optimise multivariate

    function f (✓) through updates

    ✓t+1 = ✓t � [Hf (✓t)]�1rf (✓t)

    Quasi-Newtonian: approximates Hessian Hf (✓).

    Limited-memory: doesn’t store full (approximate) Hessian.

    2Limited-Memory Broyden-Fletcher-Goldfarb-Shanno Algorithm

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Performance

    Label bias

    Generate data with noisy HMM.

    4-state system (not counting ‘initial state’), transitions:

    1 ) 2 ) 3

    4 ) 5 ) 3

    Emissions: highly biased!

    P(X = Y ’s preferred value |Y ) = 29/32P(X = other |Y ) = 1/32

    Preferred values: 1! ‘r’, 4! ‘r’, 2! ‘i’, 5! ‘o’, 3! ‘b’.

    Result: CRF error 4.6%, MEMM error 42%.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Performance

    Label bias

    Generate data with noisy HMM.

    4-state system (not counting ‘initial state’), transitions:

    1 ) 2 ) 3

    4 ) 5 ) 3

    Emissions: highly biased!

    P(X = Y ’s preferred value |Y ) = 29/32P(X = other |Y ) = 1/32

    Preferred values: 1! ‘r’, 4! ‘r’, 2! ‘i’, 5! ‘o’, 3! ‘b’.

    Result: CRF error 4.6%, MEMM error 42%.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Performance

    Mixed-order sources

    Generate data with mixed-order HMM:

    Transitions: (1� ↵)p1(yi |yi�1) + ↵p2(yi |yi�1, yi�2)Emissions: (1� ↵)p1(xi |yi ) + ↵p2(xi |yi , xi�1)

    Five labels, 26 observation values.

    Training/testing: 1000 sequences of length 25.

    CRF trained with Algorithm S (modified IIS). MEMM trained

    with iterative scaling.

    Viterbi to label test set.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Performance

    Mixed-order sources: results

    Squares : ↵ < 0.5.

    CRF sort of wins?

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Performance

    Mixed-order sources: results

    Squares : ↵ < 0.5.

    CRF sort of wins?

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Performance

    Part of Speech Tagging

    Penn Treebank: 45 syntactic tags, label each word in

    sentence.

    Train first-order HMM, MEMM, CRF.

    Spelling features exploit

    conditional framework.

    Examples: starts with

    number/upper case?,

    contains hyphen, has su�x?

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Performance

    Part of Speech Tagging

    Penn Treebank: 45 syntactic tags, label each word in

    sentence.

    Train first-order HMM, MEMM, CRF.

    Spelling features exploit

    conditional framework.

    Examples: starts with

    number/upper case?,

    contains hyphen, has su�x?

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Performance

    Part of Speech Tagging

    Penn Treebank: 45 syntactic tags, label each word in

    sentence.

    Train first-order HMM, MEMM, CRF.

    Spelling features exploit

    conditional framework.

    Examples: starts with

    number/upper case?,

    contains hyphen, has su�x?

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Skip-chain CRF

    Example: skip-chain CRF3.

    Has long-range features!

    Basic idea: extend linear-chain CRF by joining some distant

    observations with ‘skip edges’.

    Connect multiple mentions of entity across whole document.

    3From An Introduction to Conditional Random Fields for Relational

    Learning, Charles Sutton and Andrew McCallum, 2006.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Skip-chain CRF

    Example: skip-chain CRF3.

    Has long-range features!

    Basic idea: extend linear-chain CRF by joining some distant

    observations with ‘skip edges’.

    Connect multiple mentions of entity across whole document.

    3From An Introduction to Conditional Random Fields for Relational

    Learning, Charles Sutton and Andrew McCallum, 2006.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Skip-chain CRF

    Example: skip-chain CRF3.

    Has long-range features!

    Basic idea: extend linear-chain CRF by joining some distant

    observations with ‘skip edges’.

    Connect multiple mentions of entity across whole document.

    3From An Introduction to Conditional Random Fields for Relational

    Learning, Charles Sutton and Andrew McCallum, 2006.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Skip-chain CRF

    Example: skip-chain CRF3.

    Has long-range features!

    Basic idea: extend linear-chain CRF by joining some distant

    observations with ‘skip edges’.

    Connect multiple mentions of entity across whole document.

    3From An Introduction to Conditional Random Fields for Relational

    Learning, Charles Sutton and Andrew McCallum, 2006.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Example: Note: Squares denote factors (e.g. potential functions).

    Question: Ignoring the skip edges (in blue), what potentials does

    Yi

    appear in? Answer: (Yi

    ,Yi�1,Xi ), (Yi+1,Yi ,Xi+1).

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Example: Note: Squares denote factors (e.g. potential functions).

    Question: Ignoring the skip edges (in blue), what potentials does

    Yi

    appear in?

    Answer: (Yi

    ,Yi�1,Xi ), (Yi+1,Yi ,Xi+1).

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Example: Note: Squares denote factors (e.g. potential functions).

    Question: Ignoring the skip edges (in blue), what potentials does

    Yi

    appear in? Answer: (Yi

    ,Yi�1,Xi ), (Yi+1,Yi ,Xi+1).

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Skip-chain task

    Data: 485 email announcements for seminars at CMU.

    Task: identify start time, end time, location, speaker.

    Linear chain CRF with skip edges between identical capitalised

    words.

    Other word-specific features e.g. ‘appears in list of first

    names’, ‘upper case’, ‘appears to be part of time/date’ (by

    regex), etc.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Skip-chain task

    Data: 485 email announcements for seminars at CMU.

    Task: identify start time, end time, location, speaker.

    Linear chain CRF with skip edges between identical capitalised

    words.

    Other word-specific features e.g. ‘appears in list of first

    names’, ‘upper case’, ‘appears to be part of time/date’ (by

    regex), etc.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Skip-chain task

    Data: 485 email announcements for seminars at CMU.

    Task: identify start time, end time, location, speaker.

    Linear chain CRF with skip edges between identical capitalised

    words.

    Other word-specific features e.g. ‘appears in list of first

    names’, ‘upper case’, ‘appears to be part of time/date’ (by

    regex), etc.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Skip-chain task

    Data: 485 email announcements for seminars at CMU.

    Task: identify start time, end time, location, speaker.

    Linear chain CRF with skip edges between identical capitalised

    words.

    Other word-specific features e.g. ‘appears in list of first

    names’, ‘upper case’, ‘appears to be part of time/date’ (by

    regex), etc.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Skip-chain results

    Values are F1 scores.

    Repeated occurrences of speaker improve skip-chain

    performance.

    Tokens are consistently classified by skip-chain. Linear-chain is

    inconsistent on 30.2 speakers, skip-chain: 4.8.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Skip-chain results

    Values are F1 scores.

    Repeated occurrences of speaker improve skip-chain

    performance.

    Tokens are consistently classified by skip-chain. Linear-chain is

    inconsistent on 30.2 speakers, skip-chain: 4.8.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Skip-chain CRF

    Skip-chain results

    Values are F1 scores.

    Repeated occurrences of speaker improve skip-chain

    performance.

    Tokens are consistently classified by skip-chain. Linear-chain is

    inconsistent on 30.2 speakers, skip-chain: 4.8.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Summary

    CRFs combine discriminative (e.g. MEMM) and undirected

    (e.g. MRF) properties to solve problems:

    Global normalisation avoids label bias.

    Conditioning on observations avoids modelling complex

    dependencies.

    Enables use of features using global structure.

    Examples in paper strangely insubstantial, but CRFs are

    widely and successfully used.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Summary

    CRFs combine discriminative (e.g. MEMM) and undirected

    (e.g. MRF) properties to solve problems:

    Global normalisation avoids label bias.

    Conditioning on observations avoids modelling complex

    dependencies.

    Enables use of features using global structure.

    Examples in paper strangely insubstantial, but CRFs are

    widely and successfully used.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Summary

    CRFs combine discriminative (e.g. MEMM) and undirected

    (e.g. MRF) properties to solve problems:

    Global normalisation avoids label bias.

    Conditioning on observations avoids modelling complex

    dependencies.

    Enables use of features using global structure.

    Examples in paper strangely insubstantial, but CRFs are

    widely and successfully used.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Summary

    CRFs combine discriminative (e.g. MEMM) and undirected

    (e.g. MRF) properties to solve problems:

    Global normalisation avoids label bias.

    Conditioning on observations avoids modelling complex

    dependencies.

    Enables use of features using global structure.

    Examples in paper strangely insubstantial, but CRFs are

    widely and successfully used.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation

  • Introduction CRF in Context Label Bias Other points Parameters Experiments Additional example Conclusion

    Summary

    CRFs combine discriminative (e.g. MEMM) and undirected

    (e.g. MRF) properties to solve problems:

    Global normalisation avoids label bias.

    Conditioning on observations avoids modelling complex

    dependencies.

    Enables use of features using global structure.

    Examples in paper strangely insubstantial, but CRFs are

    widely and successfully used.

    Presenters: Brad Gulko and Stephanie Hyland

    CS 6784 Paper Presentation