latent variables naman agarwal michael nute may 1, 2013

23
Latent Variables Naman Agarwal Michael Nute May 1, 2013

Upload: jason-cunliffe

Post on 28-Mar-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Latent Variables

Naman AgarwalMichael Nute

May 1, 2013

Page 2: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Latent VariablesContents

• Definition & Example of Latent Variables• EM Algorithm Refresher• Structured SVM with Latent Variables• Learning under semi-supervision or indirect supervision

– CoDL– Posterior Regularization– Indirect Supervision

Page 3: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Latent VariablesGeneral Definition & Examples

A Latent Variable in a machine learning algorithm is one which is assumed to exist (or have null value) but which is not observed and is inferred from other observed variables.• Generally corresponds to some meaningful

element of the problem for which direct supervision is intractable.

• Latent variable methods often imagine the variable as part of the input/feature space (e.g. PCA, factor analysis), or as part of the output space (e.g. EM). – This distinction is only illustrative though and can

be blurred, as we will see with indirect supervision.

Latent Input Variables:𝒳 𝒴𝒳∗

(unobserved)

As part of the input space, the variable affects the output only through the unobserved variable . This formulation is only helpful if the dimension of is smaller than the dimension of , so latent variables here are essentially an exercise in dimension reduction.

Latent Output Variables:

𝒳When we think of a latent variable as part of the output space, the method becomes an exercise in unsupervised or semi-supervised learning.

𝒴∗(unobserved)𝒴

(observed)

Page 4: Latent Variables Naman Agarwal Michael Nute May 1, 2013

ExampleParaphrase Identification

Problem: Given sentences A and B, determine whether they are paraphrases of each other.• Note that if they are paraphrases, then there will exist a mapping between named entities

and predicates in the sentence.• The mapping is not directly observed, but is a latent variable in the decision problem of

determining whether the sentences say the same thing.

A: Druce will face murder charges, Conte said.

B: Conte said Druce will be charged with murder.

(latent)

Revised Problem: Given sentences A and B, determine the mapping of semantic elements between A and B. • Now we are trying to learn specifically the mapping between them, so we can use the

Boolean question in the previous problem as a latent variable.• In practice, the Boolean question is easy to answer, so we can use it to guide the semi-

supervised task of mapping semantic elements.• This is called indirect supervision (more on that later).

1Example taken from talk by D. Roth Language Technologies Institute Colloquium, Carnegie Mellon University, Pittsburgh, PA. Constraints Driven Structured Learning with Indirect Supervision. April 2010.

Page 5: Latent Variables Naman Agarwal Michael Nute May 1, 2013

The EM Algorithm Refresher

The EM Algorithm (formally)Setup:

Observed Data: Unobserved Data: Unknown Parameters: Log-Likelihood Function:

Algorithm:Initialize

E-Step: Find the expected value of over the unobserved data given the current estimate of the parameters:

M-Step: Find the parameters that maximize the expected log-likelihood function:

Takes expectation over possible “labels” of

In practice, many algorithms that use latent variables have a structure similar to the Expectations-Maximization algorithm (even though EM is not discriminative and others are).

So let’s review:

Page 6: Latent Variables Naman Agarwal Michael Nute May 1, 2013

The EM AlgorithmHard EM vs. Soft EM

• The algorithm at left is often called Soft EM because it computes the expectation of the log-likelihood function in the E-Step.

• An important variation on this algorithm is called Hard EM: — instead of computing expectation, we simply choose the MAP value for and proceed

with the likelihood function conditional on the MAP value.

• This is a simpler procedure which many latent variable methods will essentially resemble:

Label

Train (repeat until convergence)

Page 7: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Yu & Joachims—Learning Structured SVMs with Latent VariablesModel Formulation

General Structured SVM Formulation:Solve:

Where: are input and structure for training example . is the feature vector is the loss function in the output space is the weight vector

Structured SVM Formulation with Latent Variable:

Let be an unobserved variable. Since the predicted now depends on , the predicted value of the latent variable , the loss function of the actual and may now become a function of as well:

So our new optimization problem becomes:

Problem is now the difference of two convex functions, so we can solve it using a concave-convex procedure (CCCP).

Page 8: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Yu & Joachims—Learning Structured SVMs with Latent VariablesOptimization Methodology & Notes

The CCCP:1.Compute

for each

2.Update by solving the standard Structured SVM formulation, treating each as though it were an observed value.

(repeat until convergence)

Note the similarity to the simple way we looked at Hard-EM earlier: first we label the unlabeled values, then we re-train the model based on the newly labeled values.

Notes:• Technically the loss function would compare

the true values to the predicted , but since we do not observe , we are restricted to using loss functions that reduce to what is shown.

• It is not strictly necessary that the loss function depend on . In NLP it often does not.

• In the absence of latent variables, the optimization problem reduces to the general Structured SVM formulation.

Page 9: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Learning under semi-supervision Labeled dataset is hard to obtain We generally have a small labeled dataset and a

large unlabeled data-set Naïve Algorithm [A kind of EM]

Train on labeled data set [Initialization] Make Inference on the unlabeled set [Expectation] Include them in your training [Maximization] Repeat

Can we do better ? Indirect supervision

Constraints Binary decision problems

Page 10: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Constraint Driven Learning

Proposed by Chang et al [2007] Uses constraints obtained by domain-

knowledge as to streamline semi-supervision

Constraints are pretty general Incorporates soft constraints

Page 11: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Why are constraints useful ?

[AUTHOR Lars Ole Anderson . ] [TITLE Program Analysis and specification for the C programming language . ] [ TECH-REPORT PhD thesis , ] [INSTITUTION DIKU , University of Copenhagen , ][DATE May 1994 .]

HMM trained on 30 data sets produces [AUTHOR Lars Ole Anderson . Program Analysis and ]

[ TITLE specification for the ] [ EDITOR C ] BOOKTITLE programming language . ] [ TECH-REPORT PhD thesis , ] [INSTITUTION DIKU , University of Copenhagen , May ][DATE 1994 .]

Leads to noisy predictions. Simple constraint that state transition occurs

only on punctuation marks produces the correct output

Page 12: Latent Variables Naman Agarwal Michael Nute May 1, 2013

CoDL Framework

Notations L = () is the labeled dataset U = () is the unlabeled dataset represents a feature vector Structured Learning Task

Learn w such that are the set of constraints where each

Page 13: Latent Variables Naman Agarwal Michael Nute May 1, 2013

CoDL Objective

If the constraints are hard –

If constraints are soft they define a notion of violation by a distance function d such that

The objective in this “soft” formulation is given by

Page 14: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Learning Algorithm

Divided into Four Steps Initialization

Expectation For all

generates the best K “valid” assignments to Y using Beam – Search techniques

Can be thought of as assigning a uniform distribution over the above K in the posterior and 0 everywhere else

Page 15: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Learning Algorithm (cntd.)

Maximization

is a smoothing parameter that does not let the model drift too much from the supervised model

Repeat

Page 16: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Posterior Regularization [Ganchev et al ‘09]

Hard vs Soft EM Imposes constraints in Expectation over

the Posterior Distribution of the Latent Variables

Two components of the objective function The log-likelihood – The deviation from the predicted posterior

and the one satisfying constraints

Posterior Distribution of the latent variables

Constraint specified in terms of expectation

over q

Set of all posterior

distributions

Page 17: Latent Variables Naman Agarwal Michael Nute May 1, 2013

The PR Algorithm

Initialization Estimate parameters from the labeled data

set Expectation Step

Compute the closest satisfying distribution

Maximization Step

Repeat

Page 18: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Indirect Supervision - Motivation Paraphrase Identification S1: Druce will face murder charges, Conte said.

S2: Conte said Druce will be charged with murder.

There exists some Latent Structure H between S1 and S2

H acts as a justification for the binary decision. Can be used as an intermediate step in learning

the model

Page 19: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Supervision through Binary Problems

Now we ask the previous question in the reverse direction Given answers to the binary problem, can we

improve our latent structure identification

Example – Field Identification in advertisements

(size,rent etc.)

Whether the text is a well formed advertisement

• Companion Binary Problem

• Labeled dataset – easy to obtain

Structured Prediction Problem

Page 20: Latent Variables Naman Agarwal Michael Nute May 1, 2013

The Model [Chang et al 2010]

Notations – L = () is the labeled dataset B = () is the binary () labeled dataset. represents a feature vector Structured Learning Task

Learn w such that Additionally we require

𝐵=𝐵+¿∪𝐵−¿

The weight vector scores all structures

badly

The weight vector scores some

structure well

Page 21: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Loss Function

The previous “constraint” can be captured by the following loss function

Now we wish to optimize the following objective

Structured Prediction over the

labeled dataset

Page 22: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Indirect SupervisionModel Specification

So the optimization problem becomes:

Where:

is a common loss function such as the hinge loss is a normalization constant

(i.e. there is no good predicted structure for the negative examples)

(i.e. there is at least one good predicted structure for the positive examples)

Fully-labeled training data:

Binary-labeled training data: where

Setup:

Two Conditions Imposed on the Weight Vector:

This term is non-convex and must be optimized by setting and solving the first two terms, then repeating (CCCP-like).

Page 23: Latent Variables Naman Agarwal Michael Nute May 1, 2013

Latent Variables in NLPOverview of Three Methods

Method 2-Second Description Latent Variable EM Analogue Key Advantage

Structural SVM1

Structured SVM with latent variables & EM-like training

Separate and independent from the output variable

Hard EM, latent value found by

Enables Structured SVM learned with latent variable

CoDL2

Train on labeled data, generate K best structures of unlabeled data and train on that. Average the two.

Output variable for unlabeled training examples

Soft-EM with Uniform Distribution on top-K predicted outputs.

Efficient semi-supervised learning when constraints are difficult to guarantee for predictions but easy to evaluate

Indirect Supervision3

Get small number of labeled & many where we know if label exists or not. Train a model on both at the same time.

1. Companion binary-decision variable

2. Output structure on positive, unlabeled examples

Hard EM where label is applied only to examples where binary classifier is positive

Combines information gain from indirect supervision (on lots of data) with direct supervision

1Learning Structural SVMs with Latent Variables, Chun-Nam John Yu and T. Joachims, ICML, 2009.2Guiding Semi-Supervision with Constraint-Driven Learning, M. Chang, L. Ratinov and D. Roth,  ACL 20073Structured Output Learning with Indirect Supervision, M. Chang, V. Srikumar, D. Goldwasser and D. Roth, ICML 2010.