learning data representations with “partial supervision” ariadna quattoni

Learning Data Representations with “Partial Supervision”

Ariadna Quattoni

Outline

Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.

Semi-Supervised Learning

)},(),...,,(),,{( 2211 uu cxcxcxU

{ 1,1}Y

:F X Y

“Raw” Feature Space

Output Space

Core Task:Learn a function from X to Y

)},()...,,(),,{( 2211 nn yxyxyxT

},...,,{ 21 uxxxU

Labeled Dataset (Small)

Unlabeled Dataset (Large)

Partially Labeled Dataset (Large)

dRX

Classical Setting

Partial Supervision Setting

Semi-Supervised LearningClassical Setting

': XXG YXGF )(:

Unlabeled Dataset

Learn Representation

Labeled Dataset

TrainClassifier

hRX '

dh Dimensiona

lity Reduction

xxG )(dhR

Semi-Supervised LearningPartial Supervision Setting

': XXG YXGF )(:

Unlabeled Dataset

+Partial

Supervision

Learn Representation

Labeled Dataset

TrainClassifier

hRX '

dh Dimensiona

lity Reduction

xxG )(dhR

Why is “learning representations” useful?

Infer the intrinsic dimensionality of the data.

Learn the “relevant” dimensions.

Infer the hidden structure.

Example: Hidden Structure

},...,,{ 2021 sssS

},..,{},,..,{ 107625211 sssTsssT

},..,{},,..,{ 20171641512113 sssTsssT

20 Symbols

4 Topics

]0,....,31,3

1,..,0,31[ 20111021 xxxxxx

},,{ 11101 sssSubset of 3 symbols

Generate a datapoint:

Choose a topic T. Sample 3 symbols from T.

Data Covariance Matrix

2 4 6 8 10 12 14 16 18 20

2

4

6

8

10

12

14

16

18

20

Data Covariance

Example: Hidden Structure

xxG )(

Number of latent dimensions = 4 Map each x to the topic that generated it Function:

1111

1 0000

0

0000

0

0000

0

0000

0

0000

0

0000

0

1111

1

1111

1

1111

1 0000

0

0000

0

0000

0

0000

0

0000

0

0000

0 0001

31

3103

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Projection Matrix

Topic Vector

DataPoint

Latent Representation

1111 0000 0 0000 0 0000 0

0000 0 0000 0

0000 0

1111 1

1111 1

1111 10000 0

0000 0

0000 0 0000 0

0000 00000 0

1

Outline


Classical SettingPrincipal Components Analysis

0,0,0,31z

4

1

'j

jzx jr

0,31,0,0z

Rows of theta as a ‘basis’:

Example generated by:

T1 T2

T3 T4

0,0,31,0z

31,0,0,0z

Low Reconstruction Error:22

2

3

1

3

1´' xx

Minimum Error Formulation

d

miii

m

j

inn ubzx

11

'´iu

2

1

'1

U

unn xx

UJ

0jTi uu

Error:

Orthonormal basis

Solution

Data covariance

Distorsion

Approximate high dimensional x with low dimensional x‘

d

mi

TiJ

1iSuu

ii uSu i

d

miiJ

1

Principal Component Analysis2D Example

Projection Error

Uncorrelated variables

ii eu and )var( ii x

Cut dimensions according to their variance. Variables must be correlated.

Outline


Unlabeled Dataset

+Partial

SupervisionCreate

AuxiliaryTasks

StructureLearning

Partial Supervision Setting[Ando & Zhang JMLR 2005]

': XXG

Partial Supervision Setting

Unlabeled data + partial supervision: Images with associated natural language captions. Video sequences with associated speech. Document + keywords

How could the partial supervision help? A hint for discovering important features. Use the partial supervision to define “auxiliary tasks”. Discover feature groupings that are useful for these tasks.

Sometimes ‘auxiliary tasks’ defined from unlabeled data alone.E.g. Auxiliary Task for word tagging predicting substructures-

Auxiliary Tasks:

keywords: machine learning, dimensionality reduction

keywords: linear embedding, spectral methods, distance learning

keywords: object recognition, shape matching, stereo

machine learningpapers

computer vision papers

mask occurrencesof keywords

Auxiliary task: predict object recognition from document contentAuxiliary task: predict object recognition from document content

Core task: Is a vision or machine learning article?

Auxiliary Tasks

otherwisey

ky

yxyxyxD

i

iii

uui

1

1

)},()...,,(),,{( 2211

)},(),...,,(),,{( 2211 uuxxxU

Structure Learning

Learning with no prior knowledge

),(min^

jDfL

}'),({ xwxwfF

Hypothesis learned from examples

Best hypothesis

}'),({ xwxwfF

Learning with prior knowledge

}),({)( xvxvfF t

}'),({ xwxwfF

Learning from auxiliary tasks

}),({)( xvxvfF t Hypothesis learnedfor related tasks

Learning Good Hypothesis Spaces

Class of linear predictors: is an h by d matrix of structural parameters. Goal: Find the parameters and shared that

minimizes the joint loss.

( , ) 'tf v x v x

v

1:

( , , ) ( ) ( )j j jj m

L v D reg v reg

jDLoss on training set

Problem specific parametersShared parameters

Class of linear predictors: is an h by d matrix of structural parameters. Goal: Find the parameters and shared that

minimizes the joint loss.

Algorithm Step 1:

Train classifiers for auxiliary tasks.

* 2arg min ( ( , ), ) || ||2j w i i

i

Cw l f w x y w

Algorithm Step 2:PCA On Classifiers Coefficients

1 2

* * *[ , ,... ]m

w w wW

dhR tWW

by taking the first h eigenvectors

Linear subspace of dimension h; a good low dimensional approximation

to the space of coefficients.

of Covariance Matrix:

Algorithm Step 3: Training on the core task

** vw tcore

( ) 'q x x

* 2arg min ( ( , ( )), ) || ||2v i i

i

Cv l f v q x y v

Project data:

Equivalent to training core task on the original d dimensional space

with parameters constraints:

-

Structural Learning : defining a new representation. sub-space constraint on weights

Example

Object = { letter, letter, letter }

An object

abC

Example

The same object seen in a different font

Abc

Example


ABc

Example


abC

Example

acE 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 1

A C EB

6 Letters (topics)6 Letters (topics)5 fonts per letter (symbols)5 fonts per letter (symbols)

auxiliary task: recognize object .

words

“ABC” object “ADE” object

“BCF” words “ABD” words

30 Symbols 30 Features

20 words

PCA on Data can not recoverlantent structure

Covariance Matrix

5 10 15 20 25 30

5

10

15

20

25

30

Covariance DATA

PCA on Coefficients can recover latent structure

Weight Matrix Auxiliary Task

2 4 6 8 10 12 14 16 18 20

5

10

15

20

25

30

Featuresi.e. fonts

Auxiliary Tasks

Topicsi.e Letters

Parameters for object

BCD

W

PCA on Coefficients can recover latent structure

Covariance W

5 10 15 20 25 30

5

10

15

20

25

30

Each Block of Correlated Variables corresponds to a Latent Topic

Covariance WFeaturesi.e. fonts

Featuresi.e. fonts

Outline


News domain

figure skating ice hockey golden globes

grammys

Dataset: News images from Reuters web-site.Problem: Predicting news topics from images.

Learning visual representations using images with captions

The Italian team celebrate their gold medal win during the flower ceremony after the final round of the men's team pursuit speedskating at Oval Lingotto during the 2006 Winter Olympics.

Former U.S. President Bill Clinton speaks during a joint news conference with Pakistan's Prime Minister Shaukat Aziz at Prime Minister house in Islamabad.

Diana and Marshall Reed leave the funeral of miner David Lewis in Philippi, West Virginia on January 8, 2006. Lewis was one of 12 miners who died in the Sago Mine.

Senior Hamas leader Khaled Meshaal (2nd-R), is surrounded by his bodyguards after a news conference in Cairo February 8, 2006.

Jim Scherr, the US Olympic Committee's chief executive officer seen here in 2004, said his group is watching the growing scandal and keeping informed about the NHL's investigation into Rick Tocchet,

U.S. director Stephen Gaghan and his girlfriend Daniela Unruh arrive on the red carpet for the screening of his film 'Syriana' which runs out of competition at the 56th Berlinale International Film Festival.

Auxiliary task: predict “ team ” from image contentAuxiliary task: predict “ team ” from image content

Learning visual topics

word ‘games’ might contain the visual

topics:medalspeople paveme

nt

Auxiliary tasksshare visual topics

people

word ‘Demonstrations’ might contain the visual

topics:

Different words can share topics.Each topic can be observed under

different appearances.

Experiments Results

Outline


Chunking

Jane lives in New York and works for Bank of New York. PER LOC ORG

But economists in Europe failed to predict that … NP NP VP SBARPP

• Named entity chunking

• Syntactic chunking

Data points: word occurrences Labels: Begin-PER, Inside-PER, Begin-LOC, …, Outside

Example input vector representation

1curr-“in”

left-“lives”right-“New”

11

… lives in New York …

1curr-“New”

left-“in”

right-“York”

1

1

input vector X• High-dimensional vectors. • Most entries are 0.

1. Create m auxiliary problems. 2. Assign auxiliary labels to unlabeled data. 3. Compute (shared structure) by joint empirical risk minimization over all

the auxiliary problems.

4. Fix , and minimize empirical risk on the labeled data for the target task.

xvxwx TTf ),(

Algorithmic Procedure

Predictor: Additional features

Is the current word “New”? Is the current word “day”? Is the current word “IBM”? Is the current word “computer”? :

Predict 1 from 2 . compute shared add 2 as new features

Example auxiliary problems

Example auxiliary problems

??:?

currentword

left word

right word

1

1

1

2

Experiments (CoNLL-03 named entity)

4 classes: LOC, ORG, PER, MISC Labeled data: News documents.

204K words (English), 206K words (German) Unlabeled data:

27M words (English), 35M words (German) Features: A slight modification of ZJ03.

Words, POS, char types, 4 chars at the beginning/ending in a 5-word window; words in a 3-chunk window; labels assigned to two words on the left, bi-gram of the current word and left label; labels assigned to previous occurrences of the current word.

No gazetteer. No hand-crafted resources.

Auxiliary problems

# of aux. problems

Auxiliary labels Features used for learning auxiliary problems

100010001000

Previous wordsCurrent wordsNext words

All but previous wordsAll but current wordsAll but next words

300 auxiliary problems.

Syntactic chunking results (CoNLL-00)

method description F-measure

supervised baseline 93.60ASO-semi +Unlabeled data 94.39

Co/self oracle +Unlabeled data 93.66KM01 SVM combination 93.91CM03 Perceptron in two layers 93.74ZDJ02 Reg. Winnow 93.57

Exceeds previous best systems.

ZDJ02+ +full parser (ESG) output 94.17

(+0.79%)

Other experiments

POS tagging Text categorization (2 standard corpora)

Confirmed effectiveness on:

Outline


Notation

Collection of Tasks

},...,,{ 21 mDDDD},...,,{ 21 mDDDD

Joint SparseApproximation

1D2D mD

)},(),....,,{( 11kn

kn

kkk kk

yxyxD

dx }1,1{ y

mddd

m

m

www

www

www

,2,1,

,22,21,2

,12,11,1

W

Single Task Sparse Approximation

xwxf )(

Dyx

d

jjwQyxfl

),( 1

||)),((minargw

Consider learning a single sparse linear classifier of the form:

We want a few features with non-zero coefficients

Recent work suggests to use L1 regularization:

Classification error

L1 penalizes

non-sparse solutions

Donoho [2004] proved (in a regression setting) that the solution with smallest L1 norm is also the sparsest solution.

Joint Sparse Approximation

m

kmk

Dyxk

QyxflD

k121

),(,...,, ),....,,R()),((

||

1minarg www

m21 www

Setting : Joint Sparse Approximation

Average Loss on training set k

penalizes solutions that

utilize too many features

xxf kk w)(

Joint Regularization Penalty

rowszerononW #)R(

mddd

m

m

WWW

WWW

WWW

,2,1,

,22,21,2

,12,11,1

How do we penalize solutions that use too many features?

Coefficients forfor feature 2

Coefficients for classifier 2

Would lead to a hard combinatorial problem .

Ariadna

Simultaneous OrthogonalMatching Pursuit, greedy approximationsAgregar citas

Joint Regularization Penalty

We will use a L1-∞ norm [Tropp 2006]

d

iik

kWW

1

|)(|max)R(

The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solvingmany classification problems.

This norm combines:

An L1 norm on the maximum absolute values of the coefficients across tasks promotes sparsity.

Use few features

The L∞ norm on each row promotes non-sparsity on the rows.

Share features


m

k

d

iik

kk

Dyxk

WQyxflD

k1 1),(

|)(|max)),((||

1minW

Using the L1-∞ norm we can rewrite our objective function as:

For the hinge loss: the optimization problem can be expressed as a linear

program.

))(1,0max()),(( xyfyxfl

For any convex loss this is a convex objective.


Objective:

m

k

D

j

d

ii

kj

k

k

tQD1

||

1 1],,[ ||

1min tεW

Linear program formulation (hinge loss):

Max value constraints:

mkfor :1:

difor :1:

0kj

iiki twt mkfor :1:

|:|1: kDjfor

kj

kjk

kj xfy 1)(

and

Slack variables constraints:and

The LP formulation is feasible for small problems but becomes intractable for larger data-sets with thousands of examples and dimensions.

We might want a more general optimization algorithm that can handle arbitrary convex losses.

An efficient training algorithm

The LP formulation can be optimized using standard LP solvers.

We developed a simple an efficient global optimization algorithm for training joint models with L1−∞ constraints.

The total cost is in the order of: ))log(( dmdmO

Outline


SuperBowl

Danish CartoonsSharon

Australian Open Trapped Miners

Golden globes

Grammys Figure Skating

Academy Awards

Iraq

Learn a representation using labeled data from 9 topics.

Train a classifier for the 10th held out topic using the relevantfeatures R only.

}0|)(|max:{ rkk wrR Define the set of relevant features to be:

Learn the matrix W using our transfer algorithm.

4 20 40 60 80 100 120 1400.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

0.72

Average AUC

# training samples

Asymmetric Transfer

Baseline RepresentationTransfered Representation

Results

Future Directions

Joint Sparsity Regularization to control inference time.

Learning representations for ranking problems.

learning data representations with “partial supervision” ariadna quattoni

Documents

vision applications

data covariance slide

low dimensional x slide

learning data representations

structural learning

nlp applications

partial supervision

machine learning