Learning Data Representations with “Partial Supervision”
Ariadna Quattoni
Outline
Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.
Outline
Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.
Semi-Supervised Learning
)},(),...,,(),,{( 2211 uu cxcxcxU
{ 1,1}Y
:F X Y
“Raw” Feature Space
Output Space
Core Task:Learn a function from X to Y
)},()...,,(),,{( 2211 nn yxyxyxT
},...,,{ 21 uxxxU
Labeled Dataset (Small)
Unlabeled Dataset (Large)
Partially Labeled Dataset (Large)
dRX
Classical Setting
Partial Supervision Setting
Semi-Supervised LearningClassical Setting
': XXG YXGF )(:
Unlabeled Dataset
Learn Representation
Labeled Dataset
TrainClassifier
hRX '
dh Dimensiona
lity Reduction
xxG )(dhR
Semi-Supervised LearningPartial Supervision Setting
': XXG YXGF )(:
Unlabeled Dataset
+Partial
Supervision
Learn Representation
Labeled Dataset
TrainClassifier
hRX '
dh Dimensiona
lity Reduction
xxG )(dhR
Why is “learning representations” useful?
Infer the intrinsic dimensionality of the data.
Learn the “relevant” dimensions.
Infer the hidden structure.
Example: Hidden Structure
},...,,{ 2021 sssS
},..,{},,..,{ 107625211 sssTsssT
},..,{},,..,{ 20171641512113 sssTsssT
20 Symbols
4 Topics
]0,....,31,3
1,..,0,31[ 20111021 xxxxxx
},,{ 11101 sssSubset of 3 symbols
Generate a datapoint:
Choose a topic T. Sample 3 symbols from T.
Data Covariance Matrix
2 4 6 8 10 12 14 16 18 20
2
4
6
8
10
12
14
16
18
20
Data Covariance
Example: Hidden Structure
xxG )(
Number of latent dimensions = 4 Map each x to the topic that generated it Function:
1111
1 0000
0
0000
0
0000
0
0000
0
0000
0
0000
0
1111
1
1111
1
1111
1 0000
0
0000
0
0000
0
0000
0
0000
0
0000
0 0001
31
3103
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Projection Matrix
Topic Vector
DataPoint
Latent Representation
1111 0000 0 0000 0 0000 0
0000 0 0000 0
0000 0
1111 1
1111 1
1111 10000 0
0000 0
0000 0 0000 0
0000 00000 0
1
Outline
Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.
Classical SettingPrincipal Components Analysis
0,0,0,31z
4
1
'j
jzx jr
0,31,0,0z
Rows of theta as a ‘basis’:
Example generated by:
T1 T2
T3 T4
0,0,31,0z
31,0,0,0z
Low Reconstruction Error:22
2
3
1
3
1´' xx
Minimum Error Formulation
d
miii
m
j
inn ubzx
11
'´iu
2
1
'1
U
unn xx
UJ
0jTi uu
Error:
Orthonormal basis
Solution
Data covariance
Distorsion
Approximate high dimensional x with low dimensional x‘
d
mi
TiJ
1iSuu
ii uSu i
d
miiJ
1
Principal Component Analysis2D Example
Projection Error
Uncorrelated variables
ii eu and )var( ii x
Cut dimensions according to their variance. Variables must be correlated.
Outline
Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.
Unlabeled Dataset
+Partial
SupervisionCreate
AuxiliaryTasks
StructureLearning
Partial Supervision Setting[Ando & Zhang JMLR 2005]
': XXG
Partial Supervision Setting
Unlabeled data + partial supervision: Images with associated natural language captions. Video sequences with associated speech. Document + keywords
How could the partial supervision help? A hint for discovering important features. Use the partial supervision to define “auxiliary tasks”. Discover feature groupings that are useful for these tasks.
Sometimes ‘auxiliary tasks’ defined from unlabeled data alone.E.g. Auxiliary Task for word tagging predicting substructures-
Auxiliary Tasks:
keywords: machine learning, dimensionality reduction
keywords: linear embedding, spectral methods, distance learning
keywords: object recognition, shape matching, stereo
machine learningpapers
computer vision papers
mask occurrencesof keywords
Auxiliary task: predict object recognition from document contentAuxiliary task: predict object recognition from document content
Core task: Is a vision or machine learning article?
Auxiliary Tasks
otherwisey
ky
yxyxyxD
i
iii
uui
1
1
)},()...,,(),,{( 2211
)},(),...,,(),,{( 2211 uuxxxU
Structure Learning
Learning with no prior knowledge
),(min^
jDfL
}'),({ xwxwfF
Hypothesis learned from examples
Best hypothesis
}'),({ xwxwfF
Learning with prior knowledge
}),({)( xvxvfF t
}'),({ xwxwfF
Learning from auxiliary tasks
}),({)( xvxvfF t Hypothesis learnedfor related tasks
Learning Good Hypothesis Spaces
Class of linear predictors: is an h by d matrix of structural parameters. Goal: Find the parameters and shared that
minimizes the joint loss.
( , ) 'tf v x v x
v
1:
( , , ) ( ) ( )j j jj m
L v D reg v reg
jDLoss on training set
Problem specific parametersShared parameters
Class of linear predictors: is an h by d matrix of structural parameters. Goal: Find the parameters and shared that
minimizes the joint loss.
Algorithm Step 1:
Train classifiers for auxiliary tasks.
* 2arg min ( ( , ), ) || ||2j w i i
i
Cw l f w x y w
Algorithm Step 2:PCA On Classifiers Coefficients
1 2
* * *[ , ,... ]m
w w wW
dhR tWW
by taking the first h eigenvectors
Linear subspace of dimension h; a good low dimensional approximation
to the space of coefficients.
of Covariance Matrix:
Algorithm Step 3: Training on the core task
** vw tcore
( ) 'q x x
* 2arg min ( ( , ( )), ) || ||2v i i
i
Cv l f v q x y v
Project data:
Equivalent to training core task on the original d dimensional space
with parameters constraints:
Example
Object = { letter, letter, letter }
An object
abC
Example
The same object seen in a different font
Abc
Example
The same object seen in a different font
ABc
Example
The same object seen in a different font
abC
Example
acE 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 1
A C EB
6 Letters (topics)6 Letters (topics)5 fonts per letter (symbols)5 fonts per letter (symbols)
auxiliary task: recognize object .
words
“ABC” object “ADE” object
“BCF” words “ABD” words
30 Symbols 30 Features
20 words
PCA on Data can not recoverlantent structure
Covariance Matrix
5 10 15 20 25 30
5
10
15
20
25
30
Covariance DATA
PCA on Coefficients can recover latent structure
Weight Matrix Auxiliary Task
2 4 6 8 10 12 14 16 18 20
5
10
15
20
25
30
Featuresi.e. fonts
Auxiliary Tasks
Topicsi.e Letters
Parameters for object
BCD
W
PCA on Coefficients can recover latent structure
Covariance W
5 10 15 20 25 30
5
10
15
20
25
30
Each Block of Correlated Variables corresponds to a Latent Topic
Covariance WFeaturesi.e. fonts
Featuresi.e. fonts
Outline
Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.
News domain
figure skating ice hockey golden globes
grammys
Dataset: News images from Reuters web-site.Problem: Predicting news topics from images.
Learning visual representations using images with captions
The Italian team celebrate their gold medal win during the flower ceremony after the final round of the men's team pursuit speedskating at Oval Lingotto during the 2006 Winter Olympics.
Former U.S. President Bill Clinton speaks during a joint news conference with Pakistan's Prime Minister Shaukat Aziz at Prime Minister house in Islamabad.
Diana and Marshall Reed leave the funeral of miner David Lewis in Philippi, West Virginia on January 8, 2006. Lewis was one of 12 miners who died in the Sago Mine.
Senior Hamas leader Khaled Meshaal (2nd-R), is surrounded by his bodyguards after a news conference in Cairo February 8, 2006.
Jim Scherr, the US Olympic Committee's chief executive officer seen here in 2004, said his group is watching the growing scandal and keeping informed about the NHL's investigation into Rick Tocchet,
U.S. director Stephen Gaghan and his girlfriend Daniela Unruh arrive on the red carpet for the screening of his film 'Syriana' which runs out of competition at the 56th Berlinale International Film Festival.
Auxiliary task: predict “ team ” from image contentAuxiliary task: predict “ team ” from image content
Learning visual topics
word ‘games’ might contain the visual
topics:medalspeople paveme
nt
Auxiliary tasksshare visual topics
people
word ‘Demonstrations’ might contain the visual
topics:
Different words can share topics.Each topic can be observed under
different appearances.
Experiments Results
Outline
Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.
Chunking
Jane lives in New York and works for Bank of New York. PER LOC ORG
But economists in Europe failed to predict that … NP NP VP SBARPP
• Named entity chunking
• Syntactic chunking
Data points: word occurrences Labels: Begin-PER, Inside-PER, Begin-LOC, …, Outside
Example input vector representation
1curr-“in”
left-“lives”right-“New”
11
… lives in New York …
1curr-“New”
left-“in”
right-“York”
1
1
input vector X• High-dimensional vectors. • Most entries are 0.
1. Create m auxiliary problems. 2. Assign auxiliary labels to unlabeled data. 3. Compute (shared structure) by joint empirical risk minimization over all
the auxiliary problems.
4. Fix , and minimize empirical risk on the labeled data for the target task.
xvxwx TTf ),(
Algorithmic Procedure
Predictor: Additional features
Is the current word “New”? Is the current word “day”? Is the current word “IBM”? Is the current word “computer”? :
Predict 1 from 2 . compute shared add 2 as new features
Example auxiliary problems
Example auxiliary problems
??:?
currentword
left word
right word
1
1
1
2
Experiments (CoNLL-03 named entity)
4 classes: LOC, ORG, PER, MISC Labeled data: News documents.
204K words (English), 206K words (German) Unlabeled data:
27M words (English), 35M words (German) Features: A slight modification of ZJ03.
Words, POS, char types, 4 chars at the beginning/ending in a 5-word window; words in a 3-chunk window; labels assigned to two words on the left, bi-gram of the current word and left label; labels assigned to previous occurrences of the current word.
No gazetteer. No hand-crafted resources.
Auxiliary problems
# of aux. problems
Auxiliary labels Features used for learning auxiliary problems
100010001000
Previous wordsCurrent wordsNext words
All but previous wordsAll but current wordsAll but next words
300 auxiliary problems.
Syntactic chunking results (CoNLL-00)
method description F-measure
supervised baseline 93.60ASO-semi +Unlabeled data 94.39
Co/self oracle +Unlabeled data 93.66KM01 SVM combination 93.91CM03 Perceptron in two layers 93.74ZDJ02 Reg. Winnow 93.57
Exceeds previous best systems.
ZDJ02+ +full parser (ESG) output 94.17
(+0.79%)
Other experiments
POS tagging Text categorization (2 standard corpora)
Confirmed effectiveness on:
Outline
Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.
Notation
Collection of Tasks
},...,,{ 21 mDDDD},...,,{ 21 mDDDD
Joint SparseApproximation
1D2D mD
)},(),....,,{( 11kn
kn
kkk kk
yxyxD
dx }1,1{ y
mddd
m
m
www
www
www
,2,1,
,22,21,2
,12,11,1
W
Single Task Sparse Approximation
xwxf )(
Dyx
d
jjwQyxfl
),( 1
||)),((minargw
Consider learning a single sparse linear classifier of the form:
We want a few features with non-zero coefficients
Recent work suggests to use L1 regularization:
Classification error
L1 penalizes
non-sparse solutions
Donoho [2004] proved (in a regression setting) that the solution with smallest L1 norm is also the sparsest solution.
Joint Sparse Approximation
m
kmk
Dyxk
QyxflD
k121
),(,...,, ),....,,R()),((
||
1minarg www
m21 www
Setting : Joint Sparse Approximation
Average Loss on training set k
penalizes solutions that
utilize too many features
xxf kk w)(
Joint Regularization Penalty
rowszerononW #)R(
mddd
m
m
WWW
WWW
WWW
,2,1,
,22,21,2
,12,11,1
How do we penalize solutions that use too many features?
Coefficients forfor feature 2
Coefficients for classifier 2
Would lead to a hard combinatorial problem .
Joint Regularization Penalty
We will use a L1-∞ norm [Tropp 2006]
d
iik
kWW
1
|)(|max)R(
The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solvingmany classification problems.
This norm combines:
An L1 norm on the maximum absolute values of the coefficients across tasks promotes sparsity.
Use few features
The L∞ norm on each row promotes non-sparsity on the rows.
Share features
Joint Sparse Approximation
m
k
d
iik
kk
Dyxk
WQyxflD
k1 1),(
|)(|max)),((||
1minW
Using the L1-∞ norm we can rewrite our objective function as:
For the hinge loss: the optimization problem can be expressed as a linear
program.
))(1,0max()),(( xyfyxfl
For any convex loss this is a convex objective.
Joint Sparse Approximation
Objective:
m
k
D
j
d
ii
kj
k
k
tQD1
||
1 1],,[ ||
1min tεW
Linear program formulation (hinge loss):
Max value constraints:
mkfor :1:
difor :1:
0kj
iiki twt mkfor :1:
|:|1: kDjfor
kj
kjk
kj xfy 1)(
and
Slack variables constraints:and
The LP formulation is feasible for small problems but becomes intractable for larger data-sets with thousands of examples and dimensions.
We might want a more general optimization algorithm that can handle arbitrary convex losses.
An efficient training algorithm
The LP formulation can be optimized using standard LP solvers.
We developed a simple an efficient global optimization algorithm for training joint models with L1−∞ constraints.
The total cost is in the order of: ))log(( dmdmO
Outline
Motivation: Low dimensional representations. Principal Component Analysis. Structural Learning. Vision Applications. NLP Applications. Joint Sparsity. Vision Applications.
SuperBowl
Danish CartoonsSharon
Australian Open Trapped Miners
Golden globes
Grammys Figure Skating
Academy Awards
Iraq
Learn a representation using labeled data from 9 topics.
Train a classifier for the 10th held out topic using the relevantfeatures R only.
}0|)(|max:{ rkk wrR Define the set of relevant features to be:
Learn the matrix W using our transfer algorithm.
4 20 40 60 80 100 120 1400.52
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
0.7
0.72
Average AUC
# training samples
Asymmetric Transfer
Baseline RepresentationTransfered Representation
Results
Future Directions
Joint Sparsity Regularization to control inference time.
Learning representations for ranking problems.