learning with limited supervision by input and output coding

Learning with Limited Supervision by Input and

Output Coding

Yi Zhang

Machine Learning DepartmentCarnegie Mellon University

April 30th, 2012

1

Thesis Committee

Jeff Schneider, Chair Geoff Gordon Tom Mitchell Xiaojin (Jerry) Zhu, University of Wisconsin-

Madison

2

Introduction

Learning a prediction system, usually based on examples

Training examples are usually limited Cost of obtaining high-quality examples Complexity of the prediction problem

3

Learning

(x1,y1)…

(xn,yn)

X Y

Introduction

Solution: exploit extra information about the input and output space Improve the prediction performance Reduce the cost for collecting training examples

4

Learning

(x1,y1)…

(xn,yn)

X Y

Introduction

Solution: exploit extra information about the input and output space Representation and discovery? Incorporation?

5

Learning

(x1,y1)…

(xn,yn)

X Y

? ?

Outline

6

Part I: Encoding Input Information by Regularization

Learning with word correlation

A matrix-normal penalty for multi-task learning

Learn compressible models

Projection penalties

Part II: Encoding Output Information by Output Codes

Composite likelihood for pairwise coding

Multi-label output codes with CCA

Maximum-margin output coding

Regularization

The general formulation

Ridge regression

Lasso

7

Outline

8










Learning with unlabeled text

For a text classification task : plenty of unlabeled text on the Web : seemingly unrelated to the task What can we gain from such unlabeled text?

9

Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text. NIPS 2008

A motivating example for text learning

Humans learn text classification effectively! Two training examples:

+: [gasoline, truck] -: [vote, election]

Query: [gallon, vehicle]

Seems very easy! But why?

10

A motivating example for text learning

Humans learn text classification effectively! Two training examples:

+: [gasoline, truck] -: [vote, election]

Query: [gallon, vehicle]

Seems very easy! But why? Gasoline ~ gallon, truck ~ vehicle

11

A covariance operator for regularization

Covariance structure of model coefficients Usually unknown -- learn from unlabeled text?

12


Infer the covariance operator Extract latent topics from unlabeled text (with resampling) Observe the contribution of words in each topic

[gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] Estimate the correlation (covariance) of words

13


Infer the covariance operator Extract latent topics from unlabeled text (with resampling) Observe the contribution of words in each topic

[gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] Estimate the correlation (covariance) of words

For a new task, we learn with regularization

14

Experiments

Empirical results on 20 newsgroups 190 1-vs-1 classification tasks, 2% labeled examples For any task, majority of unlabeled text (18/20) is irrelevant

Similar results on logistic regression and least squares15

[1] V. Sindhwani and S. Keerthi. Large scale semi-supervised linear svms. In SIGIR, 2006

Outline

16




Multi-task generalization







Multi-task learning

Different but related prediction tasks An example

Landmine detection using radar images Multiple tasks: different landmine fields

Geographic conditions Landmine types

Goal: information sharing among tasks

17

Regularization for multi-task learning

18

W =

Our approach: view MTL as estimating a parameter matrix

Regularization for multi-task learning

19

W =

?

(Gaussian prior)

Yi Zhang and Jeff Schneider. Learning Multiple Tasks with a Sparse Matrix-Normal Penalty. NIPS 2010

Our approach: view MTL as estimating a parameter matrix

A covariance operator for regularizing a matrix?

Vector w:

Matrix W:

Matrix-normal distributions

Consider a 2 by 3 matrix W:

The full covariance = Kronecker product of and

20

full covariance

≈

row covariance

column covariance

Matrix-normal distributions

Consider a 2 by 3 matrix W:

The full covariance = Kronecker product of and

The matrix-normal density offers a compact form for

21

full covariance

≈

row covariance

column covariance

Joint learning of multiple tasks

Alternating optimization

Learning with a matrix-normal penalty

22

Matrix-normal prior

Joint learning of multiple tasks

Alternating optimization

Other recent work as variants of special cases Multi-task feature learning [Argyriou et al, NIPS 06]: learning with

the feature covariance Clustered multi-task learning [Jacob et al, NIPS 08]: learning with

the task covariance and spectral constraints Multi-task relationship learning [Zhang et al, UAI 10]: learning

with the task covariance

Learning with a matrix-normal penalty

23

Matrix-normal prior

Sparse covariance selection

Sparse covariance selection in matrix-normal penalties

Sparsity of Conditional independence of rows (tasks) and columns

(feature dimensions) of W

24

Sparse covariance selection

Sparse covariance selection in matrix-normal penalties

Sparsity of Conditional independence of rows (tasks) and columns

(feature dimensions) of W

Alternating optimization Estimating W: same as before Estimating and : L-1 penalized covariance estimation

25

Results on multi-task learning Landmine detection: multiple landmine fields

Face recognition: multiple 1-vs-1 tasks

26

[1] Jacob, Bach, and Vert. Clustered multi-task learning: A convex formulation. NIPS, 2008[2] Argyriou, Evgeniou, and Pontil. Multi-task feature learning, NIPS 2006

0.2

0.25

0.3

0.35

0.4

ST

L

MT

L_C

lust

[1]

MT

L_F

eat[2

]

Pro

pose

d

1-AUC, 30 samples per task

0.05

0.06

0.07

0.08

0.09

ST

L

MT

L_C

lust

[1]

MT

L_F

eat[2

]

Pro

pose

d

Classification error, 5 samples per subject

Outline

27





Go beyond covariance and correlation structures







Learning compressible models

Learning compressible models

A compression operator P instead of Bias: model compressibility

28

Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning Compressible Models. SDM 2010

Energy compaction

Image energy is concentrated at a few frequencies

29

JPEG (2D-DCT), 46 : 1 compression

Energy compaction

Image energy is concentrated at a few frequencies Models need to operate at relevant frequencies

30

JPEG (2D-DCT), 46 : 1 compression

2D-DCT

Sparse vs. compressible

Model coefficients w

Digit recognition:

31

coefficients w compressed coefficients Pw coefficients w as an image

sparse vs compressible sparse vs compressible sparse vs compressible

Outline

32





Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible

modelsProjection penalties





Dimension reduction

Dimension reduction conveys information about the input space Feature selection importance Feature clustering granularity Feature extraction more general structures

33

How to use a dimension reduction?

However, any reduction loses certain information May be relevant to a prediction task

Goal of projection penalties: Encode useful information from a dimension reduction Control the risk of potential information loss

34

Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

Projection penalties: the basic idea

The basic idea: Observation: reduce the feature space restrict the

model search to a model subspace MP

Solution: still search in the full model space M, and penalize the projection distance to the model subspace MP

35

Learn with a (linear) dimension reduction P

Projection penalties: linear cases

36


Learn with projection penalties

Optimization:

37

projection distance

Projection penalties: nonlinear cases

38Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

FF’X

RpRd

FF’

P

P

P

M MPw

wP

?

?


39Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

FF’X

RpRd

FF’

P

P

P

M MPw

wP

M MPw

wP

M MPw

wP

Empirical results Text classification (20 newsgroups), using logistic regression Dimension reduction: latent Dirichlet allocation

40

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2O

rig

inal

Red

uct

ion

Pro

ject

ion

Pen

alty

2% training 5% training 10% training

Ori

gin

al

Red

uct

ion

Pro

ject

ion

Pen

alty

Ori

gin

al

Red

uct

ion

Pro

ject

ion

Pen

alty

Empirical results Text classification (20 newsgroups), using logistic regression Dimension reduction: latent Dirichlet allocation

41

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

2% training

Orig Red Proj

5% training

Orig Red Proj

10% training

Similar results on face recognition, using SVM (poly-2)Dimension reduction: KPCA, KDA, OLaplacian Face

Similar results on house price prediction, using regressionDimension reduction: PCA and partial least squares

Outline

42











Outline

43











Multi-label classification

Multi-label classification

Existence of certain label dependency Example: classify an image into scenes (deserts, river, forest, etc) Multi-class problem is a special case: only one class is true

44

xLearn to predict

yqy2y1 …

Label dependency

Output coding

d < q: compression, i.e., source coding d > q: error-correcting codes, i.e., channel coding

Use the redundancy to correct prediction (“transmission”) errors

45

yqy2y1 …

z1 … zdxLearn to predict

y

z

encoding decoding

z2 z3

Error-correcting output codes (ECOCs)

Multi-class ECOCs [Dietterich & Bakiri, 1994] [Allwein, Schapire & Singer 2001]

Encode into a (redundant) set of binary problems Learn to predict the code Decode the predictions

Our goal: design ECOCs for multi-label classification46

yqy2y1 …

zt…z2z1 …

encoding decoding

y1 {y3,y4} vs. y7y2 vs. y3

xLearn to predict

Outline

47











The composite likelihood (CL): a partial specification of the likelihood as the product of simple component likelihoods

e.g., pairwise likelihood:

e.g., full conditional likelihood

Estimation using composite likelihoodsComputational and statistical efficiencyRobustness under model misspecification

Composite likelihood

48

Multi-label problem decomposition

Problem decomposition methods

Decomposition into subproblems (encoding) Decision making by combining subproblem predictions

(decoding) Examples: 1-vs-all, 1-vs-1, 1-vs-1 + 1-vs-all, etc

49

yqy2y1 …

…… …x

Learn to predict …

1-vs-All (Binary Relevance)

Classify each label independently

The composite likelihood view

50

xLearn to predict

yqy2y1 …

Independently

Pairwise label ranking [1]

1-vs-1 method (a.k.a. pairwise label ranking) Subproblems: pairwise label comparisons Decision making: label ranking by counting the number winning

comparisons, and thresholding

51

yqy2y1 …

…y1 vs. y2 yq-1 vs. yqxLearn to predict y1 vs. y3

[1] Hullermeier et. al. Artif. Intell., 2008

Pairwise label ranking [1]

1-vs-1 method (a.k.a. pairwise label ranking) Subproblems: pairwise label comparisons Decision making: label ranking by counting the number winning

comparisons, and thresholding


52

yqy2y1 …

…y1 vs. y2 yq-1 vs. yqxLearn to predict y1 vs. y3

[1] Hullermeier et. al. Artif. Intell., 2008

Calibrated label ranking [2]

1-vs-1 + 1-vs-all (a.k.a. calibrated label ranking) Subproblems: 1-vs-1 + 1-vs-all Decision making: label ranking, and a smart thresholding based

on 1-vs-1 and 1-vs-all predictions

53

yqy2y1 …

…y1 vs. y2 yq-1 vs. yq

x

Learn to predict

Learn to predict

y1 vs. y3

[2] Furnkranz et. al. MLJ, 2008

Calibrated label ranking [2]

1-vs-1 + 1-vs-all (a.k.a. calibrated label ranking) Subproblems: 1-vs-1 + 1-vs-all Decision making: label ranking, and a smart thresholding based

on 1-vs-1 and 1-vs-all predictions


54

yqy2y1 …

…y1 vs. y2 yq-1 vs. yqLearn to predict

Learn to predict

y1 vs. y3

x

[2] Furnkranz et. al. MLJ, 2008

A composite likelihood view

A composite likelihood view for problem decomposition Choice of subproblems specification of a composite likelihood? Decision making inference on the composite likelihood?

55

yqy2y1 …

…… …x

Learn to predict …

A composite pairwise coding

Subproblems: individual and pairwise label densities

conveys more information than

56

Yi Zhang and Jeff Schneider. A Composite Likelihood View for Multi-Label Classification. AISTATS 2012

yi=0, yj=0 yi=0, yj=1

yi=1, yj=0 yi=1, yj=1

yi=0, yj=0 yi=0, yj=1

yi=1, yj=0 yi=1, yj=1


Decision making: a robust mean-field approximation

is not robust to underestimation of label densities

57



Decision making: a robust mean-field approximation

is not robust to underestimation of label densities

A composite divergence, robust and efficient to optimize

58


Data sets

The Scene data Image scenes (beach, sunset, fall foliage, field,

mountain and urban)

beach, urban

59

[Boutell et. al., Pattern Recognition 2004]

Data sets

The Emotion data Music emotions (amazed, happy, relaxed, sad, etc)

The Medical data Clinical text medical categories (ICD-9-CM codes)

The Yeast data Gene functional categories

The Enron data Email tags on topics, attachment types, and emotional tones

60

Empirical results

Similar results on other data sets (emotions, medical, etc)

61

[1] Hullermeier et. al. Label ranking by learning pairwise preferences. Artif. Intell., 2008[2] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[3] Read et. al. Classifier chains for multi-label classification. ECML, 2009[4] Tsoumak et. al. Random k-labelsets: an ensemble method for multilabel classification. ECML, 2007[5] Zhang et. al. Multi-label learning by exploiting label dependency. KDD, 2010

Outline

62





problem-dependent coding and code predictability







Multi-label output coding

Design output coding to multi-label problems Problem-dependent encodings to exploit label dependency Code predictability

Propose: multi-label ECOCs via CCA63

yqy2y1 …

zt…z2z1 …

encoding decoding

? ??

xLearn to predict

Canonical correlation analysis

Given , CCA finds projection directions

with maximum correlation:

64

Canonical correlation analysis

Given , CCA finds projection directions

with maximum correlation:

Also known as “the most predictable criterion”:

CCA finds most predictable directions v in the label space

65

Multi-label ECOCs using CCA

Encoding and learning Perform CCA: Code includes both original labels and label projections

66

Yi Zhang and Jeff Schneider. Multi-label Output Codes using Canonical Correlation Analysis. AISTATS 2011

yqy2y1 …

z1 … zdyqy2y1 …xLearn to predict

y

z

encoding decoding


Encoding and learning Perform CCA: Code includes both original labels and label projections Learn classifiers for original labels Learn regression for label projections

67


yqy2y1 …


y

z

encoding decoding


Decoding Classifiers: Bernoulli on q original labels Regression: Gaussian on d label projections

68


yqy2y1 …


y

z

encoding decoding


Decoding Classifiers: Bernoulli on q original labels Regression: Gaussian on d label projections

Mean-field approximation

69


yqy2y1 …


y

z

encoding decoding

Empirical results

Similar results on other criteria (macro/micro F-1 scores)Similar results on other data (emotions)Similar results on other base learners (decision trees, SVMs)

70

[1] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[2] D. Hsu, et. al. Multi-label prediction via compressed sensing. NIPS, 2009[3] Zhang and Schneider. A composite likelihood view for multi-label classification. AISTATS 2012

Outline

71






Discriminative and predictable codes







Recall: coding with CCA

CCA finds label projections z that are most predictable Low “transmission errors” in channel coding

72

yqy2y1 …


y

z

encoding decoding

z

xpredict

A recent paper [1]: coding with PCA

Label projections z obtained by PCA z has maximum sample variance, i.e., far away from each other. Minimum code distance?

73[1] Tai and Lin, 2010

yqy2y1 …


y

z

encoding decoding

z

Goal: predictable and discriminative codes

Predictable: prediction is closed to the correct codeword Discriminative: prediction is far away from incorrect codewords

74

yqy2y1 …


y

z

encoding decoding

z

xpredict

Maximum margin output coding

A max-margin formulation

75

z

xpredict



Assume M is best linear predictor (in closed form of X, Y, V) Reformulate using metric learning Deal with the exponentially large number of constraints

The cutting plane method Overgenerating

76



Assume M is best linear predictor, and define

77



Metric learning formulation: define the Mahalanobis metric:

and the notation:

78


The metric learning problem

An exponentially large number of constraints Cutting plane method? No polynomial-time separation oracle!

79


The metric learning problem

An exponentially large number of constraints Cutting plane method? No polynomial-time separation oracle!

Cutting plane method with overgenerating (relaxation) Relax into Linearize for the relaxed domain: New separation oracle: box-constrained QP

80

Empirical results

Similar results on other data (emotions and medical)81

[3] D. Hsu, et. al. Multi-label prediction via compressed sensing. NIPS, 2009 [4] Tai and Lin. Multi-label Classification with Principal Label Space Transformation. Neur. Comp.[5] Zhang and Schneider. Multi-label output codes via canonical correlation analysis, AISTATS 2011

[1] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[2] Zhang et. al. Multi-label learning by exploiting label dependency. KDD, 2010

Conclusion

Regularization to exploit input information Semi-supervised learning with word correlation Multi-task learning with a matrix-normal penalty Learning compressible models Projection penalties for dimension reduction

Output coding to exploit output information Composite pairwise coding Coding via CCA Coding via max-margin formulation

Future

82

Thank you! Questions?

83






Discriminative and predictable codes







Local smoothness

Smoothness of model coefficients Key property: certain order of derivatives are sparse

86

Differentiation operator

Brain computer interaction

Classify Electroencephalography (EEG) signals Sparse models vs. piecewise smooth models

87


Learn a linear model with a given linear reduction P

88


Learn a linear model with a given linear reduction P

89


Learn a linear model with projection penalties

90

projection distance

Projection penalties: RKHS cases

Learning in RKHS with projection penalties Primal:

Solve for in the dual (see the next page) Solve for v and b in the primal

91

FF’

X

MMP

w

wP

P

Projection penalties: RKHS cases

Representer theorem for : Dual:

92

FF’

X

MMP

w

wP

P


Learning linear models

Learning RKHS models

93

FF’X

RpRdFF’

?

P P P

P(xi)

Empirical results

Face recognition (Yale), using SVM (poly-2) Dimension reduction: KPCA, KDA, OLaplacian

94

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

3 per class

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

5 per class

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

7 per class

Empirical results

Face recognition (Yale), using SVM (poly-2) Dimension reduction: KPCA, KDA, OLaplacian

95

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

3 per class

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

5 per class

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

7 per class

Empirical results

Face recognition (Yale), SVM (poly-2) Dimension reduction: KPCA, KDA, OLaplacian

96

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

3 per class

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

5 per class

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

7 per class

Empirical results

Price forecasting (Boston house), ridge regression Dimension reduction: partial least squares

97

1-R2

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Orig Red Proj

50 training samples

Binary relevance

Binary relevance (a.k.a. 1-vs-all) Subproblems: classify each label independently Decision making: same Assume no label dependency

98

xLearn to predict

yqy2y1 …

Independently

Binary relevance

Binary relevance (a.k.a. 1-vs-all) Subproblems: classify each label independently Decision making: same Assume no label dependency


99

xLearn to predict

yqy2y1 …

Independently

Empirical results

Emotion data (classify music to different emotions) Evaluation measure: subset accuracy

100

[1] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[2] D. Hsu, et. al. Multi-label prediction via compressed sensing. NIPS, 2009

learning with limited supervision by input and output coding

Documents

output information

majority of unlabeled

text classification

text learninghumans

output spaceimprove

output spacerepresentation

unlabeled textinfer

unlabeled textfor