learning with limited supervision by input and output coding

100
Learning with Limited Supervision by Input and Output Coding Yi Zhang Machine Learning Department Carnegie Mellon University April 30 th , 2012 1

Upload: kyle-vazquez

Post on 30-Dec-2015

40 views

Category:

Documents


0 download

DESCRIPTION

Learning with Limited Supervision by Input and Output Coding. Yi Zhang Machine Learning Department Carnegie Mellon University April 30 th , 2012. Thesis Committee. Jeff Schneider, Chair Geoff Gordon Tom Mitchell Xiaojin (Jerry) Zhu, University of Wisconsin-Madison. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Learning with Limited Supervision by Input and Output Coding

Learning with Limited Supervision by Input and

Output Coding

Yi Zhang

Machine Learning DepartmentCarnegie Mellon University

April 30th, 2012

1

Page 2: Learning with Limited Supervision by Input and Output Coding

Thesis Committee

Jeff Schneider, Chair Geoff Gordon Tom Mitchell Xiaojin (Jerry) Zhu, University of Wisconsin-

Madison

2

Page 3: Learning with Limited Supervision by Input and Output Coding

Introduction

Learning a prediction system, usually based on examples

Training examples are usually limited Cost of obtaining high-quality examples Complexity of the prediction problem

3

Learning

(x1,y1)…

(xn,yn)

X Y

Page 4: Learning with Limited Supervision by Input and Output Coding

Introduction

Solution: exploit extra information about the input and output space Improve the prediction performance Reduce the cost for collecting training examples

4

Learning

(x1,y1)…

(xn,yn)

X Y

Page 5: Learning with Limited Supervision by Input and Output Coding

Introduction

Solution: exploit extra information about the input and output space Representation and discovery? Incorporation?

5

Learning

(x1,y1)…

(xn,yn)

X Y

? ?

Page 6: Learning with Limited Supervision by Input and Output Coding

Outline

6

Part I: Encoding Input Information by Regularization

Learning with word correlation

A matrix-normal penalty for multi-task learning

Learn compressible models

Projection penalties

Part II: Encoding Output Information by Output Codes

Composite likelihood for pairwise coding

Multi-label output codes with CCA

Maximum-margin output coding

Page 7: Learning with Limited Supervision by Input and Output Coding

Regularization

The general formulation

Ridge regression

Lasso

7

Page 8: Learning with Limited Supervision by Input and Output Coding

Outline

8

Part I: Encoding Input Information by Regularization

Learning with word correlation

A matrix-normal penalty for multi-task learning

Learn compressible models

Projection penalties

Part II: Encoding Output Information by Output Codes

Composite likelihood for pairwise coding

Multi-label output codes with CCA

Maximum-margin output coding

Page 9: Learning with Limited Supervision by Input and Output Coding

Learning with unlabeled text

For a text classification task : plenty of unlabeled text on the Web : seemingly unrelated to the task What can we gain from such unlabeled text?

9

Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text. NIPS 2008

Page 10: Learning with Limited Supervision by Input and Output Coding

A motivating example for text learning

Humans learn text classification effectively! Two training examples:

+: [gasoline, truck] -: [vote, election]

Query: [gallon, vehicle]

Seems very easy! But why?

10

Page 11: Learning with Limited Supervision by Input and Output Coding

A motivating example for text learning

Humans learn text classification effectively! Two training examples:

+: [gasoline, truck] -: [vote, election]

Query: [gallon, vehicle]

Seems very easy! But why? Gasoline ~ gallon, truck ~ vehicle

11

Page 12: Learning with Limited Supervision by Input and Output Coding

A covariance operator for regularization

Covariance structure of model coefficients Usually unknown -- learn from unlabeled text?

12

Page 13: Learning with Limited Supervision by Input and Output Coding

Learning with unlabeled text

Infer the covariance operator Extract latent topics from unlabeled text (with resampling) Observe the contribution of words in each topic

[gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] Estimate the correlation (covariance) of words

13

Page 14: Learning with Limited Supervision by Input and Output Coding

Learning with unlabeled text

Infer the covariance operator Extract latent topics from unlabeled text (with resampling) Observe the contribution of words in each topic

[gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] Estimate the correlation (covariance) of words

For a new task, we learn with regularization

14

Page 15: Learning with Limited Supervision by Input and Output Coding

Experiments

Empirical results on 20 newsgroups 190 1-vs-1 classification tasks, 2% labeled examples For any task, majority of unlabeled text (18/20) is irrelevant

Similar results on logistic regression and least squares15

[1] V. Sindhwani and S. Keerthi. Large scale semi-supervised linear svms. In SIGIR, 2006

Page 16: Learning with Limited Supervision by Input and Output Coding

Outline

16

Part I: Encoding Input Information by Regularization

Learning with word correlation

A matrix-normal penalty for multi-task learning

Multi-task generalization

Learn compressible models

Projection penalties

Part II: Encoding Output Information by Output Codes

Composite likelihood for pairwise coding

Multi-label output codes with CCA

Maximum-margin output coding

Page 17: Learning with Limited Supervision by Input and Output Coding

Multi-task learning

Different but related prediction tasks An example

Landmine detection using radar images Multiple tasks: different landmine fields

Geographic conditions Landmine types

Goal: information sharing among tasks

17

Page 18: Learning with Limited Supervision by Input and Output Coding

Regularization for multi-task learning

18

W =

Our approach: view MTL as estimating a parameter matrix

Page 19: Learning with Limited Supervision by Input and Output Coding

Regularization for multi-task learning

19

W =

?

(Gaussian prior)

Yi Zhang and Jeff Schneider. Learning Multiple Tasks with a Sparse Matrix-Normal Penalty. NIPS 2010

Our approach: view MTL as estimating a parameter matrix

A covariance operator for regularizing a matrix?

Vector w:

Matrix W:

Page 20: Learning with Limited Supervision by Input and Output Coding

Matrix-normal distributions

Consider a 2 by 3 matrix W:

The full covariance = Kronecker product of and

20

full covariance

row covariance

column covariance

Page 21: Learning with Limited Supervision by Input and Output Coding

Matrix-normal distributions

Consider a 2 by 3 matrix W:

The full covariance = Kronecker product of and

The matrix-normal density offers a compact form for

21

full covariance

row covariance

column covariance

Page 22: Learning with Limited Supervision by Input and Output Coding

Joint learning of multiple tasks

Alternating optimization

Learning with a matrix-normal penalty

22

Matrix-normal prior

Page 23: Learning with Limited Supervision by Input and Output Coding

Joint learning of multiple tasks

Alternating optimization

Other recent work as variants of special cases Multi-task feature learning [Argyriou et al, NIPS 06]: learning with

the feature covariance Clustered multi-task learning [Jacob et al, NIPS 08]: learning with

the task covariance and spectral constraints Multi-task relationship learning [Zhang et al, UAI 10]: learning

with the task covariance

Learning with a matrix-normal penalty

23

Matrix-normal prior

Page 24: Learning with Limited Supervision by Input and Output Coding

Sparse covariance selection

Sparse covariance selection in matrix-normal penalties

Sparsity of Conditional independence of rows (tasks) and columns

(feature dimensions) of W

24

Page 25: Learning with Limited Supervision by Input and Output Coding

Sparse covariance selection

Sparse covariance selection in matrix-normal penalties

Sparsity of Conditional independence of rows (tasks) and columns

(feature dimensions) of W

Alternating optimization Estimating W: same as before Estimating and : L-1 penalized covariance estimation

25

Page 26: Learning with Limited Supervision by Input and Output Coding

Results on multi-task learning Landmine detection: multiple landmine fields

Face recognition: multiple 1-vs-1 tasks

26

[1] Jacob, Bach, and Vert. Clustered multi-task learning: A convex formulation. NIPS, 2008[2] Argyriou, Evgeniou, and Pontil. Multi-task feature learning, NIPS 2006

0.2

0.25

0.3

0.35

0.4

ST

L

MT

L_C

lust

[1]

MT

L_F

eat[2

]

Pro

pose

d

1-AUC, 30 samples per task

0.05

0.06

0.07

0.08

0.09

ST

L

MT

L_C

lust

[1]

MT

L_F

eat[2

]

Pro

pose

d

Classification error, 5 samples per subject

Page 27: Learning with Limited Supervision by Input and Output Coding

Outline

27

Part I: Encoding Input Information by Regularization

Learning with word correlation

A matrix-normal penalty for multi-task learning

Multi-task generalization

Go beyond covariance and correlation structures

Learn compressible models

Projection penalties

Part II: Encoding Output Information by Output Codes

Composite likelihood for pairwise coding

Multi-label output codes with CCA

Maximum-margin output coding

Page 28: Learning with Limited Supervision by Input and Output Coding

Learning compressible models

Learning compressible models

A compression operator P instead of Bias: model compressibility

28

Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning Compressible Models. SDM 2010

Page 29: Learning with Limited Supervision by Input and Output Coding

Energy compaction

Image energy is concentrated at a few frequencies

29

JPEG (2D-DCT), 46 : 1 compression

Page 30: Learning with Limited Supervision by Input and Output Coding

Energy compaction

Image energy is concentrated at a few frequencies Models need to operate at relevant frequencies

30

JPEG (2D-DCT), 46 : 1 compression

2D-DCT

Page 31: Learning with Limited Supervision by Input and Output Coding

Sparse vs. compressible

Model coefficients w

Digit recognition:

31

coefficients w compressed coefficients Pw coefficients w as an image

sparse vs compressible sparse vs compressible sparse vs compressible

Page 32: Learning with Limited Supervision by Input and Output Coding

Outline

32

Part I: Encoding Input Information by Regularization

Learning with word correlation

A matrix-normal penalty for multi-task learning

Multi-task generalization

Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible

modelsProjection penalties

Part II: Encoding Output Information by Output Codes

Composite likelihood for pairwise coding

Multi-label output codes with CCA

Maximum-margin output coding

Page 33: Learning with Limited Supervision by Input and Output Coding

Dimension reduction

Dimension reduction conveys information about the input space Feature selection importance Feature clustering granularity Feature extraction more general structures

33

Page 34: Learning with Limited Supervision by Input and Output Coding

How to use a dimension reduction?

However, any reduction loses certain information May be relevant to a prediction task

Goal of projection penalties: Encode useful information from a dimension reduction Control the risk of potential information loss

34

Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

Page 35: Learning with Limited Supervision by Input and Output Coding

Projection penalties: the basic idea

The basic idea: Observation: reduce the feature space restrict the

model search to a model subspace MP

Solution: still search in the full model space M, and penalize the projection distance to the model subspace MP

35

Page 36: Learning with Limited Supervision by Input and Output Coding

Learn with a (linear) dimension reduction P

Projection penalties: linear cases

36

Page 37: Learning with Limited Supervision by Input and Output Coding

Projection penalties: linear cases

Learn with projection penalties

Optimization:

37

projection distance

Page 38: Learning with Limited Supervision by Input and Output Coding

Projection penalties: nonlinear cases

38Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

FF’X

RpRd

FF’

P

P

P

M MPw

wP

?

?

Page 39: Learning with Limited Supervision by Input and Output Coding

Projection penalties: nonlinear cases

39Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

FF’X

RpRd

FF’

P

P

P

M MPw

wP

M MPw

wP

M MPw

wP

Page 40: Learning with Limited Supervision by Input and Output Coding

Empirical results Text classification (20 newsgroups), using logistic regression Dimension reduction: latent Dirichlet allocation

40

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2O

rig

inal

Red

uct

ion

Pro

ject

ion

Pen

alty

2% training 5% training 10% training

Ori

gin

al

Red

uct

ion

Pro

ject

ion

Pen

alty

Ori

gin

al

Red

uct

ion

Pro

ject

ion

Pen

alty

Page 41: Learning with Limited Supervision by Input and Output Coding

Empirical results Text classification (20 newsgroups), using logistic regression Dimension reduction: latent Dirichlet allocation

41

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

2% training

Orig Red Proj

5% training

Orig Red Proj

10% training

Similar results on face recognition, using SVM (poly-2)Dimension reduction: KPCA, KDA, OLaplacian Face

Similar results on house price prediction, using regressionDimension reduction: PCA and partial least squares

Page 42: Learning with Limited Supervision by Input and Output Coding

Outline

42

Part II: Encoding Output Information by Output Codes

Composite likelihood for pairwise coding

Multi-label output codes with CCA

Maximum-margin output coding

Part I: Encoding Input Information by Regularization

Learning with word correlation

A matrix-normal penalty for multi-task learning

Multi-task generalization

Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible

modelsProjection penalties

Page 43: Learning with Limited Supervision by Input and Output Coding

Outline

43

Part II: Encoding Output Information by Output Codes

Composite likelihood for pairwise coding

Multi-label output codes with CCA

Maximum-margin output coding

Part I: Encoding Input Information by Regularization

Learning with word correlation

A matrix-normal penalty for multi-task learning

Multi-task generalization

Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible

modelsProjection penalties

Page 44: Learning with Limited Supervision by Input and Output Coding

Multi-label classification

Multi-label classification

Existence of certain label dependency Example: classify an image into scenes (deserts, river, forest, etc) Multi-class problem is a special case: only one class is true

44

xLearn to predict

yqy2y1 …

Label dependency

Page 45: Learning with Limited Supervision by Input and Output Coding

Output coding

d < q: compression, i.e., source coding d > q: error-correcting codes, i.e., channel coding

Use the redundancy to correct prediction (“transmission”) errors

45

yqy2y1 …

z1 … zdxLearn to predict

y

z

encoding decoding

z2 z3

Page 46: Learning with Limited Supervision by Input and Output Coding

Error-correcting output codes (ECOCs)

Multi-class ECOCs [Dietterich & Bakiri, 1994] [Allwein, Schapire & Singer 2001]

Encode into a (redundant) set of binary problems Learn to predict the code Decode the predictions

Our goal: design ECOCs for multi-label classification46

yqy2y1 …

zt…z2z1 …

encoding decoding

y1 {y3,y4} vs. y7y2 vs. y3

xLearn to predict

Page 47: Learning with Limited Supervision by Input and Output Coding

Outline

47

Part II: Encoding Output Information by Output Codes

Composite likelihood for pairwise coding

Multi-label output codes with CCA

Maximum-margin output coding

Part I: Encoding Input Information by Regularization

Learning with word correlation

A matrix-normal penalty for multi-task learning

Multi-task generalization

Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible

modelsProjection penalties

Page 48: Learning with Limited Supervision by Input and Output Coding

The composite likelihood (CL): a partial specification of the likelihood as the product of simple component likelihoods

e.g., pairwise likelihood:

e.g., full conditional likelihood

Estimation using composite likelihoodsComputational and statistical efficiencyRobustness under model misspecification

Composite likelihood

48

Page 49: Learning with Limited Supervision by Input and Output Coding

Multi-label problem decomposition

Problem decomposition methods

Decomposition into subproblems (encoding) Decision making by combining subproblem predictions

(decoding) Examples: 1-vs-all, 1-vs-1, 1-vs-1 + 1-vs-all, etc

49

yqy2y1 …

…… …x

Learn to predict …

Page 50: Learning with Limited Supervision by Input and Output Coding

1-vs-All (Binary Relevance)

Classify each label independently

The composite likelihood view

50

xLearn to predict

yqy2y1 …

Independently

Page 51: Learning with Limited Supervision by Input and Output Coding

Pairwise label ranking [1]

1-vs-1 method (a.k.a. pairwise label ranking) Subproblems: pairwise label comparisons Decision making: label ranking by counting the number winning

comparisons, and thresholding

51

yqy2y1 …

…y1 vs. y2 yq-1 vs. yqxLearn to predict y1 vs. y3

[1] Hullermeier et. al. Artif. Intell., 2008

Page 52: Learning with Limited Supervision by Input and Output Coding

Pairwise label ranking [1]

1-vs-1 method (a.k.a. pairwise label ranking) Subproblems: pairwise label comparisons Decision making: label ranking by counting the number winning

comparisons, and thresholding

The composite likelihood view

52

yqy2y1 …

…y1 vs. y2 yq-1 vs. yqxLearn to predict y1 vs. y3

[1] Hullermeier et. al. Artif. Intell., 2008

Page 53: Learning with Limited Supervision by Input and Output Coding

Calibrated label ranking [2]

1-vs-1 + 1-vs-all (a.k.a. calibrated label ranking) Subproblems: 1-vs-1 + 1-vs-all Decision making: label ranking, and a smart thresholding based

on 1-vs-1 and 1-vs-all predictions

53

yqy2y1 …

…y1 vs. y2 yq-1 vs. yq

x

Learn to predict

Learn to predict

y1 vs. y3

[2] Furnkranz et. al. MLJ, 2008

Page 54: Learning with Limited Supervision by Input and Output Coding

Calibrated label ranking [2]

1-vs-1 + 1-vs-all (a.k.a. calibrated label ranking) Subproblems: 1-vs-1 + 1-vs-all Decision making: label ranking, and a smart thresholding based

on 1-vs-1 and 1-vs-all predictions

The composite likelihood view

54

yqy2y1 …

…y1 vs. y2 yq-1 vs. yqLearn to predict

Learn to predict

y1 vs. y3

x

[2] Furnkranz et. al. MLJ, 2008

Page 55: Learning with Limited Supervision by Input and Output Coding

A composite likelihood view

A composite likelihood view for problem decomposition Choice of subproblems specification of a composite likelihood? Decision making inference on the composite likelihood?

55

yqy2y1 …

…… …x

Learn to predict …

Page 56: Learning with Limited Supervision by Input and Output Coding

A composite pairwise coding

Subproblems: individual and pairwise label densities

conveys more information than

56

Yi Zhang and Jeff Schneider. A Composite Likelihood View for Multi-Label Classification. AISTATS 2012

yi=0, yj=0 yi=0, yj=1

yi=1, yj=0 yi=1, yj=1

yi=0, yj=0 yi=0, yj=1

yi=1, yj=0 yi=1, yj=1

Page 57: Learning with Limited Supervision by Input and Output Coding

A composite pairwise coding

Decision making: a robust mean-field approximation

is not robust to underestimation of label densities

57

Yi Zhang and Jeff Schneider. A Composite Likelihood View for Multi-Label Classification. AISTATS 2012

Page 58: Learning with Limited Supervision by Input and Output Coding

A composite pairwise coding

Decision making: a robust mean-field approximation

is not robust to underestimation of label densities

A composite divergence, robust and efficient to optimize

58

Yi Zhang and Jeff Schneider. A Composite Likelihood View for Multi-Label Classification. AISTATS 2012

Page 59: Learning with Limited Supervision by Input and Output Coding

Data sets

The Scene data Image scenes (beach, sunset, fall foliage, field,

mountain and urban)

beach, urban

59

[Boutell et. al., Pattern Recognition 2004]

Page 60: Learning with Limited Supervision by Input and Output Coding

Data sets

The Emotion data Music emotions (amazed, happy, relaxed, sad, etc)

The Medical data Clinical text medical categories (ICD-9-CM codes)

The Yeast data Gene functional categories

The Enron data Email tags on topics, attachment types, and emotional tones

60

Page 61: Learning with Limited Supervision by Input and Output Coding

Empirical results

Similar results on other data sets (emotions, medical, etc)

61

[1] Hullermeier et. al. Label ranking by learning pairwise preferences. Artif. Intell., 2008[2] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[3] Read et. al. Classifier chains for multi-label classification. ECML, 2009[4] Tsoumak et. al. Random k-labelsets: an ensemble method for multilabel classification. ECML, 2007[5] Zhang et. al. Multi-label learning by exploiting label dependency. KDD, 2010

Page 62: Learning with Limited Supervision by Input and Output Coding

Outline

62

Part II: Encoding Output Information by Output Codes

Composite likelihood for pairwise coding

Multi-label output codes with CCA

Maximum-margin output coding

problem-dependent coding and code predictability

Part I: Encoding Input Information by Regularization

Learning with word correlation

A matrix-normal penalty for multi-task learning

Multi-task generalization

Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible

modelsProjection penalties

Page 63: Learning with Limited Supervision by Input and Output Coding

Multi-label output coding

Design output coding to multi-label problems Problem-dependent encodings to exploit label dependency Code predictability

Propose: multi-label ECOCs via CCA63

yqy2y1 …

zt…z2z1 …

encoding decoding

? ??

xLearn to predict

Page 64: Learning with Limited Supervision by Input and Output Coding

Canonical correlation analysis

Given , CCA finds projection directions

with maximum correlation:

64

Page 65: Learning with Limited Supervision by Input and Output Coding

Canonical correlation analysis

Given , CCA finds projection directions

with maximum correlation:

Also known as “the most predictable criterion”:

CCA finds most predictable directions v in the label space

65

Page 66: Learning with Limited Supervision by Input and Output Coding

Multi-label ECOCs using CCA

Encoding and learning Perform CCA: Code includes both original labels and label projections

66

Yi Zhang and Jeff Schneider. Multi-label Output Codes using Canonical Correlation Analysis. AISTATS 2011

yqy2y1 …

z1 … zdyqy2y1 …xLearn to predict

y

z

encoding decoding

Page 67: Learning with Limited Supervision by Input and Output Coding

Multi-label ECOCs using CCA

Encoding and learning Perform CCA: Code includes both original labels and label projections Learn classifiers for original labels Learn regression for label projections

67

Yi Zhang and Jeff Schneider. Multi-label Output Codes using Canonical Correlation Analysis. AISTATS 2011

yqy2y1 …

z1 … zdyqy2y1 …xLearn to predict

y

z

encoding decoding

Page 68: Learning with Limited Supervision by Input and Output Coding

Multi-label ECOCs using CCA

Decoding Classifiers: Bernoulli on q original labels Regression: Gaussian on d label projections

68

Yi Zhang and Jeff Schneider. Multi-label Output Codes using Canonical Correlation Analysis. AISTATS 2011

yqy2y1 …

z1 … zdyqy2y1 …xLearn to predict

y

z

encoding decoding

Page 69: Learning with Limited Supervision by Input and Output Coding

Multi-label ECOCs using CCA

Decoding Classifiers: Bernoulli on q original labels Regression: Gaussian on d label projections

Mean-field approximation

69

Yi Zhang and Jeff Schneider. Multi-label Output Codes using Canonical Correlation Analysis. AISTATS 2011

yqy2y1 …

z1 … zdyqy2y1 …xLearn to predict

y

z

encoding decoding

Page 70: Learning with Limited Supervision by Input and Output Coding

Empirical results

Similar results on other criteria (macro/micro F-1 scores)Similar results on other data (emotions)Similar results on other base learners (decision trees, SVMs)

70

[1] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[2] D. Hsu, et. al. Multi-label prediction via compressed sensing. NIPS, 2009[3] Zhang and Schneider. A composite likelihood view for multi-label classification. AISTATS 2012

Page 71: Learning with Limited Supervision by Input and Output Coding

Outline

71

Part II: Encoding Output Information by Output Codes

Composite likelihood for pairwise coding

Multi-label output codes with CCA

Maximum-margin output coding

problem-dependent coding and code predictability

Discriminative and predictable codes

Part I: Encoding Input Information by Regularization

Learning with word correlation

A matrix-normal penalty for multi-task learning

Multi-task generalization

Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible

modelsProjection penalties

Page 72: Learning with Limited Supervision by Input and Output Coding

Recall: coding with CCA

CCA finds label projections z that are most predictable Low “transmission errors” in channel coding

72

yqy2y1 …

z1 … zdyqy2y1 …xLearn to predict

y

z

encoding decoding

z

xpredict

Page 73: Learning with Limited Supervision by Input and Output Coding

A recent paper [1]: coding with PCA

Label projections z obtained by PCA z has maximum sample variance, i.e., far away from each other. Minimum code distance?

73[1] Tai and Lin, 2010

yqy2y1 …

z1 … zdyqy2y1 …xLearn to predict

y

z

encoding decoding

z

Page 74: Learning with Limited Supervision by Input and Output Coding

Goal: predictable and discriminative codes

Predictable: prediction is closed to the correct codeword Discriminative: prediction is far away from incorrect codewords

74

yqy2y1 …

z1 … zdyqy2y1 …xLearn to predict

y

z

encoding decoding

z

xpredict

Page 75: Learning with Limited Supervision by Input and Output Coding

Maximum margin output coding

A max-margin formulation

75

z

xpredict

Page 76: Learning with Limited Supervision by Input and Output Coding

Maximum margin output coding

A max-margin formulation

Assume M is best linear predictor (in closed form of X, Y, V) Reformulate using metric learning Deal with the exponentially large number of constraints

The cutting plane method Overgenerating

76

Page 77: Learning with Limited Supervision by Input and Output Coding

Maximum margin output coding

A max-margin formulation

Assume M is best linear predictor, and define

77

Page 78: Learning with Limited Supervision by Input and Output Coding

Maximum margin output coding

A max-margin formulation

Metric learning formulation: define the Mahalanobis metric:

and the notation:

78

Page 79: Learning with Limited Supervision by Input and Output Coding

Maximum margin output coding

The metric learning problem

An exponentially large number of constraints Cutting plane method? No polynomial-time separation oracle!

79

Page 80: Learning with Limited Supervision by Input and Output Coding

Maximum margin output coding

The metric learning problem

An exponentially large number of constraints Cutting plane method? No polynomial-time separation oracle!

Cutting plane method with overgenerating (relaxation) Relax into Linearize for the relaxed domain: New separation oracle: box-constrained QP

80

Page 81: Learning with Limited Supervision by Input and Output Coding

Empirical results

Similar results on other data (emotions and medical)81

[3] D. Hsu, et. al. Multi-label prediction via compressed sensing. NIPS, 2009 [4] Tai and Lin. Multi-label Classification with Principal Label Space Transformation. Neur. Comp.[5] Zhang and Schneider. Multi-label output codes via canonical correlation analysis, AISTATS 2011

[1] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[2] Zhang et. al. Multi-label learning by exploiting label dependency. KDD, 2010

Page 82: Learning with Limited Supervision by Input and Output Coding

Conclusion

Regularization to exploit input information Semi-supervised learning with word correlation Multi-task learning with a matrix-normal penalty Learning compressible models Projection penalties for dimension reduction

Output coding to exploit output information Composite pairwise coding Coding via CCA Coding via max-margin formulation

Future

82

Page 83: Learning with Limited Supervision by Input and Output Coding

Thank you! Questions?

83

Part II: Encoding Output Information by Output Codes

Composite likelihood for pairwise coding

Multi-label output codes with CCA

Maximum-margin output coding

problem-dependent coding and code predictability

Discriminative and predictable codes

Part I: Encoding Input Information by Regularization

Learning with word correlation

A matrix-normal penalty for multi-task learning

Multi-task generalization

Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible

modelsProjection penalties

Page 84: Learning with Limited Supervision by Input and Output Coding

84

Page 85: Learning with Limited Supervision by Input and Output Coding

85

Page 86: Learning with Limited Supervision by Input and Output Coding

Local smoothness

Smoothness of model coefficients Key property: certain order of derivatives are sparse

86

Differentiation operator

Page 87: Learning with Limited Supervision by Input and Output Coding

Brain computer interaction

Classify Electroencephalography (EEG) signals Sparse models vs. piecewise smooth models

87

Page 88: Learning with Limited Supervision by Input and Output Coding

Projection penalties: linear cases

Learn a linear model with a given linear reduction P

88

Page 89: Learning with Limited Supervision by Input and Output Coding

Projection penalties: linear cases

Learn a linear model with a given linear reduction P

89

Page 90: Learning with Limited Supervision by Input and Output Coding

Projection penalties: linear cases

Learn a linear model with projection penalties

90

projection distance

Page 91: Learning with Limited Supervision by Input and Output Coding

Projection penalties: RKHS cases

Learning in RKHS with projection penalties Primal:

Solve for in the dual (see the next page) Solve for v and b in the primal

91

FF’

X

MMP

w

wP

P

Page 92: Learning with Limited Supervision by Input and Output Coding

Projection penalties: RKHS cases

Representer theorem for : Dual:

92

FF’

X

MMP

w

wP

P

Page 93: Learning with Limited Supervision by Input and Output Coding

Projection penalties: nonlinear cases

Learning linear models

Learning RKHS models

93

FF’X

RpRdFF’

?

P P P

P(xi)

Page 94: Learning with Limited Supervision by Input and Output Coding

Empirical results

Face recognition (Yale), using SVM (poly-2) Dimension reduction: KPCA, KDA, OLaplacian

94

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

3 per class

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

5 per class

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

7 per class

Page 95: Learning with Limited Supervision by Input and Output Coding

Empirical results

Face recognition (Yale), using SVM (poly-2) Dimension reduction: KPCA, KDA, OLaplacian

95

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

3 per class

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

5 per class

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

7 per class

Page 96: Learning with Limited Supervision by Input and Output Coding

Empirical results

Face recognition (Yale), SVM (poly-2) Dimension reduction: KPCA, KDA, OLaplacian

96

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

3 per class

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

5 per class

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Orig Red Proj

7 per class

Page 97: Learning with Limited Supervision by Input and Output Coding

Empirical results

Price forecasting (Boston house), ridge regression Dimension reduction: partial least squares

97

1-R2

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Orig Red Proj

50 training samples

Page 98: Learning with Limited Supervision by Input and Output Coding

Binary relevance

Binary relevance (a.k.a. 1-vs-all) Subproblems: classify each label independently Decision making: same Assume no label dependency

98

xLearn to predict

yqy2y1 …

Independently

Page 99: Learning with Limited Supervision by Input and Output Coding

Binary relevance

Binary relevance (a.k.a. 1-vs-all) Subproblems: classify each label independently Decision making: same Assume no label dependency

The composite likelihood view

99

xLearn to predict

yqy2y1 …

Independently

Page 100: Learning with Limited Supervision by Input and Output Coding

Empirical results

Emotion data (classify music to different emotions) Evaluation measure: subset accuracy

100

[1] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[2] D. Hsu, et. al. Multi-label prediction via compressed sensing. NIPS, 2009