learning with limited supervision by input and output coding

Learning with Limited Supervision by Input and

Output Coding

Yi Zhang

Machine Learning DepartmentCarnegie Mellon University

April 30th, 2012

Thesis Committee

Jeff Schneider, Chair Geoff Gordon Tom Mitchell Xiaojin (Jerry) Zhu, University of Wisconsin-

Madison

Introduction

Learning a prediction system, usually based on examples

Training examples are usually limited Cost of obtaining high-quality examples Complexity of the prediction problem

Learning

(x1,y1)…

(xn,yn)

Introduction

Solution: exploit extra information about the input and output space Improve the prediction performance Reduce the cost for collecting training examples

Learning

(x1,y1)…

(xn,yn)

Introduction

Solution: exploit extra information about the input and output space Representation and discovery? Incorporation?

Learning

(x1,y1)…

(xn,yn)

Outline

Part I: Encoding Input Information by Regularization

Learning with word correlation

A matrix-normal penalty for multi-task learning

Learn compressible models

Projection penalties

Part II: Encoding Output Information by Output Codes

Composite likelihood for pairwise coding

Multi-label output codes with CCA

Maximum-margin output coding

Regularization

The general formulation

Ridge regression

Outline

Learning with unlabeled text

For a text classification task : plenty of unlabeled text on the Web : seemingly unrelated to the task What can we gain from such unlabeled text?

Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text. NIPS 2008

A motivating example for text learning

Humans learn text classification effectively! Two training examples:

+: [gasoline, truck] -: [vote, election]

Query: [gallon, vehicle]

Seems very easy! But why?

A motivating example for text learning

Humans learn text classification effectively! Two training examples:

+: [gasoline, truck] -: [vote, election]

Query: [gallon, vehicle]

Seems very easy! But why? Gasoline ~ gallon, truck ~ vehicle

A covariance operator for regularization

Covariance structure of model coefficients Usually unknown -- learn from unlabeled text?

Infer the covariance operator Extract latent topics from unlabeled text (with resampling) Observe the contribution of words in each topic

[gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] Estimate the correlation (covariance) of words

Infer the covariance operator Extract latent topics from unlabeled text (with resampling) Observe the contribution of words in each topic

[gas: 0.3, gallon: 0.2, truck: 0.2, safety: 0.2, …] Estimate the correlation (covariance) of words

For a new task, we learn with regularization

Experiments

Empirical results on 20 newsgroups 190 1-vs-1 classification tasks, 2% labeled examples For any task, majority of unlabeled text (18/20) is irrelevant

Similar results on logistic regression and least squares15

[1] V. Sindhwani and S. Keerthi. Large scale semi-supervised linear svms. In SIGIR, 2006

Outline

Multi-task generalization

Multi-task learning

Different but related prediction tasks An example

Landmine detection using radar images Multiple tasks: different landmine fields

Geographic conditions Landmine types

Goal: information sharing among tasks

Regularization for multi-task learning

Our approach: view MTL as estimating a parameter matrix

Regularization for multi-task learning

(Gaussian prior)

Yi Zhang and Jeff Schneider. Learning Multiple Tasks with a Sparse Matrix-Normal Penalty. NIPS 2010

Our approach: view MTL as estimating a parameter matrix

A covariance operator for regularizing a matrix?

Vector w:

Matrix W:

Matrix-normal distributions

Consider a 2 by 3 matrix W:

The full covariance = Kronecker product of and

full covariance

row covariance

column covariance

Matrix-normal distributions

Consider a 2 by 3 matrix W:

The full covariance = Kronecker product of and

The matrix-normal density offers a compact form for

full covariance

row covariance

column covariance

Joint learning of multiple tasks

Alternating optimization

Learning with a matrix-normal penalty

Matrix-normal prior

Joint learning of multiple tasks

Alternating optimization

Other recent work as variants of special cases Multi-task feature learning [Argyriou et al, NIPS 06]: learning with

the feature covariance Clustered multi-task learning [Jacob et al, NIPS 08]: learning with

the task covariance and spectral constraints Multi-task relationship learning [Zhang et al, UAI 10]: learning

with the task covariance

Learning with a matrix-normal penalty

Matrix-normal prior

Sparse covariance selection

Sparse covariance selection in matrix-normal penalties

Sparsity of Conditional independence of rows (tasks) and columns

(feature dimensions) of W

Sparse covariance selection

Sparse covariance selection in matrix-normal penalties

Sparsity of Conditional independence of rows (tasks) and columns

(feature dimensions) of W

Alternating optimization Estimating W: same as before Estimating and : L-1 penalized covariance estimation

Results on multi-task learning Landmine detection: multiple landmine fields

Face recognition: multiple 1-vs-1 tasks

[1] Jacob, Bach, and Vert. Clustered multi-task learning: A convex formulation. NIPS, 2008[2] Argyriou, Evgeniou, and Pontil. Multi-task feature learning, NIPS 2006

1-AUC, 30 samples per task

Classification error, 5 samples per subject

Outline

Go beyond covariance and correlation structures

Learning compressible models

A compression operator P instead of Bias: model compressibility

Yi Zhang, Jeff Schneider and Artur Dubrawski. Learning Compressible Models. SDM 2010

Energy compaction

Image energy is concentrated at a few frequencies

JPEG (2D-DCT), 46 : 1 compression

Energy compaction

Image energy is concentrated at a few frequencies Models need to operate at relevant frequencies

JPEG (2D-DCT), 46 : 1 compression

2D-DCT

Sparse vs. compressible

Model coefficients w

Digit recognition:

coefficients w compressed coefficients Pw coefficients w as an image

sparse vs compressible sparse vs compressible sparse vs compressible

Outline

Go beyond covariance and correlation structures Encode a dimension reductionLearn compressible

modelsProjection penalties

Dimension reduction

Dimension reduction conveys information about the input space Feature selection importance Feature clustering granularity Feature extraction more general structures

How to use a dimension reduction?

However, any reduction loses certain information May be relevant to a prediction task

Goal of projection penalties: Encode useful information from a dimension reduction Control the risk of potential information loss

Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

Projection penalties: the basic idea

The basic idea: Observation: reduce the feature space restrict the

model search to a model subspace MP

Solution: still search in the full model space M, and penalize the projection distance to the model subspace MP

Learn with a (linear) dimension reduction P

Projection penalties: linear cases

Learn with projection penalties

Optimization:

projection distance

Projection penalties: nonlinear cases

38Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

FF’X

39Yi Zhang and Jeff Schneider. Projection Penalty: Dimension Reduction without Loss. ICML 2010

FF’X

Empirical results Text classification (20 newsgroups), using logistic regression Dimension reduction: latent Dirichlet allocation

2% training 5% training 10% training

Empirical results Text classification (20 newsgroups), using logistic regression Dimension reduction: latent Dirichlet allocation

Orig Red Proj

2% training

Orig Red Proj

5% training

Orig Red Proj

10% training

Similar results on face recognition, using SVM (poly-2)Dimension reduction: KPCA, KDA, OLaplacian Face

Similar results on house price prediction, using regressionDimension reduction: PCA and partial least squares

Outline

Multi-label classification

Existence of certain label dependency Example: classify an image into scenes (deserts, river, forest, etc) Multi-class problem is a special case: only one class is true

xLearn to predict

yqy2y1 …

Label dependency

Output coding

d < q: compression, i.e., source coding d > q: error-correcting codes, i.e., channel coding

Use the redundancy to correct prediction (“transmission”) errors

yqy2y1 …

z1 … zdxLearn to predict

encoding decoding

Error-correcting output codes (ECOCs)

Multi-class ECOCs [Dietterich & Bakiri, 1994] [Allwein, Schapire & Singer 2001]

Encode into a (redundant) set of binary problems Learn to predict the code Decode the predictions

Our goal: design ECOCs for multi-label classification46

yqy2y1 …

zt…z2z1 …

encoding decoding

y1 {y3,y4} vs. y7y2 vs. y3

xLearn to predict

Outline

The composite likelihood (CL): a partial specification of the likelihood as the product of simple component likelihoods

e.g., pairwise likelihood:

e.g., full conditional likelihood

Estimation using composite likelihoodsComputational and statistical efficiencyRobustness under model misspecification

Composite likelihood

Multi-label problem decomposition

Problem decomposition methods

Decomposition into subproblems (encoding) Decision making by combining subproblem predictions

(decoding) Examples: 1-vs-all, 1-vs-1, 1-vs-1 + 1-vs-all, etc

yqy2y1 …

…… …x

Learn to predict …

1-vs-All (Binary Relevance)

Classify each label independently

The composite likelihood view

xLearn to predict

yqy2y1 …

Independently

Pairwise label ranking [1]

1-vs-1 method (a.k.a. pairwise label ranking) Subproblems: pairwise label comparisons Decision making: label ranking by counting the number winning

comparisons, and thresholding

yqy2y1 …

…y1 vs. y2 yq-1 vs. yqxLearn to predict y1 vs. y3

[1] Hullermeier et. al. Artif. Intell., 2008

Pairwise label ranking [1]

1-vs-1 method (a.k.a. pairwise label ranking) Subproblems: pairwise label comparisons Decision making: label ranking by counting the number winning

comparisons, and thresholding

yqy2y1 …

…y1 vs. y2 yq-1 vs. yqxLearn to predict y1 vs. y3

[1] Hullermeier et. al. Artif. Intell., 2008

Calibrated label ranking [2]

1-vs-1 + 1-vs-all (a.k.a. calibrated label ranking) Subproblems: 1-vs-1 + 1-vs-all Decision making: label ranking, and a smart thresholding based

on 1-vs-1 and 1-vs-all predictions

yqy2y1 …

…y1 vs. y2 yq-1 vs. yq

Learn to predict

y1 vs. y3

[2] Furnkranz et. al. MLJ, 2008

Calibrated label ranking [2]

1-vs-1 + 1-vs-all (a.k.a. calibrated label ranking) Subproblems: 1-vs-1 + 1-vs-all Decision making: label ranking, and a smart thresholding based

on 1-vs-1 and 1-vs-all predictions

yqy2y1 …

…y1 vs. y2 yq-1 vs. yqLearn to predict

Learn to predict

y1 vs. y3

[2] Furnkranz et. al. MLJ, 2008

A composite likelihood view

A composite likelihood view for problem decomposition Choice of subproblems specification of a composite likelihood? Decision making inference on the composite likelihood?

yqy2y1 …

…… …x

Learn to predict …

A composite pairwise coding

Subproblems: individual and pairwise label densities

conveys more information than

Yi Zhang and Jeff Schneider. A Composite Likelihood View for Multi-Label Classification. AISTATS 2012

yi=0, yj=0 yi=0, yj=1

yi=1, yj=0 yi=1, yj=1

yi=0, yj=0 yi=0, yj=1

yi=1, yj=0 yi=1, yj=1

Decision making: a robust mean-field approximation

is not robust to underestimation of label densities

Decision making: a robust mean-field approximation

is not robust to underestimation of label densities

A composite divergence, robust and efficient to optimize

Data sets

The Scene data Image scenes (beach, sunset, fall foliage, field,

mountain and urban)

beach, urban

[Boutell et. al., Pattern Recognition 2004]

Data sets

The Emotion data Music emotions (amazed, happy, relaxed, sad, etc)

The Medical data Clinical text medical categories (ICD-9-CM codes)

The Yeast data Gene functional categories

The Enron data Email tags on topics, attachment types, and emotional tones

Empirical results

Similar results on other data sets (emotions, medical, etc)

[1] Hullermeier et. al. Label ranking by learning pairwise preferences. Artif. Intell., 2008[2] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[3] Read et. al. Classifier chains for multi-label classification. ECML, 2009[4] Tsoumak et. al. Random k-labelsets: an ensemble method for multilabel classification. ECML, 2007[5] Zhang et. al. Multi-label learning by exploiting label dependency. KDD, 2010

Outline

problem-dependent coding and code predictability

Multi-label output coding

Design output coding to multi-label problems Problem-dependent encodings to exploit label dependency Code predictability

Propose: multi-label ECOCs via CCA63

yqy2y1 …

zt…z2z1 …

encoding decoding

xLearn to predict

Canonical correlation analysis

Given , CCA finds projection directions

with maximum correlation:

Canonical correlation analysis

Given , CCA finds projection directions

with maximum correlation:

Also known as “the most predictable criterion”:

CCA finds most predictable directions v in the label space

Multi-label ECOCs using CCA

Encoding and learning Perform CCA: Code includes both original labels and label projections

Yi Zhang and Jeff Schneider. Multi-label Output Codes using Canonical Correlation Analysis. AISTATS 2011

yqy2y1 …

z1 … zdyqy2y1 …xLearn to predict

encoding decoding

Encoding and learning Perform CCA: Code includes both original labels and label projections Learn classifiers for original labels Learn regression for label projections

yqy2y1 …

encoding decoding

Decoding Classifiers: Bernoulli on q original labels Regression: Gaussian on d label projections

yqy2y1 …

encoding decoding

Decoding Classifiers: Bernoulli on q original labels Regression: Gaussian on d label projections

Mean-field approximation

yqy2y1 …

encoding decoding

Empirical results

Similar results on other criteria (macro/micro F-1 scores)Similar results on other data (emotions)Similar results on other base learners (decision trees, SVMs)

[1] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[2] D. Hsu, et. al. Multi-label prediction via compressed sensing. NIPS, 2009[3] Zhang and Schneider. A composite likelihood view for multi-label classification. AISTATS 2012

Outline

Discriminative and predictable codes

Recall: coding with CCA

CCA finds label projections z that are most predictable Low “transmission errors” in channel coding

yqy2y1 …

encoding decoding

xpredict

A recent paper [1]: coding with PCA

Label projections z obtained by PCA z has maximum sample variance, i.e., far away from each other. Minimum code distance?

73[1] Tai and Lin, 2010

yqy2y1 …

encoding decoding

Goal: predictable and discriminative codes

Predictable: prediction is closed to the correct codeword Discriminative: prediction is far away from incorrect codewords

yqy2y1 …

encoding decoding

xpredict

Maximum margin output coding

A max-margin formulation

xpredict

Assume M is best linear predictor (in closed form of X, Y, V) Reformulate using metric learning Deal with the exponentially large number of constraints

The cutting plane method Overgenerating

Assume M is best linear predictor, and define

Metric learning formulation: define the Mahalanobis metric:

and the notation:

The metric learning problem

An exponentially large number of constraints Cutting plane method? No polynomial-time separation oracle!

The metric learning problem

An exponentially large number of constraints Cutting plane method? No polynomial-time separation oracle!

Cutting plane method with overgenerating (relaxation) Relax into Linearize for the relaxed domain: New separation oracle: box-constrained QP

Empirical results

Similar results on other data (emotions and medical)81

[3] D. Hsu, et. al. Multi-label prediction via compressed sensing. NIPS, 2009 [4] Tai and Lin. Multi-label Classification with Principal Label Space Transformation. Neur. Comp.[5] Zhang and Schneider. Multi-label output codes via canonical correlation analysis, AISTATS 2011

[1] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[2] Zhang et. al. Multi-label learning by exploiting label dependency. KDD, 2010

Conclusion

Regularization to exploit input information Semi-supervised learning with word correlation Multi-task learning with a matrix-normal penalty Learning compressible models Projection penalties for dimension reduction

Output coding to exploit output information Composite pairwise coding Coding via CCA Coding via max-margin formulation

Future

Thank you! Questions?

Discriminative and predictable codes

Local smoothness

Smoothness of model coefficients Key property: certain order of derivatives are sparse

Differentiation operator

Brain computer interaction

Classify Electroencephalography (EEG) signals Sparse models vs. piecewise smooth models

Learn a linear model with a given linear reduction P

Learn a linear model with projection penalties

projection distance

Projection penalties: RKHS cases

Learning in RKHS with projection penalties Primal:

Solve for in the dual (see the next page) Solve for v and b in the primal

Projection penalties: RKHS cases

Representer theorem for : Dual:

Learning linear models

Learning RKHS models

FF’X

RpRdFF’

Empirical results

Face recognition (Yale), using SVM (poly-2) Dimension reduction: KPCA, KDA, OLaplacian

Orig Red Proj

3 per class

Orig Red Proj

5 per class

Orig Red Proj

7 per class

Empirical results

Face recognition (Yale), using SVM (poly-2) Dimension reduction: KPCA, KDA, OLaplacian

Orig Red Proj

3 per class

Orig Red Proj

5 per class

Orig Red Proj

7 per class

Empirical results

Face recognition (Yale), SVM (poly-2) Dimension reduction: KPCA, KDA, OLaplacian

Orig Red Proj

3 per class

Orig Red Proj

5 per class

Orig Red Proj

7 per class

Empirical results

Price forecasting (Boston house), ridge regression Dimension reduction: partial least squares

Orig Red Proj

50 training samples

Binary relevance

Binary relevance (a.k.a. 1-vs-all) Subproblems: classify each label independently Decision making: same Assume no label dependency

xLearn to predict

yqy2y1 …

Independently

Binary relevance

Binary relevance (a.k.a. 1-vs-all) Subproblems: classify each label independently Decision making: same Assume no label dependency

xLearn to predict

yqy2y1 …

Independently

Empirical results

Emotion data (classify music to different emotions) Evaluation measure: subset accuracy

[1] Furnkranz et. al. Multi-label classification via calibrated label ranking. MLJ, 2008[2] D. Hsu, et. al. Multi-label prediction via compressed sensing. NIPS, 2009

learning with limited supervision by input and output coding

output information

majority of unlabeled

text classification

text learninghumans

output spaceimprove

output spacerepresentation

unlabeled textinfer

unlabeled textfor

Documents

ckc chinese input system 1 ? 2 ? 3 ? 4 ? ckc character...

by piesketch - nectec€¦ · node.js c# android node-red...

1 introduction 2 coding and uncoding siso (soft input soft...

1 part 4: secure coding u many attacks are due to...

1tgc910104m0201 mservice user manual - abb ltd user manual...

cstart01 - california department of...

hydroelectric power plants - ieee.ch. temperature...

advanced modulation and coding : mimo communication systems...

chapter 13 coding design, coding process, coding

coding and decoding. types letter coding number coding ...

supervision of radiation therapy - coding strategies ·...

variable-length input huffman coding for system-on-a-chip...

secure coding handling input carefully and securely in...

real-time modeling and analysis of a multiple input...

game with us beginner tutorial. welcome!! who i am what is...

module 7 data input methods learning units 7.1 data...

operating instructions - discount cardiology · resting ecg...

code cards - coding club · print() ..... 1 input() ..... 2...

medical compliance services office of billing compliance...

stricter microprudential supervision versus ...€¦ ·...