1 electrical and computer engineering lecture set 10 nonstandard learning approaches predictive...

94
1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

Upload: randell-singleton

Post on 30-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

1Electrical and Computer Engineering

LECTURE SET 10

Nonstandard Learning Approaches

Predictive Learning from Data

Page 2: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

2

OUTLINE• Motivation for non-standard approaches

- Learning with sparse high-dimensional data

- Formalizing application requirements

- Philosophical motivation

• New Learning Settings- Transduction

- Universum Learning

- Learning Using Privileged Information

- Multi-Task Learning

• Summary

Page 3: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

3

Sparse High-Dimensional Data

• Standard inductive learning: training data (x, y)• High-dimensional, low sample size (HDLSS) data:

• Gene microarray analysis• Medical imaging (i.e., sMRI, fMRI)• Object and face recognition• Text categorization and retrieval• Web search

• Sample size is smaller than dimensionality of the input space, d ~ 10K–100K, n ~ 100’s

• Standard learning methods usually fail for such HDLSS data.

Page 4: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

4

How to improve generalization for HDLSS?

Conventional approaches use Standard inductive learning + a priori knowledge:• Preprocessing and feature selection (preceding learning)• Model parameterization (~ selection of good kernels in

SVM)• Informative prior distributions (in statistical methods)

Non-standard learning formulations• Seek new generic formulations (not methods!) that better

reflect application requirements• A priori knowledge + additional data are used to derive new

problem formulations

Page 5: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

5

Formalizing Application Requirements• Classical statistics: parametric model is given (by experts)• Modern applications: complex iterative process

Non-standard (alternative) formulation may be better!

APPLICATION NEEDS

LossFunction

Input, output,other variables

Training/test data

AdmissibleModels

FORMAL PROBLEM STATEMENT

LEARNING THEORY

Page 6: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

6

Philosophical Motivation• Philosophical view 1 (Realism):

Learning ~ search for the truth (estimation of true dependency from available data)

System identification

~ Inductive Learning

where a priori knowledge is about the true model

Page 7: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

7

Philosophical Motivation (cont’d)

• Philosophical view (Instrumentalism): Learning ~ search for the instrumental knowledge (estimation of useful dependency from available data)

VC-theoretical approach ~ focus on learning formulation

Page 8: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

8

VC-theoretical approach

• Focus on the learning setting/formulation, not on learning method

• Learning formulation depends on:(1) available data(2) application requirements(3) a priori knowledge (assumptions)

• Factors (1)-(3) combined using Vapnik’s Keep-It-Direct (KID) Principle yield new learning formulation

Page 9: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

9

Contrast these two approaches

• Conventional (statistics, data mining): a priori knowledge typically reflects properties of a true (good) model, i.e.a priori knowledge ~ properties of true

• Why a priori knowledge is about the true model?

• VC-theoretic approach:a priori knowledge ~ how to use/ incorporate available data into the problem formulationoften a priori knowledge ~ available data samples of different type new learning settings

),( wf x

Page 10: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

10

Examples of Nonstandard Settings • Standard Inductive setting, e.g., digits 5 vs. 8

Finite training set Predictive model derived using only training dataPrediction for all possible test inputs

• Possible modifications- Transduction: Predict only for given test points - Universum Learning: available labeled data ~ examples of digits 5 and 8 and unlabeled examples ~ other digits- Learning Using Privileged Information (LUPI):training data provided by t different persons. Group label is known only for training data, but not available for test data.- Multi-Task Learning:

training data ~ t groups (from different persons)test data ~ t groups (group label is known)

ii y,x

Page 11: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

11

SVM-style Framework for New Settings • Conceptual Reasons:

Additional info/data new type of SRM structure

• Technical Reasons: New knowledge encoded as additional constraints on complexity (in SVM-like setting)

• Practical Reasons: - new settings directly comparable to standard SVM- standard SVM is a special case of a new setting- optimization s/w for new settings may require minor modification of (existing) standard SVM s/w

Page 12: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

12

OUTLINE

• Motivation for non-standard approaches

• New Learning Settings

- Transduction

- Universum Learning

- Learning Using Privileged Information

- Multi-Task Learning

Note: all settings assume binary classification

• Summary

Page 13: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

13

Transduction (Vapnik, 1982, 1995)• How to incorporate unlabeled test data into the

learning process? Assume binary classification

• Estimating function at given pointsGiven: labeled training data and unlabeled test points

Estimate: class labels at these test points

Goal of learning: minimization of risk on the test set:

where

*jx mj ,...,1 ii y,x ni ,...,1

),....( **1

*myyy

)/(,1

)( **

1

*jj

y

m

j

ydPyyLm

R xy

),(),....,( **1

* mff xxy

Page 14: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

14

Transduction vs Induction

a priori knowledge assumptions

estimated function

training data

predicted output

induction deduction

transduction

Page 15: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

15

Transduction based on size of margin• Binary classification, linear parameterization,

joint set of (training + working) samplesNote: working sample = unlabeled test point

• Simplest case: single unlabeled (working) sample

• Goal of learning(1) explain well available data (~ joint set)(2) achieve max falsifiability (~ large margin)

~ Classify test (working + training) samples by the largest possible margin hyperplane (see below)

Page 16: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

16

Margin-based Local Learning• Special case of transduction: single working point

• How to handle many unlabeled samples?

Page 17: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

17

Transduction based on size of margin• Transductive SVM learning has two objectives:

(TL1) separate labeled data using a large-margin hyperplane ~ as in standard SVM(TL2) separate working data using a large-margin hyperplane

Page 18: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

18

Loss function for unlabeled samples• Non-convex loss function:

• Transductive SVM constructs a large-margin hyperplane for labeled samples AND forces this hyperplane to stay away from unlabeled samples

Page 19: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

19

Optimization formulation for SVM transduction• Given: joint set of (training + working) samples• Denote slack variables for training, for working • Minimize

subject to

where

Solution (~ decision boundary)• Unbalanced situation (small training/ large test)

all unlabeled samples assigned to one class • Additional constraint:

i *j

m

jj

n

ii CCbR

1

**

1

)(2

1),( www

mjni

by

by

ji

jij

iii

,...,1,,...,1,0,

1])[(

1])[(

*

**

xw

xw

mjbsigny jj ,...,1),(* xw

** )()( bf xwx

m

ji

n

ii bm

yn 11

])[(11

xw

Page 20: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

20

Optimization formulation (cont’d)• Hyperparameters control the trade-off

between explanation and falsifiability• Soft-margin inductive SVM is a special case of

soft-margin transduction with zero slacks• Dual + kernel version of SVM transduction• Transductive SVM optimization is not convex

(~ non-convexity of the loss for unlabeled data) – **explain** different opt. heuristics ~ different solutions

• Exact solution (via exhaustive search) possible for small number of test samples (m) – but this solution is NOT very useful (~ inductive SVM).

*CandC

0* j

Page 21: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

21

Many applications for transduction• Text categorization: classify word documents

into a number of predetermined categories• Email classification: Spam vs non-spam• Web page classification• Image database classification• All these applications:

- high-dimensional data- small labeled training set (human-labeled)- large unlabeled test set

Page 22: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

22

Example application

• Prediction of molecular bioactivity for drug discovery

• Training data~1,909; test~634 samples• Input space ~ 139,351-dimensional• Prediction accuracy:SVM induction ~74.5%; transduction ~ 82.3%Ref: J. Weston et al, KDD cup 2001 data analysis: prediction

of molecular bioactivity for drug design – binding to thrombin, Bioinformatics 2003

Page 23: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

23

Semi-Supervised Learning (SSL)• SSL assumes availability of labeled + unlabeled

data (similar to transduction)• SSL has a goal of estimating an inductive model

for predicting new (test) samples – different from transduction

• In machine learning, SSL and transduction are often used interchangeably, i.e. transduction can be used to estimate SSL model.

• SSL methods usually combine supervised and unsupervised learning methods into one algorithm

Page 24: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

24

SSL and Cluster Assumption• Cluster Assumption: real-life application data often

has clusters, due to (unknown) correlations between input variables. Discovering these clusters using unlabeled data helps supervised learning

• Example: document classification and info retrieval- individual words ~ input features (for classification)- uneven co-occurrence of words (~ input features) implies clustering of documents in the input space- unlabeled documents can be used to identify this cluster structure, so that just a few labeled examples are sufficient for constructing a good decision rule

Page 25: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

25

Toy Example for Text Classification• Data Set: 5 documents that need to be classified into 2

topics ~ “Economy’ and ‘Entertainment’• Each document is defined by 6 features (words)

- two labeled documents (shown in color)- need to classify three unlabeled documents

• Apply clustering 2 clusters (x1,x3) and (x2,x4,x5)

Budget deficit

Music Seafood Sex Interest Rates

Movies

L1 1

L2 1 1 1

U3 1 1

U4 1

U5 1 1

Page 26: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

26

Self-Learning Method (example of SSL)

Given initial labeled data set L and unlabeled set U• Repeat:

(1) estimate a classifier using L(2) classify randomly chosen unlabeled sample using the decision rule estimated in Step (1)(3) move this new labeled sample to L

Iterate steps (1)-(3) until all unlabeled samples are classified

Page 27: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

27

Illustration (using 1-nearest neighbor classifier)

Hyperbolas data:- 10 labeled and 100 unlabeled samples (green)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Iteration 1

Unlabeled Samples

Class +1Class -1

Page 28: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

28

Illustration (after 50 iterations)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Iteration 50

Unlabeled Samples

Class +1Class -1

Page 29: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

29

Illustration (after 100 iterations)

All samples are labeled now:

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Iteration 100

Class +1

Class -1

Page 30: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

30

Comparison: SSL vs T-SVM• Comparison 1 for low-dimensional data:

- Hyperbolas data set (10 labeled, 100 unlabeled)- 10 random realizations of training data

• Comparison 2 for high-dimensional data: - Digits 5 vs 8 (100 labeled, 1,866 unlabeled)- 10 random realizations of training/validation data

Note: validation data set for SVM model selection

• Methods used - Self-learning algorithm (using 1-NN classification)- Nonlinear T-SVM (needs parameter tuning)

Page 31: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

31

Comparison 1: SSL vs T-SVM and SVMMethods used

- Self-learning algorithm (using 1-NN classification)- Nonlinear T-SVM (Poly kernel d=3)

• Self-learning method is better than SVM or T-SVM

- Why?

HYPERBOLAS DATA SET

SETTING σ=0.025, No. of training and validation samples=10 (5 per class). No. of test samples= 100 (50 per class).

Polynomial SVM(d=3)

T-SVM(d=3) Self-Learning

Prediction Error ( %) 11.8(2.78) 10.9(2.60) 4.0(3.53)

Typical Parameters C= 700-7000 C*/C= 10-6 N/A

Page 32: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

32

Comparison 2: SSL vs T-SVM and SVMMethods used

- Self-learning algorithm (using 1-NN classification)- Nonlinear T-SVM (RBF kernel)

• SVM or T-SVM is better than self-learning method

- Why?

MNIST DATA SET

SETTING: No. of training and validation samples=100. No. of test samples= 1866.

RBF SVM RBF T-SVM Self-LearningTest Error ( %) 5.79(1.80) 4.33(1.34) 7.68(1.26)Typical selected parameter values

C≈ 0.3- 7gamma≈5-3-5-2

C*/C≈0.01-1 N/A

Page 33: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

33

Explanation of T-SVM for digits data setHistogram of projections of labeled + unlabeled data:

- for standard SVM (RBF kernel) ~ test error 5.94%- for T-SVM (RBF kernel) ~ test error 2.84%

Histogram for RBF SVM (with optimally tuned parameters):

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Page 34: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

34

Explanation of T-SVM (cont’d)Histogram for T-SVM (with optimally tuned parameters)Note: (1) test samples are pushed outside the margin borders

(2) more labeled samples project away from margin

-3 -2 -1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Page 35: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

35

Universum Learning (Vapnik 1998, 2006)

• Motivation: what is a priori knowledge?- info about the space of admissible models- info about admissible data samples

• Labeled training samples (as in inductive setting) + unlabeled samples from the Universum

• Universum samples encode info about the region of input space (where application data lives):- from a different distribution than training/test data- U-samples ~ neither positive nor negative class

• Examples of the Universum data• Expect large improvement for small sample size n

Page 36: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

36

Cultural Interpretation of the Universum

neither Hillary nor Obama but looks like both

• Absurd examples, jokes, some art forms

Page 37: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

37

Cultural Interpretation of the Universum

Marc Chagall:

FACES

Page 38: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

38

Cultural Interpretation of the Universum

Marcel Duchamp (1919)

Mona Lisa with Mustache

• Some art formssurrealism, dadaism

Page 39: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

39

More on Marcel Duchamp

Rrose Sélavy (Marcel Duchamp), 1921, Photo by Man Ray.

Page 40: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

40

Main Idea of Universum Learning

Fig. courtesy of J. Weston (NEC Labs)

• Handwritten digit recognition: digit 5 vs 8

Page 41: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

41

Learning with the Universum• Inductive setting for binary classification

Given: labeled training data and unlabeled Universum samples Goal of learning: minimization of prediction risk (as in standard inductive setting)

• Two goals of Universum Learning:(UL1) separate/explain labeled training data using large-margin hyperplane (as in standard SVM)(UL2) maximize the number of contradictions on the Universum, i.e. Universum samples inside the margin

Goal (UL2) is achieved by using special loss function for Universum samples

*jx mj ,...,1

ii y,x ni ,...,1

Page 42: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

42

Inference through contradictions

Page 43: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

43

-insensitive loss for Universum samples

x

y

*2

1

Page 44: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

44

Universum SVM Formulation (U-SVM)• Given: labeled training + unlabeled Universum samples

• Denote slack variables for training, for Universum

• Minimize where

subject to for labeled data

for the Universum

where the Universum samples use -insensitive loss

• Convex optimization• Hyper-parameters control the trade-off btwn

minimization of errors and maximizing the # contradictions• When =0, standard soft-margin SVM

i *j

m

jj

n

ii CCbR

1

**

1

)(2

1),( www

iii by 1])[( xw nii ,...,1,0, *)( ii b xw mjj ,...,1,0*

0, * CC

0, * CC

*C

Page 45: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

45

Application Study (Vapnik, 2006)• Binary classification of handwritten digits 5 and 8• For this binary classification problem, the following

Universum sets had been used:

U1: randomly selected other digits (0,1,2,3,4,6,7,9)

U2: randomly mixing pixels from images 5 and 8

U3: average of randomly selected examples of 5 and 8

Training set size tried: 250, 500, … 3,000 samples

Universum set size: 5,000 samples

• Prediction error: improved over standard SVM, i.e. for 500 training samples: 1.4% vs 2% (SVM)

Page 46: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

46

Universum U3 via random averaging (RA)

Average

Class 1

Class -1Hyper-plane

Page 47: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

47

Random Averaging for digits 5 and 8

• Two randomly selected examples

• Universum sample:

Page 48: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

48

Application Study: gender of human faces

• Binary classification setting• Difficult problem:

dimensionality ~ large (10K - 20K)labeled sample size ~ small (~ 10 - 20)

• Humans perform very well for this task• Issues:

- possible improvement (vs standard SVM)- how to choose Universum?- model parameter tuning

Page 49: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

49

Male Faces: examples

Page 50: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

50

Female Faces: examples

Page 51: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

51

Universum Faces:neither male nor female

Page 52: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

52

Empirical Study (Bai and Cherkassky 2008)

• Gender classification of human faces (male/ female)• Data: 260 pictures of 52 individuals (20 females and 32

males, 5 pictures for each individual) from Univ. of Essex

• Data representation and pre-processing: image size 46x59 – converted into gray-scale image, following standard image processing (histogram equalization)

• Training data: 5 female and 8 male photos• Test data: remaining 39 photos (of other people)• Experimental procedure: randomly select 13 training

samples (and 39 test samples). Estimate and compare inductive SVM classifier with SVM classifier using N Universum samples (where N=100, 500, 1000).- Report results for 4 partitions (of training and test data)

Page 53: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

53

Empirical Study (cont’d)

• Universum generation:U1 Random Average: of male and female samples randomly selected from the training set.

U2 Empirical Distribution: estimate pixel-wise distribution of the training data (both male+female). Generate a new picture from this distribution.

U3 Animal faces:

Page 54: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

5454

Universum generation: examples• U1 Averaging:

• U2 Empirical Distribution:

Page 55: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

55

Results of gender classification• Classification accuracy: improves vs standard SVM by ~

2% with U1, and ~ 1% with U2 Universum

• RA Universum works best when # Universum samples N = 500 or 1,000 (shown above for 4 partitionings of the data)

• Animal face Universum degrades performance by 2-5%

Page 56: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

56

Random Averaging Universum• How to come up with good Universum ?

- usually application – specific

But• RA Universum is generated from the training data• Under what conditions RA Universum is effective?

~ better than standard SVM• Solution approach:

Analyze histogram of projections of training & Universum samples onto the normal direction w of optimally trained SVM model (for high-dim. data)

Page 57: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

57

Histogram of projections and RA Universum

• Typical histogram (for high-dimensional data)

Case 1:

-1.5 -1 -0.5 0 0.5 1 1.510

0

101

102

103

Page 58: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

58

Histogram of projections and RA Universum

• Typical histogram (for high-dimensional data)

Case 2:

-1.5 -1 -0.5 0 0.5 1 1.50

2

4

6

8

10

12

14

16

Page 59: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

59

Histogram of projections and RA Universum

• Typical histogram (for high-dimensional data)

Case 3:

-3 -2 -1 0 1 2 30

50

100

150

200

250

Page 60: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

60

Example: Handwritten digits 5 vs 8

• MNIST data set: each digit ~ 28x28=784 pixels• Training / validation set ~ 1,000 samples each

Test set ~ 1,866 samples

For U-SVM ~ 1,000 Universum samples (via RA)• Comparison of test error rate for SVM and U-SVM:

SVM U-SVM MNIST (RBF Kernel) 1.37% (0.22%) 1.20% (0.19%) MNIST (Linear Kernel) 4.58% (0.34%) 4.62% (0.37%)

Page 61: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

61

Histogram of projections for MNIST data

for RBF SVM for Linear SVM

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.50

50

100

150

200

250MNIST Data: Distribution of RBF kernel SVM output

-6 -4 -2 0 2 4 60

20

40

60

80

100

120MNIST Data: Disbution of the linear SVM output

Page 62: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

62

Analyzing ‘other digits’ Universum

The same set-up but using digits ‘1’ or ‘3’ Universa

Digit 1 Universum Digit 3 Universum

Test error: RBF SVM~ 1.47%, Digit1~1.31%, Digit3~1.01%

-3 -2 -1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

-3 -2 -1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Page 63: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

63

Summary & Discussion• Technical aspects: why the Universum data

improves generalization when d>>n? -SVM is solved in a low-dimensional subspace implicitly defined by the Universum data

• How to specify good Universum? - no formal procedure exists- but conditions for the effectiveness of a particular Universum data set are known - see http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5936738

• Psychological/cultural aspects: - Relation to human learning (cultural Universum)?- New type of inference? Impact on psychology?

Page 64: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

64

Learning Using Privileged Info (LUPI)

Given: training data (x, x*, y)

where x* is privileged info only for training data

Estimate: predictive model y = f(x)

Note: two different Hilbert spaces x and x*

Additional info x* is common in many apps:

Handwritten digit recognition: ~ person’s label

Brain imaging (fMRI): human subject label (in a group of subjects performing the same cognitive task)

Medical diagnosis: ~ medical history after initial diagnosis

Time series prediction: ~ future values of the training data

Page 65: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

65

Learning Using Privileged Info (LUPI)

Given: training data (x, x*, y)

where x* is privileged info only for training data

Estimate y = f(x) ~ same as in standard induction

Main Idea: additional info about training samples

standard learning x LUPI x + x*

+

Page 66: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

66

Many Application Settings for LUPI

• Medical diagnosis: privileged info ~ expensive test results, patient history after diagnosis, etc.

• Time series prediction: privileged info ~ future values of the training time series

• Additional group information (~ structured data)• Feature selection etc.

SVM+ is a formal mechanism

for LUPI training

Page 67: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

67

LUPI and SVM+ (Vapnik 2006)

SVM+ is a new learning technology for LUPI

• Improved prediction accuracy (proven both theoretically and empirically)

• Similar to learning with a human teacher: ‘privileged info’ ~ feedback from an experienced teacher(can be very beneficial)

• SVM+ is a promising new technology for ‘difficult’ problems

• Computationally difficult: no public domain software is readily available

Page 68: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

68

LUPI Formalization and Challenges• LUPI training data in the form

• Privileged Info - assumed to have informative value (for prediction)- not available for testing/predictive model

• Interpretation of the privileged info- additional feedback from an expert teacher, different from ‘correct answers’ or y-labels- this feedback provided in a different space x*

• Challenge: how to handle/combine learning in two different feature spaces?

niyiii ,...2,1),,( xx

)( ix

)(ˆ xfy

Page 69: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

69

SVM+ Technology for LUPI• LUPI training data in the form

• SVM+ Learning takes place in two spaces:- decision space Z where the decision rule is estimated- correction space Z* where the privileged info is encoded in the form of additional constraints on errors (slack variables) in the decision space

• SVM+ approach implements two (kernel) mappings- into decision space- into correction space

• - assumed to have informative value (for prediction)- is not available for predictive model

• Interpretation of privileged info- additional feedback from an expert teacher, different from ‘correct answers’ or y-labels- human interaction in a different feature space x*

• Challenge: how to handle/combine learning in two different feature spaces?

niyiii ,...2,1),,( xx

)( ix

)(ˆ zfy

Zxx :)( Zxx :)(

Page 70: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

70

Two Objectives of SVM+ Learning

(SVM+ Goal 1)- separate labeled training data in decision space Z using large-margin hyperplane (as in standard SVM)

(SVM+ Goal 2)- incorporate the privileged information in the form of additional constraints on slack variables in the decision space. These constraints are modeled as linear functions in the correction space Z*

Page 71: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

71

SVM+ or Learning With Structured Data (hidden info = group label)

Decision function

Decision space

Correcting space Correcting functions

Group1Group2

Class 1Class -1

Correcting space

mapping

2

2

1

1

r slack variable for group r

mapping

Page 72: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

72

SVM+ Formulation

r

tt Ti

ri

t

r

t

rrr

ddbwwwC

11,..,,,..,,

),(2

),(2

1 min

11

wwww

subject to:

trTiby rriii ,...,1,,1)),(( zw

trTi r

ri ,...,1,,0

trTid rrrri

ri ,...,1,,),( wz

]))(,[()]([ bsignfsigny Z xwxriz

r Zri

)(xz

Zizi )(xzix

Decision Space

Correcting Space

Page 73: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

73

SVM+ Formulation• Map the input vectors simultaneously into:

- Decision space (standard SVM classifier)

- Correcting space (where correcting functions model slack variables for different groups)

• Decision space/function ~ the same for all groups• Correcting functions ~ different for each group

(but correcting space may be the same)• SVM+ optimization formulation incorporates:

- the capacity of decision function

- capacity of correcting functions for group r

- relative importance (weight) of these two capacities

ww, rr ww ,

Page 74: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

74

Multi-Task Learning (MTL)

• Application: Handwritten digit recognitionLabeled training data provided by t persons (t >1) Goal 1: estimate a single classifier that will generalize well for future samples generated by these persons ~ LUPIGoal 2: estimate t related classifiers (for each person) ~ MTL

• Application: Medical diagnosisLabeled training data provided by t groups of patients (t >1), say men and women (t = 2) Goal 1: estimate a classifier to predict/diagnose a disease using training data from t groups of patients ~ LUPIGoal 2: estimate t classifiers specialized for each group of patients ~ MTL

Note: for MTL the group info is known for training and test data

Page 75: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

75

Multi-task Learning

Training Data

Predictive Model

Training Data

Predictive Model

Training Data

Predictive Model

Training Data

Predictive Model

Task Relatedness Modeling

(a)

(b)

Task 1

Task 2

Task t

(a) Single task learning

(b) Multi-task learning (MTL)

Page 76: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

76

Contrast Inductive Learning, MTL and LWSD

Training data for Group 1

Training data for Group t

…..

Inductive Learning

Multi-TaskLearning

Learning with Structured Data

f

data

Group info

f1

ft

f

test data for Group 1

test data for Group t

…..…..

test data (no Group info)

test data (no Group info)

Page 77: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

7777

Different Ways of Using Group Information

SVM

SVM+

SVM

SVM

svm+MTL

f(x)

f(x)

f1(x)

f2(x)

f1(x)

f2(x)

sSVM:

SVM+:

mSVM:

svm+MTL:

Page 78: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

78

Problem setting for MTL• Assume binary classification problem• Training data ~ a union of t related groups(tasks)

Each group r has i.i.d. samples

generated from a joint distribution • Test data: task label for test samples is known

• Goal of learning: estimate t models such that the sum of expected losses for all tasks is

minimized:

rni ,...,2,1),( ri

ri yx

),( yPr x

rntr ,...,2,1

),( yPr x

},...,,{ 21 tfff

t

rrr ydPwfyLwR

1

),()),(,(()( xx

Page 79: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

7979

SVM+ for Multi-Task Learning

• New learning formulation: SVM+MTL• Define decision function for each group as

• Common decision function models the relatedness among groups

• Correcting functions fine-tune the model for each group (task)

.

trdbf rrzzr r,...,1,)),(()),(()( wxwxx

bz )),(( wx

Page 80: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

8080

t

r Ti

ri

t

rrr

ddbwwwr

rt

C11

,..,,,..,,),(

2),(

2

1 min

11

wwww

subject to:

trTidby rrir

riri

ri ,...,1,,1)),(),(( zwzw

trTiri ,...,1,,0

rizr Z

ri )(xz

Zizi )(xzix Correcting Space

SVM+MTL FormulationDecision Space

trdbsignf rrzzr r,...,1,))),(()),((()( wxwxx

Page 81: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

8181

Empirical Validation

• Different ways of using group info different learning settings:

- which one yields better generalization?

- how performance is affected by sample size?

• Empirical comparisons:

- for synthetic data set

Page 82: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

8282

Different Ways of Using Group Information

SVM

SVM+

SVM

SVM

svm+MTL

f(x)

f(x)

f1(x)

f2(x)

f1(x)

f2(x)

sSVM:

SVM+:

mSVM:

svm+MTL:

Page 83: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

8383

Synthetic Data

• Generate x with each component • The coefficient vectors of three tasks are specified as

• For each task and each data vector, • Details of methods used:

- linear SVM classifier (single parameter C)- SVM+, SVM+MTL classifier (3 parameters: linear kernel for decision space, RBF kernel for correcting space, and parameter γ) - Independent validation set for model selection

20R 20,...,1,)1,1(~ i�uniformxi

]0,0,0,0,0,1,0,0,0,0,0,1,1,1,1,1,1,1,1,1[

]0,0,0,0,0,0,0,0,1,0,1,0,1,1,1,1,1,1,1,1[

]0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1[

3

2

1

)5.0( xisigny

Page 84: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

8484

Experimental Results

• Comparison results (over 10 trials): ave test accuracy(%) ±std. dev.

#per group sSVM SVM+ mSVM svm+MTL

n=100 88.11±0.65 88.31±0.84 91.18±1.26 91.47±1.03

n=50 85.74±1.36 86.49±1.69 85.09±1.40 87.39±2.29

n=15 80.10±3.42 80.84±3.16 70.73±2.69 79.24±2.81

Note1: relative performance depends on sample size ‘best’ problem setting depends on sample size

Note2: SVM+ always better than SVMSVM+MTL always better than mSVM

Page 85: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

85

OUTLINE• Motivation for non-standard approaches

• New Learning Settings

• Summary

- Advantages/limitations of non-standard settings

- Solving ill-posed problems

- Knowledge discovery & predictive learning

Page 86: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

86

Advantages/Limitations of Nonstandard Settings• Advantages

- reflects application requirements- follows methodological framework (VC-theory)- yields better generalization (but not always)- new directions for research

• Limitations- formalization of application requirements need to understand application domain- generally more complex learning formulations- more difficult model selection- just a few empirical comparisons (to date)

Page 87: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

87

ILL-posed Estimation Problems• Vapnik’s Imperative:

asking the right question (with finite data) most direct learning problem formulation~ Controlling the goals of inference, in order to produce a better-posed problem

• Three types of such restrictions (Vapnik, 2006)- regularization (constraining function smoothness)- SRM ~ constraint on VC-dimension of models- constraints on the mode of inference

• Philosophical interpretation (of restrictions):- find a model/ type of inference that explains well available data and has large falsifiability

Page 88: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

88

Scientific Knowledge & Predictive Learning

• Scientific Knowledge (last 3-4 centuries):

- objective

- recurrent events (repeatable by others)

- quantifiable (described by math models)

• Search for assured knowledge

• Knowledge ~ causal, deterministic, logical

humans cannot reason well about

- noisy/random data

- multivariate (high-dimensional) data

Page 89: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

89

The Nature of Knowledge

• Idealism: knowledge is pure thought

• Realism: knowledge comes from experience

• Western History of knowledge

- Ancient Greece ~ idealism

- Classical science ~ realism

• But it is not possible to obtain assured knowledge from empirical data (Hume):

Whatever in knowledge is of empirical nature is never certain

Page 90: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

90

Digital Age

• Most knowledge in the form of data from sensors (not human sense perceptions)

• How to get assured knowledge from data?

• Popular view ~ Big Data:

more data more knowledge

more connectivity better knowledge

• Naïve realism based on

classical statistics + classical science

Page 91: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

91

Philosophical Views on Science

• Karl Popper: Science starts from problems, and not from observations

• Werner Heisenberg :What we observe is not nature itself, but nature exposed to our method of questioning

• Albert Einstein:- God is subtle but he is not malicious.

- Reality is merely an illusion, albeit a very persistent one.

Page 92: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

92

Predictive Data Modeling Philosophy

• Single ‘correct’ causal model cannot be estimated from data alone.

• Good predictive models:

- not unique

- depend on the problem setting ( ~ our method of questioning)

• Predictive Learning methodology is the most important aspect of learning from data.

Page 93: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

93

Predictive Modeling Methodology

• The same methodology applies to financial, bio/medical, weather, marketing etc

• Interpretation of predictive models:- big problem (not addressed in this course)

- reflects the need for assured knowledge

- requires understanding of the predictive methodology and its assumptions, rather than just interpretable parameterization

Page 94: 1 Electrical and Computer Engineering LECTURE SET 10 Nonstandard Learning Approaches Predictive Learning from Data

94

Predictive Modeling Methodology (cont’d)• Predictive methodology is also applicable to

non-technical areas (everyday life decisions)

• Lessons from philosophy and history:

- Amount of information grows exponentially, but the number of ‘big ideas’ always remains small

- Critical thinking ability becomes even more important in the digital age

- Freedom of information does not guarantee freedom of thought