ida 2015: efficient model selection for regularized classification by exploiting unlabeled data

19
Introduction Quantification The proposed approach Experiment Framework Conclusion Efficient Model Selection for Regularized Classification by Exploiting Unlabeled Data Georgios Balikas 1 Ioannis Partalas 2 Eric Gaussier 1 Rohit Babbar 3 Massih-Reza Amini 1 1 University Grenoble, Alpes 2 Viseo R&D 3 Max-Plank Institute for Intelligent Systems Intelligent Data Analysis 2015, Saint- ´ Etienne 1/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Upload: george-balikas

Post on 13-Apr-2017

114 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Introduction Quantification The proposed approach Experiment Framework Conclusion

Efficient Model Selection for RegularizedClassification by Exploiting Unlabeled Data

Georgios Balikas1 Ioannis Partalas2 Eric Gaussier1 RohitBabbar3 Massih-Reza Amini1

1University Grenoble, Alpes

2Viseo R&D

3Max-Plank Institute for Intelligent Systems

Intelligent Data Analysis 2015, Saint-Etienne

1/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Outline

1 Introduction

2 Quantification

3 The proposed approach

4 Experiment Framework

5 Conclusion

2/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Model selection for text classification

Doc1

DocN

d1 ∈ Rd

dN ∈ Rd

Feature

Extraction

Select hθ ∈ H.

θ: hyper-parametersR(θ) ∈ R

Learning

θ ?

The task

Efficiently select the hyper-parameter value which minimizes thegeneralization error (using the empirical error as a proxy).

3/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Traditional Model Selection Methods

Valid. Train Train Train Train

Train Valid. Train Train Train

Train Train Train Train Valid.

Figure : 5-fold Cross Validation

Train Valid.

Figure : Hold-out

Extensions of the above such as Leave-one-out, etc.

M. Mohri et al.Foundations of Machine Learning, MIT press 2012

4/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

The issues

In large scale problems:

Resource intensive: ∼ 106 − 108 free parameters. Optimizedk-CV can take up to several days.

Power law distribution ofexamples. Only a fewinstances for smallclasses, splitting themresults in loss ofinformation.

Labeled Documents/class

R. Babbar, I. Partalas, E. Gaussier, M-R. AminiRe-ranking approach to classification in large-scale power-law distributedcategory systems, SIGIR 2014

5/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Our contribution

We propose a bound that motivates efficient model selection.

Leverages unlabeled data for model selection

Performs on par (if not better) with traditional methods

Is k times faster than k-cross validation.

6/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Quantification

Definition

In many classification scenarios, the real goal is determining theprevalence of each class in the test, a task called quantification.

Given a dataset:

? How many people liked the new iPhone?

? How many instances belong to yi class?

A. Esuli and F. SebastianiOptimizing text quantifiers for multivariate loss functions, arXiv preprintarXiv:1502.05491

7/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Quantification using general purpose learners

Classify and Count

Aggregative method

Classify each instancefirst

Count instances/class

Probabilistic Classify and Count

Non-aggregative method

Get scores/probabilities for eachinstance

Sum over probabilities/class

G. FormanCounting positives accurately despite inaccurate classification, ECML 2005

8/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Our setting

Mono-label, multi-class classification

Observations x ∈ X ⊆ Rd , labels y ∈ Y, |Y | > 2

(x, y) i.i.d. according to a fixed, unknown D over X × YStrain = {(x(i), y (i))}N

i=1, S = {(x(i))}Mi=N+1

Regularized classification: w = arg min Remp(w) + λReg(w)

hθ ∈ H, e.g., for SVMs the θ = λ from a set λvalues

py , pC(S)y : prior on Strain, estimated using quantification on S

9/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Accuracy bound

Theorem

Let S = {(x(j))}Mj=1 be a set generated i.i.d. with respect to DX , py the true prior

probability for category y ∈ Y andNy

N, py its empirical estimate obtained on Strain.

We consider here a classifier C trained on Strain and we assume that the quantificationmethod used is accurate in the sense that:

∃ε, ε� min{py , py , pC(S)y }, ∀y ∈ Y : |pC(S)

y −M

C(S)y

|S || ≤ ε

Let BC(S)A , be defined as:

∑y∈Y

min{py × |S|, pC(S)y × |S|}

|S| , BC(S)A

Then for any δ ∈]0, 1], with probability at least (1− δ):

AC(S) ≤ BC(S)A + |Y|(

√log |Y|+ log 1

δ

2N+ ε)

10/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Intuition

Estimated prob. of y on |S |prior prob. of y

BC(S)A ,

∑y∈Y

min{ py × |S |, pC(S)y × |S |}

|S |

In a power-law distributed category systems this is an upperbound:

– py will be used for large classes due to false positives, and

– pC(S)y will be used for small classes due to false negatives.

11/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Model selection using the bound

Training Data

Estimate class priorsQuantification on unseen data

12/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Model selection using the bound

Training Data

Estimate class priorsQuantification on unseen data

for λ in λvalues doTrain on Strain

Estimate pC(S)y on S

end for

12/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Model selection using the bound

Training Data

Estimate class priorsQuantification on unseen data

Calculate the Bound

Select hyper-parameter value

12/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Datesets

Dataset #Training #Quantification #Test #Features # Parameters

dmoz250 1,542 2,401 1,023 55,610 13,9Mdmoz500 2,137 3,042 1,356 77,274 38,6Mdmoz1000 6,806 10,785 4,510 138,879 138,8Mdmoz1500 9,039 14,002 5,958 170,828 256,2Mdmoz2500 12,832 19,188 8,342 212,073 530,1M

– Similar experimental settings on wikipedia data

– SVMs and Log. Regression, λ ∈ {10−4, . . . , 104}– 5-CV, Held out (70%-30%), BoundUN, BoundTest

13/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Results (1/2)

10−4 10−3 10−2 10−1 1 10 102 103

λ values

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8A

ccura

cy

5-CV

H out

MaF

CC

PCC

Figure : MaF measure optimization for wiki1500 for SVM.

14/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Results (2/2)

BoundUn BoundTest Hold-out 5-CV

Dataset Acc MaF Acc MaF Acc MaF Acc MaF

dmoz250 .8260 .6242 .8270 .6243 .8260 (±.0000) .6242 (±.0000) .8260 .6242dmoz500 .7227 .5584 .7227 .5584 .7221 (±.0005) .5558 (±.0022) .7220 .5562dmoz1000 .7302 .4883 .7302 .4892 .7301 (±.0001) .4835 (±.0155) .7299 .4883dmoz1500 .7132 .4715 .7132 .4715 .6958 (±.0457) .4065 (±.0998) .7132 .4715dmoz2500 .6352 .4301 .6350 .4306 .6350 (±.0001) .3949 (±.0686) .6352 .4301

wiki1500 for SVM on 4 cores: BoundUn (302 sec), 5-CV (1310 sec).

15/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Conclusions

? Performs equally well or better than traditional modelselection methods for model selection.

? Is k times faster than k-CV.

? It requires unlabeled data from the same distribution as thetraining data.

16/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Thank you

Georgios [email protected]

Ioannis [email protected]

Eric [email protected]

Rohit [email protected]

Massih-Reza [email protected]

This work is partially supported by the CIFRE N 28/2015 and bythe LabEx PERSYVAL Lab ANR-11-LABX-0025.

17/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data