ida 2015: efficient model selection for regularized classification by exploiting unlabeled data

Introduction Quantification The proposed approach Experiment Framework Conclusion

Efficient Model Selection for RegularizedClassification by Exploiting Unlabeled Data

Georgios Balikas1 Ioannis Partalas2 Eric Gaussier1 RohitBabbar3 Massih-Reza Amini1

1University Grenoble, Alpes

2Viseo R&D

3Max-Plank Institute for Intelligent Systems

Intelligent Data Analysis 2015, Saint-Etienne

1/17 Balikas et al. Efficient Model Selection by Exploiting Unlabeled Data


Outline

1 Introduction

2 Quantification

3 The proposed approach

4 Experiment Framework

5 Conclusion



Model selection for text classification

Doc1

DocN

d1 ∈ Rd

dN ∈ Rd

Feature

Extraction

Select hθ ∈ H.

θ: hyper-parametersR(θ) ∈ R

Learning

θ ?

The task

Efficiently select the hyper-parameter value which minimizes thegeneralization error (using the empirical error as a proxy).



Traditional Model Selection Methods

Valid. Train Train Train Train

Train Valid. Train Train Train

Train Train Train Train Valid.

Figure : 5-fold Cross Validation

Train Valid.

Figure : Hold-out

Extensions of the above such as Leave-one-out, etc.

M. Mohri et al.Foundations of Machine Learning, MIT press 2012



The issues

In large scale problems:

Resource intensive: ∼ 106 − 108 free parameters. Optimizedk-CV can take up to several days.

Power law distribution ofexamples. Only a fewinstances for smallclasses, splitting themresults in loss ofinformation.

Labeled Documents/class

R. Babbar, I. Partalas, E. Gaussier, M-R. AminiRe-ranking approach to classification in large-scale power-law distributedcategory systems, SIGIR 2014



Our contribution

We propose a bound that motivates efficient model selection.

Leverages unlabeled data for model selection

Performs on par (if not better) with traditional methods

Is k times faster than k-cross validation.



Quantification

Definition

In many classification scenarios, the real goal is determining theprevalence of each class in the test, a task called quantification.

Given a dataset:

? How many people liked the new iPhone?

? How many instances belong to yi class?

A. Esuli and F. SebastianiOptimizing text quantifiers for multivariate loss functions, arXiv preprintarXiv:1502.05491



Quantification using general purpose learners

Classify and Count

Aggregative method

Classify each instancefirst

Count instances/class

Probabilistic Classify and Count

Non-aggregative method

Get scores/probabilities for eachinstance

Sum over probabilities/class

G. FormanCounting positives accurately despite inaccurate classification, ECML 2005



Our setting

Mono-label, multi-class classification

Observations x ∈ X ⊆ Rd , labels y ∈ Y, |Y | > 2

(x, y) i.i.d. according to a fixed, unknown D over X × YStrain = {(x(i), y (i))}N

i=1, S = {(x(i))}Mi=N+1

Regularized classification: w = arg min Remp(w) + λReg(w)

hθ ∈ H, e.g., for SVMs the θ = λ from a set λvalues

py , pC(S)y : prior on Strain, estimated using quantification on S



Accuracy bound

Theorem

Let S = {(x(j))}Mj=1 be a set generated i.i.d. with respect to DX , py the true prior

probability for category y ∈ Y andNy

N, py its empirical estimate obtained on Strain.

We consider here a classifier C trained on Strain and we assume that the quantificationmethod used is accurate in the sense that:

∃ε, ε� min{py , py , pC(S)y }, ∀y ∈ Y : |pC(S)

y −M

C(S)y

|S || ≤ ε

Let BC(S)A , be defined as:

∑y∈Y

min{py × |S|, pC(S)y × |S|}

|S| , BC(S)A

Then for any δ ∈]0, 1], with probability at least (1− δ):

AC(S) ≤ BC(S)A + |Y|(

√log |Y|+ log 1

δ

2N+ ε)



Intuition

Estimated prob. of y on |S |prior prob. of y

BC(S)A ,

∑y∈Y

min{ py × |S |, pC(S)y × |S |}

|S |

In a power-law distributed category systems this is an upperbound:

– py will be used for large classes due to false positives, and

– pC(S)y will be used for small classes due to false negatives.



Model selection using the bound

Training Data

Estimate class priorsQuantification on unseen data




Training Data


for λ in λvalues doTrain on Strain

Estimate pC(S)y on S

end for




Training Data


Calculate the Bound

Select hyper-parameter value



Datesets

Dataset #Training #Quantification #Test #Features # Parameters

dmoz250 1,542 2,401 1,023 55,610 13,9Mdmoz500 2,137 3,042 1,356 77,274 38,6Mdmoz1000 6,806 10,785 4,510 138,879 138,8Mdmoz1500 9,039 14,002 5,958 170,828 256,2Mdmoz2500 12,832 19,188 8,342 212,073 530,1M

– Similar experimental settings on wikipedia data

– SVMs and Log. Regression, λ ∈ {10−4, . . . , 104}– 5-CV, Held out (70%-30%), BoundUN, BoundTest



Results (1/2)

10−4 10−3 10−2 10−1 1 10 102 103

λ values

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8A

ccura

cy

5-CV

H out

MaF

CC

PCC

Figure : MaF measure optimization for wiki1500 for SVM.



Results (2/2)

BoundUn BoundTest Hold-out 5-CV

Dataset Acc MaF Acc MaF Acc MaF Acc MaF

dmoz250 .8260 .6242 .8270 .6243 .8260 (±.0000) .6242 (±.0000) .8260 .6242dmoz500 .7227 .5584 .7227 .5584 .7221 (±.0005) .5558 (±.0022) .7220 .5562dmoz1000 .7302 .4883 .7302 .4892 .7301 (±.0001) .4835 (±.0155) .7299 .4883dmoz1500 .7132 .4715 .7132 .4715 .6958 (±.0457) .4065 (±.0998) .7132 .4715dmoz2500 .6352 .4301 .6350 .4306 .6350 (±.0001) .3949 (±.0686) .6352 .4301

wiki1500 for SVM on 4 cores: BoundUn (302 sec), 5-CV (1310 sec).



Conclusions

? Performs equally well or better than traditional modelselection methods for model selection.

? Is k times faster than k-CV.

? It requires unlabeled data from the same distribution as thetraining data.



Thank you

Georgios [email protected]

Ioannis [email protected]

Eric [email protected]

Rohit [email protected]

Massih-Reza [email protected]

This work is partially supported by the CIFRE N 28/2015 and bythe LabEx PERSYVAL Lab ANR-11-LABX-0025.


ida 2015: efficient model selection for regularized classification by exploiting unlabeled data

Data & Analytics