metrics matter, examples from binary and multilabel classi...

101

Upload: others

Post on 27-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Metrics Matter,Examples from Binary and Multilabel Classi�cation

Sanmi Koyejo

University of Illinois at Urbana-Champaign

Page 2: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Joint work with

B. Yan @UT Austin K. Zhong @UT Austin P. Ravikumar @CMU

N. Natarajan @MSR India I. Dhillon @UT Austin

Page 3: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Learning with complex metrics

Goal: Train a DNN to optimize F-measure.

F1 =2TP

2TP + FN + FP

Direct optimization

Mixed combinatorial optimization

Convex lower bound

Logloss + thresholding

Page 4: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Learning with complex metrics

Goal: Train a DNN to optimize F-measure.

F1 =2TP

2TP + FN + FP

Direct optimization

Mixed combinatorial optimization

Convex lower bound

Logloss + thresholding

Page 5: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Learning with complex metrics

Goal: Train a DNN to optimize F-measure.

F1 =2TP

2TP + FN + FP

Direct optimization

F-measure is not an average. Naïve SGD is not validThe sample F-measure is non-di�erentiable

Mixed combinatorial optimization

Convex lower bound

Logloss + thresholding

Page 6: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Learning with complex metrics

Goal: Train a DNN to optimize F-measure.

F1 =2TP

2TP + FN + FP

Direct optimization

Mixed combinatorial optimization

e.g. cutting plane method (Joachims, 2005)may require exponential complexitymost statistical properties unknown

Convex lower bound

Logloss + thresholding

Page 7: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Learning with complex metrics

Goal: Train a DNN to optimize F-measure.

F1 =2TP

2TP + FN + FP

Direct optimization

Mixed combinatorial optimization

Convex lower bound

di�cult to constructmost statistical properties unknown

Logloss + thresholding

Page 8: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Learning with complex metrics

Goal: Train a DNN to optimize F-measure.

F1 =2TP

2TP + FN + FP

Direct optimization

Mixed combinatorial optimization

Convex lower bound

Logloss + thresholding

simple, most common approach in practicehas good statistical properties!

Page 9: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Learning with complex metrics

Goal: Train a DNN to optimize F-measure.

F1 =2TP

2TP + FN + FP

Direct optimization

Mixed combinatorial optimization

Convex lower bound

Logloss + thresholding

Why does thresholding work?

Page 10: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

The confusion matrix summarizes binary classi�er mistakes

Y ∈ {0, 1} denotes labels, X ∈ X denotes instances, letX,Y ∼ PThe classi�er θ : X 7→ {0, 1}

Page 11: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Metrics tradeo� which kinds of mistakes are (most)acceptable

Case Study

A medical test determines that a patient has a 30%chance of having a fatal disease. Should the doctortreat the patient?

choosing not to treat a sick patient (test is falsenegative) could lead to serious issues.

choosing to treat a healthy patient (test is falsepositive) increases risk of side e�ects.

Page 12: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Metrics tradeo� which kinds of mistakes are (most)acceptable

Case Study

A medical test determines that a patient has a 30%chance of having a fatal disease. Should the doctortreat the patient?

choosing not to treat a sick patient (test is falsenegative) could lead to serious issues.

choosing to treat a healthy patient (test is falsepositive) increases risk of side e�ects.

Page 13: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Performance metrics

We express tradeo�s via a metric Φ : [0, 1]4 7→ R

Examples

Accuracy (fraction of mistakes) = TP + TN

Error Rate = 1-Accuracy = FP + FN

For medical diagnosis example, consider the weighted error =w1FP + w2FN, where w2 � w1

and many more . . .

Recall =TP

TP + FN, Fβ =

(1 + β2)TP

(1 + β2)TP + β2FN + FP,

Precision =TP

TP + FP, Jaccard =

TP

TP + FN + FP.

Page 14: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Performance metrics

We express tradeo�s via a metric Φ : [0, 1]4 7→ R

Examples

Accuracy (fraction of mistakes) = TP + TN

Error Rate = 1-Accuracy = FP + FN

For medical diagnosis example, consider the weighted error =w1FP + w2FN, where w2 � w1

and many more . . .

Recall =TP

TP + FN, Fβ =

(1 + β2)TP

(1 + β2)TP + β2FN + FP,

Precision =TP

TP + FP, Jaccard =

TP

TP + FN + FP.

Page 15: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Performance metrics

We express tradeo�s via a metric Φ : [0, 1]4 7→ R

Examples

Accuracy (fraction of mistakes) = TP + TN

Error Rate = 1-Accuracy = FP + FN

For medical diagnosis example, consider the weighted error =w1FP + w2FN, where w2 � w1

and many more . . .

Recall =TP

TP + FN, Fβ =

(1 + β2)TP

(1 + β2)TP + β2FN + FP,

Precision =TP

TP + FP, Jaccard =

TP

TP + FN + FP.

Page 16: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Performance metrics

We express tradeo�s via a metric Φ : [0, 1]4 7→ R

Examples

Accuracy (fraction of mistakes) = TP + TN

Error Rate = 1-Accuracy = FP + FN

For medical diagnosis example, consider the weighted error =w1FP + w2FN, where w2 � w1

and many more . . .

Recall =TP

TP + FN, Fβ =

(1 + β2)TP

(1 + β2)TP + β2FN + FP,

Precision =TP

TP + FP, Jaccard =

TP

TP + FN + FP.

Page 17: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

No need for metrics if you can learn a "perfect" classi�er!

When is a perfect classi�er learnable?

the true mapping between input and labels is deterministic i.e.there is no noise

function class is su�ciently �exible (realizability) and optimalis computable

we have su�cient data

In practice:

real-world uncertainty e.g. hidden variables, measurement error

true function is unknown, optimization may be intractable

data are limited

Thus, in most realistic scenarios, all classi�ers will make mistakes!

Page 18: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

No need for metrics if you can learn a "perfect" classi�er!

When is a perfect classi�er learnable?

the true mapping between input and labels is deterministic i.e.there is no noise

function class is su�ciently �exible (realizability) and optimalis computable

we have su�cient data

In practice:

real-world uncertainty e.g. hidden variables, measurement error

true function is unknown, optimization may be intractable

data are limited

Thus, in most realistic scenarios, all classi�ers will make mistakes!

Page 19: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

No need for metrics if you can learn a "perfect" classi�er!

When is a perfect classi�er learnable?

the true mapping between input and labels is deterministic i.e.there is no noise

function class is su�ciently �exible (realizability) and optimalis computable

we have su�cient data

In practice:

real-world uncertainty e.g. hidden variables, measurement error

true function is unknown, optimization may be intractable

data are limited

Thus, in most realistic scenarios, all classi�ers will make mistakes!

Page 20: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

No need for metrics if you can learn a "perfect" classi�er!

When is a perfect classi�er learnable?

the true mapping between input and labels is deterministic i.e.there is no noise

function class is su�ciently �exible (realizability) and optimalis computable

we have su�cient data

In practice:

real-world uncertainty e.g. hidden variables, measurement error

true function is unknown, optimization may be intractable

data are limited

Thus, in most realistic scenarios, all classi�ers will make mistakes!

Page 21: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

No need for metrics if you can learn a "perfect" classi�er!

When is a perfect classi�er learnable?

the true mapping between input and labels is deterministic i.e.there is no noise

function class is su�ciently �exible (realizability) and optimalis computable

we have su�cient data

In practice:

real-world uncertainty e.g. hidden variables, measurement error

true function is unknown, optimization may be intractable

data are limited

Thus, in most realistic scenarios, all classi�ers will make mistakes!

Page 22: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Utility & Regret

population performance is measured via utility

U(θ, P ) = Φ(TP,FP,FN,TN)

we seek a classi�er that maximizes this utility within somefunction class F

The Bayes optimal classi�er, when it exists, is given by:

θ∗ = argmaxθ∈Θ

U(θ, P ), where Θ = {f : X 7→ {0, 1}}

The regret of the classi�er θ is given by:

R(θ, P ) = U(θ∗, P )− U(θ, P )

Page 23: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Utility & Regret

population performance is measured via utility

U(θ, P ) = Φ(TP,FP,FN,TN)

we seek a classi�er that maximizes this utility within somefunction class F

The Bayes optimal classi�er, when it exists, is given by:

θ∗ = argmaxθ∈Θ

U(θ, P ), where Θ = {f : X 7→ {0, 1}}

The regret of the classi�er θ is given by:

R(θ, P ) = U(θ∗, P )− U(θ, P )

Page 24: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Utility & Regret

population performance is measured via utility

U(θ, P ) = Φ(TP,FP,FN,TN)

we seek a classi�er that maximizes this utility within somefunction class F

The Bayes optimal classi�er, when it exists, is given by:

θ∗ = argmaxθ∈Θ

U(θ, P ), where Θ = {f : X 7→ {0, 1}}

The regret of the classi�er θ is given by:

R(θ, P ) = U(θ∗, P )− U(θ, P )

Page 25: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Towards analysis of the classi�cation procedure

In practice P (X,Y ) is unknown, instead we observeDn = {(Xi, Yi) ∼ P}ni=1

The classi�cation procedure estimates a classi�er θn∣∣Dn

Example

Empirical risk minimization via SVM:

θn = sign

argminf∈Hk

∑{xi,yi}∈Dn

max (0, 1− yif(xi))

Page 26: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Towards analysis of the classi�cation procedure

In practice P (X,Y ) is unknown, instead we observeDn = {(Xi, Yi) ∼ P}ni=1

The classi�cation procedure estimates a classi�er θn∣∣Dn

Example

Empirical risk minimization via SVM:

θn = sign

argminf∈Hk

∑{xi,yi}∈Dn

max (0, 1− yif(xi))

Page 27: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Consistency

Consider the sequence of classi�ers {θn(x), n→∞}

A classi�cation procedure is consistent when R(θn, P )n→∞−−−→ 0 i.e.

the procedure is eventually Bayes optimal

Consistency is a desirable property:

implies stability of the classi�cation procedure, related togeneralization performance

Page 28: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Consistency

Consider the sequence of classi�ers {θn(x), n→∞}

A classi�cation procedure is consistent when R(θn, P )n→∞−−−→ 0 i.e.

the procedure is eventually Bayes optimal

Consistency is a desirable property:

implies stability of the classi�cation procedure, related togeneralization performance

Page 29: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Optimal Binary classi�cation with

Decomposable Metrics

Page 30: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Consider the empirical accuracy:

ACC(θ,Dn) =1

n

∑(xi,yi)∈Dn

1[yi=θ(xi)]

Observe that the classi�cation problem

minθ∈F

ACC(θ,Dn)

is a combinatorial optimization problem

optimal classi�cation is computationally hard for non-trivial Fand Dn

Page 31: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Consider the empirical accuracy:

ACC(θ,Dn) =1

n

∑(xi,yi)∈Dn

1[yi=θ(xi)]

Observe that the classi�cation problem

minθ∈F

ACC(θ,Dn)

is a combinatorial optimization problem

optimal classi�cation is computationally hard for non-trivial Fand Dn

Page 32: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Bayes Optimal Classi�er

Population Accuracy

EX,Y∼P[1[Y=θ(X)]

]Easy to show that θ∗(x) = sign

(P (Y = 1|x)− 1

2

)

Weighted Accuracy

EX,Y∼P[

(1− ρ)1[Y=θ(X)=1] + ρ1[Y=θ(X)=0]

]Scott (2012) showed that θ∗(x) = sign (P (Y = 1|x)− ρ)

Page 33: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Bayes Optimal Classi�er

Population Accuracy

EX,Y∼P[1[Y=θ(X)]

]Easy to show that θ∗(x) = sign

(P (Y = 1|x)− 1

2

)Weighted Accuracy

EX,Y∼P[

(1− ρ)1[Y=θ(X)=1] + ρ1[Y=θ(X)=0]

]Scott (2012) showed that θ∗(x) = sign (P (Y = 1|x)− ρ)

Page 34: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Where do surrogates come from?

Observe that there is no need to estimate P , instead optimize anysurrogate loss function L(θ,Dn) where:

θn = sign

(argmin

fL(f,Dn)

)n→∞−−−→ θ∗(x)

These are known as classi�cation

calibrated surrogate losses (Bartlett et al.,2003; Scott, 2012)

research can focus on how to choose L,Fwhich improve e�ciency, samplecomplexity, robustness . . .

surrogates are often chosen to be convexe.g. hinge loss, logistic loss

Page 35: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Where do surrogates come from?

Observe that there is no need to estimate P , instead optimize anysurrogate loss function L(θ,Dn) where:

θn = sign

(argmin

fL(f,Dn)

)n→∞−−−→ θ∗(x)

These are known as classi�cation

calibrated surrogate losses (Bartlett et al.,2003; Scott, 2012)

research can focus on how to choose L,Fwhich improve e�ciency, samplecomplexity, robustness . . .

surrogates are often chosen to be convexe.g. hinge loss, logistic loss

Page 36: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Where do surrogates come from?

Observe that there is no need to estimate P , instead optimize anysurrogate loss function L(θ,Dn) where:

θn = sign

(argmin

fL(f,Dn)

)n→∞−−−→ θ∗(x)

These are known as classi�cation

calibrated surrogate losses (Bartlett et al.,2003; Scott, 2012)

research can focus on how to choose L,Fwhich improve e�ciency, samplecomplexity, robustness . . .

surrogates are often chosen to be convexe.g. hinge loss, logistic loss

Page 37: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Non-decomposability

A common theme so far is decomposability i.e. linearity wrt.confusion matrix

E[

Φ(C)]

=⟨A,E

[C]⟩

= Φ(E[

C])

However, Fβ , Jaccard, AUC and other common utilityfunctions are non-decomposable i.e. non-linear wrt. C

Thus imples that the averaging trick is no longer valid

E[

Φ(C)]6= Φ(E

[C])

Primary source of di�culty for analysis, optimization, . . .

Page 38: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Non-decomposability

A common theme so far is decomposability i.e. linearity wrt.confusion matrix

E[

Φ(C)]

=⟨A,E

[C]⟩

= Φ(E[

C])

However, Fβ , Jaccard, AUC and other common utilityfunctions are non-decomposable i.e. non-linear wrt. C

Thus imples that the averaging trick is no longer valid

E[

Φ(C)]6= Φ(E

[C])

Primary source of di�culty for analysis, optimization, . . .

Page 39: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Non-decomposability

A common theme so far is decomposability i.e. linearity wrt.confusion matrix

E[

Φ(C)]

=⟨A,E

[C]⟩

= Φ(E[

C])

However, Fβ , Jaccard, AUC and other common utilityfunctions are non-decomposable i.e. non-linear wrt. C

Thus imples that the averaging trick is no longer valid

E[

Φ(C)]6= Φ(E

[C])

Primary source of di�culty for analysis, optimization, . . .

Page 40: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Optimal Binary classi�cation with

Non-decomposable Metrics

Page 41: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

The unreasonable e�ectiveness of thresholding

Theorem (Koyejo et al., 2014; Yan et al., 2016)

Let ηx = P (Y = 1|X = x) and let U be di�erentiable wrt. theconfusion matrix, then ∃ a δ∗ such that:

θ∗(x) = sign (ηx − δ∗)

is a Bayes optimal classi�er almost everywhere.

result does not require concavity of U , or other "nice"properties

1

1Condition: P (ηx = δ∗) = 0, easily satis�ed e.g. when P (X) is continuous.

Page 42: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

The unreasonable e�ectiveness of thresholding

Theorem (Koyejo et al., 2014; Yan et al., 2016)

Let ηx = P (Y = 1|X = x) and let U be di�erentiable wrt. theconfusion matrix, then ∃ a δ∗ such that:

θ∗(x) = sign (ηx − δ∗)

is a Bayes optimal classi�er almost everywhere.

result does not require concavity of U , or other "nice"properties

1

1Condition: P (ηx = δ∗) = 0, easily satis�ed e.g. when P (X) is continuous.

Page 43: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Proof Sketch

Let F = {f | f : X 7→ [0, 1]} and Θ = {f | f : X 7→ {0, 1}}

Consider the relaxed problem:

θ∗F = argmaxθ∈F

U(θ,P)

Show that the optimal �relaxed� classi�er is θ∗F = sign(ηx− δ∗)Observe that Θ ⊂ F . Thus U(θ∗F ,P) ≥ U(θ∗Θ,P).

As a result, θ∗F ∈ Θ implies that θ∗F ≡ θ∗Θ.

Page 44: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Proof Sketch

Let F = {f | f : X 7→ [0, 1]} and Θ = {f | f : X 7→ {0, 1}}

Consider the relaxed problem:

θ∗F = argmaxθ∈F

U(θ,P)

Show that the optimal �relaxed� classi�er is θ∗F = sign(ηx− δ∗)

Observe that Θ ⊂ F . Thus U(θ∗F ,P) ≥ U(θ∗Θ,P).

As a result, θ∗F ∈ Θ implies that θ∗F ≡ θ∗Θ.

Page 45: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Proof Sketch

Let F = {f | f : X 7→ [0, 1]} and Θ = {f | f : X 7→ {0, 1}}

Consider the relaxed problem:

θ∗F = argmaxθ∈F

U(θ,P)

Show that the optimal �relaxed� classi�er is θ∗F = sign(ηx− δ∗)Observe that Θ ⊂ F . Thus U(θ∗F ,P) ≥ U(θ∗Θ,P).

As a result, θ∗F ∈ Θ implies that θ∗F ≡ θ∗Θ.

Page 46: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Some recovered and new results

Fβ (Ye et al., 2012), Monotonic metrics (Narasimhan et al., 2014)

Page 47: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Simulated examples

F1 Jaccard

1 2 3 4 5 6 7 8 9 10x

0.0

0.2

0.4

0.6

0.8

1.0

2TP2TP+FP+FN

η(x)

δ ∗ =0.34

θ ∗

1 2 3 4 5 6 7 8 9 10x

0.0

0.2

0.4

0.6

0.8

1.0

TPTP+FP+FN

η(x)

δ ∗ =0.39

θ ∗

Finite sample space X , so we can exhaustively search for θ∗

Page 48: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Algorithm 1 (Koyejo et al., 2014)

Step 1: Conditional probability estimation

Estimate ηx via. proper loss (Reid and Williamson, 2010), then

θδ(x) = sign(ηx − δ)

Step 2: Threshold search

maxδU(θδ,Dn)

One dimensional, e�ciently computable using exhaustive search(Sergeyev, 1998).

θδ is consistent

Page 49: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Algorithm 1 (Koyejo et al., 2014)

Step 1: Conditional probability estimation

Estimate ηx via. proper loss (Reid and Williamson, 2010), then

θδ(x) = sign(ηx − δ)

Step 2: Threshold search

maxδU(θδ,Dn)

One dimensional, e�ciently computable using exhaustive search(Sergeyev, 1998).

θδ is consistent

Page 50: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Algorithm 2 (Koyejo et al., 2014)

Step 1: Weighted classi�er estimation)

For classi�cation-calibrated loss (Scott, 2012)

fδ = argminf∈F

∑xi,yi∈Dn

`δ(f(xi), yi)

consistently estimates θδ(x) = sign(fδ(x))

Step 2: Threshold search

maxδU(θδ,Dn)

θδ is consistent

Page 51: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Algorithm 3 (Yan et al., 2016)

Under additional assumptions, U(θδ, P ) is di�erentiable and strictlylocally quasi-concave wrt. δ

Online Algorithm

Iteratively update

1 ηx via. proper loss (Reid and Williamson, 2010)

2 δt using normalized gradient ascent

Page 52: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Algorithm 3 (Yan et al., 2016)

Under additional assumptions, U(θδ, P ) is di�erentiable and strictlylocally quasi-concave wrt. δ

Online Algorithm

Iteratively update

1 ηx via. proper loss (Reid and Williamson, 2010)

2 δt using normalized gradient ascent

Page 53: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Online algorithm sample complexity

Let η estimation error at step t given by rt =∫|ηt − η|dµ, with

appropriately chosen step size, R(θδt ,P) ≤ C∑ti=1 rit

Example: Online logistic regression

Parameter converges at rate O( 1√n

) by averaged stochastic

gradient algorithm (Bach, 2014). Thus, online algorithm achievesO( 1√

n) regret.

Page 54: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Empirical Evaluation

Page 55: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Datasets

datasets default news20 rcv1 epsilon kdda kddb

# features 25 1,355,191 47,236 2,000 20,216,830 29,890,095# test 9,000 4,996 677,399 100,000 510,302 748,401# train 21,000 15,000 20,242 400,000 8,407,752 19,264,097%pos 22% 67% 52% 50% 85% 86%

η estimation: logistic regression and boosting tree

Baselines: threshold search (Koyejo et al., 2014), SVMperf andSTAMP/SPADE (Narasimhan et al., 2015)

Page 56: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Batch algorithm

Data set/Metric LR+Plug-in LR+Batch XGB+Plug-in XGB+Batch

news20-Q-Mean 0.948 (3.77s) 0.948 (0.001s) 0.874 (3.87s) 0.875 (0.003s)news20-H-Mean 0.950 (3.70s) 0.950 (0.003s) 0.859 (3.61s) 0.860 (0.003s)news20-F1 0.949 (3.49s) 0.948 (0.01s) 0.872 (5.07s) 0.874 (0.01s)default-Q-Mean 0.664 (14.3s) 0.667 (0.19s) 0.688 (13.7s) 0.701 (0.22s)default-H-Mean 0.665 (12.1s) 0.668 (0.17s) 0.693 (12.4s) 0.708 (0.18s)default-F1 0.503 (14.2s) 0.497 (0.19s) 0.538 (16.2s) 0.538 (0.15s)

Page 57: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Online Complex Metric Optimization (OCMO)

Metric Algorithm RCV1 Epsilon KDD-A KDD-B

F1 OCMO 0.952 (0.01s) 0.804 (4.87s) 0.934 (2.43s) 0.941 (5.01s)sTAMP 0.923 (14.44s) 0.585 (133.23s) - -

SVMperf 0.953 (1.72s) 0.872 (20.39s) - -H-Mean OCMO 0.964 (0.02s) 0.891 (4.85s) 0.764 (2.5s) 0.733 (5.16s)

sPADE 0.580 (15.74s) 0.578 (135.26s) - -

SVMperf 0.953 (1.72s) 0.872 (20.39s) - -Q-Mean OCMO 0.964 (0.01s) 0.889 (4.87s) 0.551 (2.11s) 0.506 (4.27s)

sPADE 0.688 (15.83s) 0.632 (136.46s) - -

SVMperf 0.950 (1.72s) 0.872 (20.39s) - -

`�' means the corresponding algorithm does not terminate within 100x that of OCMO.

Page 58: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Performance vs run time for various online algorithms

(a) F1 measure on rcv1 (b) H-Mean on rcv1 (c) Q-Mean on rcv1

Page 59: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Optimal Multilabel classi�cation with

Non-decomposable Averaged Metrics

Page 60: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Multilabel Classi�cation

Multiclass: only one classassociated with eachexample

Multilabel: multiple classesassociated with eachexample

Page 61: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Applications

Page 62: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

The Multilabel Classi�cation Problem

Inputs: X ∈ X , Labels: Y ∈ Y = [0, 1]M (with M labels)

Classi�er θ : X 7→ Y

Example: Hamming Loss

U(θ) = EX,Y∼P

[M∑m=1

1[Ym=θm(X)]

]=

M∑m=1

P(Ym = θm(X))

Optimal Prediction for Hamming Loss

θ∗m(x) = sign

(P(Ym = 1|x)− 1

2

)Well known convex surrogates e.g. hinge loss (Bartlett et al., 2006)

Page 63: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Multilabel Confusion

Recall the binary confusionmatrix

Similar idea for multilabel classi�cation, now across both labels mand examples n.

Cm,n =

TPm,n = 1[θm(x(n))=1,y

(n)m =1

], FPm,n = 1[θm(x(n))=1,y

(n)m =0

]FNm,n = 1[

θm(x(n))=0,y(n)m =1

], TNm,n = 1[θm(x(n))=0,y

(n)m =0

]

1We focus on linear-fractional metrics e.g. Accuracy, Fβ , Precision, Recall,Jaccard

Page 64: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Multilabel Confusion

Recall the binary confusionmatrix

Similar idea for multilabel classi�cation, now across both labels mand examples n.

Cm,n =

TPm,n = 1[θm(x(n))=1,y

(n)m =1

], FPm,n = 1[θm(x(n))=1,y

(n)m =0

]FNm,n = 1[

θm(x(n))=0,y(n)m =1

], TNm,n = 1[θm(x(n))=0,y

(n)m =0

]

1We focus on linear-fractional metrics e.g. Accuracy, Fβ , Precision, Recall,Jaccard

Page 65: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Label Averaging

Most popular multilabel metrics are averaged metricsSome notation: Let ηm(x) = P(Ym = 1|x)

Macro-Averaging

Average over examples for each label

Cm =1

N

N∑n=1

Cm,n, Ψmacro :=1

M

M∑m=1

Ψ(Cm).

Bayes optimal classi�er:

θ∗m(x) = sign(ηm(x)− δ∗m) ∀m ∈ [M ]

Page 66: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Label Averaging

Most popular multilabel metrics are averaged metricsSome notation: Let ηm(x) = P(Ym = 1|x)

Macro-Averaging

Average over examples for each label

Cm =1

N

N∑n=1

Cm,n,

Ψmacro :=1

M

M∑m=1

Ψ(Cm).

Bayes optimal classi�er:

θ∗m(x) = sign(ηm(x)− δ∗m) ∀m ∈ [M ]

Page 67: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Label Averaging

Most popular multilabel metrics are averaged metricsSome notation: Let ηm(x) = P(Ym = 1|x)

Macro-Averaging

Average over examples for each label

Cm =1

N

N∑n=1

Cm,n, Ψmacro :=1

M

M∑m=1

Ψ(Cm).

Bayes optimal classi�er:

θ∗m(x) = sign(ηm(x)− δ∗m) ∀m ∈ [M ]

Page 68: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Label Averaging

Most popular multilabel metrics are averaged metricsSome notation: Let ηm(x) = P(Ym = 1|x)

Macro-Averaging

Average over examples for each label

Cm =1

N

N∑n=1

Cm,n, Ψmacro :=1

M

M∑m=1

Ψ(Cm).

Bayes optimal classi�er:

θ∗m(x) = sign(ηm(x)− δ∗m) ∀m ∈ [M ]

Page 69: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Instance Average

Average over labels for each example

Cn =1

M

M∑m=1

Cm,n,

Ψinstance :=1

N

N∑n=1

Ψ(Cn).

Bayes optimal classi�er:

θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]

Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation

Note: Marginals may still be deterministically coupled acrosslabels e.g. low rank, shared DNN representation

Shared threshold across labels

Page 70: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Instance Average

Average over labels for each example

Cn =1

M

M∑m=1

Cm,n, Ψinstance :=1

N

N∑n=1

Ψ(Cn).

Bayes optimal classi�er:

θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]

Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation

Note: Marginals may still be deterministically coupled acrosslabels e.g. low rank, shared DNN representation

Shared threshold across labels

Page 71: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Instance Average

Average over labels for each example

Cn =1

M

M∑m=1

Cm,n, Ψinstance :=1

N

N∑n=1

Ψ(Cn).

Bayes optimal classi�er:

θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]

Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation

Note: Marginals may still be deterministically coupled acrosslabels e.g. low rank, shared DNN representation

Shared threshold across labels

Page 72: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Micro Average

Average over both examples and labels

C =1

NM

N∑n=1

M∑m=1

Cm,n, Ψinstance := Ψ(C).

Bayes optimal classi�er:

θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]

Bayes optimal is identical to instance averaging

Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation

Shared threshold across labels

Page 73: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Micro Average

Average over both examples and labels

C =1

NM

N∑n=1

M∑m=1

Cm,n,

Ψinstance := Ψ(C).

Bayes optimal classi�er:

θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]

Bayes optimal is identical to instance averaging

Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation

Shared threshold across labels

Page 74: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Micro Average

Average over both examples and labels

C =1

NM

N∑n=1

M∑m=1

Cm,n, Ψinstance := Ψ(C).

Bayes optimal classi�er:

θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]

Bayes optimal is identical to instance averaging

Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation

Shared threshold across labels

Page 75: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Micro Average

Average over both examples and labels

C =1

NM

N∑n=1

M∑m=1

Cm,n, Ψinstance := Ψ(C).

Bayes optimal classi�er:

θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]

Bayes optimal is identical to instance averaging

Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation

Shared threshold across labels

Page 76: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Micro Average

Average over both examples and labels

C =1

NM

N∑n=1

M∑m=1

Cm,n, Ψinstance := Ψ(C).

Bayes optimal classi�er:

θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]

Bayes optimal is identical to instance averaging

Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation

Shared threshold across labels

Page 77: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Simulated Micro-averaged F1

1 2 3 4 5 6x

0.0

0.2

0.4

0.6

0.8

1.0δ ∗ =0.40

η0 (x)

θ ∗0

1 2 3 4 5 6x

0.0

0.2

0.4

0.6

0.8

1.0

2TP2TP+FP+FN

δ ∗ =0.40

η1 (x)

θ ∗1

Page 78: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Empirical Evaluation

Page 79: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Dataset BR Plugin Macro-Thres BR Plugin Macro-ThresF1 Jaccard

Scene 0.6559 0.6847 0.6631 0.4878 0.5151 0.5010

Birds 0.4040 0.4088 0.2871 0.2495 0.2648 0.1942

Emotions 0.5815 0.6554 0.6419 0.3982 0.4908 0.4790

Cal500 0.3647 0.4891 0.4160 0.2229 0.3225 0.2608

Table: Comparison of plugin-estimator methods on multilabel F1 andJaccard metrics. Reported values correspond to micro-averaged metric(F1 and Jaccard) computed on test data (with standard deviation, over10 random validation sets for tuning thresholds). Plugin is consistent formicro-averaged metrics, and performs the best consistently acrossdatasets.

Page 80: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Dataset BR Plugin Macro-Thres BR Plugin Macro-ThresF1 Jaccard

Scene 0.5695 0.6422 0.6303 0.5466 0.5976 0.5902

Birds 0.1209 0.1390 0.1390 0.1058 0.1239 0.1195

Emotions 0.4787 0.6241 0.6156 0.4078 0.5340 0.5173

Cal500 0.3632 0.4855 0.4135 0.2268 0.3252 0.2623

Table: Comparison of plugin-estimator methods on multilabel F1 andJaccard metrics. Reported values correspond to instance-averaged metric(F1 and Jaccard) computed on test data (with standard deviation, over10 random validation sets for tuning thresholds). Plugin is consistent forinstance-averaged metrics, and performs the best consistently acrossdatasets.

Page 81: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Dataset BR Plugin Macro-Thres BR Plugin Macro-ThresF1 Jaccard

Scene 0.6601 0.6941 0.6737 0.5046 0.5373 0.5260

Birds 0.3366 0.3448 0.2971 0.2178 0.2341 0.2051

Emotions 0.5440 0.6450 0.6440 0.3982 0.4912 0.4900

Cal500 0.1293 0.2687 0.3226 0.0880 0.1834 0.2146

Table: Comparison of plugin-estimator methods on multilabel F1 andJaccard metrics. Reported values correspond to the macro-averaged

metric computed on test data (with standard deviation, over 10 randomvalidation sets for tuning thresholds). Macro-Thres is consistent formacro-averaged metrics, and is competitive in three out of four datasets.Though not consistent for macro-averaged metrics, Plugin achieves thebest performance in three out of four datasets.

Page 82: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Correlated Binary Decisions

Same procedure applies to more general correlated binarydecisions using averaged metrics

Mul$%scale*Models* Perturb*single*genes*OR*single*phenotypes**

Measure*outcomes*Demonstrate*causality*

Figure 4: Multi-scale models will be used to identify hub genes of gene modules that are associated with a particular neural or behavioral trait of interest. We will perturb the expression of single genes (or phenes)and then measure the outcomes (i.e. changes in the neural network structure, gene module stucture, etc) to determine causal relationships.

Example application:

point estimates of brainnetworks from posteriordistributions

Page 83: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Conclusion

Page 84: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Conclusion and open questions

Optimal classi�ers for a large family of metrics have a simplethreshold form sign(P (Y = 1|X)− δ)Proposed scalable algorithms for consistent estimation

Open Questions:

Can we elucidate utility functions from feedback?

Can we characterize the entire family of utility metrics withthresholded optimal decision functions?

What of more general structured prediction?

Page 85: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Conclusion and open questions

Optimal classi�ers for a large family of metrics have a simplethreshold form sign(P (Y = 1|X)− δ)Proposed scalable algorithms for consistent estimation

Open Questions:

Can we elucidate utility functions from feedback?

Can we characterize the entire family of utility metrics withthresholded optimal decision functions?

What of more general structured prediction?

Page 86: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Conclusion and open questions

Optimal classi�ers for a large family of metrics have a simplethreshold form sign(P (Y = 1|X)− δ)Proposed scalable algorithms for consistent estimation

Open Questions:

Can we elucidate utility functions from feedback?

Can we characterize the entire family of utility metrics withthresholded optimal decision functions?

What of more general structured prediction?

Page 87: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Conclusion and open questions

Optimal classi�ers for a large family of metrics have a simplethreshold form sign(P (Y = 1|X)− δ)Proposed scalable algorithms for consistent estimation

Open Questions:

Can we elucidate utility functions from feedback?

Can we characterize the entire family of utility metrics withthresholded optimal decision functions?

What of more general structured prediction?

Page 88: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Conclusion and open questions

Optimal classi�ers for a large family of metrics have a simplethreshold form sign(P (Y = 1|X)− δ)Proposed scalable algorithms for consistent estimation

Open Questions:

Can we elucidate utility functions from feedback?

Can we characterize the entire family of utility metrics withthresholded optimal decision functions?

What of more general structured prediction?

Page 89: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Questions?

[email protected]

Page 90: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

References

Page 91: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

References IFrancis R Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic

regression. Journal of Machine Learning Research, 15(1):595�627, 2014.

Peter L Bartlett, Michael I Jordan, and Jon D McAuli�e. Large margin classi�ers: Convex loss, lownoise, and convergence rates. In NIPS, pages 1173�1180, 2003.

Peter L Bartlett, Michael I Jordan, and Jon D McAuli�e. Convexity, classi�cation, and risk bounds.Journal of the American Statistical Association, 101(473):138�156, 2006.

Elad Hazan, K�r Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convexoptimization. In Advances in Neural Information Processing Systems, pages 1585�1593, 2015.

Thorsten Joachims. A support vector method for multivariate performance measures. In Proceedings ofthe 22nd international conference on Machine learning, pages 377�384. ACM, 2005.

Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consistentbinary classi�cation with generalized performance metrics. In Advances in Neural InformationProcessing Systems, pages 2744�2752, 2014.

Harikrishna Narasimhan, Rohit Vaish, and Shivani Agarwal. On the statistical consistency of plug-inclassi�ers for non-decomposable performance measures. In Advances in Neural InformationProcessing Systems, pages 1493�1501, 2014.

Harikrishna Narasimhan, Purushottam Kar, and Prateek Jain. Optimizing non-decomposableperformance measures: A tale of two classes. In 32nd International Conference on Machine Learning(ICML), 2015.

Mark D Reid and Robert C Williamson. Composite binary losses. The Journal of Machine LearningResearch, 9999:2387�2422, 2010.

Clayton Scott. Calibrated asymmetric surrogate losses. Electronic J. of Stat., 6:958�992, 2012.

Yaroslav D Sergeyev. Global one-dimensional optimization using smooth auxiliary functions.Mathematical Programming, 81(1):127�146, 1998.

Bowei Yan, Kai Zhong, Oluwasanmi Koyejo, and Pradeep Ravikumar. Online classi�cation with complexmetrics. In arXiv:1610.07116v1, 2016.

Nan Ye, Kian Ming A Chai, Wee Sun Lee, and Hai Leong Chieu. Optimizing f-measures: a tale of twoapproaches. In Proceedings of the International Conference on Machine Learning, 2012.

Page 92: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Backup Slides

Page 93: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Two Step Normalized Gradient Descent for optimalthreshold search

1: Input: Training sample {Xi, Yi}ni=1, utility measure U ,conditional probability estimator η, stepsize α.

2: Randomly split the training sample into two subsets

{X(1)i , Y

(1)i }

n1i=1 and {X(2)

i , Y(2)i }

n2i=1;

3: Estimate η on {X(1)i , Y

(1)i }

n1i=1.

4: Initialize δ = 0.5;5: while not converged do

6: Evaluate TP,TN on {X(2)i , Y

(2)i }

n2i=1 with

f(x) = sign(η − δ).7: Calculate ∇U ;8: δ ← δ − α ∇U

‖∇U‖ .

9: end while

10: Output: f(x) = sign(η − δ).

Page 94: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Online Complex Metric Optimization (OCMO)Require: online CPE with update g, metric U , stepsize α;1: Initialize η0, δ0 = 0.5;2: while data stream has points do3: Receive data point (xt, yt)4: ηt = g(ηt−1);

5: δ(0)t = δt, TP

(0)t = TPt−1,TN

(0)t = TNt−1;

6: for i = 1, · · · , Tt do7: if ηt(xt) > δ

(i−1)t then

8: TP(i)t ←

TPt−1·(t−1)+(1+yt)/2t , TN

(i)t ← TNt−1 · t−1

t ;

9: else TP(i)t ← TPt−1 · t−1

t , TN(i)t ←

TNt−1·t+(1−yt)/2t+1 ;

10: end if

11: δ(i)t = δ

(i−1)t −α ∇G(TPt,TNt)

‖∇G(TPt,TNt)‖ , TPt = TP(i)t ,TNt = TN

(i)t ;

12: end for

13: δt+1 = δ(Tt)t ;

14: t = t+ 1;15: end while

16: Output (ηt, δt).

Page 95: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Scaling up Classi�cation with ComplexMetrics

Page 96: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Additional properties of U

Informal theorem (Yan et al., 2016)

Suppose U is fractional-linear or monotonic, under weak conditionsa

on P :

U(θδ, P ) is di�erentiable wrt δ

U(θδ, P ) is Lipschitz wrt δ

U(θδ, P ) is strictly locally quasi-concave wrt δ

aηx is di�erentiable wrt x, and its characteristic function is absolutelyintegrable

Page 97: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Algorithms

Normalized Gradient Descent (Hazan et al., 2015)

Fix ε > 0, let f be strictly locally quasi-concave, and x∗ ∈ argmin f(x).

NGD algorithm with number of iterations T ≥ κ2‖x1 − x∗‖2/ε2 and step

size η = ε/κ achieves f(xT )− f(x∗) ≤ ε.

Batch Algorithm

1 Estimate ηx via. proper loss (Reid and Williamson, 2010)

2 Solve maxδ U(θδ,Dn) using normalized gradient ascent

Online Algorithm

Interleave ηt update and δt update

Page 98: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Sample Complexity

Batch Algorithm

With appropriately chosen step size, R(θδ,P) ≤ C∫|η − η|dµ

Comparison to threshold search

complexity of NGD is O(nt) = O(n/ε2), where t is thenumber of iterations and ε is the precision of the solution

when log n ≥ 1/ε2, the batch algorithm has favorablecomputational complexity vs. threshold search

Online Algorithm

Let η estimation error at step t given by rt =∫|ηt − η|dµ, with

appropriately chosen step size, R(θδt ,P) ≤ C∑ti=1 rit

Page 99: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Sample Complexity

Batch Algorithm

With appropriately chosen step size, R(θδ,P) ≤ C∫|η − η|dµ

Comparison to threshold search

complexity of NGD is O(nt) = O(n/ε2), where t is thenumber of iterations and ε is the precision of the solution

when log n ≥ 1/ε2, the batch algorithm has favorablecomputational complexity vs. threshold search

Online Algorithm

Let η estimation error at step t given by rt =∫|ηt − η|dµ, with

appropriately chosen step size, R(θδt ,P) ≤ C∑ti=1 rit

Page 100: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Sample Complexity

Batch Algorithm

With appropriately chosen step size, R(θδ,P) ≤ C∫|η − η|dµ

Comparison to threshold search

complexity of NGD is O(nt) = O(n/ε2), where t is thenumber of iterations and ε is the precision of the solution

when log n ≥ 1/ε2, the batch algorithm has favorablecomputational complexity vs. threshold search

Online Algorithm

Let η estimation error at step t given by rt =∫|ηt − η|dµ, with

appropriately chosen step size, R(θδt ,P) ≤ C∑ti=1 rit

Page 101: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case

Sample Complexity

Batch Algorithm

With appropriately chosen step size, R(θδ,P) ≤ C∫|η − η|dµ

Comparison to threshold search

complexity of NGD is O(nt) = O(n/ε2), where t is thenumber of iterations and ε is the precision of the solution

when log n ≥ 1/ε2, the batch algorithm has favorablecomputational complexity vs. threshold search

Online Algorithm

Let η estimation error at step t given by rt =∫|ηt − η|dµ, with

appropriately chosen step size, R(θδt ,P) ≤ C∑ti=1 rit