a simple geometric interpretation of svm using stochastic ...rlivni/files/simpleposter.pdf · key...

Illustration of the Dual Problem Illustration of the Primal Problem

References

Problem: SVM with regularization does not have a clean geometrical interpretation Intuition: Should be related to geometrical robustness. Approach: Require robustness to adversarial noise.

Key tools: Infinite dimensional linear programming. Results: SVM is optimal w.r.t adversarial stochastic noise. Generalization to multiclass and other losses. Simple and effective choice of regularization parameter.

A Simple Geometric Interpretation of SVM Using Stochastic Adversaries

Roi Livni1, Koby Crammer2, Amir Globerson3

1 ELSC-ICNC Edmond & Lily Safra Center For Brain Sciences, The Hebrew University, Israel

2 Dept. Of Electrical Engineering, The Technion, Israel 3 School of Computer Science and Engineering The Hebrew University, Israel

Introduction

Generalization: General Norm

What happens if we replace ℓ2 with a general norm ?

Optimization is then equivalent to:

Where ⋅ ∗is the dual norm.

Interesting examples: •ℓ1 bound – ℓ∞ regularization •ℓ∞ bound – ℓ1 regularization

Generalization: Squared L2 Norm

What happens if we replace the second set of constraints with

A smoothed version of the hinge loss!

For non-binary labels, the problem turns into an SDP. More expensive than the normed case!

For binary labels, we get that it is equivalent to

Generalization: General Loss

Assume the following on the loss ℓ(𝒙, 𝑦, 𝒘):

1. It is invariant to translation in w. 2. A vector r is a subgradient of the loss w.r.t. x if an only if it can be

written as 𝑟 = 𝑎𝒘𝒚 + 𝒗, where v is a vector in a set V and 𝑎 is a scalar.

Optimization becomes:

Future Work Experiments

We conducted two sets of experiments: UCI datasets and large scale text classification. For choosing σ we used both CV, and heuristics inspired by our analysis

• RSVM stands for Robust SVM and it refers to hinge loss plus ℓ2,∞ regularization.

• 𝑅𝑆𝑉𝑀2 is the suggested loss when squared ℓ2 norm is considered..

• (H) stands for a choice of regularization coefficient which approximates the spread around sample points.

We presented an analysis of minimax learning strategies where adversaries are stochastic, and have shown that the optimal strategy corresponds to simple optimization problems, with close links to SVM. Some questions for future research are: • Address the semisupervised setting, where unlabeled data is used to learn constraints on adversarial noise. • Extension to kernels. • Derive generalization bounds for learning in this setting. Specifically, when the data is used to tune the regularization coefficient.

Stochastic Adversaries and Infinite Dimensional Linear Programming

The adversarial distribution is constrained to have the sample point as its mean, and its expected divergence from that point is also constrained.

The goal of the adversary is to find a distribution that maximizes the expected loss:

The difficulty here is that optimization is over distributions. The dual is easier to work with, and involves few variables but infinitely many constraints:

In our analysis, we manage to “compress” those into just a few constraints.

• In the typical robust classification setting, one wishes to minimize the loss with respect to a perturbed version of the sample.

• Here we focus on a stochastically perturbed version of the sample, where one wishes to minimize the expected loss.

• We replace each point in the sample with a distribution centered around it. An adversary gets to choose a distribution from a given set.

• The goal of the learner is to minimize the expected loss.

The main challenge in applying such a scheme is calculating the worst case loss. Here we show how this can be done in a variety of cases.

Main Result: New Interpretation of SVM Regularization

Goal: minimize hinge loss subject to a stochastic adversary with bounded ℓ2 norm.

The above is equivalent to the simple problem:

Hinge loss plus ℓ2,∞regularization. Equivalent to SVM for binary

classes. For multiclass, says we should use ℓ2,∞ instead of Frobenius.

Where: ℓ 𝒙 , 𝑦 ,𝒘 := max𝑦

1 − 𝛿𝑦 ,𝑦 − 𝒘𝑦 −𝒘𝒚𝑻𝒙 .

a simple geometric interpretation of svm using stochastic ...rlivni/files/simpleposter.pdf · key...

Documents