efficient large-scale structured learning

Efficient Large-Scale Structured Learning

Steve Branson Oscar Beijbom Serge Belongie

CVPR 2013, Portland, Oregon

UC San Diego UC San Diego Caltech

Overview• Structured prediction • Learning from larger datasets

TINY IMAGES

Large Datasets

Mammal

Primate Hoofed Mammal

Odd-toedGorilla

Deformable part models Object detection

Orangutan Even-toed

Cost sensitive Learning

Overview• Available tools for structured learning not as

refined as tools for binary classification• 2 sources of speed improvement– Faster stochastic dual optimization algorithms– Application-specific importance sampling routine

Mammal


Odd-toedGorillaOrangutan

Even-toed

Summary• Usually, train time = 1-10 times test time• Publicly available software package– Fast algorithms for multiclass SVMs, DPMs– API to adapt to new applications– Support datasets too large to fit in memory– Network interface for online & active learning

Mammal



Even-toed

Summary

Cost-sensitive multiclass SVM• 10-50 times faster than

SVMstruct

• As fast as 1-vs-all binary SVM

Deformable part models• 50-1000 faster than– SVMstruct

– Mining hard negatives– SGD-PEGASOS

Mammal


Odd-toedGorillaOrangutan Even-toed

Binary vs. Structured

Binary Learner

SVM, Boosting,Logistic Regression,

etc.

Object Detection, Pose Registration, Attribute

Prediction, etc.

BIN

ARY

DATA

SET

BIN

ARY

OU

TPU

T

Structured Output

Structured Dataset

𝑌=(𝑥 , 𝑦 ,𝑤 , h)

𝑌=−1𝑌=+1


Binary Learner

SVM, Boosting,Logistic Regression,

etc.

Object Detection, Pose Registration, Attribute

Prediction, etc.

BIN

ARY

DATA

SET

BIN

ARY

OU

TPU

T

Structured Output

Structured Dataset

• Pros: binary classifier is application independent• Cons: what is lost in terms of:– Accuracy at convergence?– Computational efficiency?


Structured Prediction Loss∆ (𝑔 (𝑋 ) ,𝑌 𝑔𝑡)

≈ ≈ ∆ 01Binary Loss Convex Upper Bound

Source of Computational Speed


Structured Prediction Loss∆ (𝑔 (𝑋 ) ,𝑌 𝑔𝑡)

≈ ≈ ∆ 01Binary Loss Convex Upper Bound

ℓ (𝑋 ;𝑤)∆ (𝑔 (𝑋 ) ,𝑌 )

≈Convex Upper Bound on Structured Prediction Loss


Application-specific optimization algorithms that:– Converge to lower test error than binary solutions– Lower test error for all amounts of train time

Structured SVM• SVMs w/ structured output

• Max-margin MRF [Taskar et al. NIPS’03]

[Tsochantaridis et al. ICML’04]

Binary SVM SolversF aster Linear SVM Solvers

SVM struct𝑂 (𝑇𝑛𝜆𝜖 )

Quadratic to linear in trainset size

SVM perf P EGASOS L IBLINEARCutting Plane SGD≫ ¿ ≥



Linear to independent in trainset size





Linear to independent in trainset size


• Faster on multiple passes• Detect convergence• Less sensitive to

regularization/learning rate


Structured SVM Solvers

SVM perf P EGASOS L IBLINEARCutting Plane SGD

Faster Linear SVM Solvers

≫ ¿ ≥

SVM structCutting Plane SGD¿ ≥Applied to

SSVMs

[Shalev-Shwartz et al. JMLR’13]

[Ratliff et al. AIStats’07]

• Use faster stochastic dual algorithms• Incorporate application-specific importance

sampling routine– Reduce train times when prediction time T is large– Incorporate tricks people use for binary methods

Random Example Importance Sample

Maximize Dual SSVM objective w.r.t. samples

Our Approach

Our ApproachFor t=1… do1. Choose random training example (Xi,Yi)2. ,…,ImportanceSample()3. Approx. maximize Dual SSVM objective w.r.t. iend

Random Example Importance Sample

Maximize Dual SSVM objective w.r.t. samples

(Provably fast convergence for simple approx. solver)

Recent Papers w/ Similar Ideas

• Augmenting cutting plane SSVM w/ m-best solutions

• Applying stochastic dual methods to SSVMsA. Guzman-Rivera, P. Kohli, D. Batra. “DivMCuts…” AISTATS’13.

S. Lacoste-Julien, et al. “Block-Coordinate Frank-Wolfe…” JMLR’13 .

Applying to New Problems

1. Define loss function 2. Implement feature extraction routine3. Implement importance sampling routine

1. Loss function 2. Features 3. Importance sampling routine

Applying to New Problems3. Implement importance sampling routine

a) Is fastb) Favor samples w/ • High loss+• Uncorrelated features: small

Example: Object Detection

1. Loss function 2. Features 3. Importance sampling routine• Add sliding window & loss

into dense score map• Greedy NMS

Example: Deformable Part Models

1. Loss function sum of part losses

2. Features 3. Importance sampling routine• Dynamic programming• Modified NMS to return

diverse set of poses

Cost-Sensitive Multiclass SVM

1. Loss functionClass confusion cost 4

2. Featurese.g., bag-of-words

3. Importance sampling routine• Return all classes• Exact solution using 1

dot product per class

cat dog ant fly car bus cat dog ant fly car bus

Results: CUB-200-2011

• Pose mixture model, 312 part/pose detectors• Occlusion/visibility model• Tree-structured DPM w/ exact inference

Results: CUB-200-2011

5794 training examples 400 training examples

• ~100X faster than mining hard negatives and SVMstruct

• 10-50X faster than stochastic sub-gradient methods• Close to convergence at 1 pass through training set

Results: ImageNet

Comparison to other fast linear SVM solvers

Comparison to other methods for cost-sensitive SVMs

• Faster than LIBLINEAR, PEGASOS• 50X faster than SVMstruct

Conclusion• Orders of magnitude faster than SVMstruct

• Publicly available software package– Fast algorithms for multiclass SVMs, DPMs– API to adapt to new applications– Support datasets too large to fit in memory– Network interface for online & active learning

Mammal



Even-toed

Thanks!

efficient large-scale structured learning

Documents

binary classifier

binary svmdeformable

attribute prediction

trainset size quadratic

prediction time t

application independentcons

amounts of train timebinary

nprediction time