1 second order learning koby crammer department of electrical engineering ecml pkdd 2013 prague

Second Order Learning

Koby CrammerDepartment of Electrical Engineering

ECML PKDD 2013 Prague

Thanks

• Mark Dredze• Alex Kulesza• Avihai Mejer• Edward Moroshko• Francesco Orabona• Fernando Pereira• Yoram Singer• Nina Vaitz

Tutorial Context

OnlineLearning

Tutorial

OptimizationTheory

Real-WorldData

Outline

• Background:– Online learning + notation– Perceptron– Stochastic-gradient descent– Passive-aggressive

• Second-Order Algorithms– Second order Perceptron– Confidence-Weighted and AROW– AdaGrad

• Properties– Kernels– Analysis

• Empirical Evaluation– Synthetic– Real Data

Online Learning

Tyrannosaurus rex

Online Learning

Triceratops

Online Learning

Tyrannosaurus rex

Velocireptor

Formal Setting – Binary Classification

• Instances – Images, Sentences

• Labels– Parse tree, Names

• Prediction rule– Linear predictions rules

• Loss– No. of mistakes

Predictions

• Discrete Predictions:– Hard to optimize

• Continuous predictions :

– Label

– Confidence

Loss Functions

• Natural Loss:– Zero-One loss:

• Real-valued-predictions loss:– Hinge loss:

– Exponential loss (Boosting)– Log loss (Max Entropy, Boosting)

Loss Functions

1Zero-One Loss

Hinge Loss

Online Learning

Maintain Model M Get Instance x

Predict Label y=M(x)

Get True Label ySuffer Loss l(y,y)

Update Model M

• Any Features

• W.l.o.g.

• Binary Classifiers of the form

Linear Classifiers

Notation

• Prediction :

• Confidence in prediction:

Linear Classifiers (cntd.)

Linear Classifiers

Input Instance to be classified

Weight vector of classifier

• Margin of an example with respect to the classifier :

• Note :

• The set is separable iff there exists such that

Margin

Geometrical Interpretation

Margin >0

Margin <<0

Margin <0Margin >>0

Hinge Loss

Why Online Learning?

• Fast• Memory efficient - process one example at a

time• Simple to implement• Formal guarantees – Mistake bounds • Online to Batch conversions• No statistical assumptions• Adaptive

• Not as good as a well designed batch algorithms

Outline

The Perceptron Algorithm

• If No-Mistake

– Do nothing

• If Mistake

– Update

• Margin after update :

Rosenblat 1958

Geometrical Interpretation

Outline

Gradient Descent

• Consider the batch problem

• Simple algorithm:– Initialize– Iterate, for – Compute

– Set

Stochastic Gradient Descent

• Simple algorithm:– Initialize– Iterate, for– Pick a random index – Compute

– Set

Stochastic Gradient Descent

• “Hinge” loss

• The gradient

• Simple algorithm:– Initialize– Iterate, for– Pick a random index – If then

else– Set 32

The preceptron is a stochastic gradient descent algorithm with a sum of “hinge”-loss and a specific order of examples

Outline

Motivation

• Perceptron: No guaranties of margin after the update

• PA :Enforce a minimal non-zero margin after the update

• In particular :– If the margin is large enough (1), then do nothing– If the margin is less then unit, update such that the

margin after the update is enforced to be unit

Input Space

Input Space vs. Version Space

• Input Space :– Points are input data– One constraint is induced

by weight vector– Primal space– Half space = all input

examples that are classified correctly by a given predictor (weight vector)

• Version Space :– Points are weight vectors– One constraints is induced

by input data– Dual space– Half space = all predictors

(weight vectors) that classify correctly a given input example

Weight Vector (Version) Space

The algorithm forces to reside in this region

Passive Step

Nothing to do. already resides on the desired side.

Aggressive Step

The algorithm projects on the desired half-space

Aggressive Update Step

• Set to be the solution of the following optimization problem :

• Solution:

Perceptron vs. PA

• Common Update :

• Perceptron

• Passive-Aggressive

Perceptron vs. PA

Margin

rror, Sm

all Margin

No-Error, Large Margin

Perceptron vs. PA

Outline

Geometrical Assumption

• All examples are bounded in a ball of radius R

Separablity

• There exists a unit vector that classifies the data correctly

• Simple case: positive points

negative points

• Separating hyperplane

• Bound is :

Perceptron’s Mistake Bound

• The number of mistakes the algorithm makes is bounded by

Geometrical Motivation

SGD on such data

Outline

Second Order Perceptron

• Assume all inputs are given• Compute “whitening” matrix

• Run the Perceptron on “wightened” data

• New “whitening” matrix

Nicolò Cesa-Bianchi , Alex Conconi , Claudio Gentile, 2005

• Bound:

• Same simple case:

• Thus

• Bound is :

• If No-Mistake

– Do nothing

• If Mistake

– Update

SGD on weightened data

Outline

• The weight vector is a linear combination of examples

• Two rate schedules (many many others):– Perceptron algorithm, Conservative

– Passive - Aggressive

Span-based Update Rules

Feature-value of input instance

Target labelEither -1 or 1

Learning rateLearning rateWeight of feature f

Sentiment Classification

• Who needs this Simpsons book? You DOOOOOOOOThis is one of the most extraordinary volumes I've ever encountered … . Exhaustive, informative, and ridiculously entertaining, it is the best accompaniment to the best television show … . … Very highly recommended!

Pang, Lee, Vaithyanathan, EMNLP 2002

Sentiment Classification

• Many positive reviews with the word best

• Later negative review – “boring book – best if you want to sleep in seconds”

• Linear update will reduce both

Wbest Wboring

• But best appeared more than boring• The model know’s more about best than boring• Better to reduce words in different rate

Wboring Wbest

Natural Language Processing

• Big datasets, large number of features

• Many features are only weakly correlated with target label

• Linear classifiers: features are associated with word-counts

• Heavy-tailed feature distribution

Feature Rank

Natural Language Processing

New Prediction Models

• Gaussian distributions over weight vectors

• The covariance is either full or diagonal

• In NLP we have many features and use a diagonal covariance

Classification

• Given a new example • Stochastic:

– Draw a weight vector– Make a prediction

• Collective:– Average weight vector– Average margin– Average prediction

The Margin is Random Variable

• The signed margin

is random 1-d Gaussian

• Thus:

Linear Model Distribution over Linear Models

Example

Mean weight-vector

The algorithm forces that most of the values of would reside in this region

Weight Vector (Version) Space

Nothing to do, most of the weight vectors already classifies the example correctly

Passive Step

The mean is moved beyond the mistake-line(Large Margin)

Aggressive Step

The covariance is shrunk in the direction of the input example

The algorithm projects the current Gaussian distribution on the half-space

Projection Update

• Vectors (aka PA):

• Distributions (New Update) :

Confidence Parameter

• Sum of two divergences of parameters :

• Convex in both arguments simultaneously

Divergence

Matrix Itakura-Saito Divergence

Mahanabolis Distance

Constraint

• Probabilistic Constraint :

• Equivalent Margin Constraint :

• Convex in , concave in • Solutions:

– Linear approximation– Change variables to

get a convex formulation– Relax (AROW)

Dredze, Crammer, Pereira. ICML 2008

Crammer, Dredze, Pereira. NIPS 2008

Crammer, Dredze, Kulesza. NIPS 2009

Convexity

• Change variables• Equivalent convex formulation

Crammer, Dredze, Pereira. NIPS 2008

• PA:

• CW :

• Similar update form as CW

Crammer, Dredze, Kulesza. NIPS 2009

• Optimization update can be solved analytically

• Coefficients depend on specific algorithm

The Update

Definitions

Updates

CW (Linearization)CW (Change Variables)

Per-feature Learning Rate

Per-feature Learning rate

Reducing the Learning rate and eigenvalues of

covariance matrix

Diagonal Matrix• Given a matrix we define to

be only the diagonal part of the matrix,

• Make matrix diagonal

• Make inverse diagonal

Outline

(Back to)Stochastic Gradient Descent

– Set

Adaptive Stochastic Gradient Descent

– Set

– Set 80

Duchi, Hazan, Singer, 2010 ;McMahan, M Streeter 2010

• Very general! Can be used to solve with various regularizations

• The matrix A can be either full or diagonal

• Comes with convergence and regret bounds

• Similar performance to AROW

SGD AdaGrad

Outline

Kernels

• Show that we can write

• Induction

Proof (cntd)

• By update rule :

• Thus

Proof (cntd)

• By update rule :

Proof (cntd)

• Thus

Outline

Statistical Interpretation

• Margin Constraint :

• Distribution over weight-vectors :

• Assume input is corrupted with Gaussian noise

Statistical Interpretation

Example

Mean weight-vector

Version Space Input Space

Input Instance

Linear Separator

Good realization

Bad realization

Mistake Bound• For any reference weight vector , the

number of mistakes made by AROW is upper bounded by

– set of example indices with a mistake– set of example indices with an update

but not a mistake–

Orabona and Crammer, NIPS 2010

Comment I

• Separable case and no updates:

Comment II

• For large the bound becomes:

• When no updates are performed: Perceptron

Bound for Diagonal Algorithm

• No. of mistakes is bounded by

• Is low when either

a feature is rare or non-informative• Exactly as in NLP …

Orabona and Crammer, NIPS 2010

Outline

Synthetic Data

• 20 features• 2 informative (rotated

skewed Gaussian)• 18 noisy• Using a single feature is

as good as a random prediction

Synthetic Data (cntd.)

Distribution after 50 examples (x1)

Synthetic Data (no noise)

Perceptron

CW-full

CW-diag

Synthetic Data (10% noise)

Outline

• Sentiment– Sentiment reviews from 6 Amazon domains (Blitzer et al)– Classify a product review as either positive or negative

• Reuters, pairs of labels– Three divisions:

• Insurance: Life vs. Non-Life, Business Services: Banking vs. Financial, Retail Distribution: Specialist Stores vs. Mixed Retail.

– Bag of words representation with binary features.

• 20 News Groups, pairs of labels– Three divisions:

• comp.sys.ibm.pc.hardware vs. comp.sys.mac.hardware.instances, sci.electronics vs. sci.med.instances, and talk.politics.guns vs. talk.politics.mideast.instances.

– Bag of words representation with binary features.

Experimental Design

• Online to batch :– Multiple passes over the training data– Evaluate on a different test set after each pass– Compute error/accuracy

• Set parameter using held-out data• 10 Fold Cross-Validation• ~2000 instances per problem• Balanced class-labels

Results vs Online- Sentiment

• StdDev and Variance – always better than baseline • Variance – 5/6 significantly better

Results vs Online – 20NG + Reuters

• StdDev and Variance – always better than baseline • Variance – 4/6 significantly better

Results vs Batch - Sentiment

• always better than batch methods • 3/6 significantly better

Results vs Batch - 20NG + Reuters

• 5/6 better than batch methods • 3/5 significantly better, 1/1 significantly worse

Results - Sentiment

• CW is better (5/6 cases), statistically significant (4/6)

• CW benefit less from many passes

Passes of Training Data

O PAO CW

Results – Reuters + 20NG

• CW is better (5/6 cases), statistically significant (4/6)

• CW benefit less from many passes

Passes of Training Data

O PAO CW

Error Reduction by Multiple Passes

• PA benefits more from multiple passes (8/12)

• Amount of benefit is data dependent

Bayesian Logistic Regression

• Covariance

• Mean

CW/AROW

• Covariance

• Mean

T. Jaakkola and M. Jordan. 1997

Based on the Variational

Approximation

Conceptually decoupled update

Function of the margin/hinge-loss

Algorithms Summary

1st Order2nd Order

PerceptronSOP

PACW+AROW

SGDAdaGrad

Logisitic Regression

• Different motivation, similar algorithms

• All algorithms can be kernelized

• Work well for data NOT isotropic / symmetric

• State-of-the-art results in various domains

• Accompanied with theory

1 second order learning koby crammer department of electrical engineering ecml pkdd 2013 prague

loss hinge loss slide

margin slide

prague slide

predictions loss

label confidence slide

small cumulative loss

loss functions natural

exponential loss

Documents

online learning by projecting: from theory to large scale...

koby and kylie co retailer catalog

info2014 - ivc - koby simana

koby bentata gustavo briceño daniela torres maria...

ip crammer presentation 2013

pranking with ranking - neural information processing...

ip crammer presentation 2015

introduction to machine learning fall 2013 decision trees...

improved output coding for classification using continuous...

adapting natural language processing systems to new domains...

by shimon gamel and koby desmond

microcomputer course koby gottlieb, 3/03, p-1 microcomputer...

bregman information bottleneck nips’03, whistler december...

koby - pintocruz.pt

ecml / pkdd 2004 discovery challenge

embryology crammer

koby and kylie co. retail catalog

local one class optimization gal chechik, stanford joint...

ecml/pkdd 2009: found in translation

abbey* brian 2nd arkansas, huckabee abels koby 2nd