covariate shift correction - alex smolaalex.smola.org/drafts/covariate.pdf · covariate shift...

Covariate Shift Correction& Propensity Scores

Alex J. Smola

Monday, September 6, 2010

The Problem... aka two problems and one hammer ...

Covariate Shift• Basic setting

• Training data is drawn from• Test data is drawn from

• Examples• Training data from last week, deploy today • Training data for USA market, deploy in UK• Training data for mithril, deploy on axonite • Speech recogntion - adapt to speaker

• No labels on test set but

p(x)p(y|x)q(x)p(y|x)

p(y|x) = q(y|x)

Covariate Shift• Importance Sampler Identity

• Radon Nikodym derivative

(need measure theory to avoid ∞/∞ division)• Reweighted Empirical Risk

Ex∼q(x)[l(x, y, f(x))] = Ex∼p(x)

dp(x)l(x, y, f(x))

β(x) :=dq(x)

minimizef

β(xi)l(xi, yi, f(xi)) + λΩ[f ]

Propensity Scores• What if questions in experiments

• Display ad a for user u, what about a’• New feature for advertisers but uneven opt in• Efficacy of medical treatment

(stent vs. drugs for coronary artery problem)• More than 2 choices

• Customized module / page layout• Real-valued dosage level of drug

Propensity Scores• Basic goal - changing the conditioning

• Improvement estimation

This yields improvement when drawing from q. • Doubly robust estimation (variance reduction)

estimate f - we can evaluate the estimate on q

Eq[f(x)] = Ep[β(x)f(x)] −→1

β(xi)fi

Eq[f(x)− g(x)] = Ep[β(x)f(x)]−Eq[g(x)] −→1

β(xi)fi −1

Eq[f(x)] = Eq[f(x)− f(x)] +Eq[f(x)] = Ep[β(x)[f(x)− f(x)]] +Eq[f(x)]

The goal

• Estimate the Radon Nikodym Derivative

based on samples from p and q

β(x) :=dq(x)

Logistic Regression... aka the idiot-proof simple method ...

Logistic Regression 101• Logistic transfer function

• Samples

• Risk minimization

p(y|x, f) = 1

1 + e−yf(x)

Z = (x1, y1), . . . , (xm, ym) where (xi, yi) ∼ p(y, x)

minimizef

log [1 + exp(−yif(xi))] + λΩ[f ]

Logistic to Radon Nikodym• Key idea

Generate artificial distribution from p and q

• Connection to Radon Nikodym

• Efficient optimization in primal spaceStochastic gradient descent in f (VW, Dios)

ρ(x, y) :=1

2δy,1p(x) +

2δy,−1q(x)

ρ(y|x) = 1

1 + e−yf(x)=⇒ β(x) =

ρ(−1|x)ρ(1|x) =

1 + ef(x)

1 + e−f(x)= ef(x)

f(x) = φ(x), θMonday, September 6, 2010

Moment Matching Theorem• Maximum Entropy Estimation (primal)

• Maximum a Posteriori Estimation (dual)

here g is the conditional log-partition function• Proof - analogous to Altun&Smola, 2006

via Fenchel Duality Theorem & operators

maximizep∈P

H(y|xi) subject to

φ(xi, yi)−Ey|xi[φ(xi, y)]

minimizeθ

g(θ|xi)− φ(xi, yi), θ+λ

Mean Operators... aka Fortet & Mourier 1946 revisited ...

Mean operators• Expectation map

• Empirical average

• Convergence theorem (Altun&Smola, 2006)

f → Ex∼p[f(x)] = Ex∼p[φ(x), θ] = f,Ex∼p[φ(x)] =: f, µ[p]

X → µ[X] :=1

φ(xi) hence f, µ[X] = 1

Pr µX − µ[p] > + ρ ≤ e−n2R−2

where ρ2 = n−1Ex,x∼p [k(x, x)− k(x, x)]

Mean operators• Key idea

• Have empirical mean operator for p and q• Find reweighted combination from X to X’

• By Cauchy-Schwartz this gives bound

minimizeβ

βiφ(xi)−1

φ(xi)

βif(xi)−1

βiφ(xi)−1

φ(xi)

Guarantees• Radon Nikodym derivative is unique solution

when plugging in distributions.• For empirical averages approximation error is

small (upper bound by using RND).

where• We can find it by optimization

βiφ(xi)−1

φ(xi)

> + ρ

≤ exp

m and ρ ≤ R/√m

Optimization template• Constrained problem

• Quadratic penalty: Kernel Mean Matching• L infinity penalty: Bounded Mean Matching• Entropy penalty: Entropy Mean Matching

(Sugiyama, Bickel, Brefeld, Tsuboi, ...)

minimizeβ

Ω[β]

subject to

βiφ(xi)−1

φ(xi)

Optimization Problems... applied duality theory ...

Quadratic Program• Quadratic penalty on RND

(this favors large effective sample size)

looks like a single-class SVM• Bounded range of RND

(this bounds variance in McDiarmid tail)

minimizeα

2α[K + λ1]α− αu subject to α1 = 1 and αi ≥ 0.

minimizeα

2αKα− αu subject to α1 = 1 and αi ∈ [0,λ]

Quadratic Program• Problem

Optimization problem is cubic in sample size• Solution

Find (ante)-primal problem and solve via SGD

where and via subdifferentials

minimizeθ,b

2θ2 + b+

(ui − φ(xi), θ − b)2+

minimizeθ,b

2θ2 + b+ λ

(ui − φ(xi), θ − b)+

k(xi, xj) βi

Convex Program• Minimum KL-Divergence regularization dual

where• Problem

Computing the normalization g is expensive• Solutions

• MCMC sampler for gradient of g• Retain estimate of g (update parts frequently)

minimizeθ

g(θ)− θ, µ+ 1

2λθ2 with g(θ) = log

eφ(xi),θ

β(x) = eφ(x),θ−g(θ)

Conclusions... good/bad news ...

Experimental results

• All methods work well (much better than doing nothing)

• Online optimization is effective• Logistic regression works very well• Logistic regression works very well• Entropy regularization works best

(even though we have theory for the norms)(but not for entropy)

covariate shift correction - alex smolaalex.smola.org/drafts/covariate.pdf · covariate shift...

Documents

a least-squares approach to direct importance...

improving covariate balancing propensity score: a doubly

direct density ratio estimation for large-scale covariate...

appraising covariate balance after assignment to treatment

epic-pn large window mode fast-shift cti correction ·...

covariate-adjusted response-adaptive randomization …

covariate-dependent nonparametric mixture models

adaptive learning with covariate shift-detection for motor...

chapter 1 covariate shift detection based non-stationary...

covariate shift adaptation by importance weighted cross...

sample selection bias – covariate shift: problems,...

covariate shift adaptation by importance weighted cross...

toward better practice of covariate adjustment in

covariate balancing propensity score - harvard university

adversarial imitation learning under covariate shift ·...

a unified phase-shift modulation for optimized ... · boost...

methods for covariate adjustment in cost-effectiveness...

covariate selection for association screening in multi

reducing variance due to importance weighting in covariate...

direct importance estimation for covariate shift …direct...