transfer learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · performance...

75
Transfer Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net CS420, Machine Learning, Lecture 13 http://wnzhang.net/teaching/cs420/index.html

Upload: others

Post on 14-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer LearningWeinan Zhang

Shanghai Jiao Tong Universityhttp://wnzhang.net

CS420, Machine Learning, Lecture 13

http://wnzhang.net/teaching/cs420/index.html

Page 2: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Learning Materials

Prof. Qiang Yang• Chair Professor, Department Head of CSE, HKUST• http://www.cs.ust.hk/~qyang/• SJ Pan, Q Yang. A survey on transfer learning. IEEE TKDE 2010.• 2800+ citations on this survey paper (May 2017)

Our course on TL is mainly based on the materials from Prof. Qiang Yang and his students

Page 3: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Machine Learning Process

• Assumption: training and test data has the same distribution

TrainingData

Data Formaliz-

ation

Model

Evaluation

TestData

Raw Data

Raw Data

minμ

1

N

X(xi;yi)2Dtrain

L(yi; fμ(xi)) + ¸jjμjj22minμ

1

N

X(xi;yi)2Dtrain

L(yi; fμ(xi)) + ¸jjμjj22

Test Error =1

N

X(xi;yi)2Dtest

L(yi; fμ(xi))Test Error =1

N

X(xi;yi)2Dtest

L(yi; fμ(xi))

Page 4: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Practical Cases• Data distributions p(x) change across different

domains or vary over timeXS 6= XT or pS(x) 6= pT (x)XS 6= XT or pS(x) 6= pT (x)

Real images Cartoon images

Page 5: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Practical Cases

• Data dependencies p(y|x) could be also different

YS 6= YT or pS(yjx) 6= pT (yjx)YS 6= YT or pS(yjx) 6= pT (yjx)

Apple recognition Pear recognition

Page 6: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Learning

Page 7: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Notation and Definition of TL

• Definition • Given a source domain with corresponding learning

task and a target domain with corresponding learning task

• transfer learning is the process of improving the target predictive function by using the related information from and , where or

DSDS

TSTS DTDT

TTTT

fT (¢)fT (¢)DSDS TSTS

• Notation• A domain

• Feature space • Data distribution

• A task• Label space • Objective predictive function

D = fX ; p(x)gD = fX ; p(x)g

T = fY ; f(¢)gT = fY ; f(¢)g

XXp(x)p(x)

YYf(¢)f(¢)

Page 8: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Explanation•

•• Heterogeneous transfer learning• Two sets of documents are described in different languages

•• Domain adaptation• Two sets of documents focus on different topics

••

• Source has two classes: positive or negative; target adds one class: neutral

•• A word can have different meanings in two domains

Page 9: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Categorization of Transfer Learning

Pan, Sinno Jialin, and Qiang Yang. "A survey on transfer learning." IEEE Transactions on knowledge and data engineering 22.10 (2010): 1345-1359.

Page 10: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Learning Settings• Homogeneous/heterogeneous transfer learning

Transfer Learning

Feature Space

HomogeneousTransfer Learning

HeterogeneousTransfer Learning

Supervised Transfer Learning

Semi-Supervised Transfer Learning

Unsupervised Transfer Learning

Page 11: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Learning Methods• Instance Transfer

• Reweight instances of target data according to source

• Feature Transfer• Mapping features of source and target data in a

common space

• Parameter Transfer• Learn target model parameters according to source

model

Relational approaches are relatively unpopular, thus omitted in this talk

Page 12: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Learning Methods• Instance Transfer

• Reweight instances of target data according to source

• Feature Transfer• Mapping features of source and target data in a

common space

• Parameter Transfer• Learn target model parameters according to source

model

Relational approaches are relatively unpopular, thus omitted in this talk

Page 13: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Instance-based Transfer Learning• General assumption

• Source and target domains have a lot of overlapping features or even share the same feature spaces

• Label space should be the same

• Example applications• Electronic medical record across different departments• Sentiment analysis over different topics

XS ' XTXS ' XT

YS ' YTYS ' YT

XSXS XTXT

Page 14: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Instance TL Case 1: Domain Adaption

• Problem setting• Given source domain labeled data

and target domain data• learn such that the loss on target data is small

DS = fxSi ; ySignSi=1DS = fxSi ; ySignSi=1

DT = fxTignTi=1DT = fxTignTi=1

fTfT Xi

L(fT (xTi); yTi)X

i

L(fT (xTi); yTi)

• where is unknown. yTiyTi

• Assumption• The same label space• The same dependency• (Almost) the same feature space• Different data distribution

YS = YTYS = YT

p(yS jxS) = p(yT jxT )p(yS jxS) = p(yT jxT )

XS ' XTXS ' XT

pS(x) 6= pT (x)pS(x) 6= pT (x)

Page 15: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Importance Sampling for Domain Adaption

• Importance samplingμ¤ = arg min

μE(x;y)»pT

[L(y; fμ(x))]

= arg minμ

Z(x;y)

pT (x)L(y; fμ(x))dx

= arg minμ

Z(x;y)

pS(x)pT (x)

pS(x)L(y; fμ(x))dx

= arg minμ

E(x;y)»pS

hpT (x)

pS(x)L(y; fμ(x))

i

μ¤ = arg minμ

E(x;y)»pT[L(y; fμ(x))]

= arg minμ

Z(x;y)

pT (x)L(y; fμ(x))dx

= arg minμ

Z(x;y)

pS(x)pT (x)

pS(x)L(y; fμ(x))dx

= arg minμ

E(x;y)»pS

hpT (x)

pS(x)L(y; fμ(x))

i• Re-weight each instance by ¯(x) =

pT (x)

pS(x)¯(x) =

pT (x)

pS(x)

Page 16: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Importance Sampling for Domain Adaption

• How to estimate ¯(x) =pT (x)

pS(x)¯(x) =

pT (x)

pS(x)

• A simple solution would be to first estimate and respectively, and then calculate

• May suffer from huge variance problem

pS(x)pS(x) pT (x)pT (x)

¯(x)¯(x)

• A more practical solution is to estimate directlypT (x)

pS(x)

pT (x)

pS(x)

Page 17: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Importance Sampling for Domain Adaption

pT (x) / pS(x)p(s = 1jx)pT (x) / pS(x)p(s = 1jx)

• Imagine a rejection sampling process, and view the target domain as samples from the source domain

• Probabilistic density function (p.d.f.) relationship

Source

Selection variable

Target

s 2 f0; 1gs 2 f0; 1g

• And we estimate p(s=1|x) as a binary classification model

¯(x) =pT (x)

pS(x)/ p(s = 1jx)¯(x) =

pT (x)

pS(x)/ p(s = 1jx)

Zadrozny, Learning and Evaluating Classifiers under Sample Selection Bias, ICML 2004

11

00

Page 18: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Importance Sampling for Domain Adaption

• Imagine a rejection sampling process, and view the target domain as samples from the source domain

• Estimate p(s=1|x) as a binary classification model• Label instance from the target domain as 1• Label instance from the source domain as 0

Source

Selection variable

Target

s 2 f0; 1gs 2 f0; 1g

¯(x) =pT (x)

pS(x)/ p(s = 1jx)¯(x) =

pT (x)

pS(x)/ p(s = 1jx)

11

00

Zadrozny, Learning and Evaluating Classifiers under Sample Selection Bias, ICML 2004

Page 19: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Importance Sampling for Domain Adaption

• How to estimate ¯(x) =pT (x)

pS(x)¯(x) =

pT (x)

pS(x)

• Build the estimator with a list of basis functions

• The estimated target p.d.f. p̂T (x) = ^̄(x)pS(x)p̂T (x) = ^̄(x)pS(x)

^̄(x) =bX

l=1

®lÃl(x)^̄(x) =bX

l=1

®lÃl(x)

• Minimize KL divergence • Minimize squared errormin

f®lgbl=1

KL[pT (x)kp̂T (x)]minf®lgb

l=1

KL[pT (x)kp̂T (x)] minf®lgb

l=1

Zx

³^̄(x)¡ ¯(x)

´2pS(x)dxmin

f®lgbl=1

Zx

³^̄(x)¡ ¯(x)

´2pS(x)dx

Sugiyama et al., Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation, NIPS 2007

Kanamori et al., A Least-squares Approach to Direct Importance Estimation, JMLR 2009

Page 20: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Unbiased Training in Display Advertising• In display advertising, the label data is observed by an advertiser only

when she wins the auction, thus it is biased.

Weinan Zhang et al. Bid-aware Gradient Descent for Unbiased Learning with Censored Data in Display Advertising. KDD 16

Page 21: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Unbiased Learning Framework• Data observation process

A bid request Bid Data

observation

If win

• Importance sampling

Weinan Zhang et al. Bid-aware Gradient Descent for Unbiased Learning with Censored Data in Display Advertising. KDD 16

Page 22: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Performance Comparison on Yahoo! DSP

• A/B Testing on Yahoo! United States

10.3% more clicks

42.8% higher CTR

9.3% lower eCPC

2.97% AUC lift

Weinan Zhang et al. Bid-aware Gradient Descent for Unbiased Learning with Censored Data in Display Advertising. KDD 16

Page 23: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Instance TL Case 2: Labels in 2 Domains

• Problem setting• Given source domain labeled data• and very limited target domain data• learn such that the loss on target data is small

DS = fxSi ; ySignSi=1DS = fxSi ; ySignSi=1

DT = fxTi ; yTignTi=1DT = fxTi ; yTignTi=1

fTfT Xi

L(fT (xTi); yTi)X

i

L(fT (xTi); yTi)

• Assumption• The same label space• Different dependency• (Almost) the same feature space• Different data distribution

YS = YTYS = YT

p(yS jxS) 6= p(yT jxT )p(yS jxS) 6= p(yT jxT )

XS ' XTXS ' XT

pS(x) 6= pT (x)pS(x) 6= pT (x)

Page 24: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

TrAdaBoost• For each boosting iteration

• Use the same strategy as AdaBoost to update the weights of target domain data

• Use a new mechanism to decrease the weights of misclassified source domain data

Wenyuan Dai et al., Boosting for Transfer Learning, ICML 2007

Page 25: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

TrAdaBoost

Wenyuan Dai et al., Boosting for Transfer Learning, ICML 2007

• Source/target domain data D (combined)

xi =

(xSi ; i = 1; : : : ; n

xTi ; i = n + 1; : : : ; n + mxi =

(xSi ; i = 1; : : : ; n

xTi ; i = n + 1; : : : ; n + m

• Initialize the weight vector• For t = 1, …, N rounds

• Set• Learn the model based on the weighted data D, pt

pt = wt=(Pn+m

i=1 wti)pt = wt=(

Pn+mi=1 wt

i)

htht

²t =

Pn+mi=n+1 wt

i ¢ jht(xi)¡ c(xi)jPn+mi=n+1 wt

i

²t =

Pn+mi=n+1 wt

i ¢ jht(xi)¡ c(xi)jPn+mi=n+1 wt

i

¯t = ²t=(1¡ ²t) < 1¯t = ²t=(1¡ ²t) < 1 ¯ = 1=(1 +p

2 ln n=N)¯ = 1=(1 +p

2 ln n=N)

• Output the model hN

• Set• Update the new weight vector

• Calculate the error on target data

Page 26: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Distant Domain Transfer Learning

Ben Tan and Qiang Yang et al. Distant Domain Transfer Learning. AAAI 2017.

Page 27: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Problem Setting• Sufficient source domain data• Limited target domain data• Mixture of unlabeled data of multiple intermediate

domains , is large enough• Homogeneous: same feature space but different

distributions

Ben Tan and Qiang Yang et al. Distant Domain Transfer Learning. AAAI 2017.

Page 28: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Selective Learning Algorithm

Ben Tan and Qiang Yang et al. Distant Domain Transfer Learning. AAAI 2017.

Page 29: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Selective Learning Algorithm• Instance selection via reconstruction error by an AE

• selection indicators• regularization term

viS; vj

I 2 f0; 1gviS; vj

I 2 f0; 1g

J1(fe; fd;vs;vt) =1

nS

nSXi=1

viS

°°x̂iS ¡ xi

S

°°2

2+

1

nI

nIXi=1

viI

°°x̂iI ¡ xi

I

°°2

2

+1

nT

nTXi=1

viI

°°x̂iT ¡ xi

T

°°2

2+ R(vs;vt)

J1(fe; fd;vs;vt) =1

nS

nSXi=1

viS

°°x̂iS ¡ xi

S

°°2

2+

1

nI

nIXi=1

viI

°°x̂iI ¡ xi

I

°°2

2

+1

nT

nTXi=1

viI

°°x̂iT ¡ xi

T

°°2

2+ R(vs;vt)

R(vs;vt) = ¡¸S

nS

nSXi=1

viS ¡

¸I

nI

nIXi=1

viIR(vs;vt) = ¡¸S

nS

nSXi=1

viS ¡

¸I

nI

nIXi=1

viI

J2(fc; fe; fd) =1

nS

nSXi=1

viSl(yi

S ; fc(hiS)) +

1

nT

nTXi=1

viT l(yi

T ; fc(hiT )) +

1

nI

nIXi=1

viIg(fc(h

iI))J2(fc; fe; fd) =

1

nS

nSXi=1

viSl(yi

S ; fc(hiS)) +

1

nT

nTXi=1

viT l(yi

T ; fc(hiT )) +

1

nI

nIXi=1

viIg(fc(h

iI))

• Overall objective function

• Incorporation of label information

• Entropy function

Page 30: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Selective Learning Algorithm• Update Θ: back propagation• Update v

Ben Tan and Qiang Yang et al. Distant Domain Transfer Learning. AAAI 2017.

Page 31: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

DDTL by Selective Learning Algorithm

Ben Tan and Qiang Yang et al. Distant Domain Transfer Learning. AAAI 2017.

Page 32: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Learning Methods• Instance Transfer

• Reweight instances of target data according to source

• Feature Transfer• Mapping features of source and target data in a

common space

• Parameter Transfer• Learn target model parameters according to source

model

Relational approaches are relatively unpopular, thus omitted in this talk

Page 33: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Feature-based Transfer Learning• When source and

target domains only have some overlapping features

• Lots of features only have support in either the source or the target domain

• Possible solutions• Encode application-

specific knowledge• General approaches to

learn the transformation

XSXS XTXT

'' ''

''

Page 34: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

General Feature-Based TL Approach

• Learning new data representations by minimizing the distance between two domain distributions

• Learning new data representations by multi-task learning

• Learning new data representations by self-taught learning

Page 35: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Principle Component Analysis (PCA)

First componentSecond

component

• PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

Page 36: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Principle Component Analysis (PCA)

FirstComponent

• PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

Source

SecondComponent

NoiseComponents

Page 37: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Component Analysis• Motivation

• Minimize the distance between domain distributions by projecting data onto the learned transfer components

Pan, Sinno Jialin, et al. "Domain adaptation via transfer component analysis." IEEE Transactions on Neural Networks 22.2 (2011): 199-210.

Source TargetTwo Domain

Data

Joint latentfactors

The latent factors that cause the two-domain data distributions different

Page 38: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Component Analysis• Main idea

• Learn to map the source and target domain data to the latent space spanned by the factors which can reduce domain difference and preserve original data structure

''

min'

Dist('(XS); '(XT )) + ¸Ð(')

s.t. constraints on '(XS) and '(XT )

min'

Dist('(XS); '(XT )) + ¸Ð(')

s.t. constraints on '(XS) and '(XT )

Page 39: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Component Analysis• Maximum Mean Discrepancy (MMD)

• Given the source and target domain data

XS = fxSignSi=1XS = fxSignSi=1 XT = fxTig

nTi=1XT = fxTignTi=1

drawn from PS(x) and PT(s) respectively

Dist('(XS); '(XT )) =

°°°°° 1

nS

nSXi=1

©('(xSi))¡1

nT

nTXi=1

©('(xTi))

°°°°°H

Dist('(XS); '(XT )) =

°°°°° 1

nS

nSXi=1

©('(xSi))¡1

nT

nTXi=1

©('(xTi))

°°°°°H

Mapping Kernel function

Page 40: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Component Analysis• An illustrative example Latent features learned by

PCA and TCA

Original feature space PCA TCA

Page 41: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Maximum Mean Discrepancy

Page 42: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

MMD in Transfer LearningDeep Adaptation Network

Long et al. Learning Transferable Features with Deep Adaptation Networks. ICML 2015.

• Multi-kernels, e.g., some RBF kernels with different standard deviations

K(x;x0) = exp

μ¡kx¡ x0k2

2¾2

¶K(x;x0) = exp

μ¡kx¡ x0k2

2¾2

Page 43: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

How transferable are features in deep neural networks? [NIPS 2014]

Standard BP trainingon dataset A

Standard BP trainingon dataset B

Reuse first n=3 layersweights of baseB

and train on B

Reuse first n=3 layersweights of baseAand trained on B

Page 44: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

How transferable are features in deep neural networks? [NIPS 2014]

Page 45: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

How transferable are features in deep neural networks?

Page 46: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Domain Adversarial Neural Network

Prx»DX

S

[´(x) = 1] + Prx»DX

S

[´(x) = 0] = 1Prx»DX

S

[´(x) = 1] + Prx»DX

S

[´(x) = 0] = 1

Ajakan, Hana, et al. "Domain-adversarial neural networks." JMLR 2016

Source domain Target domain

Page 47: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Domain Adversarial Neural Network

Page 48: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Experiment Result

AVG 0.763 0.761 0.760 0.813 0.803 0.801

Page 49: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Experiment Result

Page 50: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%
Page 51: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Learning Methods• Instance Transfer

• Reweight instances of target data according to source

• Feature Transfer• Mapping features of source and target data in a

common space

• Parameter Transfer• Learn target model parameters according to source

model

Relational approaches are relatively unpopular, thus omitted in this talk

Page 52: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Parameter based Transfer Learning

• The θ-parameterized function fθ(x) learned on two domains

μ¤S = arg minμ

nSXi=1

L(ySi ; fμ(xSi)) + ¸Ð(μ)

μ¤T = arg minμ

nTXi=1

L(yTi ; fμ(xTi)) + ¸Ð(μ)

μ¤S = arg minμ

nSXi=1

L(ySi ; fμ(xSi)) + ¸Ð(μ)

μ¤T = arg minμ

nTXi=1

L(yTi ; fμ(xTi)) + ¸Ð(μ)

• Motivation• A well-trained model has learned a lot of structure on the

source domain. • If two tasks are related, this structure can be transferred to learn

the model on the target domain

fμ¤S (x)fμ¤S (x)

fμ¤T (x)fμ¤T (x)

Page 53: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Multi-Task or Collective Learning

• Minimize the joint loss on two tasks and the model parameters distance

minμS ;μT

®1

NS

NSXi=1

L(yi; fμS(xi)) + (1¡ ®)1

NT

NTXj=1

L(yj ; fμT (xj)) + ¸Ð(μS; μT )minμS ;μT

®1

NS

NSXi=1

L(yi; fμS(xi)) + (1¡ ®)1

NT

NTXj=1

L(yj ; fμT (xj)) + ¸Ð(μS; μT )

Source task loss Target task loss Parameter distance

Ð(μS ; μT ) = kμS ¡ μT k2Ð(μS ; μT ) = kμS ¡ μT k2

Ð(μS ; μT ) =X

t2fS;Tgkμt ¡

1

2

Xs2fS;Tg

μsk2Ð(μS ; μT ) =X

t2fS;Tgkμt ¡

1

2

Xs2fS;Tg

μsk2

• Different parameter distance definitions

Page 54: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Hierarchical Bayesian Network• Idea: source domain parameters, regarded as

random variables, act as the prior of the target domain parameters

μSμSxSxS

ySyS

nSnS

μTμT xTxT

yTyT

nTnTtransfer

hyperparameters

Page 55: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Case Study: from web browsing to ad click

• Source task• Data: user browsed webpage ids• Task: predict whether a user likes a webpage

• Target task• Data: user browsed webpage ids• Task: predict whether a user likes to click an ad

minμS ;μT

®1

NS

NSXi=1

L(yi; fμS (xi)) + (1¡ ®)1

NT

NTXj=1

L(yj ; fμT (xj)) + ¸kμS ¡ μT k2minμS ;μT

®1

NS

NSXi=1

L(yi; fμS (xi)) + (1¡ ®)1

NT

NTXj=1

L(yj ; fμT (xj)) + ¸kμS ¡ μT k2

[Perlich, Claudia, et al. "Machine learning for targeted display advertising: Transfer learning in action." Machine learning 95.1 (2014): 103-127.]

Logistic regression Logistic regression

Page 56: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Case Study: from web browsing to ad click

[Zhang, Weinan et al. Implicit Look-alike Modelling in Display Ads: Transfer Collaborative Filtering to CTR Estimation. ECIR 2016]

• Illustrated in a hierarchical Bayesian graphical model

Page 57: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Learning in Deep Learning

• Mostly, neural network reusing• Feed new data for domain adaptation• Build higher layers for training another task (feature transfer)

Source Data

Feature representation

Source task

Target Data

Feature representation

Target task

Target Data

Feature representation

Target task

Page 58: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Net2Net Transfer• Net2Net reuses information of already trained

model to speedup training of new model

[Chen, Tianqi, Ian Goodfellow, and Jonathon Shlens. "Net2net: Accelerating learning via knowledge transfer.“ ICLR 2016.

Page 59: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Net2Net Transfer: Growing Network

• Wider

• Deeper

Page 60: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Net2Net over Inception-BN on ImageNet

Page 61: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

ResNet: Deep Residual Networks• Difficulty of training DEEP networks

Page 62: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

ResNet: Deep Residual Networks

Page 63: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Performance on ImageNet

• Thin curves denote training error, and bold curves denote validation error of the center crops.

• The residual networks have no extra parameter compared to their plain counterparts.

Plain networks of 18 and 34 layers. ResNets of 18 and 34 layers.

Page 64: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Heterogeneous TL• Different feature space• Examples

• Cross-language document classification• Cross-system recommendation

• Approaches• Symmetric transformation mapping• Asymmetric transformation mapping

Page 65: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Cross-system Recommendation

Page 66: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Learning via CodeBook

Li, Bin, Qiang Yang, and Xiangyang Xue. "Can Movies and Books Collaborate? Cross-Domain Collaborative Filtering for Sparsity Reduction." IJCAI. Vol. 9. 2009.

Page 67: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Learning via CodeBook

Li, Bin, Qiang Yang, and Xiangyang Xue. "Can Movies and Books Collaborate? Cross-Domain Collaborative Filtering for Sparsity Reduction." IJCAI. Vol. 9. 2009.

Page 68: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Cross-Language Text Classification• A large number of labeled English webpages• A small number of labeled Chinese webpages• Solution: information bottleneck

Ling, Xiao, et al. "Can chinese web pages be classified with english data source?." WWW 2008.

Page 69: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Summary of CS420

1. ML Introduction2. Linear Models3. SVMs and Kernels4. Neural Networks5. Tree Models6. Ensemble Models7. Collaborative Filtering

8. Graphic Models9. Unsupervised Learning

10. Model Selection11. RL Introduction12. Model-free RL13. Transfer Learning14. Poster Session

Page 70: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Summary of CS420

• Play with the data and get your hands dirty!

AcademiaTheoretical novelty

Industry

Large-scalepractice

Startup

Applicationnovelty

Hands-onML

experience

CommunicationSolid math

Solidengineering

Page 71: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

APPENDIX

Page 72: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

RKHS• MMD function class : the unit ball in RKHS• Hilbert Space

• k: kernel function• Reproducing Kernel Hilbert Space

•• If satisfies

• (1) • (2)

• k: reproducing kernel• Define

Page 73: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Component Analysis

Page 74: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

Transfer Component Analysis

Page 75: Transfer Learning - wnzhangwnzhang.net/teaching/past-courses/cs420-2017/... · Performance Comparison on Yahoo! DSP •A/B Testing on Yahoo! United States 10.3% more clicks 42.8%

MMD in RKHS• MMD function class : the unit ball in RKHS• Let , called mean embedding••