preprocessing and dimensionality reduction

Datasets Preprocessing Dimensionality reduction

Preprocessing and Dimensionality Reduction

Jeremy Fix

CentraleSupelec

[email protected]

2019

1 / 73


Where to get data

You need datasets

You can use open datasets

For example for experimenting a new ML algorithm:

• UCI ML Repo : http://archive.ics.uci.edu/ml/

• Kaggle competitions, e.g. https:

//www.kaggle.com/c/diabetic-retinopathy-detection

• specific well known datasets for specific ML problems

2 / 73

http://archive.ics.uci.edu/ml/

https://www.kaggle.com/c/diabetic-retinopathy-detection

https://www.kaggle.com/c/diabetic-retinopathy-detection


Where to get data

Some available datasets

Face expression classification

48x48 pixel grayscale images of faces0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise,6=Neutral28K Train; 3K for public test, another 3K for final test.

Kaggle, ICML 2013

3 / 73


Where to get data


Object localization/detection

PascalVOC2012: 20 classes, 20000 Train images, 20000 Test,11000 TestAvg image size : 469x387 pixels, RGB

Classes : person/bird, cat, cow, dog, horse, sheep/aeroplane, bicycle, boat, bus, car, motorbike, train/bottle, chair,dining table, potted plant, sofa, tv-monitor

http://host.robots.ox.ac.uk/pascal/VOC/

4 / 73

http://host.robots.ox.ac.uk/pascal/VOC/


Where to get data



ImageNet, ILSVRC2014: 1000 classes, 1.2M Train images, 50KValid, 100K TestAvg image size : 482x415 pixels, RGB

ImageNet Large Scale Visual Recognition Challenge, Russakovsky et al. (2015)

5 / 73


Where to get data



Open Images Dataset: https://github.com/openimages/dataset

• ≈ 9M automatically labelled images, 4M human validated

• 80M bounding boxes, 6000 classes

• both meta labels (e.g. vehicle), fine-grain labels (e.g. hondansx)

6 / 73

https://github.com/openimages/dataset


Where to get data


Object segmentation

COCO 2017: 200K images, 80 classes, 500K masks

http://cocodataset.org/

7 / 73

http://cocodataset.org/


Where to get data


Recommendation systems

MovieLens, Netflix Prize, Anime Recommendations DatabaseMovieLens 20M

• 27K movies by 138K users

• 5star ratings, 1/2 increment (0.0, 0.5, ..)

• 20M ratings

• metadata (e.g. genre)

• links to imdb to enrich metadata

https://grouplens.org/datasets/movielens/

8 / 73

https://grouplens.org/datasets/movielens/


Where to get data


Automatic speech recognition

Timit, VoxForge, ...Timit :

• 630 speakers, eight American english dialects

• time-aligned orthographic, phonetic and word transcriptions

• 16kHz speech waveform file for each utterance

https://catalog.ldc.upenn.edu/ldc93s1

9 / 73

https://catalog.ldc.upenn.edu/ldc93s1


Where to get data


Sentiment analysis

Large Movie Review Dataset (IMDB)

• 25K reviews for training, 25K reviews for testing

• movie reviews (sentences), with rating ([1,10])

• aim : Are reviews on a given product positive/negative ?

Maas(2011), Learning Word Vectors for Sentiment Analysis

Automatic translation

Dataset from the european parliament (Europarl dataset)

• single language datasets (language model)

• parallel corpora (translation), e.g. french-english (2Msentences), Czech-English (650K sentences), ..

Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp

Koehn, MT Summit 200510 / 73


Make your own dataset

You need datasets

You have a specific problem

You may need to collect data on your own.

• Crawl the web ? (e.g. Tweeter API, ..)

• if supervized learning : assign labels (mechanical turk, domainexperts (classifying tumors))

• Ensure you collected sufficient features

11 / 73


We need vectors, appropriately scaled, without missing values

Preprocessing

12 / 73



Preprocessing data

Data are not necessarily vectorial

• Ordinal or Categorical : poor/faire/excellent ; Male/Female

• Text documents : bag of words / word embeddings

Even if vectorial

• Missing data : check how missing values are indicated (-9, ’ ’,..) → Imputation of missing values

• Feature scaling

13 / 73



Your data might not be vectorial data

Ordinal and categorical features

Ordinal values have an order.Ordinal Feature value poor fair excellent

Numerical feature value -1 0 1

Categorical values do not have an order (use one-hot) :Categorical value American Spanish German French

Numerical value [1, 0, 0, 0] [0, 1, 0, 0] [0, 0, 1, 0] [0, 0, 0, 1]

14 / 73




Vectorial representation of text documents

Bag Of Words

• define a vocabulary V, |V| = n

• for each document, build a vector x so that xi is thefrequency of the word Vi

e.g. V = {I , in, love,metz ,machinelearning , study}I love machine learning and love metz too. → x = [1, 0, 2, 1, 1, 0]I love studying machine learning in Metz. → x = [1, 1, 1, 1, 1, 1]Does not take the order into account → N − gram, but this leadsto sparser representations

15 / 73




Vectorial representation of text documents

Word/Sentence embeddings (e.g. word2vec, GLoVe, fasttext).Continuous Bag of Words (CBOW) : predict a word given itscontext

• Input and output coded with one-hot

• predict a word given its context

• hidden layer : word representation

Captures some semantic information.For sentences : tweet2vec, sentence2vec, word vector avg

see also : Bayesian approaches (e.g. Latent Dirichlet Allocation)Pennington(2014) GloVe: Global Vectors for Word Representation; Mikolov(2013) Efficient Estimation of Word

Representations in Vector Space; https://fasttext.cc/ 16 / 73

https://fasttext.cc/



Some features might be missing

Missing features

• Completely drop out the samples with missing attributes, orthe dimensions that have missing values

• or try to impute, i.e. set a value in place of the missingattributes

For missing value imputation, there are plenty of methods :

• global : assign the mean, median, most frequent value of anattribute

• local : based on k-nearest neighbors, decide which value toimpute

The bias you may introduce by imputing a value may depend onthe causes of the missing values, see [Silva(2014)].Silva(2014). A brief review of the main approaches for treatment of missing data 17 / 73



Some vectorial data might not be appropriately scaled

Feature scaling

• dimensions with the largest variations will dominate euclideandistances (e.g. nearest neighbors)

• when gradient descent is involved, feature scaling makesconvergence faster (because the loss is circular symmetric)

• when regularization is involved, we would like to use a singleregularization coefficient, independent on the scale of thefeatures

18 / 73



Some vectorial data might not be appropriately scaled

Feature scaling

Given xi ∈ Rd , you can normalize by :

• min/max scaling :

∀i , ∀j ∈ [0, d − 1]x ′i ,j =xi,j−mink xk,j

maxk xk,j−minkxk,j

• z-score normalization :

∀i ,∀j ∈ [0, d − 1]xi,j =xi,j − µj

σj

µj =1

N

∑k

xk,j

σj =

√1

N

∑k

(xk,j − µj)2

Your statistics must be computed from the training set and applied also

to test data.19 / 73


Dimensionality reduction

20 / 73


Dimensionality reduction : what/why/how ?

What

Optimally transform xi ∈ Rn into zi ∈ Rd so that d << nIt remains to define what means “optimally transform”

Why

• visualization of the data

• interpretability of the predictor

• speed up the algorithms whose complexity depends on n

• data may occupy a manifold of lower dimensionality than n

• curse of dimensionality : data get quickly sparse, models mayoverfit

21 / 73


Dimensionality reduction: why ?

Data analysis/Visualization

How are your data distributed ? How are your classes intricated ?Do we have discriminative features ?

t-SNE, Mnist, Maaten et al.22 / 73



Interpretability of the predictor

e.g. Why does this predictor say the tumor is malignant ?

Real risk = 0.92± 0.05Real risk = 0.92± 0.06

UCI ML Breast Cancer Wisconsin (Diagnostic) datasetReal risk estimated by 10-fold CV.

23 / 73



Speed up of the algorithms

Decreasing dimensionality decreases training/inference times.For example :

• Linear regression y = θT x + b

• Logistic regression(classification) : P(y = 1/x) = 11+exp(θT x)

Both training and inference in O(n), x ∈ Rn

24 / 73



The data may occupy a lower dimensional manifold

Swiss roll

→ you do not necessarily loose information by reducing thenumber of dimensions

25 / 73




You want to classify facial expressions of a single person,controlled illumination:

• suppose a huge image resolution, e.g. 1024× 1024 RGBpixels, x ∈ R1024×1024×3

• what is the dimensionality of the data manifold ?

≈ 50


26 / 73






• what is the dimensionality of the data manifold ? ≈ 50


27 / 73






• what is the dimensionality of the data manifold ? ≈ 50


28 / 73



You may even have better predictors : Curse of dimensionality

The data become (exponentially) quickly sparse with respect tothe dimension

Image from [Goodfellow, Bengio, Courville (2016) : Deep learning]See also [Hastie et al.(2017), The elements of statistical learning]

29 / 73


Dimensionality reduction : what/why/how ?

What

Optimally transform xi ∈ Rn into zi ∈ Rd so that d << nIt remains to define what means “optimally transform”

How

• select a subset of the original features : feature selection

• compute new features from the original ones :featureextraction

30 / 73


Feature selection

Feature selection

Select a subset of the original features/attributes/dimensionsxi ∈ Rn z ∈ Rd

31 / 73


Feature selection

Feature selection

Overview

• Embededed : The ML algorithm is designed to select a subsetof the features, e.g. linear regression with L1 penalty

• Filters : dimensions are selected based on a heuristic

• Wrappers : dimensions are selected based on an estimationof the real risk

⇒ Notebook ”Feature selection.ipynb”

32 / 73


Feature selection

Feature selection: embedded

Embedded : the loss to minimize embeds a penalty promotingsparsity.

Least Absolute Shrinkage and Selection Operator (LASSO)

Given a regression problem (xi , yi ), xi ∈ Rn, yi ∈ R, optimize w.r.t.θ:

1

N

N−1∑i=0

(yi − θT xi )2 + λ|θ|1 (1)

Linear regression with L1 penaltyL1 penalty promotes sparse predictors

Tibshirani (1996). Regression shrinkage and Selection via the Lasso

33 / 73


Feature selection


LASSO example

N=30 points, yi = 0.5 + 0.4 ∗ sin(2πxi ) +N (0, 0.01)30 RBF features + constant term:

φ(x) = [1, e(x0−x)2

2σ2 , ..., e(xN−1−x)2

2σ2 ]

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0Fit

samplestruelreglreg_l1

0 5 10 15 20 25 300.4

0.2

0.0

0.2

0.4

0.6Parameters

≈ 20%− 33% features selected34 / 73


Feature selection


Decision tree example

Decision Tree with gini impurity, max depth=2, 10-fold CV (0.92)UCI ML Breast Cancer Wisconsin dataset. 569 samples, binaryclassif, 30 continuous features.

35 / 73


Feature selection

Feature selection: univariate filtersPrinciple : measure correlation/dependency between each inputfeatures, considered independently, and the target. E.g. chi-2,Anova test of independence, mutual information measure, pearsoncorrelation,..

Example Continuous → Discrete : Anova

Breast cancer, F-values Anova

P(x14/y) (lowest F) P(x27/y) (highest F)36 / 73


Feature selection

Feature selection: multivariate filters and wrappers

Overview

Denote χ a subset of the dimensions/attributes/features

• suppose we are provided a measure of how good this subset isJ(χ)

• we optimize J(χ) over the possible subsets χ

If x ∈ Rn, we have 2n possible subsets χ:

χ ∈ {∅, {x1}, {x2}, · · · , {x1, x2}, · · · }

http://featureselection.asu.edu/ : Algorithms and datasetsPython package scikit-feature.John et al.(1994) Irrelevant features and the subset selectionproblem.

37 / 73

http://featureselection.asu.edu/


Feature selection

Feature selection: optimizing J(χ)

Tree search

{x0, x1}

∅

Xd

{x0} {x1} {xd−1}

{x0, x2} {x0, xd−1}· · ·

· · · · · ·

{x1, xd−1} {xd−2, xd−1}· · ·· · ·

... ... ...

Number of sets

1

1

d

d(d-1)/2

d!k!(d−k)!

Sequential Forward Search

Sequential Backward Search

If you allow to undo steps, “Sequential Floating ForwardSearch”/”Sequential Floating Backward Search”

Variants and extensions : Somol et al.(2010) Efficient FeatureSubset Selection and Subset Size Optimization

38 / 73


Feature selection

Feature selection: quantifying the quality of a subset offeatures

We need to quantify the quality of a subset of features J(χ)Filters use a heuristic to be maximized.

Filters

Heuristic : e.g. Correlation based feature selectionStrategy: Keep features correlated with the label, yet uncorrelatedbetween each other.Given a training set {(xi , yi )}:

JCSF (χ) =kr(χ, y)√

k(k − 1)r(χ, χ)

r(χ, y) =1

k

∑j∈χ

r(x.,j , y)

r(χ, χ) =1

k(k − 1)

∑(j1,j2)∈χ,j1 6=j2

r(x.,j1 , x.,j2 )

with k = |χ| and r a measure of correlation 39 / 73


Feature selection

Feature selection: quantifying the quality of a subset offeatures

We need to quantify the quality of a subset of features J(χ)Wrappers use an estimation of the real risk to be minimized.

Wrappers

1 Train a predictor from the subset χ

2 J(χ) = estimation of the real risk (e.g. cross validation)

More theoretically grounded, but more computationally expensive.

40 / 73


Feature extraction

Feature extraction

Given N samples xi ∈ Rd ,We compute r � d new features from the original d features.

41 / 73


Feature extraction

Principal Component Analsysis [Pearson(1901)]Statement : Find an affine transformation of the data minimizingthe reconstruction error

Intuition and formalisation

1.0 0.5 0.0 0.5 1.0 1.5 2.01.0

0.5

0.0

0.5

1.0

1.5

2.0

w0

w1

In 1D, we seek a line (w0,w1) minimizing the sum of the squaredlength of the red segments. It is not unique !

42 / 73


Feature extraction

Principal Component Analsysis [Pearson(1901)]

Statement : Find an affine transformation of the data minimizingthe reconstruction errorFormally :

min{w0,w1,..wr}∈Rd

N−1∑i=0

∣∣∣∣∣∣xi − (w0 +r∑

j=1

(wTj (xi − w0))wj)

∣∣∣∣∣∣2

2

(2)

subject to wTi wj = δi ,j .

→ matrix form ?

43 / 73


Feature extraction

Principal Component Analsysis [Pearson(1901)]

Matrix formulation of PCA

Introduce W = (w1| . . . |wr ) ∈Md×r (R)

(2)⇔ min{w0,w1,..wr}∈Rd

N−1∑i=0

∣∣∣(Id −WWT )(xi − w0)∣∣∣22

subject to WTW = Ir

44 / 73


Feature extraction

Principal Component Analysis [Pearson(1901)]

Simplification of the matrix formulation

• If M is idempotent, so is (I−M)

• (Id −WWT ) is symmetric and idempotent

(2)⇔ min{w0,w1,..wr}∈Rd

N−1∑i=0

(xi − w0)T (Id −WWT )(xi − w0)

subject to WTW = Ir

45 / 73


Feature extraction

Principal Component Analysis [Pearson(1901)]Remember : For u : Rn 7→ Rm, v : Rn 7→ Rm, A ∈Mm,m(R):

duTAv

dx=

du

dxAv +

dv

dxATu

Finding w0

J =N−1∑i=0


∂J

∂w0= −2(Id −WWT )

N−1∑i=0

(xi − w0)

46 / 73


Feature extraction


Finding w0

J =N−1∑i=0


∂J

∂w0= −2(Id −WWT )

N−1∑i=0

(xi − w0)

∂J

∂w0= 0 ⇔ ∃h ∈ span{w1, ...,wr},w0 = h +

1

N

∑i

xi

(Id −WWT )h is the residual vector by the orthogonal projection on the column vectors of WIf h ∈ span{w1, ...,wr}, the residual is 0

If h ∈ span{w1, ...,wr}⊥, the residual is h

47 / 73


Feature extraction


Finding w0

J =N−1∑i=0


argminw0J ⇒ w0 = h + 1

N

∑i xi

h ∈ span{w1, ...,wr} e.g. h = 0

The offset w0

The offset w0 is the mean of the data points, up to a translation inthe space spaned by the principal components vectors.Step 1 : Center the data

48 / 73


Feature extraction

Principal Component Analysis [Pearson(1901)]Denote xi = xi − x , x = 1

N

∑i xi

Deriving the first principal component

• J =∑N−1

i=0 xTi (Id −WWT )xi

• argminw1,..wrJ = argmaxw1,..wr

∑N−1i=0 xTi WWT xi

• X = (x0| . . . | ˜xN−1)

• argminw1,..wrJ = argmaxw1,..wr

∑rj=1 w

Tj XXTwj

Our optimization problem turns out to be :

argmaxw1,...,wr

∑rj=1 w

Tj XXTwj subject to WTW = Ir

We have a constrained optimization problem : Lagrangian.49 / 73


Feature extraction


Deriving the first principal component : Lagrangian

argmaxw1wT

1 XXTw1 subject to wT1 w1 = 1

• L(w1, λ1) = wT1 XXTw1 + λ1(1− wT

1 w1)

• ∂Ldw1

= 0⇒ XXTw1 = λ1w1, w1eigen~v but which λ1 ?

• wT1 XXTw1 = λ1, λ1 is the largest eigenvalue of XXT

First principal component vector

The first principal component vector is a normalized eigenvectorassociated with the largest eigenvalue of the “sample covariancematrix” XXT

50 / 73


Feature extraction


Deriving the second principal component : Greedy

Suppose we have w1 a norm. eigenvector of XXT associated withits largest eigenvalue. Denote λ1 ≥ λ2 ≥ . . . λd ≥ 0 theeigenvalues.We want to optimize :

argmaxw2wT

1 XXTw1 + wT2 XXTw2 = argmaxw2

λ1 + wT2 XXTw2

= argmaxw2wT

2 XXTw2

with wTi wj = δi ,j .

And wT2 XXTw2 = wT

2 (XXT − λ1w1wT1 )w2

w2 is a normalized eigenvector associated with the largesteigenvalue of (XXT − λ1w1w

T1 ), i.e. λ2

And so on. But is the greedy algorithm finding the optimum ? 51 / 73


Feature extraction


Deriving the other principal component : Greedy

Does it make sense to use a greedy algorithm ? (proof in lecturenotes)

Theorem

For any symmetric positive semi-definite matrix M ∈Md×d(R),denote {λi}i=1..d its eigenvalues with λ1 ≥ λ2 · · · ≥ λd ≥ 0. Forany set of r ∈ [|1, d |] orthogonal unit vectors, {v1, . . . , vr}, wehave :

r∑j=1

vTj Mvj ≤r∑

j=1

λj (3)

And this upper bound is reached by eigenvectors associated withthe largest eigenvalues of M

52 / 73


Feature extraction


PCA : recipe

Given {x0, . . . , xN−1} ∈ Rd , to compute the r principal componentvectors :

1 Center your data xi = xi − x

2 Build the matrix X = [x0| . . . | ˜xN−1]

3 Compute r normalized eigenvectors associated with the rlargest eigenvalues of XXT

53 / 73


Feature extraction


PCA is a projection method

Given x ∈ Rd , its principal components are its coordinates in theselected eigenspace :

x → ((x − x)Tw1, (x − x)Tw2, . . . , (x − x)Twr )

If x ∈ {x0, . . . , xN−1}, you should better use the SVD which givesyou directly the principal components.

54 / 73


Feature extraction


Singular Value Decomposition

For any matrix M ∈Md ,N(R), there exists an orthogonal matrixU ∈Md ,d(R), a diagonal matrix D ∈Md ,N(R), and anorthogonal matrix V ∈MN,N(R), such that :

M = UDVT

Orthogonal matrices : UT = U−1.

55 / 73


Feature extraction


PCA with SVD

Given X = UDVT :

XXT = UDDTU−1

This is the diagonalization of XXT

The projection vectors are the column vectors of U :{w1, . . . ,wr} = {u1, . . . , ur}.The principal components of the training set are the r first rowsof :

UT X = UTUDVT = DVT

56 / 73


Feature extraction


What is XXT ?

XXT =N−1∑i=0

xi xTi

=∑i

(xi −1

N

∑j

xj)(xi −1

N

∑j

xj)T

= (N − 1)Σ

with Σ the sample covariance matrix.Σ is symmetric, positive semi-definite, i.e. its eigenvalues are allpositive.

57 / 73


Feature extraction


Equivalent formulations

There are two equivalent formulations of the PCA :

• Find an affine transformation minimizing the reconstructionerror

• Find an affine transformation maximizing the variance of theprojections

58 / 73


Feature extraction

Principal Component Analysis

Maximizing the variance of the projections

Suppose your data are centered, i.e. 1N

∑i xi = 0.

Denote zi ∈ Rr the projection of xi over w1, . . . ,wr . We havez = 0.The sample covariance matrix Σ ∈Mr ,r (R) is :

Σ =1

N − 1

∑i

zizTi =

1

N − 1WT xix

Ti W

We want to maximize∑r

j=1 Σj ,j and :

r∑j=1

Σj ,j =1

N − 1

∑j

∑i

(wTj xi )(xTi wj) =

1

N − 1

∑j

wTj XXTwj

This is the same optimization problem as before. 59 / 73


Feature extraction


What is the fraction of variance we keep ?

For any matrix M, orthogonal matrix P :

Tr P−1MP = Tr M

Therefore, Tr XXT =∑N−1

i=0 λi . The variance of our datapoints is

Tr 1N−1XXT = 1

N−1

∑N−1i=0 λi

If we keep r principal components, we keep a fraction of thevariance equals to : ∑r−1

i=0 λi∑N−1i=0 λi

60 / 73


Feature extraction


PCA on MNIST (28× 28 images)

PCA with 2 princip. vectors, 17.05% tot var

10 8 6 4 2 0 2 4 6w1

6

4

2

0

2

4

6w

2

0 1 2 3 4 5 6 7 8 9

10 first princip. vectors

0.10 0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.08 0.10 61 / 73


Feature extraction

Sample covariance and Gram matrices

Definitions

The sample coviarance matrix is :

Σ =1

N − 1

∑i

(xi − x)(xi − x)T =1

N − 1XXT

The Gram matrix is :

G = XTX =

xT0 x0 xT0 x1 · · · xT0 xN−1...

......

...xTN−1x0 xTN−1x1 · · · xTN−1xN−1

The Gram matrix is build up from dot products.

The eigenvectors/eigenvalues of G and Σ are related !62 / 73


Feature extraction

Sample covariance and Gram matrices

Lemma

∀A ∈ Rn×m, ker (A) = ker(ATA

)Theorem (Rank-nullity)

∀A ∈ Rn×m, rk (A) + dim (ker (A)) = m.

Theorem

∀A ∈ Rn×m, rk(ATA

)= rk

(AAT

)≤ min(n,m)

63 / 73


Feature extraction

Lemma (Eigenvalues of the covariance and gram matrices)

The nonzero eigenvalues of the scaled covariance matrix(N − 1)Σ = XXT and gram matrix G = XTX are the same :

{λ ∈ R∗, ∃v 6= 0, (N − 1)Σv = λv} = {λ ∈ R∗, ∃v 6= 0,Gv = λv}

And, during the proof, we show that :

• If (λ, v) eigen of XXT , then (λ,XT v) eigen of XTX

• If (λ,w) eigen of XTX , then (λ,Xw) eigen of XXT

There are several applications of this property :

• the eigenface algorithm, used when N � d

• the nonlinear PCA called Kernel PCA [Schoelkopf, 1999]

64 / 73


Feature extraction

What to do when N � dG ∈MN,N(R), Σ ∈Md ,d(R).

Eigenface

If N � d , it is much more efficient to “diagonalize” G than Σ. Inthat case, the recipe is :

1 Center your data xi = xi − x

2 Build the matrix X = [x0| . . . | ˜xN−1]

3 Compute the r normalized eigenvectors wj ∈ RN of G, witheigenvalues λj

4 Project your data on the r normalized eigenvectors of Σ givenby :

Xwj∣∣∣Xwj

∣∣∣2

=1√λj

Xwj

65 / 73


Feature extraction

Toward a Kernel PCAWe can reformulate the PCA using only dot products.

PCA with only dot products

Computing the Gram matrix involves only dot products between xiProjecting a vector x on the vector 1√

λjXwj reads :

(1√λj

Xwj)T x =

1√λj

wTj

< x0, x >...

< xN−1, x >

A linear algorithm involving only dot products can be renderednon-linear using the kernel trick (see SVM).The only remaining difficulty is that we must ensure the vectors inthe feature space are centered.

66 / 73


Feature extraction

Non linear PCA

Kernel PCA [Scholkopf(1999)]

Consider a kernel k : RN × RN 7→ R, < φ(x), φ(x ′) >= k(x , x ′)e.g.

• RBF kernel : k(x , x ′) = exp(− |x−x′|2

2

2σ2 )

We perform a PCA in the feature space, image of φ.Compute the Gram matrix, its eigenvectors/eigenvalues λj ,wj .For projecting a vector x , compute :

1√λj

wTj

k(x0, x)...

k(xN−1, x)

What about centering the φ(xi ) ?

67 / 73


Feature extraction

Non linear PCA

Kernel PCA : centering in the feature space

It can be shown that introducing the Gram matrix G :

G = (IN −1

N1)G (IN −

1

N1)

is the matrix of the dot products of the feature vectors centered inthe feature space.The above transformation is called double centeringtransformation.

68 / 73


Feature extraction

Feature extraction : Manifold learning

Goal : For each xi ∈ Rd , associate yi ∈ Rr so that the pairwisedistances in Rd are as similar as possible to the pairwise distance

in Rr

Perfect for visualizing the datasets in low dimensions.Examples : LLE, MDS, Isomap, SNE, t-SNE, ..

69 / 73


Feature extraction

Manifold learning

Overview

xi ∈ Rd , yi ∈ Rr , r � d , e.g. r = 2

1 Quantify the similarity between pairs of points in Rd

2 Quantify the similarity between pairs of points in Rr

3 Quantify the discrepancy between these similarities

4 Optimize with respect to yi

70 / 73


Feature extraction

Manifold learning

t-Stochastic Neighbhorhood Embedding (t-SNE) [van derMaaten(2008)]

Focuses on preserving local distances, allowing larger distances inRd to be even larger in Rr

• Similarity in Rd

∀i , j , pi ,j =pi/j + pj/i

2N

∀i , j , pi/j =exp(− |xi−xj |

2

2σ2i

)∑k 6=i exp(− |xi−xk |2

2σ2i

)

71 / 73


Feature extraction

Manifold learning


Focuses on preserving local distances, allowing larger distances in Rd tobe even larger in Rr

• Similarity in Rr

∀i , j , qi,j =(1 + |yi − yj |22)−1∑k 6=l(1 + |yk − yl |22)−1

• Maximize the similarity of qi,j , pi,j with the Kullback-Leiblerdivergence :

C =∑i,j

pi,j log(pi,jqi,j

)

Complexity (O(N2)). Optimized with Barnes-Hutt, complexity

O(N logN)

72 / 73


Feature extraction

Manifold learning


73 / 73

preprocessing and dimensionality reduction

Documents