preprocessing and dimensionality reduction
TRANSCRIPT
Datasets Preprocessing Dimensionality reduction
Preprocessing and Dimensionality Reduction
Jeremy Fix
CentraleSupelec
2019
1 / 73
Datasets Preprocessing Dimensionality reduction
Where to get data
You need datasets
You can use open datasets
For example for experimenting a new ML algorithm:
• UCI ML Repo : http://archive.ics.uci.edu/ml/
• Kaggle competitions, e.g. https:
//www.kaggle.com/c/diabetic-retinopathy-detection
• specific well known datasets for specific ML problems
2 / 73
Datasets Preprocessing Dimensionality reduction
Where to get data
Some available datasets
Face expression classification
48x48 pixel grayscale images of faces0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise,6=Neutral28K Train; 3K for public test, another 3K for final test.
Kaggle, ICML 2013
3 / 73
Datasets Preprocessing Dimensionality reduction
Where to get data
Some available datasets
Object localization/detection
PascalVOC2012: 20 classes, 20000 Train images, 20000 Test,11000 TestAvg image size : 469x387 pixels, RGB
Classes : person/bird, cat, cow, dog, horse, sheep/aeroplane, bicycle, boat, bus, car, motorbike, train/bottle, chair,dining table, potted plant, sofa, tv-monitor
http://host.robots.ox.ac.uk/pascal/VOC/
4 / 73
Datasets Preprocessing Dimensionality reduction
Where to get data
Some available datasets
Object localization/detection
ImageNet, ILSVRC2014: 1000 classes, 1.2M Train images, 50KValid, 100K TestAvg image size : 482x415 pixels, RGB
ImageNet Large Scale Visual Recognition Challenge, Russakovsky et al. (2015)
5 / 73
Datasets Preprocessing Dimensionality reduction
Where to get data
Some available datasets
Object localization/detection
Open Images Dataset: https://github.com/openimages/dataset
• ≈ 9M automatically labelled images, 4M human validated
• 80M bounding boxes, 6000 classes
• both meta labels (e.g. vehicle), fine-grain labels (e.g. hondansx)
6 / 73
Datasets Preprocessing Dimensionality reduction
Where to get data
Some available datasets
Object segmentation
COCO 2017: 200K images, 80 classes, 500K masks
http://cocodataset.org/
7 / 73
Datasets Preprocessing Dimensionality reduction
Where to get data
Some available datasets
Recommendation systems
MovieLens, Netflix Prize, Anime Recommendations DatabaseMovieLens 20M
• 27K movies by 138K users
• 5star ratings, 1/2 increment (0.0, 0.5, ..)
• 20M ratings
• metadata (e.g. genre)
• links to imdb to enrich metadata
https://grouplens.org/datasets/movielens/
8 / 73
Datasets Preprocessing Dimensionality reduction
Where to get data
Some available datasets
Automatic speech recognition
Timit, VoxForge, ...Timit :
• 630 speakers, eight American english dialects
• time-aligned orthographic, phonetic and word transcriptions
• 16kHz speech waveform file for each utterance
https://catalog.ldc.upenn.edu/ldc93s1
9 / 73
Datasets Preprocessing Dimensionality reduction
Where to get data
Some available datasets
Sentiment analysis
Large Movie Review Dataset (IMDB)
• 25K reviews for training, 25K reviews for testing
• movie reviews (sentences), with rating ([1,10])
• aim : Are reviews on a given product positive/negative ?
Maas(2011), Learning Word Vectors for Sentiment Analysis
Automatic translation
Dataset from the european parliament (Europarl dataset)
• single language datasets (language model)
• parallel corpora (translation), e.g. french-english (2Msentences), Czech-English (650K sentences), ..
Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp
Koehn, MT Summit 200510 / 73
Datasets Preprocessing Dimensionality reduction
Make your own dataset
You need datasets
You have a specific problem
You may need to collect data on your own.
• Crawl the web ? (e.g. Tweeter API, ..)
• if supervized learning : assign labels (mechanical turk, domainexperts (classifying tumors))
• Ensure you collected sufficient features
11 / 73
Datasets Preprocessing Dimensionality reduction
We need vectors, appropriately scaled, without missing values
Preprocessing
12 / 73
Datasets Preprocessing Dimensionality reduction
We need vectors, appropriately scaled, without missing values
Preprocessing data
Data are not necessarily vectorial
• Ordinal or Categorical : poor/faire/excellent ; Male/Female
• Text documents : bag of words / word embeddings
Even if vectorial
• Missing data : check how missing values are indicated (-9, ’ ’,..) → Imputation of missing values
• Feature scaling
13 / 73
Datasets Preprocessing Dimensionality reduction
We need vectors, appropriately scaled, without missing values
Your data might not be vectorial data
Ordinal and categorical features
Ordinal values have an order.Ordinal Feature value poor fair excellent
Numerical feature value -1 0 1
Categorical values do not have an order (use one-hot) :Categorical value American Spanish German French
Numerical value [1, 0, 0, 0] [0, 1, 0, 0] [0, 0, 1, 0] [0, 0, 0, 1]
14 / 73
Datasets Preprocessing Dimensionality reduction
We need vectors, appropriately scaled, without missing values
Your data might not be vectorial data
Vectorial representation of text documents
Bag Of Words
• define a vocabulary V, |V| = n
• for each document, build a vector x so that xi is thefrequency of the word Vi
e.g. V = {I , in, love,metz ,machinelearning , study}I love machine learning and love metz too. → x = [1, 0, 2, 1, 1, 0]I love studying machine learning in Metz. → x = [1, 1, 1, 1, 1, 1]Does not take the order into account → N − gram, but this leadsto sparser representations
15 / 73
Datasets Preprocessing Dimensionality reduction
We need vectors, appropriately scaled, without missing values
Your data might not be vectorial data
Vectorial representation of text documents
Word/Sentence embeddings (e.g. word2vec, GLoVe, fasttext).Continuous Bag of Words (CBOW) : predict a word given itscontext
• Input and output coded with one-hot
• predict a word given its context
• hidden layer : word representation
Captures some semantic information.For sentences : tweet2vec, sentence2vec, word vector avg
see also : Bayesian approaches (e.g. Latent Dirichlet Allocation)Pennington(2014) GloVe: Global Vectors for Word Representation; Mikolov(2013) Efficient Estimation of Word
Representations in Vector Space; https://fasttext.cc/ 16 / 73
Datasets Preprocessing Dimensionality reduction
We need vectors, appropriately scaled, without missing values
Some features might be missing
Missing features
• Completely drop out the samples with missing attributes, orthe dimensions that have missing values
• or try to impute, i.e. set a value in place of the missingattributes
For missing value imputation, there are plenty of methods :
• global : assign the mean, median, most frequent value of anattribute
• local : based on k-nearest neighbors, decide which value toimpute
The bias you may introduce by imputing a value may depend onthe causes of the missing values, see [Silva(2014)].Silva(2014). A brief review of the main approaches for treatment of missing data 17 / 73
Datasets Preprocessing Dimensionality reduction
We need vectors, appropriately scaled, without missing values
Some vectorial data might not be appropriately scaled
Feature scaling
• dimensions with the largest variations will dominate euclideandistances (e.g. nearest neighbors)
• when gradient descent is involved, feature scaling makesconvergence faster (because the loss is circular symmetric)
• when regularization is involved, we would like to use a singleregularization coefficient, independent on the scale of thefeatures
18 / 73
Datasets Preprocessing Dimensionality reduction
We need vectors, appropriately scaled, without missing values
Some vectorial data might not be appropriately scaled
Feature scaling
Given xi ∈ Rd , you can normalize by :
• min/max scaling :
∀i , ∀j ∈ [0, d − 1]x ′i ,j =xi,j−mink xk,j
maxk xk,j−minkxk,j
• z-score normalization :
∀i ,∀j ∈ [0, d − 1]xi,j =xi,j − µj
σj
µj =1
N
∑k
xk,j
σj =
√1
N
∑k
(xk,j − µj)2
Your statistics must be computed from the training set and applied also
to test data.19 / 73
Datasets Preprocessing Dimensionality reduction
Dimensionality reduction
20 / 73
Datasets Preprocessing Dimensionality reduction
Dimensionality reduction : what/why/how ?
What
Optimally transform xi ∈ Rn into zi ∈ Rd so that d << nIt remains to define what means “optimally transform”
Why
• visualization of the data
• interpretability of the predictor
• speed up the algorithms whose complexity depends on n
• data may occupy a manifold of lower dimensionality than n
• curse of dimensionality : data get quickly sparse, models mayoverfit
21 / 73
Datasets Preprocessing Dimensionality reduction
Dimensionality reduction: why ?
Data analysis/Visualization
How are your data distributed ? How are your classes intricated ?Do we have discriminative features ?
t-SNE, Mnist, Maaten et al.22 / 73
Datasets Preprocessing Dimensionality reduction
Dimensionality reduction: why ?
Interpretability of the predictor
e.g. Why does this predictor say the tumor is malignant ?
Real risk = 0.92± 0.05Real risk = 0.92± 0.06
UCI ML Breast Cancer Wisconsin (Diagnostic) datasetReal risk estimated by 10-fold CV.
23 / 73
Datasets Preprocessing Dimensionality reduction
Dimensionality reduction: why ?
Speed up of the algorithms
Decreasing dimensionality decreases training/inference times.For example :
• Linear regression y = θT x + b
• Logistic regression(classification) : P(y = 1/x) = 11+exp(θT x)
Both training and inference in O(n), x ∈ Rn
24 / 73
Datasets Preprocessing Dimensionality reduction
Dimensionality reduction: why ?
The data may occupy a lower dimensional manifold
Swiss roll
→ you do not necessarily loose information by reducing thenumber of dimensions
25 / 73
Datasets Preprocessing Dimensionality reduction
Dimensionality reduction: why ?
The data may occupy a lower dimensional manifold
You want to classify facial expressions of a single person,controlled illumination:
• suppose a huge image resolution, e.g. 1024× 1024 RGBpixels, x ∈ R1024×1024×3
• what is the dimensionality of the data manifold ?
≈ 50
→ you do not necessarily loose information by reducing thenumber of dimensions
26 / 73
Datasets Preprocessing Dimensionality reduction
Dimensionality reduction: why ?
The data may occupy a lower dimensional manifold
You want to classify facial expressions of a single person,controlled illumination:
• suppose a huge image resolution, e.g. 1024× 1024 RGBpixels, x ∈ R1024×1024×3
• what is the dimensionality of the data manifold ? ≈ 50
→ you do not necessarily loose information by reducing thenumber of dimensions
27 / 73
Datasets Preprocessing Dimensionality reduction
Dimensionality reduction: why ?
The data may occupy a lower dimensional manifold
You want to classify facial expressions of a single person,controlled illumination:
• suppose a huge image resolution, e.g. 1024× 1024 RGBpixels, x ∈ R1024×1024×3
• what is the dimensionality of the data manifold ? ≈ 50
→ you do not necessarily loose information by reducing thenumber of dimensions
28 / 73
Datasets Preprocessing Dimensionality reduction
Dimensionality reduction: why ?
You may even have better predictors : Curse of dimensionality
The data become (exponentially) quickly sparse with respect tothe dimension
Image from [Goodfellow, Bengio, Courville (2016) : Deep learning]See also [Hastie et al.(2017), The elements of statistical learning]
29 / 73
Datasets Preprocessing Dimensionality reduction
Dimensionality reduction : what/why/how ?
What
Optimally transform xi ∈ Rn into zi ∈ Rd so that d << nIt remains to define what means “optimally transform”
How
• select a subset of the original features : feature selection
• compute new features from the original ones :featureextraction
30 / 73
Datasets Preprocessing Dimensionality reduction
Feature selection
Feature selection
Select a subset of the original features/attributes/dimensionsxi ∈ Rn z ∈ Rd
31 / 73
Datasets Preprocessing Dimensionality reduction
Feature selection
Feature selection
Overview
• Embededed : The ML algorithm is designed to select a subsetof the features, e.g. linear regression with L1 penalty
• Filters : dimensions are selected based on a heuristic
• Wrappers : dimensions are selected based on an estimationof the real risk
⇒ Notebook ”Feature selection.ipynb”
32 / 73
Datasets Preprocessing Dimensionality reduction
Feature selection
Feature selection: embedded
Embedded : the loss to minimize embeds a penalty promotingsparsity.
Least Absolute Shrinkage and Selection Operator (LASSO)
Given a regression problem (xi , yi ), xi ∈ Rn, yi ∈ R, optimize w.r.t.θ:
1
N
N−1∑i=0
(yi − θT xi )2 + λ|θ|1 (1)
Linear regression with L1 penaltyL1 penalty promotes sparse predictors
Tibshirani (1996). Regression shrinkage and Selection via the Lasso
33 / 73
Datasets Preprocessing Dimensionality reduction
Feature selection
Feature selection: embedded
LASSO example
N=30 points, yi = 0.5 + 0.4 ∗ sin(2πxi ) +N (0, 0.01)30 RBF features + constant term:
φ(x) = [1, e(x0−x)2
2σ2 , ..., e(xN−1−x)2
2σ2 ]
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0Fit
samplestruelreglreg_l1
0 5 10 15 20 25 300.4
0.2
0.0
0.2
0.4
0.6Parameters
≈ 20%− 33% features selected34 / 73
Datasets Preprocessing Dimensionality reduction
Feature selection
Feature selection: embedded
Decision tree example
Decision Tree with gini impurity, max depth=2, 10-fold CV (0.92)UCI ML Breast Cancer Wisconsin dataset. 569 samples, binaryclassif, 30 continuous features.
35 / 73
Datasets Preprocessing Dimensionality reduction
Feature selection
Feature selection: univariate filtersPrinciple : measure correlation/dependency between each inputfeatures, considered independently, and the target. E.g. chi-2,Anova test of independence, mutual information measure, pearsoncorrelation,..
Example Continuous → Discrete : Anova
Breast cancer, F-values Anova
P(x14/y) (lowest F) P(x27/y) (highest F)36 / 73
Datasets Preprocessing Dimensionality reduction
Feature selection
Feature selection: multivariate filters and wrappers
Overview
Denote χ a subset of the dimensions/attributes/features
• suppose we are provided a measure of how good this subset isJ(χ)
• we optimize J(χ) over the possible subsets χ
If x ∈ Rn, we have 2n possible subsets χ:
χ ∈ {∅, {x1}, {x2}, · · · , {x1, x2}, · · · }
http://featureselection.asu.edu/ : Algorithms and datasetsPython package scikit-feature.John et al.(1994) Irrelevant features and the subset selectionproblem.
37 / 73
Datasets Preprocessing Dimensionality reduction
Feature selection
Feature selection: optimizing J(χ)
Tree search
{x0, x1}
∅
Xd
{x0} {x1} {xd−1}
{x0, x2} {x0, xd−1}· · ·
· · · · · ·
{x1, xd−1} {xd−2, xd−1}· · ·· · ·
... ... ...
Number of sets
1
1
d
d(d-1)/2
d!k!(d−k)!
Sequential Forward Search
Sequential Backward Search
If you allow to undo steps, “Sequential Floating ForwardSearch”/”Sequential Floating Backward Search”
Variants and extensions : Somol et al.(2010) Efficient FeatureSubset Selection and Subset Size Optimization
38 / 73
Datasets Preprocessing Dimensionality reduction
Feature selection
Feature selection: quantifying the quality of a subset offeatures
We need to quantify the quality of a subset of features J(χ)Filters use a heuristic to be maximized.
Filters
Heuristic : e.g. Correlation based feature selectionStrategy: Keep features correlated with the label, yet uncorrelatedbetween each other.Given a training set {(xi , yi )}:
JCSF (χ) =kr(χ, y)√
k(k − 1)r(χ, χ)
r(χ, y) =1
k
∑j∈χ
r(x.,j , y)
r(χ, χ) =1
k(k − 1)
∑(j1,j2)∈χ,j1 6=j2
r(x.,j1 , x.,j2 )
with k = |χ| and r a measure of correlation 39 / 73
Datasets Preprocessing Dimensionality reduction
Feature selection
Feature selection: quantifying the quality of a subset offeatures
We need to quantify the quality of a subset of features J(χ)Wrappers use an estimation of the real risk to be minimized.
Wrappers
1 Train a predictor from the subset χ
2 J(χ) = estimation of the real risk (e.g. cross validation)
More theoretically grounded, but more computationally expensive.
40 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Feature extraction
Given N samples xi ∈ Rd ,We compute r � d new features from the original d features.
41 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analsysis [Pearson(1901)]Statement : Find an affine transformation of the data minimizingthe reconstruction error
Intuition and formalisation
1.0 0.5 0.0 0.5 1.0 1.5 2.01.0
0.5
0.0
0.5
1.0
1.5
2.0
w0
w1
In 1D, we seek a line (w0,w1) minimizing the sum of the squaredlength of the red segments. It is not unique !
42 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analsysis [Pearson(1901)]
Statement : Find an affine transformation of the data minimizingthe reconstruction errorFormally :
min{w0,w1,..wr}∈Rd
N−1∑i=0
∣∣∣∣∣∣xi − (w0 +r∑
j=1
(wTj (xi − w0))wj)
∣∣∣∣∣∣2
2
(2)
subject to wTi wj = δi ,j .
→ matrix form ?
43 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analsysis [Pearson(1901)]
Matrix formulation of PCA
Introduce W = (w1| . . . |wr ) ∈Md×r (R)
(2)⇔ min{w0,w1,..wr}∈Rd
N−1∑i=0
∣∣∣(Id −WWT )(xi − w0)∣∣∣22
subject to WTW = Ir
44 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
Simplification of the matrix formulation
• If M is idempotent, so is (I−M)
• (Id −WWT ) is symmetric and idempotent
(2)⇔ min{w0,w1,..wr}∈Rd
N−1∑i=0
(xi − w0)T (Id −WWT )(xi − w0)
subject to WTW = Ir
45 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]Remember : For u : Rn 7→ Rm, v : Rn 7→ Rm, A ∈Mm,m(R):
duTAv
dx=
du
dxAv +
dv
dxATu
Finding w0
J =N−1∑i=0
(xi − w0)T (Id −WWT )(xi − w0)
∂J
∂w0= −2(Id −WWT )
N−1∑i=0
(xi − w0)
46 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
Finding w0
J =N−1∑i=0
(xi − w0)T (Id −WWT )(xi − w0)
∂J
∂w0= −2(Id −WWT )
N−1∑i=0
(xi − w0)
∂J
∂w0= 0 ⇔ ∃h ∈ span{w1, ...,wr},w0 = h +
1
N
∑i
xi
(Id −WWT )h is the residual vector by the orthogonal projection on the column vectors of WIf h ∈ span{w1, ...,wr}, the residual is 0
If h ∈ span{w1, ...,wr}⊥, the residual is h
47 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
Finding w0
J =N−1∑i=0
(xi − w0)T (Id −WWT )(xi − w0)
argminw0J ⇒ w0 = h + 1
N
∑i xi
h ∈ span{w1, ...,wr} e.g. h = 0
The offset w0
The offset w0 is the mean of the data points, up to a translation inthe space spaned by the principal components vectors.Step 1 : Center the data
48 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]Denote xi = xi − x , x = 1
N
∑i xi
Deriving the first principal component
• J =∑N−1
i=0 xTi (Id −WWT )xi
• argminw1,..wrJ = argmaxw1,..wr
∑N−1i=0 xTi WWT xi
• X = (x0| . . . | ˜xN−1)
• argminw1,..wrJ = argmaxw1,..wr
∑rj=1 w
Tj XXTwj
Our optimization problem turns out to be :
argmaxw1,...,wr
∑rj=1 w
Tj XXTwj subject to WTW = Ir
We have a constrained optimization problem : Lagrangian.49 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
Deriving the first principal component : Lagrangian
argmaxw1wT
1 XXTw1 subject to wT1 w1 = 1
• L(w1, λ1) = wT1 XXTw1 + λ1(1− wT
1 w1)
• ∂Ldw1
= 0⇒ XXTw1 = λ1w1, w1eigen~v but which λ1 ?
• wT1 XXTw1 = λ1, λ1 is the largest eigenvalue of XXT
First principal component vector
The first principal component vector is a normalized eigenvectorassociated with the largest eigenvalue of the “sample covariancematrix” XXT
50 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
Deriving the second principal component : Greedy
Suppose we have w1 a norm. eigenvector of XXT associated withits largest eigenvalue. Denote λ1 ≥ λ2 ≥ . . . λd ≥ 0 theeigenvalues.We want to optimize :
argmaxw2wT
1 XXTw1 + wT2 XXTw2 = argmaxw2
λ1 + wT2 XXTw2
= argmaxw2wT
2 XXTw2
with wTi wj = δi ,j .
And wT2 XXTw2 = wT
2 (XXT − λ1w1wT1 )w2
w2 is a normalized eigenvector associated with the largesteigenvalue of (XXT − λ1w1w
T1 ), i.e. λ2
And so on. But is the greedy algorithm finding the optimum ? 51 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
Deriving the other principal component : Greedy
Does it make sense to use a greedy algorithm ? (proof in lecturenotes)
Theorem
For any symmetric positive semi-definite matrix M ∈Md×d(R),denote {λi}i=1..d its eigenvalues with λ1 ≥ λ2 · · · ≥ λd ≥ 0. Forany set of r ∈ [|1, d |] orthogonal unit vectors, {v1, . . . , vr}, wehave :
r∑j=1
vTj Mvj ≤r∑
j=1
λj (3)
And this upper bound is reached by eigenvectors associated withthe largest eigenvalues of M
52 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
PCA : recipe
Given {x0, . . . , xN−1} ∈ Rd , to compute the r principal componentvectors :
1 Center your data xi = xi − x
2 Build the matrix X = [x0| . . . | ˜xN−1]
3 Compute r normalized eigenvectors associated with the rlargest eigenvalues of XXT
53 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
PCA is a projection method
Given x ∈ Rd , its principal components are its coordinates in theselected eigenspace :
x → ((x − x)Tw1, (x − x)Tw2, . . . , (x − x)Twr )
If x ∈ {x0, . . . , xN−1}, you should better use the SVD which givesyou directly the principal components.
54 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
Singular Value Decomposition
For any matrix M ∈Md ,N(R), there exists an orthogonal matrixU ∈Md ,d(R), a diagonal matrix D ∈Md ,N(R), and anorthogonal matrix V ∈MN,N(R), such that :
M = UDVT
Orthogonal matrices : UT = U−1.
55 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
PCA with SVD
Given X = UDVT :
XXT = UDDTU−1
This is the diagonalization of XXT
The projection vectors are the column vectors of U :{w1, . . . ,wr} = {u1, . . . , ur}.The principal components of the training set are the r first rowsof :
UT X = UTUDVT = DVT
56 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
What is XXT ?
XXT =N−1∑i=0
xi xTi
=∑i
(xi −1
N
∑j
xj)(xi −1
N
∑j
xj)T
= (N − 1)Σ
with Σ the sample covariance matrix.Σ is symmetric, positive semi-definite, i.e. its eigenvalues are allpositive.
57 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
Equivalent formulations
There are two equivalent formulations of the PCA :
• Find an affine transformation minimizing the reconstructionerror
• Find an affine transformation maximizing the variance of theprojections
58 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis
Maximizing the variance of the projections
Suppose your data are centered, i.e. 1N
∑i xi = 0.
Denote zi ∈ Rr the projection of xi over w1, . . . ,wr . We havez = 0.The sample covariance matrix Σ ∈Mr ,r (R) is :
Σ =1
N − 1
∑i
zizTi =
1
N − 1WT xix
Ti W
We want to maximize∑r
j=1 Σj ,j and :
r∑j=1
Σj ,j =1
N − 1
∑j
∑i
(wTj xi )(xTi wj) =
1
N − 1
∑j
wTj XXTwj
This is the same optimization problem as before. 59 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
What is the fraction of variance we keep ?
For any matrix M, orthogonal matrix P :
Tr P−1MP = Tr M
Therefore, Tr XXT =∑N−1
i=0 λi . The variance of our datapoints is
Tr 1N−1XXT = 1
N−1
∑N−1i=0 λi
If we keep r principal components, we keep a fraction of thevariance equals to : ∑r−1
i=0 λi∑N−1i=0 λi
60 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Principal Component Analysis [Pearson(1901)]
PCA on MNIST (28× 28 images)
PCA with 2 princip. vectors, 17.05% tot var
10 8 6 4 2 0 2 4 6w1
6
4
2
0
2
4
6w
2
0 1 2 3 4 5 6 7 8 9
10 first princip. vectors
0.10 0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.08 0.10 61 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Sample covariance and Gram matrices
Definitions
The sample coviarance matrix is :
Σ =1
N − 1
∑i
(xi − x)(xi − x)T =1
N − 1XXT
The Gram matrix is :
G = XTX =
xT0 x0 xT0 x1 · · · xT0 xN−1...
......
...xTN−1x0 xTN−1x1 · · · xTN−1xN−1
The Gram matrix is build up from dot products.
The eigenvectors/eigenvalues of G and Σ are related !62 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Sample covariance and Gram matrices
Lemma
∀A ∈ Rn×m, ker (A) = ker(ATA
)Theorem (Rank-nullity)
∀A ∈ Rn×m, rk (A) + dim (ker (A)) = m.
Theorem
∀A ∈ Rn×m, rk(ATA
)= rk
(AAT
)≤ min(n,m)
63 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Lemma (Eigenvalues of the covariance and gram matrices)
The nonzero eigenvalues of the scaled covariance matrix(N − 1)Σ = XXT and gram matrix G = XTX are the same :
{λ ∈ R∗, ∃v 6= 0, (N − 1)Σv = λv} = {λ ∈ R∗, ∃v 6= 0,Gv = λv}
And, during the proof, we show that :
• If (λ, v) eigen of XXT , then (λ,XT v) eigen of XTX
• If (λ,w) eigen of XTX , then (λ,Xw) eigen of XXT
There are several applications of this property :
• the eigenface algorithm, used when N � d
• the nonlinear PCA called Kernel PCA [Schoelkopf, 1999]
64 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
What to do when N � dG ∈MN,N(R), Σ ∈Md ,d(R).
Eigenface
If N � d , it is much more efficient to “diagonalize” G than Σ. Inthat case, the recipe is :
1 Center your data xi = xi − x
2 Build the matrix X = [x0| . . . | ˜xN−1]
3 Compute the r normalized eigenvectors wj ∈ RN of G, witheigenvalues λj
4 Project your data on the r normalized eigenvectors of Σ givenby :
Xwj∣∣∣Xwj
∣∣∣2
=1√λj
Xwj
65 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Toward a Kernel PCAWe can reformulate the PCA using only dot products.
PCA with only dot products
Computing the Gram matrix involves only dot products between xiProjecting a vector x on the vector 1√
λjXwj reads :
(1√λj
Xwj)T x =
1√λj
wTj
< x0, x >...
< xN−1, x >
A linear algorithm involving only dot products can be renderednon-linear using the kernel trick (see SVM).The only remaining difficulty is that we must ensure the vectors inthe feature space are centered.
66 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Non linear PCA
Kernel PCA [Scholkopf(1999)]
Consider a kernel k : RN × RN 7→ R, < φ(x), φ(x ′) >= k(x , x ′)e.g.
• RBF kernel : k(x , x ′) = exp(− |x−x′|2
2
2σ2 )
We perform a PCA in the feature space, image of φ.Compute the Gram matrix, its eigenvectors/eigenvalues λj ,wj .For projecting a vector x , compute :
1√λj
wTj
k(x0, x)...
k(xN−1, x)
What about centering the φ(xi ) ?
67 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Non linear PCA
Kernel PCA : centering in the feature space
It can be shown that introducing the Gram matrix G :
G = (IN −1
N1)G (IN −
1
N1)
is the matrix of the dot products of the feature vectors centered inthe feature space.The above transformation is called double centeringtransformation.
68 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Feature extraction : Manifold learning
Goal : For each xi ∈ Rd , associate yi ∈ Rr so that the pairwisedistances in Rd are as similar as possible to the pairwise distance
in Rr
Perfect for visualizing the datasets in low dimensions.Examples : LLE, MDS, Isomap, SNE, t-SNE, ..
69 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Manifold learning
Overview
xi ∈ Rd , yi ∈ Rr , r � d , e.g. r = 2
1 Quantify the similarity between pairs of points in Rd
2 Quantify the similarity between pairs of points in Rr
3 Quantify the discrepancy between these similarities
4 Optimize with respect to yi
70 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Manifold learning
t-Stochastic Neighbhorhood Embedding (t-SNE) [van derMaaten(2008)]
Focuses on preserving local distances, allowing larger distances inRd to be even larger in Rr
• Similarity in Rd
∀i , j , pi ,j =pi/j + pj/i
2N
∀i , j , pi/j =exp(− |xi−xj |
2
2σ2i
)∑k 6=i exp(− |xi−xk |2
2σ2i
)
71 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Manifold learning
t-Stochastic Neighbhorhood Embedding (t-SNE) [van derMaaten(2008)]
Focuses on preserving local distances, allowing larger distances in Rd tobe even larger in Rr
• Similarity in Rr
∀i , j , qi,j =(1 + |yi − yj |22)−1∑k 6=l(1 + |yk − yl |22)−1
• Maximize the similarity of qi,j , pi,j with the Kullback-Leiblerdivergence :
C =∑i,j
pi,j log(pi,jqi,j
)
Complexity (O(N2)). Optimized with Barnes-Hutt, complexity
O(N logN)
72 / 73
Datasets Preprocessing Dimensionality reduction
Feature extraction
Manifold learning
t-Stochastic Neighbhorhood Embedding (t-SNE) [van derMaaten(2008)]
73 / 73