exploratory analysis dimensionality reduction

46
Exploratory Analysis Dimensionality Reduction Feature Extraction Conclusion Exploratory Analysis Dimensionality Reduction Davide Bacciu Computational Intelligence & Machine Learning Group Dipartimento di Informatica Università di Pisa [email protected] Introduzione all’Intelligenza Artificiale - A.A. 2012/2013

Upload: others

Post on 19-Dec-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Exploratory AnalysisDimensionality Reduction

Davide Bacciu

Computational Intelligence & Machine Learning GroupDipartimento di Informatica

Università di [email protected]

Introduzione all’Intelligenza Artificiale - A.A. 2012/2013

Page 2: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Lecture Outline

1 Exploratory Analysis

2 Dimensionality ReductionCurse of DimensionalityGeneral View

3 Feature ExtractionFinding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

4 Conclusion

Page 3: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Drowning into complex data

Slide credit goes to Percy Liang (Lawrence Berkeley National Laboratory)

Page 4: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Exploratory Data Analysis (EDA)

Discover structure in dataFind unknown patterns in the data that cannot be predictedusing current expert knowledgeFormulate new hypotheses about the causes of theobserved phenomena

A mix of graphical and quantitative techniquesVisualizationFinding informative attributes in the dataFinding natural groups in the data

Interdisciplinary approachComputer graphicsMachine learningData MiningStatistics

Page 5: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

A Machine Learning Perspective

Often an unsupervised learning taskDimensionality reduction

Feature ExtractionFeature Selection

ClusteringTackle with

Large datasets.....as well as high-dimensional data and small sample size

Exploiting tools and models beyond statisticsE.g. non-parametric neural models

Page 6: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Natural Groups in DNA Microarray

SL Pomeroy et al (2002) Prediction of central nervous system embryonal tumour outcome based on gene

expression, Nature, 415, 436-442

Page 7: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Informative Genes

SL Pomeroy et al (2002) Prediction of central nervous system embryonal tumour outcome based on gene

expression, Nature, 415, 436-442

Page 8: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Curse of DimensionalityGeneral View

The Curse of Dimensionality

If the data lies in a high dimensional space, then an enormousamount of data is required to learn a model

Curse of Dimensionality (Bellman, 1961)Some problems become intractable as the number of thevariables increases

Huge amount of training data requiredToo many model parameters (complexity)

Given a fixed number oftraining samples, thepredictive power reduces assample dimensionalityincreases (Hughes Effect,1968)

Page 9: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Curse of DimensionalityGeneral View

A Simple Combinatorial Example (I)

A toy 1-dimensional classification task with 3 classes

Classes cannot be separated well: lets add another feature..

Better class separation, but still errors. What if we add anotherfeature?

Page 10: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Curse of DimensionalityGeneral View

A Simple Combinatorial Example (II)

Classes are wellseparated

Exponential growth in the complexity of the learned modelwith increasing dimensionalityExponential growth in the number of examples required tomaintain a given sampling density

3 samples per bin in 1-D81 samples per bin in 3-D

Page 11: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Curse of DimensionalityGeneral View

Intrinsic Dimension

The intrinsic dimension of data is the minimum number ofindependent parameters needed to account for the observedproperties of the data

Data might live in a lower dimensional surface (fold) thanexpected

Page 12: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Curse of DimensionalityGeneral View

What is the Intrinsic Dimension?

Might not be an easy question to answer...

It may increase due to noiseA data fold needs to be unfolded to reveal its intrinsic dimension

Page 13: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Curse of DimensionalityGeneral View

Informative Vs Uninformative Features

Data can be made of several dimensions that are eitherunimportant or comprise only noise

Irrelevant information might distract the learning modelLearning resources (memory) are wasted to representirrelevant portions of the input space

Dimensionality reduction aims at automatically finding alower-dimensional representation of high-dimensional data

Counteracts the curse of dimensionalityReduces the effect of unimportant attributes

Page 14: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Curse of DimensionalityGeneral View

Why Dimensionality Reduction?

Data VisualizationProjecting high-dimensional data to a 2D/3D screen spacePreserving topological relationshipsE.g. visualize semantically related textual documents

Data CompressionReducing storage requirementsReducing complexityE.g. stopwords removal

Feature ranking and selectionIdentifying informative bits of informationNoise reductionE.g. identify words correlated with document topics

Page 15: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Curse of DimensionalityGeneral View

Flavors of Dimensionality Reduction

Feature Extraction - Create a lower dimensional representationof x ∈ RD by combining the existing features with a givenfunction f : RD → RD′

x =

x1x2. . .

xD

y=f (x)−−−−→ y =

y1y2. . .yD′

Feature Selection - Choose a D′-dimensional subset of all thefeatures (possibly the most-informative)

x =

x1x2. . .

xD

select i1,...,iD′−−−−−−−−→ y =

xi1xi2. . .xiD′

Page 16: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Curse of DimensionalityGeneral View

A Unique Formalization

Definition (Dimensionality Reduction)

Given an input feature space x ∈ RD find a mappingf : RD → RD′

such that D′ < D and y = f (x) preserves most ofthe informative content in x.

Often the mapping f (x) is chosen as a linear function y = Wxy is a linear projection of xW ∈ RD′ × RD is the matrix of linear coefficients

y1y2. . .yD′

=

w11 w12 . . . w1Dw21 w22 . . . w2D. . . . . . . . . . . .

wD′1 wD′2 . . . wD′D

x1x2. . .

xD

Page 17: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Curse of DimensionalityGeneral View

Unsupervised Vs Supervised DimensionalityReduction

The linear/nonlinear map y = f (x) is learned from the databased on an error function that we seek to minimize

Signal representation (Unsupervised)

The goal is to represent thesamples accurately in alower-dimensional spacePrincipal Component Analysis(PCA)

Classification (Supervised)

The goal is to enhance theclass-discriminatory information inthe lower-dimensional spaceLinear Discriminant Analysis (LDA)

Page 18: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Feature Extraction

Objective - Create a lower dimensional representation ofx ∈ RD by combining the existing features with a given functionf : RD → RD′

, while preserving as much information as possible

x =

x1x2. . .

xD

y=f (x)−−−−→ y =

y1y2. . .yD′

where D′ � D and, for visualization, D′ = 2 or D′ = 3.

Page 19: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Linear Feature Extraction

Signal Representation (Unsupervised)Independent Component Analysis (ICA)Principal Component Analysis (PCA)Non-negative Matrix Factorization (NMF)

Classification (Supervised)Linear Discriminant Analysis (LDA)Canonical Correlation Analysis (CCA)Partial Least Squares (PLS)

We focus on unsupervised approaches exploiting linearmapping functions

Page 20: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Linear Methods Setup

Given N samples xn ∈ RD, define the input data as the matrix

X =

x11 | . . . xN1. . . x2 . . . . . .x1D | . . . xND

∈ RD × RN

Choose D′ � D projection directions wk

W =

w11 | . . . wD′1. . . w2 . . . . . .

w1D | . . . xD′D

∈ RD × RD′

Compute the projection of x along each direction wk as

y = [y1, . . . , yD′ ]T = WT x

Linear methods only differ in the criteria used for choosing W

Page 21: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Linear Projection - A Graphical Interpretation

3D samples projected on anhyperplane generated by 2projection directions

2D projection of the inputsamples on the hyperplane

Page 22: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Principal Component Analysis (PCA)

Orthogonal linear projection of high dimensional data onto alow dimensional subspace preserving as much varianceinformation as possible

Objective1 Minimize the projection error,

i.e. the error of thereconstructed sample‖xn − x̃n‖

2 Maximize the variance of theprojected data Y

The good news is that bothobjectives are equivalent!!

Page 23: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

PCA - Two Operations

Encode Project data onto the principal components

y = WT x for k− th component yk = wTk x

Decode Reconstruct the projected data

x̃ = Wy =D′∑

k=1

ykwk

Page 24: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

PCA -Variance Maximization

Given N samples {xn}Nn=1 and xn ∈ RD

GoalProject data into a D′ < D dimensional space such that thevariance of the projected data is maximized

For simplicity consider D′ = 1

A single projection direction w1

Assume normalized vectors ‖w1‖2 = 1Orthonormal basis from numerical analysisServes to select a single solution among infinite w

Page 25: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Variance Maximization - Input Space

Compute the means of the input data {xn}Nn=1

x =1N

N∑n=1

xn

Compute the covariance of the input data

S =1N

N∑n=1

(xn − x)(xn − x)T

Page 26: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Variance Maximization - Projected Data

Compute the means of the projected data as wT1 x

Compute the variance of the projected data as

1N

N∑n=1

{wT

1 xn −wT1 x

}2=

1N

N∑n=1

{wT

1 (xn − x)}2

=1N

N∑n=1

wT1 (xn − x)(xn − x)T w1

= wT1 Sw1

Page 27: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

PCA - Variance Maximization Problem

Goal - Maximizing the variance of the projected data

L = maxw

{wT Sw

}subject to the normalization constraint

‖w‖2 = 1

How? Don’t panic! No theoretical explanation. For that you willneed to take the Machine Learning course

Basically, it is an optimization problem that it is solved bydifferentiating L to find its maximum

Page 28: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

PCA - Variance Maximization Solution

For D′ = 1 the solution is the first principal component w = u1such that

Su1 = λ1u1

whereλ1 ∈ R is the first eigenvalue of S (i.e. the largest)u1 ∈ RD is the associate first eigenvectorλ1 is the variance of the projected data, i.e.

λ1 = uT1 Su1

Maximize the variance⇒ choose eigenvector u with largestassociated eigenvalue λ

Page 29: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

PCA - More Principal Components

What if I want more than 1 projection direction (D′ > 1)?Choose each new direction wk as one that

Maximizes the variance of projected dataIs subject to the normalization constraint ‖wk‖2 = 1Is orthogonal to those already selected, i.e. w1, . . . ,wk−1

The solution is in the eigenvectors of the input covariance SThe covariance S of a D-dimensional input space has Deigenvectors

The eigenvector u1 of the largest eigenvalue λ1 is the firstprincipal componentThe eigenvector u2 of the second-largest eigenvalueλ2 isthe second principal componentThe eigenvector u3 of the third-largest eigenvalue λ3 is thethird principal component. . .

Page 30: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

PCA Solution - Eigenvalue Decomposition

The PCA solution reduces to finding the eigenvaluedecomposition of the covariance matrix of input data

S = UΛUT

whereU = [uk]Dk=1 is the D × D matrix of eigenvectors uk

Λ is the D × D diagonal matrix whose diagonal element λkis the k -th eigenvalue

A D′ < D dimensional projection space is created by choosingD′ eigenvectors {uk}D

k=1 corresponding to the D′ largesteigenvalues {λk}D

k=1

Page 31: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Practical PCA (I)

Step 1 Organize Data - Put your N samples into a D × N matrix XStep 2 Compute Means - Calculate the empirical means of your

data

x =1N

N∑n=1

xn

Step 3 Preprocess Data - Subtract means x to each input sample

X = X− x

Page 32: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Practical PCA (Ia)

Input data Compute means Rescale data

Page 33: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Practical PCA (II)

Step 4 Compute Covariance - Calculate the covariance of inputdata

S =1N

XXT

Step 5 Eigenvalue Decomposition - Compute the eigenvaluedecomposition of the covariance

S = UΛUT

where Λ = diag(λ1, . . . , λD) and λ1 ≥ λ2 ≥ · · · ≥ λD

Eigenvalue decomposition can be obtained using standardvector algebra or numerical routines (e.g. Singular ValueDecomposition (SVD))

Page 34: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Practical PCA (III)

Step 6 Model Selection - Select D′ < D projection directions,associated to the first D′ eigenvalues, so as to maximizethe amount of variance retained in the projection

W = UD′

| . . . |u1 . . . uD′

| . . . |

∈ RD × RD′

Step 7 Encoding - Transform the normalized data X by projectingit onto the D′ principal components

Y = WT X

where Y ∈ RD′ × RN is a compressed representation of theinput data

Page 35: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Practical PCA (IIIa)

Principal Components Data projected in the principalcomponents’ plane

Page 36: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Projecting New Data

Given N ′ new input samples in X′ ∈ RD × RN′they can be

projected into the reduced space by1 Subtracting the means

X′= X′ − x

2 Projecting onto the known principal components

Y′ = WT X′

Page 37: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Application Example - Eigenfaces (I)

Each sample xn ∈ RD is a face picture with D pixels

The value of the d-th feature xn(d) is the intensity level of thecorresponding pixel

M. Turk and A. Pentland (1991) Face recognition using eigenfaces Proc. IEEE Conference on Computer Vision and

Pattern Recognition, pp 586-591

Page 38: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Application Example - Eigenfaces (II)

What is a principal component? Clearly, an eigenface uk

Eigenvectors can be shown as images depicting primitive facialfeatures

Page 39: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Application Example - Eigenfaces (III)

We can easily visualize the reconstruction of an imageprojected onto its eigenfaces

Page 40: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

How Many Principal Components?

Eigenvalues measure the fraction of variance captured by theprojection

Can be used to define a distortionmeasure

Suppose we have selectedK < D principal componentsThe resulting distortion is

J =D∑

k=K+1

λk

that is the proportion ofvariance neglected by theprojection

Page 41: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Is Variance so much Informative?

Page 42: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Linear Discriminant Analysis

Adding supervised class information into the projection function

Linear Discriminant Analysis (LDA)Perform dimensionality reduction while preserving as muchof the class discriminatory information as possible

Maximum separationbetween the means of theprojectionMinimum variance withineach projected class

Page 43: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Nonlinear Projections

To solve this problem either you un-fold the roll (manifoldapproaches) or you change the data representation (kernelmethods)

Page 44: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Finding Linear ProjectionsPrincipal Component AnalysisApplications and Advanced Issues

Nonlinear Feature Extraction

Signal Representation (Unsupervised)Manifold learning algorithms: e.g. ISOMAPKernel Principal Component Analysis (KPCA)

Classification (Supervised)Kernel Discriminant Analysis (KDA)Kernel Canonical Correlation Analysis (KCCA)

Kernels allow to use a linear model for a nonlinear problem

A kernel induces a new space by means of a non-linearmapping, where the original linear operations can beperformed.E.g. KPCA performs a linear PCA in the space created by thekernel rather than in the original data space.

Page 45: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Take-home Messages

Exploratory data analysisFind new patterns in dataFormulate new hypotheses

Two key conceptsCurse of dimensionality - Intractable problemsIntrinsic dimension - Data lies in lower dimensional space

Dimensionality ReductionFeature Extraction - Create new features by combininginput dataFeature Selection - Extract a subset of informative inputdimension

Linear feature extraction⇒ y = WxModels differentiate by the criteria used to chose W

Page 46: Exploratory Analysis Dimensionality Reduction

Exploratory AnalysisDimensionality Reduction

Feature ExtractionConclusion

Wrapping up PCA..

PCA is a linear transformationDefined by the matrix of eigenvectors W of data covarianceSPreserves as much variance as possible, measured by theeigenvalues Λ

A general linear transformation produces a rotation,translation and scaling of the space

PCA rotates the data so that is maximally decorrelated(orthonormal principal components)

PCA is linearIt cannot fit well curved surfacesNonlinear models

PCA does not account for class informationSupervised models (LDA)