pcamgormley/courses/10601/slides/lecture26-pca.pdf · principal component analysis (pca) in case...
TRANSCRIPT
![Page 1: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/1.jpg)
PCA
1
10-601 Introduction to Machine Learning
Matt GormleyLecture 26
April 20, 2020
Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University
![Page 2: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/2.jpg)
Reminders• Homework 8: Reinforcement Learning– Out: Fri, Apr 10– Due: Wed, Apr 22 at 11:59pm
• Homework 9: Learning Paradigms– Out: Wed, Apr. 22– Due: Wed, Apr. 29 at 11:59pm– Can only be submitted up to 3 days late,
so we can return grades before final exam
• Today’s In-Class Poll– http://poll.mlcourse.org
4
![Page 3: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/3.jpg)
ML Big Picture
5
Learning Paradigms:What data is available and when? What form of prediction?• supervised learning
• unsupervised learning• semi-supervised learning
• reinforcement learning• active learning• imitation learning
• domain adaptation• online learning
• density estimation• recommender systems• feature learning
• manifold learning• dimensionality reduction
• ensemble learning• distant supervision• hyperparameter optimization
Problem Formulation:What is the structure of our output prediction?boolean Binary Classification
categorical Multiclass Classification
ordinal Ordinal Classification
real Regression
ordering Ranking
multiple discrete Structured Prediction
multiple continuous (e.g. dynamical systems)
both discrete &
cont.
(e.g. mixed graphical models)
Theoretical Foundations:What principles guide learning?q probabilistic
q information theoretic
q evolutionary search
q ML as optimization
Facets of Building ML Systems:How to build systems that are robust, efficient, adaptive, effective?1. Data prep 2. Model selection3. Training (optimization /
search)4. Hyperparameter tuning on
validation data5. (Blind) Assessment on test
data
Big Ideas in ML:Which are the ideas driving development of the field?• inductive bias• generalization / overfitting• bias-variance decomposition• generative vs. discriminative• deep nets, graphical models• PAC learning• distant rewards
App
licat
ion
Are
asKe
y ch
alle
nges
?N
LP
, Sp
ee
ch, C
om
pu
ter
Vis
ion
, Ro
bo
tics
, Me
dic
ine
, S
ea
rch
![Page 4: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/4.jpg)
Learning Paradigms
6
![Page 5: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/5.jpg)
DIMENSIONALITY REDUCTION
7
![Page 6: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/6.jpg)
PCA Outline• Dimensionality Reduction
– High-dimensional data– Learning (low dimensional) representations
• Principal Component Analysis (PCA)– Examples: 2D and 3D– Data for PCA– PCA Definition– Objective functions for PCA– PCA, Eigenvectors, and Eigenvalues– Algorithms for finding Eigenvectors /
Eigenvalues• PCA Examples
– Face Recognition– Image Compression
8
![Page 7: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/7.jpg)
High Dimension Data
Examples of high dimensional data:– High resolution images (millions of pixels)
9
![Page 8: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/8.jpg)
High Dimension Data
Examples of high dimensional data:– Multilingual News Stories
(vocabulary of hundreds of thousands of words)
10
![Page 9: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/9.jpg)
High Dimension Data
Examples of high dimensional data:– Brain Imaging Data (100s of MBs per scan)
11Image from https://pixabay.com/en/brain-mrt-magnetic-resonance-imaging-1728449/
Image from (Wehbe et al., 2014)
![Page 10: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/10.jpg)
High Dimension Data
Examples of high dimensional data:– Customer Purchase Data
12
![Page 11: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/11.jpg)
PCA, Kernel PCA, ICA: Powerful unsupervised learning techniques for extracting hidden (potentially lower dimensional) structure from high dimensional datasets.
Learning Representations
Useful for:
• Visualization
• Further processing by machine learning algorithms
• More efficient use of resources (e.g., time, memory, communication)
• Statistical: fewer dimensions à better generalization
• Noise removal (improving data quality)
Slide from Nina Balcan
![Page 12: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/12.jpg)
Shortcut Example
17
https://www.youtube.com/watch?v=MlJN9pEfPfE
Photo from https://www.springcarnival.org/booth.shtml
![Page 13: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/13.jpg)
PRINCIPAL COMPONENT ANALYSIS (PCA)
18
![Page 14: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/14.jpg)
PCA Outline• Dimensionality Reduction– High-dimensional data– Learning (low dimensional) representations
• Principal Component Analysis (PCA)– Examples: 2D and 3D– Data for PCA– PCA Definition– Objective functions for PCA– PCA, Eigenvectors, and Eigenvalues– Algorithms for finding Eigenvectors / Eigenvalues
• PCA Examples– Face Recognition– Image Compression
19
![Page 15: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/15.jpg)
Principal Component Analysis (PCA)
In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace are an effective representation of the data.
Identifying the axes is known as Principal Components Analysis, and can be obtained by using classic matrix computation tools (Eigen or Singular Value Decomposition).
Slide from Nina Balcan
![Page 16: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/16.jpg)
2D Gaussian dataset
Slide from Barnabas Poczos
![Page 17: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/17.jpg)
1st PCA axis
Slide from Barnabas Poczos
![Page 18: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/18.jpg)
2nd PCA axis
Slide from Barnabas Poczos
![Page 19: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/19.jpg)
Data for PCA
We assume the data is centered
24
=
�
����
( (1))T
( (2))T
...( (N))T
�
����D = { (i)}Ni=1
µ =1
N
N�
i=1
(i) = 0
Q: What if your data is
not centered?
A: Subtract off the
sample mean
![Page 20: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/20.jpg)
Sample Covariance Matrix
The sample covariance matrix is given by:
25
�jk =1
N
N�
i=1
(x(i)j � µj)(x
(i)k � µk)
Since the data matrix is centered, we rewrite as:
� =1
NT
=
�
����
( (1))T
( (2))T
...( (N))T
�
����
![Page 21: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/21.jpg)
Principal Component Analysis (PCA)
Whiteboard– Strawman: random linear projection– PCA Definition– Objective functions for PCA
26
![Page 22: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/22.jpg)
Maximizing the VarianceQuiz: Consider the two projections below
1. Which maximizes the variance?2. Which minimizes the reconstruction error?
27
Option A Option B
![Page 23: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/23.jpg)
Background: Eigenvectors & Eigenvalues
For a square matrix A (n x n matrix), the vector v (n x 1 matrix) is an eigenvectoriff there exists eigenvalue λ (scalar) such that:
Av = λv
28
Av = λv
v
The linear transformation A is only stretching vector v.
That is, λv is a scalar multiple of v.
![Page 24: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/24.jpg)
Principal Component Analysis (PCA)
Whiteboard– PCA, Eigenvectors, and Eigenvalues
29
![Page 25: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/25.jpg)
PCA
30
Equivalence of Maximizing Variance and Minimizing Reconstruction Error
![Page 26: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/26.jpg)
PCA: the First Principal Component
31
![Page 27: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/27.jpg)
Algorithms for PCA
How do we find principal components (i.e. eigenvectors)?• Power iteration (aka. Von Mises iteration)– finds each principal component one at a time in order
• Singular Value Decomposition (SVD)– finds all the principal components at once– two options:
• Option A: run SVD on XTX• Option B: run SVD on X
(not obvious why Option B should work…)
• Stochastic Methods (approximate)– very efficient for high dimensional datasets with lots of
points
32
![Page 28: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/28.jpg)
Background: SVD
Singular Value Decomposition (SVD)
33
![Page 29: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/29.jpg)
SVD for PCA
34
![Page 30: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/30.jpg)
Principal Component Analysis (PCA)X X" v = λv , so v (the first PC) is the eigenvector of
sample correlation/covariance matrix & &'
Sample variance of projection v'& &'v = (v'v = (Thus, the eigenvalue ( denotes the amount of variability captured along that dimension (aka amount of energy along that dimension).
Eigenvalues () ≥ (+ ≥ (, ≥ ⋯
• The 1st PC .) is the the eigenvector of the sample covariance matrix & &'associated with the largest eigenvalue
• The 2nd PC .+ is the the eigenvector of the sample covariance matrix & &' associated with the second largest eigenvalue
• And so on …
Slide from Nina Balcan
![Page 31: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/31.jpg)
• For M original dimensions, sample covariance matrix is MxM, and has up to M eigenvectors. So M PCs.
• Where does dimensionality reduction come from?Can ignore the components of lesser significance.
0
5
10
15
20
25
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Varia
nce
(%)
How Many PCs?
© Eric Xing @ CMU, 2006-2011 36
• You do lose some information, but if the eigenvalues are small, you don’t lose much– M dimensions in original data – calculate M eigenvectors and eigenvalues– choose only the first D eigenvectors, based on their eigenvalues– final data set has only D dimensions
Variance (%) = ratio of variance along given principal component to total
variance of all principal components
![Page 32: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/32.jpg)
PCA EXAMPLES
37
![Page 33: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/33.jpg)
Projecting MNIST digits
38
Task Setting:1. Take 25x25 images of digits and project them down to K components2. Report percent of variance explained for K components3. Then project back up to 25x25 image to visualize how much information was preserved
![Page 34: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/34.jpg)
Projecting MNIST digits
39
Task Setting:1. Take 25x25 images of digits and project them down to 2 components2. Plot the 2 dimensional points3. Here we look at all ten digits 0 - 9
![Page 35: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/35.jpg)
Projecting MNIST digits
40
Task Setting:1. Take 25x25 images of digits and project them down to 2 components2. Plot the 2 dimensional points3. Here we look at just four digits 0, 1, 2, 3
![Page 36: PCAmgormley/courses/10601/slides/lecture26-pca.pdf · Principal Component Analysis (PCA) In case where data lies on or near a low d-dimensional linear subspace, axes of this subspace](https://reader030.vdocuments.net/reader030/viewer/2022040615/5f0e848e7e708231d43fa126/html5/thumbnails/36.jpg)
Learning ObjectivesDimensionality Reduction / PCA
You should be able to…1. Define the sample mean, sample variance, and sample
covariance of a vector-valued dataset2. Identify examples of high dimensional data and common use
cases for dimensionality reduction3. Draw the principal components of a given toy dataset4. Establish the equivalence of minimization of reconstruction
error with maximization of variance5. Given a set of principal components, project from high to low
dimensional space and do the reverse to produce a reconstruction
6. Explain the connection between PCA, eigenvectors, eigenvalues, and covariance matrix
7. Use common methods in linear algebra to obtain the principal components
41