dimension reduction · what is dimension reduction!! motivation for performing reduction on your...
TRANSCRIPT
! Dimension Reduction
Pamela K. Douglas
NITP Summer 2013
©2013 Pamela Douglas, UCLA NITP
Overview
! What is dimension reduction ! Motivation for performing reduction on your data
! Intuitive Description of Common Methods ! Applications in Neuroimaging
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
Overview
! What is dimension reduction ! Motivation
! Intuitive Description of Common Methods ! Applications in Neuroimaging
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
! The number of dimension in a given data set correspond to the number of variables that are measured on each observation.
Data Dimensions
©2013 Pamela Douglas, UCLA NITP
! In machine learning (ML) applications, the terms “feature” and/or “attribute” are often used to refer to dimensions being classified
fMRI Voxels EEG Channels
Examples
Goal
©2013 Pamela Douglas, UCLA NITP
fMRI Voxels EEG Channels
Examples
! Given a d-dimensional data, (x1, . . . , xd)T ,find a lower dimensional representation s=(s1, . . . , sk) with k ≤ d, that describes the original data
! Ideally this occurs with minimal information loss according to some criterion
Is Dimension Reduction Necessary?
! “In an applica8on, whether it is classifica8on or regression, observa8on data that we believe contain informa8on are taken as inputs and fed to the system.
Ideally, we should not need feature selec8on as a separate process; the classifier (or regressor) should be able to use whichever features are necessary, discarding the irrelevant. However, there are several reasons why we are interested in reducing dimensionality as a separate step.”
www.brainmapping.org Alpaydin
©2013 Pamela Douglas, UCLA NITP
Sparse Representation Motivation
7
! Reduce complexity, which depends on number of input dimensions, d, as well as N, the number of data exemplars or sample size
! Diminish computation (important beyond speed!) ! Simpler models tend to depend less on noise ! Potential for increased knowledge extraction/
interpretability
Alpaydin, Machine Learning 2004
©2013 Pamela Douglas, UCLA NITP
Douglas et al. 2012; Colby et al. 2012
Graph Theory Metrics for ASD classification prior to and post feature selection
before
aPer
ADHD 200 Initiative
! Public release of (n=973) subjects including structural, resting state fMRI, and demographic information from ADHD subtypes
! Our team (3rd place) and others had 1,000s of neuroimaging features ! Winning team used only demographic features! ! Brief survey of Keggle Big Data ML Competitions – winning teams use FS
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
Judged on 200 held out samples
Overview
! What is dimension reduction ! Motivation for performing reduction on your data
! Intuitive Description of Common Methods ! Applications in Neuroimaging
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
Methods for Reducing Dimensions
©2013 Pamela Douglas, UCLA NITP www.brainmapping.org
Feature Selection - Feature Extraction
- Extracting data from regions of interest (ROIs) - A priori knowledge for ROI or EEG channel selection
- ML Feature Selection methods
Tend to be supervised
! Two common approaches
Projection/Clustering Methods - NNMF & Topic Modeling
- SVD
- PCA
- ICA
Tend to be unsupervised
Region of Interest Feature Extraction
©2013 Pamela Douglas, UCLA NITP www.brainmapping.org
! Reducing dimensions to selected ROIs or paths may be useful when the number of features is very large
! Applied commonly in functional connectivity analysis (e.g. rs-fcMRI)
! Historically, data is warped to a common canonical atlas and time courses are extracted from each ROI (Pro: Easy to implement)
! Example – Dosenbach et al. (2010) found that weakening of short range connections were predictive of brain maturation
©2013 Pamela Douglas, UCLA NITP www.brainmapping.org
! One should use caution. Movement, slice timing and data quality should be checked and corrected or “scrubbed”
! Power et al. (2012) found that long range connections were specifically decreased with subject motion
! Alternative data-driven methods for defining ROIs may prove more successful
Rs-fcMRI Caution
Periera et al. 2013 PRNI
Feature Selection
! Reduces Complexity & diminishes risk of overfitting (e.g. Kohavi & John 1997)
Methods Include: ! Filtering – variable ranking, almost like a preprocessing
step (e.g. t-test thresholding on training data) ! Wrapper Methods – “wrap” the induction algorithm
within a nested cross validation to assess the importance of a variable or subset of variables by removing it (e.g. SVM-RFE) (Maldonodo & Weber 2009; Das et al. 2001; Guyon & Eisseleef 2003)
! Embedded Methods – usually use a 2 part objective function, assessing goodness-of-fit with a penalty term for a large number of parameters
©2013 Pamela Douglas, UCLA NITP
! Corr(ROIi, ROIj) ! ! !
Douglas et al. 2010 OHBM
Parsimonious Dynamic Systems Models���(state space models similar to DCM)
! Using physiologic knowledge to constrain a problem in state space modeling approaches is also useful
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
dx1
dt
dxn
dt
⎡
⎣
⎢⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥⎥
=
a11 a1n
an1 ann
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
x1
xn
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
+b11 b1n
bn1 bnn
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
u1
un
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
! Classic Linear Model with Two state variables
V1 FFA
Y(t) u(t)
! Analytically, the model is underdetermined or structurally unidentifiable
! Remove either leak term, and the model becomes uniquely identifiable
kv,f
kf,v
k0,v k0,f
V1 FFA
u(t) Y(t)
kv,f
kf,v
k0,v k0,f
Inspired by work from Joe D. DiStefano III
! Adding nonlinearity in the model (like a bilinear term) can also identifiable
Example – Classic linear state space model
! Sparsity & DCM for ML ! Depending on the goal, one may wish to use more or less variables for
analysis ! Part I: Goal is to classify moderately aphasic patients from controls
accurately
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP Inspired by work from Joe D. DiStefano III
! Restricted analysis to ROIs selected from GLM analysis ! Tested a number of feature selection methods ! Accuracy Result: DCM Features > PCA> t-test > searchlight > All Voxels in ROIs
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP Inspired by work from Joe D. DiStefano III
! Part II: Investigate which features were informative ! Used l1 approximation to the l0-norm regularizer to impose sparsity ! Using only 9 (highlighted below) out of 22 original features yielded same
balanced accuracy
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP Inspired by work from Joe D. DiStefano III
Projection Methods
! We are interested in finding a mapping from original d-dimensional input space to a new (k<d) dimensional space with minimum loss of information
! Common Linear Projection Methods Include:
! Principle Component Analysis ! Independent Component Analysis ! Other Close Relatives: Multidimensional
Scaling, Factor Analysis
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
A bit more intuition ���(matrix factorization & eigenvalues)
! Matrix Factorization Can be useful for a number of reasons
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
! Use Singular Value Decomposition Instead
A is n × m
A = U∑VT where ∑ = D 00 0
⎡
⎣⎢⎢
⎤
⎦⎥⎥
D is r × r
! A full rank symmetric n x n matrix can be factored
A = PDP−1 with D diagonaland P−1 = PT if P has orthonormal columns
D =λ1 0 00 00 0 λn
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
where λ1 ≥ λ2 ≥λn ≥ 0
But Most Matrices that we work with are not square symmetric.
Ax = b A x b
y
U L
n-‐r columns
m-‐r rows
Singular Value Decomposition ���(a simple example)
! Matrix, A is a linear transformation that maps unit sphere {x: ||x|| =1} onto an ellipse
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
! Apply SVD from here
A = 3 / 10 1/ 101/ 10 −3 / 10
⎡
⎣⎢⎢
⎤
⎦⎥⎥
6 10 0 00 3 10 0
⎡
⎣⎢⎢
⎤
⎦⎥⎥
1/ 3 −2 / 3 2 / 32 / 3 −1/ 3 −2 / 32 / 3 2 / 3 1/ 3
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
! Calculate eigenvalues, eigenvectors of ATA
λ1 = 6 10 λ2 = 90 λ3 = 0
v1 =1/ 32 / 32 / 3
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
v1 =−2 / 3−1/ 32 / 3
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
v1 =2 / 3−2 / 31/ 3
⎡
⎣
⎢⎢⎢
⎤
⎦
⎥⎥⎥
D = 6 10 00 3 10
⎡
⎣⎢⎢
⎤
⎦⎥⎥, ∑ = 6 10 0 0
0 3 10 0
⎡
⎣⎢⎢
⎤
⎦⎥⎥
Ax = b A = 4 11 14
8 7 −2⎡
⎣⎢
⎤
⎦⎥
A
A = U∑VT
Principle Component Analysis���(if calculated by hand)
! Matrix of Observations, X
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
! Compute Covariance Matrix
S =
1N −1
BBT where B = X̂1X̂N⎡⎣ ⎤⎦
! Effectively perform SVD from here
S =
s11
s jj
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
s jj = σ2 var
si≠ j ⇒ cov ar
σ2 = tr(S)∑
X =w1 w2
h1 h2
…w n−1 w n
hn−1 hn
⎡
⎣⎢⎢
⎤
⎦⎥⎥ X = X1…X N⎡⎣ ⎤⎦
Example Heights and Weights observed w
h
! First Center/demean data
X̂k = Xk − µ
w
h
D =λ1 0 00 00 0 λn
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
where λ1 ≥ λ2 ≥λn ≥ 0
Principle Component Analysis���(in practice)
! In practice, iterative calculation of SVD is typically faster and more accurate than an eigenvalue decomposition of S.
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
! Typically proceeds as follows:
w* = arg max
subject to w:||w||=1wT ∑w
! Subsequent components must be orthogonal
Interpreting PCA & ICA Results
! Why is having this decomposition useful?
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
! PCA is useful when most of the variation or dynamic range of the data can be explained via a linear combination of only a few of the new orthogonal variables
! The fraction of the total variance “captured” or explained by a certain variable
λ itr(D)
D =λ1 0 00 00 0 λn
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
where λ1 ≥ λ2 ≥λn ≥ 0
! FSL MELODIC Output
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
! Variance explained by a given eigenvalue is generally monotonically decreasing and can be visualized with a Scree Graph
λ itr(D)
The Scree Graph���(Interpreting our PCA Results)
Frac8on of Variance explained Different Methods for finding
Cut-off Points (e.g. Levenberg-Marquadt)
! Similar to PCA, but relaxes the orthogonality constraint ! Goal is to find weight matrix W~=A-‐1 and components, s, of data x
! Depending on algorithm weights may be ini8alized via PCA or probabilis8c PCA
! Weights learned according to a cost func8on ! Two common algorithms used in neuroimaging are:
! Infomax (Bell & Sejnowski 1995) – mutual informa8on ! FastICA (Hyvarinen 1998) – negentropy as cost func8on ! Many others!! ! Including new work Tudai Adali
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
Overview
! What is dimension reduction ! Motivation
! Intuitive Description of Common Methods ! Applications in Neuroimaging
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
ICA & Dual Regression
! ICA based noise removal – Dual Regression Approach
! Group ICA – Identify RSN of interest
! DMN Subnetwork Identification
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
Clewef Notes on Dual Regression
Independent Components as Classifier Dimensions
Hypothesis: Brain Cogni8ve States can be described as the superposi8on of current ac8vity in IC basis images
©2013 Pamela Douglas, UCLA NITP
Belief Disbelief
IC 5
IC 13
IC 15
IC 19
Decision Tree & Interpre8ng Classifier Output
B>DB Common to B DB Common to DB IC Spa8al Mask
Douglas et al. (2010) Neuroimage; Douglas et al. (2013) Fron8ers (in press)
Belief Disbelief
IC 5
IC 13
IC 15
IC 19
Decision Tree & Interpre8ng Classifier Output
Areas that were uniquely iden8fied by ICs were amygdala, right medial frontal gyrus & cingulate cortex.
Newer Applications of ICA
! Mutimodal Fusion ! Joint ICA ! Linked ICA
www.brainmapping.org ©2013 Pamela Douglas, UCLA NITP
Thanks!