mtts1 dimensionality reduction and visualization · dimension reduction methods • there are...

MTTS1 Dimensionality Reduction

and Visualization !

Spring 2014 Jaakko Peltonen

Lecture 7: Nonlinear dimensionality

reduction, part 2

Brief Recap

��3

• Additional reading on PCA: any book on matrix algebra

• Additional reading on MDS:

- Borg, Kroenen, Modern multidimensional scaling: theory and applications. Springer 1997.

- Buja, Swayne, Littman, Dean, XGvis: interactive data visualization with multidimensional scaling, 1998. (XGVis and GGobi are open source visualization tools that include MDS; MDS is also available, e.g., in SAS toolkit and GNU R [e.g., cmdscale and isoMDS in package MASS])

• NIPS 2005 tutorial by Saul, Spectral methods for dimensionality reduction, http://www.nips.cc/Conferences/2005/Tutorials/

• Jarkko Venna 2007, Academic Dissertation

• Lee & Verleysen, 2007. Nonlinear dimensionality reduction. Springer.

• Contents (Today): 1. Dimension reduction methods: overview 2. Principal component analysis (PCA) 3. Multidimensional scaling (MDS) 4. Self-organizing Maps 5. Laplacian eigenmap + Curvilinear component analysis (CCA)

Literature on dimension reduction

http://www.nips.cc/Conferences/2005/Tutorials/

Dimension reduction methods

• What to do if... I have to analyze multidimensional data???

• Solution: Dimensionality Reduction from 17 dimensions to 1D, 2D or 3D

• Problem: How to project the nodes faithfully into low-dimensional space (1D-3D)?

��4

!0.5

0

0.5

0.5

1

1.5

!0.5

0

0.5

1

1.5

2

2.5

Original 3D data 2D projection by Curvilinear Component Analysis

!6 !5 !4 !3 !2 !1 0 1 2

0

1

2

3

4

5

Dimension reduction

Nonlinear data

��5

Properties of PCA

• Strengths–Eigenvector method

–No tuning parameters

–Non-iterative

–No local optima

• Weaknesses

–Limited to second order statistics

–Limited to linear projections

The first principal component is given by the red line. The green line on the right gives the “correct” non-linear dimension (which PCA is of course unable to find).

PCA

Dimension reduction methods

• There are several methods, with different optimization goals and complexities

• We will go through some of them (most not in any detail, last ones not at all):

• Principal component analysis (PCA) - “simple” linear method that tries to preserve global distances

• Multidimensional scaling (MDS) - tries to preserve global distances

• Sammon’s projection - a variation of the MDS, pays more attention to short distances

• Isometric mapping of data manifolds (ISOMAP) - a graph-based method (of the MDS spirit)

• Curvilinear component analysis (CCA) - MDS-like method that tries to preserve distances in small neighborhoods

• Maximum variance unfolding - maximizes variance with the constraint that the short distances are preserved (an exercise in semidefinite programming)

• Self-organizing map (SOM) - a flexible and scalable method that tries a surface that passes through all data points (originally developed at HUT)

• Independent component analysis (ICA) - a fast linear method, suitable for some applications

• Nerv (NEighbor Retrieval Visualizer, originally developed at HUT)

��6

Dimension reduction

Two measures of faithfulness - precision and recall

Faithfully?

• Good Precision: Points that are close in the “reduced” space are close in the original space

• Good Recall: Points that are close in the original space are close in the “reduced” space

��8

RdR2

Faithfully?



��9

RdR2

Faithfully?



• In general, impossible to get both

May be impossible to get both good recall and good precision

��11

from Venna, Peltonen, Nybo, Aidos, and Kaski, “Information retrieval perspective to nonlinear dimensionality reduction for data visualization”,Journal of Machine Learning Research, 2010


��12

“Squashed flat sphere surface” yields good recall but bad precision



��13

“Cut-open sphere surface” yields good precision but less good recall


Performance of MDS• MDS is tries to preserve the large distances at the expense of small ones, hence, it can

“collapse” some small distances on the expense of preserving large distances

• A projection is trustworthy (precision) if k closest neighbors of a sample on the projection are also close by in the original space. A projection preserves the original neighborhoods (recall) if all k closest neighbors of a sample in the original space are also close by in the projection.

��14

BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/48

Page 4 of 13

(page number not for citation purposes)

mapping and non-metric MDS were selected to representMDS methods since they have beneficial properties; Sam-mon's mapping emphasizes the preservation of short dis-tances which are the focus of our trustworthiness measureas well. Non-metric MDS tries to preserve rank orders ofdistances, which is the error measure we use. For hierar-chical clustering, there are lots of variants; we comparedall variants available in the Cluster program by Eisen [16]:centroid linkage, complete linkage, and single linkage.Complete linkage gave clearly better results than the othervariants and is the only one included in the results below.

All methods used the same inner product (correlation)metric, which is the most commonly used metric for geneexpression data sets. Additional justification for the choiceis that correlation metric works well for classification ofthe specific yeast dataset (preliminary studies). It is imper-ative to use the same metric for all methods to keep theresults comparable. In principle, the whole study could berepeated for different metrics. However, it is unlikely thatthe conclusions would change; in an earlier experiment[17] on Euclidean metrics for non-biological data sets, theconclusions were the same.

TrustworthinessThe results are shown in Figure 2. We focus on trustwor-thiness of relatively small neighborhoods, of the order of

some tens of genes, which are perceived to be most sali-ently proximate in displays such as Figure 8. In this range,hierarchical clustering is the best for the smallest neigh-borhoods (k < 10), and SOM after that. The excellent per-formance of hierarchical clustering at very smallneighborhood sizes was to be expected as it explicitly con-nects the closest points first.

Preservation of the original neighborhoodsAs discussed in the Background Section, all methods makea compromise between trustworthiness and preservationof the original proximities. The latter kinds of errors resultfrom discontinuities in the projection; we measured themby how well neighborhoods of data points in the originaldata space were preserved. Non-parametric measures wereagain used to avoid biases. The neighborhood of size k ofan expression profile is defined as those k profiles thathave the smallest distance (here, strongest correlation)from the profile. If a profile becomes projected away fromthe neighborhood, the error is quantified by rank dis-tances on the display. The measure M2 (Eq. 4; for details,see the Methods Section) summarizes the errors for allexpression profiles. For these data sets, the SOM and mul-tidimensional scaling (Sammon and non-metric MDS)are the best for preserving small (k < 50) originalneighborhoods (Fig. 3). Hierarchical clustering is by farthe worst.

Trustworthiness of the visualized similarities (neighborhoods of k nearest samples)Figure 2Trustworthiness of the visualized similarities (neighborhoods of k nearest samples). Sammon: Sammon's mapping, NMDS: non-metric multidimensional scaling, SOM: self-organizing map, HC: hierarchical clustering, with the ultrametric distance measure and with the linear distance measure. RP: Random linear projection is the approximate worst possible practical result (the small standard deviation over different projections, approximately 0.01, is not shown). The theoretical worst case, estimated with random neighborhoods, is approximately M1 = 0.5. a) Yeast data. b) Mouse data.

5 10 15 20 25 30 35 40 45 500.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Sammon

NMDS

SOM

HC, ultrametric

HC, linear order

RP

5 10 15 20 25 30 35 40 45 500.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Sammon

NMDS

SOM

HC, ultrametric

HC, linear order

RP

M1

k k

b)a)

Precision and recall as a function of the neighborhood size k for a yeast data set. Non-metric (ordinal) MDS (NMDS) is shown in blue. Larger

precision and recall is better.

BMC Bioinformatics 2003, 4 http://www.biomedcentral.com/1471-2105/4/48

Page 5 of 13

(page number not for citation purposes)

Improving the trustworthinessTrustworthiness can be improved by discarding the leasttrustworthy data samples and analyzing them separately.Figure 4 shows the increase of trustworthiness as thenumber of discarded samples is increased. It is strikingthat although the performance of most of the other meth-ods increases rapidly, they do not reach even the startingpoint of the SOM before nearly one third of the data sethas been discarded. The ultrametric measure (see theMethods Section) of similarity for hierarchical clusteringhas the smallest improvement rate.

Visualization of functional similarity by learning metricsA main problem in comparing gene expression profiles isto choose which properties to compare, that is, how todefine the similarity measure or, equivalently, the metric.When comparing knock-out mutation profiles of genes,the relevant mutations need to be selected and scaled suit-ably for each gene.

There is not enough prior knowledge to do this manually,and our goal is to learn automatically the proper metricfrom interrelationships between the expression data setand another data set that is known to be relevant to genefunction: the functional classification of the genes. In anadditional study, the primary data are the gene expressionprofiles of human genes measured in different tissues, and

the auxiliary data used to guide the learning are the activ-ities of the homologous mouse genes in a set of tissues[13].

Details on how to learn the metrics are described in theMethods Section [14,15]. In summary, the metric is suchthat functional classes change uniformly in the new met-ric. If some of the knock-out mutations have only a weakcorrelation with the functional classes, they contributeonly weakly in the measured similarity among expressionprofiles. The similarity measure focuses on those differ-ences that are relevant for the functional classes.

The metric is defined as a local scaling of the expressionspace, which makes it very general; the contributions ofthe knock-out profiles to the similarities may be differentfor different genes.

We applied the new metric to one of the visualizationmethods, the SOM, and compared the results with thesame method in the standard correlation metric. For tech-nical details of combining of the SOM and the learningmetrics, see the Methods Section.

We began by measuring quantitatively whether SOMs inlearning metrics represented the functional classes betterthan those in the standard inner product metric. In short,

Capability of the visualizations to preserve the similarities (the neighborhoods of size k) of the original data spaceFigure 3Capability of the visualizations to preserve the similarities (the neighborhoods of size k) of the original data space. Sammon: Sammon's mapping, NMDS: non-metric multidimensional scaling, SOM: self-organizing map, HC: hierarchical clustering, with the ultrametric distance measure and with the linear distance measure. RP: Random linear projection is the approximate worst possible practical result (the small standard deviation over different projections, about 0.01, is not shown). The theoretical worst case, estimated with random neighborhoods, is approximately M2 = 0.5. a) Yeast data. b) Mouse data.

5 10 15 20 25 30 35 40 45 500.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Sammon

NMDS

SOM

HC, ultrametric

HC, linear order

RP

5 10 15 20 25 30 35 40 45 500.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Sammon

NMDS

SOM

HC, ultrametric

HC, linear order

RP

M

k k

b)a)

2

Figures are from Kaski, Nikkilä, Oja, Venna, Törönen, Castrén, Trustworthiness and metrics in visualizing

similarity of gene expression, BMC

Bioinformatics 2003, 4:48.RecallPrecision

MDS

More methods: Self-Organizing Map, Laplacian Eigenmap,

Curvilinear Component Analysis

Self-Organizing Maps


��17

2013: about 68500 Google Scholar results

for “Kohonen”.


��18

Vector Quantization

��19

��20

Vector Quantization

��21

Vector Quantization

��22

Vector Quantization• minimizes the distortion

!!!!

• with N the number of points (samples), V the number of centroids (units), x the samples, v the centroids

E =1

NV

NX

i=1

kxi � v(xi)k2

��23

Vector Quantization

• One of the 1000 algorithms:

- random initialization among the samples

- iterations !

!with v(x(t)) the closest centroid of x(t) and Robbins-Monro conditions

Z 1

0�(t)dt =1

Z 1

0�(t)2dt <1

v(x(t)) = v(x(t)) + �(t)(x(t)� v(x(t)))

��24

Vector Quantization

• Matlab examples


• I need three volunteers!

��25

SOM

1 2 3 4 5

1

2

3

4

5

Structure of the map (definition)

Map in the data space

SOMStructure of the map (definition)

Map in the data space

(1,1) (1,2)

(3,1)

(2,1) (2,2)

(3,2)

(1,1)(1,2)

(3,1)

(2,1)

(2,2)

(3,2)

��29

SOM• Random initialization or PCA initialization

• Iterations

!!

!

• Example:

��30

• Matlab Examples

SOM

��31

• And then? What about visualization?

SOM

(1,1)(1,2)

(3,1)

(2,1)

(2,2)

(3,2)

��32


SOM

(1,1)(1,2)

(3,1)

(2,1)

(2,2)

(3,2)

��33


SOM

(1,1)(1,2)

(3,1)

(2,1)

(2,2)

(3,2)

(1,1) (1,2)

(3,1)

(2,1) (2,2)

(3,2)

��34


SOM

(1,1)(1,2)

(3,1)

(2,1)

(2,2)

(3,2)

(1,1) (1,2)

(3,1)

(2,1) (2,2)

(3,2)

��35

SOM

• Wealth Data Example: Excel + Matlab

��36

SOM

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 61

2

3

4

5

6

7

8

9

10

11

Albania

Algeria

Angola

Argentina

Armenia

Australia

Austria

Azerbaijan

Bahrain

Bangladesh

Belarus

Belgium

Belize

Benin

Bhutan

Bolivia

Botswana

Brazil

Brunei Darussalam

Bulgaria

Burkina Faso

Burundi

Cameroon

Canada

Cape VerdeCentral African Republic

Chad

Chile

China

Colombia

Comoros

Congo, Dem. Rep.

Congo, Rep.

Costa Rica

Côte d’IvoireCroatia

Czech Republic

Denmark

Dominica

Dominican Republic

Ecuador

Egypt, Arab Rep.

El Salvador

Ethiopia

Fiji

Finland

France

Gabon

Gambia, The

Georgia

Germany

Ghana

Greece

Grenada

Guatemala

Guinea

Guinea−Bissau

Guyana

Haiti

Honduras

Hungary

Iceland

India

Indonesia

Iran, Islamic Rep.

Ireland

Israel

Italy

Jamaica

Japan

Jordan

Kenya

Korea, Rep.Kuwait

Kyrgyz RepublicLao PDRLatvia

Lesotho

Liberia

Lithuania

LuxembourgMacedonia, FYR

Madagascar

Malawi

Malaysia

Maldives

Mali

Malta

MauritaniaMauritius

Mexico

Moldova

Mongolia

Morocco

Mozambique

Namibia

Nepal

NetherlandsNew Zealand

Nicaragua

Niger

Nigeria

Norway

Oman

Pakistan

Panama

Papua New Guinea

PeruPhilippines

Poland

Portugal

Romania

Russian Federation

Rwanda

Saudi Arabia

Senegal

Seychelles

Sierra Leone

Singapore

Slovak Republic

South AfricaSpain

Sri Lanka

St. Kitts and Nevis

St. Vincent and the Grenadines

Sudan

Swaziland

Sweden

Switzerland

Syrian Arab Republic

Tajikistan

Thailand

Togo

TongaTrinidad and Tobago

Tunisia

Turkey

Uganda

Ukraine

United Arab Emirates

United Kingdom

United States

Uruguay

Uzbekistan

Vanuatu

Venezuela, RB

Vietnam

Zambia

Zimbabwe

��37

SOM

From Juha Vesanto’s doctoral thesis, 2002

Values of the prototype vectors shown on the SOM grid.

Left: one of the feature values. Right: many of the feature values.

Laplacian eigenmap

• Laplacian eigenmap is a spectral method, like PCA.

• construct k-nearest neighbors graph. Assign Wij=1, if i and j are neighbors, otherwise assign Wij=0. Define diagonal matrix D, Dii=∑jWij, and graph Laplacian, L=D-W.

• The embedding of data points is given by the eigenvectors of L, corresponding to the d smallest non-zero eigenvalues.

��38

Manifolds

Laplacian eigenmap

��39

octave:20> L L = ! -2 1 0 0 0 1 1 -2 1 0 0 0 0 1 -2 1 0 0 0 0 1 -2 1 0 0 0 0 1 -2 1 1 0 0 0 1 -2 !octave:21> [V,D] = eig(L) V = ! -0.408248 0.458032 -0.351483 0.551189 -0.171826 0.408248 0.408248 -0.533409 -0.220926 0.424400 0.391430 0.408248 -0.408248 0.075377 0.572409 -0.126788 0.563257 0.408248 0.408248 0.458032 -0.351483 -0.551189 0.171826 0.408248 -0.408248 -0.533409 -0.220926 -0.424400 -0.391430 0.408248 0.408248 0.075377 0.572409 0.126788 -0.563257 0.408248 !D = ! -4.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 -3.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 -3.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 -1.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 -1.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000

N=6 k=2

6

5 4

3

21

Manifolds

��40

Laplacian eigenmap>> L !L = ! -1 1 0 0 0 0 1 -2 1 0 0 0 0 1 -2 1 0 0 0 0 1 -2 1 0 0 0 0 1 -2 1 0 0 0 0 1 -1 !>> [V,D]=eig(L) !V = ! 0.1494 0.2887 0.4082 -0.5000 0.5577 0.4082 -0.4082 -0.5774 -0.4082 0.0000 0.4082 0.4082 0.5577 0.2887 -0.4082 0.5000 0.1494 0.4082 -0.5577 0.2887 0.4082 0.5000 -0.1494 0.4082 0.4082 -0.5774 0.4082 0.0000 -0.4082 0.4082 -0.1494 0.2887 -0.4082 -0.5000 -0.5577 0.4082 !!D = ! -3.7321 0 0 0 0 0 0 -3.0000 0 0 0 0 0 0 -2.0000 0 0 0 0 0 0 -1.0000 0 0 0 0 0 0 -0.2679 0 0 0 0 0 0 -0.0000 !

6

5 4

3

21

Manifolds

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

Laplacian eigenmap

��41

>> L !L = ! -3 1 1 0 0 0 0 0 1 1 -2 1 0 0 0 0 0 0 1 1 -3 1 0 0 0 0 0 0 0 1 -3 1 1 0 0 0 0 0 0 1 -2 1 0 0 0 0 0 0 1 1 -3 1 0 0 0 0 0 0 0 1 -3 1 1 0 0 0 0 0 0 1 -2 1 1 0 0 0 0 0 1 1 -3 !>> [V,D]=eig(L) !V = ! -0.4082 0.5014 -0.1803 -0.3099 0.3433 -0.2527 0.3982 -0.0451 0.3333 -0.0000 -0.1232 0.2895 0.5573 0.1121 0.3482 0.5775 0.1093 0.3333 0.4082 -0.2177 -0.4863 -0.2475 -0.4554 -0.0955 0.3542 0.1874 0.3333 -0.4082 -0.0946 0.5243 -0.2475 -0.4554 -0.0955 -0.1601 0.3674 0.3333 -0.0000 -0.1891 -0.2514 0.3798 0.3669 -0.4069 -0.3834 0.4455 0.3333 0.4082 0.5300 0.0546 -0.1323 0.0885 0.5024 -0.3394 0.2130 0.3333 -0.4082 -0.4068 -0.3441 -0.1323 0.0885 0.5024 -0.2381 -0.3223 0.3333 0.0000 0.3123 -0.0380 0.4422 -0.4319 -0.2497 -0.1941 -0.5548 0.3333 0.4082 -0.3123 0.4317 -0.3099 0.3433 -0.2527 -0.0148 -0.4005 0.3333 !!D = ! -5.0000 0 0 0 0 0 0 0 0 0 -4.3028 0 0 0 0 0 0 0 0 0 -4.3028 0 0 0 0 0 0 0 0 0 -3.0000 0 0 0 0 0 0 0 0 0 -3.0000 0 0 0 0 0 0 0 0 0 -3.0000 0 0 0 0 0 0 0 0 0 -0.6972 0 0 0 0 0 0 0 0 0 -0.6972 0 0 0 0 0 0 0 0 0 -0.0000

9

8 76 5

1 3

2

4

Y X

X

Y

Manifolds

Curvilinear component analysis (CCA)

• Demartines, Hérault, 1997.

• Curvilinear component analysis (CCA) is like (absolute) MDS, except that only short distances are taken into account.

• More formally, the cost function reads !!

where F(d,λy) equals unity, if d<λy, and zero otherwise; and d denotes the Euclidean distance of points in the original space (x) and in the projection (y), respectively. (Actually, F(d,λy), could be any monotonically decreasing function in d.)

��42

�r =�

i<j

(d(xi, xj)� d(yi, yj))2 F (d(yi, yj),�y)

Manifolds

!6 !5 !4 !3 !2 !1 0 1 2 3

!3

!2

!1

0

1

2

3

4

Curvilinear component analysis (CCA)

• CCA performs generally well in terms of precision; it appears to be quite robust.

• Notice outliers at right: they are result of small neighborhood.

��43

!1

0

1

0.5

1

1.5

!1.5

!1

!0.5

0

0.5

1

1.5

2

2.5

3

3.5

Original CCA

Figures by Jarkko Venna

Manifolds

Laplacian eigenmap

��44

!0.5

0

0.5

0.5

1

1.5

!0.5

0

0.5

1

1.5

2

2.5

a b

c d

e f

Figure 4:

34

!6 !5 !4 !3 !2 !1 0 1 2

0

1

2

3

4

5

Eigenmap CCA

Original

Manifolds

Methods that directly use precision and recall?

• We interpreted some methods by precision and recall. Most methods don’t optimize these, they use various cost functions.

• The Neighbor Retrieval Visualizer (Venna & Kaski 2007; Venna, Peltonen, Nybo, Aidos, Kaski 2010) directly maximizes a user-specified tradeoff of precision & recall

• Stochastic Neighbor Embedding (Hinton&Roweis 2002) can also be interpreted like this: maximizes recall

• Details later on the Neighbor Embedding lectures

NeRV, result for max recall NeRV, result for max precision

Next lecture:

• general mathematical framework for nonlinear dimensionality reduction

• more methods: Isomap, Locally Linear Embedding, Maximum Variance Unfolding

• Supervised nonlinear dimensionality reduction

mtts1 dimensionality reduction and visualization · dimension reduction methods • there are...

Documents