dimensionality reduction by random projection and latent semantic indexing jessica lin and dimitrios...

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING

Jessica Lin and Dimitrios GunopulosÂngelo Cardoso

IST/UTLDecember 2009

Outline

1. Introduction1. Latent Semantic Indexing (LSI)2. Random Projection (RP)

2. Combining LSI and Random Projection3. Experiments

1. Dataset and pre-processing2. Document Similarity3. Document Clustering

IntroductionLatent Semantic Indexing

Vector-space model Term-to-document matrix where each entry is

the relative frequency of a term in the document Find a subspace with k dimensions to project

the original term-to-document matrix SVD is the optimal solution in mean squared

error sense Speed up queries Address synonymy Find the intrinsic dimensionality of data

Introduction Random Projection

What if we randomly construct the subspace to project?

Johnson-Lindenstrauss lemma If points in vector space are projected onto a

randomly selected subspace of suitably high dimensions, then the distances between the points are approximately preserved

Making the subspace orthogonal is computationally expensive However we can rely on a result by Hecht-Nielsen:

In a high-dimensional space, there exists a much larger number of almost orthogonal than orthogonal directions

Combining LSI and Random ProjectionMotivation

LSI Captures the underlying semantics Highly accurate

Can improve retrieval performance Time complexity is expensive

O(cmn) where m is the number of terms, c is the average number of terms per document and n the number of documents

Random Projection Efficient in terms of computational time Does not preserve as much information as LSI

Combining LSI and Random ProjectionAlgorithm

Proposed in Latent Semantic Indexing: A Probalistic Analsys; Papadimitriou, C.H.

and Raghavan, P. and Tamaki, H. and Vempala, S.; Journal of Computer and System Sciences; 2000

Idea Improve Random Projection accuracy Improve LSI computional time

First the data is pre-processed to a lower dimension k1 using Random Projection

LSI is applied on the reduced lower-dimensional data, to further reduce the data to the desired dimension k2

Complexity is O(ml (l + c)) RP on original data

O(mcl) LSI on reduced lower-dimensional data)

O(ml²)

Experiments – SimilarityDataset and Pre-processing

Two subsets of Reuters categorization text collection Common and rare words are removed Porter stemming Term-document matrix representation

Normalized to unit length Sets

Larger subset 10377 documents 12113 terms Term-document matrix density is 0,4%

Smaller subset 1831 documents 5414 terms Term-document matrix density is 0,8%

Experiments – SimilarityLayout

Three techniques for dimensionality reduction are compared Latent Semantic Indexing (LSI) Random Projection (RP) Combination of Random Projection and LSI

(RP_LSI) The dimensionality of the original data is

reduced to lower k-dimensions k = 50, 100, 200, 300, 400, 500, 600

Experiments – SimilarityMetrics

Euclidean Distance Cosine of the angle between documents Determining the error

Randomly select 100 document pairs and then calculate their distances before and after dimensionality reduction

Compute the correlation between the distance vectors before (x) and after (y) dimensionality reduction

Error is defined as

Experiments - SimilarityDistance before and after dimensionality reduction

The best technique in terms of error is LSI as expected

We can see that RP_LSI improves the accuracy of RP in terms of euclidean distance and dot product

* RP_LSI: k1 = 600

Experiments - SimilarityRP_LSI - k1 and k2 parameters

The amount of the second reduction (the final dimension) is more important to achieve a smaller error than the amount of the first reduction This suggests that LSI plays a more

important role in preserving similarity than RP

Experiments - SimilarityRunning Time

RP_LSI performs slightly worse than LSI for the larger dataset (more sparse)

RP_LSI achieves a significant improvement over LSI in the smaller dataset (less sparse)

* RP_LSI: k1 = 600

Experiments – ClusteringLayout

Clustering is applied on the data before and after dimensionality reduction.

Experiments are performed on the smaller dataset

Clustering algorithm choosen is classic k-Means Effective Low computional cost

Documents vectors are normalized to unit lenght before clustering

Centroids are normalized to unit lenght after clustering

Experiments – Clusteringk-Means

k-Means objective function is to minimize the sum of intra-cluster errors The quality of dimensionality reduction is evaluated

using this criterion Since the dimensionality of data is reduced we have to

compute this criteria on the original space to make the comparison possible

The number of clusters is set to 5 Since it’s rougly the number of main topics in the

dataset Initialization is random

k-Means is repeated 20 times for each experiment and the average is taken

Experiments – ClusteringResults

LSI and RP_LSI show results similar to the original data even for smaller dimensions

RP shows significantly worse performance for smaller dimensions and more similar performance for larger dimensions

LSI shows slightly better results than RP_LSI

Clustering results using euclidean distance are similar

Conclusion

LSI and Random Projection were compared The combination of Random Projection and LSI

is analyzed The sparseness of the data seems to play central role

in the effectiveness of this technique The technique appears to be more effective the less

sparse the original data is SVD complexity is linear on the sparseness of the

data Random Projection makes the data completely dense The gain in reducing first the data dimensionality

rivals with the additional complexity added to the SVD calculation by making the data completely dense

Conclusion

Additional experiments are necessary to prove that it is indeed the sparsness of the data that causes the discrepancy on the running time to what was previously expected

Other dimensionality reduction algorithms that preserve the sparseness of the data might be useful in improving the running time of LSI

Questions

dimensionality reduction by random projection and latent semantic indexing jessica lin and dimitrios...

Documents

xxx encontro da / xxx conference of the associação ... ·...

Ângelo costa

final version modified after june 2006 adoption - dvrpc.org...

miguel Ângelo - pintura

Émile zola. - utl-landerneau.com

haline e Ângelo

utl technologies corporate brochure

final report on utl

miguel Ângelo- pieta

utl a unique experience utl an university in the world ...

utl mk ii manual

chapter 7 - utl repository

haline & ângelo - wedding

conférence utl réseaux sociaux

Ângelo bressan filho

news highlight-utl

utl series writing. - farnell.com

Ângelo assis - criptojudaísmo

miguel Ângelo

Ângelo neves - dgae