# svd and lsi tutorial 4: latent semantic indexing (lsi) how ... svd and lsi tutorial 4: latent...

Post on 22-Feb-2020

17 views

Embed Size (px)

TRANSCRIPT

Pr in

te d

w it

h jo

lip rin

t

miislita.com

SVD and LSI Tutorial 4: Latent Semantic Indexing (LSI) How-to Calculations

A tutorial on Latent Semantic Indexing How-to Calculations. Learn

how LSI scores documents and queries. Includes SEO LSI myths and do-it-yourself procedures for ranking

documents.

Introduction

In this tutorial you will learn how Singular Value Decomposition (SVD) is used in Latent Semantic Indexing (LSI) to score documents and queries. Do-it-yourself procedures using an online matrix calculator are described. After completing this tu- torial, you should be able to

1. learn about basic LSI models, then move forward to advanced models.

2. understand how LSI ranks documents. 3. replicate all calculations and experiment

with your own data. 4. get out of you head the many SEO myths

and fallacies about LSI. 5. stay away from «LSI based» Snakeoil Search

Marketers.

The later is a sector from the search marketing in- dustry (SEOs/SEMs) that claim to provide «LSI based» services while giving a black eye to the marketing in- dustry before the information retrieval community. These marketers like to come up with any possible pseudo-science argument in order to pitch their services. I have exposed their tactics and tricks in the blog

Latest SEO Incoherences (LSI) - My Case Against «LSI based» Snakeoil Marketers.

Before delving into LSI a little review is in order. Believe it or not, this is related with these snakeoil sellers.

Assumptions and Prerequisites

This is a tutorial series, so I assume you have read previous parts. To replicate the calculations given below you must be familiar with linear algebra or understand the material covered in:

If you are not familiar with linear algebra or have not read the tutorials, please STOP AND READ THESE.

You might also find useful the following fast tracks, which are designed to serve as quick references for the series:

Latent Semantic Indexing (LSI) Fast Track Tutorial Singular Value Decomposition (SVD) Fast Track Tutorial

Warning: Term Count Model Ahead

One of the first term vector models was the Term Count Model. In this model the weight of term i in document j is defined as a local weight (Lij):

Equation 1: wij = Lij = tfij

where tfij is term frequency or number of times term i occurs in document j. So, let say that document 1 repeats latent 3 times and semantic 2 times. The weights of these terms in document j are then 3 and 2 in each case. If we treat queries as documents, then query weights are defined the same way. Thus, the weight of term i in a query q is defined as wiq = tfiq.

This model has many limitations and theoretical flaws. Among others:

13/02/2011 10:53

Page 1

http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html

Mi Islita.com

Cl ic

k h

er e

to s

en d

yo u

r fe

ed b

ac k

http://www.miislita.com/ http://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/ http://irthoughts.wordpress.com/2007/05/03/latest-seo-incoherences-lsi/ http://www.miislita.com/information-retrieval-tutorial/latent-semantic-indexing-fast-track-tutorial.pdf http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf http://www.miislita.com/term-vector/term-vector-2.html http://www.miislita.com/term-vector/term-vector-2.html http://joliprint.zendesk.com/anonymous_requests/new?ticket[subject]=http%3A%2F%2Fwww.miislita.com%2Finformation-retrieval-tutorial%2Fsvd-lsi-tutorial-4-lsi-how-to-calculations.html

Pr in

te d

w it

h jo

lip rin

t

miislita.com

SVD and LSI Tutorial 4: Latent Semantic Indexing (LSI) How-to Calculations

• by repeating terms many times these be- come artificially relevant. Thus, the model is easy to game.

• long documents are favored since these tend to have more words and word occur- rences. These tend to score high simply because they are longer, not because they are relevant.

• the relative global frequency of terms across a collection of documents is ignored.

• repetition does not mean that terms are relevant. These might be repeated within different contexts or topics.

There are other theoretical flaws, but these are the obvious ones. Today we can do better. Compa- red with Equation 1, a better description for term weights is given by

Equation 2: wij = LijGiNj

where

6. Li, j is the local weight for term i in docu- ment j.

7. Gi is the global weight for term i across all documents in the collection.

8. Nj is the normalization factor for document j.

Local weights are functions of how many times each term occurs in a document, global weights are func- tions of how many times documents containing each term appears in the collection, and the normaliza- tion factor corrects for discrepancies in the lengths of the documents across the collection. Advanced models incorporate global entropy measures, pro- babilistic measures and link analysis. Evidently, how documents are ranked is determined by which term weight scheme one uses with Equation 2. In 1999 Erica Chisholm and Tamara G. Kolda from Oak Ridge National Labs reviewed several term weight schemes that conform to Equation 2 in New Term Weighting Formulas for the Vector Space. Other schemes have been proposed since then.

What does this has to do with LSI and snakeoil mar- keters? A lot.

Snakeoil Marketers and LSI

SEOs like to copy/paste/quote early papers on LSI not realizing that these describe LSI implementa- tions wherein entries of the term-document matrix A are defined using the Term Count Model. Back then such implementations served a purpose. The first LSI papers described experiments under controlled conditions. Documents examined were similar in size and format and free from Web spam. Some of these papers describe experiments in which LSI was applied to collection of titles, abstracts, research pa- pers and regulated examinations (TOEFL, SAT, GRE, etc.) free from commercial noise. By defining A using mere word occurrences those LSI implementations inherited some of the limitations and theoretical flaws associated to the Term Count Model.

It was immediately realized that performance could be improved by incorporating global and norma- lization weights. This is still the case these days. Today we deal with unstructured Web collections, wherein global weights are important and where documents come in different lengths and formats. This is why many LSI implementations today are not based on Equation 1, but on Equation 2.

Figure 1 shows how Equation 1 and Equation 2 are incorporated in a term-document matrix. When we talk about LSI and Term Vector Models, it turns out that this matrix is their common denominator. Thus, how we define term weights is critical.

13/02/2011 10:53

Page 2

http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html

Cl ic

k h

er e

to s

en d

yo u

r fe

ed b

ac k

http://www.miislita.com/ http://csmr.ca.sandia.gov/~tgkolda/pubs/ornl-tm-13756.pdf http://csmr.ca.sandia.gov/~tgkolda/pubs/ornl-tm-13756.pdf http://joliprint.zendesk.com/anonymous_requests/new?ticket[subject]=http%3A%2F%2Fwww.miislita.com%2Finformation-retrieval-tutorial%2Fsvd-lsi-tutorial-4-lsi-how-to-calculations.html

Pr in

te d

w it

h jo

lip rin

t

miislita.com

SVD and LSI Tutorial 4: Latent Semantic Indexing (LSI) How-to Calculations

Figure 1. Term-document matrix. How we de- fine term weights affects Term Vector and LSI implementations.

What are the chances for end users to replicate search engine scores? Since end users do not have access to the corpus of a search engine they cannot compute global weights, normalize all documents or compute global entropy values, among other li- mitations. Thus, their chances to replicate what a search engine does is close to zero.

This is why «LSI based» Snakeoil Marketers resource to primitive tools where word occurrences and just few number of search results are used. So, by default whatever these tools score is a poor caricature of what a search engine like Google or Yahoo! probably scores. You get what you pay for. If you want to be fooled by these firms, be my guest. As a sidekick, let

me mention that in the first issue of IR Watch - The Newsletter our subscriber learned how LSI models based on mere word occurrences can be gamed by spammers.

Debunking Few More SEO Myths

Before scoring documents with LSI or with a term vector model we need to construct the term-docu- ment matrix A. This means that documents have been already indexed, unique terms have been extracted and term weights have been computed. Thus, the idea that LSI is document indexing is just another misconception. Our Document Indexing Tutorial explains what is/is not document indexing. Ruthven and Lalmas have summarized document indexing in details in A survey on the use of rele- vance feedback for information access systems.

Many search marketers claim that their link based services and