latent semantic indexing and beyond

7. March 2003 Leif Grönqvist, MSI - Växjö 1

Latent Semantic Indexingand Beyond

Leif Grönqvist ([email protected])School of Mathematics and Systems Engineering

the Swedish Graduate School of Language Technology


Overview

• My background

• Introduction to vector space models and Latent Semantic Indexing• A toy example

• Interpretation

• Some applications

• A concrete example and a small experiment

• Improvements of the model

• Various unsolved problems

• Conclusion: things I have to do


My Background• 1986-1989: ”4-årig teknisk” (electrical engineering)• 1989-1993: MSc (official translation of “Filosofie Magister”) in

Computing Science, Göteborg University• 1989-1993: 62 points in mechanics, electronics, etc.• 1994-2001: Work at the Linguistic department in Göteborg

• Various projects related to corpus linguistics• Some teaching on statistical methods (Göteborg and Uppsala), • and corpus linguistics in Göteborg, Sofia, and Beijing

• 1995: Consultant at Redwood Research, in Sollentuna, working on information retrieval in medical databases

• 1995-1996: Work at the department of Informatics in Göteborg (the Internet Project)

• 2001-2006: PhD Student in Computer Science / Language Technology


Vector Space Models• If we had a way to map any term to a vector in

a high-dimensional space, in a way such that the similarity between the meaning of the terms is reflected in the distance between the vectors… Then we could:• For a given term t, find an ordered list of the terms

most similar to t

• For any two terms, find the similarity between them


Vector Space Models, cont.

• And if it is possible to add meaning for terms and this is also reflected by adding the corresponding vectors, we could do some more things:• If we assume that it is possible to extract terms from

a document, we can map documents to vectors too!

• A set of terms (one or more terms) may be seen as a document as well


Vector Space Models, cont.

• Now it is possible for any [term or document] d, to find an ordered list of the terms or documents most similar to d

• Further, we can for any two [term or document]s, find the similarity between them

• Therefore it is meaningful to look at terms as a special case of document – a short one


Alternative data sources

• A useful data source to get similar information would be a thesaurus, a WorldNet, or any kind of knowledge database. But:• We don’t have them for all languages• They are not domain specific and domain specific terms are

not covered• In such data sources most of the words are missing

• Especially names, compounds, technical terms and numbers• My big newspaper corpus contains ~3 000 000 unique words

• A vector space model can be trained from raw un-annotated corpus data!


Calculating a vector space

• The training process needs a large set of documents - the bigger the better. My data set used for experiments contains roughly 1.5 million newspaper articles and 0.5 billion running words but I will collect more…

• Step 1: Create a word-by-document matrix - each element in the matrix is a frequency for a word type in a specific document

• From here there are several ways to find a good vector space


Vector Space Algorithms

• Singular Value Decomposition (SVD)• This is a mathematically complicated (based on eigen-values) way

to find an optimal vector space in a specific number of dimensions• Computationally heavy - maybe 20 hours for my test set• Uses often the entire document as context

• Random Indexing (RI)• Select a number of dimensions randomly• Not as heavy to calculate, but more unclear (for me) why it works• Uses a small context, typically 1+1 – 5+5 words

• Neural nets, Hyperspace Analogue to Language, etc.


The terminology I use

Some people use these terms in a sloppy way. For me:

• LSI=LSA: Latent Semantic Indexing/ Analysis are used in roughly the same way by most people

• Two ways to obtain the model used in LSA are SVD and RI – they both find the latent information


The distance measure

• Three easy-to-calculate distance measures:• Cosine: the cosine of the angle between the vectors

• Euclidean distance: just the distance as we all know it

• Manhattan distance: the distance if you walk only along the orthogonal axes

• Just as easy to calculate in n dimensions where n>>3

• The most used is the cosine


A toy example


What SVD gives us

X=T0S0D0: X, T0,S0,D0 are matrices


And our example: T0

.22 -.11 .29 -.41 -.11 -.34 .52 -.06 -.41

.20 -.07 .14 -.55 .28 .50 -.07 -.01 -.11

.24 .04 -.16 -.59 -.11 -.25 -.30 .06 .49

.40 .06 -.34 .10 .33 .38 0 0 .01

.64 -.17 .36 .33 -.16 -.21 -.17 .03 .27

.27 .11 -.43 .07 .08 -.17 .28 -.02 -.05

.27 .11 -.43 .07 .08 -.17 .28 -.02 -.05

.30 -.14 .33 .19 .11 .27 -.03 -.02 -.17

.21 .27 -.18 -.03 -.54 .08 -.47 -.04 -.58

.01 .49 .23 .03 -.59 -.39 -.29 .25 -.23

.04 .62 .22 0 -.07 .11 .16 -.68 .23

.03 .65 .14 -.01 -.30 .28 .34 .68 -.18


And our example: S0

3.34

2.54

2.35

1.64

1.50

1.31

0.85

0.56

0.36


And our example: D0

.20 .61 .46 .54 .28 0 .01 .02 .08

-.06 .17 -.13 -.23 .11 .19 .44 .62 .53

.11 -.50 .21 .57 -.51 .10 .19 .25 .08

-.95 -.03 .04 .27 .15 .02 .02 .01 -.03

.05 -.21 .38 -.21 .33 .39 .35 .15 -.60

-.08 -.26 -.72 -.37 .03 -.30 -.21 0 .36

.18 -.43 -.24 .26 .67 -.34 -.15 .25 .04

-.01 .05 .01 -.02 -.06 .45 -.76 .45 -.07

-.06 .24 .02 -.08 -.26 -.62 .02 .52 -.45


We can recalculate X with m=2 C1 C2 C3 C4 C5 M1 M2 M3 M4

Human .16 .40 .38 .47 .18 -.05 -.12 -.16 -.09

Interface .14 .37 .33 .40 .16 -.03 -.07 -.10 -.04Computer .15 .51 .36 .41 .24 .02 .06 .09 .12

User .26 .84 .61 .70 .39 .03 .08 .12 .19

System .45 1.23 1.05 1.27 .56 -.07 -.15 -.21 -.05Response .16 .58 .38 .42 .28 .06 .13 .19 .22

Time .16 .58 .38 .42 .28 .06 .13 .19 .22

EPS .22 .55 .51 .63 .24 -.07 -.14 -.20 -.11

Survey .10 .53 .23 .21 .27 .14 .44 .44 .42

Trees -.06 .23 -.14 -.27 .14 .24 .77 .77 .66

Graph -.06 .34 -.15 -.30 .20 .31 .98 .98 .85

Minors -.04 .25 -.10 -.21 .15 .22 .71 .71 .62


What does the SVD give?• Dumais 1995: “The SVD program takes the ltc transformed

term-document matrix as input, and calculates the best "reduced-dimension" approximation to this matrix.”

• Michael W Berry 1992: “This important result indicates that Ak is the best

k-rank approximation (in at least

squares sense) to the matrix A.

• Leif: What Berry says is that SVD gives the best projection from n to k dimensions, that is the projection that keep distances in the best possible way, so no problems with local maxima.


What does it really mean then?

• The fact that a word w is represented by a specific vector v means exactly nothing!

• If two words a, b are represented by vectors close to each other (the angle between them is small) then:• a and b are often found in the same document and/or

• a is often found together with c and c is often found together with b

And so on…


A naïve algorithm• Not trivial that SVD and RI works. I will explain a naive but

more intuitive algorithm to obtain a result similar to SVD, but too slow for practical use:

1. Select a random point in a space with the selected dimensionality, for each unique word

2. For each document D in the set: move the points corresponding to each word towards the mass center for the words/points in D.

3. If any point made a “big” move since last iteration, then go back to step 2Step 1-3 could be done several times to have a chance to find the global maximum


Zoom into the blue cluster


And the red one


Some applications

• Automatic generation of a domain specific thesaurus

• Keyword extraction from documents

• Find sets of similar documents in a collection

• Find documents related to a given document or a set of terms


Problems and questions

• How can we interpret the similarities as different kinds of relations?

• How can we include document structure and phrases in the model?

• Terms are not really terms, but just words• Ambiguous terms pollute the vector space• How could we find the optimal number of

dimensions for the vector space?


An example based on 5000 newspaper articles

pelle svensson0.886 pelle0.886 svensson0.821 svenssons0.795 ödsligt0.789 skandal0.786 frikännande0.784 polismannens0.781 tjänstetid0.781 slutkörd0.781 munsex0.780 avstyra

bengt johansson0.853 johansson0.752 bengt0.750 davidson0.746 folkpartiledaren0.737 kdsledaren0.734 öresundsbroprojektet0.728 centerledaren0.725 irhammar0.716 partiledarna0.715 avgaser0.709 lyckosamt


Bengt Johansson is just Bengt + Johansson – something is missing!

1.000 bengt0.764 folkpartiledaren0.749 westerberg0.730 kdsledaren0.713 riksdagsledamot0.703 ändrats0.697 ingbritt0.692 irhammar0.685 tolkningen0.677 tolkar0.674 partiledarna

1.000 johansson0.789 olof0.768 miljödepartementets0.752 görel0.751 thurdin0.750 miljöminister0.749 brofrågan0.749 rosenbad0.746 miljödepartementet0.746 regeringssammanträdet0.745 avgaser


A small experiment

• I want the model to know the difference between Bengt and Bengt

1. Make a frequency list for all n-tuples up to n=5 with a frequency>1

2. Keep all words in the bags, but add the tuples, with space replaced by _, as words

3. Run the LSI again• Now Bengt_Johansson is a word, and Bengt_Johansson is

NOT Bengt + JohanssonNumber of terms grows from 34238 to 104783


New results• Some distances0.4371 bengt_johansson johansson

0.2566 bengt_johansson bengt

0.1014 bengt_johansson olof

0.0994 bengt_johansson folkpartiledaren

0.8014 johansson olof

0.5376 johansson folkpartiledaren

0.4850 johansson bengt

0.8438 bengt folkpartiledaren

0.4246 bengt olof

0.5616 folkpartiledaren olof


And the top list for Bengt_Johansson1.000 bengt_johansson0.997 handbollslandslag0.995 gunnar_blombäck0.993 fyrnationsturneringen_i_östergötland0.991 fyrnationsturneringen0.991 förbundskapten_bengt_johansson0.991 förbundskapten_bengt0.990 blombäck0.974 carlen0.972 åtta_mål0.971 bänken0.957 magnus_wislander0.957 wislander0.953 målet_stod0.951 svenske_förbundskaptenen

0.949 orutinerade0.948 vinna_den_här0.945 magnus_andersson0.945 matchen_spelades0.942 förbundskaptenen0.935 landskamp0.935 glädjeämnen0.933 vmlaget0.927 halvlek0.927 världsstjärnor0.926 bottenlaget0.924 brolin0.923 uppvisningen0.923 offensivt0.922 jörgensen0.921 landslag


The new vector space model

• It is clear that it is now possible to find terms closely related to Bengt Johansson – the handball coach

• But is the model better for single words or for document comparison?

What do you think?

• More “words” than before – hopefully it improves the result just as more data does

• At least no reason for a worse result... Or?


An example document

REGERINGSKRIS ELLER INTE PARTILEDARNA I SISTAMINUTEN ÖVERLÄGGNINGAR OM BRON Under onsdagskvällen satt partiledarna i regeringen i sista minutenöverläggningar om Öresundsbron Centerledaren Olof Johansson var den förste som lämnade överläggningarna På torsdagen ska regeringen ge ett besked Det måste dock enligt statsminister Carl Bildt inte innebära ett ja eller ett nej till bron …


Closest terms in each model0.986 underkänner0.982 irhammar0.977 partiledarna0.970 godkände0.962 delade_meningar0.960 regeringssammanträde0.957 riksdagsledamot0.957 bengt_westerberg0.954 materialet0.952 diskuterade0.950 folkpartiledaren0.949 medierna0.947 motsättningarna0.946 vilar0.944 socialminister_bengt_westerberg

0.967 partiledarna0.921 miljökrav0.921 underkänner0.918 tolkar0.897 meningar0.888 centerledaren0.886 regeringssammanträde0.880 slottet0.880 rosenbad0.877 planminister0.866 folkpartiledaren0.855 thurdin0.845 brokonsortiet0.839 görel0.826 irhammar


Closest document in both modelsBILDT LOVAR BESKED OCH REGERINGSKRIS HOTAR Det

blir ett besked under torsdagen men det måste inte innebära ett ja eller nej från regeringen till Öresundsbroprojektet Detta löfte framförde statsminister Carl Bildt under onsdagen i ett antal varianter Samtidigt skärptes tonen mellan honom och miljöminister Olof Johansson och stämningen tydde på annalkande regeringskris De båda har under den långa broprocessen undvikit att uttala sig kritiskt om varandra och därmed trappa upp motsättningarna Men nu menar Bildt att centern lämnar sned information utåt Johansson och planminister Görel Thurdin anser å andra sidan att regeringen bara kan säga nej till bron om man tar riktig hänsyn till underlaget för miljöprövningen …


Doc Basic model Phrases addedScore Rank Score Rank

2126 1.000 1 1.000 1

2127 .996 2 .999 2

2128 .848 5 .677 3

3767 .849 3 .534 7

211 .805 8 .526 8

156 .844 6 .525 9

215 .805 9 .522 10

2602 .848 4 .492 12

2367 .804 10 .434 19

2360 .838 7 .402 23

3481 .527 53 .673 4

1567 .456 73 .601 5

1371 .456 73 .601 5


Documents with better ranking in the basic model

2602 .848 4 .492 12BRON KAN BLI VALFRÅGA SÄGER JOHANSSON Om det

lutar åt ett ja i regeringen av politiska skäl då är naturligtvis den här frågan en viktig valfråga …

2367 .804 10 .434 19INTE EN KRITISK RÖST BLAND CENTERPARTISTERNA

TILL BROBESKEDET En etappseger för miljön och centern En eloge till Olof Johansson Görel Thurdin och Carl Bildt …


Documents with better ranking in the phrase model

1567 .456 73 .601 5ALF SVENSSON TOPPNAMN I STOCKHOLM Kds-ledaren Alf

Svensson toppar kds riksdagslista för Stockholms stad och Michael Stjernström sakkunnig i statsrådsberedningen har en valbar andra plats …

1371 .456 74 .601 6BENGT WESTERBERG BARNPORREN MÅSTE STOPPAS

Folkpartiledaren Bengt Westerberg lovade på onsdagen att regeringen ska göra allt för att stoppa barnporren …


Hmm, adding n-grams was maybe too simple...

• If the bad result is due to overtraining, it could help to remove the words I build phrases of, but maybe not all

• Another way to try is to use a dependency parser to find more meaningful phrases, not just n-grams


The interpretation of similarities

• I havn’t tried to solve this problem at all but one idea I have is to:• Calculate vector spaces for various dimensionalities

and context widths

• Check if the different settings find different kind of relations

• With a data source like WordNet it could be done in a systematic way


How to select the number of dimensions• Susan T Dumais 1995: “In previous experiments we found that

performance, improves as the number of dimensions is increased up to 200 or 300 dimensions, and decreases slowly after that to the level observed for the standard vector EC 3 method (Dumais, 1991).”

• Jason I Hong 2000: “There does not seem to be a general consensus for an optimal number of dimensions; instead, the size of the concept space must be determined based on the specific collection of documents used.”

• Thomas K Landauer 1997: “Near maximum performance of 45-53%, corrected for guessing, was obtained over a fairly broad region around 300 dimensions”

• Leif 2003: “We should try to do similar experiments as Dumais/Landauer, but relate the optimal dimensionality to measures like number of documents, terms, nonzero elements, etc, because these could give us a formula not relying on hand tagged data sets”


Performance for the SVD• Dumais 1995: “The SVD takes only about 2 minutes on a Sparc10

for a 2k x 5k matrix, but this time increases to about 18-20 hours for a 60k x 80k matrix.”

• Hong 2000: “The SVD algorithm is O(N2 k3), where N is the number of terms plus documents, and k is the number of dimensions in the concept space”, “However, if the collection is stable, SVD will only need to be performed once, which may be an acceptable cost.”

• Leif: So if a good computer today is 100 times faster than Dumais’ 1995 and we have 20 times bigger data sets and we have an optimized SVD function instead of a research prototype, it should still take around 20 hours.


What I still have to do something about

• Find a better LSI/SVD package than the one I have (old C-code from 1990), or maybe writing it myself...

• Get the phrases into the model in some way

When these things are done I could:

• Try to interpret various relations from similarities in a vector space mode

• Try to solve the “number of optimal dimensions”-problem

• Explore what the length of the vectors mean

latent semantic indexing and beyond

Documents

set of terms

technical terms

domain specific terms

doleif grnqvist

themleif grnqvist

msi vxjvector space

wellleif grnqvist

msi vxjcalculating