göteborg 26. jan -04evaluation of vector space models obtained by latent semantic indexing1 leif...
Post on 15-Jan-2016
216 views
TRANSCRIPT
![Page 1: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/1.jpg)
Göteborg 26. Jan -04 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing 1
Evaluation of Vector Space Models Obtained by Latent Semantic Indexing
Leif Grönqvist ([email protected])Växjö University (Mathematics and Systems Engineering)
GSLT (Graduate School of Language Technology)Göteborg University (Department of Linguistics)
![Page 2: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/2.jpg)
2Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Outline of the talk
Vector space models in IR (short reminder since last seminar) The traditional model Latent semantic indexing (LSI)
Singular value decomposition (SVD)
Evaluation Why How & Data sources
![Page 3: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/3.jpg)
3Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
The traditional vector model
One dimension for each index term A document is a vector in a very high
dimensional space The similarity between a document and
a query is the cosine:
Gives us a degree of similarity instead of yes/no as for basic keyword search
||||),(
qd
qdqdsim
j
jj
![Page 4: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/4.jpg)
4Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
The traditional vector model, cont.
Assumption used: all terms are unrelated Could be fixed partially using different weights
for each term Still, we have a lot more dimensions than we
want How should we decide the index terms? Similarity between terms are always 0 Very similar documents may have sim0 if they:
use a different vocabulary don’t use the index terms
![Page 5: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/5.jpg)
5Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Latent semantic indexing (LSI)
Similar to factor analysis Number of dimensions can be chosen as
we like We make some kind of projection from a
vector space with all terms to the smaller dimensionality
Each dimension is a mix of terms Impossible to know the meaning of the
dimension
![Page 6: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/6.jpg)
6Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
LSI, cont.
Distance between vectors is cosine just as before
Meaningful to calculate distance between all terms and/or documents
How can we do the projection? There are some ways:
Singular value decomposition (SVD) Random indexing Neural nets, factor analysis, etc.
![Page 7: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/7.jpg)
7Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Why SVD?
I prefer SVD since: Michael W Berry 1992: “… This
important result indicates that Ak is the best k-rank approxima-tion (in a least squaressense) to the matrix A.
Leif 2003: What Berry says is that SVD gives the best projection from n to k dimensions, that is the projection that keep distances in the best possible way.
![Page 8: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/8.jpg)
8Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
A small example input to SVD
![Page 9: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/9.jpg)
9Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
What SVD gives us
X=T0S0D0: X, T0, S0, D0 are matrices
![Page 10: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/10.jpg)
10Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Using the SVD
The matrices make it easy to project term and document vectors into a m-dimensional space (m ≤ min (terms, docs)) using ordinary linear algebra
We can select m easily just by using as many rows/columns of T0, S0, D0 as we want
It is possible to calculate a new (approximated) X – it will still be a t x d matrix
![Page 11: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/11.jpg)
11Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Some applications
Automatic generation of a domain specific thesaurus
Keyword extraction from documents Find sets of similar documents in a
collection Find documents related to a given
document or a set of terms
![Page 12: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/12.jpg)
12Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
An example based on 50 000 newspaper articles
stefan edbergedberg 0.918cincinnatis
0.887edbergs 0.883världsfemman 0.883stefans 0.883tennisspelarna 0.863stefan
0.861turneringsseger 0.859queensturneringen
0.858växjöspelaren 0.852grästurnering 0.847
bengt johanssonjohansson 0.852johanssons 0.704bengt 0.678centerledare 0.674miljöcentern 0.667landsbygdscentern 0.667implikationer 0.645ickesocialistisk 0.643centerledaren 0.627regeringsalternativet 0.620vagare 0.616
![Page 13: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/13.jpg)
13Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Evaluation
We need evaluation metrics to be able to improve the model!
How can we evaluate millions of vectors? “similar terms have vectors with high cosine” What is similar?
Seems impossible to evaluate the model objectively…
Possible solution: look at specific applications! They may be much easier to evaluate
![Page 14: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/14.jpg)
14Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Applications using the model
Vector models may be evaluated using: A typical IR test suite of queries, documents, and
relevance information Texts with lists of manually selected keywords
(multiword units included) Selected terms in a thesaurus (with multiword units) The Test of English as a Foreign Language (TOEFL),
which tests the ability of selecting synonyms from a set of alternatives
Still subjectivity, but the more the vector model improves these applications the better it is!
Let’s look in detail at the first application
![Page 15: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/15.jpg)
15Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
An IR testbed
There are such testbeds for English, but Swedish has other problems Very different from English Compounds without spaces “New” letters (åäö) Complex morphology Other stop words …
![Page 16: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/16.jpg)
16Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
A new Swedish test collection
A group in Borås is building it Per Ahlgren Johan Eklund Leif Grönqvist
It will contain Documents Topics Relevance judgments
![Page 17: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/17.jpg)
17Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Document collection
Newspaper articles from GP and HD Same year as the TT-data in CLEF 161 000 articles, 40 MTokens Good to have more than one newspaper:
Same content, different author (not always) 10% of my newspaper article collection Copyright is a problem
![Page 18: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/18.jpg)
18Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Topics
Borrowed from CLEF 52/90, but not the most difficult Examples:
Filmer av bröderna Kaurismäki. Description: Sök efter information om filmer som regisserats
av någon av de båda bröderna Aki och Mika Kaurismäki. Narrative: Relevanta dokument namnger en eller flera titlar
på filmer som regisserats av Aki eller Mika Kaurismäki. Finlands första EU-kommissionär
Description: Vem utsågs att vara den första EU-kommissionären för Finland i Europeiska unionen?
Narrative: Ange namnet på Finlands första EU-kommissionär. Relevanta dokument kan också nämna sakområdena för den nya kommissionärens uppdrag.
![Page 19: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/19.jpg)
19Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Relevance judgments
Only a subset for each topic Selected by earlier experiments Similar approach to TREC and CLEF
100 documents for 5 strategies: 100 N 500 Important to include relevant and irrelevant
documents A scale of relevance proposed by Sormonen:
Irrelevant (0) Marginally relevant (1) Fairly relevant (2) Highly relevant (3)
Manually annotated
![Page 20: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/20.jpg)
20Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Relevance definitions
Id Tag Description
0 Irrelevant The document does not contain any information about the topic
1 Marginally relevant
The document only points to the topic. It does not contain any other information, with respect to the topic, than the description of the topic
2 Fairly relevant
The document contains more information than the description of the topic but the presentation is not exhaustive. In the case of a topic with several aspects, only some of the aspects are covered by the document
3 Highly relevant
The document discusses all of the themes of the topic. In the case of a topic with several aspects, all or most of the aspects are covered by the document.
![Page 21: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/21.jpg)
21Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Statistics
Some difficult topics got very few relevant documents
![Page 22: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/22.jpg)
22Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Statistics per relevance category
![Page 23: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/23.jpg)
23Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
Evaluation metrics
Recall & precision is problematic: Ranked lists – how much better is position 1 than pos
5 and 10? How long should the lists be? Relevance scale – how much better is “highly
relevant” than “fairly relevant” What about the unknown documents not judged?
Idea: different user types needs different evaluation metrics
Too many unknown leads to a need of more manual judgments…
![Page 24: Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist (leifg@ling.gu.se) Växjö University (Mathematics](https://reader035.vdocuments.net/reader035/viewer/2022062518/56649d4a5503460f94a27075/html5/thumbnails/24.jpg)
24Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04
The End!
Thank you for listening
???