ghislain fourny information retrieval - systems group · 2019-07-15 · term vocabulary and posting...

57
Ghislain Fourny Information Retrieval 12. Wrap-Up Picture copyright: johan2011/123RF Stock Photo

Upload: others

Post on 11-Mar-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

Ghislain Fourny

Information Retrieval12. Wrap-Up

Picture copyright: johan2011/123RF Stock Photo

Page 2: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

2

IntroductionBoolean queriesTerm vocabulary and posting listsTolerant retrievalEvaluationScale upIndex compressionVector space modelProbabilistic information retrievalLanguage modelsIndexing the Web

Lecture Overview

Basics of Information Retrieval

Advanced topics

Alternate methodologies

Page 3: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

3

Data Shapes: Text

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel erat nec dui aliquet vulputate sed quis nulla. Doneceget ultricies magna, eu dignissim elit. Nullam sed urna nec nisl rhoncus ullamcorper placerat et enim. Integer variusornare libero quis consequat. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean eu efficitur orci.Aenean ac posuere tellus. Ut id commodo turpis.

Praesent nec libero metus. Praesent at turpis placerat, congue ipsum eget, scelerisque justo. Ut volutpat, massa aclacinia cursus, nisl dui volutpat arcu, quis interdum sapien turpis in tellus. Suspendisse potenti. Vestibulum pharetrajusto massa, ac venenatis mi condimentum nec. Proin viverra tortor non orci suscipit rutrum. Phasellus sit ameteuismod diam. Nullam convallis nunc sit amet diam suscipit dapibus. Integer porta hendrerit nunc. Quisque pharetracongue porta. Suspendisse vestibulum sed mi in euismod. Etiam a purus suscipit, accumsan nibh vel, posuereipsum. Nulla nec tempor nibh, id venenatis lectus. Duis lobortis id urna eget tincidunt.

Page 4: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

4

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

Page 5: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

5

Document

Documents

Page 6: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

6

Term

SherlocklawyerSwitzerlandUnterwalden nid dem WaldETH Zürichpersonwatchrunpaperbook...

Page 7: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

7

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

Page 8: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

8

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0)

Page 9: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

9

Incidence MatrixDocuments

Term

s

1 2 3 4 5 6 7 8 9 10

t

u

v

w

x

y

Page 10: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

10

Warm up

a

b

c

d

e

f

g

1 2 3 5 6 8

3 4 7 8 9

1 2 4 5 7

1 3 5 8 9

2 3 4 7

1 2 4 5 8 9

3 5 7 8

6

5

5

5

4

6

4

Page 11: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

11

Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List B

Intersection of A and B 1 4 8

Page 12: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

12

Index construction

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

Page 13: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

13

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

Page 14: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

14

Stop words

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

Page 15: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

15

Query expansionUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

Page 16: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

16

Porter Stemmer

https://tartarus.org/martin/PorterStemmer/

(m>0) ENCI -> ENCE valenci -> valence(m>0) ANCI -> ANCE hesitanci -> hesitance(m>0) IZER -> IZE digitizer -> digitize(m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different(m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous(m>0) IZATION -> IZE vietnamization -> vietnamize(m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate(m>0) ALISM -> AL feudalism -> feudal(m>0) IVENESS -> IVE decisiveness -> decisive(m>0) FULNESS -> FUL hopefulness -> hopeful(m>0) OUSNESS -> OUS callousness -> callous

Page 17: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

17

Skip lists

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

Page 18: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

18

Bi-word indices (Phrase search feature)

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future.

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Page 19: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

19

Positional index (phrase search feature)

Help C,1: 1

ETH C,1: 2

Zurich C,1: 3

to C,3: 4, 7, 11

flexibly C,1: 5

react C,1: 6

"ETH Zurich"|

Page 20: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

20

Search structures

Hash tables Trees (B, B+)

Page 21: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

21

B+-tree

almost carefully

fair

is Laertes

most

be

come hour mymerely

it takepossess

that

should

youupon yourthine

timethy to

this

possess

come is merely that thy upon

4 4

2

But it's fine if the root has less.

Page 22: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

22

Wildcard queries

foo*eth*barmultiple wildcards

Page 23: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

23

Permuterm index

plant

$plant

t$plan

nt$pla

ant$pl

lant$p

plant$

Rotations

Page 24: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

24

k-grams

computer

$c, co, om, mp, pu, ut, te, er, r$

$co, com, omp, mpu, put, ute, ter, er$

$com, comp, ompu, mput, pute, uter, ter$

$comp, compu, omput, mpute, puter, uter$

$compu, comput, ompute, mputer, puter$

$comput, compute, omputer, mputer$

$, c, o, m, p, u, t, e, r, $1-grams

2-grams

4-grams

3-grams

5-grams

6-grams

7-grams

...

Not very useful

Not space efficient

Usable zone

Page 25: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

25

Edit distance# a t e

# 0 1 2 3

c 1 1 2 3

a 2 1 2 3

t 3 2 1 2ate

Page 26: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

26

Jaccard coefficient

$co

com

mpu

put

uteter

er$$cm

cmp

= 5 / 10 = 0.5

omp

Page 27: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

27

Soundex algorithm

Change... To...A E H I O U W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

Page 28: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

28

Memory hierarchy

Memory (RAM)

Disk (Secondary storage)

Tapes, DVDs (Tertiary storage)

Cache (CPU), level 1 and 2

Volatile

Non volatile

Page 29: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

29

TermIDs

t1

t2

t3

t4

t5

t6

t7

1 2 3

3 4 7

1 2 4

1 3 5

2 3 4

1 2 4

3 5 7

t1

t2

t3

t4

t5

t6

t7

t1

t2

t3

t4

t5

t6

t7

...

...

...

...

...

...

...

Page 30: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

30

Blocked Sort-Based Indexing

Page 31: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

31

Single-Pass In-Memory Indexing

Page 32: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

32

MapReduce

ETH

computer

data

CPU

information

1

2

1

2

1

2

1

2

ETH

computer

information

Page 33: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

33

Logarithmic Merging

I0 I1

Z0 Z1

I2

n postings 2n postings 4n postings

Page 34: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

34

Heap's law#

Term

s

(M)

(T)

M = kpT

30 k 100

# Tokens

Page 35: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

35

0

10000000

20000000

30000000

40000000

50000000

60000000

122

444

767

089

311

1513

3815

6117

8420

0622

2924

5226

7528

9831

2033

4335

6637

8940

1142

3444

5746

8049

0351

2553

4855

7157

9460

1662

3964

6266

8569

0871

3073

5375

7677

9980

2182

4484

6786

9089

1391

3593

5895

8198

04

Zipf's law

Frequency =k

Rank

Page 36: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

36

Compression: Front coding

6

5

5

5

4

6

4

4 bytes bytes

8automat*a8○e9○ic10○ion

4 bytes (less bytes)

Only everyk terms

3

k

Page 37: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

37

Variable byte encodingvariable byte encoding000000010010001101000101011001111001 00001001 00011001 00101001 00111001 01001001 01011001 01101001 01111010 00001010 00011010 00101010 00111010 01001010 01011010 01101010 0111...1001 1000 0000

decimal01234567891011121314151617181920212223...64

binary011011100101110111100010011010101111001101111011111000010001100101001110100101011011010111...1000000

fits

on 3

bits

fits

on 6

bits

50%less space

Page 38: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

38

Gamma encoding

19binary

10011

001111110 Length in unary

111100011

Page 39: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

39

Ranked retrieval

lawyerPenangsilver

2

1

3

InputSet of documents

OutputRanked subset of documents

query

4

Page 40: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

40

Parametric search

Title Algorithms|

Author

Publication Date

Language

Country

Cost $

Search

to $

Page 41: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

41

Parametric indicesTitle

Author

Publication Date

Language

Country

Cost

Search structure Posting lists

Page 42: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

42

Term frequency, (Inverted) Document frequency,

idffoo 5bar 10foobar 3

tf A Bfoo 5 1bar 0 4foobar 2 1

tf-idf A Bfoo 25 5bar 0 40foobar 6 3

Page 43: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

43

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a vector of numbers

(0. 1.2 0.15 0.34 2.4 23.5.4324.5 0.13)

Document as a bag of words

Page 44: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

44

Vector-Space Model

d1

d2

d3

d4

d5

Documents= vectors in thefirst quadrant

of

RM

Page 45: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

45

Queries as vectors

d1

d2

d3

d4

d5

Queries= points in thefirst quadrant

of

RMq1

q2

d3 is a goodresult of q2!

Page 46: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

46

Inner product as score

�!x .�!y =I=MX

i=1

xiyi

Page 47: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

47

Evidence accumulation

ETH

tftq

computer

data

1 2 3 5 6

3 tftd 7 8

1 2 4 5 7

1 3 5

6

idft, ||d||

5

5

1 2 3 4 5 6 7

||q|| tftq ⇥ idft ⇥ tftd ⇥ idftkqk ⇥ kdk

Page 48: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

48

SMART notation

atc.lnbQuery weights

Sublinear term frequency

Natural document frequency

Byte-size normalization

Page 49: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

49

Probabilistic Information Retrieval

SortP (R = 1|D = d ^Q = q)

P (R = 1|D = e ^Q = q)

P (R = 1|D = f ^Q = q)

P (R = 1|D = g ^Q = q)

Page 50: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

50

... falling back to Ranked Retrieval and evidence accumulation!

RSVd =X

k|dk=1^qk=1

logN

dft

This justifies idf weighting in the Vector-Space Model!

Page 51: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

51

Language models

Enters a query q

Thought experiment: imagine that:• we picked a random document and built its model• we used this model to generate a new document• that document turns out to be q

What document is the most likely to have been picked and to have generated q?

Page 52: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

52

Results

Ret

urne

d re

sults

Relevant Not relevant

Precision =

Page 53: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

53

Results

Posi

tives

Relevant

Neg

ativ

es

Recall =

Page 54: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

54

Specificity

Specificity =

Not relevant

Page 55: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

55

F measure: harmonic mean

F↵ =1

↵P + 1�↵

R

Weighting

↵ = 1↵ = 0

Page 56: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

56

Precision-Recall curvesPrecision

Recall0.10 0.5

Page 57: Ghislain Fourny Information Retrieval - Systems Group · 2019-07-15 · Term vocabulary and posting lists Tolerant retrieval Evaluation Scale up Index compression Vector space model

57

ROC CurvesRecall (Sensitivity)

1 - Specificity