ghislain fourny information retrieval - systems group · 2019-07-15 · term vocabulary and posting...

Post on 11-Mar-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Ghislain Fourny

Information Retrieval12. Wrap-Up

Picture copyright: johan2011/123RF Stock Photo

2

IntroductionBoolean queriesTerm vocabulary and posting listsTolerant retrievalEvaluationScale upIndex compressionVector space modelProbabilistic information retrievalLanguage modelsIndexing the Web

Lecture Overview

Basics of Information Retrieval

Advanced topics

Alternate methodologies

3

Data Shapes: Text

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel erat nec dui aliquet vulputate sed quis nulla. Doneceget ultricies magna, eu dignissim elit. Nullam sed urna nec nisl rhoncus ullamcorper placerat et enim. Integer variusornare libero quis consequat. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean eu efficitur orci.Aenean ac posuere tellus. Ut id commodo turpis.

Praesent nec libero metus. Praesent at turpis placerat, congue ipsum eget, scelerisque justo. Ut volutpat, massa aclacinia cursus, nisl dui volutpat arcu, quis interdum sapien turpis in tellus. Suspendisse potenti. Vestibulum pharetrajusto massa, ac venenatis mi condimentum nec. Proin viverra tortor non orci suscipit rutrum. Phasellus sit ameteuismod diam. Nullam convallis nunc sit amet diam suscipit dapibus. Integer porta hendrerit nunc. Quisque pharetracongue porta. Suspendisse vestibulum sed mi in euismod. Etiam a purus suscipit, accumsan nibh vel, posuereipsum. Nulla nec tempor nibh, id venenatis lectus. Duis lobortis id urna eget tincidunt.

4

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

5

Document

Documents

6

Term

SherlocklawyerSwitzerlandUnterwalden nid dem WaldETH Zürichpersonwatchrunpaperbook...

7

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

query

8

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0)

9

Incidence MatrixDocuments

Term

s

1 2 3 4 5 6 7 8 9 10

t

u

v

w

x

y

10

Warm up

a

b

c

d

e

f

g

1 2 3 5 6 8

3 4 7 8 9

1 2 4 5 7

1 3 5 8 9

2 3 4 7

1 2 4 5 8 9

3 5 7 8

6

5

5

5

4

6

4

11

Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List B

Intersection of A and B 1 4 8

12

Index construction

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

13

Type

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

14

Stop words

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

15

Query expansionUpon indexing

Lift

Elevator

1 5

41

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Lift

Elevator 41

6

5 6

41 5 6

Expansion

Lift

Elevator

1 5

41 6

16

Porter Stemmer

https://tartarus.org/martin/PorterStemmer/

(m>0) ENCI -> ENCE valenci -> valence(m>0) ANCI -> ANCE hesitanci -> hesitance(m>0) IZER -> IZE digitizer -> digitize(m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different(m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous(m>0) IZATION -> IZE vietnamization -> vietnamize(m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate(m>0) ALISM -> AL feudalism -> feudal(m>0) IVENESS -> IVE decisiveness -> decisive(m>0) FULNESS -> FUL hopefulness -> hopeful(m>0) OUSNESS -> OUS callousness -> callous

17

Skip lists

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

18

Bi-word indices (Phrase search feature)

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future.

Index

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

19

Positional index (phrase search feature)

Help C,1: 1

ETH C,1: 2

Zurich C,1: 3

to C,3: 4, 7, 11

flexibly C,1: 5

react C,1: 6

"ETH Zurich"|

20

Search structures

Hash tables Trees (B, B+)

21

B+-tree

almost carefully

fair

is Laertes

most

be

come hour mymerely

it takepossess

that

should

youupon yourthine

timethy to

this

possess

come is merely that thy upon

4 4

2

But it's fine if the root has less.

22

Wildcard queries

foo*eth*barmultiple wildcards

23

Permuterm index

plant

$plant

t$plan

nt$pla

ant$pl

lant$p

plant$

Rotations

24

k-grams

computer

$c, co, om, mp, pu, ut, te, er, r$

$co, com, omp, mpu, put, ute, ter, er$

$com, comp, ompu, mput, pute, uter, ter$

$comp, compu, omput, mpute, puter, uter$

$compu, comput, ompute, mputer, puter$

$comput, compute, omputer, mputer$

$, c, o, m, p, u, t, e, r, $1-grams

2-grams

4-grams

3-grams

5-grams

6-grams

7-grams

...

Not very useful

Not space efficient

Usable zone

25

Edit distance# a t e

# 0 1 2 3

c 1 1 2 3

a 2 1 2 3

t 3 2 1 2ate

26

Jaccard coefficient

$co

com

mpu

put

uteter

er$$cm

cmp

= 5 / 10 = 0.5

omp

27

Soundex algorithm

Change... To...A E H I O U W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

28

Memory hierarchy

Memory (RAM)

Disk (Secondary storage)

Tapes, DVDs (Tertiary storage)

Cache (CPU), level 1 and 2

Volatile

Non volatile

29

TermIDs

t1

t2

t3

t4

t5

t6

t7

1 2 3

3 4 7

1 2 4

1 3 5

2 3 4

1 2 4

3 5 7

t1

t2

t3

t4

t5

t6

t7

t1

t2

t3

t4

t5

t6

t7

...

...

...

...

...

...

...

30

Blocked Sort-Based Indexing

31

Single-Pass In-Memory Indexing

32

MapReduce

ETH

computer

data

CPU

information

1

2

1

2

1

2

1

2

ETH

computer

information

33

Logarithmic Merging

I0 I1

Z0 Z1

I2

n postings 2n postings 4n postings

34

Heap's law#

Term

s

(M)

(T)

M = kpT

30 k 100

# Tokens

35

0

10000000

20000000

30000000

40000000

50000000

60000000

122

444

767

089

311

1513

3815

6117

8420

0622

2924

5226

7528

9831

2033

4335

6637

8940

1142

3444

5746

8049

0351

2553

4855

7157

9460

1662

3964

6266

8569

0871

3073

5375

7677

9980

2182

4484

6786

9089

1391

3593

5895

8198

04

Zipf's law

Frequency =k

Rank

36

Compression: Front coding

6

5

5

5

4

6

4

4 bytes bytes

8automat*a8○e9○ic10○ion

4 bytes (less bytes)

Only everyk terms

3

k

37

Variable byte encodingvariable byte encoding000000010010001101000101011001111001 00001001 00011001 00101001 00111001 01001001 01011001 01101001 01111010 00001010 00011010 00101010 00111010 01001010 01011010 01101010 0111...1001 1000 0000

decimal01234567891011121314151617181920212223...64

binary011011100101110111100010011010101111001101111011111000010001100101001110100101011011010111...1000000

fits

on 3

bits

fits

on 6

bits

50%less space

38

Gamma encoding

19binary

10011

001111110 Length in unary

111100011

39

Ranked retrieval

lawyerPenangsilver

2

1

3

InputSet of documents

OutputRanked subset of documents

query

4

40

Parametric search

Title Algorithms|

Author

Publication Date

Language

Country

Cost $

Search

to $

41

Parametric indicesTitle

Author

Publication Date

Language

Country

Cost

Search structure Posting lists

42

Term frequency, (Inverted) Document frequency,

idffoo 5bar 10foobar 3

tf A Bfoo 5 1bar 0 4foobar 2 1

tf-idf A Bfoo 25 5bar 0 40foobar 6 3

43

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a vector of numbers

(0. 1.2 0.15 0.34 2.4 23.5.4324.5 0.13)

Document as a bag of words

44

Vector-Space Model

d1

d2

d3

d4

d5

Documents= vectors in thefirst quadrant

of

RM

45

Queries as vectors

d1

d2

d3

d4

d5

Queries= points in thefirst quadrant

of

RMq1

q2

d3 is a goodresult of q2!

46

Inner product as score

�!x .�!y =I=MX

i=1

xiyi

47

Evidence accumulation

ETH

tftq

computer

data

1 2 3 5 6

3 tftd 7 8

1 2 4 5 7

1 3 5

6

idft, ||d||

5

5

1 2 3 4 5 6 7

||q|| tftq ⇥ idft ⇥ tftd ⇥ idftkqk ⇥ kdk

48

SMART notation

atc.lnbQuery weights

Sublinear term frequency

Natural document frequency

Byte-size normalization

49

Probabilistic Information Retrieval

SortP (R = 1|D = d ^Q = q)

P (R = 1|D = e ^Q = q)

P (R = 1|D = f ^Q = q)

P (R = 1|D = g ^Q = q)

50

... falling back to Ranked Retrieval and evidence accumulation!

RSVd =X

k|dk=1^qk=1

logN

dft

This justifies idf weighting in the Vector-Space Model!

51

Language models

Enters a query q

Thought experiment: imagine that:• we picked a random document and built its model• we used this model to generate a new document• that document turns out to be q

What document is the most likely to have been picked and to have generated q?

52

Results

Ret

urne

d re

sults

Relevant Not relevant

Precision =

53

Results

Posi

tives

Relevant

Neg

ativ

es

Recall =

54

Specificity

Specificity =

Not relevant

55

F measure: harmonic mean

F↵ =1

↵P + 1�↵

R

Weighting

↵ = 1↵ = 0

56

Precision-Recall curvesPrecision

Recall0.10 0.5

57

ROC CurvesRecall (Sensitivity)

1 - Specificity

top related