ghislain fourny information retrieval - systems group · 2019-07-15 · term vocabulary and posting...

Ghislain Fourny

Information Retrieval12. Wrap-Up

Picture copyright: johan2011/123RF Stock Photo

IntroductionBoolean queriesTerm vocabulary and posting listsTolerant retrievalEvaluationScale upIndex compressionVector space modelProbabilistic information retrievalLanguage modelsIndexing the Web

Lecture Overview

Basics of Information Retrieval

Advanced topics

Alternate methodologies

Data Shapes: Text

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel erat nec dui aliquet vulputate sed quis nulla. Doneceget ultricies magna, eu dignissim elit. Nullam sed urna nec nisl rhoncus ullamcorper placerat et enim. Integer variusornare libero quis consequat. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean eu efficitur orci.Aenean ac posuere tellus. Ut id commodo turpis.

Praesent nec libero metus. Praesent at turpis placerat, congue ipsum eget, scelerisque justo. Ut volutpat, massa aclacinia cursus, nisl dui volutpat arcu, quis interdum sapien turpis in tellus. Suspendisse potenti. Vestibulum pharetrajusto massa, ac venenatis mi condimentum nec. Proin viverra tortor non orci suscipit rutrum. Phasellus sit ameteuismod diam. Nullam convallis nunc sit amet diam suscipit dapibus. Integer porta hendrerit nunc. Quisque pharetracongue porta. Suspendisse vestibulum sed mi in euismod. Etiam a purus suscipit, accumsan nibh vel, posuereipsum. Nulla nec tempor nibh, id venenatis lectus. Duis lobortis id urna eget tincidunt.

Boolean retrieval

lawyer ANDPenang AND NOT silver

InputSet of documents

OutputSubset of documents

Document

Documents

SherlocklawyerSwitzerlandUnterwalden nid dem WaldETH Zürichpersonwatchrunpaperbook...

Boolean retrieval

lawyer ANDPenang AND NOT silver

OutputSubset of documents

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a set of words

Document as a vector of booleans

(0 1 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0)

Incidence MatrixDocuments

1 2 3 4 5 6 7 8 9 10

Warm up

1 2 3 5 6 8

3 4 7 8 9

1 2 4 5 7

1 3 5 8 9

2 3 4 7

1 2 4 5 8 9

3 5 7 8

Intersection algorithm

1 2 4 5 8 9 10 12

1 3 4 6 7 8 11 12

List A

List B

Intersection of A and B 1 4 8

Index construction

Collect documents

Tokenizing

Linguistic preprocessing

Build the index (postings list)

You come most carefully upon your hour

thinebetimeLaerteshourthyfairTake

My hour is almost come

Possess it merely That it should come to this

Type=equivalence class (same sequences)

Stop words

aanandareasatbebyforfromhashein

isititsofonthatthetowaswerewillwith

Query expansionUpon indexing

Elevator

Lift |

Upon querying

Lift |

Expansion

Lift OR Elevator

Elevator 41

41 5 6

Expansion

Elevator

Porter Stemmer

https://tartarus.org/martin/PorterStemmer/

(m>0) ENCI -> ENCE valenci -> valence(m>0) ANCI -> ANCE hesitanci -> hesitance(m>0) IZER -> IZE digitizer -> digitize(m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different(m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous(m>0) IZATION -> IZE vietnamization -> vietnamize(m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate(m>0) ALISM -> AL feudalism -> feudal(m>0) IVENESS -> IVE decisiveness -> decisive(m>0) FULNESS -> FUL hopefulness -> hopeful(m>0) OUSNESS -> OUS callousness -> callous

Skip lists

1 2 3 4 5 6 7 8 9 10 12

In practicep

Number of postings

13 1411 15 16

Bi-word indices (Phrase search feature)

Help ETH Zurich to flexibly react to new challenges and to set new accents in the future.

Help ETH

ETH Zurich

Zurich to

to flexibly

flexibly react

react to

Positional index (phrase search feature)

Help C,1: 1

ETH C,1: 2

Zurich C,1: 3

to C,3: 4, 7, 11

flexibly C,1: 5

react C,1: 6

"ETH Zurich"|

Search structures

Hash tables Trees (B, B+)

B+-tree

almost carefully

is Laertes

come hour mymerely

it takepossess

should

youupon yourthine

timethy to

possess

come is merely that thy upon

But it's fine if the root has less.

Wildcard queries

foo*eth*barmultiple wildcards

Permuterm index

$plant

t$plan

nt$pla

ant$pl

lant$p

plant$

Rotations

k-grams

computer

$c, co, om, mp, pu, ut, te, er, r$

$co, com, omp, mpu, put, ute, ter, er$

$com, comp, ompu, mput, pute, uter, ter$

$comp, compu, omput, mpute, puter, uter$

$compu, comput, ompute, mputer, puter$

$comput, compute, omputer, mputer$

$, c, o, m, p, u, t, e, r, $1-grams

2-grams

4-grams

3-grams

5-grams

6-grams

7-grams

Not very useful

Not space efficient

Usable zone

Edit distance# a t e

# 0 1 2 3

c 1 1 2 3

a 2 1 2 3

t 3 2 1 2ate

Jaccard coefficient

uteter

er$$cm

= 5 / 10 = 0.5

Soundex algorithm

Change... To...A E H I O U W Y 0B F P V 1C G J K Q S X Z 2D T 3L 4M N 5R 6

Memory hierarchy

Memory (RAM)

Disk (Secondary storage)

Tapes, DVDs (Tertiary storage)

Cache (CPU), level 1 and 2

Volatile

Non volatile

TermIDs

Blocked Sort-Based Indexing

Single-Pass In-Memory Indexing

MapReduce

computer

information

computer

information

Logarithmic Merging

n postings 2n postings 4n postings

Heap's law#

M = kpT

30 k 100

# Tokens

10000000

20000000

30000000

40000000

50000000

60000000

Zipf's law

Frequency =k

Compression: Front coding

4 bytes bytes

8automat*a8○e9○ic10○ion

4 bytes (less bytes)

Only everyk terms

Variable byte encodingvariable byte encoding000000010010001101000101011001111001 00001001 00011001 00101001 00111001 01001001 01011001 01101001 01111010 00001010 00011010 00101010 00111010 01001010 01011010 01101010 0111...1001 1000 0000

decimal01234567891011121314151617181920212223...64

binary011011100101110111100010011010101111001101111011111000010001100101001110100101011011010111...1000000

50%less space

Gamma encoding

19binary

001111110 Length in unary

111100011

Ranked retrieval

lawyerPenangsilver

OutputRanked subset of documents

Parametric search

Title Algorithms|

Author

Publication Date

Language

Country

Cost $

Search

Parametric indicesTitle

Author

Publication Date

Language

Country

Search structure Posting lists

Term frequency, (Inverted) Document frequency,

idffoo 5bar 10foobar 3

tf A Bfoo 5 1bar 0 4foobar 2 1

tf-idf A Bfoo 25 5bar 0 40foobar 6 3

Model and abstraction

Document as a list of words(with duplicates)

Simplification

Document as a vector of numbers

(0. 1.2 0.15 0.34 2.4 23.5.4324.5 0.13)

Document as a bag of words

Vector-Space Model

Documents= vectors in thefirst quadrant

Queries as vectors

Queries= points in thefirst quadrant

d3 is a goodresult of q2!

Inner product as score

�!x .�!y =I=MX

Evidence accumulation

computer

1 2 3 5 6

3 tftd 7 8

1 2 4 5 7

idft, ||d||

1 2 3 4 5 6 7

||q|| tftq ⇥ idft ⇥ tftd ⇥ idftkqk ⇥ kdk

SMART notation

atc.lnbQuery weights

Sublinear term frequency

Natural document frequency

Byte-size normalization

Probabilistic Information Retrieval

SortP (R = 1|D = d ^Q = q)

P (R = 1|D = e ^Q = q)

P (R = 1|D = f ^Q = q)

P (R = 1|D = g ^Q = q)

... falling back to Ranked Retrieval and evidence accumulation!

RSVd =X

k|dk=1^qk=1

This justifies idf weighting in the Vector-Space Model!

Language models

Enters a query q

Thought experiment: imagine that:• we picked a random document and built its model• we used this model to generate a new document• that document turns out to be q

What document is the most likely to have been picked and to have generated q?

Results

Relevant Not relevant

Precision =

Results

Relevant

Recall =

Specificity

Specificity =

Not relevant

F measure: harmonic mean

F↵ =1

↵P + 1�↵

Weighting

↵ = 1↵ = 0

Precision-Recall curvesPrecision

Recall0.10 0.5

ROC CurvesRecall (Sensitivity)

1 - Specificity

ghislain fourny information retrieval - systems group · 2019-07-15 · term vocabulary and posting...

Documents

in copyright - non-commercial use permitted rights ......big...

ghislain stephen n’goran - catalogue en ligne cdi

portfolio ghislain berard

ghislain samson et abdeljalil métioui

ghislain fourny big data for engineers spring 2019 ·...

de la opresión maligna, padre ghislain roy

ogier ghislain de busbecq-türk mektupları.pdf

ghislain vanherle - p&v elektrotechniek

flupa 2010 personas ghislain sillaume

d ghislain asymptotes obliques

ghislain fourny information retrieval spring 2019 ·...

cetc2011 pierre ghislain - v5 apresentacao mt04

ghislain fabre -...

ghislain fourny information systems for engineers fall 2018...

ghislain marechal - european external action service ·...

programme ghislain descazeaux liliane morvan

rapport laurent fourny - itinera

rene francois ghislain magritte

(ghislain), « bergson et l’ouverture inachevée »,...

p. ghislain roy - para liberarse y sanar