information retrievalale/bici/ir/slides/bertinoro4.pdf · information retrieval lecture 4. recap of...
TRANSCRIPT
Info
rmat
ion R
etri
eval
Lect
ure
4
Rec
ap o
f la
st t
ime
Post
ings
poin
ter
stora
ge
Dic
tionar
y st
ora
ge
Com
pre
ssio
nW
ild-c
ard q
uer
ies
This
lec
ture
Quer
y ex
pan
sion
What
quer
ies
can w
e now
pro
cess
?
Index
const
ruct
ion
Esti
mat
ion o
f re
sourc
es
Spel
l co
rrec
tion a
nd
expan
sion
Expan
sion
Wild
-car
ds
wer
e one
way
of
“expan
sion”:
i.e.
, a
single
“te
rm”
in t
he
quer
y hit
s m
any
post
ings.
Ther
e ar
e oth
er f
orm
s of
expan
sion:
Spel
l co
rrec
tion
Thes
auri
Soundex
(hom
onym
s).
Spel
l co
rrec
tion
Expan
d t
o t
erm
s w
ithin
(sa
y) e
dit
dis
tance
2Ed
it∈
{Inse
rt/D
elet
e/R
epla
ce}
Expan
d a
t quer
y ti
me
from
quer
ye.
g.,
Ala
nis
Mor
iset
teSp
ell co
rrec
tion is
expen
sive
and s
low
s th
e quer
y (u
pto
a f
acto
r of
10
0)
Invo
ke o
nly
when
index
ret
urn
s (n
ear-
)zer
o
mat
ches
.W
hat
if
docs
conta
in m
is-s
pel
lings?
Why
not
atin
dex
tim
e?
Thes
auri
Thes
auru
s: lan
guag
e-sp
ecif
ic lis
t of
synonym
s fo
r te
rms
likel
y to
be
quer
ied
Gen
eral
ly h
and-c
raft
edM
achin
e le
arnin
g m
ethods
can a
ssis
t –
more
on t
his
in lat
er lec
ture
s.
Soundex
Cla
ss o
f heu
rist
ics
to e
xpan
d a
quer
y in
to
phonet
ic e
quiv
alen
tsLa
nguag
e sp
ecif
icE.
g.,
che
bys
hev→
tche
byc
heff
How
do w
e use
thes
auri
/soundex
?C
an “
expan
d”
quer
y to
incl
ude
equiv
alen
ces
Quer
y ca
rty
res→
car
tyre
sau
tom
obile
tir
es
Can
expan
d index
Index
docs
conta
inin
g c
arunder
aut
omob
ile, as
wel
l
Quer
y ex
pan
sion
Usu
ally
do q
uer
y ex
pan
sion r
ather
th
an index
expan
sion
No index
blo
wup
Quer
y pro
cess
ing s
low
ed d
ow
nD
ocs
fre
quen
tly
conta
in e
quiv
alen
ces
May
ret
riev
e m
ore
junk
pum
a→
jagu
ar
retr
ieve
s docu
men
ts o
n
cars
inst
ead o
f on s
nea
kers
.
Languag
e det
ecti
on
Man
y of
the
com
ponen
ts d
escr
ibed
req
uir
e la
nguag
e det
ecti
on
For
docs
/par
agra
phs
at index
ing t
ime
For
quer
y te
rms
at q
uer
y ti
me
–m
uch
har
der
For
docs
/par
agra
phs,
gen
eral
ly h
ave
enough
text
to a
pply
mac
hin
e le
arnin
g m
ethods
For
quer
ies,
gen
eral
ly lac
k su
ffic
ient
text
Augm
ent
wit
h o
ther
cues
, su
ch a
s cl
ient
pro
per
ties
/spec
ific
atio
n f
rom
applic
atio
nD
om
ain o
f quer
y ori
gin
atio
n,
etc.
What
quer
ies
can w
e pro
cess
?
We
hav
eBa
sic
inve
rted
index
wit
h s
kip p
oin
ters
Wild
-car
d index
Spel
l-co
rrec
tion
Quer
ies
such
as
an*e
r* A
ND
(mor
iset
/3
tor
onto
)
Asi
de
–re
sult
s ca
chin
g
If 2
5%
of
your
use
rs a
re s
earc
hin
g f
or
bri
tney
AN
Dsp
ears
then
you p
robab
ly d
onee
d s
pel
ling
corr
ecti
on,
but
you d
on’t
nee
d t
o k
eep o
n
inte
rsec
ting t
hose
tw
o p
ost
ings
lists
Web
quer
y dis
trib
uti
on is
extr
emel
y sk
ewed
, an
d y
ou c
an u
sefu
lly c
ache
resu
lts
for
com
mon q
uer
ies
–m
ore
lat
er.
Index
const
ruct
ion
Index
const
ruct
ion
Thus
far,
consi
der
ed index
spac
eW
hat
about
index
const
ruct
ion t
ime?
What
str
ateg
ies
can w
e use
wit
h
limit
ed m
ain m
emory
?
Som
ewhat
big
ger
corp
us
Num
ber
of
docs
= n
= 4
0M
Num
ber
of
term
s =
m =
1M
Use
Zip
f to
est
imat
e num
ber
of
post
ings
entr
ies:
n +
n/2
+ n
/3 +
….
+ n
/m ~
nln
m =
56
0M
en
trie
sN
o p
osi
tional
info
yet
Che
ck fo
ryo
urse
lf
Rec
all in
dex
const
ruct
ion
Te
rmD
oc
#I
1d
id1
en
act
1ju
lius
1ca
esa
r1
I1
was
1ki
lled
1i'
1th
e1
cap
ito
l1
bru
tus
1ki
lled
1m
e1
so2
let
2it
2b
e2
wit
h2
cae
sar
2th
e2
no
ble
2
bru
tus
2h
ath
2
told
2
you
2ca
esa
r 2
was
2am
bit
iou
s2
Docu
men
ts a
re p
arse
d t
o e
xtr
act
word
s an
d t
hes
e ar
e sa
ved w
ith t
he
Docu
men
t ID
. Doc
1D
oc 2
I did
ena
ct J
uliu
sC
aesa
r I w
as k
illed
i'
the
Cap
itol;
Bru
tus
kille
d m
e.
So
let i
t be
with
Cae
sar.
The
nobl
eB
rutu
s ha
th to
ld y
ouC
aesa
r was
am
bitio
us
Key
ste
pT
erm
Do
c #
I1
did
1e
nac
t 1
juliu
s1
cae
sar
1I
1w
as1
kille
d1
i'1
the
1ca
pit
ol
1b
rutu
s1
kille
d1
me
1so
2le
t2
it2
be
2w
ith
2ca
esa
r2
the
2n
ob
le
2b
rutu
s2
hat
h
2to
ld
2yo
u2
cae
sar
2w
as2
amb
itio
us
2
Ter
mD
oc #
ambi
tious
2be
2br
utus
1br
utus
2
capi
tol
1ca
esar
1ca
esar
2ca
esar
2di
d1
enac
t1
hath
1I
1I
1i'
1it
2ju
lius
1ki
lled
1ki
lled
1le
t2
me
1no
ble
2so
2th
e1
the
2to
ld2
you
2w
as1
was
2w
ith2
Aft
er a
ll docu
men
ts h
ave
bee
n p
arse
d t
he
inve
rted
fi
le is
sort
ed b
y te
rms
We
focu
s on t
his
sort
ste
p.
Index
const
ruct
ion
As
we
build
up t
he
index
, ca
nnot
explo
it
com
pre
ssio
n t
rick
spar
se d
ocs
one
at a
tim
e.
The
final
post
ings
entr
y fo
r an
y te
rm is
inco
mple
te u
nti
l th
e en
d.
(act
ual
ly y
ou c
an e
xplo
it c
om
pre
ssio
n,
but
this
bec
om
es a
lot
more
com
ple
x)
At
10
-12
byt
es p
er p
ost
ings
entr
y, d
eman
ds
seve
ral te
mpora
ry g
igab
ytes
Syst
em p
aram
eter
s fo
r des
ign
Dis
k se
ek ~
1 m
illis
econd
Block
tra
nsf
er f
rom
dis
k ~
1 m
icro
seco
nd
per
byt
e ( f
ollo
win
g a
seek
)A
ll oth
er o
ps
~ 1
0 m
icro
seco
nds
E.g.,
com
par
e tw
o p
ost
ings
entr
ies
and
dec
ide
thei
r m
erge
ord
er
Bott
lenec
k
Pars
e an
d b
uild
post
ings
entr
ies
one
doc
at a
ti
me
Now
sort
post
ings
entr
ies
by
term
(th
en b
y doc
wit
hin
eac
h t
erm
)D
oin
g t
his
wit
h r
andom
dis
k se
eks
would
be
too s
low
If ev
ery
com
paris
on to
ok 1
dis
k se
ek, a
ndn
item
s co
uld
beso
rted
with
nlog
2n c
ompa
rison
s, h
ow lo
ng w
ould
this
take
?
Sort
ing w
ith f
ewer
dis
k se
eks
12
-byt
e (4
+4
+4
) re
cord
s (t
erm
, doc
, fr
eq).
Thes
e ar
e gen
erat
ed a
s w
e par
se d
ocs
.M
ust
now
sort
56
0M
such
12
-byt
e re
cord
s by
term
.D
efin
e a
Block
= 1
0M
such
rec
ord
sca
n “
easi
ly”
fit
a co
uple
into
mem
ory
.
Will
sort
wit
hin
blo
cks
firs
t, t
hen
mer
ge
the
blo
cks
into
one
long s
ort
ed o
rder
.
Sort
ing 5
6 b
lock
s of
10M
re
cord
s
Firs
t, r
ead e
ach b
lock
and s
ort
wit
hin
: Q
uic
ksort
take
s ab
out
2 x
(10M
ln 1
0M
) st
eps
Exer
cise
: es
tim
ate
tota
l tim
e to
rea
d e
ach
Exer
cise
: es
tim
ate
tota
l tim
e to
rea
d e
ach
blo
ck f
rom
dis
k an
d a
nd
blo
ck f
rom
dis
k an
d a
nd q
uick
sort
qui
ckso
rtit
.it
.5
6 t
imes
this
est
imat
e -
giv
es u
s 5
6 s
ort
ed
runs
of
10
M r
ecord
s ea
ch.
Nee
d 2
copie
s of
dat
a on d
isk,
thro
ughout.
Mer
gin
g 5
6 s
ort
ed r
uns
Mer
ge
tree
of
log
25
6 ~
6 lay
ers.
Duri
ng e
ach lay
er,
read
into
mem
ory
runs
in
blo
cks
of
10
M,
mer
ge,
wri
te b
ack. 21 43
1 342
Dis
k
Mer
ge
tree
Sort
ed r
uns.
…
…
12
56
55
28 r
uns,
20M
/run
14 r
uns,
40M
/run
7 r
uns,
80M
/run
4 r
uns
… ?
2 r
uns
… ?
1 r
uns
… ?
Mer
gin
g 5
6 r
uns
Tim
e es
tim
ate
for
dis
k tr
ansf
er:
6 x
(5
6ru
ns
x 1
20
MB
x 1
0-6
sec)
x 2
~ 2
2hrs
.
Wor
k ou
t how
thes
e tra
nsfe
rs a
re s
tage
d,
and
the
tota
l tim
e fo
r m
ergi
ng.
Dis
k bl
ock
trans
fer t
ime.
Why
is th
is a
nO
vere
stim
ate?
Exer
cise
-fi
ll in
this
tab
le
Tim
eSt
ep
56
init
ial quic
ksort
sof
10
M r
ecord
s ea
ch
Rea
d 2
sort
ed b
lock
s fo
r m
ergin
g,
wri
te b
ack
Mer
ge
2 s
ort
ed b
lock
s
1 2 3 4 5
Add (
2)
+ (
3)
= t
ime
to r
ead/m
erge/
wri
te
56
tim
es (
4)
= t
ota
l m
erge
tim
e
?
Larg
e m
emory
index
ing
Suppose
inst
ead t
hat
we
had
16
GB
of
mem
ory
for
the
above
index
ing t
ask.
Exer
cise
: ho
w m
uch
tim
e to
index
?R
epea
t w
ith
a co
uple
of
valu
es o
f n,
m.
In p
ract
ice,
spid
erin
g inte
rlac
ed w
ith
index
ing.
Spid
erin
g b
ott
lenec
ked b
y W
AN
spee
d a
nd
man
y oth
er f
acto
rs -
more
on t
his
lat
er.
Impro
vem
ents
on b
asic
mer
ge
Com
pre
ssed
tem
pora
ry f
iles
com
pre
ss t
erm
s in
tem
pora
ry d
icti
onar
y ru
ns
How
do w
e m
erge
com
pre
ssed
runs
to
gen
erat
e a
com
pre
ssed
run?
Giv
en t
wo γ
-enco
ded
runs,
mer
ge
them
into
a
new
γ-e
nco
ded
run
To d
o t
his
, fi
rst γ-
dec
ode
a ru
n into
a
sequen
ce o
f gap
s, t
hen
act
ual
rec
ord
s:33,1
4,1
07,5
… →
33, 47, 154, 159
13,1
2,1
09,5
… →
13, 25, 134, 139
Mer
gin
g c
om
pre
ssed
runs
Now
mer
ge:
13, 25, 33, 47, 134, 139, 154, 159
Now
gen
erat
e new
gap
seq
uen
ce13,1
2,8
,14,8
7,5
,15,5
Finis
h b
y γ-
enco
din
g t
he
gap
seq
uen
ceBu
t w
hat
was
the
poin
t of
all th
is?
If w
e w
ere
to u
nco
mpre
ss t
he
enti
re r
un in
mem
ory
, w
e sa
ve n
o m
emory
How
do w
e gai
n a
nyt
hin
g?
“Zip
per
” unco
mpre
ss/d
ecom
pre
ss
When
mer
gin
g t
wo r
uns,
bri
ng t
hei
r γ-
enco
ded
ver
sions
into
mem
ory
Do N
OT
unco
mpre
ss t
he
enti
re g
ap
sequen
ce a
t once
–only
a s
mal
l se
gm
ent
at
a ti
me
Mer
ge
the
unco
mpre
ssed
seg
men
tsC
om
pre
ss m
erged
seg
men
ts a
gai
n
Com
pre
ssed
, m
erged
outp
ut
Unco
mpre
ssed
segm
ents
Com
pre
ssed
inputs
Impro
ving o
n b
inar
y m
erge
tree
Mer
ge
more
than
2 r
uns
at a
tim
eM
erge
k>2
runs
at a
tim
e fo
r a
shal
low
er t
ree
mai
nta
in h
eap o
f ca
ndid
ates
fro
m e
ach r
un
15
24
36
….
….
Dyn
amic
index
ing
Docs
com
e in
ove
r ti
me
post
ings
updat
es f
or
term
s al
read
y in
dic
tionar
ynew
ter
ms
added
to d
icti
onar
y
Docs
get
del
eted
Sim
ple
st a
ppro
ach
Mai
nta
in “
big
” m
ain index
New
docs
go into
“sm
all”
auxili
ary
index
Sear
ch a
cross
both
, m
erge
resu
lts
Del
etio
ns
Inva
lidat
ion b
it-v
ecto
r fo
r del
eted
docs
Filt
er d
ocs
outp
ut
on a
sea
rch r
esult
by
this
in
valid
atio
n b
it-v
ecto
r
Peri
odic
ally
, re
-index
into
one
mai
n index
More
com
ple
x a
ppro
ach
Fully
dyn
amic
updat
esO
nly
one
index
at
all ti
mes
No b
ig a
nd s
mal
l in
dic
es
Act
ive
man
agem
ent
of
a pool of
spac
e
Fully
dyn
amic
updat
es
Inse
rtin
g a
(va
riab
le-l
ength
) re
cord
e.g.,
a t
ypic
al p
ost
ings
entr
y
Mai
nta
in a
pool of
(say
) 6
4K
B ch
unks
Chunk
hea
der
mai
nta
ins
met
adat
a on
reco
rds
in c
hunk,
and its
fre
e sp
ace
Rec
ord
Rec
ord
Rec
ord
Rec
ord
Free
spa
ce
Hea
der
Glo
bal
tra
ckin
g
In m
emory
, m
ainta
in a
glo
bal
rec
ord
addre
ss
table
that
say
s, f
or
each
rec
ord
, th
e ch
unk
it’s
in.
Def
ine
one
chunk
to b
e cu
rren
t.In
sert
ion
if c
urr
ent
chunk
has
enough f
ree
spac
eex
tend r
ecord
and u
pdat
e m
etad
ata.
else
look
in o
ther
chunks
for
enough s
pac
e.el
se o
pen
new
chunk.
Chan
ges
to d
icti
onar
y
New
ter
ms
appea
r ove
r ti
me
cannot
use
a s
tati
c per
fect
has
h f
or
dic
tionar
y
OK
to u
se t
erm
char
acte
r st
ring w
/poin
ters
fr
om
post
ings
as in lec
ture
2.
Index
on d
isk
vs. m
emory
Most
ret
riev
al s
yste
ms
keep
the
dic
tionar
y in
m
emory
and t
he
post
ings
on d
isk
Web
sea
rch e
ngin
es f
requen
tly
keep
both
in
mem
ory
mas
sive
mem
ory
req
uir
emen
t
feas
ible
for
larg
e w
eb s
ervi
ce inst
alla
tions,
le
ss s
o f
or
stan
dar
d u
sage
wher
equer
y lo
ads
are
lighte
r
use
rs w
illin
g t
o w
ait
2 s
econds
for
a re
sponse
More
on t
his
when
dis
cuss
ing d
eplo
ymen
t m
odel
s
Dis
trib
ute
d index
ing
Suppose
we
had
sev
eral
mac
hin
es a
vaila
ble
to
do t
he
index
ing
how
do w
e ex
plo
it t
he
par
alle
lism
Tw
o b
asic
appro
aches
stri
pe
by
dic
tionar
y as
index
is
built
up
stri
pe
by
docu
men
ts
Index
ing in t
he
real
worl
d
Typ
ical
ly,
don’t
hav
e al
l docu
men
ts s
itti
ng
on a
loca
l fi
lesy
stem
Docu
men
ts n
eed t
o b
e sp
ider
edC
ould
be
dis
per
sed o
ver
a W
AN
wit
h v
aryi
ng
connec
tivi
tyM
ust
sch
edule
dis
trib
ute
d s
pid
ers/
index
ers
Could
be
(sec
ure
conte
nt)
in
Dat
abas
esC
onte
nt
man
agem
ent
applic
atio
ns
Emai
l ap
plic
atio
ns
htt
p o
ften
not
the
most
eff
icie
nt
way
of
fetc
hin
g t
hes
e docu
men
ts -
nat
ive
API
fe
tchin
g
Index
ing in t
he
real
worl
d
Docu
men
ts in a
var
iety
of
form
ats
word
pro
cess
ing f
orm
ats
(e.g
., M
S W
ord
)sp
read
shee
tspre
senta
tions
publis
hin
g f
orm
ats
(e.g
., p
df)
Gen
eral
ly h
andle
d u
sing f
orm
at-s
pec
ific
“f
ilter
s”co
nve
rt f
orm
at into
tex
t +
met
a-dat
a
Index
ing in t
he
real
worl
d
Docu
men
ts in a
var
iety
of
languag
esau
tom
atic
ally
det
ect
languag
e(s)
in a
docu
men
tto
keniz
atio
n,
stem
min
g,
are
languag
e-dep
enden
t
“Ric
h”
docu
men
ts
(How
) D
o w
e in
dex
im
ages
?
Res
earc
her
s hav
e dev
ised
Quer
y Ba
sed o
n
Imag
e C
onte
nt
(QBI
C)
syst
ems
“show
me
a pic
ture
sim
ilar
to t
his
ora
nge
circ
le”
wat
ch f
or
nex
t w
eek’
s le
cture
on v
ecto
r sp
ace
retr
ieva
l
In p
ract
ice,
im
age
sear
ch b
ased
on m
eta-
dat
a su
ch a
s fi
le n
ame
e.g.,
monal
isa.
jpg
Pass
age/
sente
nce
ret
riev
al
Suppose
we
wan
t to
ret
riev
e not
an e
nti
re
docu
men
t m
atch
ing a
quer
y, b
ut
only
a
pas
sage/
sente
nce
-sa
y, in a
ver
y lo
ng
docu
men
t
Can
index
pas
sages
/sen
tence
s as
min
i-docu
men
ts
But
then
you lose
the
ove
rall
rele
vance
as
sess
men
t fr
om
the
com
ple
te d
ocu
men
t -
more
on t
his
as
we
study
rele
vance
ran
king
More
on t
his
when
dis
cuss
ing X
ML
sear
ch
Res
ourc
es, an
d b
eyond
MG
5.
Thus
far,
docu
men
ts e
ither
mat
ch a
quer
y or
do n
ot.
It’s
tim
e to
bec
om
e m
ore
dis
crim
inat
ing -
how
wel
l does
a d
ocu
men
t m
atch
a q
uer
y?G
ives
ris
e to
ran
king a
nd s
cori
ng