overview - ucsd cognitive sciencerik/courses/cogs188_s10/slides/3-wgt-match.pdf · overview the...
TRANSCRIPT
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Overview
The fascination with the subliminal, the camouflaged, and theencrypted is ancient. Getting a computer to munch away at longstrings of letters from the Old Testament is not that different fromkilling animals and interpreting the entrails, or pouring out teaand reading the leaves. It does add the modern impersonal touch –a computer found it, not a person, so it must be “ really there.” Butcomputers find what people tell them to find. As the programmerslike to say, “prophesy in, prophesy out.”
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Zipf’s Law
≈ ≈
Rank order of words
toocommon
tooraresignificant
Zipf's first law
Fre
quen
cy o
f wor
ds
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Other very clever people have providedother cognitive/linguistic explanations
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Consequences ... (cont)Token Freq Unstem-f
the 78428of 50026and 33834a 31347to 28666in 21512system 2 1 4 8 8 8632is 18781model 1 4 7 7 2 4796
for 14640de 1 1 9 2 3network 1 0 3 0 6 3965
this 10095base 9 8 3 8that 9820are 9792learn 9 2 9 3world 8 1 0 3la 7 6 7 8author 7 6 1 5an 7593
Token Freq Unstem-f
knowledg 7 4 1 0 5496
neural 7 2 2 0 3912with 7197as 6964on 6920by 6886
process 6 5 6 9 2900
design 6 3 6 2 3308
del 6 1 7 8be 6045
develop 5 8 9 1integr 5 6 3 3domain 5 6 3 0based 5326use 5 2 2 6intel l ig 5 1 9 7which 5158
control 5 1 5 1 3288
expert 4 9 5 3 2842
comput 4 8 5 1mechan 4 8 1 8escolar 4 7 2 8
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Consequences ... (cont2)Token Freq Unstem-f
approach 4 6 2 1 2535from 4587
classifi 4 5 5 6algorithm 4 5 3 3 2155
f inal 4 4 3 6systems 4387can 4370
code 4 1 1 6robot 4 1 0 3intern 4 0 9 7applic 4 0 5 5perform 4 0 5 1percept 4 0 4 7method 4 0 3 7 2003
enabl 4 0 3 6data 4 0 1 3 3326
make 3 9 8 4increm 3 9 4 7incomplet 3 8 9 0secondli 3 7 6 5mo 3 7 3 3it 3697used 3594problem 3520
Token Freqwe 3276these 3268using 3268learning 3266was 3205has 3051or 2859been 2715research 2622have 2609two 2601developed 2550information 2461networks 2449time 2370s 2350new 2293also 2259performance 2244results 2239were 2216such 2165problems 2133analysis 2045models 2000
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Function words follow Poissondistribution
λ
Pr(n occur of w) = e−λ w λ wn
n!
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Resolving Power
Fre
quen
cy o
f wor
ds
Rank order of words
upper cut-off
lower cut-off
toocommon
toorare
significant
Zipf's first law
Fre
quen
cy o
f wor
ds
Resolving power
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Exhaustivity: Number of topics indexed
≈
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Specificity: ability to describe FOAinformation need precisely
≈
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Index: A balance between user and corpus
Query
Exhaustivity
INDEX Corpus
Specificity
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Not too exhaustive, not too specific...
Query
Exhaustivity
INDEX Corpus
Specificity
Representationof
Discriminabilityof
Hi Precision Hi Recall
few doc/jkw
many kw/doc
leads to
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Factors in index weighting
freqkd ≡ N(occurrences of wordk in docd )
wkd ∝ freqkd ∗ discrimk
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Information is reduction in uncertainty
≈
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Separate informative words from noise
Noisek = freqkd
freqkd=1
NDoc∑ ∗ log
freqk
freqkd
Signalk = freqk − Noisek
wkd = freqkd * Signalk
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
3.3.7 Inverse document frequency
Dock ≡ N(documents containing wordk )
wkd = freqkd * logNorm
Dock
+1
Norm =Ndoc [Sparck- Jones' 72]
argmaxk
Dock [Sparck- Jones' 79]
î
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Inter-document similarity
Sim(di ,d j ) ≡ "Similarity" twix documents
D* ≡ Centroid; average document
Sim ≡ 12NDoc
Sim(di ,d j )i, j∑
= α Sim(i=1
NDoc
∑ di ,D* )
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Removing keyword collapses documentspace
Simk ≡ Sim when termk removed
Disck ≡ Simk − Sim
wkd = freqkd * Disck
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Sensitivity of IDF to “Document” Size
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Pivot-Based Document LengthNormalization
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Summary: SMART WeightingSpecification
wkd = freqkd ∗ collectk
norm
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Frequency of KW in DOC
freqkd =
{0,1} binary
freqkd
maxk
( freqkd )max norm
12 + 1
2freqkd
maxk
( freqkd )augmented
ln( freqkd ) +1 log
î
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Collection statistics of KW
freqkd =
{0,1} binary
freqkd
maxk
( freqkd )max norm
12 + 1
2freqkd
maxk
( freqkd )augmented
ln( freqkd ) +1 log
î
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Normalization
norm =
wivector∑ sum
wi2
vector∑ cosine
wi4
vector∑ fourth
maxvector
wi( ) max
î
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
3.5.1 Measures of association
Q = {kw ∈ query}
D = {kw ∈ document}
Q ∩ D Shared features
2Q ∩ D
Q + DDice coefficient
Q ∩ D
Q ⋅ DCosine coefficient
© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01
Dissimilarity as “distance”
0.2 0.4 0.6 0.8 1
50
100
150
200D
S
-4 -2 2 4
-20
-10
10
20
S
D