wbia review 黄连恩北京大学信息工程学院 12/23/2014

WBIA Review

http://net.pku.edu.cn/~wbia黄连恩[email protected]北京大学信息工程学院

12/23/2014

Bow-tie Strongly

Connected Component (SCC) Core

Upstream (IN) Core can’t reach

IN Downstream

(OUT) OUT can’t reach

core Disconnected Tendrils & Tubes

Power-law

Nature seems to create bell curves(range around an average)

Human activity seems to create power laws(popularity skewing)

Power Law Distribution -Examples

From Graph structure in the web, (by altavista crawl,1999)

习题：怎么存储Web图？

Web Graph

PageRank

Why and how it works?

Random walker model

V

u1

u2

u3

u4

u5

Damping Factor

pN

LpN

pLp NT

NT

)1()1(1)1(

β 选在 0.1 和 0.2 之间，被称作 damping factor(Page & Brin 1997 ）G=(1-β)LT+ β/N(1N) 被称为 Google Matrix

112/12/12/12/12/12/12/12/1

3/13/13/12/12/1

11

11/111/111/111/111/111/111/111/111/111/111/1

L

小规模数据求解 β 取 0.15 G= 0.85*LT+0.15/11(1N) P0=(1/11,1/11,….)T P1=GP0 ... 。。。。。。。 Power Iteration 求解得 ( 迭代 50 次 ) P=(0.033,0.384,0.343,0.039,0.081, 0.039,0.016……)T

You can try this in MatLab

习题：写出 PageRank 算法的伪码

HITS(Hyperlink Induced Topic Search) 声望高的（入度大）权威性高认识许多声望高的（出度大）目录性强如何计算？

Power Iteration on:

hEEaEh

aEEhEaT

TT

Authority and Hub scores 针对 u∈V(q) ，在每个网页 u 上定义有两个参数：a[u] 和 h[u] ，分别表示其权威性和目录性。交叉定义

一个网页 u 的 a 值依赖于指向它的网页 v 的 h 值一个网页 u 的 h 值依赖于它所指的网页 v 的 a 值

hEEaEh

hEaT

T

Web Spam Term spamming

Manipulating the text of web pages in order to appear relevant to queries

Link spamming Creating link structures that boost page rank or

hubs and authorities scores

TrustRankTrustRank Expecting that good pages point to other good

pages, all pages reachable from a good seed page in M or fewer steps are denoted as good

t= · LT · t + (1- · d / |d|

1

2 3

4

5 6

7

good page

bad page

TrustRank in ActionTrustRank in Action

Select seed set using inversed PageRank

=[2, 4, 5, 1, 3, 6, 7] Invoke L(=3) oracle functions Populate static score distribution

vectord=[0, 1, 0, 1, 0, 0, 0]

Normalize distribution vectord=[0, 1/2, 0, 1/2, 0, 0, 0]

Calculate TrustRank scores using biased PageRank with trust dampening and trust splitting

RESULTS [0, 0.18, 0.12, 0.15, 0.13, 0.05, 0.05]

t= · LT · t + (1- · d / |d|

1

2 3

4

5 6

7

0.180.12

0.050.05

0.13

0.15

0

Tokenization Friends, Romans, Countrymen, lend me your ears; Friends | Romans | Countrymen | lend | me your |

ears

Token an instance of a sequence of characters that are grouped together as a useful semantic unit for processing

Type the class of all tokens containing the same character sequence

Term type that is included in the system dictionary (normalized)

Stemming and lemmatization Stemming

Crude heuristic process that chops off the ends of the words

Democratic democa

Lemmatization Use of vocabulary and morphological analysis, returns the

base form of a word (lemma) Democratic democracy Sang sing

Porter stemmer Most common algorithm for stemming English

5 phases of word reduction SSES SS

caresses caress IES I

ponies poni SS SS S

cats cat EMENT

replacement replac cement cement

Bag of words model

A document can now be viewed as the collection of terms in it and their associated weight

Mary is smarter than John John is smarter than Mary

Equivalent in the bag of words model

Term frequency and weighting A word that appears often in a document is

probably very descriptive of what the document is about

Assign to each term in a document a weight for that term, that depends on the number of occurrences of the that term in the document

Term frequency (tf) Assign the weight to be equal to the number of

occurrences of term t in document d

Inverse document frequency

N number of documents in the collection

• N = 1000; df[the] = 1000; idf[the] = 0

• N = 1000; df[some] = 100; idf[some] = 2.3

• N = 1000; df[car] = 10; idf[car] = 4.6

• N = 1000; df[merger] = 1; idf[merger] = 6.9

it.idf weighting Highest when t occurs many times within a

small number of documents Thus lending high discriminating power to those

documents Lower when the term occurs fewer times in

a document, or occurs in many documents Thus offering a less pronounced relevance

signal Lowest when the term occurs in virtually all

documents

tf x idf term weights tf x idf 权值计算公式 :

term frequency (tf ) or wf, some measure of term density in a doc

inverse document frequency (idf ) 表达 term 的重要度 ( 稀有度 ) 原始值 idft = 1/dft 同样，通常会作平滑

为文档中每个词计算其 tf.idf 权重：

dfNidf

tt log

)/log(,, tdtdt dfNtfw 24

Document vector space representation

Each document is viewed as a vector with one component corresponding to each term in the dictionary

The value of each component is the tf-idf score for that word

For dictionary terms that do not occur in the document, the weights are 0

Documents as vectors

每一个文档 j 能够被看作一个向量，每个 term 是一个维度，取值为 tf.idf So we have a vector space

terms are axes docs live in this space 高维空间：即使作 stemming, may have 20,000+ dimensions

D1 D2 D3 D4 D5 D6 …

中国 4.1 0.0 3.7 5.9 3.1 0.0文化 4.5 4.5 0 0 11.6 0日本 0 3.5 2.9 0 2.1 3.9留学生 0 3.1 5.1 12.8 0 0教育 2.9 0 0 2.2 0 0北京 7.1 0 0 0 4.4 3.8…

26

Cosine similarity

Cosine similarity

M

i jij wd1 ,

2

向量 d1 和 d2 的 “ closeness” 可以用它们之间的夹角大小来度量具体的，可用 cosine of the angle x 来计算向量相似度 . 向量按长度归一化 Normalization

t 1

d 2

d 1

t 3

t 2

θ

M

i kiM

i ji

M

i kiji

kj

kjkj

ww

ww

dd

ddddsim

12,1

2,

1 ,,),(

28

Jaccard coefficient Resemblance

Symmetric, reflexive, not transitive, not a metric Note r (A,A) = 1 But r (A,B)=1 does not mean A and B are identical!

Forgives any number of occurrences and any permutations of the terms. Resemblance distance

)()()()(

),(BSASBSAS

BAr

),(1),( BArBAd

Shingling A contiguous subsequence contained in D is called a shingle. Given a document D we define its w-shingling S(D, w) as the set of all unique shingles of size w contained in D.

D = (a,rose,is,a,rose,is,a,rose) S(D,4) = {(a,rose,is,a),(rose,is,a,rose),(is,a,rose,is)} “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is

Why shingling? S(D,4) .vs. S(D,1)What is a good value for w?

Shingling & Jaccard Coefficient Doc1= "to be or not to be, that is a question!"

Doc2= "to be a question or not"

Let windows size w = 2, Resemblance r (A,B) = ?

Random permutation Random permutation

Let be a set (1..N e.g.) Pick a permutation : uniformly at random

={3,7,1,4,6,2,5} A={2,3,6} MIN((A))=?

Inverted index 对每个 term T: 保存包含 T 的文档 ( 编号 ) 列表中国文化留学生

2 4 8 16 32 64 1282 3 5 8 13 21 34

13 161

Dictionary PostingsSorted by docID (more later on why).

33

Inverted Indexwith counts

• supports better ranking

algorithms

VS-based Retrieval

Columns headed ‘n’ are acronyms for weight schemes.

Why is the base of the log in idf immaterial?

Sec. 6.4

tf-idf example: lnc.ltc

Term Query Document Prod

tf-raw

tf-wt df idf wt n’lize tf-raw tf-wt wt n’lize

auto 0 0 5000 2.3 0 0 1 1 1 0.52 0

best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0

car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27

insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53

Document: car insurance auto insuranceQuery: best car insurance

Exercise: what is N, the number of docs?

Score = 0+0+0.27+0.53 = 0.8Doc length =

12 02 12 1.32 1.92

Sec. 6.4

Singular Value Decomposition

对 term-document 矩阵作奇异值分解 Singular Value Decomposition r, 矩阵的 rank , singular values 的对角阵（按降序排列） D, T, 具有正交的单位长度列向量 (TT’=I, DD’=I)

t d t r

Wtd = T

r r

DT

r d

WWT 的特征值 WTW 和 WWT 的特征向量

Latent Semantic Model LSI 检索过程：

查询映射 / 投影到 LSI 的 DT 空间，称为“ folded in“ ： W=TDT ，若 q 投影到 DT 中后为 q’ ，则有q = Tq’T 既有 q’= (-1T-1q)T = qT-1 Folded in 既为 document/query vector 乘上 T-1

文档集的文档向量为 DT 两者通过 dot-product 计算相似度

Stochastic Language Models 用来生成文本的统计模型

Probability distribution over strings in a given languageM

P ( | M ) = P ( | M ) P ( | M,

)P ( | M, )

P ( | M, )

Unigram model likely topics

Bigram model grammaticality

tokenswcountwP

#)()(

)()()( 1

1i

iiii wcount

wwcountwwP

Bigram Model

Approximate by P(unicorn|the mythical) by P(unicorn|mythical)

Markov assumption: the probability of a word depends only on the probability of a limited history

Generalization: the probability of a word depends only on the probability of the n previous words trigrams, 4-grams, … the higher n is, the more data needed to train backoff models…

)11|( nn wwP )|( 1nn wwP

LM-based Retrieval 排序公式

用最大似然估计 :

Qt d

dt

Qtdmld

dltf

MtpMQp

),(

)|(ˆ)|(ˆUnigram assumption:

Given a particular language model, the query terms occur

independently

),( dttf

ddl

: language model of document d : raw tf of term t in document d : total number of tokens in document d

dM

)|()()|()(),(

dMQpdpdQpdpdQp

Laplace smoothing Also called add-one smoothing Just add one to all the counts! Very simple MLE estimate:

Laplace estimate:

Mixture model smoothing

P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc) 参数很重要

值高，使得查询成为 “ conjunctive-like” – 适合短查询值低更适合长查询调整来优化性能

比如使得它与文档长度相关 (cf. Dirichlet prior or Witten-Bell smoothing)

Example Document collection (2 documents)

d1: Xerox reports a profit but revenue is down d2: Lucent narrows quarter loss but revenue decreases

further Model: MLE unigram from documents; = ½ Query: revenue down

P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2] = 1/8 x 3/32 = 3/256

P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2] = 1/8 x 1/32 = 1/256

Ranking: d1 > d2

What is relative entropy?

KL divergence/relative entropy

Relative entropy between the two distributions

Cost in bits of coding using Q when true distribution is P

)))(log()((

))(log()()(

iPiP

iQiPQPDi

KL

i

iPiPxPH ))(log()())((

48

i

KL iQiPiPQPD ))()(log()()(

49

Precision and Recall

Precision: 检索得到的文档中相关的比率 = P(relevant|retrieved) Recall: 相关文档被检索出来的比率 = P(retrieved|relevant)

精度 Precision P = tp/(tp + fp) 召回率 Recall R = tp/(tp + fn)

Relevant Not Relevant

Retrieved tp fpNot Retrieved

fn tn

50

Accuracy 给定一个 Query ，搜索引擎对每个文档分类 classifies as “Relevant” or “Irrelevant”. Accuracy of an engine: 分类的正确比率 . Accuracy = (tp + tn)/(tp + fp +tn + fn) Is this a very useful evaluation measure in IR?

Relevant Not Relevant

Retrieved tp fpNot Retrieved fn tn

51

A combined measure: F P/R 的综合指标 F measure (weighted harmonic mean):

通常使用 balanced F1 measure( = 1 or = ½)

Harmonic mean is a conservative average ， Heavily penalizes low values of P or R

RPPR

RP

F

2

2 )1(1)1(1

1

52

MAP 多个 queries 间的平均

微平均 Micro-average – 每个 relevant document 是一个点，用来计算平均宏平均 Macro-average – 每个 query 是一个点，用来计算平均

Average of many queries’ average precision values

Called mean average precision (MAP) “Average average precision” sounds weird

Mostcommon

53

Averaging across queries 多个 queries 间的平均

微平均 Micro-average – 每个 relevant document 是一个点，用来计算平均宏平均 Macro-average – 每个 query 是一个点，用来计算平均

Average of many queries’ average precision values

Called mean average precision (MAP) “Average average precision” sounds weird

Mostcommon

54

习题 8-9 [**] 在 10,000 篇文档构成的文档集中，某个查询的相关文档总数为 8 ，下面给出了某系统针对该查询的前 20 个有序结果的相关 ( 用 R 表示 ) 和不相关 ( 用 N 表示 ) 情况，其中有 6 篇相关文档： RRNNN NNNRN RNNNR NNNNR a. 前 20 篇文档的正确率是多少？ b. 前 20 篇文档的 F1 值是多少 ? c. 在 25% 召回率水平上的插值正确率是多少？

56

KNN

Government

ScienceArts

P(science| )?

Sec.14.3

Naïve Bayes

),,,|(argmax 21 njCc

MAP xxxcPcj

)()|,,,(argmax 21 jjnCc

cPcxxxPj

i jij

CccxPcP

j

)|(̂)(̂argmax

NcCN

cP jj

)()(ˆ

kcCNcCxXN

cxPj

jiiji

)(1),(

)|(ˆ

Conditional Independence AssumptionAdd one smooth maximum likelihood estimatesMaximum a posteriori HypothesisBayes Rule

Parameter estimation

fraction of documents of topic cjin which word w appears

Binomial model:

Multinomial model:

)|(ˆjw ctXP

fraction of times in which word w appears

across all positions in the documents of topic cj

)|(ˆji cwXP

58

NB Example

c(5)=?

59

NB Example

c(5)=?

60

Multinomial NB Classifier Feature likelihood estimate

Posterior

Result: c(5) = China

61

NB Example

c(5)=?

62

Bernoulli NB Classifier Feature likelihood estimate

Posterior

Result: c(5) <> China63

例题：你的任务是将单词分成英语 (English) 类或非英语类。这些单词的产生来自如下分布：

(i) 计算多项式 NB 分类器的参数，分类器使用字母 b、 n 、 o 、 u 和 z 作为特征。在计算参数时使用平滑方法，零概率平滑成 0.01 ，而非零概率不做改变。(ii) 上述分类器对单词 zoo 的分类结果是什么？

65

Support Vector Machine (SVM)Support vectors

Maximizesmargin

SVMs maximize the margin around the separating hyperplane.

A.k.a. large margin classifiers

The decision function is fully specified by a subset of training samples, the support vectors.

Solving SVMs is a quadratic programming problem

Seen by many as the most successful current text classification method*

*but other discriminative methods often perform very similarly

Narrowermargin

2 statistic (CHI)

The null hypothesis : Term(jaguar) is independent with Class(auto)

Then, what value are expected in this confusion matrix?

9500

500

3Class auto

2Class = auto

Term jaguar

Term = jaguar

observed: fo

66

2 statistic (CHI)

2 is interested in (fo – fe)2/fe summed over all table entries

The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence).

)001.(9.129498/)94989500(502/)502500(

75.4/)75.43(25./)25.2(/)(),(22

2222

p

EEOaj

9500

500

(4.75)

(0.25)

(9498)3Class auto

(502)2Class = auto

Term jaguar

Term = jaguar expected: fe

observed: fo

67

K-Means 假设 documents 是实值 vectors. 基于 cluster ω 的中心 centroids (aka the center of gravity or mean)

划分 instances 到 clusters 是根据它到 cluster centroid 中心点的距离，选择最近的 centroid

K Means Example(K=2)

Pick seedsReassign clustersCompute centroids

xx

Reassign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Hierarchical Agglomerative Clustering (HAC)

假定有了一个 similarity function 来确定两个 instances 的相似度 . 贪心算法：

每个 instances 为一独立的 cluster 开始选择最 similar 的两个 cluster ，合并为一个新 cluster 直到最后剩下一个 cluster为止

上面的合并历史形成一个binary tree或 hierarchy.Dendrogram

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 *(max(5, 1, 0)) = 5/6Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5

Purity

Total: Purity = 1/17 * (5+4+3) = 12/17

Rand Index View it as a series of decisions, one for

each of the N(N − 1)/2 pairs of documents in the collection.

true positive (TP) decision assigns two similar documents to the same cluster

true negative (TN) decision assigns two dissimilar documents to different clusters.

false positive (FP) decision assigns two dissimilar documents to the same cluster.

false negative (FN) decision assigns two similar documents to different clusters.

Rand Index

Number of points

Same Cluster in clustering

Different Clusters in clustering

Same class in ground truthDifferent classes in ground truth

TP FN

TNFP

Rand index Example

Cluster I Cluster II Cluster III

Thank You!Q&A

wbia review 黄连恩 北京大学信息工程学院 12/23/2014

Documents

wbia review 黄连恩北京大学信息工程学院 12/23/2014