web mapping, modeling, mining, and minglingwaw2006/winterschool/menczer-tutori… · web mapping,...

124
Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling and Mining of Networked Information Spaces Filippo Menczer Informatics & Computer Science Indiana University, Bloomington

Upload: others

Post on 09-Oct-2020

9 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Web mapping, modeling, mining, and mingling

Tutorial for the 2006 MITACS Winter SchoolModelling and Mining of Networked Information Spaces

Filippo MenczerInformatics & Computer ScienceIndiana University, Bloomington

Page 2: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

NaN:Networks & agents Network

http://homer.informatics.indiana.edu/~nan/

Page 3: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

OutlineMapping

Topical localityContent, link, and semantic topologies in the Web

ModelingHow the Web evolves and why content mattersConsequences for navigation and crawling

MiningTopical Web crawlersAdaptive, intelligent crawling techniques

MinglingSocial Web search & recommendationDistributed collaborative peer search

Page 4: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

The Web as a text corpus

Pages close in word vector space tend to be related

Cluster hypothesis (van Rijsbergen 1979)

The WebCrawler (Pinkerton 1994)

The whole first generation of search engines

weapons

mass

destruction

p1p2

Page 5: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Enter the Web’s link structure

Broder & al. 2000

p(i) =!

N+ (1 ! !)

!

j:j"i

p(j)

|" : j " "|

Brin & Page 1998

Barabasi & Albert 1999

Page 6: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Three network topologies

Text LinksMeaning

Page 7: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Connection between semantic topology (topicality or relevance) and link topology (hypertext)

G = Pr[rel(p)] ~ fraction of relevant pages (generality)

R = Pr[rel(p) | rel(q) AND link(q,p)]

Related nodes are “clustered” if R > G (modularity)

Necessary and sufficient condition for a random crawler to find pages related to start points

G = 5/15C = 2R = 3/6 = 2/4

The “link-cluster” conjecture

ICML 1997

Page 8: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

• Stationary hit rate for a random crawler:

Link-cluster conjecture

η(t + 1) = η(t) ⋅R + (1 −η(t)) ⋅G ≥ η(t)

η t→∞ → η∗ =G

1− (R −G)η∗ >G⇔ R >G

η∗

G−1 =

R−G1 − (R−G)

Value added

Conjecture

Page 9: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Pages that link to each other tend to be related

Preservation of semantics (meaning)

A.k.a. topic drift

Link-cluster conjecture

L(q,δ) ≡path(q, p)

{p: path(q,p ) ≤δ }∑

{p : path(q, p) ≤ δ}

R(q,δ)G(q)

≡Pr rel(p) | rel(q)∧ path(q, p) ≤ δ[ ]

Pr[rel(p)]

JASIST 2004

Page 10: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

10

Correlation of lexical and linkage topology

L(δ): average link distance

S(δ): average similarity to start (topic) page from pages up to distance δ

Correlationρ(L,S) = –0.76

The “link-content” conjecture

S(q,δ) ≡sim(q, p)

{p: path(q,p ) ≤δ }∑

{p : path(q, p) ≤ δ}

Page 11: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Heterogeneity of link-content correlation

S = c + (1− c)eaLb edu net

gov

com

signif. diff. a only (p<0.05)

signif. diff. a & b (p<0.05)

org

Page 12: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Discussion

Topic drift: Myth or reality?

Page 13: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Mapping the relationship between links, content, and semantic topologies

• Given any pair of pages, need ‘similarity’ or ‘proximity’ metric for each topology:– Content: textual/lexical (cosine) similarity– Link: co-citation/bibliographic coupling– Semantic: relatedness inferred from manual classification

• Data: Open Directory Project (dmoz.org)– ~ 1 M pages after cleanup– ~ 1.3*1012 page pairs!

Page 14: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Content similarity€

σ c p1, p2( ) =p1 ⋅ p2p1 ⋅ p2

term i

term j

term k

p1p2

p1 p2

σ l (p1, p2) =Up1

∩Up2

Up1∪Up2

Link similarity

Page 15: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Semantic similarity

• Information-theoretic measure based on classification tree (Lin 1998)

• Classic path distance in special case of balanced tree€

σ s(c1,c2) =2logPr[lca(c1,c2)]logPr[c1]+ logPr[c2]

top

lca

c1

c2

Page 16: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Individual metric distributions

semanticcontent

link

Page 17: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Correlations between

similarities

Games

Arts

Business

Society

All pairs

Science

Shopping

Adult

Health

Recreation

Kids and Teens

Computers

Sports

Reference

Home

News

0 0.1 0.2 0.3 0.4 0.5

content-linkcontent-semanticlink-semantic

IEEE Internet Computing 2005

Page 18: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

| Retrieved & Relevant || Retrieved |

| Retrieved & Relevant || Relevant |

P(sc,sl ) =

σ s(p,q){p,q:σ c = sc ,σ l = sl }

{p,q :σ c = sc,σ l = sl}

R(sc,sl ) =

σ s(p,q){p,q:σ c = sc ,σ l = sl }

σ s(p,q){p,q}∑

Averagingsemantic similarity

Summingsemantic similarity

Precision =

Recall =

Page 19: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Business

σc

σllog Recall Precision

Page 20: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Science

σc

σllog Recall Precision

Page 21: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Adult

σc

σllog Recall Precision

Page 22: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Computers

σc

σllog Recall Precision

Page 23: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Reference

σc

σllog Recall Precision

Page 24: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Home

σc

σllog Recall Precision

Page 25: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

News

σc

σllog Recall Precision

Page 26: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

All pairs

σc

σllog Recall Precision

Page 27: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Discussion: So what?

Page 28: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

So what?

Recall

Precision

P (f, !)=

!

p,q:f("c(p,q),"l(p,q))!!

"s(p, q)

|p, q : f("c(p, q), "l(p, q)) ! !|

R(f, !)=

!

p,q:f("c(p,q),"l(p,q))!!

"s(p, q)

!

p,q"s(p, q)

Approximate relevance by semantic similarity

Page 29: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

00.2

0.40.6

0.81

!c 00.2

0.40.6

0.81

!l

0

0.2

0.4

0.6

0.8

1

(!c+!l)/2

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

00.2

0.40.6

0.81

!c 00.2

0.40.6

0.81

!l

0

0.2

0.4

0.6

0.8

1

!l if !c > 0.5

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

00.2

0.40.6

0.81

!c 00.2

0.40.6

0.81

!l

0

0.2

0.4

0.6

0.8

1

max(!c,!l)

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

00.2

0.40.6

0.81

!c 00.2

0.40.6

0.81

!l

0

0.2

0.4

0.6

0.8

1

!c !l

Rank by combining content and link similarity

Page 30: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

All pairs!l · H(!c ! 0.5)

2

3!l +

1

3!c

!c · !l

max(σc, σl)

!l

!c

Proc. WWW 2004

Page 31: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

News

!l

!c

(!l + !c)/2

!c · !l

Page 32: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

What about the “hole”?

σc

σl

Precision

Page 33: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

t2

t1

t4t3

t5 t6

t7 t8TSR

Edge Type

Page 34: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Better semantic similarity measureWork w/Ana Maguitman & al.

Include cross-links (symbolic) and see-also links (related)

Transitive closure of topic graph

Compute entropy based on fuzzy membership matrix

!s(t1, t2) = maxk

2 · min (Wk1, Wk2) · log Pr[tk]

log(Pr[t1|tk] · Pr[tk]) + log(Pr[t2|tk] · Pr[tk])

W = T+ ! G ! T+

Page 35: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Differences

Page 36: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

0 0.2 0.4 0.6 0.8 1

σsT

0

5

10

15

20

rela

tive

diffe

renc

e %

0 0.1 0.2 0.3 0.40

0.1

0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1< σsG>

actual difference

Page 37: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

mean stderr

tree 5.7% 0.8%

graph 84.7% 1.8%

Page 38: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Combining content & links

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Spea

rman

rank

cor

rela

tion

σc threshold

σc σlσc H(σl)

σl0.2 σc + 0.8 σl0.8 σc + 0.2 σlσc

Proc. WWW 2005, WWWJ 2006

Page 39: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Discussion: Is content really so bad? Why?

Page 40: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

OutlineMapping

Topical localityContent, link, and semantic topologies in the Web

ModelingHow the Web evolves and why content mattersConsequences for navigation and crawling

MiningTopical Web crawlersAdaptive, intelligent crawling techniques

MinglingSocial Web search & recommendationDistributed collaborative peer search

Page 41: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Preferential attachment

Pr(i) ! k(i)

“BA” model (Barabasi & Albert 1999, de Solla Price 1976)

At each step t add new page p Create m new links from p to i (i<t)

Rich-get-richer

?

Pr(i) =k(i)

mt=! Pr(k) " k#!

Page 42: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Other growth models

Web copying

(Kleinberg, Kumar & al 1999, 2000)

same indegree distribution

no need to know degree

Pr(i) ! Pr(j) · Pr(j " i)

Page 43: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Other growth modelsRandom mixture

(Pennock & al. 2002,Cooper & Frieze 2001, Dorogovtsev & al 2000)

winners don’t take all

general indegree distribution

fits non-power-law cases

Pr(i) ! ! ·1

t+ (1 " !) ·

k(i)

mt

Page 44: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Networks, power laws and phase transitions

++

+

+

++ ++

+

+

+

+

+

+

+

+

+

++

+

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

ooo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

Raissa D’Souza, Microsoft Research

Joint with: Noam Berger, Christian Borgs, JenniferChayes, and Bobby Kleinberg.

Other growth modelsMixture with Euclidean distance(HOT: Fabrikant, Koutsoupias & Papadimitriou 2002)

tradeoff between centrality and geometric locality

fits power-law in certain critical trade-off regimes

i = arg min(!rit + gi)

Page 45: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Discussion:What about content?

Page 46: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Link probability vs lexical distance

r =1 σ c −1

Pr(λ | ρ) =(p,q) : r = ρ∧σ l > λ

(p,q) : r = ρ

Phasetransition

Power law tail

Pr(λ | ρ) ~ ρ−α(λ)

ρ*

Proc. Natl. Acad. Sci. USA 99(22): 14014-14019, 2002

Page 47: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Local content-based growth model

γ=4.3±0.1

γ=4.26±0.07

• Similar to preferential attachment (BA)

• Use degree info (popularity/importance) only for nearby (similar/related) pages

Pr(pt → pi< t ) =k(i)mt

if r(pi, pt ) < ρ*

c[r(pi, pt )]−α otherwise

Page 48: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

So, many models can predict degree distributions...

Which is “right” ?

Need an independent observation (other than degree) to validate models

Distribution of content similarity across linked pairs

Page 49: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

None of these models is right!

Page 50: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Back to the mixture model

Bias choice by content similarity instead of uniform distribution

Pr(i) ! ! ·1

t+ (1 " !) ·

k(i)

mtdegree-uniform mixture

ai2

i1

i3t

bi2

i1

i3t

ci2

i1

i3t

Page 51: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Degree-similarity mixture model

Pr(i) ! ! · P̂r(i) + (1 " !) ·k(i)

mt

! = 0.2, " = 1.7

P̂r(i) ! [r(i, t)]"!

Page 52: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Build it...

(M.M.)

Page 53: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Both mixture models get the degree distribution right…

Page 54: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

…but the degree-similarity mixture model predicts the similarity distribution better

Proc. Natl. Acad. Sci. USA 101: 5261-5265, 2004

Page 55: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

0

4

DMOZ

0

0.25

0.5

0.75

1

!l

0

2

PNAS

0 0.25 0.5 0.75 1

!c

0

0.25

0.5

0.75

1

!l

Citation networks

15,785 articles

published in PNAS between 1997 and

2002

Page 56: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Citation networks

Page 57: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Discussion

What questions remain open?

What other factors should we consider?

Page 58: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Open QuestionsUnderstand distribution of content similarity across all pairs of pages

Using TF: exponential

Using TF-IDF: power law

Growth model to explain co-evolution of both link topology and content similarity

The role of search engines (keynote)

Pr(!) ! ""#!

Pr(!) ! "!"#

Page 59: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Efficient crawling algorithms?

Practice: can’t find them!

• Greedy algorithms based on location in geographical small world networks: ~ poly(N) (Kleinberg 2000)

• Greedy algorithms based on degree in power law networks: ~ N (Adamic, Huberman & al. 2001)

Theory: since the Web is a small world network, or has a scale free degree distribution, short paths exist between any two pages:

~ log N (Barabasi & Albert 1999)~ log N / log log N (Bollobas 2001)

Page 60: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Exception # 1

• Geographical networks (Kleinberg 2000)– Local links to all lattice neighbors– Long-range link probability distribution:

power law Pr ~ r–α

• r: lattice (Manhattan) distance• α: constant clustering exponent

t ~ log2 N⇔α = D

Page 61: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Is the Web a geographical network?

local links

long range links (power law tail)

Replace lattice distance by lexical distancer = (1 / σc) – 1

Page 62: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Exception # 2

• Hierarchical networks (Kleinberg 2002, Watts & al. 2002)– Nodes are classified at the leaves of tree– Link probability distribution: exponential tail

Pr ~ e–h

• h: tree distance (height of lowest common ancestor)

h=1h=2

t ~ logε N,ε ≥1

Page 63: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

exponential tail

Is the Web a hierarchical network? Replace tree distance by semantic distance

h = 1 – σs

top

lca

c1

c2

Page 64: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Discussion: Does this stuff really have any

applications?

Page 65: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

OutlineMapping

Topical localityContent, link, and semantic topologies in the Web

ModelingHow the Web evolves and why content mattersConsequences for navigation and crawling

MiningTopical Web crawlersAdaptive, intelligent crawling techniques

MinglingSocial Web search & recommendationDistributed collaborative peer search

Page 66: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Crawler applications• Universal Crawlers

– Search engines!

• Topical crawlers– Live search

(e.g., myspiders.informatics.indiana.edu)

– Topical search engines & portals

– Business intelligence (find competitors/partners)

– Distributed, collaborative search

Page 67: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Topical crawlers

• Seminal work– Cho & Garcia-Molina @ Stanford– Chakrabarti & al @ IBM Almaden / IIT– Amento & al @ ATT Shannon– Ben-Shaul & al @ IBM Haifa– many others…

Page 68: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

spears

[sic]Topical crawlers

Page 69: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Evaluating topical crawlers

• Goal: build “better” crawlers to support applications • Build an unbiased evaluation framework

– Define common tasks of measurable difficulty– Identify topics, relevant targets– Identify appropriate performance measures

• Effectiveness: quality of crawler pages, order, etc.• Efficiency: separate CPU & memory of crawler algorithms from

bandwidth & common utilities

Information Retrieval 2005

Page 70: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Evaluating topical crawlers: Topics

• Automate evaluation using edited directories

• Different sources of relevance assessments

Keywords

Description

Targets

Page 71: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Topics and Targets

topic level ~ specificitydepth ~ generality

Page 72: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Evaluating topical crawlers: TasksStart from seeds, find targets

and/or pages similar to target descriptions

d=2

d=3

Page 73: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Performance matrix

targetdepth

“recall” “precision”

targetpages

targetdescriptions

d=0d=1

d=2€

Sct ∩TdTd

σ c (p,Dd )p∈Sc

t

∑€

Sct ∩TdSct

σ c (p,Dd )p∈Sc

t

Sct

Page 74: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Discussion:What assumption (targets)?

R(Relevant)

S(Crawled)

T(Targets)

T ∩ S

R ∩ S

Page 75: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Evaluating topical crawlers: Architecture

URL

Web

CommonData

Structures

Crawler 1Logic

MainKeywords

Seed URLs

Crawler NLogic

Private DataStructures

(limited resource)

Concurrent Fetch/Parse/Stem Modules

HTTP HTTP HTTP

Page 76: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Examples of crawling algorithms

• Breadth-First– Visit links in order encountered

• Best-First– Priority queue sorted by similarity– Variants:

– explore top N at a time– tag tree context– hub scores

• SharkSearch– Priority queue sorted by combination of similarity,

anchor text, similarity of parent, etc.

• InfoSpiders

Page 77: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

InfoSpiders adaptive distributed algorithm using an evolving population of learning agents

keywordvector neural net

local frontier

offspring

Page 78: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling
Page 79: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling
Page 80: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling
Page 81: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling
Page 82: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Foreach agent thread: Pick & follow link from local frontier Evaluate new links, merge frontier Adjust link estimator E := E + payoff - cost

If E < 0: Die Elsif E > Selection_Threshold: Clone offspring Split energy with offspring Split frontier with offspring Mutate offspring

Evolutionary Local Selection Algorithm (ELSA)

selectivequery

expansion

matchresource

bias

reinforcementlearning

Page 83: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Action selection

Page 84: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Q-learningCompare estimated relevance of visited document with estimated relevance of link followed from previous page

Teaching input: E(D) + µ maxl(D) λl

Page 85: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Performance

ACM Trans. Internet Technology 2003

Pages crawled

Avg

tar

get

reca

ll

Page 86: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Exploration vs. Exploitation

Pages crawled

Avg

targ

et r

ecal

l

Page 87: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Efficiency & scalability

Link frontier size

Perf

orm

ance

/cos

t

Page 88: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

DOM contextLink score = linear combination between page-based and context-based similarity score

Page 89: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Co-citation: hub scoresLink scorehub = linear combination between

link and hub score

Page 90: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Recall (159 ODP topics)

Split ODP URLs between seeds and

targets

Add 10 best hubs to seeds for 94 topics

0

5

10

15

20

25

30

35

40

45

0 2000 4000 6000 8000 10000

aver

age

targ

et re

call@

N (%

)

N (pages crawled)

Breadth-FirstNaive Best-First

DOMHub-Seeker

0

5

10

15

20

25

30

35

40

45

0 2000 4000 6000 8000 10000

aver

age

targ

et re

call@

N (%

)

N (pages crawled)

Breadth-FirstNaive Best-First

DOMHub-Seeker

ECDL 2003

Page 91: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Normative lessons about crawlers

• Remember links from previous pages• Exploration is important• Adaptation can help

– to focus search– to expand query– to learn link estimates

• Distributed algorithms – boost efficiency– but watch for premature exploitation

• Improve link evaluation by looking at – parent pages– DOM context

Page 92: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Discussion: Which crawler algorithm would

you use in your app?

Page 93: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

OutlineMapping

Topical localityContent, link, and semantic topologies in the Web

ModelingHow the Web evolves and why content mattersConsequences for navigation and crawling

MiningTopical Web crawlersAdaptive, intelligent crawling techniques

MinglingSocial Web search & recommendationDistributed collaborative peer search

Page 94: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

hierarchical

flat

centralized social

Structure

CollaborationWWW2006, AAAI2006

Page 95: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

news research

+ = σ★ ☆★★☆★ ☆☆

σA★ ☆☆★

σB

fun

Page 96: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

FA(x) FA(y)

+ = σ(x,y)

FB(a(x,y))

!(x, y) =1

N

N!

u=1

2 log

"|Fu[a(x,y)]|

|Ru|

#

log|Fu(x)|

|Ru| + log|Fu(y)||Ru|

Page 97: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling
Page 98: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

10-16

10-14

10-12

10-10

10-8

10-6

10-4

10-1

100

101

102

103

104

Fra

ctio

n o

f U

RL

s

Degree

smin = 0

smin = 0.0025

smin = 0.0055

Pr(k) !! exp(-0.002*k)

Pr(k) !! k-2

degree

URLs

Page 99: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling
Page 100: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

GoogleGaL

RelevantSet

UserStudy

Page 101: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

c(x) =1

|U |!

y!U

"

#$1 + minx!y

!

(u,v)!x!y

%1

!(u, v)"1

&'

()

"1

pt+1(x) = (1 ! !) + ! ·!

y"U

"(x, y) · pt(y)"

z"U "(y, z)

Centrality&

Prestige

Page 102: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Pre

cis

ion

Recall

GiveALink s.pGiveALink s

u0

p=0.9

u3

p=0.5

u2

p=1.3

u1

p=1.3

s=0.1

s*p=0.13

s=0.1

s*p=0.13

s=0.2

s*p=0.11

s=0.6

Ranking by Similarity * Prestige

Page 103: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Ranking &

Recommendation by Novelty

!(x, y) =

!1 + min

z!U

"1

"(x, z)+

1

"(z, y)" 2

#$"1

"(x, y)

x

y

z

Page 104: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Personalization

!(x, A) = maxy!A

!"(x, y) · log

"N

N(y)

#$

Page 105: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

spam

collaborative filtering, social semantic similarity, unlinked pages, multimedia content, trust

scalability,

density

Page 106: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Donate!givealink.org

Page 107: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Discussion: Critical mass?

Page 108: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Peer distributed crawling andcollaborative searching

• Crawl when idle– Start from bookmarks– Past queries or selected

documents as topics

• Index locally– User relevance feedback

is natural– No centralized

coordination bottleneck• Unlike grub.org,

hyperbee.com

bookmarks

localstorage

WWW

indexcrawler

peer

Page 109: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

http://homer.informatics.indiana.edu/~nan/6S/

Page 110: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

http://homer.informatics.indiana.edu/~nan/6S/

Page 111: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

http://homer.informatics.indiana.edu/~nan/6S/

Page 112: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

http://homer.informatics.indiana.edu/~nan/6S/

Page 113: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

http://homer.informatics.indiana.edu/~nan/6S/

Page 114: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

6S: peer distributed crawling and collaborative searching

AB

query

query

hit

hit

Data mining & referral opportunities

query

query

query

C

Emerging communities

Page 115: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Algorithm 1: Random Known

Page 116: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Algorithm 2: Greedy

Page 117: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Algorithm 3: Reinforcement

Page 118: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Learning Peers

A focused profile Wf of other peers for terms in queries

An expanded profile We of other peers for terms not in queries, but that co-occur with query terms and with higher frequency

Query routing:

wi,p(t + 1) = (1 ! !) · wi,p(t) + ! ·!

Sp + 1

Sl + 1! 1

"

!(p, Q) =!

i!Q

"" · w

fi,p + (1 " ") · we

i,p

#

Page 119: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 0.05 0.1 0.15 0.2 0.25

Prec

ision

Recall

6SCentralized Search Engine

Google

Distributed vs CentralizedWTAS 2005

Page 120: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

P@10WWW 2005

Page 121: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

! " #! #" $! $"!$!

!

$!

%!

&!

'!

#!!

#$!

#%!

()*+,*-./*+./**+

01*+02*.34052*.678

39)-:*+.3;*<<,3,*5:=,0>*:*+

Pajek Pajek

Small-worldWWW 2004

Page 122: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Semantic similarity

Arts/Movies/Filmmaking

Business/Arts_and_Entertainment/Fashion

Business/E-Commerce/Developers

Business/Telecommunications/Call_Centers

Computers/Programming/Graphics

Health/Conditions_and_Diseases/Cancer

Health/Mental_Health/Grief,_Loss_and_Bereavement

Health/Professions/Midwifery

Health/Reproductive_Health/Birth_Control

Home/Family/Pregnancy

Shopping/Clothing/Accessories

Shopping/Clothing/Footwear

Shopping/Clothing/Uniforms

Shopping/Sports/Cycling

Shopping/Visual_Arts/Artist_Created_Prints

Society/Issues/Abortion

Society/People/Women

Sports/Cycling/Racing

P2PIR 2006

Page 123: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Discussion:

Would you use 6S? Why or why not?

Page 124: Web mapping, modeling, mining, and minglingwaw2006/winterschool/Menczer-tutori… · Web mapping, modeling, mining, and mingling Tutorial for the 2006 MITACS Winter School Modelling

Thank you!Questions?

informatics.indiana.edu/fil

Research supported by NSF CAREER Award IIS-0348940

homer.informatics.indiana.edu/~nan

www.dataandsearch.org