acfocs 2004chakrabarti1 using graphs in unstructured and semistructured data mining soumen...
Post on 19-Dec-2015
219 views
TRANSCRIPT
ACFOCS 2004 Chakrabarti 1
Using Graphs in Unstructuredand Semistructured Data Mining
Soumen Chakrabarti
IIT Bombay
www.cse.iitb.ac.in/~soumen
ACFOCS 2004 Chakrabarti 2
Acknowledgments C. Faloutsos, CMU W. Cohen, CMU IBM Almaden (many colleagues) IIT Bombay (many students) S. Sarawagi, IIT Bombay S. Sudarshan, IIT Bombay
ACFOCS 2004 Chakrabarti 3
Graphs are everywhere Phone network, Internet, Web Databases, XML, email, blogs Web of trust (epinion) Text and language artifacts (WordNet) Commodity distribution networks
Internet Map [lumeta.com]
Food Web [Martinez1991]
Protein Interactions [genomebiology.com]
ACFOCS 2004 Chakrabarti 4
Why analyze graphs? What properties do real-life graphs have? How important is a node? What is
importance? Who is the best customer to target in a
social network? Who spread a raging rumor? How similar are two nodes? How do nodes influence each other? Can I predict some property of a node
based on its neighborhood?
ACFOCS 2004 Chakrabarti 5
Outline, some more detail Part 1 (Modeling graphs)
What do real-life graphs look like? What laws govern their formation, evolution and
properties? What structural analyses are useful?
Part 2 (Analyzing graphs) Modeling data analysis problems using graphs Proposing parametric models Estimating parameters Applications from Web search and text mining
ACFOCS 2004 Chakrabarti 6
Modeling and generatingrealistic graphs
ACFOCS 2004 Chakrabarti 7
Questions What do real graphs look like?
Edges, communities, clustering effects
What properties of nodes, edges are important to model? Degree, paths, cycles, …
What local and global properties are important to measure?
How to artificially generate realistic graphs?
ACFOCS 2004 Chakrabarti 8
Modeling: why care? Algorithm design
Can skewed degree distribution make our algorithm faster?
Extrapolation How well will Pagerank work on the Web 10
years from now?
Sampling Make sure scaled-down algorithm shows same
performance/behavior on large-scale data
Deviation detection Is this page trying to spam the search engine?
ACFOCS 2004 Chakrabarti 9
Laws – degree distributions Q: avg degree is ~10 - what is the most
probable degree?
degree
count ??
10
ACFOCS 2004 Chakrabarti 10
Laws – degree distributions Q: avg degree is ~10 - what is the most
probable degree?
degreedegree
count ??
10
count
10
ACFOCS 2004 Chakrabarti 11
Power-law: outdegree O
The plot is linear in log-log scale [FFF’99]freq = degree (-2.15)
O = -2.15
Exponent = slope
Outdegree
Frequency
Nov’97
-2.15
ACFOCS 2004 Chakrabarti 12
Power-law: rank R
The plot is a line in log-log scale
Exponent = slope
R = -0.74
R
outdegree
Rank: nodes in decreasing outdegree order
Dec’98
ACFOCS 2004 Chakrabarti 13
Eigenvalues Let A be the adjacency matrix of graph The eigenvalue satisfies
A v = v, where v is some vector
Eigenvalues are strongly related to graph topology
AB C
D
A B C D
A 1
B 1 1 1
C 1
D 1
ACFOCS 2004 Chakrabarti 14
Power-law: eigenvalues of E Eigenvalues in decreasing order
E = -0.48
Exponent = slope
Eigenvalue
Rank of decreasing eigenvalue
Dec’98
ACFOCS 2004 Chakrabarti 15
The Node Neighborhood N(h) = # of pairs of nodes within h hops Let average degree = 3 How many neighbors should I expect within
1,2,… h hops? Potential answer:
1 hop -> 3 neighbors 2 hops -> 3 * 3 … h hops -> 3h
ACFOCS 2004 Chakrabarti 16
The Node Neighborhood N(h) = # of pairs of nodes within h hops Let average degree = 3 How many neighbors should I expect within
1,2,… h hops? Potential answer:
1 hop -> 3 neighbors 2 hops -> 3 * 3 … h hops -> 3h WE HAVE DUPLICATES!
ACFOCS 2004 Chakrabarti 17
The Node Neighborhood N(h) = # of pairs of nodes within h hops Let average degree = 3 How many neighbors should I expect within
1,2,… h hops? Potential answer:
1 hop -> 3 neighbors 2 hops -> 3 * 3 … h hops -> 3h ‘avg’ degree: meaningless!
ACFOCS 2004 Chakrabarti 18
Power-law: hop-plot H
Pairs of nodes as a function of hops N(h)= hH
H = 4.86
Dec 98Hops
# of Pairs # of Pairs
Hops Router level ’95
H = 2.83
ACFOCS 2004 Chakrabarti 19
Observation Q: Intuition behind ‘hop exponent’? A: ‘intrinsic=fractal dimensionality’ of the
network
N(h) ~ h1 N(h) ~ h2
...
ACFOCS 2004 Chakrabarti 20
Any other ‘laws’? The Web looks like a “bow-tie”
[Kumar+1999] IN, SCC, OUT, ‘tendrils’ Disconnected components
ACFOCS 2004 Chakrabarti 21
Generators How to generate graphs from a realistic
distribution? Difficulty: simultaneously preserving many
local and global properties seen in realistic graphs Erdos-Renyi: switch on each edge
independently with some probability Problem: degree distribution not power-law
Degree-based Process-based (“preferential attachment”)
ACFOCS 2004 Chakrabarti 22
Degree-based generator Fix the degree distribution (e.g., ‘Zipf’) Assign degrees to nodes
Add matching edges to satisfy degrees No direct control over
other properties “ACL model”
[AielloCL2000]
ACFOCS 2004 Chakrabarti 23
Process-based: Preferential attachment Start with a clique with m nodes Add one node v at every time step v makes m links to old nodes Suppose old node u has degree d(u) Let pu = d(u)/ wd(w)
v invokes a multinomial distribution defined by the set of p’s
And links to whichever u’s show up At time t, there are m+t nodes, mt links What is the degree distribution?
ACFOCS 2004 Chakrabarti 24
Preferential attachment: analysis ki(t) = degree of node i at time t
Discrete random variable Approximate as continuous random variable Let i(t) = E(ki(t)), expectation over random
linking choices At time t, the infinitesimal expected growth
rate of i(t) is, by linearity of expectation,
t
t
mt
tm
t
t iii
2
)(
2
)()(
m degrees to add Total degree at t
ii
t
tm t) (
Time at which node i was born
ACFOCS 2004 Chakrabarti 25
Preferential attachment, continued Expected degree of each node grows as
square-root of age:
Let the current time be t A node must be old enough for its degree to
be large; for i(t) > k, we need
Therefore, the fraction of nodes with degree larger than k is
Pr(degree = k) const/k3 (data closer to 2)
iit
t m t) (
2 2k t m ti
2
2
2
2
) (k
m
k t m
t mt
ACFOCS 2004 Chakrabarti 26
Bipartite cores Basic preferential attachment does not
Explain dense/complete bipartite cores (100,000’s in a O(20 million)-page crawl)
Account for influence of search engines
The story isn’t over yet
2:3 core(n:m core)
log m
Nu
mb
er
of c
ore
s
n=2
n=7
ACFOCS 2004 Chakrabarti 27
Other process-based generators “Copying model” [KumarRRTU2000]
New node v picks old reference node r u.a.r. v adds k new links to old nodes; for ith link: W.p. add a link to an old node picked u.a.r. W.p. 1– copy the ith link from r
More difficult to analyze Reference node compression
techniques! H.O.T.: connect to closest, high-connectivity
neighbor [Fabrikant+2002] Winner does not take all [Pennock+2002]
ACFOCS 2004 Chakrabarti 28
Reference-based graph compression Well-motivated: pack graph into limited fast
memory for, e.g., query-time Web analysis Standard approach
Assign integer IDs to URLs, lexicographic order Delta or Gamma encoding of outlink IDs
If link-copying is rampant, should be able to compress Outlinks(u) by recording A reference node r Outlinks(r) Outlinks(u) …the “correction”
Finding r : what’s optimal? practical? [Adler, Mitzenmacher, Boldi, Vigna 2002—2004]
ACFOCS 2004 Chakrabarti 29
Reference-based compression, cont’d r is a candidate reference for u if
Outlinks(r)Outlinks(u) is “large enough” Given G, construct G’ in which… Directed edge from r to u with edge cost =
number of bits needed to write down Outlinks(u) Outlinks(r)
Dummy node z, z has no outlinks in G z connected to each u in G’ cost(z,u) = #bits to write Outlinks(u) w/o ref
Shortest path tree rooted at z In practice, pick “recent” r… 2.58 bits/link
ACFOCS 2004 Chakrabarti 30
Summary: Power laws are everywhere
Bible: rank vs. word frequency Length of file transfers [Bestavros+] Web hit counts [Huberman] Click-stream data [Montgomery+01] Lotka’s law of publication count (CiteSeer data)
log(rank)
log(freq)“a”
“the”
log(#citations)
log(
coun
t)
J. Ullman
ACFOCS 2004 Chakrabarti 31
Resources Generators
R-MAT [email protected] BRITE www.cs.bu.edu/brite/ INET topology.eecs.umich.edu/inet
Visualization tools Graphviz www.graphviz.org Pajek vlado.fmf.uni-lj.si/pub/networks/pajek
Kevin Bacon web sitewww.cs.virginia.edu/oracle
Erdös numbers etc.
ACFOCS 2004 Chakrabarti 32
R-MAT: Recursive MATrix generator Goals
Power-law in- and out-degrees
Power-law eigenvalues Small diameter (“six
degrees of separation”) Simple, few parameters
Approach Subdivide the adjacency
matrix Choose a quadrant with
probability (a,b,c,d)
2n
2n
a (0.5)
d (0.25)
c (0.15)
b (0.1)
From To
ACFOCS 2004 Chakrabarti 33
R-MAT algorithm, cont’d Subdivide the adjacency
matrix Choose a quadrant with
probability (a,b,c,d) Recurse till we reach a
11 cell
By construction Rich gets richer for in-
and out-degree Self-similar
(communities within communities)
Small diameter
2n2
n
a
c d
a
c d
b
ACFOCS 2004 Chakrabarti 34
Evaluation on clickstream data
Count vs Indegree Count vs Outdegree Hop-plot
Left “Network value” Right “Network value”
R-MAT matches it well
Singular value vs Rank
ACFOCS 2004 Chakrabarti 35
Topic structure of the Web Measure correlations between link proximity
and content similarity How to characterize “topics”? Started with http://dmoz.org Keep pruning until all leaf topics
have enough (>300) samples Approx 120k sample URLs Flatten to approx 482 topics Train a text classifier Characterize new document d as a vector of
probabilities pd = (Pr(c|d) c)
Classifier
Topic ProbArts 0.1Computers 0.3Science 0.6
Test doc
ACFOCS 2004 Chakrabarti 36
Sampling the background topic distrib. What fraction of Web pages are about
/Health/Diabetes? How to sample the Web? Invoke the “random surfer” model
(Pagerank) Walk from node to node Sample trail adjusting for Pagerank
Modify Web graph to do better sampling Self loops Bidirectional edges
ACFOCS 2004 Chakrabarti 37
Convergence
Start from pairs of diverse topics Two random walks, sample from each walk Measure distance between topic distributions
L1 distance |p1 – p2| = c|p1(c) – p2(c)| in [0,2]
Below .05 —.2 within 300—400 physical pages
Background distribution
0
0.1
0.2
0.3
0.4
Art
s
Bu
sin
ess
Co
mp
ute
rs
Ga
me
s
He
alth
Ho
me
Re
cre
atio
n
Re
fere
nce
Sci
en
ce
Sh
op
pin
g
So
cie
ty
Sp
ort
s00.20.40.60.8
1
0 500 1000#hops
Dis
trib
utio
n di
ffere
nce
Stride=30k
Stride=75k
ACFOCS 2004 Chakrabarti 38
Biases in topic directories Use Dmoz to train a
classifier Sample the Web Classify samples Diff Dmoz topic distribution
from Web sample topic distribution
Report maximum deviation in fractions
NOTE: Not exactly Dmoz
Dmoz over-representsGames.Video_GamesSociety.PeopleArts.Celebrities...Education.Colleges...Travel.ReservationsDmoz under-represents…WWW…Directories!Sports.HockeySociety.PhilosophyEducation…K12…Recreation…Camping
ACFOCS 2004 Chakrabarti 39
Topic-specific degree distribution Preferential attachment:
connect u to v w.p. proportional to the degree of v, regardless of topic
More realistic: u has a topic, and links to v with related topics
Unclear if power-law should be upheld
Intra-topiclinkage
Inter-topiclinkage
ACFOCS 2004 Chakrabarti 40
Random forward walk without jumps/Arts/Music
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20Wander hops
L_
1 D
ista
nce
From backgroundFrom hop0
/Sports/Soccer
0.4
0.6
0.8
1
1.2
1.4
0 5 10 15 20Wander hops
L_
1 D
ista
nce
From backgroundFrom hop0
Sampling walk is designed to mix topics well How about walking forward without jumping?
Start from a page u0 on a specific topic
Forward random walk (u0, u1, …, ui, …)
Compare (Pr(c|ui) c) with (Pr(c|u0) c) and with the background distribution
ACFOCS 2004 Chakrabarti 41
Forward walks wander away fromstarting topic slowly
But do not converge to thebackground distribution
Global PageRank ok alsofor topic-specific queries Jump parameter d=.1—.2 Topic drift not too bad within
path length of 5—10 Prestige conferred mostly by
same-topic neighbors
Also explains why focused crawling works
Observations and implications
W.p. d jump toa random node
W.p. (1-d)jump to anout-neighboru.a.r.
High-prestige
node
Jump
ACFOCS 2004 Chakrabarti 42
Citation matrix Given a page is about topic i, how likely is it
to link to topic j? Matrix C[i,j] = probability that page about topic i
links to page about topic j Soft counting: C[i,j] += Pr(i|u)Pr(j|v)
Applications Classifying Web pages into topics Focused crawling for topic-specific pages Finding relations between topics in a directory
u v
ACFOCS 2004 Chakrabarti 43
Citation, confusion, correctionFrom topic
True topic From topic
To topic
Guessed topic
To topic
Art
sB
usin
ess
Com
put
ers
Gam
esH
ealth
Hom
eR
ecre
atio
nR
efe
renc
eS
cien
ceS
hopp
ing
Soc
iety
Spo
rts
Classifier’s confusion on held-out documents can be used to correct confusion matrix
ACFOCS 2004 Chakrabarti 44
Fine-grained views of citation
Clear block-structure derived from coarse-grain topics
Strong diagonals reflecttightly-knit topic communities
Prominent off-diagonalentries raise designissues for taxonomyeditors and maintainers
ACFOCS 2004 Chakrabarti 45
Outline, some more detail Part 1 (Modeling graphs)
What do real-life graphs look like? What laws govern their formation, evolution and
properties? What structural analyses are useful?
Part 2 (Analyzing graphs) Modeling data analysis problems using graphs Proposing parametric models Estimating parameters Applications from Web search and text mining
ACFOCS 2004 Chakrabarti 46
Centrality and prestige
ACFOCS 2004 Chakrabarti 47
How important is a node? Degree, min-max radius, … Pagerank Maximum entropy network flows HITS and stochastic variants Stability and susceptibility to spamming Hypergraphs and nonlinear systems Using other hypertext properties Applications: Ranking, crawling, clustering,
detecting obsolete pages
ACFOCS 2004 Chakrabarti 48
Importance/prestige as Pagerank A node is important if it is connected to
important nodes “Random surfer” walks along links for ever If current node has 3 outlinks, take each
with probability 1/3 Importance = steady-state
probability (ssp) of visit “Maxwell’s equation
for the Web”
u
vOutDegree(u)=3
Evu u
uPRvPR
),( )OutDegree()(
)(
ACFOCS 2004 Chakrabarti 49
(Simplified) Pagerank algorithm Let A be the node adjacency matrix Column normalize AT
Want vector p such that AT p = p I.e. p is the eigenvector corresponding to the
largest eigenvalue, which is 1
1 2 3
45
p1
p2
p3
p4
p5
p1
p2
p3
p4
p5
=
To From
AT
1
1 1
1/2 1/2
1/2
1/2
ACFOCS 2004 Chakrabarti 50
Intuition
A as vector transformation
2 11 3
A
10
x
21
x’
= x
x’
2
1
1
3
ACFOCS 2004 Chakrabarti 51
Intuition
By defn., eigenvectors remain parallel to themselves (‘fixed points’)
2 11 3
A0.52
0.85
v1v1
=
0.52
0.853.62 *
1
ACFOCS 2004 Chakrabarti 52
Convergence
Usually, fast: depends on ratio
1 : 21
2
ACFOCS 2004 Chakrabarti 53
Eigenvalues and epidemic networks Will a virus spread across an arbitrary
network create an epidemic? Susceptible-Infected-Susceptible (SIS)
Was healthy, did not got infected Was infected, got cured without further attack Was infected, got cured immediately after attack Cured nodes immediately become susceptible
Susceptible/
healthy
Infected &
infectious
Infected by neighbor
Cured internally
ACFOCS 2004 Chakrabarti 54
The infection model (virus) Birth rate : probability than an
infected neighbor attacks (virus) Death rate : probability that an
infected node heals
Infected
Healthy
NN1
N3
N2Prob. β
Prob. β
Prob.
ACFOCS 2004 Chakrabarti 55
Prob. that node i does NOT receive an infection from neighbors
Working parameters pi,t = prob. node i is infected at time t
Was healthy, did not got infected Was infected, got cured without further
attack Was infected, got attacks from neighbors
and yet cured itself in this time step … somewhat arbitrary
titip ,1,1
titip ,1,
Eji
tjEji
tjtjti ppp),(
1,),(
1,1,, 1)1()1(
titip ,1,21 1
ACFOCS 2004 Chakrabarti 56
Time recurrence for pi,t Assuming various probabilities are “suitably
small”
Recurrence of the probability of infection can be approximated by this linear form:
In other words, the (symmetric) transition matrix is
Eji tj
Ejitjti pp
),( 1,),(
1,, 11
Eji
tjtiti ppp),(
1,1,, 1
AIS 1
ACFOCS 2004 Chakrabarti 57
Epidemic threshold Virus ‘strength’ s = /
Epidemic threshold of a graph is defined as the value of , such that if strength s = / < then an epidemic can not happen
Problem: compute epidemic threshold
Infected
Healthy
NN1
N3
N2Prob. β
Prob. β
Prob. δ
ACFOCS 2004 Chakrabarti 58
Epidemic threshold What should depend on? avg. degree? and/or highest degree? and/or variance of degree? and/or third moment of degree?
ACFOCS 2004 Chakrabarti 59
Analysis Eigenvectors of S and A are the same Eigenvalues are shifted and scaled
From spectral decomposition
A sufficient condition for infection dying down is that
AS ,, 1 ii
i
Tii
ni
nn uu SSSpSp ,,,)0()(
AA
AS
,1,1
,1,1
/oror
11
ACFOCS 2004 Chakrabarti 60
Epidemic threshold
An epidemic must die down if
β/δ <τ = 1/ λ1,A
largest eigenvalueof adj. matrix A
attack prob.
recovery prob.epidemic threshold
Proof: [Wang+03]
ACFOCS 2004 Chakrabarti 61
Experiments
/ > τ (above threshold)
/ = τ (at the threshold)
/ < τ (below threshold)
ACFOCS 2004 Chakrabarti 62
Remarks “Primal” problem: topology design
Design a network resilient to infection
“Dual” problem: viral marketing Selectively convert important customers who
can influence many others
Will come back to this later
ACFOCS 2004 Chakrabarti 63
Back to the random surfer Practical issues with Pagerank [BrinP1997] PR converges only if E is aperiodic and
irreducible; make it so:
d is the (tuned) probability of “teleporting” to one of N pages uniformly at random (0.1—0.2 in practice)
(Possibly) unintended consequences: topic sensitivity, stability
Evu u
uPRd
Nd
vPR),( )OutDegree(
)()1()(
ACFOCS 2004 Chakrabarti 64
Prestige as network flow yij =#surfers clicking from i to j per unit time
Hits per unit time on page j is Flow is conserved at The total traffic is Normalize:
Can interpret pij as a probability
Standard Pagerank corresponds to one solution:
Many other solutions possible
Eji ijj yH),(
Ekj jkEji ij yyj),(),(
:
ji ijj j yHY
,Yyp ijij /
))OutDegree(( iYHp iij
ACFOCS 2004 Chakrabarti 65
Maximum entropy flow [Tomlin2003] Flow conservation modeled using feature
And the constraints Goal is to maximize
subject to Solution has form i is the “hotness” of page i
otherwise
),(,
),(,
0
1
1
)(:,,1 Ejrri
Erirj
xfNr ijr
ji ijrijijr xfpxfE
,)())((0
ji ijij pp,
log
ji ijp,
1
)exp( 0 jiijp
ACFOCS 2004 Chakrabarti 66
Maxent flow results
Two IBMintranetdata setswith knowntop URLs
Averagerank (106) of knowntop URLswhensorted byPagerank
i
Hi
(Smaller rank is better)
i ranking is better than Pagerank; Hi
ranking is worse
Depth up to whichdmoz.org URLs areused as ground truth
Average rank (108)
ACFOCS 2004 Chakrabarti 67
HITS [Kleinberg1997] Two kinds of prestige
Good hubs link to good authorities Good authorities are linked to by good hubs
In matrix notation, iterations amount to
with interleaved normalization of h and a Note that scores are copied not divided
EvuEvuvauhuhva
),(),()()();()(
EaEa
Eah
hEa
T
T
i.e.,
ACFOCS 2004 Chakrabarti 68
HITS graph acquisition Steps
Get root set via keyword query Expand by one move forward and backward Drop same-site links (why?)
Each node assigned both ahub and an auth score
Graph tends to be “topic-specific”
Whereas Pagerank is usually run on “the entire Web” (but need not be)
IN Root set OUT
ACFOCS 2004 Chakrabarti 69
HITS vs. Pagerank, more comments Dominant eigenvectors of different matrices HITS: Eigensystems of EET (h) and ETE (a) Pagerank: eigensystem of
where
HITS copies scores, Pagerank distributes HITS drops same-site links, Pagerank does
(can?) not Implications?
TLddJ )1(
otherwise
),(
0
)e(1/OutDegre),(and
/1/1
/1/1Ejii
jiL
NN
NN
J
ACFOCS 2004 Chakrabarti 70
HITS: Dyadic interpretation [CohnC2000] Graph includes many communities z
Query=“Jaguar” gets auto, game, animal links
Each URL is represented as two things A document d A citation c
Max
Guess number of aspects zs and use [Hofmann 1999] to estimate Pr(c|z)
These are the most authoritative URLs
Ecd z
EcdEcd
zcdzd
dcdcd
),(
),(),(
)|Pr()|Pr()Pr(
)|Pr()Pr(),Pr(
ACFOCS 2004 Chakrabarti 71
Dyadic results for “Machine learning”
Clustering based on citations + ranking within clusters
ACFOCS 2004 Chakrabarti 72
Spamming link-based ranking Recipe for spamming HITS
Create a hub linking to genuine authorities Then mix in links to your customers’ sites Highly susceptible to adversarial behavior
Recipe for spamming Pagerank Buy a bunch of domains, cloak IP addresses Host a site at each domain Sprinkle a few links at random per page to other
sites you own Takes more work than spamming HITS
ACFOCS 2004 Chakrabarti 73
Stability of link analysis [NgZJ2001] Compute HITS authority
scores and Pagerank Delete 30% of nodes/links at
random Recompute and compare
ranks; repeat Pagerank ranks more stable
than HITS authority ranks Why? How to design more stable
algorithms?
1 3 1 1 12 5 3 3 23 12 6 6 34 52 20 23 45 171 119 99 56 135 56 40 8
10 179 159 100 78 316 141 170 6
1 1 1 1 12 2 2 2 23 5 6 4 54 3 5 5 45 6 3 6 36 4 4 3 67 7 7 7 78 8 8 8 9
HIT
S A
utho
rity
Pag
eran
k
ACFOCS 2004 Chakrabarti 74
Stability depends on graph and params Auth score is eigenvector for ETE = S, say Let 1 > 2 be the first two eigenvalues
There exists an S’ such that S and S’ are close ||S–S’||F = O(1 –2)
But ||u1 – u’1||2 = (1)
Pagerank p is eigenvector of U is a matrix full of 1/N and is the jump prob If set C of nodes are changed in any way, the
new Pagerank vector p’ satisfies
TEU ))1((
/2'2
Cu uppp
ACFOCS 2004 Chakrabarti 75
Randomized HITS Each half-step, with
probability , teleport to a node chosen uniformly at random
Much more stable than HITS Results more meaningful
near 1 will always stabilize Here was 0.2
1 3 3 2 14 1 1 1 22 2 2 3 43 4 4 4 35 6 6 6 56 5 5 5 67 7 7 7 78 8 8 8 8
1 1 1 1 23 2 2 2 12 3 3 3 34 4 4 4 45 6 7 5 56 7 6 6 67 5 5 7 78 9 9 9 11
)(col
)1(
)(row
)1(
)1(1
)1(1tt
tTt
aEh
hEa
Pag
eran
kR
ando
miz
ed H
ITS
ACFOCS 2004 Chakrabarti 76
1/3
1/3
1/3
Another random walk variation of HITS SALSA: Stochastic HITS [Lempel+2000] Two separate random walks
From authority to authority via hub From hub to hub via authority
Transition probability Pr(aiaj) =
If transition graph is irreducible, For disconnected components, depends on
relative size of bipartite cores Avoids dominance of larger cores
1/2
1/2
a1
a2
)OutDegree(1
)InDegree(1
),(),,(: haEahahhi
ji
)InDegree(aa
ACFOCS 2004 Chakrabarti 77
SALSA sample result (“movies”)
HITS: The Tightly-Knit Community (TKC) effect
SALSA: Less TKC influence (but no reinforcement!)
ACFOCS 2004 Chakrabarti 78
Links in relational data [GibsonKR1998] (Attribute, value) pair is a node
Each node v has weight wv
Each tuple is a hyperedge Tuple r has weight xr
HITS-like iterations to update weight wv
For each tuple
Update weight
Combining operator can be sum, max, product, Lp avg, etc.
),,(1 kuur wwx
),,,( 1 kuuvr
r rv xw
ACFOCS 2004 Chakrabarti 79
Distilling links in relational data
Author Author Forum Year
Dat
abas
eT
heor
y
ACFOCS 2004 Chakrabarti 80
Searching and annotating graph data
ACFOCS 2004 Chakrabarti 81
Searching graph data Nodes in graph contain text
RandomIntelligent surfer [RichardsonD2001] Topic-sensitive Pagerank [Haveliwala2002] Assigning image captions using random walks
[PanYFD2004] Rotting pages and links [BarYossefBKT2004]
Query is a set of keywords All keywords may not match a single node Implicit joins [Hulgeri+2001, Agrawal+2002] Or rank aggregation [Balmin+2004] required
ACFOCS 2004 Chakrabarti 82
Intelligent Web surfer
Eji qqqq jiijj),(
)(Pr)(Pr)(Pr')1()(Pr
Keyword Probabilityof teleportingto node j
Probability of walkingfrom i to j wrt q
Vk q
qq kR
jRj
)(
)()(Pr'
Relevanceof node k wrt q
Eki q
qq kR
jRji
),()(
)()(Pr'
Pick out-link to walk on inproportion to relevance oftarget out-neighbor
Qq qQ jqj )(Pr)Pr()(PR
Query=setof words
Pick a query word per some distribution, e.g. IDF
ACFOCS 2004 Chakrabarti 83
Implementing the intelligent surfer
PRQ(j) approximates a walk that picks a query keyword using Pr(q) at every step
Precompute and store Prq(j) for each keyword q in lexicon: space blowup = avg doc length
Query-dependent PR rated better by volunteers
ACFOCS 2004 Chakrabarti 84
Topic-sensitive Pagerank High overhead for per-word Pagerank Instead, compute Pageranks for some
collection of broad topics PRc(j) Topic c has sample page set Sc
Walk as in Pagerank Jump to a node in Sc uniformly at random
“Project” query onto set of topics
Rank responses by projection-weighted Pageranks
QqcqcQc )|Pr()Pr()|Pr(
c c jQcjQ )(PR)|Pr(),Score(
ACFOCS 2004 Chakrabarti 85
Topic-sensitive Pagerank results
Users prefer topic-sensitive Pagerank on most queries to global Pagerank + keyword filter
ACFOCS 2004 Chakrabarti 86
Image captioning Segment images
into regions Image has
caption words Three-layer
graph: image, regions, caption words
Threshold on region similarity to connect regions (dotted)
ACFOCS 2004 Chakrabarti 87
Random walks with restarts
Find regions in test image Connect regions to other nodes in the region layer
using region similarity Random walk, restarting at test image node Pick words with largest visit probability
Regions
Images
Words
Testimage
ACFOCS 2004 Chakrabarti 88
More random walks: “Link rot” How stale is a page?
Last-mod unreliable Automatic dead-link cleaners mask disuse
A page is “completely stale” if it is “dead” Let D be the set of pages which cannot be
accessed (404 and other problems)
How stale is a page u? Start with p u If pD declare decay value of u to be 1, else With probability declare decay value of u = 0 W.p. 1– choose outlink v, set pv, loop
ACFOCS 2004 Chakrabarti 89
Page staleness results
Decay score is correlated with, but generally larger than the fraction of dead outlinks on a page
Removing direct dead links automatically does not eliminate live but “rotting” pages
404s
Decay
ACFOCS 2004 Chakrabarti 90
Graph proximity search: two paradigms A single node as query response
Find node that matches query terms… …or is “near” nodes matching query terms
[Goldman+ 1998]
A connected subgraph as query response Single node may not match all keywords No natural “page boundary” [Bhalotia+2002]
[Agrawal+2002]
ACFOCS 2004 Chakrabarti 91
Single-node response examples Travolta, Cage
Actor, Face/Off
Travolta, Cage, Movie Face/Off
Kleiser, Movie Gathering, Grease
Kleiser, Woo, Actor Travolta
Travolta Cage
Face/OffGrease
“acted-in”
Movie“is-a”
Actor
“is-a”
Kleiser
Director
“is-a”
Gathering
“dir
ect
ed
”
Woo
A3
ACFOCS 2004 Chakrabarti 92
Basic search strategy Node subset A activated because they
match query keyword(s) Look for node near nodes that are activated Goodness of response node depends
Directly on degree of activation Inversely on distance from activated node(s)
ACFOCS 2004 Chakrabarti 93
Proximity query: screenshot
http://www.cse.iitb.ac.in/banks/
ACFOCS 2004 Chakrabarti 94
Ranking a single node response Activated node set A Rank node r in “response set” R based on
proximity to nodes a in A Nodes have relevance R and A in [0,1] Edge costs are “specified by the system”
d(a,r) = cost of shortest path from a to r Bond between a and r
Parameter t tunes relative emphasis on distance and relevance score
Several ad-hoc choices
tRA
rad
rarab
),(
)()(),(
ACFOCS 2004 Chakrabarti 95
Scoring single response nodes Additive
Belief
Goal: list a limited number of find nodes with the largest scores
Performance issues Assume the graph is in memory? Precompute all-pairs shortest path (|V |3)? Prune unpromising candidates?
Aarabr ),()score(
Aarabr ),(11)score(
ACFOCS 2004 Chakrabarti 96
Hub indexing Decompose APSP problem using sparse
vertex cuts |A|+|B | shortest paths to p |A|+|B | shortest paths to q d(p,q)
To find d(a,b) compare d(apb) not through q d(aqb) not through p d(apqb) d(aqpb)
Greatest savings when |A||B| Heuristics to find cuts, e.g. large-degree nodes
A B
a bp
q
ACFOCS 2004 Chakrabarti 97
ObjectRank [Balmin+2004] Given a data graph with nodes having text For each keyword precompute a keyword-
sensitive Pagerank [RichardsonD2001] Score of a node for multiple keyword search
based on fuzzy AND/OR Approximation to Pagerank of node with restarts
to nodes matching keywords
Use Fagin-merge [Fagin2002] to get best nodes in data graph
ACFOCS 2004 Chakrabarti 98
Connected subgraph as response Single node may not match all keywords
No natural “page boundary” On-the-fly joins make up a “response page”
Two scenarios Keyword search on relational data
• Keywords spread among normalized relations
Keyword search on XML-like or Web data• Keywords spread among DOM nodes and subtrees
ACFOCS 2004 Chakrabarti 99
Keyword search on relational data Tuple = node Some columns have text Foreign key constraints =
edges in schema graph Query = set of terms No natural notion
of a document Normalization Join may be needed
to generate results Cycles may exist in schema
graph: ‘Cites’
Cites
CitingCited
Paper
PaperIDPaperName
Writes
AuthorIDPaperID
Author
AuthorIDAuthorName
PaperID PaperNameP1 DBXplorerP2 BANKS
AuthorID AuthorNameA1 ChaudhuriA2 SudarshanA3 Hulgeri
Citing CitedP2 P1
AuthorID PaperIDA1 P1A2 P2A3 P2
ACFOCS 2004 Chakrabarti 100
DBXplorer and DISCOVER Enumerate subsets of relations in schema graph
which, when joined, may contain rows which have all keywords in the query “Join trees” derived from schema graph
Output SQL query for each join tree Generate joins, checking rows for matches
[Agrawal+2001], [Hristidis+2002]
T1 T2 T3
T4
T5
K1,K2,K3
K2
K3
T2
T2 T3
T4 T2 T3
T5
T2 T3
T4
T5
ACFOCS 2004 Chakrabarti 101
Discussion Exploits relational
schema information to contain search
Pushes final extraction of joined tuples into RDBMS
Faster than dealing with full data graph directly
Coarse-grained ranking based on schema tree
Does not model proximity or (dis) similarity of individual tuples
No recipe for data with less regular (e.g. XML) or ill-defined schema
ACFOCS 2004 Chakrabarti 102
Motivation from Web search “Linux modem driver
for a Thinkpad A22p” Hyperlink path matches
query collectively Conjunction query
would fail
Projects where X and P work together Conjunction may
retrieve wrong page
General notion of graph proximity
IBM Thinkpads•A20m•A22p Thinkpad
Drivers•Windows XP•Linux
Home Page ofProfessor XPapers•VLDB…Students•P•Q
P’s home pageI work on the Bproject.
The B SystemGroup members•P•S•X
DownloadInstallation tips•Modem•Ethernet
ACFOCS 2004 Chakrabarti 103
Data structures for search Answer = tree with at least one leaf
containing each keyword in query Group Steiner tree problem, NP-hard
Query term t found in source nodes St
Single-source-shortest-path SSSP iterator Initialize with a source (near-) node Consider edges backwards getNext() returns next nearest node
For each iterator, each visited node v maintains for each t a set v.Rt of nodes in St which have reached v
ACFOCS 2004 Chakrabarti 104
Generic expanding search Near node sets St with S = t St
For all source nodes S create a SSSP iterator with source
While more results required Get next iterator and its next-nearest node v Let t be the term for the iterator’s source s crossProduct = {s} t ’tv.Rt’
For each tuple of nodes in crossProduct• Create an answer tree rooted at v with paths to each
source node in the tuple
Add s to v.Rt
ACFOCS 2004 Chakrabarti 105
Search example (“Vu Kleinberg”)
author writes cites
Quoc Vu Jon Kleinberg
paper
writeswrites
writes
A metriclabeling problem
Authoritative sources in ahyperlinked environment
Organizing Web pagesby “Information Unit”
Divyakant Agrawal
Eva Tardos
writes
writes
cites
cites
cites
ACFOCS 2004 Chakrabarti 106
First response
author writes cites
Quoc Vu Jon Kleinberg
paper
writeswrites
writes
A metriclabeling problem
Authoritative sources in ahyperlinked environment
Organizing Web pagesby “Information Unit”
Divyakant Agrawal
Eva Tardos
writes
writes
cites
cites
cites
ACFOCS 2004 Chakrabarti 107
Subgraph search: screenshot
http://www.cse.iitb.ac.in/banks/
ACFOCS 2004 Chakrabarti 108
Similarity, neighborhood, influence
ACFOCS 2004 Chakrabarti 109
Why are two nodes similar? What is/are the best paths connecting two nodes
explaining why/how they are related? Graph of co-starring, citation, telephone call, …
Graph with nodes s and t; budget of b nodes Find “best” b nodes capturing relationship between
s and t [FaloutsosMT2004]: Proposing a definition of goodness How to efficiently select best connections
NegropontePalmisano
Esther Dyson Gerstner
ACFOCS 2004 Chakrabarti 110
Simple proposals that do not work Shortest path
Pizza boy p gets sameattention as g
Network flow sabt is as good as sgt
Voltage Connect +1V at s, ground t Both g and p will be at +0.5V
Observations Must reward parallel paths Must reward short paths Must penalize/tax pizza boys
a b
s g t
p
ACFOCS 2004 Chakrabarti 111
Resistive network with universal sink Connect +1V to s Ground t Introduce universal sink
Grounded Connected to every node
Universal sink is a “tax collector” Penalizes pizza boys Penalizes long paths
Goodness of a path is the electric current it carries
a b
s g t
p
Connectedto every node
ACFOCS 2004 Chakrabarti 112
Resistive network algorithm Ohm’s law: Kirchhoff’s current law: Boundary conditions (without sink): Solution:
Here C(u,v) is the conductance from u to v Add grounded universal sink z with V(z)=0 Set Display subgraph carrying high current
vuvVuVvuCvuI ,)]()()[,(),(
u
vuItsv 0),(:,
0)(,1)( tVsVtsu
wuC
vuCvVuV
w
v ,for,),(
),()()(
zwwuCzuCu ),(),(:
ACFOCS 2004 Chakrabarti 113
Distributions influenced via graphs Directed or undirected graph; nodes have
Observable properties Some unobservable (random) state
Edges indicate that distributions over unobservable states are coupled
Many applications Hypertext classification (topics are clustered) Social network of customers buying products Hierarchical classification Labeling categorical sequences: pos/ne
tagging, sense disambiguation, linkage analysis
ACFOCS 2004 Chakrabarti 114
Basic techniques Directed (acyclic) graphs: Bayesian network Markov networks
(Loopy) belief propagation
Conditional Markov networks Avoid modeling joint distribution over
observable and hidden properties of nodes
Some computationally simple special cases
ACFOCS 2004 Chakrabarti 115
Hypertext classification Want to assign labels to Web pages Text on a single page may be too little or
misleading Page is not an isolated instance by itself Problem setup
Web graph G=(V,E) Node uV is a page having text uT
Edges in E are hyperlinks Some nodes are labeled Make collective guess at missing labels
Probabilistic model? Benefits?
ACFOCS 2004 Chakrabarti 116
Luckily we don’t need to worry about this
Seek a labeling f of all unlabeled nodes so as to maximize
Let VK be the nodes with known labels and f(VK) their known label assignments
Let N(v) be the neighbors of v and NK(v)N(v) be neighbors with known labels
Markov assumption: f(v) is conditionally independent of rest of G given f(N(v))
)|}:{,Pr()Pr(
))(|}:{,Pr())(Pr(}):{,|)(Pr(
VuuE
VfVuuEVfVuuEVf
T
TT
Graph labeling model
ACFOCS 2004 Chakrabarti 117
Markov assumption: label of v does not depend onunknown labels outside the set NU(v)
Sum over all possible labelingsof unknown neighbors of v…given edges, text,
parital labeling
Probability of labelingspecific node v…
Markov graph labeling
Circularity between f(v) and f(NU(v)) Some form of iterative Gibb’s sampling or MCMC
)(,,)),((|)(Pr)(,,|))((Pr
)(,,|))((),(Pr))(,,|)(Pr(
))((
))((
KTU
vNf
KTU
vNf
KTUKT
VfVEvNfvfVfVEvNf
VfVEvNfvfVfVEvf
vU
vU
ACFOCS 2004 Chakrabarti 118
Take the expectation of thisterm over NU(v) labelings
Joint distribution over neighborsapproximated as product of marginals
Label estimates of v in the next iteration
Iterative labeling
)(,,)),((|)(Pr)(,,|)(Pr
)(,,|)(Pr
))(()( )(
)1(
KTU
vNfvNw
KTr
KTr
VfVEvNfvfVfVEwf
VfVEvf
vU
U
Sum over all possible NU(v) labelings still too expensive to compute
In practice, prune to most likely configurations Let us look at the last term more carefully
ACFOCS 2004 Chakrabarti 119
A generative node model
TKTU vvNfvfVfEVvNfvf )),((|)(Pr)(,,)),((|)(Pr
Can use Bayes classifier as with ordinary text: estimate a parametric model for
the class-conditional joint distribution between the text on the page v and the labels of neighbors of v
Must make naïve Bayes assumptions to keep practical
By the Markov assumption, finally we need a distributioncoupling f(v) and vT (the text on v) and f(N(v))
)(|)),((Pr vfvvNf T
ACFOCS 2004 Chakrabarti 120
Pictorially… c=class, t=text,
N=neighbors Text-only model: Pr[t|c] Using neighbors’ text
to judge my topic:Pr[t, t(N) | c] (hurts)
Better model:Pr[t, c(N) | c]
Estimate histograms and update based on neighbors’ histograms
?
ACFOCS 2004 Chakrabarti 121
Generative model: results 9600 patents from 12
classes marked by USPTO
Patents have text and cite other patents
Expand test patent to include neighborhood
‘Forget’ fraction of neighbors’ classes
0
10
20
30
40
0 50 100
%Neighborhood known
%E
rro
rText Link Text+Link
ACFOCS 2004 Chakrabarti 122
Detour: generative vs. discriminative x = feature vector, y = label {0,1} say Generative method models Pr(x|y) or Pr(x,y)
“the generation of data x given label y” Use Bayes rule to get Pr(y|x) Inaccurate; x may have many dimensions
Discriminative: directly estimates Pr(y|x) Simple linear case: want w s.t.
w.x > 0 if y=1 and w.x 0 if y=0 Cannot differentiate for w; instead pick some
smooth “loss function” Works very well in practice xw
x
exp1
1)|Pr(y
ACFOCS 2004 Chakrabarti 123
Discriminative node model OA(X) = direct “own” attributes of node X LD(X) = link-derived attributes of node X
Mode-link: most frequent label of neighbors(X) Count-link: histogram of neighbor labels Binary-link: 0/1 histogram of neighbor labels
Iterate as in generative case
)1)OA(exp(1))OA(,|Pr( XcwXwc Too
Local model params
)1)LD(exp(1))LD(,|Pr( XcwXwc Tll
Neighborhood model params
))LD(|Pr())OA(|Pr(maxarg)(ˆ XcXcXC c
ACFOCS 2004 Chakrabarti 124
Discriminative model: results [Li+2003]
Binary-link and count-link outperform content-only at 95% confidence
Better to separately estimate wl and wo
In+Out+Cocitation better than any subset for LD
ACFOCS 2004 Chakrabarti 125
Undirected Markov networks Clique cC(G) a set of
completely connected nodes
Clique potential c(Vc) a function over all possible configurations of nodes in Vc
Decompose Pr(v) as (1/Z)cC(G)c(Vc)
Parametric formLabel variable
Local feature variable
Instance
Label coupling
)(exp)( ccccc f vwv Params of model Feature functions
ACFOCS 2004 Chakrabarti 126
Conditional and relational networks x = vector of observed features at all nodes y = vector of labels
A set of “clique templates”specifying links to use
Other features: “in the sameHTML section”
)(
),()(
1)|Pr(
GCccccZ
yxx
xy
' )(
' ),()( wherey
yxxGCc
cccZ
),(exp),( ccccccc F yxwyx
ACFOCS 2004 Chakrabarti 127
“Toy problem”: Hierarchical classification Obvious approaches
Flatten to leaf topics, losing hierarchy info Level-by-level, compounding error probability
Cascaded generative model
Pr(c|d,r) estimated as Pr(c|r)Pr(d|c)/Z(r) Estimate of Pr(d|c) makes naïve independence
assumptions if d has high dimensionality Pr(c|d,r) tends to 0/1 for large dimensions and Mistake made at shallow levels become
irrevocable
),|Pr()|Pr()|Pr( rdcdrdc r
c
ACFOCS 2004 Chakrabarti 128
Global discriminative model Each node has an associated bit X Propose a parametric form
Each training instance sets one path to 1, all other nodes have X=0
)),(exp(1
)),(exp(),|1Pr(
rc
rcrc xdFw
xdFwxdX
dT
2T+1
xr
xr=0 xr=1F(d,xr)
wc
SVM 1v1 CtreeReuters 1% 41.9 47.3Reuters 5% 68 71News20 1% 17.2 21.8
%Accuracy
ACFOCS 2004 Chakrabarti 129
Network value of customers Customer = node X in graph, neighbors N M is a marketing action (promotion, coupon) Want to predict
Broader objective is to design action M Again, we approximate as
),,|Pr( MYXKiX
Response of customer i
Known response ofother customers
Product attributes
Aggregatemarketing action
),,|Pr(),,|Pr()(
MYXMYN K
NC
Uiii
Ui
NXSum over unknown neighbor configurations
ACFOCS 2004 Chakrabarti 130
Network value, continued Let the action be boolean
c is the cost of marketing r0 is the revenue without marketing, r1 with
Expected lift in profit by marketing to customer i in isolation
Global effect
cfXr
fXrELP
iK
i
iK
iK
i
))(,,|1Pr(
))(,,|1Pr(,,0
0
11
MYX
MYXMYX
ACFOCS 2004 Chakrabarti 131
Special case: sequential networks Text modeled as sequence of tokens drawn
from a large but finite vocabulary Each token has attributes
Visible: allCaps, noCaps, hasXx, allDigits, hasDigit, isAbbrev, (part-of-speech, wnSense)
Not visible: part-of-speech, (isPersonName, isOrgName, isLocation, isDateTime), {starts|continues|ends}-noun-phrase
Visible (symbols) and invisible (states) attributes of nearby tokens are dependent
Application decides what is (not) visible Goal: Estimate invisible attributes
ACFOCS 2004 Chakrabarti 132
Hidden Markov model A generative sequential model for the joint
distribution of states (s) and symbols (o)St-1 St
Ot
St+1
Ot+1Ot-1
...
...
nn oooossss ,...,,..., 2110
||
11 )Pr()Pr(),Pr(
o
ttttt osssos
ACFOCS 2004 Chakrabarti 133
Using redundant token features Each o is usually a vector of features
extracted from a token Might have high dependence/redundancy:
hasCap, hasDigit, isNoun, isPreposition Parametric model for Pr(stot) needs to
make naïve assumptions to be practical Overall joint model Pr(s,o) can be very
inaccurate (Same argument as in naïve Bayes vs. SVM
or maximum entropy text classifiers)
ACFOCS 2004 Chakrabarti 134
Discriminative graphical model Assume one-stage Markov dependence Propose direct parametric form for
conditional probability of state sequence given symbol sequence
St-1 St
Ot
St+1
Ot+1Ot-1
...
...
k ttkko tossft ),,,(exp)( 1
||
11 )|Pr()|Pr(
)Pr(
1)|Pr(
o
ttttt soss
oos
||
111 ),,,(),(
)(
1 o
tttotts tossss
oZ
Model
Log-linear form
Parameters to fitFeature function; mightdepend on whole o
ACFOCS 2004 Chakrabarti 135
Feature functions and parameters
Find L/k for each k and perform a gradient-based numerical optimization
Efficient for linear state dependence structure
k
k
Dos
o
t kttkk tossf
oZL
2
2
,
||
11 2
),,,(exp)(
1log
Maximize total conditional likelihood over all instances
Penalizelarge params
ACFOCS 2004 Chakrabarti 136
Conditional vs. joint: results
The asbestos fiber , crocidolite, is unusually resilient once
it enters the lungs , with even brief exposures to it causing
symptoms that show up decades later , researchers said .
DT NN NN , NN , VBZ RB JJ IN
PRP VBZ DT NNS , IN RB JJ NNS TO PRP VBG
NNS WDT VBP RP NNS JJ , NNS VBD .
Penn Treebank: 45 tags, 1M words training data
Algorithm/Features %Error %OOVE* HMM with words 5.69 45.99CRF with words 5.55 48.05CRF with words and orthography 4.27 23.76
Out
-of-
voca
bula
ry e
rror
Orthography: Use words, plus overlapping features: isCap, startsWithDigit, hasHyphen, endsWith… -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
ACFOCS 2004 Chakrabarti 137
Summary Graphs provide a powerful way to model
many kinds of data, at multiple levels Web pages, XML, relational data, images… Words, senses, phrases, parse trees…
A few broad paradigms for analysis Factors affecting graph evolution over time Eigen analysis, conductance, random walks Coupled distributions between node attributes
and graph neighborhood
Several new classes of model estimation and inferencing algorithms
ACFOCS 2004 Chakrabarti 138
References [BrinP1998] The Anatomy of a Large-Scale
Hypertextual Web Search Engine, WWW. [GoldmanSVG1998] Proximity search in databases.
VLDB, 26—37. [ChakrabartiDI1998] Enhanced hypertext
categorization using hyperlinks. SIGMOD. [BikelSW1999] An Algorithm that Learns What’s in a
Name. Machine Learning Journal. [GibsonKR1999]
Clustering categorical data: An approach based on dynamical systems. VLDB.
[Kleinberg1999] Authoritative sources in a hyperlinked environment. JACM 46.
ACFOCS 2004 Chakrabarti 139
References [CohnC2000] Probabilistically Identifying
Authoritative Documents, ICML. [LempelM2000] The stochastic approach for link-
structure analysis (SALSA) and the TKC effect. Computer Networks 33 (1-6): 387-401
[RichardsonD2001] The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank. NIPS 14 (1441-1448).
[LaffertyMP2001] Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML.
[BorkarDS2001] Automatic text segmentation for extracting structured records. SIGMOD.
ACFOCS 2004 Chakrabarti 140
References [NgZJ2001] Stable algorithms for link analysis.
SIGIR. [Hulgeri+2001] Keyword Search in Databases.
IEEE Data Engineering Bulletin 24(3): 22-32. [Hristidis+2002] DISCOVER: Keyword Search in
Relational Databases. VLDB. [Agrawal+2002] DBXplorer: A system for keyword-
based search over relational databases. ICDE. [TaskarAK2002] Discriminative probabilistic
models for relational data. [Fagin2002] Combining fuzzy information: an
overview. SIGMOD Record 31(2), 109–118.
ACFOCS 2004 Chakrabarti 141
References [Chakrabarti2002] Mining the Web: Discovering
Knowledge from Hypertext Data [Tomlin2003] A New Paradigm for Ranking Pages
on the World Wide Web. WWW. [Haveliwala2003] Topic-Sensitive Pagerank: A
Context-Sensitive Ranking Algorithm for Web Search. IEEE TKDE.
[LuG2003] Link-based Classification. ICML. [FaloutsosMT2004] Connection Subgraphs in
Social Networks. SIAM-DM workshop. [PanYFD2004] GCap: Graph-based Automatic
Image Captioning. MDDE/CVPR.
ACFOCS 2004 Chakrabarti 142
References [Balmin+2004] Authority-Based Keyword Queries
in Databases using ObjectRank. VLDB. [BarYossefBKT2004] Sic transit gloria telae:
Towards an understanding of the Web’s decay. WWW2004.