mining di dati webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• a graph is...
TRANSCRIPT
![Page 1: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/1.jpg)
Mining di Dati WebLezione 2 - Webgraph & its Models
![Page 2: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/2.jpg)
Introduction
• A graph G=(V,E) is characterized by a set of nodes (vertexes) V and a set of Edges E whose elements are pairs (v1,v2) where v1,v2 are vertexes in V.
![Page 3: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/3.jpg)
Directed Graph
• A graph G=(V,E) is directed (a.k.a. digraph) if edges in E are ordered pairs of vertexes.
![Page 4: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/4.jpg)
Features of a (Di)Graph
• The degree of a vertex is the number of edges incident to it
• The in-degree (out-degree) of a node in a digraph is the number of incoming (outgoing) edges.
in-degree 2out-degree 0
in-degree1out-degree 1
in-degree 0out-degree 1degree 3
![Page 5: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/5.jpg)
Successor and Predecessor
• We call successors of a node v, all the nodes pointed by v
• We call predecessors of a node v, all the nodes that point to v
![Page 6: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/6.jpg)
Subset of Nodes
• A subset of nodes S of V is a connected component iff for every pair o vertices u,v in S, u is reachable from v.
• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v.
• A set of nodes S is a strongly connected component (SCC) of a digraph iff, for every pair of nodes A,B in S, there exists a directed path from A to B and from B to A, and the set is maximal.
![Page 7: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/7.jpg)
The Webgraph
![Page 8: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/8.jpg)
A “sort of” Webgraph
![Page 9: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/9.jpg)
Well...
![Page 10: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/10.jpg)
The Size of Webgraph• The web is really infinite
• Dynamic content, e.g. calendars, online organizers, etc.
• http://www.raingod.com/raingod/resources/Programming/JavaScript/Software/RandomStrings/index.html
• Static web contains syntactic duplication, mostly due to mirroring (~ 20-30%)
• Some servers are seldom connected.
![Page 11: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/11.jpg)
Recent Measurement
• A. Gullì and A. Signorini. The Indexable Web is More than 11.5 Billion Pages. WWW2005.
• 2.3B the pages unknown to popular Search Engines.
• 35-120B of pages are within the hidden web.
• The index intersection between the largest available search engines - namely Google, Yahoo!, MSN, Ask/Teoma - is estimated to be
We’ll dedicate a lesson on this at
the end.
![Page 12: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/12.jpg)
Let’s Characterize it Better
These are power-law
distributions!
![Page 13: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/13.jpg)
Power-laws(an Informal Definition)• Power law trends arise in many different natural contexts:
• Telephone call networks.
• Java program networks.
• E-mail networks.
• Scientific citations.
• Protein-protein interactions in a cell.
• http://wordcount.org/main.php (Zipf’s law)
• ...
![Page 14: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/14.jpg)
Power-laws(an Informal Definition)
• Sometimes called heavy-tail or long-tail distributions.
• In a power law network many nodes have degree equal to 1 and very few of them have higher degrees.
![Page 15: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/15.jpg)
Power-law
• Two discrete random variables x and y are related by a power-law when:
• y(x) = Kx-a
• where K and a are positive constants
• The constant a is often called the power law exponent.
![Page 16: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/16.jpg)
Power-law Distribution
• A discrete random variable is distributed according to a power-law when the probability density function (pdf) is given by:
• p(x)=Kx-a
![Page 17: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/17.jpg)
Examples of Power-laws10-2
10-1
100
101
102
10-1
100
101
y = x-1.0
y = x-1.5
y = x-2.0
![Page 18: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/18.jpg)
Semantic of Power-law Distributions
• Roughly speaking a variable is distributed according to a power-law when there are few values having a very high probability of occurring, whereas the majority of the values occurs very rarely.
• For instance: words in english texts are distributed according a power-law of parameter a=1 (Zipf ’s Law)
![Page 19: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/19.jpg)
Wikipedia’s Word Distribution
From http://en.wikipedia.org/wiki/Zipf's_law
![Page 20: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/20.jpg)
Wikipedia’s Word Distribution
![Page 21: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/21.jpg)
Diameter of a Graph
• Informally it is the “longest shortest path”
The diameter is,
thus, 6!
![Page 22: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/22.jpg)
Diameter of Webgraphs
• In Webgraphs the diameter should be “as small as possible”
• If N is the number of nodes of the graph, Webgraphs exhibit logarithmic diameters - i.e. O(log N)
• This property is also known as:
• Scale-free: because doubling the nodes increase the diameter by only 1
• Small World: because every two nodes are linked by very few vertexes
Typically diameter in a Webgraph is 19
![Page 23: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/23.jpg)
Bipartite Cores
• Informally a bipartite core in a graph consists of two sets of nodes L and R such that every node in L links to every node in R.
Set L Set R
![Page 24: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/24.jpg)
Models of the Webgraph
• On-line property.
• The number of nodes and edges changes with time.
• Power law degree distribution.
• Small world property.
• Many bipartite substructures.
![Page 25: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/25.jpg)
Random Graphs
• RGs are structures introduced by Paul Erdos and Alfred Reny.
• There are several models of RGs. We are concerned with the model Gn,p.
• A graph G = (V,E) Gn,p is such that |V|=n and an edge (u,v) is selected uniformly at random with probability p.
![Page 26: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/26.jpg)
Why Webgraph Cannot be a Random Graph?
• Suppose Xv is the degree of node v.
• Suppose Xv,w be a r.v. equal to 1 if there is an edge joining v and w (v ≠ w), 0 otherwise.
• Thus Xv is distributed as a Binomial(n-1,k) not a power-law.
E [Xv] =!w
E [Xv,w]
=!w
p = (n! 1) p
Xv =!w
Xv,w
![Page 27: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/27.jpg)
Preferential Attachment (PA)
• Parameter: m a positive integer
• At time 0, add a single edge
• At time t+1, add m edges from a new node vt+1 to existing nodes
• the edge (vt+1,vs) is added with probability degree(vs)/2t.
![Page 28: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/28.jpg)
An example
Generated withhttp://ccl.northwestern.edu/netlogo/models/PreferentialAttachment
![Page 29: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/29.jpg)
PA in-degree
• Fix m a positive integer, fix an epsilon > 0. For k a non-negative integer, define
Then with probability tending to 1 as t goes to infinity, for all k satisfying 0≤k≤t1/5
!m,k = 2m(m+1)(k+m)(k+m+1)(k+m+2)
(1! !) "m,k " p(k) " (1 + !) "m,k
![Page 30: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/30.jpg)
PA Diameter
• Fix an integer m≥2 and a positive real number epsilon. With probability 1 as t goes to infinity, Gm(t) is connected and
(1! !) log tlog log t " diam (Gm (t)) " (1 + !) log t
log log t
![Page 31: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/31.jpg)
Scale-Free Networks
• Network analysis is in its infancy
• Many different examples of networks exists.
![Page 32: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/32.jpg)
co-authors Network
![Page 33: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/33.jpg)
Those with Erdos number ≤ 2
![Page 34: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/34.jpg)
Protein-Protein Interactions
![Page 35: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/35.jpg)
Hollywood Network
![Page 36: Mining di Dati Webdidawiki.cli.di.unipi.it/lib/exe/fetch.php/wma/lezione2.pdf• A graph is connected iff for every pair of vertices u,v in V, u is reachable from v. • A set of nodes](https://reader031.vdocuments.net/reader031/viewer/2022041722/5e4f5182eae8fa34a2001ad3/html5/thumbnails/36.jpg)
The Lesson is Over