measurement and modeling of the web and related data sets
DESCRIPTION
- Web Measurement- Self similarity on the web- Extraction of information from large graphs- A word on evolutionTRANSCRIPT
![Page 1: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/1.jpg)
| May 2003 | Almaden Research Center, San Jose, CA © 2003 IBM Corporation
IMA Tutorial (part II):Measurement and modeling of the web and related data sets
Andrew TomkinsIBM Almaden Research CenterMay 5, 2003
![Page 2: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/2.jpg)
2 © 2003 IBM Corporation
I
Setup
This hour: data analysis on the web Next hour: probabilistic generative models,
particularly focused on models that generate distributions that are power laws in the limit
![Page 3: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/3.jpg)
3 © 2003 IBM Corporation
I
Context
Data Analysis on the web… …as a hyperlinked corpus Note: Many areas of document analysis are highly relevant to the
web, and should not be ignored (but will be):• Supervised/unsupervised classification (Jon – combinatorial side)
• Machine learning (Jon – a little)
• Information retrieval (Jon – dimensionality reduction)
• Information extraction
• NLP
• Discourse analysis
• Relationship induction
• etc
![Page 4: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/4.jpg)
4 © 2003 IBM Corporation
I
Focus Areas
Web Measurement Self similarity on the web Extraction of information from large graphs A word on evolution
![Page 5: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/5.jpg)
5 © 2003 IBM Corporation
I
One view of the Internet: Inter-Domain Connectivity
Core: maximal clique of high-degree nodes
Shells: nodes in 1-neighborhood of core, or of previous shell, with degree > 1
Legs: 1-degree nodes
CoreShells: 12
3
[Tauro, Palmer, Siganos, Faloutsos, 2001 Global Internet]
![Page 6: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/6.jpg)
6 © 2003 IBM Corporation
I
Another view of the web: the hyperlink graph
Each static html page = a node Each hyperlink = a directed edge Currently ~1010 nodes (mostly junk), 1011 edges
![Page 7: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/7.jpg)
7 © 2003 IBM Corporation
I
Getting started – structure at the hyperlink level
Measure properties of the link structure of the web.
Study a sample of the web that contains a reasonable fraction of the entire web.
Apply tools from graph theory to understand the structure.
[Broder, Kumar, Maghoul, Raghavan, Rajagopalan, Stata, Tomkins, Wiener, 2001]
![Page 8: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/8.jpg)
8 © 2003 IBM Corporation
I
Terminology
SCC – strongly connected component WCC – “weakly connected component” –
connected component in the underlying undirected graph
![Page 9: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/9.jpg)
9 © 2003 IBM Corporation
I
Data
Altavista crawls, up to 500M pages Ran strong and weak connected component
algorithms Ran random directed breadth-first searches from
1000 starting nodes, both forwards and backwards along links
![Page 10: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/10.jpg)
10 © 2003 IBM Corporation
I
Breadth-first search from random starts
How many vertices are reachable from a random vertex?
![Page 11: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/11.jpg)
11 © 2003 IBM Corporation
I
A Picture of (~200M) pages.
![Page 12: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/12.jpg)
12 © 2003 IBM Corporation
I
Some distance measurements
Pr[u reachable from v] ~ 1/4 Max distance between 2 SCC nodes: 28 Max distance between 2 nodes (if there is a path)
> 900 Avg distance between 2 SCC nodes: 16
![Page 13: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/13.jpg)
13 © 2003 IBM Corporation
I
Facts (about the crawl).
Indegree and Outdegree distributions satisfy the power law. Consistent over time and scale.
The distribution of indegrees on the web is given by a Power Law ---Heavy-tailed distribution, with many high-indegree pages (eg, Yahoo)
![Page 14: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/14.jpg)
14 © 2003 IBM Corporation
I
Analysis of power law
Pr [ page has k inlinks ] =~ kk -2.1
Pr [ page has > k inlinks ] =~ 1/kk
Pr [ page has k outlinks ] =~ kk-2.7
Corollary:
![Page 15: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/15.jpg)
15 © 2003 IBM Corporation
I
Component sizes.
Component sizes are distributed by the power law.
![Page 16: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/16.jpg)
16 © 2003 IBM Corporation
I
Other observed power laws in the web
Depths of URLs Sizes of sites Eigenvalues of adjacency matrix of hyperlink
graph [Mihail and Papadimitriou shed some light here]
Many different traffic measures Linkage between hosts and domains Many of the above measures on particular
subsets of the graph …
[Faloutsos, Faloutsos, Faloutsos 99]
[Bharat, Chang, Henzinger, Ruhl 02]
![Page 17: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/17.jpg)
| May 2003 | Almaden Research Center, San Jose, CA © 2003 IBM Corporation
More Characterization: Self-Similarity
![Page 18: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/18.jpg)
18 © 2003 IBM Corporation
I
Ways to Slice the Web
Domain (*.it) Host (www.ibm.com) Geography (pages with a geographical reference
in the Western US) Content
• Keyword: Math, subdivided by Math Geometry
• Keyword: MP3, subdivided by MP3 Napster
We call these slices “Thematically Unified Communities”, or TUCs
![Page 19: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/19.jpg)
19 © 2003 IBM Corporation
I
Self-Similarity on the Web
Pervasive: holds for all reasonable characteristics Robust: holds for all reasonable slices “Theorem:”
• TUCs share properties with the web at large
• TUCs are linked by a “navigational backbone”
![Page 20: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/20.jpg)
20 © 2003 IBM Corporation
I
In particular…
All TUCs have:
• Power laws for degree, SCC, and WCC distributions
• Similar exponents for power laws
• Similar “bow tie” structure
• Large number of dense subgraphs
![Page 21: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/21.jpg)
21 © 2003 IBM Corporation
I
Is this surprising? YES (for downsampling general graphs). Example:
This graph has 1 SCC containing all nodes
Remove any nonzero fraction of edges – graph has n components of size 1
Generally: random subset of size n1/2 in a graph with O(n) edges will have only constant number of edges
![Page 22: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/22.jpg)
22 © 2003 IBM Corporation
I
A structural explanation
Each TUC has a “bow tie” – how do they relate?
![Page 23: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/23.jpg)
23 © 2003 IBM Corporation
I
The Navigational Backbone
Each TUC contains a large SCC that is well-connected to the SCCs of other TUCs
![Page 24: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/24.jpg)
| May 2003 | Almaden Research Center, San Jose, CA © 2003 IBM Corporation
Information Extraction from Large Graphs
![Page 25: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/25.jpg)
25 © 2003 IBM Corporation
I
Overview
WWW Distill
KB1
KB2
KB3
Goal: Create higher-level "knowledge bases"
of web information for further processing.
[Kumar, Raghavan, Rajagopalan, Tomkins 1999]
![Page 26: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/26.jpg)
26 © 2003 IBM Corporation
I
Many approaches to this problem
Databases over the web:• Web SQL, Lore, ParaSite, etc
Data mining• A priori, Query flocks, etc
Information foraging Community extraction
• [Lawrence et al]
Authority-based search• HITS, and variants
![Page 27: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/27.jpg)
27 © 2003 IBM Corporation
I
General approach
It’s hard (though getting easier) to analyze the content of all pages on the web
It’s easier (though still hard) to analyze the graph How successfully can we extract useful semantic
knowledge (ie, community structure) from links alone?
![Page 28: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/28.jpg)
28 © 2003 IBM Corporation
I
Web Communities
FishingOutdoor Magazine
Bill's Fishing Resources
Linux Linux Links
LDP
Different communities appear to have very different structure.
![Page 29: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/29.jpg)
29 © 2003 IBM Corporation
I
Web Communities
FishingOutdoor Magazine
Bill's Fishing Resources
Linux Linux Links
LDP
But both contain a common “footprint”: two pages ( ) that bothPoint to three other pages in common ( )
![Page 30: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/30.jpg)
30 © 2003 IBM Corporation
I
Communities and cores
Example K2,3
Definition: A "core" Kij consists of i left nodes,j right nodes, and all left->right edges.
Critical facts:1. Almost all communities contain a core [expected]2. Almost all cores betoken a community [unexpected]
![Page 31: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/31.jpg)
31 © 2003 IBM Corporation
I
Other footprint structures
Newsgroup threadWeb ring
Corporate partnership Intranet fragment
![Page 32: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/32.jpg)
32 © 2003 IBM Corporation
I
Subgraph enumeration
Goal: Given a graph-theoretic "footprint" for structures of interest, find ALL occurrences of these footprints.
![Page 33: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/33.jpg)
33 © 2003 IBM Corporation
I
Enumerating cores
a
a belongs to a K2,3 if and
only if some node points
to b1, b2, b3.
b2
b1
b3
Inclusion/Exclusion Pruning
Clean data by removing:mirrors (true and approximate)
empty pages, too-popular pages,
nepotistic pages
Preprocessing
When no more pruning
is possible, finish using
database techniques
Postprocessing
![Page 34: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/34.jpg)
34 © 2003 IBM Corporation
I
Results for cores
3 5 7 90
20
40
60
80
100
Thousands
i=3
i=4
i=5
i=6
Number of cores found by Elimination/Generation
3 5 7 90
20
40
60
80Thousands
i=3
i=4
Number of cores found during postprocessing
![Page 35: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/35.jpg)
35 © 2003 IBM Corporation
I
The cores are interesting
(1) Implicit communities are defined by cores.(2) There are an order of magnitude more of these. (105+)(3) Can grow the core to the community using further processing.
Explicit communities.
Yahoo!, Excite, Infoseek webrings news groups mailing lists
Implicit communities
japanese elementary schools turkish student associations oil spills off the coast of japan australian fire brigades
![Page 36: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/36.jpg)
36 © 2003 IBM Corporation
I
Elementary Schools in Japan
The American School in Japan The Link Page ‰ªès—§ˆä“c¬ŠwZƒz[ƒƒy[ƒW � � � � � � � Kids' Space ˆÀés—§ˆÀ鼕”¬ŠwZ � � � � � � ‹{鋳ˆç‘åŠw•‘®¬ŠwZ � � � � KEIMEI GAKUEN Home Page ( Japanese
) Shiranuma Home Page fuzoku-es.fukui-u.ac.jp welcome to Miasa E&J school _“Þ쌧E‰¡•ls—§’†ì¼¬ŠwZ‚̃y� � � � � � � � http://www...p/~m_maru/index.html fukui haruyama-es HomePage Torisu primary school goo Yakumo Elementary,Hokkaido,Japan FUZOKU Home Page Kamishibun Elementary School...
schools LINK Page-13 “ú–{‚ÌŠwZ � a‰„¬ŠwZƒz[ƒƒy[ƒW � � � � � � 100 Schools Home Pages (English) K-12 from Japan 10/...rnet and Education ) http://www...iglobe.ne.jp/~IKESAN ‚l‚f‚j¬ŠwZ‚U”N‚P‘g•¨Œê � � ÒŠ—’¬—§ÒŠ—“Œ¬ŠwZ � � � � Koulutus ja oppilaitokset TOYODA HOMEPAGE Education Cay's Homepage(Japanese) –y“쬊wZ‚̃z[ƒƒy[ƒW � � � � � UNIVERSITY ‰J—³¬ŠwZ DRAGON97-TOP � � ‰ª¬ŠwZ‚T”N‚P‘gƒz[ƒƒy[ƒW � � � � � � ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼
![Page 37: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/37.jpg)
37 © 2003 IBM Corporation
I
So…
Possible to extract order-of-magnitude more communities than currently known.
Few (4%) of these appear coincidental. Entirely automatic extraction. Open question: how to use implicit communities?
![Page 38: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/38.jpg)
| May 2003 | Almaden Research Center, San Jose, CA © 2003 IBM Corporation
A word on evolution
![Page 39: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/39.jpg)
39 © 2003 IBM Corporation
I
A word on evolution
Phenomenon to characterize: A topic in a temporal stream occurs in a “burst of activity”
Model source as multi-state Each state has certain emission properties Traversal between states is controlled by a
Markov Model Determine most likely underlying state sequence
over time, given observable output
[Kleinberg02]
![Page 40: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/40.jpg)
40 © 2003 IBM Corporation
I
Example
Time
I’ve been thinking about your idea with
the asparagus…
Uh huh I think I see…
Uh huh Yeah, that’s what I’m saying…
So then I said “Hey, let’s give
it a try”
And anyway she said
maybe, okay?
0.0051 2
0.01State 1:Output rate: very low
State 2:Output rate: very high
Pr[1] ~ 1 Pr[1] ~ 10 Pr[1] ~ 5 Pr[1] ~ 10 Pr[1] ~ 2 Pr[1] ~ 1 Pr[1] ~ 2
Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 5 Pr[2] ~ 2 Pr[2] ~ 7 Pr[2] ~ 10 Pr[2] ~ 10
1 1 1 1 2 2 2
Most likely “hidden” sequence:
![Page 41: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/41.jpg)
41 © 2003 IBM Corporation
I
More bursts
Infinite chain of increasingly high-output states Allows hierarchical bursts Example 1: email messages Example 2: conference titles
![Page 42: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/42.jpg)
42 © 2003 IBM Corporation
I
Integrating bursts and graph analysis
Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace
Number of blog pages that belong to a community
Number of blog communities
Wired magazine publishes an article on weblogs that impacts the tech community
Newsweek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption
[KNRT03]
![Page 43: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/43.jpg)
| May 2003 | Almaden Research Center, San Jose, CA © 2003 IBM Corporation
IMA Tutorial (part III):Generative and probabilistic models of data
May 5, 2003
![Page 44: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/44.jpg)
44 © 2003 IBM Corporation
I
Probabilistic generative models
Observation: These distributions have the same form:1. Fraction of laptops that fail catastrophically during
tutorials, by city
2. Fraction of pairs of shoes that spontaneously de-sole during periods of stress, by city
Conclusion: The distribution arises because the same stochastic process is at work, and this process can be understood beyond the context of each example
![Page 45: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/45.jpg)
45 © 2003 IBM Corporation
I
Models for Power Laws
Power laws arise in many different areas of human endeavor, the “hallmark of human activity”
(they also occur in nature) Can we find the underlying process (processes?)
that accounts for this prevalence?
![Page 46: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/46.jpg)
46 © 2003 IBM Corporation
I
An Introduction to the Power Law
Definition: a distribution is said to have a power law if Pr[X >= x] cx
Normally: 0<<=2 (Var(X) infinite)
Sometimes: 0<<=1 (Mean(X) infinite)
Exponentially-decaying distribution
Power law distribution
![Page 47: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/47.jpg)
47 © 2003 IBM Corporation
I
Early Observations: Pareto on Income
[Pareto1897] observed that the random variable I denoting the income of an individual has a power law distribution
More strongly, he observed that Pr[X>x] = (x/k)
For density function f, note that ln f(x) = (--1)ln(x) + c for constant c
Thus, in a plot of log(value) versus log(probability), power laws display a linear tail, while Pareto distributions are linear always.
![Page 48: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/48.jpg)
48 © 2003 IBM Corporation
I
Early Observations: Yule/Zipf
[Yule26] observed (and explained) power laws in the context of number of species within a genus
[Zipf32] and [Estoup16] studied the relative frequency of words in natural language, beginning a cottage industry that continues to this day.
A “Yule-Zipf” distribution is typically characterized by rank rather than value:• The ith most frequent word in English occurs with
probability proportional to 1/i. This characterization relies on finite vocabulary
![Page 49: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/49.jpg)
49 © 2003 IBM Corporation
I
Early Observations: Lotka on Citations
[Lotka25] presented the first occurrence of power laws in the context of graph theory, showing a power law for the indegree of the citation graph
![Page 50: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/50.jpg)
50 © 2003 IBM Corporation
I
Ranks versus Values
Commonly encountered phrasings of the power law in the context of word counts:1. Pr[word has count >= W] has some form
2. Number of words with count >= W has some form
3. The frequency of the word with rank r has some form
• The first two forms are clearly identical.• What about the third?
![Page 51: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/51.jpg)
51 © 2003 IBM Corporation
I
Equivalence of rank versus value formulation
Given: number of words occurring t times ~ t Approach:
• Consider single most frequent word, with count T
• Characterize word occurring t times in terms of T
• Approximate rank of words occurring t times by counting number of words occurring at each more frequent count.
Conclusion: Rank-j word occurs (cj + d)times (power law)
But... high ranks correspond to low values – must keep straight the “head” and the “tail”
[Bookstein90, Adamic99]
![Page 52: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/52.jpg)
52 © 2003 IBM Corporation
I
Early modeling work
The characterization of power laws is a limiting statement
Early modeling work showed approaches that provide the correct form of the tail in the limit
Later work introduced the rate of convergence of a process to its limiting distribution
![Page 53: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/53.jpg)
53 © 2003 IBM Corporation
I
A model of Simon
Following Simon [1955], described in terms of word frequences
Consider a book being written. Initially, the book contains a single word, “the.”
At time t, the book contains t words. The process of Simon generates the t+1st word based on the current book.
![Page 54: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/54.jpg)
54 © 2003 IBM Corporation
I
Constructing a book: snapshot at time t
When in the course of human events, it becomes necessary…
Current word frequencies:
Rank Word Count
1 “the” 1000
2 “of” 600
3 “from” 300
… “...” …
4,791 “necessary” 5
“...” “...”
11,325 “neccesary” 1
Let f(i,t) be the number of words of count i at time t
![Page 55: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/55.jpg)
55 © 2003 IBM Corporation
I
The Generative Model
Assumptions:1. Constant probability that a neologism will be introduced
at any timestep
2. Probability of re-using a word of count i is proportional to if(i,t), that is, number of occurrences of count i words.
Algorithm:• With probability a new word is introduced into the text
• With remaining probability, a word with count i is introduced with probability proportional to if(i,t)
![Page 56: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/56.jpg)
56 © 2003 IBM Corporation
I
Constructing a book: snapshot at time t
Current word frequencies:
Rank Word Count
1 “the” 1000
2 “of” 600
3 “from” 300
… “...” …
4,791 “necessary” 5
“...” “...”
11,325 “neccesary” 1
Let f(i,t) be the number of words of count i at time t
Pr[“the”] = (1- ) 1000 / K
Pr[“of”] = (1- ) 600 / K
Pr[some count-1 word] = (1- ) 1 * f(1,t) / K
K = if(i,t)
![Page 57: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/57.jpg)
57 © 2003 IBM Corporation
I
What’s going on?
One unique word (which occurs 1 or more times)
1 2 3 4 5 6
Each word in bucket i occurs i times in the current document
….
![Page 58: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/58.jpg)
58 © 2003 IBM Corporation
I
What’s going on?
1 2 3 4 5 6
With probability a new word is introduced into the text
![Page 59: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/59.jpg)
59 © 2003 IBM Corporation
I
What’s going on?
1 2 3 4 5 6
How many times do words in this bucket occur?
With probability 1- an existing word is reused
![Page 60: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/60.jpg)
60 © 2003 IBM Corporation
I
What’s going on?
2 3 4
Size of bucket 3 at time t+1 depends only on sizes of buckets 2 and 3 at time t
??
Must show: fraction of balls in 3rd bucket approaches some limiting value
![Page 61: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/61.jpg)
61 © 2003 IBM Corporation
I
Models for power laws in the web graph
Retelling the Simon model: “preferential attachment”• Barabasi et al
• Kumar et al
Other models for the web graph:• [Aiello, Chung, Lu], [Huberman et al]
![Page 62: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/62.jpg)
62 © 2003 IBM Corporation
I
Why create such a model?
Evaluate algorithms and heuristics Get insight into page creation Estimate hard-to-sample parameters Help understand web structure Cost modeling for query optimization To find “surprises” means we must understand what
is typical.
![Page 63: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/63.jpg)
63 © 2003 IBM Corporation
I
Random graph models
G(n,p) Web
indeg > 1000
k23's
4-cliques
0
0
0
100000
125000
many
Traditional random graphs [Bollobas 85] are not like the web!
Is there a better model?
![Page 64: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/64.jpg)
64 © 2003 IBM Corporation
I
Desiderata for a graph model
Succinct description Insight into page creation No a priori set of "topics", but... ... topics should emerge naturally Reflect structural phenomena Dynamic page arrivals Should mirror web's "rich get richer" property, and manifest
link correlation.
![Page 65: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/65.jpg)
65 © 2003 IBM Corporation
I
Page creation on the web
Some page creators will link to other sites without regard to existing topics, but
Most page creators will be drawn to pages covering existing topics they care about, and will link to pages within these topics
Model idea: new pages add links by "copying" them from existing pages
![Page 66: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/66.jpg)
66 © 2003 IBM Corporation
I
Generally, would require…
Separate processes for:• Node creation
• Node deletion
• Edge creation
• Edge deletion
![Page 67: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/67.jpg)
67 © 2003 IBM Corporation
I
A specific model
Nodes are created in a sequence of discrete time steps• e.g. at each time step, a new node is created with
d1) out-links
Probabilistic copying– links go to random nodes with probability – copy d links from a random node with probability 1-
![Page 68: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/68.jpg)
68 © 2003 IBM Corporation
I
Example
New node arrivesWith probability , it linksto a uniformly-chosen page
![Page 69: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/69.jpg)
69 © 2003 IBM Corporation
ITo copy, it first choosesa page uniformlyThen chooses a uniformout-edge from that pageThen links to the destinationof that edge ("copies" the edge)
Under copying, your rate of getting new inlinks is proportional to your in-degree.
Example
With probability (1-), itdecides to copy a link.
![Page 70: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/70.jpg)
70 © 2003 IBM Corporation
I
Degree sequences in this model
Pr[page has k inlinks] =~ kk
Heavy-tailed inverse polynomial degree sequences.Pages like netscape and yahoo exist.Many cores, cliques, and other dense subgraphs
( = 1/11 matches web)
-(2-)
(1-)
![Page 71: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/71.jpg)
71 © 2003 IBM Corporation
I
Model extensions
Component size distributions. More complex copying. Tighter lower tail bounds. More structure results.
![Page 72: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/72.jpg)
72 © 2003 IBM Corporation
I
A model of Mandelbrot
Key idea: Generate frequencies of English words to maximize information transferred per unit cost
Approach:• Say word i occurs with probability p(i)
• Set the transmission cost of word i to be log(i)
• Average information per word: –p(i) log(p(i))
• Cost of a word with probability p(j): log (j)
• Average cost per word: p(j) log(j)
• Choose probabilities p(i) to maximize information/cost Result: p(j) = c j
![Page 73: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/73.jpg)
73 © 2003 IBM Corporation
I
Discussion of Mandelbrot’s model
Trade-offs between communication cost (log(p(j)) and information.
Are there other tradeoff-based models that drive similar properties?
![Page 74: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/74.jpg)
74 © 2003 IBM Corporation
I
Heuristically Optimized Trade-offs
Goal: construction of trees (note: models to generate trees with power law behavior were first proposed in [Yule26])
Idea: New nodes must trade off connecting to nearby nodes, and connecting to central nodes.
Model:• Points arrive uniformly within the unit square
• New point arrives, and computes two measures for candidate connection points j– d(j): distance from new node to existing node j (“nearness”)– h(j): distance from node j to root of tree (“centrality”)
• New destination chosen to minimize d(j) + h(j) Result: for a wide variety of values of , distribution of
degrees has a power law
[Fabrikant, Koutsoupias, Papadimitriou 2002]
![Page 75: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/75.jpg)
75 © 2003 IBM Corporation
I
Monkeys on Typewriters
Consider a creation model divorced form concerns of information and cost
Model:• Monkey types randomly, hits space bar with probability
q, character chosen uniformly with remaining probability
Result:• Rank j word occurs with probability qjlog(1-q)-1 = c j
![Page 76: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/76.jpg)
76 © 2003 IBM Corporation
I
Other Distributions
“Power law” means a clean characterization of a particular property on distribution upper tails
Often used to mean “heavy tailed,” meaning bounded away from an exponentially decaying distribution
There are other forms of heavy-tailed distributions
A commonly-occurring example: lognormal distribution
![Page 77: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/77.jpg)
77 © 2003 IBM Corporation
I
Quick characterization of lognormal distributions
Let X be a normally-distributed random variable Let Y = ln X Then Y is lognormal Properties:
• Often occur in situations of multiplicative growth
• Prop2
Concern: There is a growing sequence of papers dating back several decades questioning whether certain observed values are best described by power law or lognormal (or other) distributions.
![Page 78: Measurement and modeling of the web and related data sets](https://reader033.vdocuments.net/reader033/viewer/2022051609/547a9f5fb4af9ff73c8b45c3/html5/thumbnails/78.jpg)
78 © 2003 IBM Corporation
I
One final direction…
The Central Limit Theorem tells us how sums of independent random variables behave in the limit
Example: ln Xj = ln X0 + ln Fj Xj well-approximated by a lognormal variable Thus, lognormal variables arise in situations of
multiplicative growth Examples in biology, ecology, economics,… Example: [Huberman et al]: growth of web sites Similarly: the product The same result applies to the
product of lognormal variables Each of these generative models is evolutionary What is the role of time?