stochastic modeling of web evolution m. vafopoulos joint work with s. amarantidis, i. antoniou...

58
Stochastic Modeling of Web Evolution M. Vafopoulos Joint work with S. Amarantidis, I. Antoniou 2010/06/09, SMTDA 2010 http://www.smtda.net / Chania, Crete, Greece Aristotle University, Department of Mathematics Master in Web Science supported by Municipality of Veria

Post on 21-Dec-2015

227 views

Category:

Documents


2 download

TRANSCRIPT

Stochastic Modeling of Web Evolution

M. VafopoulosJoint work with S. Amarantidis, I. Antoniou

2010/06/09, SMTDA 2010http://www.smtda.net/ Chania, Crete, Greece

Aristotle University, Department of Mathematics Master in Web Science

supported by Municipality of Veria

2

Contents • What is the Web?• What are the issues and problems?• The Web as a Complex System• Query-Web Models• Stochastic Models and the Web

3

What is the Web?

Internet ≠ Web Web: a system of interlinked hypertext

documents (html) with unique addresses (URI) accessed via the Internet (http)

4

Web milestones

1992: Tim Berners-Lee presents the idea in CERN1993: Dertouzos (MIT) andMetakides (EU) create W3C appointing TBL as director

5

Why the Web is so successful?

Is based on architecture (HTTP, URI, HTML) which is:

• simple, free or cheap, open source, extensible

• tolerant• networked • fun & powerful• universal

6

Why is so successful?• New experience of exploring & editing huge

amount of information, people, abilities anytime, from anywhere

• The biggest human system with no central authority and control but with log data (Yotta* Bytes/sec)

• Has not yet revealed its full potential…

*10248

We knew the Web was big...

• 1 trillion unique URIs (Google blog 7/25/2008)

• 2 billion users• Google: 300 million searches/day• US: 15 billion searches/month• 72% of the Web population are active on at

least 1 social network …

7

Source blog.usaseopros.com/2009/04/15/google-searches-per-day-reaches-293-million-in-march-2009/

Web: the new continent

• Facebook: 400 million active users– 50% of our active users log on to Facebook in any given

day– 35 million users update their status each day– 60 million status updates posted each day– 3 billion photos uploaded to the site each month

• Twitter: 75 million active users– 141 employees

• Youtube: 350 million daily visitors• Flickr: 35 million daily visitors

8

Web: the new continent

• Online advertising spending in the UK has overtaken television expenditure for the first time [4 billion Euros/year] (30/9/2009, BBC)

• In US, spending on digital marketing will overtake that of print for the first time in 2010

• Amazon.com: 50 million daily visitors– 60 billion dollars market capitalization– 24.000 employess

9

10

Web generations eras description basic value source

Pre Web 1980’scalculate

The desktop is the platform Computations

[no network effect]

Web 1.0:90’sread

Surfing Web: The browser is the platform hyper-linking of documents

Web 2.0: 00’swrite

Social Web: The Web is the platform social dimension of linkage properties

Web 3.0:10’sdiscover

Semantic Web: The Graph is the platform URI-based semantic linkages

Web 4.0:20’sexecute

Metacomputing: The network is the platform Web of things (embedded

systems, RFID)

Connection & production in a global computing system for everything

Web 2w

Combine allAlmost everything is (or could be) a Web

service New inter-creativity

Web: What are the issues and related problems?

• Safe surfing (navigating)• Find relevant and credible information

(example: research)

• Create successful e-business• Reduce tax evasion• Enable local economic development• Communicate with friends, colleagues,

customers, citizens, voters,…

11

12

Need to study the Web

The Web is the largest human information construct in history. The Web is transforming society…

It is time to study it systematically as stand-alone socio-technicalartifact

How to study the Web?

• Analyze the interplay among the:– Structure– Function– Evolution

the Web as a highly inter-connected large complex system

13

14

Web Modeling

• understand • measure and • model its evolution

in order to optimize its social benefit through effective policies

What is the Structure of the WebThe Web as a Graph: • Nodes: the websites (URI) more than 1 trillion• Links: the hyperlinks 5 links per page (average) • Weights: link assessment

The WWW graph is a Directed Evolving Graph 15

2

4

31

1.2

2.1

0.2

0.5

Statistical Analysis of Graphs:The degree distribution

P(k) = P(d ≤ k) is the distribution function of the random variable d that counts the degree of a randomly chosen node.

Statistical Analysis of the Web Graphfour major findings: • power law degree distribution (self-similarity)

– internet traffic: Faloutsos – Web links: Barabasi

• small world property (the diameter is much smaller than the order of the graph) easy communication

• many dense bipartite subgraphs

• on-line property (the number of nodes and edges changes with time)

Distribution of links on the World-Wide Web P(k)∼ k−γ power law a, Outgoing links (URLs found on an HTML document); b, Incoming links Web. c, Average of the shortest path between two documents as a function of system size [Barabasi,ea 1999]

Small World Property Social Communication Networks Watts-Strogatz (1998)

Short average path lengths and high clustering. WWW Average Distance (Shortest Path) between 2 Documents:<ℓ> = 0.35 + 2.06 log(n)<ℓ> = 18.6, n = 8 x 108 (1999)<ℓ> = 18.9, n = 109 (2009)two randomly chosen documents on the web are on average 19 clicks away from each other. (Small World)

Web dynamics

• Search (PageRank, HITS, Markov matrices)• Traffic • Evolution

– graph generators– Games– mechanism design (auctions)– Queries-search engine-Web

20

Search The Hyperlink MatrixThe page rank vector π, is an eigenvector of

the Hyperlink Markov matrix M,

For the eigenvalue 1.

π is a stationary distribution Mπ = π

π = (π(κ)), π(κ) = the pagerank of the web page κ

dimΜ = the number of the web pages

that can be crawled by search engines.

Basis of Google’s Algorithm

• If the Μarkov matrix M is ergodic,

the stationary distribution vector ρ is unique.• If the Μarkov matrix Μ is mixing, then π is calculated

as the limit for every initial probability distribution ρ.• The 2nd eigenvalue οf M estimates the speed of Convergence

n

nlimS

Internet TrafficPrigogine and Herman 1971 Stochastic model of vehicular traffic dynamics based on statistical physics between the macroscopic “fluid dynamics” model and the individual vehicle model (1st order SDE)

f0 is the "desired" velocity distribution function

x and v are the position and velocity of the "vehicles“ is the average velocityc is the concentration of the "vehicles“P is the probability of "passing" in the sense of increase of flow, T is the relaxation time.

0f (x, v, t) f (x, v, t)f (x, v, t) f (x, v, t)v c(v v)(1 P)f (x, v, t)

t v t

Adaptation of the Prigogine - Hermann Model for the Internet Traffic [Antoniou, Ivanov 2002,2003]

• Vehicles = the Information Packages• Statistics of Information Packages: Log-Normal Distribution

The Origin of Power Law in Network Structure and Network Traffic

Kolmogorov 1941, The local structure of the turbulence in incompressibleviscous fluid for very large Reynolds numbers, Dokl. Akad. Nauk SSSR 30, 301.The origin of Self-Similar Stochastic Processes Model of the homogeneous fragmentationApplying a variant the central limit theorem, Kolmogorov found that the logarithms of the grain sizes are normally distributed

Before Fractals and Modern scale-free models,

Wavelet Analysis of data [Antoniou, Ivanov 2002]

Evolution: Graph Generators

• Erdős-Rényi (ER) model [Erdős, Rényi ‘60]• Small-world model [Watts, Strogatz ‘98]• Preferential Attachment [Barabási, Albert ‘99]• Edge Copying models [Kumar et al.’99], [Kleinberg

et al.’99],• Forest Fire model [Leskovec, Faloutsos ‘05]• Kronecker graphs [Leskovec, Chakrabarti, Kleinberg,

Faloutsos ‘07]• Optimization-based models [Carlson,Doyle,’00]

[Fabrikant et al. ’02]26

Evolution: Game theoretic models

• Stageman (2004) Information Goods and Advertising: An Economic Model of the Internet

• Zsolt Katona and Miklos Sarvary (2007) Network Formation and the Structure of the Commercial World Wide Web

• Kumar (2009), Why do Consumers Contribute to Connected Goods

27

28

Evolution: Queries- Search Engine -Web Kouroupas, Koutsoupias, Papadimitriou, Sideri KKPS 2005

Economic-inspired model (utility)Explains scale-free behavior

In the Web three types of entities exist:• Documents-i.e. web pages, created by authors [n]

• Users [m]• Topics [k]

• k≤m≤n

29

the KKPS model

• The Search Engine recommends Documents to the Users

• A User obtains satisfaction (Utility) after presented with some Documents by a Search Engine

• Users choose and endorse those that have the highest Utility for them, and then

• Search Engines make better recommendations based on these endorsements

30

Documents

• For each topic t ≤ k there is a Document vector Dt of length n (relevance of Document d for Topic t )

• For Dt the value 0 is very probable so that about k - 1 of every k entries are 0

31

User-Query

• There are Users that can be thought as simple Queries asked by individuals.

• For each topic t there is a User vector Rt of length m, (relevance of User-Query i for Topic t)

• with about m/k non-zero entries

32

User-Query

• the number of Documents proposed by the Search Engine is fixed and denoted by α

• the number of endorsements per User-Query is also fixed and denoted by b

• b ≤ α ≤ n

the algorithm

Step 1: A User-Query, for a specific Topic, is entered in the Search Engine

Step 2: The Search Engine recommends α relevant Documents. The listing order is defined by a rule. In the very first operation of the Search Engine the Documents the rule is random listing according to some probability distribution

Step 3: Among the α recommended Documents, b are endorsed on the basis of highest Utility. In this way, the bipartite graph S= ([m], [n], L) of Document endorsements is formed. Compute the in-degree of the Documents from the endorsements

33

the algorithm

Step 4: Repeat Step 1 for another Topic.Step 5: Repeat Step 2. The rule for Documents

listing is the decreasing in-degree for the specific User-Query computed in Step 3.

Step 6: Repeat Step 3.Step 7: Repeat Steps 4, 5, 6 for a number of

iterations necessary for statistical convergence (“that is, until very few changes are observed in the www state” )

34

35

utility

36

results of statistical experimentsKKPS

• for a wide range of values of the parameters m, n, k, a, b, the in-degree of the documents is power-law distributed

• the price of anarchy (efficiency of algorithm) improved radically during the first 2-3 iterations and later the improvement had a slower rate

37

results of statistical experimentsKKPS

• When the number of topics k increases the efficiency of the algorithm increases

• When a increases (the number of recommended documents by the search engine) the efficiency of the algorithm also increases

• Increasing b (number of endorsed documents per user) causes the efficiency of the algorithm to decrease

38

results of statistical experimentsAmarantidis-Antoniou-Vafopoulos

we extend the investigation in two directions:

• for Uniform, Poisson and Normal initial random distribution of Documents in-degree (Step 2) and

• for different values of α, b and k

39

results of statistical experimentsAmarantidis-Antoniou-Vafopoulos

in the case α=bthe validity of the power law becomes less significant as b increases

b: number of endorsed documents per user a: the number of recommended

documents by the search engine

40

0 2 4 6 8 10 12 14 16 18 201.5

1.6

1.7

1.8

1.9

2

2.1

2.2

Uniform

Poisson

Normal

a=b

Pow

er L

aw E

xpon

ent

results of statistical experimentsAmarantidis-Antoniou-Vafopoulos

an increase in the number of Topics k, results faster decay of the power law exponent

41

0 2 4 6 8 10 12 14 16 18 201

1.2

1.4

1.6

1.8

2

2.2 UniformPoissonNormal

a=b

Pow

er L

aw E

xpon

ent

Power law for the case b≤ α

efficiency of the search algorithm

=b αefficiency of the search algorithm increases when the number of topics k increases

[confirmation of KKPS results]

1 2 3 4 5 60.47

0.49

0.51

0.53k=30k=60k=90k=120

Iterations

Eff

icie

ncy

efficiency of the search algorithm

1 2 3 4 5 6 70.45

0.5

0.55

0.6

a=10a=8a=6a=4a=2

Iterations

Eff

icie

ncy

in the case b≤α efficiency of the search algorithm increases when the number of recommended Documents by the Search Engine α increases

[confirmation of KKPS results]

efficiency of the search algorithm

b≤ αefficiency of the search algorithm increases when the number of b of endorsed Documents per User-Query increases

[KKPS results not confirmed]

1 2 3 4 5 6 70.49

0.52

0.55

0.58

0.61

0.64

b=2b=4b=6b=8

Iterations

Eff

icie

ncy

46

Discussion of statistical experimentsAmarantidis-Antoniou-Vafopoulos

• α=b: all recommended Documents are endorsed according to the highest in-degree criterion

• Utility is useful only in terms of establishing compatibility between Utility Matrix and the Users-Queries and Documents bipartite graph

47

Discussion of statistical experimentsAmarantidis-Antoniou-Vafopoulos

• origin of the power law distribution of the in-degree of Documents, two mechanisms are identified in the KKPS model:– Users-Queries endorse a small fraction of Documents presented

(b)– Assuming a small fraction of poly-topic Documents, the

algorithm creates a high number of endorsements for them

• The above mechanisms are not exhaustive for the real Web graph.

Indexing algorithms, crawler’s design, Documents structure and evolution should be examined as

possible additional mechanisms contributing to the manifestation of the power law distribution

48

Discussion on the Endorsement Mechanism

“The endorsement mechanism does not need to be specified, as soon as it is observable by the Search Engine. For example, endorsing a Document may entail clicking it, or pointing a hyperlink to it.”

This KKPS hypothesis does not take into account the fundamental difference between clicking a link (browsing) and creating a hyperlink.

49

discussion

Web traffic is observable by the website owner or administrator through the corresponding log file and by third parties authorized (like search engine cookies which can trace clicking behavior or malicious

50

discussion

On the contrary, creating a hyperlink results in a more “permanent” link between two Documents which is observable by all Users-Queries and Search Engines.

Therefore, the KKPS algorithm actually examines the Web traffic and not the hyperlink structure of Documents which is the basis of the in-degree Search engine’s algorithm

51

discussion

Web traffic as well as Web content editing, are not taken into account in the algorithms of Search engines based on the in-degree (i.e. Pagerank).

These algorithms were built for Web 1.0 where Web content update and traffic monetization were not so significant

52

discussion

In the present Web 2.0 era with rapid change, the Web graph, content and traffic should be taken into account in efficient search algorithms.

Therefore, birth-death processes for Documents and links and Web traffic should be introduced in Web models, combined with content update (Web 2.0) and semantic markup (Web 3.0) for Documents.

53

discussion

The discrimination between Users and Queries could facilitate extensions of the KKPS model:

• teleportation (a direct visit to a Document which avoids Search Engines)

• different types of Users and • relevance feedback between Documents and

Queries

54

discussion

• KKPS: Utility is defined to be time invariant linear function of R and D which by construction is not affecting the www state when α=b

• not take into account the dynamic interdependence of the Utility on the www state. In reality, the evolution of the www state will change both R and D

• A future extension of KKPS model should account for user behavior by incorporating Web browsing and editing preferences

55

discussion

useful to offer deeper insight in the Web’s economic aspects in the KKPS model:

• valuation mechanisms for Web traffic and link structures and

• monetizing the search procedure (sponsored search, digital goods, excludable, anti-rival goods etc)

Stochastic Models and the Web

• Webmetrics: statistical models for the Web function, structure & evolution in order to evaluate individual, business and public policies

56

WS.01 lecture 1: Web history 57

Master in web science

is based on Web assessment, mathematical modeling and operation combined with business applications and societal transformations in the knowledge society. Web science studies, apart from Academic, Research and Training careers offer remarkable opportunities in Business.

Michalis Vafopoulos

58

Master in web science

Michalis Vafopoulos

winter spring

Web science Economics and Business in the

Web Technologies Web Knowledge Processing in the Web

Networks and Discrete Mathematics

Statistical Analysis of Networks

Information Processing and Networks

Mathematical Modeling of the Web