models and structure of the web graph stefano leonardi università di roma “la sapienza” thanks...

Models and Structure of The Web Graph

Stefano Leonardi

Università di Roma “La Sapienza”

thanks to A. Broder, E. Koutsoupias, P. Raghavan, D. Donato, G. Caldarelli, L. Salete Buriol

Complex Networks

• Self-organized and operated by a multitude of distributed agents without a central plan

• Dynamically evolving in time

• Connections established on the basis of local knowledge

• Nevertheless efficient and robust, i.e.– allow to reach every

other vertex in a few hops

– Resilient to random faults

• Examples:

– The physical Internet– The Autonomous System

graph– The Web Graph– The graphs of phone

calls– Food webs– E-mail – etc….

Outline of the Lecture

• Small World Phenomena– The Milgram Experiment– Models of the Small World

• The Web Graph– Power laws– The Bow Tie Structure– Topological properties

• Representation and Compression of the Web Graph

• Graph Models for the Web– Preferential Attachment– Copying model– Multi-layer

• The Internet Graph– Geography matters

• Experiments

• Algorithms

• Mathematical Models

Small World Phenomena

Properties of Small World Networks

• Surprising properties of Small World:

– The graph of acquantancies has small diameter

– There is a simple local algorithm that can route a message in few steps

– The number of edges is small, linear in the number of vertices

– The network is resilient to edge faults

Small World Phenomena

• A small world is a network – a large fraction of short chains of acquantancies – a small fraction of 'shortcuts' linking clusters with

one another, superposition of structured clusters– Distances are short – no high degree nodes

needed

• There are lots of small worlds:– spread of diseases– electric power grids– phone calls at a given time– etc.

The model of Watts and Strogatz[1998]

1. Take a ring of n vertices with every vertex connected to the next k=2 nodes

2. Randomly rewire every edge to a random destination with pb p

3. The resulting graph has with high probabilitydiameter = O(log n)

diameter = maxu,vdistance(u,v)

The model of Jon Kleinberg [2000]

• Consider a 2-dimensional grid• For each node u add edge (u,v) to a vertex

v selected with pb proportional to [d(u,v)]-r

– If r = 0, v selected at random as in WS– If r = 2, v between x and 2x with equal

probability

Routing in the Small World

• Define a local routing algorithm that knows:– Its position in the grid– The postition in the grid of the destination– The set of neighbours, short range and long range– The neighbours of all the vertices that have seen the message

• If r=2, expected delivery time is O(log2 n) • If r≠2, expected delivery time is Ώ(nε), ε depends on r

The Web Graph

Web graph

•Notation: G = (V, E) is given by

– a set of vertices (nodes) denoted V

– a set of edges (links) = pairs of nodes denoted E

•The page graph (directed)

– V = static web pages (4,2 B) #pages indexed by

Google on March 5– E = static hyperlinks (30 B?)

Why is it interesting to study the Web Graph?

•It is the largest artifact ever conceived by the human•Exploit its structure of the Web for– Crawl strategies – Search– Spam detection – Discovering communities on the web– Classification/organization

•Predict the evolution of the Web– Mathematical models– Sociological understanding

Many other web/internet related graphs

• Physical network graph– V = Routers– E = communication links

• The host graph (directed)– V = hosts– E = There is an edge from a page on host A to a page on host

B• The “cosine” graph (undirected, weighted)

– V = static web pages– E = cosine distance among term vectors associated with pages

• Co-citation graph (undirected, weighted)– V = static web pages– E = (x,y) number of pages that refer to both x and y

• Communication graphs (which hosts talks to which hosts at a given time)

• Routing graph (how packets move)• Etc

Observing Web Graph

• It is a huge ever expanding graph• We do not know which percentage of it we know• The only way to discover the graph structure of

the web as hypertext is via large scale crawls

• Warning: the picture might be distorted by– Size limitation of the crawl– Crawling rules– Perturbations of the "natural" process of birth and death

of nodes and links

Naïve solution

•Keep crawling, when you stop seeing more pages, stop

•Extremely simple but wrong solution: crawling is complicated because the web is complicated– spamming– duplicates– mirrors

•First example of a complication: Soft 404– When a page does not exists, the server is supposed to

return an error code = “404”– Many servers do not return an error code, but keep the

visitor on site, or simply send him to the home page

The Static Public Web

•Static– not the result of a cgi-bin scripts– no “?” in the URL– doesn’t change very often– etc.

•Public– no password required– no robots.txt exclusion– no “noindex” meta tag– etc.

Static vs. dynamic pages

•“Legitimate static” pages built on the fly– “includes”, headers, navigation bars, decompressed text,

etc. •“Dynamic pages” appear static– browseable catalogs (Hierarchy built from DB)

•Huge amounts of catalog & search results pages– Shall we count all the Amazon pages?

•Very low reliability servers•Seldom connected servers•Spider traps -- infinite url descent– www.x.com/home/home/home/…./home/home.html

•Spammer games

In some sense, the “static” web is infinite …

The static web is whatever pages can be found in at least one major search engine.

Large scale crawls

•[KRRT99] Alexa crawl, 1997, 200M pages– Trawling the Web for emerging cyber communities, R.

Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins,

•[BMRRSTW 2000] AltaVista crawls, 500 M pages– Graph structure on the web, A. B., F. Maghoul, P.

Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, & J. Wiener

•[LLMMS03] Webbase Stanford project, 2001. 400 M pages– Algorithms and Experiments for the Webgraph

L. Laura, S. Leonardi, S. Millozzi, U. Meyer, J. Sibeyin, et al..

Graph properties of the Web

– Global structure -- how does the web look from far away ?

– Connectivity –- how many connections?

– Connected components –- how large?

– Reachability -- can one go from here to there ?

How many hops?

– Dense Subgraphs –- are there good clusters?

Basic definitions

•In-degree = number of incoming links – www.yahoo.com has in-degree ~700,000

•Out-degree = number of outgoing links– www.altavista.com has out-degree 198

•Distribution = #nodes with given in/out-degree

Power laws

•Inverse polynomial tail (for large k)– Pr(X = k) ~ 1/k

– Graph of such a function is a line on a log-log scalelog(Pr(X=k)) = - * log(k)

– Very large values possible & probable

•Exponential tail (for large k)– Pr(X = k) ~ e-k

– Graph of such function is a line on a log scalelog(Pr(X=k)) = -k

– No very large values

Power laws• Examples

– Inverse poly tail: the distribution of wealth, popularity of books, etc.

– Inverse exp tail: the distribution of age, etc– Internet network topology [Faloutsos & al. 99]

Power laws on the web

– Web page access statistics [Glassman 97, Huberman & al 98 Adamic & Huberman 99]

– Client-side access patterns [Barford & al.]– Site popularity

– Statistical Physics: Signature of self-similar/fractal structure

The In-degree distribution

Altavista crawl, 1999Broder et al.

WebBase Crawl 2001Laura et al.2003

Indegree follows power law distributionk

ku 1

])(degree-inPr[

2.1

Graph structure of the Web

Altavista, 1999 WebBase 2001

Out-degree follow power law distribution? Pages with a large numberof outlinks are less frequent.

What is good about 2.1

– Expected number of nodes with degree k

~ n / k 2.1

– Expected contribution to total edges

~ n / k 2.1

* k = ~ n / k 1.1

– Summing over all k, the expected number of edges is~ n

– This is good news: our computational power is such that we can deal with work linear in the size of the web, but not much about that.

– The average number of incoming link is 7 !!

SCCSCC

WCCWCC

More definitions

•Weakly connected components (WCC)– Set of nodes such that from any node can go to any node via an

undirected path.

•Strongly connected components (SCC)– Set of nodes such that from any node can go to any node via a

directed path.

The Bow Tie [Broder et al., 2000]

www.ibm.cowww.ibm.comm

……/~newbie//~newbie/ /…/…//…/…/leaf.htmleaf.htm

COREINOUTTENDRILSTUBESDISC

Experiments (1)

WebBase ‘01

Regione Dimensione (numero di nodi)

SCC 44.713.193

IN 14.441.156

OUT 53.330.475

TENDRILS 17.108.859

DISC. 6.172.183

iiuizeScc

1])(sPr[

Altavista ‘99

Experiments (2)SCC distribution region by region

Power Law 2.07

Experiments (3)Indegree distribution region by region

Power Law 2.1

Experiments (4)Indegree distribution in the SCC graph

Power Law 2.1

Experiments (5)WCC distribution in IN

Power Law 1.8

What did we learn from all these power laws?

• The second largest SCC is of size less 10000 nodes!

• IN and Out have millions of access points to the CORE and thousands of relatively large Weakly Connected Components

• This may help in designing better crawling strategies at least for IN and OUT, i.e. the load can be splitted between the robots without much overlapping.

• While power law with exponent 2.1 is an universal feature of the Web, there is no fractal structure: IN and OUT do not show any Bow Tie Phenomena

Algorithmic issues

•Apply standard linear-time algorithms for WCC and SCC– Hard to do if you can’t store the entire graph in

memory!!• WCC is easy if you can store V (semi-external algorithm)• No one knows how to do DFS in semi-external memory,

so SCC is hard ??• Might be able to do approx SCC, based on low diameter.

•Random sampling for connectivity information– Find all nodes reachable from one given node

(“Breadth-First Search”) – BFS is also hard. Simpler on low diameter

Find the CORE

• Iterate the following process:

– Pick a random vertex v

– Compute all node reached from v: O(v)

– Compute all nodes that reach v: I(v)

– Compute SCC(v):= I(v) ∩ O(v)

– Check whether it is the largest SCC

If the CORE is about ¼ of the vertices, after 20 iterations, Pb to not find the core < 1%.

Find OUT

• SCC OUT

Find IN

• SCC IN

Find TENDRILS and TUBES• IN TENDRILS_IN• OUT TENDRILS_OUT

TENDRILS_IN TENDRILS_OUT TENDRILS TENDRILS_IN TENDRILS_OUT TUBES

Find DISC

• DISCONNECTED: what is left.

(2) Compute SCCs

• Classical Algorithms:– DFS(G)– Transpose G in GT

– DFS(GT) following vertices in decreasing order of f[v] (time of the end of the visit)

– Every tree is a SCC.

• DFS hard to compute on secondary memory: no locality

DFS

DFS(u:vertex)color[u]=GRAY

d[u] time time +1foreach v in succ[u] do

if (color[v]=WHITE) then p[v] u

DFS(v)endForcolor[v] BLACKf[u] time time + 1

Classical Approach

main(){ foreach vertex v do

color[v]=WHITE

endFor

foreach vertex v do

if (color[v]==WHITE)

DFS(v);

endFor

}

Semi-External DFS (1)

(J.F. Sibeyn, J.Abello, U. Meyer)

Compute the DFS forest in several iterations until there are no forward edges:

A forest is a DFS forest if and only if there are no forward edges

Memory space:

(12+1/8) bytes per vertex

a

b


Data structures

0 0 2 3 4 5

n+k adjacent vertices0 1 2 5 3 4

n+1 pointers

n+1

n k

• Adjacency list to store partial DFS tree– n+1 integers point to successors

– n+k integers to point up to n+k successors ( k>=n)


WhileDFS changes{– Add next k edges to the current DFS– Compute DFS on n+k edges– Update DFS

}

Computation of SCC

Is the Web a small world?

– Based on a simple model, [Barabasi et. al.] predicted that most pages are within 19 links of each other. Justified the model by crawling nd.edu (1999)

•Well, not really!

Distance measurements

•Experiment data (Altavista)– Maximum directed distance between 2 CORE

nodes: 28– Maximum directed distance between 2 nodes,

given there is a path: > 900– Average directed distance between 2 SCC nodes:

16

•Experimental data (WebBase)– Depth of IN = 8– Depth of OUT = 112– Max backward and forward BFS depth in core = 8

IN OUT

More structure in the Web Graph

• Insights from hubs and authorities: Dense bipartite subgraph Web Community

• A large number of bipartite cliques, cores of hidden Web-communities can be found in the Webgraph[R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, 99]

Hub/fan

Authority/center

(4,3) clique

Disjoint bipartite cliques

200M Crawl Alexa, 1997Kumar et al.

200M Crawl WebBase 2001

More Cyber-communities and/or better algorithm, for finding disjoint bipartite cliques

Approach

• Find all disjoint (i,j) cores, 3≤i≤10, 3≤j≤10• Expand such cores into full communities

• Enumerating all dense bipartite subgraphs is very expensive

• We run heuristics to approximately enumerate disjoint bipartite cliques of small size

Pre processing

• Remove all vertices with |I(v)|>50 Not interested in popular pages such Yahoo, CNN, etc…

• Use iterative pruning: Remove all centers with |I(v)|<jRemove all fans with |O(v)|<i

Enumeration of (i,j) cliques

1. For a vertex v, enumerate size j subsets S of O(v)

2. If then (i,j) clique found

3. Remove i fans and j centers

4. Repeat until the graph is empty

iuISu

|)(|

Semi-external algorithm• List of predecesors and succesors stored in N/B

blocks• Every block contains the adjacency list of B

vertices and fits in main memory

• Keep two 1-bit arrays Fan() and Center() in main memory

• Phase I. and II. Easily implemented in streaming fashion

• Phase III. Problem: Given S, computing needs access to later blocks )(uI

Su

Semi-external algorithm

Phase III. If we cannot decide on set S wihtin the current block, store and with the next block containing a vertex of S

When moving a new block to main memory, explore all vertices in the block and continue exploration of set S inherited from previous blocks

)(vOS )(uI

Su

Computation of disjoint (4,4) cliques

PageRank

• PageRank measures the steady state visit rate of a random walk of the Web Graph

• Imagine a random surfer: – Start from a random page– At any time choose with equal probability an

outgoing link

1/3

1/3

1/3

Problem

The Web is full of dead ends

• Teleporting– With pb choose a random outgoing edge– With pb 1- continue from a random node

• There is a long term visit rate for every page

?

Page Rank Computation

• Let a be the Page-Rank vector and A the adjacency matrix

• If we start from distribution a, after one step we are at distribution aA

• PageRank is the left eigenvector of A:

a = a A

• In practice, start from any vector a

• Repeatedly apply

a=aAa=aA2

a=aAk

till a is stable

Pagerank distribution

• Pagerank distributed with a power law with = 2.1 [Pandurangan Raghavan, Upfal, 2002] on a sample of 100.000 vertices from brown.edu

• In-degree and Pagerank not correlated!!

• Compute Pagerank on the WebBase Crawl

• Confirm Pagerank distributed with = 2.1

• Pagerank/Indegree correlation = 0.3

• Efficient external memory computation of Pagerank based on [T.H. Haveliwala, 1999]

Efficient computation of Pagerank

Models of the Web Graph

Why to Study models for the Web Graph?

• To better understand the process of content creation

• To test Web applications • To predict its evolution

and….. A challenging mathematical problem

Standard Theory of Random Graph(Erdös and Rényi 1960)

• Random Graphs are composed by starting with n vertices.

• With probability p two vertices are connected by an edge

Degrees are Poisson distributed

P(k)

k!

)()(

k

pNekP

kpN

Properties of Random Graphs

• The probability that the degree is far from expectation c=pn drops exponenatially fast

• Threshold Phenomena: – If c<1, the largest component has size O(log n)– If c>1, the largest component has size O(n), all

others have size O(log n)

• If c=ω(log n), the graph is connected and the diameter is O(log n/log log n)

• We look for Random Graph models holding the properties of the Web

Features of a good model

• Must evolve over time: pages appear and disappear over time

• Content creation some time independent, sometime dependent on the current Web, some links are random, others are copied from existing links

• Content creators are biased towards good Pages

• Must be able to reproduce relevant observables, i.e. statistical and topological structures

Details of Models

• Vertices are created in discrete time steps

• At every time step a new vertex is created that connects with d edges to existing edges

• d=7 in simulation

• Edges are selected according to specific probability distributions

Evolving Network [Alberts, Barabasi, 1999]

• Growing Network with Preferential attachment: 1. Growth: Every time step

a new node enter the system with d edges

2. Preferential Attachment: The probability to be connected depends on the degree P(k) k

• P(k) ~ k-α, α=2

A formal argument[Bollobas, Riordan, Spencer, Tusnadi, 2001]

• The fraction of nodes that has in-degree k is proportional to k-3 if d =1.

• Consider 2n nodes and make a random pairing

• Start from the left side. Idetify a vertex with all consecutive left endpoints, till reach a right endpoint

• Preferential attachment fails to capture othe aspects of the Web-Graph, such as the large number of small bipartite cliques

The Copying Model [Kumar, Raghavan, Rajagopalan, Sivakumar,

Tomkins, Upfal, 2000]• It is an evolving model: vertices added one by one,

and point with d=7 edges to existing vertices– When inserting vertex v, choose at random a prototype

vertex u in the graph

– With pb copy the jth link of u, with pb 1- choose a random endpoint

– Indegree follows a power law with exponent 2.1 if =0.8

• Want to model the process of copying links from other releated pages

• Try to form Web-communities whose hubs point to good authorities

Properties of the Copying model

• Let Nt,k be the number of nodes of degree k at time t.

• Limt∞Nt,k/t ~ k-(2- )/(1- )

• Let Q(i,j,t) be the expected number of cliques K(i,j) at time t

• Theorem: Q(i,j,t) is Ώ(t) for small values of i and j

A Multi-Layer model of the Graph

• Web produced by the superposition of multiple independent regions, Thematically Unifying Clusters [Dill et al, Self Similarity in the Web, 2002]

• Different regions being different in size and aggregation criteria, for instance topic, geography or domain.

• Regions are connected together by a ``connectivity backbone'' formed by pages that are part of multiple regions, e.g. documents that are relevant for multiple topics

Multi-layer model [Caldarelli, De Los Rios, Laura, Leonardi,

Millozzi 2002]

• Model the Web as the superposition of different thematically unified clusters, generated by independent stochastic processes

• Different regions may follow different stochastic models of aggregation

Multi-layer model, details

• Every new vertex is assigned to a constant number c=3, chosen at random out of L=100 layers.

• Every vertex is connected with d=7 edges distributed over the c layers.

• Whithin every layer, edges are inserted using the Copying with =0.8 or the Evolving Network model

• The final graph is obtained by merging the graphs created in all the layers

Properties of the Multi-layer model

• The in-degree distribution follows a power law with =2.1

• The result is stable for a large variation of the parameter:– Total # of layers L– # of layers to which every page is assigned– The stochastic model used in a single layer

What about SCCs

• All models presented until now produce directed graphs without cycles

• Rewiring the Copying model and the Evolving Network model:– Generate a graph according to the model on N

vertices– Insert a number of random edges, from 0.01 N

till 3N

• In classical random graph, connected components suddenly emerge at c=1

Size of the largest SCC/#of SCCs

• We observe the #of SCCs of size 1 and the size of the largest SCC

• Both measures have a smooth transition as the number of rewired edges increase.

• No threshold phenomena on SCC for Power Law graph

• Similar result recently proved by Bollobas and Riordan for WCC on undirected graphs

Copying model with rewiring

Evolving Network model with rewiring

Efficient computation of SCC

Conclusions on Models for the Web

• In-degree: all models give power law distributionfor specific parameters. Multi-layer achieves =2.1 for a broad range of parameter

• Bipartite Cliques: The Copying model achieves to form a large number of bipartite cliques

• All models have high correlation between Page Rank and In-degree

• No model replicates the Bow Tie structure and the Out-degree distribution

The Internet Graph

The map of the InternetBurch, Cheswich [1999]

Falutsous3 [1999]Plot of the frequency of the outdegree in the Autonomous system Graph

How many vantage points we need to to sample most of the connections between Autonomous Systems?

The Web and the Internet are different!

• A model for a physical network should consider geography.

• Heuristically Optimized Trade-off (Carlson, Doyle).

• Power law is the outcome of human activity, i.e. compromise between different contrastating objectives.

• Network growth compromise between cost and centrality, i.e. distance and good positioning in the network.

The FKP model, [Fabrikant, Koutsoupias, Papadimitriou, 2002]

• Vertices arrive one by one uniformly at random in the unit square

• Vertex i connects to a previous vertex j<i (a tree), – d(i,j): distance between i and j

– hj: measure of centrality of node j• Avg distance to the other nodes, or• Avg # hops to other vertices

• Choose j that minimizes d(i,j)+hj,

depends on n

Results

• Indegree distributions depends on α: <1/√N the tree is a star

= Ώ(√N) the degree is exponentially distributed

– 4< = o (√N) the degree is distributed with a power law

Challenges

• Exploit the knowledge of the Web graph to design better crawling strategies.

• Design models for the Dynamically Evolving Web: e.g. model the rate of arrival of new connections over time.

• on-line algorithms with sub-linear space to maintain toplogical and statistical informations

• Data Structure able to answer queries with time arguments

Web Graph representation and compression

Thanks to Luciana Salete Buriol and Debora Donato

Main features of Web Graphs

Locality: usually most of the hyperlinks are local, i.e, they point to other URLs on the same host. The literature reports that on average 80% of the hyperlinks are local.

Consecutivity: links within same page are likely to be consecutive respecting to the lexicographic order.

URLs normalization: Convert hostnames to lower case, cannonicalizes port number, re-introducing them where they need,

and adding a trailing slash to all URLs that do not have it.

Main features of WebGraphs

Similarity: Pages on the same host tend to have many hyperlinks pointing to the same pages.

Consecutivity is the dual distance-one

similarity.

Literature

Connectivity Server (1998) – Digital Systems Reseach Center and Stanford University – K. Bharat, A. Broder, M. Henzinger,

P. Kumar, S. Venkatasubramanian;

Link Database (2001) - Compaq Systems Research Center – K. Randall, R. Stata, R. Wickremesinghe, J. Wiener;

WebGraph Framework (2002) – Universita degli Studi di Milano – P. Boldi, S. Vigna.

Connectivity Server➢ Tool for web graphs visualisation, analysis

(connectivity, ranking pages) and URLs compression.

➢ Used by Alta Vista;➢ Links represented by an outgoing and an incoming

adjacency lists;

➢ Composed of:

URL Database: URL, fingerprint, URL-id;

Host Database: group of URLs based on the hostname portion;

Link Database: URL, outlinks, inlinks.

Connectivity Server: URL compressionURLs are sorted lexicographically and stored as a

delta encoded entry (70% reduction).

URLs delta URLs delta encodingencoding

Indexing Indexing the delta the delta

encondingenconding

Link1: first version of Link Database

No compression: simple representation of outgoing and incoming adjacency lists of links.

Avg. inlink size: 34 bits

Avg. outlink size: 24 bits

Link2: second version of Link Database

Single list compression and starts compression

Avg. inlink size: 8.9 bits

Avg. outlink size: 11.03 bits

Delta Encoding of the Adjacency Lists

Each array element is 32 bits long.

-3 = 101-104 (first item)

42 = 174-132 (other items)

Delta Encoding of the Adjacency Lists

.

Nybble Code

The low-order bit of each nybble indicates whether or not there are more nybbles in the string

The least-significant data bit encodes the sign.

The remaining bits provide an unsigned number

28 = 0111 1000

-28 = 1111 0010

Starts array compression

• The URLs are divided into three partitions based on their degree;

• Elements of starts indices to nybbles;

• The literature reports that 74% of the entries are in the low-degree partition.

Starts array compression

Z(x) = max (indegree(x), outdegree(x))

P = the number of pages in each block.

Entry range Partition # bits

Z(x) > 254 High-degree partition 32

254 Z(x) 24

medium-degree partition

(32+P*16)/P

Z(x) < 24 low-degree partition (32+P*8)/P

Link3: third version of Link Database

Interlist compression with representative list



Interlist Compressionref : relative index of the representative adjacency list;

deletes: set of URL-ids to delete from the representative list;

adds: set of URL-ids to add to the representative list.

LimitSelect-K-L: chooses the best representative adjacency list from among the previus K (8) URL-ids' adjacency lists and only allows chains of fewer than L (4) hops.

-codes (WebGraph Framework)

Interlist compression with representative list



Compressing Gaps

Successor list S(x) = {s1-x, s

2-s

1-1, ..., s

k-s

k-1-1}

For negative entries:

Adjacency list with compressed gaps.

Uncompressed adjacency list

Using copy lists

Each bit on the copy list informs whether the corresponding successor of y is also a successor of x;

The reference list index ref. is chosen as the value between 0 and W (window size) that gives the best compression.

Uncompressed adjacency list

Adjacency list with copy lists.

Using copy blocks

The last block is omitted;

The first copy block is 0 if the copy list starts with 0;

The length is decremented by one for all blocks except the first one.

Adjacency list with copy blocks.


Compressing intervals

Intervals: represented by their left extreme and lenght;

Intervals length: are decremented by the threshold Lmin

;

Residuals: compressed using differences.

Adjacency list with intervals.


Compressing intervals

Adjacency list with intervals.


50 = ?

0 = (15-15)*2

600 = (316-16)*2

5 = |13-15|*2-1

3018 = 3041-22-1

Compression comparison

Using different computers and compilers.

Inlink size Outlink size Access time# pages (million)# links (million) DatabaseHuff. 15,2 15,4 112 320 WebBaseLink1 34 24 13 61 1000 Web Crawler MercatorLink2 8,9 11,03 47 61 1000 Web Crawler MercatorLink3 5,66 5,61 248 61 1000 Web Crawler Mercator

z-codes 3,25 2,18 206 18,5 300 .uk domains-Node 5,07 5,63 298 900 WebBase

Conclusions

The compression techniques are specializedspecialized for Web Graphs.

The average link sizelink size decreases with the increase of the graph.

The average link access timelink access time increases with the increase of the graph.

The -codes seems to have the best trade-off best trade-off between avg. bit size and access time.

models and structure of the web graph stefano leonardi università di roma “la sapienza” thanks...

Documents

web graph graph models

r slide

small diameter

web graph notation

graph of acquantancies

page graph

resulting graph

small worlds