wikipedia as a complex networkimpact.asu.edu/cse591sp07/wikipedia.pdf · 18 bow-tie components in...

20
Wikipedia as a Complex Network 1/30/07 Ken Bannister

Upload: vutuyen

Post on 18-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Wikipedia as a Complex Network

1/30/07

Ken Bannister

2

Introduction

Talk is based on:Wikipedias: Collaborative web-based encyclopedias as complex networksby V. Zlatić M. Božičević H. Štefančić and M. DomazetPhysical Review E 74, 016115, 2006

However, to explain it all we look at some other resources as well.

3

Network Characteristics and Structure

Complex Network Characteristics● Mean shortest path length -- low?● Clustering coefficient -- high?● Node degree distribution – power law?● Triad significance profile – Hey, Newman didn't mention that!

Topology● Like a bow tie

4

Wikipedia as a Graph

Article == Node Hyperlink to another article == Directed (outgoing) Link

5

Wikipedia Snapshot

Paper based on data from 7 January 2005.Wikipedia article count [1]: 430k then; 1.4M now.

6

Wikipedia Snapshot

Wikipedia internal links [2]: 8.5M then; 32.1M now.

7

Mean Shortest Path Length

From Newman [3]:

Wikipedia shows small world effect.

directed type

n 434 000m 8 500 000l 4.9

l (random) 3.6

8

Clustering Coefficient

Let's review.

Two ways to calculate, from [3]:● 3x triangles / connected triples● triangles for vertex i / triples centered on i

9

Significance of Clustering Coefficient

High clustering coefficient found in diverse real world networks as shown by Watts, Strogatz [4].

10

Watts, Strogatz 1998

“Collective dynamics of ‘small-world’ networks” -- What a letter!3 pages in Nature; 3739 citations on Google Scholar.

Why so popular?● Identifies an important phenomenon with two simple statistics

● shortest path length● clustering coefficient

● Develops a simple model to recreate the phenomenon● one-dimensional lattice● add/replace some links with random links

● Shows insights from the model with a relevant simulation

Main insight – small-world network:● High clustering coefficient means clumps of nodes● Low path length means connections between clumps

Notice the distinction between small-world ...● effect – small shortest path length● network – small shortest path length + high clustering coefficient

11

Clustering in Wikipedia

Clustering coefficient for Wikipedias vs. random graph.Assumes undirected graph.

Coefficients are around 2x to 5xgreater than random/expected.

Clustering coefficient trends larger in smaller Wikipedias, i.e., negative correlation with size.

12

Node Degree Distribution

Diagram above for EnglishWikipedia.in-degree γ = 2.21; for out-degree γ = 2.65

Looking for power law distribution: P(k) ~ k-γ

Diagram above from [5] is for WWW in 1999.Interestingly, in-degree γ = 2.1; for out-degree γ = 2.7.

13

Triad Significance Profile (TSP)

TSP is a way to analyze the structure of 3-node subgraphs in a (directed) graph. Introduced by Milo, et al. in 2002 [6] (cited by 588 in Google Scholar), and expanded in 2004 [7].

For each subgraph i, the statistical significance is described bythe Z score:

Zi = (Nreal

i - <Nrand

i>) / std(Nrand

i)

where Nreali is the number of times the subgraph appears in the network,

and Nrandi and std(Nrand

i) are the mean and standard deviation of its

appearances in the randomized network ensemble.

For comparison between networks, this vector of values then may be normalized to 1.

14

TSP Families

From [7]. Row 1: Gene/transcription network in bacteria. Peak on feed-forward triad (7), used in signal processing. Row 2: Various biological information-processing networks. Row 3: WWW and social networking. Peak on clique triad (13) for clustering. Row 4: Word adjacency networks in text. Few triads due to ordering of word categories, like preposition->noun.

15

TSP Comparisons

Correlation coefficient matrix from [7]

16

TSP and Correlation Matrix for Wikipedia

Strong resemblance to WWW inMilo, et al.

Differs from Milo, et al. in comparison to social networks.

17

Bow-tie Topology

Graph topology formed by links between nodes in the WWW, as described by Broder, et al. in [5]. Crawled 200M pages in 1999.

18

Bow-tie Components in WWW and Wikipedia

Wikipedia data from Capocci, et al. in [8]. Notice much larger strongly-connected component and much smaller tendrils.

                     Percentage of Total              SCC    IN   OUT TENDRIL TUBE DISCWikipedia    82.4   6.6   6.7   0.6   0.02  3.7WWW          27.7  21.3  21.2  21.5   0?    8.2

Remember the high incidence of clique triads in TSP, where thereare directed links between all three nodes? This incidence matches well with the proportion of links in the SCC.

19

Complex Network Review

Complex Network Characteristics● Mean shortest path length – low? -- Yes● Clustering coefficient – high? -- Yes; higher for smaller Wiki● Node degree distribution – power law? -- Yes● Triad significance profile – Same family as WWW

Topology● Bow tie – Much larger SCC than WWW

20

References

[1] http://stats.wikimedia.org/EN/PlotsPngArticlesTotal.htm[2] http://stats.wikimedia.org/EN/PlotsPngDatabaseLinks.htm[3] M.E.J. Newman, The structure and function of complex networks http://www-personal.umich.edu/~mejn/courses/2004/cscs535/review.pdf[4] DJ Watts, SH Strogatz. Collective dynamics of 'small-world' networks. Nature. 1998 Jun 4;393(6684):440-2.[5] A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, S. Stata, A. Tomkins, and J. Weiner. Graph structure in the Web. Computer Networks 33, 309, 2000.[6] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, U. Alon. Network Motifs: Simple Building Blocks of Complex Networks. Science 298, 824, 2002.[7] R. Milo, S. Itzkovtz, N. Kashtan, R. Levitt, S. Shen-Orr, I. Ayzenshtat, M. Sheffer, and U. Allon. Superfamilies of Evolved and Designed Networks. Science 303, 1538, 2004.[8] A. Capocci, V. D. P. Servedio, F. Colaiori, L. S. Buriol, D. Donato, S. Leonardi, and G. Caldarelli. Preferential attachment in the growth of social networks: The internet encyclopedia Wikipedia. Physical Review E 74, 036116, 2006.