wikipedia as a complex networkimpact.asu.edu/cse591sp07/wikipedia.pdf · 18 bow-tie components in...
TRANSCRIPT
2
Introduction
Talk is based on:Wikipedias: Collaborative web-based encyclopedias as complex networksby V. Zlatić M. Božičević H. Štefančić and M. DomazetPhysical Review E 74, 016115, 2006
However, to explain it all we look at some other resources as well.
3
Network Characteristics and Structure
Complex Network Characteristics● Mean shortest path length -- low?● Clustering coefficient -- high?● Node degree distribution – power law?● Triad significance profile – Hey, Newman didn't mention that!
Topology● Like a bow tie
5
Wikipedia Snapshot
Paper based on data from 7 January 2005.Wikipedia article count [1]: 430k then; 1.4M now.
7
Mean Shortest Path Length
From Newman [3]:
Wikipedia shows small world effect.
directed type
n 434 000m 8 500 000l 4.9
l (random) 3.6
8
Clustering Coefficient
Let's review.
Two ways to calculate, from [3]:● 3x triangles / connected triples● triangles for vertex i / triples centered on i
9
Significance of Clustering Coefficient
High clustering coefficient found in diverse real world networks as shown by Watts, Strogatz [4].
10
Watts, Strogatz 1998
“Collective dynamics of ‘small-world’ networks” -- What a letter!3 pages in Nature; 3739 citations on Google Scholar.
Why so popular?● Identifies an important phenomenon with two simple statistics
● shortest path length● clustering coefficient
● Develops a simple model to recreate the phenomenon● one-dimensional lattice● add/replace some links with random links
● Shows insights from the model with a relevant simulation
Main insight – small-world network:● High clustering coefficient means clumps of nodes● Low path length means connections between clumps
Notice the distinction between small-world ...● effect – small shortest path length● network – small shortest path length + high clustering coefficient
11
Clustering in Wikipedia
Clustering coefficient for Wikipedias vs. random graph.Assumes undirected graph.
Coefficients are around 2x to 5xgreater than random/expected.
Clustering coefficient trends larger in smaller Wikipedias, i.e., negative correlation with size.
12
Node Degree Distribution
Diagram above for EnglishWikipedia.in-degree γ = 2.21; for out-degree γ = 2.65
Looking for power law distribution: P(k) ~ k-γ
Diagram above from [5] is for WWW in 1999.Interestingly, in-degree γ = 2.1; for out-degree γ = 2.7.
13
Triad Significance Profile (TSP)
TSP is a way to analyze the structure of 3-node subgraphs in a (directed) graph. Introduced by Milo, et al. in 2002 [6] (cited by 588 in Google Scholar), and expanded in 2004 [7].
For each subgraph i, the statistical significance is described bythe Z score:
Zi = (Nreal
i - <Nrand
i>) / std(Nrand
i)
where Nreali is the number of times the subgraph appears in the network,
and Nrandi and std(Nrand
i) are the mean and standard deviation of its
appearances in the randomized network ensemble.
For comparison between networks, this vector of values then may be normalized to 1.
14
TSP Families
From [7]. Row 1: Gene/transcription network in bacteria. Peak on feed-forward triad (7), used in signal processing. Row 2: Various biological information-processing networks. Row 3: WWW and social networking. Peak on clique triad (13) for clustering. Row 4: Word adjacency networks in text. Few triads due to ordering of word categories, like preposition->noun.
16
TSP and Correlation Matrix for Wikipedia
Strong resemblance to WWW inMilo, et al.
Differs from Milo, et al. in comparison to social networks.
17
Bow-tie Topology
Graph topology formed by links between nodes in the WWW, as described by Broder, et al. in [5]. Crawled 200M pages in 1999.
18
Bow-tie Components in WWW and Wikipedia
Wikipedia data from Capocci, et al. in [8]. Notice much larger strongly-connected component and much smaller tendrils.
Percentage of Total SCC IN OUT TENDRIL TUBE DISCWikipedia 82.4 6.6 6.7 0.6 0.02 3.7WWW 27.7 21.3 21.2 21.5 0? 8.2
Remember the high incidence of clique triads in TSP, where thereare directed links between all three nodes? This incidence matches well with the proportion of links in the SCC.
19
Complex Network Review
Complex Network Characteristics● Mean shortest path length – low? -- Yes● Clustering coefficient – high? -- Yes; higher for smaller Wiki● Node degree distribution – power law? -- Yes● Triad significance profile – Same family as WWW
Topology● Bow tie – Much larger SCC than WWW
20
References
[1] http://stats.wikimedia.org/EN/PlotsPngArticlesTotal.htm[2] http://stats.wikimedia.org/EN/PlotsPngDatabaseLinks.htm[3] M.E.J. Newman, The structure and function of complex networks http://www-personal.umich.edu/~mejn/courses/2004/cscs535/review.pdf[4] DJ Watts, SH Strogatz. Collective dynamics of 'small-world' networks. Nature. 1998 Jun 4;393(6684):440-2.[5] A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, S. Stata, A. Tomkins, and J. Weiner. Graph structure in the Web. Computer Networks 33, 309, 2000.[6] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, U. Alon. Network Motifs: Simple Building Blocks of Complex Networks. Science 298, 824, 2002.[7] R. Milo, S. Itzkovtz, N. Kashtan, R. Levitt, S. Shen-Orr, I. Ayzenshtat, M. Sheffer, and U. Allon. Superfamilies of Evolved and Designed Networks. Science 303, 1538, 2004.[8] A. Capocci, V. D. P. Servedio, F. Colaiori, L. S. Buriol, D. Donato, S. Leonardi, and G. Caldarelli. Preferential attachment in the growth of social networks: The internet encyclopedia Wikipedia. Physical Review E 74, 036116, 2006.