network sampling, community detection

Network Sampling, Community Detection(Aula 18)

Roberval Gomes Mariano José MacedoOrientando Co-Orientador

http://bl.ocks.org/mbostock/7607999

The social network of friendships within a 34-person karate club

When people are inuenced by the behaviors their neighbors in the network (a kind of informational or social contagion.)

The spread of an epidemic disease (such as the tuberculosis)

FACEBOOK

AIDS Blog Network

Eric D. Kolaczyk

• Statistical Analysis of Network Data• Lecture 1 { Network Mapping &

Characterization}• Dept of Mathematics and Statistics, Boston

University• [email protected]

Centrality measures

Closeness(Proximidade)

Betweenness

Usadas para calcular a importância dos Vértices dentro do grafo

eigenvector

A centralidade de autovalor (e.g. PageRank do Google) mede a importância de um vértice na rede considerando cada aresta como um voto.

Modelo de Watts-Strogatz

O quão pequeno é o mundo?• Quantos contatos separam uma pessoa de outra, em

qualquer lugar do mundo?• 1990 – John Guare e a peça “Six Degrees of

Separation”.• Karinthy acreditava que o mundo estava “encolhendo”.• 1960 – The Small World Experiment por Stanley

Milgram e Jeffrey Travers.

Modelo de Crescimento Exponencial

• Você : 1• Seus amigos : 100• Os amigos dos seus amigos : 10000

Modelo de Crescimento Exponencial

• Em apenas 5 passos, 1005 pessoas seriam alcançadas, o que é maior que a população mundial.

• Esse modelo despreza a probabilidade de cada pessoa ter amigos em comum.

• Não serve para representar os relacionamentos sociais no mundo real.

• Mundo real – muitos triângulos.

• Como é possível então, que as redes sociais possuem muitos triângulos em escopo local e mesmo assim constituírem um mundo pequeno?

Modelo de Watts-Strogatz• Criado por Duncan J. Watts e Steven

Strogatz, em 1998.

• Introduz o conceito de coeficiente de clusterização.

• Consegue simular tanto redes aleatórias como as redes sociais do mundo real.

Coeficiente de clusterização

Construção do grafo

• O modelo de Watts-Strogatz usa os seguintes parâmetros:

• N número de nós;

• K grau de cada nó;

• (0 <= P <= 1) -> coeficiente que modela o grafo, quanto maior P, maior a desordem;

• C coeficiente de clusterização;

• L distância média entre dois nós da rede.

Coeficiente de clusterização

Estudo do grafop = 0 (ordenado) p = 1 (aleatório)• L µ (n / k) L µ (ln n /ln k)• C » 3 / 4 C » (k / n) ® 0

• O que acontece com valores de P intermediários?

• P pequeno - diminui consideravelmente L, mantém C praticamente constante.• P pequeno caracteriza redes de mundo pequeno.• Condições necessárias: alguma fonte de ordenação; um pouco de

aleatoriedade.

Por que o mundo real é pequeno?• Eu conheço o Governador! - E eu a Xuxa, agora pergunta se

ele conhece a minha pessoa!• Ordenação redes de amigos locais altamente

clusterizadas.• Aleatoriedade algum contato que foge da rede local.Número de Baconhttp://oracleofbacon.org/

Facebook• Experiência com 5,8

milhões de usuários.• Distância média: 5,73.• Distância máxima: 12.

Twitter• Estudo da relação de um

usuário seguir outro.• 5,2 bilhões de relações.• Distância média: 4,67.

Class Size Paradox

• Simple example: each student takes one course; suppose there is one course with 100 students, fifty courses with 2 students.– Dean calculates: (100+50*2)/51 = 3.92– Students calculate: (100*100+100*2)/200 = 51

• Class Size Paradox in Networks– Average number of friends of a person’s friends is

greater than average number of friends of a person!!

Comunidades

Community Detection Algorithms

Metric called Modularity.

Community: It is formed by individuals such that those within a group interact with each other more frequently than with those outside the group

a.k.a. group, cluster, cohesive subgroup, module in different contexts

Community detection: discovering groups in a network where individuals’ group memberships are not explicitly given

Complete Mutuality: Cliques• Clique: a maximum complete subgraph in which all

nodes are adjacent to each other

• NP-hard to find the maximum clique in a network• Straightforward implementation to find cliques is very

expensive in time complexity

Nodes 5, 6, 7 and 8 form a clique

29

Finding the Maximum Clique

• In a clique of size k, each node maintains degree >= k-1– Nodes with degree < k-1 will not be included in the maximum clique

• Recursively apply the following pruning procedure– Sample a sub-network from the given network, and find a clique in the

sub-network, say, by a greedy approach– Suppose the clique above is size k, in order to find out a larger clique,

all nodes with degree <= k-1 should be removed.

• Repeat until the network is small enough• Many nodes will be pruned as social media networks follow a

power law distribution for node degrees

30

Maximum Clique Example

• Suppose we sample a sub-network with nodes {1-9} and find a clique {1, 2, 3} of size 3

• In order to find a clique >3, remove all nodes with degree <=3-1=2– Remove nodes 2 and 9– Remove nodes 1 and 3– Remove node 4

31

Clique Percolation Method (CPM)• Clique is a very strict definition, unstable• Normally use cliques as a core or a seed to find larger

communities

• CPM is such a method to find overlapping communities– Input

• A parameter k, and a network – Procedure

• Find out all cliques of size k in a given network• Construct a clique graph. Two cliques are adjacent if they share k-1

nodes• Each connected component in the clique graph forms a

community

32

CPM ExampleCliques of size 3:{1, 2, 3}, {1, 3, 4}, {4, 5, 6}, {5, 6, 7}, {5, 6, 8}, {5, 7, 8}, {6, 7, 8}

Communities: {1, 2, 3, 4}

{4, 5, 6, 7, 8}

33

Reachability : k-clique, k-club • Any node in a group should be reachable in k hops• k-clique: a maximal subgraph in which the largest geodesic

distance between any two nodes <= k • k-club: a substructure of diameter <= k

• A k-clique might have diameter larger than k in the subgraph– E.g. {1, 2, 3, 4, 5}

• Commonly used in traditional SNA• Often involves combinatorial optimization

Cliques: {1, 2, 3}2-cliques: {1, 2, 3, 4, 5}, {2, 3, 4, 5, 6}2-clubs: {1,2,3,4}, {1, 2, 3, 5}, {2, 3, 4, 5, 6}

34

Group-Centric Community Detection: Density-Based Groups

• The group-centric criterion requires the whole group to satisfy a certain condition– E.g., the group density >= a given threshold

• A subgraph is a quasi-clique if

where the denominator is the maximum number of degrees.

• A similar strategy to that of cliques can be used– Sample a subgraph, and find a maximal

quasi-clique (say, of size )– Remove nodes with degree less than the average

degree 35

,

<

Network-Centric Community Detection

• Network-centric criterion needs to consider the connections within a network globally

• Goal: partition nodes of a network into disjoint sets• Approaches:

– (1) Clustering based on vertex similarity– (2) Latent space models (multi-dimensional scaling )– (3) Block model approximation– (4) Spectral clustering– (5) Modularity maximization

36

Clustering based on Vertex Similarity

• Apply k-means or similarity-based clustering to nodes• Vertex similarity is defined in terms of the similarity of their

neighborhood• Structural equivalence: two nodes are structurally equivalent

iff they are connecting to the same set of actors

• Structural equivalence is too strict for practical use.

Nodes 1 and 3 are structurally equivalent; So are nodes 5 and 6.

37

(1) Clustering based on vertex similarity

Vertex Similarity

• Jaccard Similarity

• Cosine similarity

38

(1) Clustering based on vertex similarity

Cut

• Most interactions are within group whereas interactions between groups are few

• community detection minimum cut problem• Cut: A partition of vertices of a graph into two disjoint sets• Minimum cut problem: find a graph partition such that the

number of edges between the two sets is minimized

42

(4) Spectral clustering

Ratio Cut & Normalized Cut

• Minimum cut often returns an imbalanced partition, with one set being a singleton, e.g. node 9

• Change the objective function to consider community size

Ci,: a community|Ci|: number of nodes in Ci

vol(Ci): sum of degrees in Ci

43


Ratio Cut & Normalized Cut Example

For partition in red:

For partition in green:

Both ratio cut and normalized cut prefer a balanced partition44


Girvan-Newman Algorithm

Bickel-Chen on Community DetectionInconsistency result for Newman-Girvan.

http://www.stat.berkeley.edu/~bickel/Bickel%20Chen%2021068.full.pdf

Agglomerative Hierarchical Clustering

• Initialize each node as a community• Merge communities successively into larger

communities following a certain criterion– E.g., based on modularity increase

48

Dendrogram according to Agglomerative Clustering based on Modularity

Sampling

• Types of Sampling– Simple Random Sampling– Stratified Sampling– Cluster Sampling

• Sample Size• Incorporating Sample Design• Design vs. Model-Based Sampling

– Statisticians tend to favor the design-based point of view because it makes no assumptions about the mechanism that generates the data in the survey, as we explain below

Respondent Driven Sampling (RDS)

• sampling scheme for hard-to-reach populations, based on link-tracing across a social network with coupon incentives

• becoming extremely-widely used all over the world; hundreds of studies done or ongoing, e.g., CDC National HIV Behavioral Surveillance (NHBS) studies of injection drug users

• RDS as sampling vs. RDS estimation

To Model or Not To Model;Design-based vs. model-based

• Model the underlying network? • What about unknown nodes?• The recruitment process?• Coupon refusal?• The outcome variables (such as HIV status)?

Referências• Modelo de Mundo Pequeno - Arthur Freitas Ramos & Hugo Neiva de Melo (000 - mundo-pequeno.pptx)• Collective dynamics of ‘small-world’ networks Duncan J. Watts* & Steven H. Strogatz (001 - w_s_NATURE_0.pdf)• Statistical Analysis of Network Data Lecture 1 { Network Mapping & Characterization Eric D. Kolaczyk (003 -

kolaczyk_yes13_1.pdf)• Tutorial: Statistical Analysis of Network Data Eric D. Kolaczyk (004 - Kolaczyk-CN.pdf)• The Watts–Strogatz network model developed by including degree distribution: theory and computer simulation Y W

Chen1,2, L F Zhang1 and J P Huang1 (005 - JPA-1.pdf)• Where Online Friends Meet: Social Communities in Location-based Networks - Chlo¨e Brown Vincenzo Nicosia Salvatore

Scellato, Anastasios Noulas Cecilia Mascolo (006 - icwsm12_brown.pdf)• Mining of Massive Datasets Anand Rajaraman Jure Leskovec (001 – book.pdf)• Machine Learning for Hackers - Drew Conway and John Myles White (002 - Machine_Learning_for_Hackers.pdf)• pandas: powerful Python data analysis toolkit Release 0.13.1 Wes McKinney & PyData Development Team(003 – pandas.pdf)• Provenance Data in Social Media Geoffrey , Barbier Zhuo Feng , Pritam Gundecha, Huan Liu - SYNTHESIS LECTURES ON DATA

MINING AND KNOWLEDGE DISCOVERY #7 &MC Morgan &cLaypool publishers(004 - Provenance-Data-in-Social-Media(2).pdf)• Python for Data Analysis Wes McKinney (005 - Python_for_Data_Analysis.pdf)• Dominique Haughton l Jonathan Haughton - Living Standards Analytics (006 - bok%3A978-1-4614-0385-2.pdf)• A nonparametric view of network models and Newman–Girvan and other modularities Peter J. Bickela,1 and Aiyou Chenb

(007 - Bickel%20Chen%2021068.full 2.pdf)• Collective dynamics of ‘small-world’ networks Duncan J. Watts* & Steven H. Strogatz (008 - 393440a0.pdf)• Communities in Networks - Mason A. Porter, Jukka-Pekka Onnela, and Peter J. Mucha(009 - 0902.3788v2 a.pdf)• Managing and Mining Graph Data. Haixun Wang, Charu C. Aggarwal. Beijing, China. Springer. (007 - bok%3A978-1-4419-6045-

0.pdf)• Social Network. Data Analytics. Charu C. Aggarwal. Springer. (008 - bok%3A978-1-4419-8462-3.pdf)• RDS Analysis Tools 7.2 RDSAT 7.1 User Manual. Springer. (009 - RDSAT_7.1-Manual_2012-11-25.pdf)• Eric D. Kolaczyk. Statistical Analysis of Network Data Methods and Models. Springer. (010 - bok_978-0-387-88146-1)• Networks, Crowds, and Markets: Reasoning about a Highly Connected World. David Easley & Jon Kleinberg (011 - networks-

book.pdf)

network sampling, community detection

Technology

network sampling

given network

grafo o modelo

o quo pequeno o mundo

social network of friendships

o que maior que

aids blog network

mundo real muitos tringulos