network sampling, community detection

49
Network Sampling, Community Detection (Aula 18) Roberval Gomes Mariano José Macedo Orientando Co-Orientador

Upload: roberval-mariano

Post on 24-May-2015

227 views

Category:

Technology


0 download

DESCRIPTION

Apresenta exemplos de redes e aborda formas de detecção de comunidades, bem como o problema de comunidades ocultas em amostras e temas relacionados.

TRANSCRIPT

Page 1: Network sampling, community detection

Network Sampling, Community Detection(Aula 18)

Roberval Gomes Mariano José MacedoOrientando Co-Orientador

Page 2: Network sampling, community detection
Page 3: Network sampling, community detection
Page 4: Network sampling, community detection
Page 5: Network sampling, community detection
Page 6: Network sampling, community detection
Page 7: Network sampling, community detection
Page 8: Network sampling, community detection
Page 9: Network sampling, community detection
Page 11: Network sampling, community detection

The social network of friendships within a 34-person karate club

When people are inuenced by the behaviors their neighbors in the network (a kind of informational or social contagion.)

The spread of an epidemic disease (such as the tuberculosis)

Page 12: Network sampling, community detection

FACEBOOK

Page 13: Network sampling, community detection

AIDS Blog Network

Page 14: Network sampling, community detection

Eric D. Kolaczyk

• Statistical Analysis of Network Data• Lecture 1 { Network Mapping &

Characterization}• Dept of Mathematics and Statistics, Boston

University• [email protected]

Page 15: Network sampling, community detection

Centrality measures

Closeness(Proximidade)

Betweenness

Usadas para calcular a importância dos Vértices dentro do grafo

eigenvector

A centralidade de autovalor (e.g. PageRank do Google) mede a importância de um vértice na rede considerando cada aresta como um voto.

Page 16: Network sampling, community detection

Modelo de Watts-Strogatz

Page 17: Network sampling, community detection

O quão pequeno é o mundo?• Quantos contatos separam uma pessoa de outra, em

qualquer lugar do mundo?• 1990 – John Guare e a peça “Six Degrees of

Separation”.• Karinthy acreditava que o mundo estava “encolhendo”.• 1960 – The Small World Experiment por Stanley

Milgram e Jeffrey Travers.

Page 18: Network sampling, community detection

Modelo de Crescimento Exponencial

• Você : 1• Seus amigos : 100• Os amigos dos seus amigos : 10000

Page 19: Network sampling, community detection

Modelo de Crescimento Exponencial

• Em apenas 5 passos, 1005 pessoas seriam alcançadas, o que é maior que a população mundial.

• Esse modelo despreza a probabilidade de cada pessoa ter amigos em comum.

• Não serve para representar os relacionamentos sociais no mundo real.

• Mundo real – muitos triângulos.

• Como é possível então, que as redes sociais possuem muitos triângulos em escopo local e mesmo assim constituírem um mundo pequeno?

Page 20: Network sampling, community detection

Modelo de Watts-Strogatz• Criado por Duncan J. Watts e Steven

Strogatz, em 1998.

• Introduz o conceito de coeficiente de clusterização.

• Consegue simular tanto redes aleatórias como as redes sociais do mundo real.

Page 21: Network sampling, community detection

Coeficiente de clusterização

Page 22: Network sampling, community detection

Construção do grafo

• O modelo de Watts-Strogatz usa os seguintes parâmetros:

• N número de nós;

• K grau de cada nó;

• (0 <= P <= 1) -> coeficiente que modela o grafo, quanto maior P, maior a desordem;

• C coeficiente de clusterização;

• L distância média entre dois nós da rede.

Page 23: Network sampling, community detection

Coeficiente de clusterização

Page 24: Network sampling, community detection

Estudo do grafop = 0 (ordenado) p = 1 (aleatório)• L µ (n / k) L µ (ln n /ln k)• C » 3 / 4 C » (k / n) ® 0

• O que acontece com valores de P intermediários?

• P pequeno - diminui consideravelmente L, mantém C praticamente constante.• P pequeno caracteriza redes de mundo pequeno.• Condições necessárias: alguma fonte de ordenação; um pouco de

aleatoriedade.

Page 25: Network sampling, community detection

Por que o mundo real é pequeno?• Eu conheço o Governador! - E eu a Xuxa, agora pergunta se

ele conhece a minha pessoa!• Ordenação redes de amigos locais altamente

clusterizadas.• Aleatoriedade algum contato que foge da rede local.Número de Baconhttp://oracleofbacon.org/

Facebook• Experiência com 5,8

milhões de usuários.• Distância média: 5,73.• Distância máxima: 12.

Twitter• Estudo da relação de um

usuário seguir outro.• 5,2 bilhões de relações.• Distância média: 4,67.

Page 26: Network sampling, community detection

Class Size Paradox

• Simple example: each student takes one course; suppose there is one course with 100 students, fifty courses with 2 students.– Dean calculates: (100+50*2)/51 = 3.92– Students calculate: (100*100+100*2)/200 = 51

• Class Size Paradox in Networks– Average number of friends of a person’s friends is

greater than average number of friends of a person!!

Page 27: Network sampling, community detection

Comunidades

Page 28: Network sampling, community detection

Community Detection Algorithms

Metric called Modularity.

Community: It is formed by individuals such that those within a group interact with each other more frequently than with those outside the group

a.k.a. group, cluster, cohesive subgroup, module in different contexts

Community detection: discovering groups in a network where individuals’ group memberships are not explicitly given

Page 29: Network sampling, community detection

Complete Mutuality: Cliques• Clique: a maximum complete subgraph in which all

nodes are adjacent to each other

• NP-hard to find the maximum clique in a network• Straightforward implementation to find cliques is very

expensive in time complexity

Nodes 5, 6, 7 and 8 form a clique

29

Page 30: Network sampling, community detection

Finding the Maximum Clique

• In a clique of size k, each node maintains degree >= k-1– Nodes with degree < k-1 will not be included in the maximum clique

• Recursively apply the following pruning procedure– Sample a sub-network from the given network, and find a clique in the

sub-network, say, by a greedy approach– Suppose the clique above is size k, in order to find out a larger clique,

all nodes with degree <= k-1 should be removed.

• Repeat until the network is small enough• Many nodes will be pruned as social media networks follow a

power law distribution for node degrees

30

Page 31: Network sampling, community detection

Maximum Clique Example

• Suppose we sample a sub-network with nodes {1-9} and find a clique {1, 2, 3} of size 3

• In order to find a clique >3, remove all nodes with degree <=3-1=2– Remove nodes 2 and 9– Remove nodes 1 and 3– Remove node 4

31

Page 32: Network sampling, community detection

Clique Percolation Method (CPM)• Clique is a very strict definition, unstable• Normally use cliques as a core or a seed to find larger

communities

• CPM is such a method to find overlapping communities– Input

• A parameter k, and a network – Procedure

• Find out all cliques of size k in a given network• Construct a clique graph. Two cliques are adjacent if they share k-1

nodes• Each connected component in the clique graph forms a

community

32

Page 33: Network sampling, community detection

CPM ExampleCliques of size 3:{1, 2, 3}, {1, 3, 4}, {4, 5, 6}, {5, 6, 7}, {5, 6, 8}, {5, 7, 8}, {6, 7, 8}

Communities: {1, 2, 3, 4}

{4, 5, 6, 7, 8}

33

Page 34: Network sampling, community detection

Reachability : k-clique, k-club • Any node in a group should be reachable in k hops• k-clique: a maximal subgraph in which the largest geodesic

distance between any two nodes <= k • k-club: a substructure of diameter <= k

• A k-clique might have diameter larger than k in the subgraph– E.g. {1, 2, 3, 4, 5}

• Commonly used in traditional SNA• Often involves combinatorial optimization

Cliques: {1, 2, 3}2-cliques: {1, 2, 3, 4, 5}, {2, 3, 4, 5, 6}2-clubs: {1,2,3,4}, {1, 2, 3, 5}, {2, 3, 4, 5, 6}

34

Page 35: Network sampling, community detection

Group-Centric Community Detection: Density-Based Groups

• The group-centric criterion requires the whole group to satisfy a certain condition– E.g., the group density >= a given threshold

• A subgraph is a quasi-clique if

where the denominator is the maximum number of degrees.

• A similar strategy to that of cliques can be used– Sample a subgraph, and find a maximal

quasi-clique (say, of size )– Remove nodes with degree less than the average

degree 35

,

<

Page 36: Network sampling, community detection

Network-Centric Community Detection

• Network-centric criterion needs to consider the connections within a network globally

• Goal: partition nodes of a network into disjoint sets• Approaches:

– (1) Clustering based on vertex similarity– (2) Latent space models (multi-dimensional scaling )– (3) Block model approximation– (4) Spectral clustering– (5) Modularity maximization

36

Page 37: Network sampling, community detection

Clustering based on Vertex Similarity

• Apply k-means or similarity-based clustering to nodes• Vertex similarity is defined in terms of the similarity of their

neighborhood• Structural equivalence: two nodes are structurally equivalent

iff they are connecting to the same set of actors

• Structural equivalence is too strict for practical use.

Nodes 1 and 3 are structurally equivalent; So are nodes 5 and 6.

37

(1) Clustering based on vertex similarity

Page 38: Network sampling, community detection

Vertex Similarity

• Jaccard Similarity

• Cosine similarity

38

(1) Clustering based on vertex similarity

Page 39: Network sampling, community detection

Cut

• Most interactions are within group whereas interactions between groups are few

• community detection minimum cut problem• Cut: A partition of vertices of a graph into two disjoint sets• Minimum cut problem: find a graph partition such that the

number of edges between the two sets is minimized

42

(4) Spectral clustering

Page 40: Network sampling, community detection

Ratio Cut & Normalized Cut

• Minimum cut often returns an imbalanced partition, with one set being a singleton, e.g. node 9

• Change the objective function to consider community size

Ci,: a community|Ci|: number of nodes in Ci

vol(Ci): sum of degrees in Ci

43

(4) Spectral clustering

Page 41: Network sampling, community detection

Ratio Cut & Normalized Cut Example

For partition in red:

For partition in green:

Both ratio cut and normalized cut prefer a balanced partition44

(4) Spectral clustering

Page 42: Network sampling, community detection

Girvan-Newman Algorithm

Page 43: Network sampling, community detection

Bickel-Chen on Community DetectionInconsistency result for Newman-Girvan.

http://www.stat.berkeley.edu/~bickel/Bickel%20Chen%2021068.full.pdf

Page 44: Network sampling, community detection

Agglomerative Hierarchical Clustering

• Initialize each node as a community• Merge communities successively into larger

communities following a certain criterion– E.g., based on modularity increase

48

Dendrogram according to Agglomerative Clustering based on Modularity

Page 45: Network sampling, community detection

Sampling

• Types of Sampling– Simple Random Sampling– Stratified Sampling– Cluster Sampling

• Sample Size• Incorporating Sample Design• Design vs. Model-Based Sampling

– Statisticians tend to favor the design-based point of view because it makes no assumptions about the mechanism that generates the data in the survey, as we explain below

Page 46: Network sampling, community detection

Respondent Driven Sampling (RDS)

• sampling scheme for hard-to-reach populations, based on link-tracing across a social network with coupon incentives

• becoming extremely-widely used all over the world; hundreds of studies done or ongoing, e.g., CDC National HIV Behavioral Surveillance (NHBS) studies of injection drug users

• RDS as sampling vs. RDS estimation

Page 47: Network sampling, community detection
Page 48: Network sampling, community detection

To Model or Not To Model;Design-based vs. model-based

• Model the underlying network? • What about unknown nodes?• The recruitment process?• Coupon refusal?• The outcome variables (such as HIV status)?

Page 49: Network sampling, community detection

Referências• Modelo de Mundo Pequeno - Arthur Freitas Ramos & Hugo Neiva de Melo (000 - mundo-pequeno.pptx)• Collective dynamics of ‘small-world’ networks Duncan J. Watts* & Steven H. Strogatz (001 - w_s_NATURE_0.pdf)• Statistical Analysis of Network Data Lecture 1 { Network Mapping & Characterization Eric D. Kolaczyk (003 -

kolaczyk_yes13_1.pdf)• Tutorial: Statistical Analysis of Network Data Eric D. Kolaczyk (004 - Kolaczyk-CN.pdf)• The Watts–Strogatz network model developed by including degree distribution: theory and computer simulation Y W

Chen1,2, L F Zhang1 and J P Huang1 (005 - JPA-1.pdf)• Where Online Friends Meet: Social Communities in Location-based Networks - Chlo¨e Brown Vincenzo Nicosia Salvatore

Scellato, Anastasios Noulas Cecilia Mascolo (006 - icwsm12_brown.pdf)• Mining of Massive Datasets Anand Rajaraman Jure Leskovec (001 – book.pdf)• Machine Learning for Hackers - Drew Conway and John Myles White (002 - Machine_Learning_for_Hackers.pdf)• pandas: powerful Python data analysis toolkit Release 0.13.1 Wes McKinney & PyData Development Team(003 – pandas.pdf)• Provenance Data in Social Media Geoffrey , Barbier Zhuo Feng , Pritam Gundecha, Huan Liu - SYNTHESIS LECTURES ON DATA

MINING AND KNOWLEDGE DISCOVERY #7 &MC Morgan &cLaypool publishers(004 - Provenance-Data-in-Social-Media(2).pdf)• Python for Data Analysis Wes McKinney (005 - Python_for_Data_Analysis.pdf)• Dominique Haughton l Jonathan Haughton - Living Standards Analytics (006 - bok%3A978-1-4614-0385-2.pdf)• A nonparametric view of network models and Newman–Girvan and other modularities Peter J. Bickela,1 and Aiyou Chenb

(007 - Bickel%20Chen%2021068.full 2.pdf)• Collective dynamics of ‘small-world’ networks Duncan J. Watts* & Steven H. Strogatz (008 - 393440a0.pdf)• Communities in Networks - Mason A. Porter, Jukka-Pekka Onnela, and Peter J. Mucha(009 - 0902.3788v2 a.pdf)• Managing and Mining Graph Data. Haixun Wang, Charu C. Aggarwal. Beijing, China. Springer. (007 - bok%3A978-1-4419-6045-

0.pdf)• Social Network. Data Analytics. Charu C. Aggarwal. Springer. (008 - bok%3A978-1-4419-8462-3.pdf)• RDS Analysis Tools 7.2 RDSAT 7.1 User Manual. Springer. (009 - RDSAT_7.1-Manual_2012-11-25.pdf)• Eric D. Kolaczyk. Statistical Analysis of Network Data Methods and Models. Springer. (010 - bok_978-0-387-88146-1)• Networks, Crowds, and Markets: Reasoning about a Highly Connected World. David Easley & Jon Kleinberg (011 - networks-

book.pdf)