social network analysis - leiden universityliacs.leidenuniv.nl/~takesfw/pdf/sna.pdf · social...

51

Upload: ledat

Post on 26-Feb-2019

228 views

Category:

Documents


0 download

TRANSCRIPT

Social Network Analysis

Challenges in Computer Science – April 1, 2014

Frank Takes ([email protected])

LIACS, Leiden University

Overview

Context

Social Network Analysis

Online Social Networks

Friendship Graph

Centrality Measures

Example

Conclusions

Data Science

Data Science is the study of the generalizable extraction of knowledge from data, yet the key word is science (Wikipedia)

Builds on techniques and theories from many fields, including machine learning, computer programming, statistics, data engineering, pattern recognition and learning, visualization, …

Goal is extracting meaning from data and creating data products

Data science is a buzzword, often used interchangeably with analytics or big data…

Big Data?

Online Social Network

Friendship Graph

8 million users

1 billion friendships between users

10GB on disk

Is this Big Data?

No.

Three V’s of Big Data

Volume

Velocity

Volatile

Or….?

From Data to Networks

Unstructured data: numeric measurementsfrom a temperature sensor, textual contexts of a news article

Structured data: data organized according to a model or data structure, for example:

Database: tables with rows and columns

Graph/Network: nodes and edges

Graphs

C

D

E

B

AVertex/Node/Knoop

Relationship/Edge/Link/Tak

Distance/Afstandd(C, E) = d(E, C) = 2

n = 5 nodesm = 6 edges

Social Network Analysis

Social Network Analysis (SNA): the study of social networks to understand their structure and behavior.

Social Network: a social structure of people, related (directly or indirectly) to each other through a common relation or interest.

Social Networks != Social Media

Social Network Analysis

Social Network Analysis (SNA) Sociology

Algorithms

Data Mining

Social Networks Real-life (explicit)

Online (explicit)

Derived (implicit) e-mail networks, citation networks, co-author networks, terrorist collaboration networks

Online Social Networks

History

1997: SixDegrees.com

2000: Friendster

2003: LinkedIn & MySpace

2004: Hyves

2005: Facebook

2006: Twitter

................

2010: The Social Network (movie)

Online Social Networks

User (node) has a profile

Profiles have attributes (labels/annotations)

Explicit links (edges) Social Links / Friendship links

User groups

Implicit links Social messaging

Common attributes

Directed vs. undirected links

2011

Example: Facebook

More than 1 billion active users

Average user has 130 friends

Estimated 100 billion social links

Over 600 million interactive objects (pages, groups and events)

More than 45 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month

OSNA Research Topics

User behavior

Privacy & Anonymity

Trust & Authorities

Diffusion of information

Sampling & Crawling

Community Detection

Friendship Graph

Friendship Graph

Friendship Graph

Friendship Graph

Static analysis Who is the most important person in a network?

Can we distinguish between groups of people in the network?

What is the average distance between two peoplein the network?

Dynamic analysis Who are likely to become friends next?

How does the social network evolve over time?

Centrality measures

Degree centrality

Betweenness centrality

Closeness centrality

Graph centrality (eccentricity centrality)

Eigenvector centrality

Random walk centrality

Hyperlink Induced Topic Search (HITS)

PageRank

Centrality

B

C

D

E

A

Degree Centrality:C has the highest degree

Who has a central position in this graph?

E

F

G

H

B

C

D

E

A

E

F

G

H

Centrality

B

C

D

E

A

Betweenness Centrality:E is part of the largest

number of shortest paths

Who has a central position in this graph?

E

F

G

H

B

C

D

E

A

E

F

G

H

Google PageRank

Google PageRank

4 webpages A, B, C and D (N = 4)

Initially: PR(A) = PR(B) = PR(C) = PR(D) = 1/n

L(A) is the outdegree of page A

Now if B, C and D each link to A, the simple PageRank PR(A) of a page A is equal to:

Google PageRank

A B

C D

0.25 0.25

0.25 0.25

B

C D

0.458 0.083

0.208 0

A

Google PageRank

PageRank as suggested by Larry Page in 1999

N = number of pages, pi and pj are pages

M(pi) is the set of pages linking to pi

L(pj) is the outdegree of pj

d = 0.85, 85% chance to follow a link, 15% chance to jump to a random page (random surfer)

t = 0

t = t + 1

Frequent Subgraphs

B

C

B

A

A

B

C

B

A

A

Frequent Subgraph: A-B-C

B

C

B

A

A

B

C

B

A

A

What pattern occurs frequently in this graph?

Clustering

htt

p:/

/bay

ram

ann

ako

v.w

ord

pre

ss.c

om

Six Degrees of Separation"I read somewhere that everybody on this planet is separated by only six other people. Six degrees of separation. Between us and everybody else on this planet. The president of the United States. A gondolier in Venice. Fill in the names. I find that A) tremendously comforting that we're so close and B) like Chinese water torture that we're so close. Because you have to find the right six people to make the connection. It's not just big names. It's anyone. A native in a rain forest. A Tierra del Fuegan. An Eskimo. I am bound to everyone on this planet by a trail of six people. It's a profound thought."

John Guare, 1990

Six Degrees of Separation

Stanley Milgram, 1969

300 brieven van Omaha naar Boston

Geadresseerd aan 300 willekeurige mensen, met het verzoek debrief door te sturenrichting de uiteindelijkgeadresseerde.

Na gemiddeld 5.5 stap-pen kwam de brief bij de geadresseerde aan.

Six Degrees of Separation

Testen op een online sociaal netwerk

Dataset

8 miljoen gebruikers

900 miljoen onderlinge vriendschappen

9GB in text (datafile), 4GB in memory

Alle afstanden vergelijken: 8M x 8M = 64 x 1012

Sampling: onderlinge afstand van paren van 1000 willekeurige gebruikers bepalen mb.v. kortstepad-algoritme van Dijkstra

Gemiddelde afstanden

Netwerk Gebruikers Vriendschappen Gemiddelde afstand

Flickr 1.800.000 22.600.000 5.67

Hyves 8.000.000 900.000.000 4.75

LiveJournal 5.300.000 77.400.000 5.88

Orkut 3.100.000 223.500.000 4.25

YouTube 1.160.000 4.950.000 5.10

1000 samples

Random gekozen

Voor datasets van sociale netwerken:

Distance Distribution

Data storage in memory

n nodes, m links, k links per node on average

1 < k < n < m < n2

Adjacency Matrix Sorted Adjecency List

Size 8M x 8M x 1bit = 64 Tbit = 8 TbyteO(n2) space

900M x 8 bytes (INT pairs) = 7.2 GbyteO(m) space

Link existence O(1) time O(log k) time

Link addition O(1) time O(k log k) time

Link deletion O(1) time O(1) time

Neighborhood O(n) time O(1) time

Friendship Graph Analysis

Static analysis

Densely connected core

Fringe of low-degree nodes

Few isolated communities & singletons

Static properties

Node degree distribution, average distance, diameter

Edge/node ratio, level of symmetry

Number of cliques, k-cliques, etc.

Small world phenomanon

Degree Distribution

Small World Networks

Class of networks with certain properties:

Sparse graphs

Highly connected

Short average node-to-node distance: d ~ log(n)

Fat tailed power law node degree distribution

Densely connected core with many (near-)cliques

Existence of hubs: nodes with a very high degree

Fringe of low(er)-degree nodes

Regular vs. Small World

Small World Networks

Other examples of small world networks

Web graphs

Gene networks

E-mail networks

Telephone call graphs

Information networks

Internet topology networks

Scientific co-authorship networks

Corporate networks (interlocks or ownerships)

Collaboration @ LIACS

Friendship Graph Analysis

Static analysis

Dynamic analysis

Network evolution

Network modelling

Network growth

Link Prediction

Triadic Closure

Preferential Attachment

Link Prediction

B

C

D

E

A

B

C

D

E

A

Two principles: preferential attachment andtriadic closure

Triadic Closure

B

C

A

B

C

A

Two principles: preferential attachment andtriadic closure

Preferential Attachment

Nodes with a large degree acquire new links at a faster rate.

B

C D

J

A E F

G

Conclusions

When you hear “big data”, then it is almostnever really big data.

Online social networks are an excellent domain of study for data (or graph-) miners.

Social Network Analysis is important for many areas of research, not only computer science.

(Small world) networks are everywhere.

Try this at home

Graph visualization: http://www.gephi.orghttp://nodexl.codeplex.com

Network datasets: http://snap.stanford.eduhttp://konect.uni-koblenz.de/networks