introduction to network analysis

62
Introduction to Network Analysis Marko Grobelnik, Dunja Mladenic JSI Parts of the presentation taken from the tutorial “Structure and function of real-world graphs and networks” by Jure Leskovec, CMU/JSI

Upload: dawson

Post on 22-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Network Analysis. Marko Grobelnik, Dunja Mladenic JSI. Parts of the presentation taken from the tutorial “Structure and function of real-world graphs and networks” by Jure Leskovec, CMU/JSI. Outline. What are networks? …few examples Network properties Small worlds - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to  Network Analysis

Introduction to Network Analysis

Marko Grobelnik, Dunja Mladenic JSI

Parts of the presentation taken from the tutorial “Structure and function of real-world graphs and networks” by Jure Leskovec, CMU/JSI

Page 2: Introduction to  Network Analysis

What are networks?◦ …few examples

Network properties◦ Small worlds◦ Power law ◦ Long tail◦ Network Resilience◦ Structure of networks

Applications◦ Mining e-mail server logs◦ Mining MSN Messenger data

Outline

Page 3: Introduction to  Network Analysis

3

Networks & Computer Science

Statistics

Computer systems

Theory and

algorithms

(complex) networks

Machine learning / Data mining

Page 4: Introduction to  Network Analysis

4

Networks & Science

Statistics

Computer systems

Theory and algorithms

(complex) networks

Machine learning /

Data mining

Social Sciences

Biology

Physics

(complex)

networks

Industry & Applications

Computer

Science

Page 5: Introduction to  Network Analysis

Networks (graphs)Vertex / Node

Page 6: Introduction to  Network Analysis

Networks (graphs)Vertex / Node

Edge/ Link

Page 7: Introduction to  Network Analysis

Networks (graphs)Vertex / Node

Edge/ Link

Direction

Page 8: Introduction to  Network Analysis

Networks (graphs)Vertex / Node

Edge/ Link

Direction0.3

0.60.1

Probabilities

Page 9: Introduction to  Network Analysis

Dynamic Networks (graphs)

Vertex / Node

Edge/ Link

Direction0.3

0.60.1

Probabilities

…in dynamic networks all the elements of the graph are changing

…dealing with dynamic networks is active research topic

Page 10: Introduction to  Network Analysis

Query

Active topicduring limited time period

Example of Dynamic Graph (1/3)

Page 11: Introduction to  Network Analysis

On 1996-08-30Clinton and Chicago are connected

Example of Dynamic Graph (2/3)

Page 12: Introduction to  Network Analysis

On 1996-10-02Clinton and Chicago are NOT connected

Example of Dynamic Graph (3/3)

Page 13: Introduction to  Network Analysis

Networks of the real-world (1) Information networks:

◦ World Wide Web: hyperlinks◦ Citation networks◦ Blog networks

Social networks: people + interactios◦ Organizational networks◦ Communication networks ◦ Collaboration networks◦ Sexual networks ◦ Collaboration networks

Technological networks:◦ Power grid◦ Airline, road, river networks◦ Telephone networks◦ Internet◦ Autonomous systems

Florence families Karate club network

Collaboration networkFriendship network

Page 14: Introduction to  Network Analysis

Networks of the real-world (2)

Biological networks◦ metabolic networks◦ food web◦ neural networks◦ gene regulatory

networks Language networks

◦ Semantic networks Software networks …

Yeast proteininteractions

Semantic network

Language network XFree86 network

Page 15: Introduction to  Network Analysis

Types of networks Directed/undirected Multi graphs (multiple edges between nodes) Hyper graphs (edges connecting multiple

nodes) Bipartite graphs (e.g., papers to authors) Weighted networks Different type nodes and edges Evolving networks:

◦ Nodes and edges only added◦ Nodes, edges added and removed

Page 16: Introduction to  Network Analysis

Traditional approach Sociologists were first to study networks:

◦ Study of patterns of connections between people to understand functioning of the society

◦ People are nodes, interactions are edges ◦ Questionares are used to collect link data (hard to

obtain, inaccurate, subjective)◦ Typical questions: Centrality and connectivity

Limited to small graphs (~10 nodes) and properties of individual nodes and edges

Page 17: Introduction to  Network Analysis

New approach (1) Large networks (e.g., web, internet, on-line

social networks) with millions of nodes Many traditional questions not useful anymore:

◦ Traditional: What happens if a node U is removed? ◦Now: What percentage of nodes needs to be

removed to affect network connectivity? Focus moves from a single node to study of

statistical properties of the network as a whole

Can not draw (plot) the network and examine it

Page 18: Introduction to  Network Analysis

New approach (2) How the network “looks like” even if I can’t

look at it? Need for statistical methods and tools to

quantify large networks 3 parts/goals:

◦ Statistical properties of large networks◦ Models that help understand these properties◦ Predict behavior of networked systems based on

measured structural properties and local rules governing individual nodes

Page 19: Introduction to  Network Analysis

Statistical properties of networks Features common to networks of different

types:◦ Properties of static networks:

Small-world effect Transitivity or clustering Degree distributions (scale free networks) Network resilience Community structure Subgraphs or motifs

◦ Temporal properties: Densification Shrinking diameter

Page 20: Introduction to  Network Analysis

Small-world effect Six degrees of separation (Milgram 60s)

◦ Random people in Nebraska were asked to send letters to stockbrokes in Boston

◦ Letters can only be passed to first-name acquantices

◦ Only 25% letters reached the goal◦ But they reached it in about 6 steps

Measuring path lengths: ◦ Diameter (longest shortest path): max dij◦ Effective diameter: distance at which 90% of all

connected pairs of nodes can be reached◦ Mean geodesic (shortest) distance l

Page 21: Introduction to  Network Analysis

Small World Networks on Web Empirical observation for the Web-Graph is

that the diameter of the Web-Graph is small relative to the size of the network◦ …this property is called “Small World”◦ …formally, small-world networks have diameter

exponentially smaller then the size By simulation it was shown that for the

Web-size of 1B pages the diameter is approx. 19 steps◦ …empirical studies confirmed the findings

Page 22: Introduction to  Network Analysis

Small World on FP5-IST (collaboration network) The network represents collaboration between institutions on

FP5-IST projects funded by European Union◦ …there are 7886 organizations collaborating on 2786 projects◦ …in the network, each node is an organization, two

organizations are connected if they collaborate on at least one project

Small world properties of the collaboration network:◦ Main connected part of the network contains 94% of the

nodes◦ Max distance between any two organizations is 7 steps …

meaning that any organization can be reached in up to 7 steps from any other organization

◦ Average distance between any two organizations is 3.15 steps (with standard deviation 0.38)

◦ 38% (2770) of organizations have avg. distance 3 or less

Page 23: Introduction to  Network Analysis

• 1856 collaborations• avg. distance is 1.95• max. distance is 4

Connectedness of the most connected institution

Page 24: Introduction to  Network Analysis

• 179 collaborations • avg. distance is 2.42• max. distance is 4

Connectedness of semi connected institution

Page 25: Introduction to  Network Analysis

• 8 collaborations • max. distance is 7

Connectedness of min. connected institution

Page 26: Introduction to  Network Analysis

Small World effect on MSN Messenger Network Distribution of

shortest path lengths

Microsoft Messenger network ◦ 180 million people◦ 1.3 billion edges◦ Edge if two people

exchanged at least one message in one month period

0 5 10 15 20 25 3010

0

101

102

103

104

105

106

107

108

Distance (Hops)

Num

ber o

f nod

es

Pick a random node, count how many

nodes are at distance

1,2,3... hops7

Page 27: Introduction to  Network Analysis

What is Power Law? Power law describes relations between the

objects in the network◦ …it is very characteristic for the networks

generated within some kind of social process◦ …it describes scale invariance found in many

natural phenomena (including physics, biology, sociology, economy and linguistics)

Page 28: Introduction to  Network Analysis

Power-Law on the Web In the context of Web the power-law appears in many

cases:◦ Web pages sizes◦ Web page connectivity◦ Web connected components’ size◦ Web page access statistics◦ Web Browsing behavior

Formally, power law describing web page degrees are:

(This property has been preserved as the Web has grown)

Page 29: Introduction to  Network Analysis
Page 30: Introduction to  Network Analysis

Degree distribution

number of people a person talks to on a

Microsoft Messenger

Node degree

Cou

nt

X

Highest degree

Page 31: Introduction to  Network Analysis

Detour: how long is the long tail?

This is not directly related to graphs, but it nicely explains

the “long tail” effect. It shows that there is big

market for niche products.

Page 32: Introduction to  Network Analysis

Network resilience We observe how the

connectivity (length of the paths) of the network changes as the vertices get removed

It is important for epidemiology◦ Removal of vertices

corresponds to vaccination Real-world networks are

resilient to random attacks◦ One has to remove all web-

pages of degree > 5 to disconnect the web

◦ …but this is a very small percentage of web pages

Random network has better resilience to targeted attacks

Page 33: Introduction to  Network Analysis

Network motifs (1) What are the building blocks (motifs) of

networks? Do motifs have specific roles in networks? Network motifs detection process:

◦ Count how many times each subgraph appears◦ Compute statistical significance for each

subgraph – probability of appearing in random as much as in real network

3 node motifs

Page 34: Introduction to  Network Analysis

Network motifs (2) Biological networks

◦ Feed-forward loop◦ Bi-fan motif

Web graph:◦ Feedback with two

mutual diads◦ Mutual diad◦ Fully connected triad

Page 35: Introduction to  Network Analysis

Shrinking diameters Intuition says that

distances between the nodes slowly grow as the network grows (like log n)

But as the network grows the distances between nodes slowly decrease

Internet

Citations

Page 36: Introduction to  Network Analysis

Structure of the Web – “Bow Tie” model In November 1999 large scale study using

AltaVista crawls in the size of over 200M nodes and 1.5B links reported “bow tie” structure of web links◦ …we suspect, because of the scale free nature of

the Web, this structure is still preserved

Page 37: Introduction to  Network Analysis

SCC - Strongly Connected component where pages can reach each other via

directed paths

IN – consisting from pages that can reach

core via directed path, but cannot be reached

from the core

OUT – consisting from pages that can be

reached from the core via directed path, but cannot reach core in a

similar way

TENDRILS – disconnected components reachable only via

directed path from IN and OUT but not from and to

core

TENDRILS – disconnected components reachable only via

directed path from IN and OUT but not from and to

core

Page 38: Introduction to  Network Analysis

Mining email server logs

Page 39: Introduction to  Network Analysis

Ontology generation from social networks data We address the problem how to construct a

taxonomy from a social network data.◦ …we adapt the approach used when dealing with

text As an example we use e-mail graph in a mid

size research institution◦ ...communication records of JSI 770 people

The experiments and evaluation show our approach to be useful and applicable in real life situations◦ …the approach could be easily reused in case

studies (and elsewhere)

Page 40: Introduction to  Network Analysis

Architecture The main contribution of the deliverable is architecture & software

consisting from 5 major steps:1. Starting with log files from the institutional e-mail server where the data

include information about e-mail transactions with three fields: time, sender and the list of receivers.

2. After cleaning we get the data in the form of e-mail transactions which include e-mail addresses of sender and receiver.

3. From a set of e-mail transactions we construct a graph where vertices are e-mail addresses connected if there is a transaction between them

4. E-mail graph is transformed into a sparse matrix allowing to perform data manipulation and analysis operations

5. Sparse matrix representation of the graph is analyzed with ontology learning tools producing an ontological structure corresponding to the organizational structure of the institution where e-mails came from.

Page 41: Introduction to  Network Analysis

Data used for Experimentation The data is the collection of log files with e-

mail transactions from local e-mail spam filter software Amavis (http://www.amavis.org/):◦ Each line of the log files denotes one event at the

spam filter software◦ We were interested in the events on successful e-

mail transactions ...having information on time, sender, and list of

receivers ◦ An example of successful e-mail transaction is the

following line: 2005 Mar 28 13:59:05 patsy amavis[33972]: (33972-01-3) Passed CLEAN, [217.32.164.151] [193.113.30.29] <[email protected]> -> <[email protected]>, Message-ID: <21DA6754A9238B48B92F39637EF307FD0D4781C8@i2km41-ukdy.domain1.systemhost.net>, Hits: -1.668, 6389 ms

Page 42: Introduction to  Network Analysis

Some statistics about the data The log files include e-mails data from Sep

5th 2003 to Mar 28th 2005:◦ …this sums up to 12.8Gb of data. ◦ After filtering out successful e-mail transactions

it remains 564Mb …which contains approx. 2.7 million of successful e-

mail transitions used for further processing◦ The whole dataset contains references to

approx. 45000 e-mail addresses …after the data cleaning phase the number is reduced

to approx. 17000 e-mail addresses …out of which 770 e-mail addresses are internal from

the home institution (with “ijs.si” domain name)

Page 43: Introduction to  Network Analysis

Organizational structure of JSI produced from cleaned e-mail transactions with OntoGen in <5 minutes

Page 44: Introduction to  Network Analysis

Organizational structure of JSI visualized from e-mail transactions with Document-Atlas

Page 45: Introduction to  Network Analysis

EvaluationPart of clustering results for “Jozef Stefan Institute” e-mail data into 10 clusters (C-0, C-1, …C-9) showing distribution of the clustered e-mails over the Institute departments.

Page 46: Introduction to  Network Analysis

Analysis of MSN Messenger Communication Network

By Jure Leskovec

Page 47: Introduction to  Network Analysis

Data that we have: Communication

For every conversation (session) we have a list of users who participated in the conversation

There can be multiple people per conversation

For each conversation and each user:◦ User Id◦ Time Joined◦ Time Left◦ Number of Messages Sent◦ Number of Messages Received

Page 48: Introduction to  Network Analysis

Data that we have: Demographics For every user (self reported):

◦ Age◦ Gender◦ Location (Country, ZIP)◦ Language◦ IP address (we can do reverse GeoIP lookup)

Page 49: Introduction to  Network Analysis

Facts about the data 150 GB compressed logs per day

◦ Just copying over the network takes 8 to 10 hours◦ Parsing and processing takes another 4 to 6 hours

After parsing, collapsing, saving as binary and compressing ~ 40GB per day

Collected data for all of June 2006: 1.3TB of data

Page 50: Introduction to  Network Analysis

User age distribution (self reported)

Age

Coun

t

Page 51: Introduction to  Network Analysis

Number of participants in the conversation

Conversation size

Coun

t

Limit of 20 users per session

Page 52: Introduction to  Network Analysis

Activity per day Data for June 1:

◦ 982,005,323 sessions (conversations)◦ 980,219,231 2-user conversations◦ 471,837,591 conversations with 0 exchanged

messages ◦ 508,315,719 “good” sessions◦ 63,949,711 different users talking◦ 65,921 unknown users talking (users which never

login)

Page 53: Introduction to  Network Analysis

Data statistics Over June 2006: 242,720,596 users logged in 179,792,538 users engaged in

conversations 17,510,905 new users (never logged in

before) More than 30 billion conversations

Page 54: Introduction to  Network Analysis

Age: Number of conversations

Age

Age

High

Low

Page 55: Introduction to  Network Analysis

Age: Conversation duration

High

LowAge

Age

Page 56: Introduction to  Network Analysis

Age: Sent messages per session

High

LowAge

Age

Page 57: Introduction to  Network Analysis

Age: Messages per secondHigh

LowAge

Age

Page 58: Introduction to  Network Analysis

Where are the users coming from?

Page 59: Introduction to  Network Analysis

Communication network Using only 2-user conversations from

June 2006 we build a graph:◦ 179,792,538 nodes◦ 1,342,246,427 edges◦ 15,010,572,090 2-user conversations

Page 60: Introduction to  Network Analysis

7-degrees of separation

0 5 10 15 20 25 3010

0

101

102

103

104

105

106

107

108

Distance (Hops)

Num

ber o

f nod

es

Pick a random node, count how many nodes are at distance 1,2,3... hops

Hops Nodes1 10

2 78

3 396

4 8648

5 3299252

6 28395849

7 79059497

8 52995778

9 10321008

10 1955007

11 518410

12 149945

13 44616

14 13740

15 4476

16 1542

17 536

18 167

19 71

20 29

21 16

22 10

23 3

24 2

25 3

Page 61: Introduction to  Network Analysis

In ACTIVE we will perform analytics along three main dimensions:◦ content (text, tags, semi-structured data)◦ social network (graph of social linkages)◦ time

Content dimensions is well studies and covered by many text-mining methods

…static social network analysis aspect will be covered well by the existing methods

…core research will happen on “dynamic social networks”

…relation to ACTIVE project

Page 62: Introduction to  Network Analysis

Network analysis is very active research topic on the intersection of several areas◦ …the area deals primarily with graph

representation, fundamental to many problems in the nature and society

◦ …currently hot research topic in network analysis is dealing with “dynamic networks”

◦ …in ACTIVE we will perform research and provide solutions for large dynamic social networks extracted from enterprise data

Conclusion