ibm research 5/27/2007 | information flow prediction and people mining | ching-yung lin presentation...

51
IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin © 2007 IBM Corporation http://w3.ibm.com/ibm/presentations Information Flow Prediction and People Mining Ching-Yung Lin IBM T. J. Watson Research Center May 27, 2007

Post on 22-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

IBM Research

5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin © 2007 IBM Corporation

Information Flow Prediction and People Mining

Ching-Yung Lin

IBM T. J. Watson Research Center

May 27, 2007

Page 2: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

2

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

10Gbit/s Continuous Feed Coming into System Types of Data

• Speech, text, moving images, still images, coded application data, machine-to-machine binary communication

System Mechanisms

• Telephony: 9.6Gbit/sec (including VoIP)

• Internet

Email: 250Mbit/sec (about 500 pieces per second)

Dynamic web pages: 50Mbit/sec

Instant Messaging: 200Kbit/sec

Static web pages: 100Kbit/sec

Transactional data: TBD

• TV: 40Mb/sec (equivalent to about 10 stations)

• Radio: 2Mb/sec (equivalent to about 20 stations)

Data Flow through an Internet Gateway..

Page 3: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

3

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Network Monitoring and Stream Analysis

200-500MB/s ~100MB/sper PE rates

10 MB/s

InputsDataflow Graph

ip http

ntp

udp

tcp ftp

rtp

rtsp

sessvideo

sessaudio Interest Routing

keywords id

Packet content analysis

Advanced content analysis

Interest Filtering

Interested MM streams

By IBM Dense Information Gliding Team

Page 4: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

4

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Borrow this from Hoover...

Page 5: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

5

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

One of the issues – Speech Recognition, Speaker & Social Network Detection

Stream A

Stream B

Stream C

Stream D

Speaker Detection

Olivier Mihalis

Ching-Yung Upendra

talks to

talks to

Deepak

After denoising

- Social network- Fusion technique- Iterative method

Denoising & Social Network Analysis

What can be achieved by combining content analysis and social network analysis?

Page 6: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

6

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Challenge – every node in the network is unique

Photo Source: New York Times, 3/2/2005

Page 7: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

7

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Part I: Dynamic Probabilistic Complex Network and Information Flow

Page 8: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

8

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

The Most Difficult Challenge: State-of-the-Arts?

Social Networks in sociological and statistic fields: focus on (1) overall network characteristics, (2) dynamic random graphs, (3) binary edges, etc. Not consider probabilistic nodes/edges or individual nodes/edges.

Epidemic Networks & Computer Virus Network: focus on (1) overall network characteristics – when will an outbreak occurs, (2) regular / random graphs. Not focus on individual nodes/edges.

(Computer) Communication Networks: focus on (1) packet transmission – information is not duplicated, or (2) broadcasting – not considering individual nodes/edges or complex network topology.

WWW: focus on (1) topology description, (2) binary edges and ranked nodes (e.g., Google PageRank) Not consider probabilistic edges

Our Objectives: Find important people, community structures, or information flow in a network, which is dynamic, probabilistic and complex, in order allocate resources in a large-scale mining system.

Page 9: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

9

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

What is a Dynamic Probabilistic Complex Network?

Page 10: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

10

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Modeling a Dynamic Probabilistic Complex Network[Assumption] A DPCN can be represented by a Dynamic Transition Matrix P(t), a

Dynamic Vertex Status Random Vector Q(t), and two dependency functions fM and gM.

, 1

, 2

,

Pr( ( ) )

Pr( ( ) )( ) ,

Pr( ( ) )E

i j

i j

i j

y t SE

y t SEt

y t SE

i,jp

where

( )ix t : the status value of vertex i at time t. and

1

2

Pr( ( ) )

Pr( ( ) )( ) ,

Pr( ( ) )V

i

i

i

x t SV

x t SVt

x t SV

iq

Pr( ( ) ) 1,

V

ix t SV

, ( )i jy t : the status value of edge i →j at time t.

,Pr( ( ) ) 1,E

i jy t SE

where

( ) ( ) ( )

( ) ( ) ( )

( ) ,

( ) ( ) ( )

t t t

t t t

t

t t t

1,1 2,1 N,1

1,2 2,2 N,2

1,N 2,N N,N

p p p

p p p

P

p p p

( )

( )

( ) ,

( )

t

t

t

t

1

2

N

q

q

Q

q

( ) ( ( ), ( )),Mt t f t tP Q P

( )

( ( ), ( ), ( )),M

t t

g t t t t

Q

P Q P

and

Page 11: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

11

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Information Flow in Dynamic Probabilistic Complex Network (Let’s call it: Behavioral Information Flow (BIF) Model)

[Assumption] Edge can be represented by a four-state S-D-A-R (Susceptible-Dormant-Active-Removed) Markov Model. Nodes can be represented by three states S-A-I (Susceptible-Active-Informed) Model.

, ,

, ,

, ,

, ,

Pr( ( ) )

Pr( ( ) )( ) ,

Pr( ( ) )

Pr( ( ) )

i j i j

i j i j

i j i j

i j i j

y t S

y t Dt

y t A

y t R

i,jp

where

( ) ( ) ( )

( ) ( ) ( )

( ) ,

( ) ( ) ( )

t t t

t t t

t

t t t

1,1 2,1 N,1

1,2 2,2 N,2

1,N 2,N N,N

p p p

p p p

P

p p p

( )

( )

( ) ,

( )

t

t

t

t

1

2

N

q

q

Q

q

( )

( , ( ), ( )),

t t

f t t

P

M Q P

( )

( ( ), ( ), ( )),

t t

g t t t t

Q

P Q P

and

Pr( ( ) )

( ) Pr( ( ) ) ,

Pr( ( ) )

i i

i i

i i

x t S

t x t A

x t I

iq

, , , , 1i j i j i j i j 1i i i

Page 12: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

12

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Major Difference between BIF and Prior Modeling Methods in Epidemic Research and Computer Virus Fields

Prior Models: Model Human Nodes as S-I-R (Susceptible, Infected, and Removed).

Did not consider individual node’s behavior different in network structure/topology did not consider edge status.

We propose to model edge status as (autonomous) S-D-A-R Markov Model (Susceptible, Dormant, Active, Removed)

We propose to model human node behavior as S-A-I (Susceptible, Active, and Informed).

Page 13: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

13

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Edges are Markov State Machines, Nodes are not

State transitions of edges: S-D-A-R model. (Susceptible, Dormant, Active, and Removed) This indicates the time-aspect changes of the state of edges.

S A RD

1

trigger

1 1 1

States of nodes: S-A-I model. (Susceptible, Active, and Informed) Trigger occurs when the start node of the edge changes from state S to state I :

Node view Network view

Edge view

S Itrigger

A

Page 14: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

14

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Edge State Probability and Network Configuration ModelNodes and Edges

( ) ( , ( ), ( )),t t f t t P M Q P

1,1 1,1 1,1 2,1 2,1 2,1 ,1 ,1 ,1

1,2 1,2 1,2 2,2 2,2 2,2 ,2 ,2 ,2

1, 1, 1, 2, 2, 2, , , ,

( , , ) ( , , ) ( , , )

( , , ) ( , , ) ( , , )

,

( , , ) ( , , ) ( , , )

N N N

N N N

N N N N N N N N N N N N

M

i,j = 0 No Edge between i and j Our KDD 2005 paper is a special case that i,j =1 or 0, and did not model (i,j ,i,j )

Network Configuration Model (which is learned by training). It includes the network topology information, long-term edge probability, and delay parameter).

Page 15: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

15

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Define Edge State Probability Update Function

Given three different cases:1. On trigger:

2. No trigger – node not informed yet:

3. No trigger – node has been informed:

( ) ( , ( ), ( ))t t f t t P M Q P

, ,

, , , ,

, , , ,

, , , ,

0 0 0 0

1 0 0( ) ( ),

0 1 0

1 0 1

i j i j

i j i j i j i j

i j i j i j i j

i j i j i j i j

t t t

i,j i,jp F p

( ) , ( )i ix t t I x t I

( ) , ( )i ix t t I x t I

( ) ( ),t t t i,j i,jp p

( ) , ( )i ix t t I x t I

( ) ( ),t t t i,j i,jp F p

S A RD

trigger

1 1

Therefore, consider the probabilities of node states, then we get f(.):

( ) ( ) (1 ) ( )i it t t t i,j i,j i,jp F p p

Edge State Probability Update function f(.) s.t.:

Page 16: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

16

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Nodes: State Transitions Determined by Incoming Edges

Node State Probability Update Function g(.): S Itrigger

A

( ) ( ( ), ( ), ( )),t t g t t t t Q P Q P

Network view

,

, ,

,

,

, , ,

, ,

(1 ) 0 0

( ) 1 (1 ) (1 ) 0 ( ),

0 1 (1 ) 1

V i

V i V i

V i

n in

i i

i n i n i n i in n

i i

n i n in

t t t

i iq Q q

where

,

, ,

, ,

Pr( {1 }, ( ) , ( ) )

1 (1 )

V i

n i n i

n i n in

n N y t t R y t A

and V,i is the set of all source nodes of the incoming edges of Node i: , ,{ | {1 }, 0}V i n in n N

Page 17: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

17

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

An Application of Information Flow Prediction – find important people

Who are the most likely people to talk about this information at a specific time given the current observation?

For a given concrete observation, the values in the given priors are either 0 or 1.

For speaker recognition results, the priors can be confidence values between 0 ~ 1.

,, {1 }

( , ) arg max( ( ))m nm n N

m n t

given ( ( ), ( ))t tP Qor ( )tQ

( ), ( )t tP Q

Page 18: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

18

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Case Study I – Switchboard data from 679 people

Monte Carlo Method: Simulate each DPCN information flow for 1000 times.

It takes 12 seconds to use MC simulation to predict the process. (For a given model and test all 679 nodes, it takes a PC 130 mins for calculate the probabilities if the information flow starts from different 679 seeds).

The Probabilities of the Nodes Receives Information

0

0.05

0.1

0.15

0.2

0.25

0.3

1 28

55

82

109

136

163

190

217

244

271

298

325

352

379

406

433

460

487

514

541

568

595

622

649

676

SeedID100

Page 19: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

19

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

The distribution histogram of the alpha values of the edges in the Enron dataset.

1

10

100

1000

10000

100000All Topics

Market Opportunity

California Market

North America Product

Page 20: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

20

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Noise Factor I – Impact of Classification Error from Speaker Recognition

Assume the classification precision rate on the speaker (node) i is i, and the false alarm rate on the speaker i is φi.

Then the expected number of times that the node is counted is:

And the link is counted is:

Therefore,

If we assume a universal precision and false alarm rate at all speakers, then:

Assume the average waiting time of links and the average transmission duration of links are the same regardless of the links observed, then:

If we assume the false alarm rate is small and can be neglected when the number of nodes is large, then

2i iK K Z

i j i jL L Z

, 2i j i j

i ji i

L ZL

K ZK

2 2

, 2i ji i

L L Z

K ZK

, ,i j i j , ,i j i j and

, ,i j i j

K

Z

i Kφi 2Z

truth detected

Page 21: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

21

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Speaker Recognition Accuracy can be Improved by Fusion of Original Speaker Recognition and Predicted Node Probability

We can use this fusion method to combine both speaker recognition result and the estimated node probability:

,

i ii

i i i k k

k

which is guaranteed to be increasing when i k

Before Fusion

After Fusion with BIF Prediction

Speaker iRecognizer

i , 1i k , 2i k , 3i k

Speaker iRecognizer

BIF Prediction

i ii

Page 22: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

22

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Recognition Result from Switchboard-2 Telephone Conversation Set

Improvement on Recognition Accuracy on Node 171. The x-axis is the time that model is updated based on the recognition result after fusion. The y-axis represents the recognition accuracy. In the six testing cases, the Node 171 is usually confused with Node 218 or Node 164. In the first two cases, there are no false alarm from the classification of Node 218 or 164. In the next two cases, they are usually confused with each other. In the last two cases, the false alarm from Node 218 or 164 is 0.3.

0

0.2

0.4

0.6

0.8

1

1.2

0 1 2 3 4 5

Node 218, no falsealarm

Node 164, no falsealarm

Node 218, mutuallyconfused

Node 164, mutuallyconfused

Node 218, prob. falsealarm = 0.3

Node 164, prob. falsealarm = 0.3

Page 23: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

23

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Case Study (II) – our experiments on Enron Emails

Page 24: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

24

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Modeling and Predicting Topic-Related Personal Information Flow

Given the sender

and the time of an email:1. Get the probability of a topic given the sender

2. Get the probability of the receiver given the sender and the topic

3. Get the probability of a word given the topic

Boxes represents iteration.

Content-Time-Relation Model Combine content, time and social relation information with Dirichlet allocations and a causal Bayesian network. [ Song et al., KDD, August 2005] (1st paper combining content analysis and social network analysis)

: observations

A

ND

T

ad

z

w

r

S

Tm

t

a: sender/author, z: topic, S: social network (Exponential Random Graph Model / p* model), D: document/emailr: receivers, w: content words, N: Word set, T: Topic

Page 25: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

25

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Corporate Topic Trend Analysis Example: Yearly repeating events

Topic Trend Comparison

0

0.005

0.01

0.015

0.02

0.025

0.03

Jan Mar May Jul Sep Nov

Popula

rity

Topic45(y2000)

Topic45(y2001)Topic19(y2000)

Topic19(y2001)

Topic 45, which is talking about a schedule issue, reaches a peak during June to September. For topic 19, it is talking about a meeting issue. The trend repeats year to year.

Page 26: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

26

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Topic Detection and Key People Detection of “California Power” Match Their Real-Life Roles

(a)

Topic Analysis for Topic 61

00.002

0.0040.0060.008

0.010.0120.014

0.0160.018

Jan-00 Apr-00 Jul-00 Oct-00 Jan-01 Apr-01 Jul-01 Oct-01

Popula

rity

Key Words power 0.089361 California 0.088160 electrical 0.087345 price 0.055940 energy 0.048817 generator 0.035345 market 0.033314 until 0.030681

Key PeopleJeff_Dasovich 0.249863 James_Steffes 0.139212Richard_Shapiro 0.096179 Mary_Hain 0.078131Richard_Sanders 0.052866 Steven_Kean 0.044745Vince_Kaminski 0.035953

Event “California Energy Crisis” occurred at exactly this time period. Key people are active in this event except Vince_Kaminski …

Page 27: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

27

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Social Network of Enron ManagersIf we try to find out social networks based on all communications, it is

difficult.

Page 28: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

28

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Information Flow in Enron – California MarketActor 151 (Rosalee Fleming — the Enron CEO Ken L.’s assistant) is

the key information spreader of this issue.

Page 29: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

29

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Information Flow in Enron – Market OpportunitiesRosalee Fleming also played an important role at “Market Opportunities.” She received info

from Actor 119 (Mike Carson) and Actor 23 (James Steffes – VP of Gov. Affairs of Enron.)Actor 68 (Rod Hayslett -- CFO) is also a major information spreader.

Page 30: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

30

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Information Flow in Enron – North American Products Two disjoint communities can be observed. Actor 21 (Keith Holst) and Actor

142 (Dan Hyvl) are the main bridges of the two communities.

Page 31: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

31

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

This kind of analysis is wonderful, but..

We cannot wait until our company has scandle and bankrupts....

What kinds of applications can be valuable out of network analysis?

Page 32: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

32

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Part II: Small Blue

Page 33: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

33

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Social Network -- A key differentiator for corporate performance

Informal social network within formal organizations is a major factor affecting companies’ performance:

Krackhardt (CMU, 2005) showed that companies with strong informal networks perform five or six times better than those with weak networks.

Brydon (VisblePath, 2006) showed that the performance gain of companies utilizing social networks:

• 16x at sales

• 4x at marketing

• 10x at hiring

Page 34: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

34

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

We hope social network and expertise mining can dramatically increase our colleagues’ knowledge and collaboration

Page 35: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

35

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Social Networks -- Beyond the organizational chart

Source: Cross, R., Parker, A., Prusak, L. & Borgatti, S.P. 2001. Knowing What We Know: Supporting Knowledge Creation and Sharing in Social Networks. Organizational Dynamics 30(2): 100-120. [pdf]

Organization charts are not the best indicator of how work gets done

Senior people are not always central; peripheral people can represent untapped knowledge

Making the network visible makes it actionable and becomes the basis for a collaboration action plan

Provided by Drs. Tony Mobbs and Kate Ehrlich, IBM

Page 36: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

36

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Group and Roles

Marketing

Finance

Manufacturing

Andy

Bob

Carl

Darren

Earl

Frank Indojit

Gerry Harry Jeff

Sam

Karen

Leo

Ming

Neo

Central people Sam. Could be bottleneck or

holding group together

Peripheral people Earl. Goes to others but no-

one goes to him for information. At risk for leaving. Potentially unrealized expertise

Sub-groups Group split by function. Very

little information shared across groups

This slide is excerpted from SNA Theory, Concepts and Practice by Dr. T. Mobbs, BCS and Dr. K. Ehrlick, Research

Page 37: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

37

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Some Roles are especially critical

Marketing

Finance

Manufacturing

Andy

Bob

Carl

Darren

Earl

Frank Indojit

Gerry Harry Jeff

Karen

Leo

Ming

Neo

What happens if Sam leaves the group through layoffs, job reassignment, attrition, merger, retirement?

This slide is excerpted from SNA Theory, Concepts and Practice by Dr. T. Mobbs, BCS and Dr. K. Ehrlick, Research

Page 38: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

38

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Relationships are multi-dimensional and (traditionally) uncovered through network questions

CommunicationHow often do you communicate with this person?

InnovationHow often do you turn to this person for new ideas

AdviceHow often do you seek advice from this person before making an important decision?

AwarenessI am aware of this person’s knowledge and skills

LearningHow likely are you to rely on this person for advice on new methods and processes

Valued ExpertiseHow likely are you to turn to this person for specialized expertise

TrustI believe there is a high personal cost in seeking advice or support from this person

AccessI believe this person will respond to my request in a reasonable and timely manner

EnergyI generally feel energized when I interact with this person

Actions Awareness Emotional

Provided by Drs. Tony Mobbs and Kate Ehrlich, IBM

Page 39: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

39

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Personal Network preferred source for information and collaboration

• Under utilisation of electronic products and services.

• Content has lower performance impact / not realising full potential benefits.

• Widely inconsistent working practices.

Personal Network

W3 Stub W3 Stub/ client

W3 Stub/ Client

W3 Stub/ client

W3 Stub

PSN Methods Education CommunitiesOther w3content

KnowledgeView

W3 Stub

ProjectRepositories

client W3 Stub/ client

CollaborationProjectTools

client

Existing Resources Provided

?

GBS Practitioner with task in project / delivery environment

Standalone, disparate, poor integration, large number of sources, steep learning curve (identify, understand & synthesise into specific work context), difficult to locate, choose & use.

Preferred / primary mode

Forces: • Time Constrained• Delivery activity focus• What gets measured gets done• Expedience• Perceived value (return on time investment)

High reliance on:• 50% ~ 75%: Personal networks (Gartner Report,

2006)• Hard-drive materials• What has worked for them previously (personal

experience)

leads to

• fast turnaround of request• specific response• Small # relevant items returned• recommendation of quality• ability to quickly understand the

supplied resource & determine relevant parts

• additional context / value-add info not available in electronic materials

Who knows what? How to reach them? Who plays what hidden roles?

Page 40: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

40

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Mining Expertise, Interests and Social Network

People can be “known” by: public resources:

• publications• personal webpages• blogs• presentations• wiki

organizational resources:

• patent applications• bluepages

personal resources:

• emails• instant messaging• meeting• phone calls• face-to-face interactions

Expertise can also be inferred by her friends’ recommendations or expertises.

private

public

timely &abundant

resources for

expertisemodeling

Page 41: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

41

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

SmallBlue Find

SmallBlue Reach

SmallBlue Ego

SmallBlue Connect

SmallBlue Expand

SmallBlue Inference Engines

andServers

SmallBlue Clients(Distributed Automatic Social Sensors)

Private & Personalized

Public & Personalized

Public

External Data

BluepagesBlueGroupsCommunityMapBlogCentralIBM ForumKnowledgeViewSocial Bookmark

My friends’ social values to me

Evolution of my Ego net

My personal network (Ego net) inferred from my Notes emails in server/local/archive and SameTime chats

Inference of my understanding on my friends’ expertise

Corporate-wise ranked experts

Ranked experts in my extended personal network, in a business unit and/or in a country

Only Public Information is shown

My social paths to her: which friends can introduce her, which friends work with her, .. trust, awareness, collaboration.

Her public postings, profiles, and communities to judge whether she is the right person.

Who I may want to know..

Which communities I may want to join..

Which documents I may want to look at

how to reach a person

social network analysis of Top-K experts

SNA of a formal group, a bluegroup or a community

social network analysis of a list of people

Other IBMers’ EgoNets

Other IBMers’ Expertise Inferences

I cannot see their communications, EgoNets nor Expertise Inferences

social network info

user search experts or person

social network analysis (SNA): who are the key persons in this network? who are the major hubs? who are the major bridges?

Page 42: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

42

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Major Use of SmallBlue Find

Find out who are the experts of any search terms. (Right now, zillions of possible terms.)

Rank them based on collaborative expert recommendationCan show experts based on:

whole corporate-wise

business unit

country

my personal proximity

Page 43: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

43

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Collaborative Expert Recommendation

Combine everyone’s knowledge of the expertise of our colleagues.

The more recommendation from more colleagues, the higher the score.

The more recommendation from my trusted colleagues, the higher the score.

The higher recommendation score from colleagues, the higher the overall score.

Combining all IBMers’ knowledge, we can make an advanced expert finding search engine.

Utilizing the expert search engine, we can enhance all IBMers’ knowledge and social connections.

Page 44: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

44

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

SmallBlue Reach Paths help users to reach another person

SmallBlue Reach Paths show the shortest paths for me to reach a person up to 6 degrees away.

SmallBlue Reach Paths can be initiated from any one of three SmallBlue applications.

Can be used for: Access -- knowing who can help introducing

me to this person.

Trust -- knowing who in my social networks knows this person.

Get Familiar with – knowing what kinds of people are contacting to this person.

Initiate Communication – who do we know in common.

Page 45: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

45

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

SmallBlue EgoHow healthy is my personal social capital?

What is the social value of Alice to me?

What are the changes and trends of my social capital evolution? For instance, I have to talk to Alice soon. She is valuable to me in

terms of social connections and she is getting out of the Ego net circle..

Page 46: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

46

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

SmallBlue Connect

Enterprise Social Network Analysis Tool

Showing Social Networks of people based on:

expertise key words

formal hierarchy

Any list of emails

Utilizing Social Network Analysis to show:

who are the important hubs among experts

who are the important bridges linking groups

Page 47: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

47

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Privacy Consideration – Bottom Line

Employees’ communications (e.g., time, from, to, cc, subject, content of emails, SameTime, etc.) are NOT searched nor retrievable to anyone.

Employees’ knowledge of other employees are INFERRED. Only the aggregated inferred knowledge is searchable. It is NOT possible to guess which part of aggregated inferred knowledge is contributed by whom.

In the social network analysis graphs, people relationships are modeled by their multimodal generic relationships. NO clue for their communication content.

Only the employees’ outgoing emails & instant messages and the portion that was authored by the employee is utilized.

Anyone can suggest keywords not be searched, search terms that should not find him, or ask to remove from the system.

Page 48: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

48

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Preliminary User Evaluation

Scores5 – very satisfied

5 4 3 2 1

Capability 24% 42% 17% 17% 0%

Usability 28% 33% 5% 25% 10%

Search 10% 43% 23% 22% 2%

Reliability 28% 38% 17% 12% 5%

Performance 15% 45% 25% 13% 3%

Privacy 29% 34% 34% 3% 0%

Personal Network

15% 50% 13% 23% 0%

Overall Satisfaction

17% 49% 17% 15% 2%

Page 49: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

49

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Demo

Page 50: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

50

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

Coincidence ??

SmallBlue Ego Trial Release (8/21)

SmallBlue Find and Connect

Trial Release (9/20)SmallBlue on TAP (11/07)

Page 51: IBM Research 5/27/2007 | Information Flow Prediction and People Mining | Ching-Yung Lin Presentation subtitle: 20pt Arial Regular, teal R045 | G182 | B179

51

IBM Research

5/27/07| Information Flow and People Mining | Ching-Yung Lin, IBM T. J. Watson Research Center © 2007 IBM Corporation

AcknowledgementsThanks to the SmallBlue Team Members:

Vicky Griffits-Fisher, Kate Ehrlich, Christopher Desforges, Michael Ackerbaruer, Reynold Khachatourian, Irina Fedulova, Ekaterina Zaytseva, Jeffrey Borden, Jennifer Xu, Yi Gu, Jie Lu, Dima Rekesh Belle Tseng Xiaodan Song

Contact: Ching-Yung Lin ([email protected]) ( http://www.research.ibm.com/people/c/cylin )