(icmia 2013) personalized community detection using collaborative similarity measure

28
PERSONALIZED COMMUNITY DETECTION USING COLLABORATIVE SIMILARITY MEASURE Presenter: Waqas Nawaz * P u b l i c a l l y a v a i l a b l e e m a i l d a t a s e t 5th International Conference on Data Mining and Intelligent Information Technology Applications ICMIA2013, June 18 - 20, 2013, Jeju Island, Republic of Korea Waqas Nawaz, Yongkoo Han, Kifayat-Ullah Khan, Young-Koo Lee The Case of Enron* Email Database Data and Knowledge Engineering Lab, Department of Computer Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do 446-701, Republic of Korea

Upload: waqas-nawaz

Post on 18-Aug-2015

29 views

Category:

Data & Analytics


0 download

TRANSCRIPT

* Publically available em

ail dataset

PERSONALIZED COMMUNITY DETECTION USING COLLABORATIVE

SIMILARITY MEASURE

Presenter: Waqas Nawaz

5th International Conference on Data Mining and Intelligent Information Technology ApplicationsICMIA2013, June 18 - 20, 2013, Jeju Island, Republic of Korea

Waqas Nawaz, Yongkoo Han, Kifayat-Ullah Khan, Young-Koo Lee

The Case of Enron* Email Database

Data and Knowledge Engineering Lab,Department of Computer Engineering, Kyung Hee University,

Yongin-si, Gyeonggi-do 446-701, Republic of Korea

2

AGENDA

Motivation Problem Statement Related Work Proposed Idea

System Architecture Preprocessing Partitioning

Applications References

3

MOTIVATION

Email is an asynchronous and most prevalent means of communication among others (vocal, visual)

Users: Worldwide email service providers namely Hotmail, Yahoo and Gmail has 330, 302 and 193 million users respectively in 2009-11 [1]

Communication: Yahoo email network deliver around 38 thousand emails per second Globally[2]

It contains significant information: sender, receivers, contents, time etc.

to depict one’s communication behavior or pattern

4

PERSONALIZATION

Personal Interaction Network (Net) from Host’s (user ‘A’) Repository

A

B

@B

C

@

@

C

F

G

@G

G G

72

4

5Host A

B@

BC

@

@

E

D

@D

B C

42

3

5Host

User ‘A’ as Sender User ‘A’ as Receiver

5

PROBLEM STATEMENT FORMULATION (1/2)

Email provides a sophisticated bridge among individuals or groups to interact and share useful information

Is there any possibility to identify the groups having similar communication pattern?

Can we partition the individuals or sources separately with irregular activity?

Grouping the strongly connected individuals with analogous interaction patterns (behaviors) is admissible?

6

PROBLEM STATEMENT FORMULATION (2/2)

By Using Personalized Information (Restricted Information) Find the K group of users Application: Email Perspective

User Point of View Automatic email group prediction or management Email Strainer (e.g. Assist spam filtration )

Service Provider Point of View Efficiently anticipate the global network structure

using personalized information

7

RELATED WORK In literature Email database has been utilized to:

(CEAS-2005) Detect Worm Propagation: Statistically Analyze features to classify normal (legitimate) and abnormal (worms affected) emails [4]

(KDD-2005) Predict Organizational Structure Discover of important nodes (Email users) to predict hidden organizational structure [6] [7]*

(CEAS-2007) Characterization of Email: Measure the flow and frequency of user email toward the identification of communities of interest [12]

(ACM-2008*) Predict Organizational Structure Classify emails within the organization to determine its structure [10]

(ICADIWT-2008) Determine Popularity of Individuals in Community: Find popularity of individuals based on ranking strategy (Page Rank) and evaluated using organization hierarchy [11]

(CEC*-2009) Detect Email Communication Networks: Detect key users based on Evolutionary approach using SNA [8]

(KDD-2009) Email Prioritization: Personalize email prioritization based on social network analysis [5]

(IEEE*-2010) Identify Hidden Relationships: Predict sensitive relationships based on frequency of mutual private communication [9]

* BM

EI-2

01

0

8

PROBLEM STATEMENT INVESTIGATION

Key factors: Users Interaction

Email exchange (bi-directional) Behavioral Patterns

Email Metadata or Properties Sender, Receiver Time stamp (granularity? Day, Week, Month, Year) content's information

Subject, Body, Attachment(type, size , count) etc. Strength (Frequency: no. of sent and received)

9

PROPOSED ARCHITECTURE Input: Email Data Source Output: User Grouping Two Stages:

Preprocessing Partitioning

10

PROPOSED SOLUTION (1/5) Preprocessing – Emails Network to Graph

Mapping Email address Nodes or Vertices Email Meta Data Vertex Attributes No. of Emails Weights on edges

11

PROPOSED SOLUTION (2/5) Preprocessing – Behavioral Feature Extraction

Extracted Information (Email Metadata) Avg. No. of Emails (sent and received)

Avg. no. of recipients in To, Cc, and Bcc Ratio of sent and received emails Min, Max and Avg. attachments and size Dominating type of attachment

Power point, PDF, Word, Excel … Time stamp (Granularity Daily, Weekly, Monthly)

Morning, Noon, After-Noon, Evening, Night, Late-Night Avg. length or size of email subject and body Ratio of emails with attachments

Behavioral Feature Set

12

PROPOSED SOLUTION (3/5) Preprocessing – Feature Scaling

We have scale down or mapped related set of features from real space nominal space ID space for time efficiency (bypassing the extra computation for clustering)

Parallel coordinates are utilized

13

Morning <4am-

12pm>; 7

After Noon

<12pm-4pm>; 4

Evening <4pm-

8pm>; 4

Night <8pm-

12am>; 4

Late Night

(12am-4am); 4

PROPOSED SOLUTION (4/5) Preprocessing – Time Feature Scaling

First postulate the granularity level Day, Week, Month etc.

Communication Time Slices (CTS) Splitting the complete day (days level, real space) into

mutually exclusive time intervals and assign labels (nominal space) accordingly

14

PROPOSED SOLUTION (5/5) Partitioning – Intra-Graph Clustering*

1. Collaborative Similarity Measure (CSM)

2. K-Medoid Clustering

Collaborative Similarity = CSimሺ𝑣𝑎,𝑣𝑏ሻ= =

ە۔

𝛼∗𝑆𝐼𝑀ሺ𝑣𝑎,𝑣𝑏ሻ𝑠𝑡𝑟𝑢𝑐𝑡ۓ + (1− 𝛼) ∗𝑆𝐼𝑀ሺ𝑣𝑎,𝑣𝑏ሻ𝑐𝑜𝑛𝑡𝑒𝑥𝑡 , 𝑣𝑎 ↔ 𝑣𝑏

ෑ� 𝐶𝑆𝑖𝑚൫𝑣𝑝𝑖 ,𝑣𝑝𝑖+1൯𝑞

𝑖=1 , 𝑣𝑎 ⋯𝑣𝑏, 𝑣𝑝 𝑖𝑠 𝑜𝑛 𝑝𝑎𝑡ℎ 𝑣𝑎 𝑎𝑛𝑑 𝑣𝑏 (1− 𝛼) ∗𝑆𝐼𝑀ሺ𝑣𝑎,𝑣𝑏ሻ𝑐𝑜𝑛𝑡𝑒𝑥𝑡 , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

𝑆𝐼𝑀ሺ𝑣𝑎,𝑣𝑏ሻ𝑠𝑡𝑟𝑢𝑐𝑡𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ቐ

a2𝑁𝑎 ∩ 𝑁𝑏a2a2𝑁𝑎 ∪ 𝑁𝑏a2∗(𝑤𝑎𝑏), 𝑣𝑎 ↔ 𝑣𝑏 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

𝑆𝐼𝑀ሺ𝑣𝑎,𝑣𝑏ሻ𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑤𝑒𝑖𝑔ℎ𝑡𝑒𝑑 = ς (𝑤𝑎𝑖)𝑀𝑖=1, 𝑣𝑎&& 𝑣𝑏←𝑎𝑖 ς (𝑤𝑎𝑗)𝑀𝑗=1, 𝑣𝑎| | 𝑣𝑏←𝑎𝑗 , 𝑣𝑎 ↔ 𝑣𝑏 | | 𝑣𝑎 ⋯𝑣𝑏

*Publish

ed in

DA

SFA

A W

orksh

op 2

01

2

15

APPLICATIONS

Automatically formed group of users with identical patterns can be applied to data (email) network: Part of security mechanism to assist in

detection of abnormal activity (e.g. spam filtering)

Known fraudulent accounts(individual or group) can facilitate to determine other fraudulent users

Identify the source of virus or infected group Assist Email Management System (Grouping

or Filtering) E.g. auto generated mailing list Automatic group level email prioritization

Community analysis or formation based on analogous communication patterns

16

REFERENCES

1. http://www.email-marketing-reports.com/metrics/email-statistics.htm

2. http://visualize.yahoo.com/3. CEAS 2005 – Second Conference on Email and Anti-Spam, July

21-22, 2005, Stanford University, California, USA, 2005.4. Steve Martin, Anil Sewani et. al., “Analyzing Behavioral

Features for Email Classification”, In CEAS [3].5. Shinjae Yoo et. al., “Mining Social Networks for Personalized

Email Prioritization”, KDD June 28 – July 1. 2009.6. J. Shetty, and J. Adibi, “Discovering Important Nodes through

Graph Entropy The Case of Enron Email Database”, KDD, Chicago, Illinois, 2005.

7. H. Yang et. al., “Discovering Important Nodes through Comprehensive Assessment Theory on Enron Email Database”, IEEE 3rd International Conference on BMEI, 2010.

8. G. Wilson and W. Banzhaf, “Discovery of Email Communication Networks from the Enron Corpus with Genetic Algorithm using SNA”, IEEE Congress on Evolutionary Computation, 2009.

17

REFERENCES (CONT…)

9. Hanhe Lin, “Predicting Sensitive Relationships form Email Corpus”, IEEE 4th Int. Conf. on Genetic and Evolutionary Computing, 2010.

10. K. Yelupula et. al., “Social Network Analysis for Email Classification”, ACM – SE , 28 – 29 March. 2008.

11. Anton Timofieiev et. al., “Social communities detection in Enron Corpus using h-Index”, IEEE ICADIWT, 4 – 6 Aug. 2008.

12. L. Johansen, M. Rowell, K. Butler, and P. McDaniel. “Email Communities of Interest”, In 4th Conference on Email and Anti-Spam (CEAS), Mountain View, CA, July 2007.

18

RELATED WORK (1/11)

In literature Email database has been utilized to: (CEAS-2005) Detect Worm Propagation: Statistically

Analyze features to classify normal (legitimate) and abnormal (worms affected) emails [4] Many researchers examine incoming email exclusively, which

does not provide detailed information about individual user’s behavior; Outgoing emails need to be utilized

Features extracted from outgoing emails to distinguish between normal and abnormal email activity

Examined the contributions of these features to their classification or the sensitivity of their model to those feature

Features includes presence of HTML, script tags, embedded images, hyperlinks, types of attachments, no of sent, etc.

Application: Novel worm detection based on SVM and Naïve Bayes classifiers.

19

RELATED WORK (2/11)

In literature Email database has been utilized to: (CEAS-2005) Detect Worm Propagation:

Statistically Analyze features to classify normal (legitimate) and abnormal (worms affected) emails [4]

20

RELATED WORK (3/11) (KDD-2005) Predict Organizational Structure

Discover of important nodes (Email users) to predict hidden organizational structure [6] [7] Event based entropy to find influential users Focused on text mining and natural language

processing

21

* BM

EI-2010

RELATED WORK (4/11) (KDD-2005) Predict Organizational Structure

Discover of important nodes (Email users) to predict hidden organizational structure [6] [7]*

22

RELATED WORK (5/11)

(CEAS-2007) Characterization of Email: Measure the flow and frequency of user email toward the identification of communities of interest [12]

A large number of communications indicates as association whereas a fewer number of communication does not

Frequency Thresholding

k-means Clusterings of Inbound weightsEmail communications between users

23

* 28-29 March

RELATED WORK (6/11)

(ACM-2008*) Predict Organizational Structure Classify emails within the organization to determine its structure [10] The analysis of email flows within the

organization by neglecting email contents Ranking individuals based on no. of emails sent

and received Algorithm steps

1. Get the statistical view of email conversations2. WEKA tool (k-Means) is utilized for clustering of

organization3. Validation of the results with original organization

structure

24

* 28-29 March

RELATED WORK (7/11)

(ACM-2008*) Predict Organizational Structure Classify emails within the organization to determine its structure [10]

25

* ICA

DIW

T, 4 – 6 Aug

RELATED WORK (8/11)

(IEEE*-2008) Determine Popularity of Individuals in Community: Find popularity of individuals based on ranking strategy (Page Rank) and evaluated using organization hierarchy [11] Detect abnormal activities Topologically identifying the communities

(triangle estimation measure the density of triangles in the network)

Importance of individual is estimated using h-index Organization hierarchy comparison for analysis

26

RELATED WORK (9/11) (CEC*-2009) Detect Email Communication

Networks: Detect key users based on Evolutionary approach using SNA [8] Detect important actors w.r.t email transactions Topological metrics used for network analysis

Degree (mean) Density Proximity Prestige

* IEEE C

ongre

ss on E

volu

tionary

Com

puta

tion

Hypothetical Networks

27

RELATED WORK (10/11) (KDD-2009) Email Prioritization: Personalize

email prioritization based on social network analysis [5] Automatically determine the importance level of

non-spam email messages Topological features

In/out/total degree centrality, Clustering coefficient, Clique count, Betweenness centrality

An example of bipartite email network: circles are contact persons and rectangles are email messages. Some email messages have human-assigned importance values but others do not. The network enables us to propagate the partially available importance values from messages to persons, and vice versa

28

RELATED WORK (11/11) (IEEE*-2010) Identify Hidden Relationships:

Predict sensitive relationships based on frequency of mutual private communication [9] Relationship means belonging to same

department, workgroup, or being friends(sensitive)

Email network modeled as directed weighted graph

Two Steps1. Counting mutual privacy communication among

users2. Evaluate the cluster factor information between

every pair of email users

* 4th

Int. C

onf. o

n G

enetic a

nd E

volu

tionary

C

om

putin

g