how many folders do you really need ? classifying email into a handful of categories

54
+ How Many Folders Do You Really Need? Classifying Email into a Handful of Categories 2014/1/23 (Fri.) Chang Wei-Yuan @ MakeLab Group Meeting Mihajlo Grbovic, Guy Halawi, Zohar Karnin, Yoelle Maarek Yahoo Labs CIKM‘14

Upload: chang-wei-yuan

Post on 18-Jul-2015

158 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: How many folders do you really need ? Classifying email into a handful of categories

+

How Many Folders Do You Really Need? �Classifying Email into a Handful of Categories

2014/1/23 (Fri.)�Chang Wei-Yuan @ MakeLab Group Meeting

Mihajlo Grbovic, Guy Halawi, Zohar Karnin, Yoelle Maarek �Yahoo Labs CIKM‘14

Page 2: How many folders do you really need ? Classifying email into a handful of categories

+Outline

n Introduction �

n Method �n  Discovering Latent Categories n  Modeling Data�n  Training Data�n  Classification Mechanism�

n Experiment �

n Conclusion �

n Thought

2

Page 3: How many folders do you really need ? Classifying email into a handful of categories

+Outline

n Introduction �

n Method �n  Discovering Latent Categories n  Modeling Data�n  Training Data�n  Classification Mechanism�

n Experiment �

n Conclusion �

n Thought

3

Page 4: How many folders do you really need ? Classifying email into a handful of categories

+ Introduction

n Traditional email classification is still a mostly manual task. �

4

Page 5: How many folders do you really need ? Classifying email into a handful of categories

+ Introduction

n Recently automatic classification has started to appear in some Web mail clients, e.g. Inbox.

5

Page 6: How many folders do you really need ? Classifying email into a handful of categories

+ Introduction

n The current email traffic is dominated by non-spam machine-generated email. �n Social network �n Commerce sites �n Official institutions

6

Page 7: How many folders do you really need ? Classifying email into a handful of categories

+ Introduction

n Goal �n automatically distinguishing between personal

and machine-generated email �n classifying messages into latent categories,

without requiring users to have defined any folder

7

Page 8: How many folders do you really need ? Classifying email into a handful of categories

+Outline

n Introduction �

n Method �n  Discovering Latent Categories n  Modeling Data�n  Training Data�n  Classification Mechanism�

n Experiment �

n Conclusion �

n Thought

8

Page 9: How many folders do you really need ? Classifying email into a handful of categories

+Overview

Latent categories

Extracting Features

Aggregation Level

LDA

Training data Classifier

Mail raw data

Mail testing

data raw data

Page 10: How many folders do you really need ? Classifying email into a handful of categories

+Discovering Latent Categories

Latent categories

Extracting Features

Aggregation Level

LDA

Training data Classifier

Mail raw data

Mail testing

data raw data

Page 11: How many folders do you really need ? Classifying email into a handful of categories

+Discovering Latent Categories

n All messages have the potential to be classified. �n by retrieving the most popular folder from

users �

n This paper applied LDA to these "document folders " for finding latent categories. �n  latent topics would map into "latent

categories" �

11

Page 12: How many folders do you really need ? Classifying email into a handful of categories

+ 12

msg msg msg

msg

msg

msg

msg msg

msg msg msg

msg

msg msg

Page 13: How many folders do you really need ? Classifying email into a handful of categories

+ 13

msg msg msg

msg

msg

msg

msg msg

msg msg msg

msg

msg msg

LDA

Page 14: How many folders do you really need ? Classifying email into a handful of categories

+Discovering Latent Categories

n Our objective was to train a value of K �n each individual and overall set of topics

achieve significant coverage �

n We further examined for K = 6 �n good balance between total and individual

coverage �

14

Page 15: How many folders do you really need ? Classifying email into a handful of categories

+Discovering Latent Categories 15

msg

travel %, social % …

travel

Page 16: How many folders do you really need ? Classifying email into a handful of categories

+Modeling Data

Latent categories

Extracting Features

Aggregation Level

LDA

Training data Classifier

Mail raw data

Mail testing

data raw data

Page 17: How many folders do you really need ? Classifying email into a handful of categories

+Modeling Data

n Original method: Each individual message as a single data point �n various features extracted from the message

header and body�

17

Page 18: How many folders do you really need ? Classifying email into a handful of categories

+Modeling Data

n Extracting Features �n content features �

n  the message subject and body�n address features�

n  sender email address, including the subdomain �n behavioral features �

n  sender's and recipient's actions over a given message

18

subject� body� action� time� sender� address� domain� msg

Page 19: How many folders do you really need ? Classifying email into a handful of categories

+Modeling Data

n Extended method: Aggregating messages at higher levels�n address/mail domain level �

n This paper consider three levels of aggregation.

19

subject� body� action� time� address� sender� domain� msg

Aggregating : sender level

Aggregating : domain level

Page 20: How many folders do you really need ? Classifying email into a handful of categories

+Modeling Data

n Aggregation Levels �

20

msg: shopping msg: traveling

Page 21: How many folders do you really need ? Classifying email into a handful of categories

+Training Data

Latent categories

Extracting Features

Aggregation Level

LDA

Training data Classifier

Mail raw data

Mail testing

data raw data

Page 22: How many folders do you really need ? Classifying email into a handful of categories

+Training Data

n labeling techniques �n  label used as 6 latent categories �n we will create a two-stage classifier by msg-

level and sender-level �

22

subject� action� …� sender� domain� category � msg

sender� domain� category� sender

Page 23: How many folders do you really need ? Classifying email into a handful of categories

+Training Data

n labeling techniques �n  label used as 6 latent categories �n we will create a two-stage classifier by msg-

level and sender-level �

23

subject� action� …� sender� domain� category � msg

sender� domain� category� sender known by LDA

unknown

Page 24: How many folders do you really need ? Classifying email into a handful of categories

+ 24

sender

human

travel

social

career

Page 25: How many folders do you really need ? Classifying email into a handful of categories

+ 25

sender

human

travel

social

career

heuristic-based •  Domain : gmail.com, yahoo.com •  Sender: <first name>.<last name>

Page 26: How many folders do you really need ? Classifying email into a handful of categories

+ 26

sender

human

travel

social

career

automatic voting

sender msg

msg

msg

folder1

folder2

folder3

travel 96%,

travel 88%,

shopping 70%, travel 20 %

Page 27: How many folders do you really need ? Classifying email into a handful of categories

+ 27

sender

human

travel

social

career

automatic voting

sender msg

msg

msg

folder1

folder2

folder3

travel

travel

shopping

Page 28: How many folders do you really need ? Classifying email into a handful of categories

+Classification Mechanism

Latent categories

Extracting Features

Aggregation Level

LDA

Training data Classifier

Mail raw data

Mail testing

data raw data

Page 29: How many folders do you really need ? Classifying email into a handful of categories

+Classification Mechanism

n Offline creation of classified senders table and message-level classier�n We use the training set to train a logistic

regression model. �n  For each category we train a separate model in a

one-vs-all manner. �n The classification process is run performed

periodically to account for new senders.

Page 30: How many folders do you really need ? Classifying email into a handful of categories

+Classification Mechanism

35 % sender training data

classifier

classifier

senders table

65 % sender testing data

msg training data

Page 31: How many folders do you really need ? Classifying email into a handful of categories

+Classification Mechanism

Latent categories

Extracting Features

Aggregation Level

LDA

Training data Classifier

Mail raw data

Mail testing

data raw data

Page 32: How many folders do you really need ? Classifying email into a handful of categories

+Classification Mechanism

n Online Light-weight classification �

n The initial classification �n hard coded rules designed to quickly classify �

n This process described requires very few resources and covers 32% of the email traffic.

Page 33: How many folders do you really need ? Classifying email into a handful of categories

+Classification Mechanism

n Online Sender-based classification �

n The second phase in our cascade classification �n  looking for the sender with known categories �n using senders table �

n The amount of traffic that is not covered by this phase is roughly 8%. �

Page 34: How many folders do you really need ? Classifying email into a handful of categories

+Classification Mechanism

n Online Heavy-weight classification �

n As only 8% of the traffic end up in this last phase �

n We can afford slightly heavier computations to classifier. �n use all relevant feature, pertaining to the

message body, subject line and sender name

Page 35: How many folders do you really need ? Classifying email into a handful of categories

+One-vs-all 35

social

human

career

shopping

travel

finance

Yes, confidence

No

msg

Page 36: How many folders do you really need ? Classifying email into a handful of categories

+Semi-supervise 36

Latent categories

Extracting Features

Aggregation Level

LDA

Training data Classifier

Mail raw data

Mail testing

data raw data

Page 37: How many folders do you really need ? Classifying email into a handful of categories

+Semi-supervise 37

Latent categories

Extracting Features

Aggregation Level

LDA

Training data Classifier

Mail raw data

Mail testing

data raw data

Page 38: How many folders do you really need ? Classifying email into a handful of categories

+Semi-supervise 38

Latent categories

Extracting Features

Aggregation Level

LDA

Training data Classifier

Mail raw data

Mail testing

data raw data

Page 39: How many folders do you really need ? Classifying email into a handful of categories

+Semi-supervise 39

Latent categories

Extracting Features

Aggregation Level

LDA

Training data Classifier

Mail raw data

Mail testing

data raw data

Page 40: How many folders do you really need ? Classifying email into a handful of categories

+Outline

n Introduction �

n Method �n  Discovering Latent Categories n  Modeling Data�n  Training Data�n  Classification Mechanism�

n Experiment �

n Conclusion �

n Thought

40

Page 41: How many folders do you really need ? Classifying email into a handful of categories

+Experiment

n This paper estimated the actual volume of machine-generated messages on a very large Yahoo mail dataset. �

n This dataset built for the purpose of this work �n 6 months of email traffic �n more than 500 billion messages.

41

Page 42: How many folders do you really need ? Classifying email into a handful of categories

+Experiment

n 5 sender based classifiers for machine latent categories �n Shopping, Financial, Travel, Career and

Social �

n 1 sender-based machine for human classifier.

Page 43: How many folders do you really need ? Classifying email into a handful of categories

+

Page 44: How many folders do you really need ? Classifying email into a handful of categories

+ 44

Page 45: How many folders do you really need ? Classifying email into a handful of categories

+

Page 46: How many folders do you really need ? Classifying email into a handful of categories

+Outline

n Introduction �

n Method �n  Discovering Latent Categories n  Modeling Data�n  Training Data�n  Classification Mechanism�

n Experiment �

n Conclusion �

n Thought

46

Page 47: How many folders do you really need ? Classifying email into a handful of categories

+Conclusion

n We presented here a Web-scale categorization approach. �n offline learning �n online classification �

n Discovered latent categories. �

n Discriminated human and machine-generated email. �

n Building a scalable online system can be applied in Web mail.

Page 48: How many folders do you really need ? Classifying email into a handful of categories

+Future Work

n Discussing how categories should be exposed to users.

Page 49: How many folders do you really need ? Classifying email into a handful of categories

+Outline

n Introduction �n Method �

n Discovering Latent Categories n Modeling Data�n Training Data�n Classification Mechanism�

n Experiment �

n Conclusion �

n Thought

49

Page 50: How many folders do you really need ? Classifying email into a handful of categories

+Thought

n Extended multiclass classification with multi-label.

50

Page 51: How many folders do you really need ? Classifying email into a handful of categories

+Overview

Latent categories

Extracting Features

Aggregation Level

LDA

Training data Classifier

Mail raw data

Mail testing

data raw data

Page 52: How many folders do you really need ? Classifying email into a handful of categories

+Overview

Latent categories

Extracting Features

Aggregation Level

LDA

Training data Classifier

Mail raw data

Mail testing

data raw data

k ?

Page 53: How many folders do you really need ? Classifying email into a handful of categories

+Overview

Latent categories

Extracting Features

Aggregation Level

LDA

Training data Classifier

Mail raw data

Mail testing

data raw data

threshold ?

Page 54: How many folders do you really need ? Classifying email into a handful of categories

+Thanks for listening. 2014 / 01 / 23 (Tue.) @ MakeLab Group Meeting �[email protected]