e-mail mining: extracting collaborative activities from e-mail akiko murakami koichi takeda
TRANSCRIPT
E-mail Mining:Extracting Collaborative Activities
from E-Mail
Akiko Murakami
Koichi Takeda
Contents
Overview of our Text mining work Text Mining for individual text Text Mining for discussion text Text Mining for e-mail
Discussion on E-mail mining Pair-mail Three levels of e-mail mining targets
Preliminary study of e-mail mining
Text Mining
Text mining has become one of the most influential natural language processing research.
Text mining is extended to various domain CRM (Customer Relationship Management) Biomedical domain Web pages Discussion records Patent
Text Mining for Individual Text
Call Taker: James Date: Aug. 30, 2002Duration: 10 min.CustomerID: ADC00123
Q: cust sys has stopped working.A: checked cust bios and it need updated. …
Unstructured Data
Structured Data[Call Taker] James [Date] 2002/08/30[Duration] 10 min.[CustomerID] ADC00123
[Noun] Customer[Software] BIOS[Subj...Verb] customer system..stop[SW..Problem] BIOS..need
Original Data Meta Data
LinguisticAnalysis
TaggingDependency AnalysisNamed Entity ExtractionIntention Analysis
CategoryDictionary
SynonymDictionary
Category Item
Visualization & Interactive Mining
Mining
IBM TAKMI(Nasukawa, Nagano,1999)
Mining target: individual text Mining unit: >texts >category labeled items extracted from text using NLP
TAKMI Client GUI
Mining History
Document List
Distribution AnalysisView
Other Mining Views
Text Mining for Discussion Records
Mail A
Mail B
Mail C
Quotation from Mail A
Comment on the quotationQuotation from Mail B
Comment on the quotation
Thread Summary
Discussion Mining(Murakami, Nagao,2001)
Linguistic Annotation
Mining target: discussion recordsMining unit:>summarized texts based on thread structure >mail graph structures
Discussion Mining
Text Mining for E-mail
Private E-mail Data Various structured data as mail messages
Sender(From), Receiver(To,cc.,bcc.), Time Stamp, Mail unique ID, Referential ID, etc.
Independent and relational documents are mixed in e-mail data.
F.Y.I., invitation, CFP etc. Mailing List, inquiry, request etc.
Properties of e-mail messages
Private Mailwithout c.c.
Private Mailwith c.c.
Private
Public
IndependentRelative
F.Y.I
Spam
memo
Mailing List
ScheduleDiscussion Mining
E-mail Mining
Text Mining
Discussion, BBS,,,
Discourse
Paper, Report,,,
E-mail mining
Not suitable for annotation Need to consider scalability
Shorter threads than discussion records’.
New concept of the E-mail mining target is required.
AND
Lack of information like discourse structure participants are small than discussion
Pair-mail
Pair-mail is formed by reference link, reply-to information.
Each reply-to link forms a pair-mail. It contains reference type
information based on previous/next mail contents
Question/Answer, Imperative/Action, Action/Regards... etc
reply-to
Mining Target -mining units-
Three levels of mining target in mail data
1st level : e-mail an individual e-mail as a single substance
2nd level : pair-mail a pair of e-mail linked by reply-to relations.
3rd level : thread a chain of e-mail messages (threads)
Scalabilit
y
High
Low
Preliminary study
Examples of mail mining
Mail data for one month (May, 2003)Business related mails
discussion with co-author of my paper meeting invitations mail magazines and mailing list messages are
received in another accountIncluding my sending messagesVolume: 380mail messages (19 mail messages / a working day)
Thread Properties
Extracting thread structure based on the header information (Reference ID).
Average length of threads 1.60 mail message(238 threads). but, most of mail message are individual type
Average length without individual mail is 3.09 mail messages(68 threads).
Most threads are shorter than 3 messagesLong thread (over 4 messages) is only 16
The average of participant number of long thread (more than 4 messages) is 3.5.
Changes in numbers of thread participants
Changes in numbers of thread participants
012345678
1 2 3 4 5 6 7 8 9 10 11
Mail Thread
Num
bers
of pa
rtic
ipan
t
Total ParticipantsC.C.B.C.C.
Expansion of participants number → general information
No member in c.c. field→ Special topics in sender and receiver
Consider the pair mail properties (ex. the shift of the number of participants),it helps to extract the relevant information.
Pair-mail Extraction
Extracted pair-mail contains some expression in second mail ex. gratitude expression such as
“Thank you”. These pair-mails contain some
relation to the expression in the example, “gratitude”
expressions is a result of some “action” in the previous mail
“thank you...”
Action
Result of pair-mail extraction
action described in maildatainformation in previousmailreal world action
platitudinous expression
attachment text
Most of the expressions are found in previous mail as attachment - data cleansing are required
In the rest of results, we can find the action described in previous mail. About 40% is one’s gratitude for actions described in mail (8% is for information) and 10 % is for real world action.
5% is platitudinous expression.
Extracted 106 pair-mail
Summary
Text Mining for e-mail Text Mining for individual and relational text
Introduce the new mining unit Three levels of e-mail mining targets
single mail. pair-mail. thread
Preliminary study of e-mail mining Pair-mail information is important in threads. Needs data cleansing.
Remove signature, attachment,,,