icsm2009 bettenburg presentation
TRANSCRIPT
An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data
Nicolas Bettenburg, Emad Shihab, Ahmed E. HassanQueen’s University, Canada
1Tuesday, November 17, 2009
Development Repositories
SOURCE CODE
COMMUNICATION ARCHIVES
BUG DATABASES
2Tuesday, November 17, 2009
Development Repositories
SOURCE CODE
COMMUNICATION ARCHIVES
BUG DATABASES
3Tuesday, November 17, 2009
The Importance of Mailing List Archives
• Email popular form of communication
• Mailing lists to distribute messages
• Messages contain valuable information
• Discussions of source code
• Development decisions
• Error reports
• User support requests
4Tuesday, November 17, 2009
Mining the Mailing Lists of23 Open-Source Projects
• Summarizing developer mailing lists
• Using off-the-shelf tools
• Data from around 500,000 emails
• Unexpected results from experiments
5Tuesday, November 17, 2009
scatter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration !! - !! looks !! -C !! } !! getopt(argc, !! specifies
!! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://www.gnupg.org !! postmaster !! * !! == !! live !! wrote
!! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! + while !! wrote: !! different !! EOF) !! ___ !! allows
!! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetDataDir(potential_DataDir); !! convenient !! + } !! put !! GnuPG
!! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B !! Version: !! @@ !! system !! issues !! -u !! /
path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(char); !! 19 !! if !! /etc/postgresql !! directory !! note !!
\"datadir\" !! running !! PGP !! (GNU/Linux) !! \"hbaconfig\" !! file. !! ((opt !! 2001 !! 72 !! Apache !! way !! NULL; !! options !! CONF_FILE);
!! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! \"A:a:B:b:c:D:d:Fh:ik:lm:MN:no:p:Ss-:\")) !! blow !! + !! DataDir); !! \"To !! method !!
#include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explicit !! debian !! data !! = !! +char !!
malloc(strlen(DataDir) !! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case !! error !! strlen(CONFIG_FILENAME) !! Comment: !! line !!
diff !! easier !! certs !! given !! { !!
6Tuesday, November 17, 2009
scatter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration !! - !! looks !! -C !! } !! getopt(argc, !! specifies
!! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://www.gnupg.org !! postmaster !! * !! == !! live !! wrote
!! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! + while !! wrote: !! different !! EOF) !! ___ !! allows
!! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetDataDir(potential_DataDir); !! convenient !! + } !! put !! GnuPG
!! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B !! Version: !! @@ !! system !! issues !! -u !! /
path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(char); !! 19 !! if !! /etc/postgresql !! directory !! note !!
\"datadir\" !! running !! PGP !! (GNU/Linux) !! \"hbaconfig\" !! file. !! ((opt !! 2001 !! 72 !! Apache !! way !! NULL; !! options !! CONF_FILE);
!! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! \"A:a:B:b:c:D:d:Fh:ik:lm:MN:no:p:Ss-:\")) !! blow !! + !! DataDir); !! \"To !! method !!
#include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explicit !! debian !! data !! = !! +char !!
malloc(strlen(DataDir) !! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case !! error !! strlen(CONFIG_FILENAME) !! Comment: !! line !!
diff !! easier !! certs !! given !! { !!
Funny, !! fiat !! configuration !! PGDATA !! impose !! them. !! opinion !! keys !! long !! environment !! agrees ! resides. !! start!! variable. !! normal !!
organize !! single !! creating !! exactly !! postgresql !! original !! stuff !! described !! (My !! say !! \"pg\" !! BSD !! fruity !! me, !! real !! little !! want
!! $PGDATA/; !! sort !! specifies !! certs !! data !! Tux !! looks !! policy, !! '/etc/pgsql/pg_hba.conf' !! servers !! maintain !! (This
!! = !! week !! scattered !! patch !! layout !! linux, !! '/u01/postgres' !! give !! path !! all. !! file. !! live !! belongs, !! stuff, !! result !! way !! -p
!! sux. !! Apache !! specified, !! hey, !! reasonable. !! reasons !! it. !! damn !! options: !! utterly !! line, !! files !! consistency !! datadir !!
debian. !! method !! considering !! always. !! options !! symlinks. !! different !! 5434 !! /etc/pgsql/mydb.conf !! delivers !! me. !! /etc/
apache. !! /etc/postgresql !! overides !! things !! using, !! symlinking !! convenient !! able !! hbaconfig !! /path/default.conf !! command !! controllable !! modssl !! undesired !! /path/name3" !! ","I !! Similarly, !! ObFlame: !! And, !!
postmaster !! Config !! directory !! discussion !! packager !! ass. !! really !! machine !! subdirectory !! distros !! bet !!
package. !! devil !! sense !! hbaconfig !! /etc/nessusd. !! logical. !! behavior !! crypto !! Debian !! set, !! 5432 !! as: !! share !! line
!! Ross !! having !! kinda !! see !! forced !! people !! pg_hba.conf ! !! pgdatadir !! /path/name2 !! guess !! get !! own. !! nice !! /path/name1 !! simple !! setting !! rational !!
6Tuesday, November 17, 2009
While mining Mailing Lists of23 Open-Source Projects
• Don’t treat mail archives as textual data
• Changing technologies
• Up to 98% of messages contain noise
7Tuesday, November 17, 2009
While mining Mailing Lists of23 Open-Source Projects
• Don’t treat mail archives as textual data
• Changing technologies
• Up to 98% of messages contain noise
Additional processing and cleaning needed!
8Tuesday, November 17, 2009
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
9Tuesday, November 17, 2009
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
10Tuesday, November 17, 2009
Resolving Multiple Sender Identities
• Participants send mail from different addresses
• Up to 21% of addresses are aliases
• Such aliases bias identity-based analyses
• Manual inspection and correction tedious
• No fully automated approach to resolve identities
11Tuesday, November 17, 2009
A
B
C
D
A
B
C
D
Linear Sequence Thread Hierarchy
Reconstructing Discussion Threads
• Mail stored sequentially in archives
• Logical grouping: discussion topics
• Required information erroneous or missing
• Essential for social network and topic analysis
12Tuesday, November 17, 2009
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
13Tuesday, November 17, 2009
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
14Tuesday, November 17, 2009
Attachments
• MIME standard defines extensions to email
• Binary data encoded as text
• Around 10% of messages have attachments
• Extract attachments and store separately
15Tuesday, November 17, 2009
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
16Tuesday, November 17, 2009
From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure
- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.
Here is a gzip'ed tar of the results.
=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================
- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz
H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn
17Tuesday, November 17, 2009
Quotes and Signatures
• Duplicate information
• Unrelated to actual message
• Removing signatures is challenging
• Quoted text may or may not be desirable
• Signatures impact text mining approaches
• No perfect method for signature removal
============
============
============
============
============
=========
| Please do
not shoot at
the thermon
uclear weapo
ns! -- Deaco
n |
============
============
============
============
============
=========
| Finger gee
.edu for my
public key.
|
============
============
============
============
============
=========
18Tuesday, November 17, 2009
More Risks presented in the Paper
19Tuesday, November 17, 2009
(1) Mailing Lists contain valuable information on a project.
(3) Manual Data Processing is often not feasible or requires much effort.
(4) Off-the-Shelf tools were not designed to prepare data for mining.
(2) Data Needs Pre-Processing before applying traditional tools.
20Tuesday, November 17, 2009