icsm2009 bettenburg presentation

21
An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data Nicolas Bettenburg , Emad Shihab, Ahmed E. Hassan Queen’s University, Canada 1 Tuesday, November 17, 2009

Upload: sailqu

Post on 21-Jan-2018

119 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Icsm2009 bettenburg presentation

An Empirical Study on the Risks of Using Off-the-Shelf Techniques for Processing Mailing List Data

Nicolas Bettenburg, Emad Shihab, Ahmed E. HassanQueen’s University, Canada

1Tuesday, November 17, 2009

Page 2: Icsm2009 bettenburg presentation

Development Repositories

SOURCE CODE

COMMUNICATION ARCHIVES

BUG DATABASES

2Tuesday, November 17, 2009

Page 3: Icsm2009 bettenburg presentation

Development Repositories

SOURCE CODE

COMMUNICATION ARCHIVES

BUG DATABASES

3Tuesday, November 17, 2009

Page 4: Icsm2009 bettenburg presentation

The Importance of Mailing List Archives

• Email popular form of communication

• Mailing lists to distribute messages

• Messages contain valuable information

• Discussions of source code

• Development decisions

• Error reports

• User support requests

4Tuesday, November 17, 2009

Page 5: Icsm2009 bettenburg presentation

Mining the Mailing Lists of23 Open-Source Projects

• Summarizing developer mailing lists

• Using off-the-shelf tools

• Data from around 500,000 emails

• Unexpected results from experiments

5Tuesday, November 17, 2009

Page 6: Icsm2009 bettenburg presentation

scatter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration !! - !! looks !! -C !! } !! getopt(argc, !! specifies

!! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://www.gnupg.org !! postmaster !! * !! == !! live !! wrote

!! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! + while !! wrote: !! different !! EOF) !! ___ !! allows

!! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetDataDir(potential_DataDir); !! convenient !! + } !! put !! GnuPG

!! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B !! Version: !! @@ !! system !! issues !! -u !! /

path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(char); !! 19 !! if !! /etc/postgresql !! directory !! note !!

\"datadir\" !! running !! PGP !! (GNU/Linux) !! \"hbaconfig\" !! file. !! ((opt !! 2001 !! 72 !! Apache !! way !! NULL; !! options !! CONF_FILE);

!! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! \"A:a:B:b:c:D:d:Fh:ik:lm:MN:no:p:Ss-:\")) !! blow !! + !! DataDir); !! \"To !! method !!

#include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explicit !! debian !! data !! = !! +char !!

malloc(strlen(DataDir) !! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case !! error !! strlen(CONFIG_FILENAME) !! Comment: !! line !!

diff !! easier !! certs !! given !! { !!

6Tuesday, November 17, 2009

Page 7: Icsm2009 bettenburg presentation

scatter !! things !! info !! mlw !! bool !! palloc(bufsize); !! symlinks !! configuration !! - !! looks !! -C !! } !! getopt(argc, !! specifies

!! Nov !! --- !! && !! NULL, !! reasonable. !! -r !! argv, !! Added !! reasons !! A1 !! http://www.gnupg.org !! postmaster !! * !! == !! live !! wrote

!! > !! break; !! get !! SIGNATURE !! -----END !! != !! symlinks. !! command !! + while !! wrote: !! different !! EOF) !! ___ !! allows

!! char !! 1F !! file !! postgres !! Dec !! 43 !! DataDir, !! pg_hba.conf !! 69 !! + SetDataDir(potential_DataDir); !! convenient !! + } !! put !! GnuPG

!! -- !! 093E !! stuff !! -D !! switch !! NULL !! +extern !! recursion !! admin !! setting !! 5B !! Version: !! @@ !! system !! issues !! -u !! /

path/default.conf !! see !! overides !! http://xyzzy.dhs.org/~drew/ !! + /* !! (char !! -p !! sizeof(char); !! 19 !! if !! /etc/postgresql !! directory !! note !!

\"datadir\" !! running !! PGP !! (GNU/Linux) !! \"hbaconfig\" !! file. !! ((opt !! 2001 !! 72 !! Apache !! way !! NULL; !! options !! CONF_FILE);

!! bits !! simple !! databases !! */ !! servers !! multiple !! /* !! share !! \"A:a:B:b:c:D:d:Fh:ik:lm:MN:no:p:Ss-:\")) !! blow !! + !! DataDir); !! \"To !! method !!

#include !! *) !! vendors !! E3 !! people !! + { !! 08:27:06 !! 3B !! 16 !! +# !! explicit !! debian !! data !! = !! +char !!

malloc(strlen(DataDir) !! +++ !! !! v1.0.6 !! having !! 56 !! /etc. !! 613-389-5481 !! extern !! case !! error !! strlen(CONFIG_FILENAME) !! Comment: !! line !!

diff !! easier !! certs !! given !! { !!

Funny, !! fiat !! configuration !! PGDATA !! impose !! them. !! opinion !! keys !! long !! environment !! agrees ! resides. !! start!! variable. !! normal !!

organize !! single !! creating !! exactly !! postgresql !! original !! stuff !! described !! (My !! say !! \"pg\" !! BSD !! fruity !! me, !! real !! little !! want

!! $PGDATA/; !! sort !! specifies !! certs !! data !! Tux !! looks !! policy, !! '/etc/pgsql/pg_hba.conf' !! servers !! maintain !! (This

!! = !! week !! scattered !! patch !! layout !! linux, !! '/u01/postgres' !! give !! path !! all. !! file. !! live !! belongs, !! stuff, !! result !! way !! -p

!! sux. !! Apache !! specified, !! hey, !! reasonable. !! reasons !! it. !! damn !! options: !! utterly !! line, !! files !! consistency !! datadir !!

debian. !! method !! considering !! always. !! options !! symlinks. !! different !! 5434 !! /etc/pgsql/mydb.conf !! delivers !! me. !! /etc/

apache. !! /etc/postgresql !! overides !! things !! using, !! symlinking !! convenient !! able !! hbaconfig !! /path/default.conf !! command !! controllable !! modssl !! undesired !! /path/name3" !! ","I !! Similarly, !! ObFlame: !! And, !!

postmaster !! Config !! directory !! discussion !! packager !! ass. !! really !! machine !! subdirectory !! distros !! bet !!

package. !! devil !! sense !! hbaconfig !! /etc/nessusd. !! logical. !! behavior !! crypto !! Debian !! set, !! 5432 !! as: !! share !! line

!! Ross !! having !! kinda !! see !! forced !! people !! pg_hba.conf ! !! pgdatadir !! /path/name2 !! guess !! get !! own. !! nice !! /path/name1 !! simple !! setting !! rational !!

6Tuesday, November 17, 2009

Page 8: Icsm2009 bettenburg presentation

While mining Mailing Lists of23 Open-Source Projects

• Don’t treat mail archives as textual data

• Changing technologies

• Up to 98% of messages contain noise

7Tuesday, November 17, 2009

Page 9: Icsm2009 bettenburg presentation

While mining Mailing Lists of23 Open-Source Projects

• Don’t treat mail archives as textual data

• Changing technologies

• Up to 98% of messages contain noise

Additional processing and cleaning needed!

8Tuesday, November 17, 2009

Page 10: Icsm2009 bettenburg presentation

From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.

Here is a gzip'ed tar of the results.

=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================

- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

9Tuesday, November 17, 2009

Page 11: Icsm2009 bettenburg presentation

From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.

Here is a gzip'ed tar of the results.

=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================

- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

10Tuesday, November 17, 2009

Page 12: Icsm2009 bettenburg presentation

Resolving Multiple Sender Identities

• Participants send mail from different addresses

• Up to 21% of addresses are aliases

• Such aliases bias identity-based analyses

• Manual inspection and correction tedious

• No fully automated approach to resolve identities

11Tuesday, November 17, 2009

Page 13: Icsm2009 bettenburg presentation

A

B

C

D

A

B

C

D

Linear Sequence Thread Hierarchy

Reconstructing Discussion Threads

• Mail stored sequentially in archives

• Logical grouping: discussion topics

• Required information erroneous or missing

• Essential for social network and topic analysis

12Tuesday, November 17, 2009

Page 14: Icsm2009 bettenburg presentation

From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.

Here is a gzip'ed tar of the results.

=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================

- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

13Tuesday, November 17, 2009

Page 15: Icsm2009 bettenburg presentation

From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.

Here is a gzip'ed tar of the results.

=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================

- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

14Tuesday, November 17, 2009

Page 16: Icsm2009 bettenburg presentation

Attachments

• MIME standard defines extensions to email

• Binary data encoded as text

• Around 10% of messages have attachments

• Extract attachments and store separately

15Tuesday, November 17, 2009

Page 17: Icsm2009 bettenburg presentation

From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.

Here is a gzip'ed tar of the results.

=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================

- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

16Tuesday, November 17, 2009

Page 18: Icsm2009 bettenburg presentation

From [email protected] Wed Jan 21 08:11:26 1998Date: Mon, 27 Jan 1997 12:50:44 -0500 (EST)From: "Brian E. Gallew" <[email protected]>Subject: Re: [HACKERS] configure

- ---559023410-851401618-854387445=:824Content-Type: TEXT/PLAIN; CHARSET=US-ASCII

> If you can grab a copy and run it on your machine, and send me> the output, that would help alot.

Here is a gzip'ed tar of the results.

=====================================================================| Please do not shoot at the thermonuclear weapons! -- Deacon |=====================================================================| Finger [email protected] for my public key. |=====================================================================

- ---559023410-851401618-854387445=:824Content-Type: APPLICATION/x-gzipContent-Transfer-Encoding: BASE64Content-Description: m88k-dg-dgux5.4R3.10.tar.gz

H4sIAHDq7DICA+xba3vaSLLOV/MrepzsGHiQuNomOJ4MdnDMrC8csB17HQ8WUgM9FpJWFxsmyX8/Vd0tIYGwye5kP+w5fp4EaHW9XV1dXbdu6bY1ZCNV1/QxffWD/sql0k6tRl4RUq6IT1KWn4R/L5UI2ansVirVSrlWhpZqqVZ5RUqv/gN/gedrLiGvHNvzRy71VvUbUYu6mvnqv+zvNbkYM48MmUkJfGrEG1PTJJ7uMscn

17Tuesday, November 17, 2009

Page 19: Icsm2009 bettenburg presentation

Quotes and Signatures

• Duplicate information

• Unrelated to actual message

• Removing signatures is challenging

• Quoted text may or may not be desirable

• Signatures impact text mining approaches

• No perfect method for signature removal

============

============

============

============

============

=========

| Please do

not shoot at

the thermon

uclear weapo

ns! -- Deaco

n |

============

============

============

============

============

=========

| Finger gee

[email protected]

.edu for my

public key.

|

============

============

============

============

============

=========

18Tuesday, November 17, 2009

Page 20: Icsm2009 bettenburg presentation

More Risks presented in the Paper

19Tuesday, November 17, 2009

Page 21: Icsm2009 bettenburg presentation

(1) Mailing Lists contain valuable information on a project.

(3) Manual Data Processing is often not feasible or requires much effort.

(4) Off-the-Shelf tools were not designed to prepare data for mining.

(2) Data Needs Pre-Processing before applying traditional tools.

20Tuesday, November 17, 2009