the role of metadata in machine learning for tar amanda jones marzieh bazrafshan fernando delgado...

12
The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler [email protected] [email protected] [email protected] [email protected] [email protected]

Upload: job-caldwell

Post on 03-Jan-2016

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajones@h5.com mbazrafshan@h5.com

The Role of Metadata in MachineLearning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami [email protected] [email protected] [email protected] [email protected] [email protected]

Page 2: The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajones@h5.com mbazrafshan@h5.com

Metadata Use in TAR – Lack of Consensus

2

It is generally agreed across the industry that metadata is a critical component of ESI for eDiscovery.

• Some view incorporation of metadata into machine learning algorithm development for TAR as a matter of course.

• Others view it as atypical, if not incompatible, with machine learning approaches to document classification.

Page 3: The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajones@h5.com mbazrafshan@h5.com

Metadata in TAR – Goals of the Study

3

If metadata provides information that is vital for manual document review in eDiscovery, though, why would it be any less valuable for TAR?

Goals of the current study:

1. Establish the potential benefit of incorporating metadata into TAR algorithm development processes.

2. Establish the potential benefit of leveraging widely inclusive sets of metadata, as opposed to limited pre-determined sets.

3. Establish the potential benefit of integrating metadata using techniques that preserve the added layer of information associated with metadata values.

Page 4: The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajones@h5.com mbazrafshan@h5.com

Metadata in TAR – Data & Methods

4

3 Distinct Data Sets:

• 1 drawn from Topic 301 of the TREC 2010 Interactive Task• 2 proprietary business data sets• Random sample of 4500 individually labeled documents for each• Split into a 3000 document Control Set and a 1500-document Training Set

All machine learning models developed using an open source Support Vector Machine (SVM) implementation

Performance Metric: Area Under the Receiver Operating Characteristic Curve (AUROC)

Page 5: The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajones@h5.com mbazrafshan@h5.com

Metadata in TAR – Metadata Choices

5

Metadata availability varied across data sets

Fields were chosen opportunistically based on availability and amenability to feature transformation

• Fields that were populated for fewer than 5% of the documents were omitted

• Continuous metadata values were transformed into categorical values.

o For example, date values were collapsed into simple Month-Year values ; file size values were assigned to categories ranging from very small to very large.

Page 6: The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajones@h5.com mbazrafshan@h5.com

Metadata in TAR – Metadata Choices

6

Standard Metadata:• Author• Sender, Recipient, Cc• Subject, Title, File Name• Document Type, File Extension• Sent Date, Created Date• Sender Domain, Recipient Domain

Extended Metadata (all of the above, plus):• All Custodians, Primary Custodian• Record Type• Attachment Name• Bates Prefix• Drop Id• Company/Organization• Native File Size, Text Size• Normalized Date, Parent Date• Family Count, Attachment Count• Recipient Count, Cc Count,

Combined Recipient Count• Page Count

Page 7: The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajones@h5.com mbazrafshan@h5.com

Metadata in TAR – Incremental Testing

7

Hypothesis 1: Incorporating metadata into the machine learning process will lead to improved model performance.

• Text from Standard Metadata added to the body text of documents

o There was a general trend of improvement across the three data sets. The improvement was highly significant for Data Set 3.

Hypothesis 2: Incorporating the text from Extended Metadata will lead to superior results as compared to incorporating Standard Metadata alone.

• Text from Extended Metadata added to the body text of documents – compared to models based on addition of Standard Metadata

o There was a general trend of improvement across the three data sets. The improvement was highly significant for Data Sets 1 and 3.

Page 8: The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajones@h5.com mbazrafshan@h5.com

Metadata in TAR – Incremental Testing

8

Hypothesis 3: Using metadata values in ways that preserve both attribute and value information will result in superior performance.

• Extended Metadata values prefixed to indicate their field origins added to body text – compared to models with Extended Metadata added as plain text

o Improvements varied across the three data sets, but significant for Data Set 2.

• Dual modeling – prefixed metadata values and simple body text modeled independently, scores from two models multiplied to arrive at a final score – dual models compared to single models with prefixed Extended Metadata

o There was a general trend of improvement across the three data sets. The improvement was highly significant for Data Sets 2 and 3.

Page 9: The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajones@h5.com mbazrafshan@h5.com

9

Stepping back from incremental pairwise comparisons - clearer answers and more striking differences emerge

Models incorporating Extended Metadata significantly outperformed models based on body text alone in each condition for every data set.

Overall Findings – MD Can Improve TAR

Data Set 3

Data Set 2

Data Set 1

0.75 0.8 0.85 0.9 0.95 1

Figure 1- Extended Metadata vs. Text Only

Ext MD and Text - Dual Model

Ext Tagged MD Added to Text

Ext Plain MD Added to Text

Text Only

AUROC

Page 10: The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajones@h5.com mbazrafshan@h5.com

10

Similarly strong trends can be observed when each model created using Standard Metadata is compared to its Extended Metadata counterpart.

Extended Metadata improvements were highly significant in all cases for Data Sets 1 and 3 and significant for the dual model in Data Set 2.

Overall Findings – More MD Is Better

Data Set 3

Data Set 2

Data Set 1

0.75 0.8 0.85 0.9 0.95 1

Extended Metadata vs. Standard Metadata

Std MD and Text - Dual Model

Ext MD and Text - Dual Model

Std Tagged MD Added to Text

Ext Tagged MD Added to Text

Std Plain MD Added to Text

Ext Plain MD Added to Text

AUROC

Page 11: The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajones@h5.com mbazrafshan@h5.com

11

Incorporating metadata as an integral component of machine learning processes for TAR in eDiscovery will benefit the community of practice.

• Neglecting this resource is – at best – a missed opportunity. In an information retrieval effort, why leave information on the table?

To realize the full potential of using metadata in machine learning for TAR, practitioners should not rely solely on a limited intuitive set of metadata.

• Examining the contributions of specific metadata fields at a more granular level could be very worthwhile.

o Is “all available” always the best choice?

There are still more questions than answers when it comes to the use of metadata in modeling for TAR.

o More effective algorithms?

o Better techniques for capturing the full metadata contribution?

Metadata in TAR – Conclusions

Page 12: The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler ajones@h5.com mbazrafshan@h5.com

12

Questions?