m12s06 - will technology-assisted predictive modeling and auto-classification end the 'end...

16
Cohasset Associates, Inc. NOTES 2012 Managing Electronic Records Conference 6.1 Will Technology-Assisted Predictive Modeling and Auto- Classification End the ‘End-User’ Burden in Records Management? 2012 Managing Electronic Records Conference Chicago, IL May 7, 2012 Jason R. Baron, Esq. Director of Litigation Office of General Counsel National Archives and Records Administration Dave Lewis, Ph.D. David D. Lewis Consulting, LLC Chicago, IL A New Era of Government “[P]roper records management is the backbone of open Government.” President Obama’s Memorandum dated November 28, 2011 re “Managing Government Records” http://www.whitehouse.gov/the-press-office/2011/11/28/presidential-memorandum- managing-government-records

Upload: mer-conference

Post on 28-Nov-2014

307 views

Category:

Education


1 download

DESCRIPTION

From the MER Conference 2012 Seakers: Jason R. Baron, Esq. Dave Lewis, Ph.D. 2012 is the year we will see great strides by information professionals in using automation (in the form of "predictive" and "technology-assisted" search, filtering, and auto-classification) for the purpose of achieving efficiencies and cutting costs in records management as well as in legal settings. The strategic use of these new methods is absolutely necessary given the massive, exponential increases in electronically stored information - in the form of records within corporate networks and repositories. This session addresses the latest technological developments from the two perspectives: - A longtime advocate of smart technology in the public recordkeeping sector, and - A leading information scientist. The session includes a state of the art overview of the latest developments in technology-assisted review, with an emphasis on how these technologies can and will enhance electronic records management by helping to end the era of excessive reliance on end user RM. You will learn: - What technology-assisted review and predictive analytics are all about using advanced search, filtering, and auto-classification as part of a defensible electronic records management program. - How these technologies also add value to overall corporate information governance.

TRANSCRIPT

Page 1: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.1

Will Technology-Assisted Predictive Modeling and Auto-Classification End the ‘End-User’ Burden in

Records Management?

2012 Managing Electronic Records ConferenceChicago, ILg

May 7, 2012

Jason R. Baron, Esq.Director of Litigation

Office of General CounselNational Archives and Records Administration

Dave Lewis, Ph.D.David D. Lewis Consulting, LLC

Chicago, IL

A New Era of Government“[P]roper records management is the backbone of open Government.”

President Obama’s Memorandum dated November 28, 2011 re “Managing Government Records”

http://www.whitehouse.gov/the-press-office/2011/11/28/presidential-memorandum-managing-government-records

Page 2: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.2

Reality:The era of Big Data has just begun….

Lehman Brothers Investigation

-- 350 billion page universe (3 petabytes)

-- Examiner narrowed collection by selecting key custodians, using dozens of Boolean searches

-- Reviewed 5 million docs (40 million pages using 70 contract attorneys)Source: Report of Anton R. Valukas, Examiner, In re Lehman Brothers Holdings Inc., et al., Chapter 11 Case No. 08-13555 (U.S. Bankruptcy Ct. S.D.N.Y. March 11, 2010), Vol. 7, Appx. 5, at http://lehmanreport.jenner.com/.

Process Optimization Problem 1: The transactional toll of user-based recordkeeping schemes (“as is” RM)

5

…. and the need for better, automated solutions ….

6

Page 3: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.3

Impact of Technology on E-Records Management: Snapshot 2012 (“As is”) A universe of proprietary products exists in the

marketplace: document management and records management applications (RMAs)

DoD 5015.2 version 3 compliant products

7

DoD 5015.2 version 3 compliant products However, scalability issues exist Agencies must prepare to confront significant

front-end process issues when transitioning to electronic recordkeeping

Records schedule simplification is key

RM wish list for 2012…. RM’s “easy button”: the elusive goal of zero

extra keystrokes to comply with RM requirements (capture)

A technology app that automatically tags records in compliance with RM policies and practices (categorize)

Supervised learning RM with minimal records officer or end user involvement (learn)

Rule-based and role-based RM

Advanced search 8

Electronic Archiving As The First Step What is it?

100% snapshot of (typically) email, plus in some cases other selected ESI applications

9

cases other selected ESI applications How does it differ from an RMA?

Goal is of preservation of evidence, not records management per se

NARA Bulletin 2008-05

Page 4: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.4

A Possible Path Forward?

Email archiving in short term, synced to existing proprietary software on email system

Designation of key senior officials as creating permanent records, consistent with existing records schedules

10

Additional designations of permanent records by agency component

“Smart” filters/categorical rules built in based on content, to the extent feasible to do

Default are records in designated temporary record buckets, disposed of under existing records schedules.

A pyramid approach combines disposition policy with automatedtools to bring FRA email under recordsmanagement, preservation, and access

= permanent or top officials

= temporary or staff and support

slider

The position of the “set-point” for email capture depends on policy and resources:setting it higher allows use of tools now available to get 100% of email at lowervolumes;* setting it lower means more records will be captured and smarter toolsare needed to distinguish and disposition temporary- and non-record.

Implementing an email archiving policy is feasible now, since tools are readily available to capture 100% of email traffic at the individual or organizational level, in formats that can be archived.

A pyramid approach combines disposition policy with automatedtools to bring FRA email under recordsmanagement, preservation, and access

= permanent or top officials

= temporary or staff and support

slider

The position of the “set-point” for email capture depends on policy and resources:setting it higher allows use of tools now available to get 100% of email at lowervolumes;* setting it lower means more records will be captured and smarter toolsare needed to distinguish and disposition temporary- and non-record.

Implementing an email archiving policy is feasible now, since tools are readily available to capture 100% of email traffic at the individual or organizational level, in formats that can be archived.

Page 5: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.5

How To Avoid A Train Wreck With Email Archiving….

13

Capture E-mail But Utilize Records Management!

Functional Requirements for Categorization Products in the Federal workplace

Ease of use …. Scalability …. Archiving in native formats….. Metadata preservation … Seamless integration with existing software apps …. Versioning …. Compatibility with big bucket records schedules …. Advanced search capabilities …. Ease of training / machine learning using records officers or end users …. Cost

Process Optimization Problem 2: The Coming Age of Dark Archives (and the inability to provide access)

15

Page 6: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.6

Emerging New Strategies:“Predictive Analytics”

Improved review and case assessment: cluster docs thru use of software with minimal human intervention at front end to code “seeded” data set

Slide adapted from Gartner ConferenceJune 23, 2010 Washington, D.C.

16

Language Processing TechnologiesRetrieval / Search

Classification

Question Answering

Summarization

Information Retrieval

1.

2.

17

Summarization

Entity Recognition

Information Extraction

Machine Translation

:

Natural Language Processing

Text Classification

Deciding which of several groups a text belongs to

Crudest form of

18

Crudest form oflanguage understanding... ...but often can be automated

with high accuracy

Page 7: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.7

Why Classify?

Reduce infinite variety of text...

...to finite set of classes...

...to specify an action for every possible input.

19

Other Advantages of Text Classification

Supervised learning: Classifiers (rules) can be

learned by imitating manual classifications

20

Straightforward numerical measures of quality

Objective reason why a decision was made

recall: 85% +/- 4%precision: 75% +/- 3%

classification rule

Binary vs. multiclass

Hierarchical

Variations on Classification

Probabilistic 83% 17%

Graded / ordered / fuzzy21

Page 8: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.8

Defining Sets of Classes

Tradeoff among Ideal classes to

implementpolicy

Classes you can teach

22

Classes you can teach people to assign

Classes you can teachsoftwareto assign

Be skeptical of automatic discovery of classes

?

Text Retrieval Systems

AKA search engines, semi-structured databases, text databases etc

23

databases, etc.

Classification Search

autonomous

long term

interactive

transitory

24

organizational

structured

personal

independent ??

?

Page 9: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.9

"Concepts"

vs.

"Keywords"

Some Distinctions Among Search Approaches

Exact Match vs. Ranked Retrieval vs. Browsing

"Keywords"

25

Text Representations

Matching Aids

Exact Match Search

Query specifies conditions document must meet

Variants B l

budget AND KnoxvilleAND (revised or preliminary)

26

Boolean

SQL

Faceted

Often (ambiguously) called "keyword" search

A Faceted Search Interface

27

Page 10: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.10

Ranked Retrieval

Query specifies important attributes of desired documents

System statistically weights

28

System statistically weights those attributes

Results returned in order of strength of match

Statistical Evidence in Ranked Retrieval

Corpus statistics Word (and metadata) counts

Unsupervised learningCl t i LSI/LSA t

29

Clustering, LSI/LSA, etc.

finds (maybe useless) patterns

Supervised learning aka "relevance feedback"

learn indicators of user interest

Browsing

Hierarchies

Networks

Clusters

30

Spaces / Maps / Dimensions make great pictures / demos

unclear if useful for finding information

Page 11: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.11

Visual Analysis Examples(Presentation by Dr. Victoria Lemieux, Univ. British Columbia, at Society of American Archivist Annual Mtg. 2010, Washington, D.C.)

With acknowledgments to Jeffrey Heer, Exploring Enron, http://hci.stanford.edu/jheer/projects/enron/, Adam Perer, Contrasting Portraits, http://hcil.cs.umd.edu/trs/2006-08/2006-08.pdf, and Fernanda Viegas, Email Conversations, http://fernandaviegas.com/email.html

31

32

Page 12: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.12

What Evidence Can The Search Software Use?

Words, phrases, etc.

Manually assigned categories

Metadata

34

Author, organization, creation date, change date, access date, length, file type,...

Contextual information (links, attachments,...)

What Resources Aid Matching?

Linguistic analysis At word level or higher

Clusters / spaces / ...

35

Thesauri / semantic nets / concept maps / ... Suited to your task?

Modifiable?

How is text determined to belong to category?

Concepts v. KeywordsSupreme Court of Information Retrieval, Case No. 1-tfidf-0-2902, 2009

Search software marketing: Them = keyword search = bad

Us = concept search = good

R lit

36

Reality: Both terms have referred to dozens of

different technologies...

...including some of the same ones!

Conceptual search is an aspiration, not a technology

Page 13: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.13

Example of Boolean search string from U.S. v. Philip Morris

(((master settlement agreement OR msa) AND NOT (medical savings account OR metropolitan standard area)) OR s. 1415 OR (ets AND NOT educational testing service) OR (liggett

37

OR (ets AND NOT educational testing service) OR (liggett AND NOT sharon a. liggett) OR atco OR lorillard OR (pmi AND NOT presidential management intern) OR pm usa OR rjr OR (b&w AND NOT photo*) OR phillip morris OR batco OR ftc test method OR star scientific OR vector group OR joe camel OR (marlboro AND NOT upper marlboro)) AND NOT (tobacco* OR cigarette* OR smoking OR tar OR nicotine OR smokeless OR synar amendment OR philip morris OR r.j. reynolds OR ("brown and williamson") OR ("brown & williamson") OR bat industries OR liggett group)

U.S. v. Philip Morris E-mail Winnowing Process

20 million 200,000 100,000 80,000 20,000

38

email hits based relevant produced placed on records on keyword emails to opposing privilege terms used party logs (1%)

A PROBLEM: only a handful entered as exhibits at trial A BIGGER PROGLEM: the 1% figure does not scale

Judicial endorsement of predictive analytics in document review by Judge Peck in DaSilva Moore v. PublicisGroupe(SDNY Feb. 24, 2012)

This opinion appears to be the first in which a Court has approved of the use of computer-assisted review. pp p. . . What the Bar should take away from this Opinion is that computer-assisted review is an available tool and should be seriously considered for use in large-data-volume cases where it may save the producing party (or both parties) significant amounts of legal fees in document review. Counsel no longer have to worry about being the ‘first’ or ‘guinea pig’ for judicial acceptance of computer-assisted review . . . Computer-assisted review can now be considered judicially-approved for use in appropriate cases.

Page 14: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.14

Social Networking/Links Analysis Example

40

From Marc Smith Posted on FlickrUnder Creative Commons License

Judicial second guessing of failure to use e-search capabilities: Capitol Records v. MP3 Tunes, 261 F.R.D. 44 (S.D.N.Y. 2009)

“In [a prior case] the Court notes its dismay that the party opposing discovery of its ESI had organized its files in a manner which seemed to serve no purpose

41

files in a manner which seemed to serve no purpose other than ‘to discourage audits. . .’ Similarly, in this case, [the party] host[ed] no ediscovery software on their servers and apparently are unable to conduct centralized email searches of groups of users without downloading them to a separate file and relying on the services of an outside vendor.”

Judicial second guessing of failure to use e-search capabilities: Capitol Records v. MP3 Tunes (con’t)

Court went on to add:“The day will undoubtedly will come when

burden arguments based on a large organization’s lack of internal ediscovery

42

g ysoftware will be received about as well as the contention that a party should be spared from retrieving paper documents because it had filed them sequentially, but in no apparent groupings, in an effort to avoid the added expense of file folders or indices.”

Page 15: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.15

Problem 3: Innovative Thinking

43

The records management world of tomorrow….

References

Background Law Review Referencing Autocategorization& Advanced Search

J. Baron, “Law in the Age of Exabytes: Some Further Thoughts on ‘Information Inflation’ and Current Issues in E-Discovery Search, 17 Richmond J. Law & Technology (2011), see htt //l i h d d

45

http://law.richmond.edu

Latest “Predictive Coding” Case Law to follow in blogs online:

Da Silva Moore v PublicisGroupe& MSL Group, 11 Civ. 1279 (S.D.N.Y.) (Peck, M.J.) (Opinion dated Feb. 24 2012)

Kleen Products, LLC v. Packaging Corp. of America, 10 C 5711 (N.D. Ill.) (Nolan, M.J.)

Page 16: M12S06 - Will Technology-Assisted Predictive Modeling and Auto-Classification End the 'End User' Burden in Records Management?

Cohasset Associates, Inc.

NOTES

2012 Managing Electronic Records Conference 6.16

Jason R. BaronDirector of Litigation

46

g

Office of General Counsel

National Archives and Records Administration

(301) 837-1499

Email: [email protected]

Dave Lewis, Ph.D.David D. Lewis Consulting, LLC

Chicago, IL

Email: [email protected]

47

http//www.DavidDLewis.com