m12s06 - will technology-assisted predictive modeling and auto-classification end the 'end...
DESCRIPTION
From the MER Conference 2012 Seakers: Jason R. Baron, Esq. Dave Lewis, Ph.D. 2012 is the year we will see great strides by information professionals in using automation (in the form of "predictive" and "technology-assisted" search, filtering, and auto-classification) for the purpose of achieving efficiencies and cutting costs in records management as well as in legal settings. The strategic use of these new methods is absolutely necessary given the massive, exponential increases in electronically stored information - in the form of records within corporate networks and repositories. This session addresses the latest technological developments from the two perspectives: - A longtime advocate of smart technology in the public recordkeeping sector, and - A leading information scientist. The session includes a state of the art overview of the latest developments in technology-assisted review, with an emphasis on how these technologies can and will enhance electronic records management by helping to end the era of excessive reliance on end user RM. You will learn: - What technology-assisted review and predictive analytics are all about using advanced search, filtering, and auto-classification as part of a defensible electronic records management program. - How these technologies also add value to overall corporate information governance.TRANSCRIPT
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.1
Will Technology-Assisted Predictive Modeling and Auto-Classification End the ‘End-User’ Burden in
Records Management?
2012 Managing Electronic Records ConferenceChicago, ILg
May 7, 2012
Jason R. Baron, Esq.Director of Litigation
Office of General CounselNational Archives and Records Administration
Dave Lewis, Ph.D.David D. Lewis Consulting, LLC
Chicago, IL
A New Era of Government“[P]roper records management is the backbone of open Government.”
President Obama’s Memorandum dated November 28, 2011 re “Managing Government Records”
http://www.whitehouse.gov/the-press-office/2011/11/28/presidential-memorandum-managing-government-records
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.2
Reality:The era of Big Data has just begun….
Lehman Brothers Investigation
-- 350 billion page universe (3 petabytes)
-- Examiner narrowed collection by selecting key custodians, using dozens of Boolean searches
-- Reviewed 5 million docs (40 million pages using 70 contract attorneys)Source: Report of Anton R. Valukas, Examiner, In re Lehman Brothers Holdings Inc., et al., Chapter 11 Case No. 08-13555 (U.S. Bankruptcy Ct. S.D.N.Y. March 11, 2010), Vol. 7, Appx. 5, at http://lehmanreport.jenner.com/.
Process Optimization Problem 1: The transactional toll of user-based recordkeeping schemes (“as is” RM)
5
…. and the need for better, automated solutions ….
6
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.3
Impact of Technology on E-Records Management: Snapshot 2012 (“As is”) A universe of proprietary products exists in the
marketplace: document management and records management applications (RMAs)
DoD 5015.2 version 3 compliant products
7
DoD 5015.2 version 3 compliant products However, scalability issues exist Agencies must prepare to confront significant
front-end process issues when transitioning to electronic recordkeeping
Records schedule simplification is key
RM wish list for 2012…. RM’s “easy button”: the elusive goal of zero
extra keystrokes to comply with RM requirements (capture)
A technology app that automatically tags records in compliance with RM policies and practices (categorize)
Supervised learning RM with minimal records officer or end user involvement (learn)
Rule-based and role-based RM
Advanced search 8
Electronic Archiving As The First Step What is it?
100% snapshot of (typically) email, plus in some cases other selected ESI applications
9
cases other selected ESI applications How does it differ from an RMA?
Goal is of preservation of evidence, not records management per se
NARA Bulletin 2008-05
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.4
A Possible Path Forward?
Email archiving in short term, synced to existing proprietary software on email system
Designation of key senior officials as creating permanent records, consistent with existing records schedules
10
Additional designations of permanent records by agency component
“Smart” filters/categorical rules built in based on content, to the extent feasible to do
Default are records in designated temporary record buckets, disposed of under existing records schedules.
A pyramid approach combines disposition policy with automatedtools to bring FRA email under recordsmanagement, preservation, and access
= permanent or top officials
= temporary or staff and support
slider
The position of the “set-point” for email capture depends on policy and resources:setting it higher allows use of tools now available to get 100% of email at lowervolumes;* setting it lower means more records will be captured and smarter toolsare needed to distinguish and disposition temporary- and non-record.
Implementing an email archiving policy is feasible now, since tools are readily available to capture 100% of email traffic at the individual or organizational level, in formats that can be archived.
A pyramid approach combines disposition policy with automatedtools to bring FRA email under recordsmanagement, preservation, and access
= permanent or top officials
= temporary or staff and support
slider
The position of the “set-point” for email capture depends on policy and resources:setting it higher allows use of tools now available to get 100% of email at lowervolumes;* setting it lower means more records will be captured and smarter toolsare needed to distinguish and disposition temporary- and non-record.
Implementing an email archiving policy is feasible now, since tools are readily available to capture 100% of email traffic at the individual or organizational level, in formats that can be archived.
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.5
How To Avoid A Train Wreck With Email Archiving….
13
Capture E-mail But Utilize Records Management!
Functional Requirements for Categorization Products in the Federal workplace
Ease of use …. Scalability …. Archiving in native formats….. Metadata preservation … Seamless integration with existing software apps …. Versioning …. Compatibility with big bucket records schedules …. Advanced search capabilities …. Ease of training / machine learning using records officers or end users …. Cost
Process Optimization Problem 2: The Coming Age of Dark Archives (and the inability to provide access)
15
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.6
Emerging New Strategies:“Predictive Analytics”
Improved review and case assessment: cluster docs thru use of software with minimal human intervention at front end to code “seeded” data set
Slide adapted from Gartner ConferenceJune 23, 2010 Washington, D.C.
16
Language Processing TechnologiesRetrieval / Search
Classification
Question Answering
Summarization
Information Retrieval
1.
2.
17
Summarization
Entity Recognition
Information Extraction
Machine Translation
:
Natural Language Processing
Text Classification
Deciding which of several groups a text belongs to
Crudest form of
18
Crudest form oflanguage understanding... ...but often can be automated
with high accuracy
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.7
Why Classify?
Reduce infinite variety of text...
...to finite set of classes...
...to specify an action for every possible input.
19
Other Advantages of Text Classification
Supervised learning: Classifiers (rules) can be
learned by imitating manual classifications
20
Straightforward numerical measures of quality
Objective reason why a decision was made
recall: 85% +/- 4%precision: 75% +/- 3%
classification rule
Binary vs. multiclass
Hierarchical
Variations on Classification
Probabilistic 83% 17%
Graded / ordered / fuzzy21
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.8
Defining Sets of Classes
Tradeoff among Ideal classes to
implementpolicy
Classes you can teach
22
Classes you can teach people to assign
Classes you can teachsoftwareto assign
Be skeptical of automatic discovery of classes
?
Text Retrieval Systems
AKA search engines, semi-structured databases, text databases etc
23
databases, etc.
Classification Search
autonomous
long term
interactive
transitory
24
organizational
structured
personal
independent ??
?
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.9
"Concepts"
vs.
"Keywords"
Some Distinctions Among Search Approaches
Exact Match vs. Ranked Retrieval vs. Browsing
"Keywords"
25
Text Representations
Matching Aids
Exact Match Search
Query specifies conditions document must meet
Variants B l
budget AND KnoxvilleAND (revised or preliminary)
26
Boolean
SQL
Faceted
Often (ambiguously) called "keyword" search
A Faceted Search Interface
27
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.10
Ranked Retrieval
Query specifies important attributes of desired documents
System statistically weights
28
System statistically weights those attributes
Results returned in order of strength of match
Statistical Evidence in Ranked Retrieval
Corpus statistics Word (and metadata) counts
Unsupervised learningCl t i LSI/LSA t
29
Clustering, LSI/LSA, etc.
finds (maybe useless) patterns
Supervised learning aka "relevance feedback"
learn indicators of user interest
Browsing
Hierarchies
Networks
Clusters
30
Spaces / Maps / Dimensions make great pictures / demos
unclear if useful for finding information
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.11
Visual Analysis Examples(Presentation by Dr. Victoria Lemieux, Univ. British Columbia, at Society of American Archivist Annual Mtg. 2010, Washington, D.C.)
With acknowledgments to Jeffrey Heer, Exploring Enron, http://hci.stanford.edu/jheer/projects/enron/, Adam Perer, Contrasting Portraits, http://hcil.cs.umd.edu/trs/2006-08/2006-08.pdf, and Fernanda Viegas, Email Conversations, http://fernandaviegas.com/email.html
31
32
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.12
What Evidence Can The Search Software Use?
Words, phrases, etc.
Manually assigned categories
Metadata
34
Author, organization, creation date, change date, access date, length, file type,...
Contextual information (links, attachments,...)
What Resources Aid Matching?
Linguistic analysis At word level or higher
Clusters / spaces / ...
35
Thesauri / semantic nets / concept maps / ... Suited to your task?
Modifiable?
How is text determined to belong to category?
Concepts v. KeywordsSupreme Court of Information Retrieval, Case No. 1-tfidf-0-2902, 2009
Search software marketing: Them = keyword search = bad
Us = concept search = good
R lit
36
Reality: Both terms have referred to dozens of
different technologies...
...including some of the same ones!
Conceptual search is an aspiration, not a technology
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.13
Example of Boolean search string from U.S. v. Philip Morris
(((master settlement agreement OR msa) AND NOT (medical savings account OR metropolitan standard area)) OR s. 1415 OR (ets AND NOT educational testing service) OR (liggett
37
OR (ets AND NOT educational testing service) OR (liggett AND NOT sharon a. liggett) OR atco OR lorillard OR (pmi AND NOT presidential management intern) OR pm usa OR rjr OR (b&w AND NOT photo*) OR phillip morris OR batco OR ftc test method OR star scientific OR vector group OR joe camel OR (marlboro AND NOT upper marlboro)) AND NOT (tobacco* OR cigarette* OR smoking OR tar OR nicotine OR smokeless OR synar amendment OR philip morris OR r.j. reynolds OR ("brown and williamson") OR ("brown & williamson") OR bat industries OR liggett group)
U.S. v. Philip Morris E-mail Winnowing Process
20 million 200,000 100,000 80,000 20,000
38
email hits based relevant produced placed on records on keyword emails to opposing privilege terms used party logs (1%)
A PROBLEM: only a handful entered as exhibits at trial A BIGGER PROGLEM: the 1% figure does not scale
Judicial endorsement of predictive analytics in document review by Judge Peck in DaSilva Moore v. PublicisGroupe(SDNY Feb. 24, 2012)
This opinion appears to be the first in which a Court has approved of the use of computer-assisted review. pp p. . . What the Bar should take away from this Opinion is that computer-assisted review is an available tool and should be seriously considered for use in large-data-volume cases where it may save the producing party (or both parties) significant amounts of legal fees in document review. Counsel no longer have to worry about being the ‘first’ or ‘guinea pig’ for judicial acceptance of computer-assisted review . . . Computer-assisted review can now be considered judicially-approved for use in appropriate cases.
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.14
Social Networking/Links Analysis Example
40
From Marc Smith Posted on FlickrUnder Creative Commons License
Judicial second guessing of failure to use e-search capabilities: Capitol Records v. MP3 Tunes, 261 F.R.D. 44 (S.D.N.Y. 2009)
“In [a prior case] the Court notes its dismay that the party opposing discovery of its ESI had organized its files in a manner which seemed to serve no purpose
41
files in a manner which seemed to serve no purpose other than ‘to discourage audits. . .’ Similarly, in this case, [the party] host[ed] no ediscovery software on their servers and apparently are unable to conduct centralized email searches of groups of users without downloading them to a separate file and relying on the services of an outside vendor.”
Judicial second guessing of failure to use e-search capabilities: Capitol Records v. MP3 Tunes (con’t)
Court went on to add:“The day will undoubtedly will come when
burden arguments based on a large organization’s lack of internal ediscovery
42
g ysoftware will be received about as well as the contention that a party should be spared from retrieving paper documents because it had filed them sequentially, but in no apparent groupings, in an effort to avoid the added expense of file folders or indices.”
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.15
Problem 3: Innovative Thinking
43
The records management world of tomorrow….
References
Background Law Review Referencing Autocategorization& Advanced Search
J. Baron, “Law in the Age of Exabytes: Some Further Thoughts on ‘Information Inflation’ and Current Issues in E-Discovery Search, 17 Richmond J. Law & Technology (2011), see htt //l i h d d
45
http://law.richmond.edu
Latest “Predictive Coding” Case Law to follow in blogs online:
Da Silva Moore v PublicisGroupe& MSL Group, 11 Civ. 1279 (S.D.N.Y.) (Peck, M.J.) (Opinion dated Feb. 24 2012)
Kleen Products, LLC v. Packaging Corp. of America, 10 C 5711 (N.D. Ill.) (Nolan, M.J.)
Cohasset Associates, Inc.
NOTES
2012 Managing Electronic Records Conference 6.16
Jason R. BaronDirector of Litigation
46
g
Office of General Counsel
National Archives and Records Administration
(301) 837-1499
Email: [email protected]
Dave Lewis, Ph.D.David D. Lewis Consulting, LLC
Chicago, IL
Email: [email protected]
47
http//www.DavidDLewis.com