continuous active learning: a workshop · 2018-08-18 · summarizing the content of document sets...
TRANSCRIPT
Caroline Sweeney, Dorsey & Whitney LLP
David Grant, FTI Technology
Continuous Active Learning:
A Workshop
AGENDA
• Introductions
• Quick CAL overview
• Scenarios
• Preparation Checklist
• Open Q&A
2
TODAY’S SPEAKERS
Caroline SweeneyGlobal Director, E-Discovery & Client Technology, Dorsey & Whitney LLP
Caroline Sweeney is responsible for the delivery of Dorsey’s e-
discovery services, including LegalMine Managed Review services,
litigation technology support, and trial technology support. Caroline
is a member of Dorsey’s Electronic Discovery Practice Group and
the Cybersecurity, Privacy and Social Media Practice Group. She
has extensive experience consulting with attorneys and clients with
regard to e-discovery, including identification, preservation,
collection, processing, review and production of electronically
stored information. Her 25+ years of experience in the litigation
support industry, include working in the law firm and litigation
support vendor environments. Caroline is actively involved in the
e-discovery community. She is ACEDs certified, currently Co-
President of the Twin Cities ACEDs chapter, and participated in the
development of the litigation support certification test for the
Organization of Legal Professionals (OLP).
3
TODAY’S SPEAKERS David Grant
Senior Managing Director, FTI Technology
David Grant is a senior managing director in the FTI Technology
practice and is based in New York. Mr. Grant focuses on discovery
readiness projects involving proactive planning for ongoing
litigation needs, and on planning and managing discovery
strategies for and across major litigations involving large data
volumes, tight deadlines and international data collection. Mr. Grant
has worked on a wide range of matters including multiple Hart-
Scott-Rodino second requests, international cartel investigations,
Securities class actions, large-scale commercial litigation and a
number of product liability MDLs. Mr. Grant graduated with Honors
1st Class from the University of New South Wales in Sydney,
Australia with a major in Politics and Philosophy from the University
of New South Wales Arts faculty and a major in Law from the
University of New South Wales Law School.
4
USES FOR PREDICTIVE MODELS & CAL
Sometimes Recall More Relevant Docs
than Keywords with Less Review
Exclude Likely Irrelevant Docs from Human
Review, Reducing Costs / Speeding
Compliance
Identify Reviewers with High Error Rate Prioritize Key or Likely Relevant Docs Into
Review First
Issue/Key Coding
PREDICTIVE CODING MODEL BASED WORKFLOW
Provide
Seed Set
Train and Refine
Predictive ModelValidate Model
Performance
Apply Scores
and Codes to
Population
Finalize Production Defensibility
A. Mapper QC – visual coding verification B. Mines QC – assess completeness of production
Sampling
Refine
Refine
Refine
PREDICTIVE CODING CONTINUAL WORKFLOW
7
Provide
Seed Sample
(or Seed Set)
Measure Achieved Recall
Compared to Sample
Prevalence, Stop
Reviewing If Done
Finalize Production Defensibility
A. Mapper QC – visual coding verification B. Mines QC – assess completeness of production
Sampling
Predictive Coding
Continual Workflow for culling
(“Continuous Active Learning” or
CAL)
8
DIFFERENT OPTIONS: PROS / CONS
KeywordsPROS:
― Known quantity
― Easy to explain
― No sharing issues re NR docs
CONS:
― Usually more docs so more costly and
time consuming
― Easy to explain
― Takes time to negotiate
― Doesn’t prioritize useful docs
― Typically lower recall
Model Based PCPROS:
― Typically less docs
― Production without review possible if
desired
― Predictable number of docs need review
so can easily manage to deadline
― Standard approach in some
circumstances (2nd requests)
CONS:
― Needs upfront model training
― Slightly more complex
― Validation sample required at end
― Training / measurement smaple NR doc
sharing issues
Continual Training PCPROS:
― Typically less docs
― Just start review – no upfront training
― No training set to be produced/ haggled
over
― Decide when to stop later in the process
― More similar to standard review all or
keyword approaches (large review team
upfront)
CONS:
― Don’t know how many docs need review
to meet target – harder to manage
― Measurement sample might be haggled
over (NR doc sharing)
Note other options:
Prioritize any review set using Continual
Training for easier management
Keywords followed by PC (but note lower
recall caused by ‘culling twice’)
Train and Refine
Predictive Model
Refine
Refine
Refine
Results
Responsive (30%)
Non-Responsive
Responsive 300
Non-Responsive 700
Training Seed Set
Seed Set
Comparison Sample
Remaining Set
(~1 Million)
Apply Scores
and Codes to
Population
Option A:
Test Predictive Coding
Model Based Workflow for culling
(“Simple Active/Passive Learning” or
SAL / SPL)
Collection
Document Collection
Expert Review
Attorney Expert
Review
Option B:
Test keywords as culling tool; prove
term burden (imprecision);
Refine keywords
If search term retrieves
too many Non-relevant
documents, try to refine
search term.
Run modified/new search term that you would like to test
Relevant
Not-Relevant
If search term only
retrieves part of Relevant
documents in cluster,
analyze document in
cluster that search term
did not retrieve.
Identify new concepts
without actual review of
documents (find missing
search terms)
DISCUSSION SCENARIOS: PREDICTIVE
MODELS & CAL
HSR 2nd
requests
Investigations Symmetric
litigation
Asymmetric
litigation
DISCUSSION SCENARIOS: PREDICTIVE
MODELS & CAL
Quality Control Inbound Production
Review
OTHER FORMS OF AI TO COMBINE WITH CAL:
SUPERVISED AND UNSUPERVISED MACHINE LEARNING
Supervised Machine Learning:Reproduced known knowledge through an
algorithm learning to make decisions like
humans do
Unsupervised Machine Learning
(data mining):Summarizing the content of document sets and
grouping them into sets of trend similarity so
humans can discover previously unknown
knowledge
In eDiscovery as an example, content grouping tells what you have and helps identify key facts
(the goal of discovery), standard predictive coding reproduces automatically attorney decisions
against a large population to identify documents which must be produced (production is a means to
reach the goal of discovery)
Unsupervised Example – Content Grouping and
Summarization (“Find Facts Fast”)
Each cluster provides information about the clustered docs(Concepts, Density of Concepts, Most Representative Documents)
request
investigation
improper
attorney
payment
reporting
layover
flight
delay
airport
boarding
ticket
fantasy
draft
football
quarterback
suspension
week 1
matchup
15
‘Zoom’ to document level grouping and
summary by content / index data
--[ Doc Mapper ]--
request
investigation
improper
attorney
payment
reporting
dot = one document
clusters = groups of
similar documents
cluster concepts
spine = group of clusters
that share a similar concept
COMPLICATIONS
1
6
Work
product/
transparency
Training
process and
review
requirements
Population
requirements
Measuring
success
Timing of
Issues
Prioritized
CHECKLIST
Overall goals
• Review a small set of training documents and a small random sample to test if predictive coding can be
an effective solution
• Is predictive coding being used strictly for prioritization purposes or also for document culling
• If keywords are being used to pre-cull the set before subjecting to predictive coding then confirm
whether the full population should be sampled first to understand keyword recall (and explain that the
recall estimated by the PC process is only against the keyword culled set)
• Discuss needs related to disclosures in order to ensure that the proper metrics are being recorded
throughout the process and their importance in measuring success. Assess the models performance
before negotiating the target recall or precision with other parties
1
7
CHECKLIST (CONTINUED)
Population creation• Any date or file type culling to be applied before a population is defined; for example database, graphic or
multimedia file types may need to be excluded from the population since the model cannot analyze or
learn these document types
• Similarly, confirm how to treat unresolved exceptions (e.g. encrypted docs)
• This can have a significant impact on the overall prevalence and thus impact sample and training
document sizes
• Understand the composition of each population if documents are being processed on a rolling basis and
confirm what is to be subjected to PC process
• How to manage foreign language documents in the population
• Due to language agnostic nature of the process determine the client’s comfort level with utilizing a single
model that includes all languages
• Create separate English and FL models with distinct populations for FL docs
1
8
CHECKLIST (CONTINUED)
Determine the sample size
• The size of a comparison/validation sample from a population is determined by the prevalence of positive
documents and the desired level of confidence/margin of error
• Desired recall level will also factor into calculations associated with sample sizing
Sample and training document review
• Identify issues that define positive/negative document sets for model building
• Identify the teachers: attorneys/associates or selected experienced contract reviewers after a few days of
active 1L review
1
9
CHECKLIST (CONTINUED)
Model building and prediction
• Iteratively refine the model and assess performance based on a representative random sample
• Documents predicted with high scores can be prioritized for review during the refinement process
• If a model version projects desired level of recall and precision (or there is no significant improvement in
few consecutive rounds of training review) then discuss the results and determine whether the model can
be used to predict the population
• Discuss if and how the documents that are predicted positive by the model (and their families) should be
reviewed, i.e. linear review of analytical tools
2
0
OPEN Q&A