continuous active learning: a workshop · 2018-08-18 · summarizing the content of document sets...

Caroline Sweeney, Dorsey & Whitney LLP

David Grant, FTI Technology

Continuous Active Learning:

A Workshop

AGENDA

• Introductions

• Quick CAL overview

• Scenarios

• Preparation Checklist

• Open Q&A

2

TODAY’S SPEAKERS

Caroline SweeneyGlobal Director, E-Discovery & Client Technology, Dorsey & Whitney LLP

Caroline Sweeney is responsible for the delivery of Dorsey’s e-

discovery services, including LegalMine Managed Review services,

litigation technology support, and trial technology support. Caroline

is a member of Dorsey’s Electronic Discovery Practice Group and

the Cybersecurity, Privacy and Social Media Practice Group. She

has extensive experience consulting with attorneys and clients with

regard to e-discovery, including identification, preservation,

collection, processing, review and production of electronically

stored information. Her 25+ years of experience in the litigation

support industry, include working in the law firm and litigation

support vendor environments. Caroline is actively involved in the

e-discovery community. She is ACEDs certified, currently Co-

President of the Twin Cities ACEDs chapter, and participated in the

development of the litigation support certification test for the

Organization of Legal Professionals (OLP).

3

TODAY’S SPEAKERS David Grant

Senior Managing Director, FTI Technology

David Grant is a senior managing director in the FTI Technology

practice and is based in New York. Mr. Grant focuses on discovery

readiness projects involving proactive planning for ongoing

litigation needs, and on planning and managing discovery

strategies for and across major litigations involving large data

volumes, tight deadlines and international data collection. Mr. Grant

has worked on a wide range of matters including multiple Hart-

Scott-Rodino second requests, international cartel investigations,

Securities class actions, large-scale commercial litigation and a

number of product liability MDLs. Mr. Grant graduated with Honors

1st Class from the University of New South Wales in Sydney,

Australia with a major in Politics and Philosophy from the University

of New South Wales Arts faculty and a major in Law from the

University of New South Wales Law School.

4

USES FOR PREDICTIVE MODELS & CAL

Sometimes Recall More Relevant Docs

than Keywords with Less Review

Exclude Likely Irrelevant Docs from Human

Review, Reducing Costs / Speeding

Compliance

Identify Reviewers with High Error Rate Prioritize Key or Likely Relevant Docs Into

Review First

Issue/Key Coding

PREDICTIVE CODING MODEL BASED WORKFLOW

Provide

Seed Set

Train and Refine

Predictive ModelValidate Model

Performance

Apply Scores

and Codes to

Population

Finalize Production Defensibility

A. Mapper QC – visual coding verification B. Mines QC – assess completeness of production

Sampling

Refine

Refine

Refine

PREDICTIVE CODING CONTINUAL WORKFLOW

7

Provide

Seed Sample

(or Seed Set)

Measure Achieved Recall

Compared to Sample

Prevalence, Stop

Reviewing If Done

Finalize Production Defensibility

A. Mapper QC – visual coding verification B. Mines QC – assess completeness of production

Sampling

Predictive Coding

Continual Workflow for culling

(“Continuous Active Learning” or

CAL)

DIFFERENT OPTIONS: PROS / CONS

KeywordsPROS:

― Known quantity

― Easy to explain

― No sharing issues re NR docs

CONS:

― Usually more docs so more costly and

time consuming

― Easy to explain

― Takes time to negotiate

― Doesn’t prioritize useful docs

― Typically lower recall

Model Based PCPROS:

― Typically less docs

― Production without review possible if

desired

― Predictable number of docs need review

so can easily manage to deadline

― Standard approach in some

circumstances (2nd requests)

CONS:

― Needs upfront model training

― Slightly more complex

― Validation sample required at end

― Training / measurement smaple NR doc

sharing issues

Continual Training PCPROS:

― Typically less docs

― Just start review – no upfront training

― No training set to be produced/ haggled

over

― Decide when to stop later in the process

― More similar to standard review all or

keyword approaches (large review team

upfront)

CONS:

― Don’t know how many docs need review

to meet target – harder to manage

― Measurement sample might be haggled

over (NR doc sharing)

Note other options:

Prioritize any review set using Continual

Training for easier management

Keywords followed by PC (but note lower

recall caused by ‘culling twice’)

Train and Refine

Predictive Model

Refine

Refine

Refine

Results

Responsive (30%)

Non-Responsive

Responsive 300

Non-Responsive 700

Training Seed Set

Seed Set

Comparison Sample

Remaining Set

(~1 Million)

Apply Scores

and Codes to

Population

Option A:

Test Predictive Coding

Model Based Workflow for culling

(“Simple Active/Passive Learning” or

SAL / SPL)

Collection

Document Collection

Expert Review

Attorney Expert

Review

Option B:

Test keywords as culling tool; prove

term burden (imprecision);

Refine keywords

If search term retrieves

too many Non-relevant

documents, try to refine

search term.

Run modified/new search term that you would like to test

Relevant

Not-Relevant

If search term only

retrieves part of Relevant

documents in cluster,

analyze document in

cluster that search term

did not retrieve.

Identify new concepts

without actual review of

documents (find missing

search terms)

DISCUSSION SCENARIOS: PREDICTIVE

MODELS & CAL

HSR 2nd

requests

Investigations Symmetric

litigation

Asymmetric

litigation

DISCUSSION SCENARIOS: PREDICTIVE

MODELS & CAL

Quality Control Inbound Production

Review

OTHER FORMS OF AI TO COMBINE WITH CAL:

SUPERVISED AND UNSUPERVISED MACHINE LEARNING

Supervised Machine Learning:Reproduced known knowledge through an

algorithm learning to make decisions like

humans do

Unsupervised Machine Learning

(data mining):Summarizing the content of document sets and

grouping them into sets of trend similarity so

humans can discover previously unknown

knowledge

In eDiscovery as an example, content grouping tells what you have and helps identify key facts

(the goal of discovery), standard predictive coding reproduces automatically attorney decisions

against a large population to identify documents which must be produced (production is a means to

reach the goal of discovery)

Unsupervised Example – Content Grouping and

Summarization (“Find Facts Fast”)

Each cluster provides information about the clustered docs(Concepts, Density of Concepts, Most Representative Documents)

request

investigation

improper

attorney

payment

reporting

layover

flight

delay

airport

boarding

ticket

fantasy

draft

football

quarterback

suspension

week 1

matchup

15

‘Zoom’ to document level grouping and

summary by content / index data

--[ Doc Mapper ]--

request

investigation

improper

attorney

payment

reporting

dot = one document

clusters = groups of

similar documents

cluster concepts

spine = group of clusters

that share a similar concept

COMPLICATIONS

1

6

Work

product/

transparency

Training

process and

review

requirements

Population

requirements

Measuring

success

Timing of

Issues

Prioritized

CHECKLIST

Overall goals

• Review a small set of training documents and a small random sample to test if predictive coding can be

an effective solution

• Is predictive coding being used strictly for prioritization purposes or also for document culling

• If keywords are being used to pre-cull the set before subjecting to predictive coding then confirm

whether the full population should be sampled first to understand keyword recall (and explain that the

recall estimated by the PC process is only against the keyword culled set)

• Discuss needs related to disclosures in order to ensure that the proper metrics are being recorded

throughout the process and their importance in measuring success. Assess the models performance

before negotiating the target recall or precision with other parties

1

7

CHECKLIST (CONTINUED)

Population creation• Any date or file type culling to be applied before a population is defined; for example database, graphic or

multimedia file types may need to be excluded from the population since the model cannot analyze or

learn these document types

• Similarly, confirm how to treat unresolved exceptions (e.g. encrypted docs)

• This can have a significant impact on the overall prevalence and thus impact sample and training

document sizes

• Understand the composition of each population if documents are being processed on a rolling basis and

confirm what is to be subjected to PC process

• How to manage foreign language documents in the population

• Due to language agnostic nature of the process determine the client’s comfort level with utilizing a single

model that includes all languages

• Create separate English and FL models with distinct populations for FL docs

1

8


Determine the sample size

• The size of a comparison/validation sample from a population is determined by the prevalence of positive

documents and the desired level of confidence/margin of error

• Desired recall level will also factor into calculations associated with sample sizing

Sample and training document review

• Identify issues that define positive/negative document sets for model building

• Identify the teachers: attorneys/associates or selected experienced contract reviewers after a few days of

active 1L review

1

9


Model building and prediction

• Iteratively refine the model and assess performance based on a representative random sample

• Documents predicted with high scores can be prioritized for review during the refinement process

• If a model version projects desired level of recall and precision (or there is no significant improvement in

few consecutive rounds of training review) then discuss the results and determine whether the model can

be used to predict the population

• Discuss if and how the documents that are predicted positive by the model (and their families) should be

reviewed, i.e. linear review of analytical tools

2

0

OPEN Q&A