predictive analytics: email management...
TRANSCRIPT
August 18, 2014
Thank you for being here today
Presenters:
Jason R. Baron
Of Counsel, Drinker Biddle & Reath
Avi Elkoni
Consultant, Equivio
Neil Etheridge
VP, Product Management, Recommind Inc.
Mark Olson
National Analytics Manager, Dataskill
Sandra E. Serkes
President & CEO, Valora Technologies Inc.
Why All The Fuss About Big Data?
How to deal? Email Management as a Case in Point
Valora’s Analytics & Data Mining Solutions for Email Management
Predictive Analytics: Email Management Magic?
What is “Big Data” Anyway?
4,079,442,984,960 bytes of data since 1/1/2014 That’s 25 GB per day!
Sandy’s Digital Footprint
EMC Digital Footprint Calculator Hartford Union HS Digital Footprint Calculator
• Any amount of data that is overwhelming
• Any data whose contents, source or purpose are unknown
– Cannot answer why or whether you should have it
• Data that can harm the organization (liability)
• Data that can help the organization (asset)
Why talk about this now?
• Big Data is the “talk of the town”
• Clients, investors, media, employees, management, government
• Increasing data breach events keep the conversation alive
• People now expect that organizations are routinely collecting and mining data on their behavior, purchases, searches, posts, etc.
• They are starting to demand ethical, compliant & competent management of that data
• Costs to perform large-scale, complex analytics & hosting have come down enough to be financially viable for most organizations
608,087,870 – total number of records containing sensitive personal information involved in security breaches in the United States since January 2005
Source: Privacy Rights Clearinghouse, June 2013
Facebook Tinkers With Users’ Emotions in News Feed Experiment
6/24/2014
Google removes search results in wake of EU privacy ruling 6/26/2014
Enterprise Email Management is a very good Case In Point
Universal Issue
Involves several key IG problems:
• Storage/hosting
• Content analysis & classification
• Context – correspondence, notification & record, date/time/file signatures, transmission & attachments, custodianship, etc.
• Administration, management & maintenance
Elements of Backfile and Day Forward records management
ESI is generally easier & lower cost to tackle than paper files
Because of Context, EEM is a hot button issue with real budgets available
• Investor & media attention
• Customer concerns
• Risk & compliance danger zone
How a computer classifies an email
with data mining (analytics)
Implied matter: Passaro (34-6788)
Author
Author Validation & Contact Info
Matter indicator & validation
Doc Type & Implied attachment range
Additional Info Analytics Determines
• What DocType is this?
– An email with an attachment
• Who created this? Who is the author?
– Stuart Trumbull, Partner at DCH
• Who is receiving this? Why?
– Roberta Halstrom, paralegal
– Work instruction/direction
• What is the Author-Recipient relationship?
– Supervisor-subordinate
• What are important words, patterns & concepts?
– “please file”
– “Motion in Limine”
– “Passaro matter”
– “34-6788”
• How is attachment related?
– Author match
– Passaro match
– Key Motion content
• What else is known about this party?
– Wrote 14 emails that day
– 94% of “Passaro” mentions include him as auth/recip/cc
– 7 instances as Pleading Author w/ Passaro matter(s)
– Halstom & assistant/associate on 48% of Trumbull+pasaro content
• What other context can be inferred?
– Tuesday = 5/13/14
– 15 date-correlated instances of 5/13/14 with Passaro docket
– Tone is neutral-friendly, professionally appropriate
• What presents better visually?
– Topics over time
– Relationship between Trumbull & others
– Passaro matter against other matters
Drawbacks to Classification without Context (Classification alone)
• Treats all document contents the same
• Misses the notion of document context (who, what, where, when and how)
• Email particularly has important attributes as a communication mechanism
• Makes retain/delete decisions on content only
• Oversimplifies inherent or explicit decisioning hierarchies
• Duplicative content vs. “best” content
• Assumes all content instances are equal
• Assumes Backfile/Batch methodology only
• Weak solution for Day Forward document creation or intake
• Does not inform creation or approval of email content
• Focused on cleanup, rather than asset management
• Assumes no future value of past email or content, ignores business value
• Does not prioritize results for future use
• Relegates to IT tool, rather than IG strategy
• Unable to adapt to ongoing maintenance or contextual changes
• Assumes technology is independent of policy creation & enforcement
• Ignores evolving capabilities to define policy and ensure compliance
• Ignores PII, PHI & other content sensitivities
• Assumes all content instances are equal
• Exists outside of data visualization
• Missed opportunity for information presentation, forward-use asset management
Comments indicate actual usage is lower • “We are in early experiments”
• “Very preliminary exploration”
• “We continue to experiment with predictive coding”
Use of predictive coding
76% reported using/exploring predictive coding or other technologies
Is your company using or exploring “predictive coding” or other technologies for preservation,
collection or production of ESI?
Yes
No
The Magic of Predictive coding
• AKA supervised classification or TAR
• Trainable software
• Form of machine learning
• Widely used in industry/academia since 1970s
• Now well established in e-discovery
• Used for ECA, culling, prioritized review, QA
• Court approved
• Useful for sorting through volumes of documents including emails!
Predictive coding for IG - scenarios
Disposition scenarios
• Email (and other Records) retention
• Enables implementation of retention schedules
• Efficient / controlled / consistent
• Replaces “Trusted custodian” or “Do nothing”
• Legacy data remediation
• System migration
• Data hygiene
Predictive coding for IG - scenarios
Detection scenarios
• Pre-litigation
• E-discovery
• Investigations. Regulatory / Internal
• “Early warning” systems
Predictive coding for IG - challenges
• Handling low richness
• Training multiple categories
• Quantification of risk
• Federated architecture for centralized IG management
• Add-on approach to optimize use of legacy archiving and RIM systems
Sedona guidelines say:
Defensibility
The requirement is for
reasonableness, not perfection.
“We have to quantify the
imperfection.”
Corporate management says:
Defensibility
• Transparency Standard / repeatable / auditable
• Validation Standard step in the process
• Quantification
• “Right” statistics for IG environment
• ROI
• Risk retention trade-off
Risk of under-retention vs. Cost & risk of over-retention
Case study
18 Equivio proprietary and confidential
Drinker POC
Client International bank
Categories 2 retention, 3 junk
Days invested 3 days
Defensible deletion 6%
Projected deletion 45%
Retention success rate 95%
2002 2004 2006 2008 2010 2012 2014
CONCEPT SEARCH
CATEGORIZATION
PHRASE EXTRACTION
SMART FILTERS
DYNAMIC JOINS
ENTITY EXTRATION
FOOTER DETECTION
LANGUAGE DETECTION PREDICTIVE CODING BEST MATCHES
HYPERGRAPH
EASY UPLOAD
USE ANALYTICS
AND MACHINE
LEARNING TO
BETTER CONNECT
PEOPLE AND
INFORMATION
A HISTORY OF INNOVATION
AUTOMATIC FILING
o Great for matter-centric filing
o Filling location suggested to the
user
o No training = minimal set up time
INTELLIGENT EMAIL FILING –
APPROACHES
SUGGESTED FILING
o Great for topic-centric filing
o 100% adoption - no end users
o System is “trained” then
deployed, sampling required
CASE STUDY - U.S.
DEPARTMENT OF
ENERGY
Category Driven Filing Needed
Previous efforts with manual
filing had been unsuccessful for
applying categories
Automated Process Implemented
Recommind predictive analytics
categorize emails with no burden
on end users
Higher Accuracy Attained Recommind exceeded 70% project
accuracy target with high 80’s to
low 90’s across all categories
CASE STUDY –
DAVIES WARD
Client-Matter Classification Needed
The firm needed a way to assign
C/M’s to all emails, beyond what
DMS could offer
Decisiv Email Implemented
Decisiv suggests likely C/M’s,
with user confirmation or editing
High Accuracy Maintained
The combination of Recommind’s
analytics and user intervention
yields strong results
August 18, 2014
Thank you for being here today
Presenter:
Mark R. Olson Director, North American Watson Analytics, DataSkill, Inc.
• Business Drivers for email management and many solutions
• Email Classification as the key activity and Dataskill’s application for
Email Management
• About Dataskill
Agenda
Summary of typical client conversations about email management
Business Drivers ‘First to scream’ principle
All drivers for implementing an email management solution are ultimately financial, but they can still be classified depending on “who screams first”: • Technical. IT complains that they are running out of storage, want to buy
more and more of ‘disk’. Email system slows down due to a large volume of email data.
• Business. Line of business personnel cannot find or access relevant emails easily. Has to rely on personal knowledge which leads to continuity issues when staff leaves.
• Compliance. General Counsel requires legal hold management and defensible disposal capabilities to protect the organization and help make responding to legal enquiries more efficient and cost effective.
Dataskill’s Natural Language Approach Address the problem not the symptom
The Real problem is, the information that is required to classify emails is contained with in the body of the email, and in Natural Language.
“Mark and Denise are going to Paul and Sue’s for dinner.
Mark and Denise are going to Sue Paul over dinner.”
Normal classification and ILG solutions can not understand the difference in
these two sentences.
Dataskill Acumi for Email Can.
ACUMI for Email Management
Because Acumi uses Natural Language Processing, Language Dictionaries, and Legal Industry Annotators it can do a better job of classifying mail and Documents.
And if that is not good enough, you can train it to be better.
What it does REALLY WELL