ii-sdv 2015, 20 - 21 april, in nice

Text Mining: It's About Time

Andrew Hinton & David Milward

II-SDV, Nice France 20th April 2015

Overview: Text Mining, It's About Time !

Introducing text mining

Why text mining has now come of age

Speed to insight

Real time data & timeliness of data

Temporal nature of data

Future challenges

© Linguamatics 2015

What is Text Mining?

"Text Mining is the discovery by computer of new, previously unknown information, by

automatically extracting and relating information from different written resources,

to reveal otherwise "hidden" meanings. A key element is the linking together of the extracted information to form new facts or new hypotheses to be explored further by

more conventional means of experimentation."

Marti Hearst, UC Berkeley


How is it being used in Life Sciences?

Advanced text analytics delivers value along the pipeline


Gene-disease mapping

Target ID/selection

Mutation/expression analysis

Toxicity analysis and prediction

Biomarker discovery

Drug repurposing

Patent analysis

KOL identification

Opportunity scouting

Trial site selection and study design

Safety

Competitive intelligence

Pharmacovigilance

Social media analysis

Comparative Effectiveness

Regulatory Submission QC

HEOR

SAR

Challenges

Most of the information required is in free text

Ever-increasing amounts of text data to examine


0

5.000.000

10.000.000

15.000.000

20.000.000

25.000.000

PubMed Records

− Different kinds of

documents

− External literature,

patents, EHRs, internal

reports, blogs,

presentations

− Different formats

− HTML, PDF, XML, Word,

PPT, Wiki

Challenges in Unstructured Data


Different word, same meaning

cyclosporine

ciclosporin

Neoral

Sandimmune

Different expression, same meaning

Non-smoker

Does not smoke

Does not drink or smoke

Denies tobacco use

Different grammar, same meaning

5mg/kg of cyclosporine per day

5mg/kg per day of cyclosporine

cyclosporine 5mg/kg per day

Same word, different context

Diagnosed with diabetes

Family history of diabetes

No family history of diabetes

NLP

From Words to Meaning


“Among them, nimesulide, a selective COX2 inhibitor, …”

Entrez Gene ID: 5743

inhibits

Entrez Gene ID: 5743inhibits

Identifyingentities and relations

Linguistics to establish relationships

Finding Indirect Relationships

Treatment has been applied in clinical trials

8

Thalidomide in advanced hepatocellular carcinoma as antiangiogenic treatment approach: a phase I/II trial.Pinter et al.Eur J Gastroenterol Hepatol. 2008 20(10):1012-9

Phase II study of temozolomide, thalidomide, and celecoxib for newly diagnosed glioblastoma in adults.Kesari et al.Neuro Oncol. 2008 10(3):300-8

<Thalidomide>-<Relationship>-<Gene> <Gene>-<Relationship>-<Angiogenic Process>

Modes of Use

© Linguamatics 2015 - Confidential

Reusable pipelines

•Decision support

•Knowledge capture

•Classification/ mark-up

•Capture and re-use strategies

•Semantic categorization

9

Speed to Insight


Time = Money


Speed to insight Example I: Patents


Business Impact and value


Leveraging Text Analytics in Patents to Empower Business Decisions


Not for profit

Education

Research Biotech

Pharma

Medical devices

ICT

Funders

Approvers

Government

Patient

Payers

Prescribers

Providers

Dispensers

ElectronicHealthRecord

EHRs & Healthcare Challenges


The challenge is to unlock the value of the huge investment being made in EHRs

“Natural language processing (NLP) and visualization dashboards are the technologies most suitable to improve EHR usability. NLP can produce readable summaries of unstructured text, helping clinicians retrieve information needed for point-of-care decision making”

Frost and Sullivan, 2014


CHALLENGE

Identifying disease comorbidities for study via patient narratives and disease codes is often slow and manual. To find 700 patients with HIV and Hepatitis C took 5 medical students 4 months.

SPEED TO INSIGHT EXAMPLECOHORT SELECTIONMINING PATIENT RECORDS FOR DISEASE COMORBIDITIES

SOLUTION

Using text mining queries for disease codes and terminology took less than half a day to identify 1100 patients.

BENEFIT

Patient groups can be quickly identified from both structured and unstructured text. Identifying new disease cohorts is easy and can be quickly iterated to select new groups for study.

Real time data & timeliness of data


Patent Analytics with I2EComprehensive Effective Search For Patent Landscaping

CHALLENGE

Patents are a valuable source of novel data. Identifying drug targets for specific indications is often slow and manual, as patents are long and the language obtuse. To find targets for 3 therapeutic areas took 50 FTE days.

SOLUTION

A pipeline was built that used queries to extract target, indication, invention type and organisations and feed into a database. Recall was 10x manual, with good precision; plus target relevance scores.

BENEFIT

The integrated process drastically reduces the FTEs required to keep the organization up-to-date on recent findings published in the patent literature.


Temporal Nature of Data


Diagnose Cancer Earlier: Pulmonary Nodule


Early diagnosis of lung cancer is limited because predictive

models rely on a combination of structured and textual data

For example:

Cancer Risk

Low Intermediate High

Nodule size, diameter (mm) <8 8 to 20 >20

Age, yr <45 45 to 60 >60

Prior cancer history No prior cancer Prior cancer history

Tobacco use (pack/day) Never smoked 1 >1

Smoking cessation Quit > 7 yr ago Quit <7 yr ago Never quit

Chronic obstructive lung disease No COPD COPD

Asbestos exposure No exposure Exposure

Nodule characteristics Smooth Lobulated Spiculated

Temporal Nature of Data

What is the challenge

− Temporal attributes of an individual “event” e.g.

cancer ‘v’ previous history of cancer (before ‘v’ after)

− Emerging hypotheses e.g. “X may represent a novel

technique for Y”

− Temporal nature of corpora e.g. Published literature-

Grants-Patents

Examples

− I2B2 Challenge

− Opposition based searching

− Patents-NIH-Grants-Medline


I2B2 2014 Cardiac Risk Factors

The challenge is to extract a fixed set of Cardiac Risk factorsRisk factors include:

− medications, mentions of diabetes, hypertension,

hyperlipidaemia, obesity, glucose/LDL/A1C/BMI test results,

“cardiac events”, family history of Coronary Artery Disease,

smoking etc.

Each annotation must also be given a temporal relation to the document i.e.

− the patient had a heart attack BEFORE the day of the report

− the patient’s LDL was tested DURING the day of the report

There might be multiple annotations if the risk factor is ongoing

− Diabetes is probably going to be BEFORE, DURING and AFTER

Precision: 89.8%, Recall: 93.8%, F1-score: 91.7%


Key Insights: Temporal Data

Events tend to have a "default" time if no appropriate language or dates are mentioned

− "Medications: Metformin, Aspirin" -> presumed to be continuing

Language to express temporal relations also depends on what you are trying to extract

− "Patient discontinued metformin" -> the patient took the drug

before the report but is not continuing it

− "Starting a course of metformin" -> the patient will start the

course after the report but did not take it before the report

− "Avoid metformin" -> the patient will stop taking the drug but

took it before the report

− "Patient had Myocardial Infarction this morning" -> use pronouns

to establish relation to report (on the date the report was written)

− "previous A1c was 6.5" -> use temporal adjectives and the tense

of the verb for test results


Key Insights: Temporal Data

Reports are often written after the event was described, however, so you can't always rely on the tense of the verb− "Her BP today was 120/90"

Extracting a date within a few words of the event often implies the event took place in the past− "10/12 Pt brought in after Myocardial Infarction"

− "LDL from 10/11/09 120"


Opposition Searching

Single search over multiple data on different servers providing a single set of results

Information from differently structured data is brought together and ordered by year


Connected Data Technology


Single Query over Multiple Data Sources and Network Locations

Challenges: A Big Data Future

High indexing performance

• Millions of documents, TBs of storage

• Ontologies with 100,000s of terms

• Handles large documents with ease

• Open, configurable pipeline

• Advanced table processing

VOLUME VARIETY VELOCITY VISUALIZATION

Connected data technology

• Unified heterogeneous document types across federated servers

• Connect – Normalize – Use

• Structured, semi-structured and unstructured

Distributed indexing and querying

• Multi-processor

• Multi-machine

Integrates in enterprise applications, portals, pipelines and workflows

• Open web services API

• Public query language

Strong integrated visualization


Conclusion: Text mining; It’s About Time To Start Using It !

Use of text mining demonstrates clear value in the Pharma & Healthcare sectors

− Time to insight

− Timeliness of data

− Real time data

− Temporal data

Technology improvements make possible real-time, Text mining that is both agile and scalable in a world of big Data


Question Time


Thanks to :

David Milward, James Cormack, Jane Reed, Simon Beaulah & Phil Hastings

Linguamatics Text Mining Summit

October 12-14 2015, Newport RI

www.linguamatics.com/textminingsummit

Featuring customer use cases in the life sciences and healthcare, hands-on training, and new healthcare hackathon


ii-sdv 2015, 20 - 21 april, in nice

Documents

patents linguamatics

value linguamatics

insight linguamatics

text mining

pipeline linguamatics

money linguamatics

unstructured data linguamatics

modes of use linguamatics