ii-sdv 2015, 20 - 21 april, in nice
TRANSCRIPT
Overview: Text Mining, It's About Time !
Introducing text mining
Why text mining has now come of age
Speed to insight
Real time data & timeliness of data
Temporal nature of data
Future challenges
© Linguamatics 2015
What is Text Mining?
"Text Mining is the discovery by computer of new, previously unknown information, by
automatically extracting and relating information from different written resources,
to reveal otherwise "hidden" meanings. A key element is the linking together of the extracted information to form new facts or new hypotheses to be explored further by
more conventional means of experimentation."
Marti Hearst, UC Berkeley
© Linguamatics 2015
How is it being used in Life Sciences?
Advanced text analytics delivers value along the pipeline
© Linguamatics 2015
Gene-disease mapping
Target ID/selection
Mutation/expression analysis
Toxicity analysis and prediction
Biomarker discovery
Drug repurposing
Patent analysis
KOL identification
Opportunity scouting
Trial site selection and study design
Safety
Competitive intelligence
Pharmacovigilance
Social media analysis
Comparative Effectiveness
Regulatory Submission QC
HEOR
SAR
Challenges
Most of the information required is in free text
Ever-increasing amounts of text data to examine
© Linguamatics 2015
0
5.000.000
10.000.000
15.000.000
20.000.000
25.000.000
PubMed Records
− Different kinds of
documents
− External literature,
patents, EHRs, internal
reports, blogs,
presentations
− Different formats
− HTML, PDF, XML, Word,
PPT, Wiki
Challenges in Unstructured Data
© Linguamatics 2015
Different word, same meaning
cyclosporine
ciclosporin
Neoral
Sandimmune
Different expression, same meaning
Non-smoker
Does not smoke
Does not drink or smoke
Denies tobacco use
Different grammar, same meaning
5mg/kg of cyclosporine per day
5mg/kg per day of cyclosporine
cyclosporine 5mg/kg per day
Same word, different context
Diagnosed with diabetes
Family history of diabetes
No family history of diabetes
NLP
From Words to Meaning
© Linguamatics 2015
“Among them, nimesulide, a selective COX2 inhibitor, …”
Entrez Gene ID: 5743
inhibits
Entrez Gene ID: 5743inhibits
Identifyingentities and relations
Linguistics to establish relationships
Finding Indirect Relationships
Treatment has been applied in clinical trials
8
Thalidomide in advanced hepatocellular carcinoma as antiangiogenic treatment approach: a phase I/II trial.Pinter et al.Eur J Gastroenterol Hepatol. 2008 20(10):1012-9
Phase II study of temozolomide, thalidomide, and celecoxib for newly diagnosed glioblastoma in adults.Kesari et al.Neuro Oncol. 2008 10(3):300-8
<Thalidomide>-<Relationship>-<Gene> <Gene>-<Relationship>-<Angiogenic Process>
Modes of Use
© Linguamatics 2015 - Confidential
Reusable pipelines
•Decision support
•Knowledge capture
•Classification/ mark-up
•Capture and re-use strategies
•Semantic categorization
9
Not for profit
Education
Research Biotech
Pharma
Medical devices
ICT
Funders
Approvers
Government
Patient
Payers
Prescribers
Providers
Dispensers
ElectronicHealthRecord
EHRs & Healthcare Challenges
© Linguamatics 2015
The challenge is to unlock the value of the huge investment being made in EHRs
“Natural language processing (NLP) and visualization dashboards are the technologies most suitable to improve EHR usability. NLP can produce readable summaries of unstructured text, helping clinicians retrieve information needed for point-of-care decision making”
Frost and Sullivan, 2014
© Linguamatics 2015
CHALLENGE
Identifying disease comorbidities for study via patient narratives and disease codes is often slow and manual. To find 700 patients with HIV and Hepatitis C took 5 medical students 4 months.
SPEED TO INSIGHT EXAMPLECOHORT SELECTIONMINING PATIENT RECORDS FOR DISEASE COMORBIDITIES
SOLUTION
Using text mining queries for disease codes and terminology took less than half a day to identify 1100 patients.
BENEFIT
Patient groups can be quickly identified from both structured and unstructured text. Identifying new disease cohorts is easy and can be quickly iterated to select new groups for study.
Patent Analytics with I2EComprehensive Effective Search For Patent Landscaping
CHALLENGE
Patents are a valuable source of novel data. Identifying drug targets for specific indications is often slow and manual, as patents are long and the language obtuse. To find targets for 3 therapeutic areas took 50 FTE days.
SOLUTION
A pipeline was built that used queries to extract target, indication, invention type and organisations and feed into a database. Recall was 10x manual, with good precision; plus target relevance scores.
BENEFIT
The integrated process drastically reduces the FTEs required to keep the organization up-to-date on recent findings published in the patent literature.
© Linguamatics 2015
Diagnose Cancer Earlier: Pulmonary Nodule
© Linguamatics 2015
Early diagnosis of lung cancer is limited because predictive
models rely on a combination of structured and textual data
For example:
Cancer Risk
Low Intermediate High
Nodule size, diameter (mm) <8 8 to 20 >20
Age, yr <45 45 to 60 >60
Prior cancer history No prior cancer Prior cancer history
Tobacco use (pack/day) Never smoked 1 >1
Smoking cessation Quit > 7 yr ago Quit <7 yr ago Never quit
Chronic obstructive lung disease No COPD COPD
Asbestos exposure No exposure Exposure
Nodule characteristics Smooth Lobulated Spiculated
Temporal Nature of Data
What is the challenge
− Temporal attributes of an individual “event” e.g.
cancer ‘v’ previous history of cancer (before ‘v’ after)
− Emerging hypotheses e.g. “X may represent a novel
technique for Y”
− Temporal nature of corpora e.g. Published literature-
Grants-Patents
Examples
− I2B2 Challenge
− Opposition based searching
− Patents-NIH-Grants-Medline
© Linguamatics 2015
I2B2 2014 Cardiac Risk Factors
The challenge is to extract a fixed set of Cardiac Risk factorsRisk factors include:
− medications, mentions of diabetes, hypertension,
hyperlipidaemia, obesity, glucose/LDL/A1C/BMI test results,
“cardiac events”, family history of Coronary Artery Disease,
smoking etc.
Each annotation must also be given a temporal relation to the document i.e.
− the patient had a heart attack BEFORE the day of the report
− the patient’s LDL was tested DURING the day of the report
There might be multiple annotations if the risk factor is ongoing
− Diabetes is probably going to be BEFORE, DURING and AFTER
Precision: 89.8%, Recall: 93.8%, F1-score: 91.7%
© Linguamatics 2015
Key Insights: Temporal Data
Events tend to have a "default" time if no appropriate language or dates are mentioned
− "Medications: Metformin, Aspirin" -> presumed to be continuing
Language to express temporal relations also depends on what you are trying to extract
− "Patient discontinued metformin" -> the patient took the drug
before the report but is not continuing it
− "Starting a course of metformin" -> the patient will start the
course after the report but did not take it before the report
− "Avoid metformin" -> the patient will stop taking the drug but
took it before the report
− "Patient had Myocardial Infarction this morning" -> use pronouns
to establish relation to report (on the date the report was written)
− "previous A1c was 6.5" -> use temporal adjectives and the tense
of the verb for test results
© Linguamatics 2015
Key Insights: Temporal Data
Reports are often written after the event was described, however, so you can't always rely on the tense of the verb− "Her BP today was 120/90"
Extracting a date within a few words of the event often implies the event took place in the past− "10/12 Pt brought in after Myocardial Infarction"
− "LDL from 10/11/09 120"
© Linguamatics 2015
Opposition Searching
Single search over multiple data on different servers providing a single set of results
Information from differently structured data is brought together and ordered by year
© Linguamatics 2015
Connected Data Technology
© Linguamatics 2015
Single Query over Multiple Data Sources and Network Locations
Challenges: A Big Data Future
High indexing performance
• Millions of documents, TBs of storage
• Ontologies with 100,000s of terms
• Handles large documents with ease
• Open, configurable pipeline
• Advanced table processing
VOLUME VARIETY VELOCITY VISUALIZATION
Connected data technology
• Unified heterogeneous document types across federated servers
• Connect – Normalize – Use
• Structured, semi-structured and unstructured
Distributed indexing and querying
• Multi-processor
• Multi-machine
Integrates in enterprise applications, portals, pipelines and workflows
• Open web services API
• Public query language
Strong integrated visualization
© Linguamatics 2015
Conclusion: Text mining; It’s About Time To Start Using It !
Use of text mining demonstrates clear value in the Pharma & Healthcare sectors
− Time to insight
− Timeliness of data
− Real time data
− Temporal data
Technology improvements make possible real-time, Text mining that is both agile and scalable in a world of big Data
© Linguamatics 2015
Question Time
© Linguamatics 2015
Thanks to :
David Milward, James Cormack, Jane Reed, Simon Beaulah & Phil Hastings