evidence finder: a semantic search tool for the pmc corpus...
TRANSCRIPT
Evidence Finder: a semantic search tool for the PMC
Corpus of Biomedical Research
Papers
C.J. Rupp National Centre for Text Mining University of Manchester www.nactem.ac.uk
Now: Spatial Humanities Project History [email protected]
2/6/2013 C.J. Rupp 1
Outline
What is UKPMC
Text Mining for Biomedicine
What is PubMed Central?
What Does UKPMC add?
Medie: a point of Comparison
Lean Fact Extraction
What is Evidence Finder?
Complementary Search
Fact Summary
2/6/2013 C.J. Rupp 2
The UKPMC Team at NaCTeM
C.J. Rupp Parsing, Relation Extraction, Indexing.
Chikashi Nobata Named Entity Recognition (NER)
Bill Black Project Manager
Prof. Sophia Ananiadou Director
Jock McNaught Deputy Director
Matt Machin Web Application, Interfaces, GWIT
Jacob Carter Databases
C.J. & Bill Design
2/6/2013 C.J. Rupp 3
What is UKPMC?
• A repository of 2.4 million full text journal articles in Biomedicine and Health Science
• Available for free on the web with no access restrictions
• Launched in January 2007, funded by the 8 largest funders of medical research in the UK
• Delivered by a consortium: The British Library, EBI and Manchester University
• This is the UK portal on the PubMed Central repository
• Now extended Europe-wide.
2/6/2013 C.J. Rupp 4
• Text Mining for Biomedicine
• There's a lot of work on Text Mining for Biomedicine
This field has money
But it also has one of the best problems
• The rate of Biomedical publication has soared
To inhuman proportions
So it's appealing to look for machine assistance
• The selling points are:
Handle information overload, and
Avoid overlooking information.
2/6/2013 C.J. Rupp 5
•Medline
•Total Articles / year
•Medline
•New Articles / year
Data Deluge
•EMBL Database
•Total Entries / year
2/6/2013 C.J. Rupp 6
What is PubMed Central?
PubMed Central (PMC) is the U.S. National Institutes of Health (NIH) digital archive of biomedical and life sciences journal literature.
Around 2 Million full text, published article
Contrast with PubMed: c. 22 Million abstracts
Many PMC articles are Open Access
Mixed format corpus: XML, PDF, OCR-ed
2/6/2013 C.J. Rupp 8
What does UKPMC add?
There are two main areas where UKPMC offers an extended service:
1. Additional literature, including UK-specific documents, such as NHS guidelines
2. A range of text mining services
This is where NaCTem comes in
2/6/2013 C.J. Rupp 10
Our Mission
Provide a more Intelligent Search tool for UKPMC
Showcase Text Mining Technologies
Use existing Resources, specifically:
Enju: deep syntactic parser
Biolexicon: domain lexicon
NER tools: for genes, diseases, etc.
2/6/2013 C.J. Rupp 11
Enju Parser
A syntactic parser for English.
With a wide-coverage probabilistic HPSG grammar
An efficient parsing algorithm
Trained on Biomedical text (PubMed abstracts)
Which provides phrase structures and predicate-argument structures.
2/6/2013 C.J. Rupp 12
The BioLexicon
A Lexical Database for Biomedicine
2.2 M entries (mainly biomedical terms)
658 domain-relevant verbs
Syntactic subcategorisation frames specified for all verbs (1760 frames)
Collected automatically based on dependency-parsed corpus of 6M tokens on topic of E.Coli
Include strongly selected modifiers according to importance of location, time, manner etc., in description of biomedical facts
Also, Semantic frames specified for 168 verbs (856 frames)
2/6/2013 C.J. Rupp 13
NER (Named Entity Recognition)
Dictionary-Based NER for significant classes of entity:
Genes and Proteins
Drugs and Diseases
Metabolites
including NeMine, trained for gene/protein disambiguation
Dictionaries include UMLS, Drugbank, HMDB
2/6/2013 C.J. Rupp 14
Medie: a Point of Comparison
There was an existing system with a similar specification:
Defined on PubMed Abstracts
With a powerful query language
GCL (Generalised Concordance Lists) based on Region Algebra
Using a tabular format for queries
2/6/2013 C.J. Rupp 15
Medie
2/6/2013 C.J. Rupp 16
Medie: Result
2/6/2013 C.J. Rupp 17
Tabular Format
2/6/2013 C.J. Rupp 18
Formal Query
2/6/2013 C.J. Rupp 19
Notes on Medie
While it seems fairly intuitive
Medie stores a lot of information from the Enju parse
So there's expressive power under the hood
But the average user doesn't get to use it
Also non-linguists may be put off by explicit grammatical terminology in the interface
2/6/2013 C.J. Rupp 20
UKPMC Engagement
We did some focus group studies
These showed a marked preference for a simple interface (predictably?)
How do you get as close as possible to a Google-style interface
And still show off your deep linguistic analysis?
2/6/2013 C.J. Rupp 21
Design Constraints
Intuitive interface
Tailor the information stored to the requirements of the functionality
Make best use of our own specialised resource
Provide a simple web service to link with keyword and metadata searches
2/6/2013 C.J. Rupp 22
Lean Fact Extraction
We extract a database of facts that may provide answers to queries
We rely on specialised linguistic and domain knowledge to underwrite the quality of the fact entries
Facts should be seen as units of evidence
Validity is the authors' problem
Ours is relevance
2/6/2013 C.J. Rupp 23
What is a Fact?
Each entry in the fact database is the conjunction of:
A named entity (NE), according to the NER
Occuring within an argument (or modifier) position, according to the Enju analysis
That is designated as domain relevant in the BioLexicon
That's a recipe!
2/6/2013 C.J. Rupp 24
Explanation
The BioLexicon extends our scope with predicted modifiers, as well as arguments
We take phrases containing NE's to generalise and improve yield
The parse assigns syntactic roles
We also handle some negation
Mainly explicit negation on the verb.
2/6/2013 C.J. Rupp 25
A Simplified Fact Table
Document ID Verb Arg1 Arg2 Sentence
PMC2845863 result ciprofloxacin - Treatment wi..
PMC2817234 result ciprofloxacin PAE Treatment of..
PMC2738812 result ciprofloxacin - the combin…
PMC2847397 result ciprofloxacin - An in vivo ex..
In practice, tables are populated with identifiers in fields that may be normalised or cross references. In particular, NEs are mapped to a canonical identifier in the database and a canonical written form in generated questions. (PAE, here, represents another NE in an (oblique) object position. Otherwise, it’s just text.)
2/6/2013 C.J. Rupp 26
Sentence Snippets
The database also, contains the sentence where each fact was found
As well as the document ID to coordinate with other UKPMC services, e.g. metadata
Because of copyrighting issues (with the HTML webpages)
We were not given access to present results in situ, with highlighting and links in the text
2/6/2013 C.J. Rupp 27
Some Sentence Snippets (about Ciprofloxacin)
Treatment with ciprofloxacin, ceftriaxone or pivmecillinam resulted in a cure rate of >99% while assessing clinical failure, bacteriological failure and bacteriological relapse.
Treatment of the malaria parasites with ciprofloxacin, an inhibitor of the bacterial DNA gyrase, and other antibiotics including chloramphenicol, clindamycin, tetracycline and rifampicin resulted in the arrest of growth in the second asexual cycle, while the parasites in the current cell cycle appeared relatively unaffected (Geary et al. 1988; McFadden & Roos 1999; Surolia et al. 2004; Ramya et al. 2007).
the combination of ciprofloxacin and 5-FU resulted in a synergistic prolongation of the postantibiotic effect (PAE) in comparison with the PAE induced by the drugs alone.
An in vivo exposure to ciprofloxacin resulted in predominately efflux-mediated resistant mutants, suggesting that efflux plays a central role in emergence of fluoroquinolone resistance.
2/6/2013 C.J. Rupp 28
We Have all the Answers
Well actually we don't!
But we have all the answers we are prepared to offer
How do we provide these to the user, in response to relevant query?
This must be coordinated with searches based on:
A keyword in the text or (literary) metadata
2/6/2013 C.J. Rupp 29
What is Evidence Finder?
The Concept:
This is a complementary search tool for UKPMC.
To search the repository from a different perspective.
We retrieve documents,
But we search on evidence, rather than publication history, or keywords.
We provide a structured answer using generated questions
2/6/2013 C.J. Rupp 30
More than a Keyword!
Evidence Finder extends a keyword search
Search on a keyword produces a, potentially large, set of possible answers from the fact database
Generating questions around the relations in those facts can structure the result into smaller answer sets: the Jeopardy® solution!?
And help the user refine their query:
• “This is what you could have asked”
2/6/2013 C.J. Rupp 31
2/6/2013 C.J. Rupp 32
Generating questions
Entity1 activates Entity2 Entity2 is activated by Entity1 Entity1 cooperate to activate Entity2 Entity1 play key roles by activating Entity2
activate
ARG1 Entity1
ARG2 Entity2
We deal with syntactic variability by deep semantic parsing
Turning these into questions suggests how they can be accessed in a search application
2/6/2013 C.J. Rupp 33
Complementary Search
2/6/2013 C.J. Rupp 34
Complementary Search
Evidence Finder Result
What to expect from
EvidenceFinder
Suggests questions for you
Clicking on a question will return sets of documents with evidence snippets
Shows where answers may be in the text
Answers should immediately show you if you want to look at the whole document
Helps you look at similar facts in other documents
2/6/2013 C.J. Rupp 35
2/6/2013 C.J. Rupp 36
Evidence Finder: Result
2/6/2013 C.J. Rupp 37
Evidence Finder: Result
Document Metadata
Generated Questions
Evidence Sentences
2/6/2013 C.J. Rupp 38
Fact Summary
2/6/2013 C.J. Rupp 39
“More Like This” Query
What is Evidence Finder?
The Implementation:
A Web Services by NaCTeM
1. Suggested questions corresponding to a search term
2. Paged ‘answers’ to question: Document Metadata from EBI WS, extended with matching analyzed sentences.
3. All the analyzed factual sentences in a doc., each with a more like this query attached.
• The Platform:
• Java supported by Eclipse, using Google Web Toolkit (GWIT)
• Web Service running under Apache Tomcat
2/6/2013 C.J. Rupp 40
UKPMC Evidence Finder
Indexing Searching
New doc set
Web interface
Fact extractor
EVF
Fact DB
Enju parser
Consolidate NER data
Query from user
Web User Interface
Document Data From Europe PMC
Web Service
Retrieved facts
Store
Search
NER for UKPMC
XML Converter
Statistics and observations
2.4M articles fully parsed
67.36 million indexed facts
Representing 1.7 million documents
Relies on NE’s indexed by NaCTeM
Search results ranked by date, newest first.
Other rankings possible
2/6/2013 C.J. Rupp 42
What is Evidence Finder for?
An Evidence-based search:
Starts from the bottom
Locates specific statements
It may find unexpected or overlooked facts
It may find trivial and boring facts
It's not an antidote to literature or google search
It may not be able to handle complex queries (yet).
2/6/2013 C.J. Rupp 43
Extensions?
Structure within phrases
Select NEs with the “Head” line
More negation operators
• “lack of”, “fail to”, “avoid”
More normalisation
e.g. Acronym resolution
Relation sets from other domains
– Refine the medical verb dictionary
2/6/2013 C.J. Rupp 44
Thanks For your patience and stamina
Services to try:
http://labs.europepmc.org/evf
http://www.nactem.ac.uk/MEDIE/
2/6/2013 C.J. Rupp 45