evidence finder: a semantic search tool for the pmc corpus...

Evidence Finder: a semantic search tool for the PMC

Corpus of Biomedical Research

Papers

C.J. Rupp National Centre for Text Mining University of Manchester www.nactem.ac.uk

Now: Spatial Humanities Project History [email protected]

2/6/2013 C.J. Rupp 1

http://www.nactem.ac.uk/

http://www.nactem.ac.uk/

Outline

What is UKPMC

Text Mining for Biomedicine

What is PubMed Central?

What Does UKPMC add?

Medie: a point of Comparison

Lean Fact Extraction

What is Evidence Finder?

Complementary Search

Fact Summary

2/6/2013 C.J. Rupp 2

The UKPMC Team at NaCTeM

C.J. Rupp Parsing, Relation Extraction, Indexing.

Chikashi Nobata Named Entity Recognition (NER)

Bill Black Project Manager

Prof. Sophia Ananiadou Director

Jock McNaught Deputy Director

Matt Machin Web Application, Interfaces, GWIT

Jacob Carter Databases

C.J. & Bill Design

2/6/2013 C.J. Rupp 3

What is UKPMC?

• A repository of 2.4 million full text journal articles in Biomedicine and Health Science

• Available for free on the web with no access restrictions

• Launched in January 2007, funded by the 8 largest funders of medical research in the UK

• Delivered by a consortium: The British Library, EBI and Manchester University

• This is the UK portal on the PubMed Central repository

• Now extended Europe-wide.

2/6/2013 C.J. Rupp 4

• Text Mining for Biomedicine

• There's a lot of work on Text Mining for Biomedicine

This field has money

But it also has one of the best problems

• The rate of Biomedical publication has soared

To inhuman proportions

So it's appealing to look for machine assistance

• The selling points are:

Handle information overload, and

Avoid overlooking information.

2/6/2013 C.J. Rupp 5

•Medline

•Total Articles / year

•Medline

•New Articles / year

Data Deluge

•EMBL Database

•Total Entries / year

2/6/2013 C.J. Rupp 6

What is PubMed Central?

PubMed Central (PMC) is the U.S. National Institutes of Health (NIH) digital archive of biomedical and life sciences journal literature.

Around 2 Million full text, published article

Contrast with PubMed: c. 22 Million abstracts

Many PMC articles are Open Access

Mixed format corpus: XML, PDF, OCR-ed

2/6/2013 C.J. Rupp 8

What does UKPMC add?

There are two main areas where UKPMC offers an extended service:

1. Additional literature, including UK-specific documents, such as NHS guidelines

2. A range of text mining services

This is where NaCTem comes in

2/6/2013 C.J. Rupp 10

Our Mission

Provide a more Intelligent Search tool for UKPMC

Showcase Text Mining Technologies

Use existing Resources, specifically:

Enju: deep syntactic parser

Biolexicon: domain lexicon

NER tools: for genes, diseases, etc.

2/6/2013 C.J. Rupp 11

Enju Parser

A syntactic parser for English.

With a wide-coverage probabilistic HPSG grammar

An efficient parsing algorithm

Trained on Biomedical text (PubMed abstracts)

Which provides phrase structures and predicate-argument structures.

2/6/2013 C.J. Rupp 12

The BioLexicon

A Lexical Database for Biomedicine

2.2 M entries (mainly biomedical terms)

658 domain-relevant verbs

Syntactic subcategorisation frames specified for all verbs (1760 frames)

Collected automatically based on dependency-parsed corpus of 6M tokens on topic of E.Coli

Include strongly selected modifiers according to importance of location, time, manner etc., in description of biomedical facts

Also, Semantic frames specified for 168 verbs (856 frames)

2/6/2013 C.J. Rupp 13

NER (Named Entity Recognition)

Dictionary-Based NER for significant classes of entity:

Genes and Proteins

Drugs and Diseases

Metabolites

including NeMine, trained for gene/protein disambiguation

Dictionaries include UMLS, Drugbank, HMDB

2/6/2013 C.J. Rupp 14

Medie: a Point of Comparison

There was an existing system with a similar specification:

Defined on PubMed Abstracts

With a powerful query language

GCL (Generalised Concordance Lists) based on Region Algebra

Using a tabular format for queries

2/6/2013 C.J. Rupp 15

Medie

2/6/2013 C.J. Rupp 16

Medie: Result

2/6/2013 C.J. Rupp 17

Tabular Format

2/6/2013 C.J. Rupp 18

Formal Query

2/6/2013 C.J. Rupp 19

Notes on Medie

While it seems fairly intuitive

Medie stores a lot of information from the Enju parse

So there's expressive power under the hood

But the average user doesn't get to use it

Also non-linguists may be put off by explicit grammatical terminology in the interface

2/6/2013 C.J. Rupp 20

UKPMC Engagement

We did some focus group studies

These showed a marked preference for a simple interface (predictably?)

How do you get as close as possible to a Google-style interface

And still show off your deep linguistic analysis?

2/6/2013 C.J. Rupp 21

Design Constraints

Intuitive interface

Tailor the information stored to the requirements of the functionality

Make best use of our own specialised resource

Provide a simple web service to link with keyword and metadata searches

2/6/2013 C.J. Rupp 22

Lean Fact Extraction

We extract a database of facts that may provide answers to queries

We rely on specialised linguistic and domain knowledge to underwrite the quality of the fact entries

Facts should be seen as units of evidence

Validity is the authors' problem

Ours is relevance

2/6/2013 C.J. Rupp 23

What is a Fact?

Each entry in the fact database is the conjunction of:

A named entity (NE), according to the NER

Occuring within an argument (or modifier) position, according to the Enju analysis

That is designated as domain relevant in the BioLexicon

That's a recipe!

2/6/2013 C.J. Rupp 24

Explanation

The BioLexicon extends our scope with predicted modifiers, as well as arguments

We take phrases containing NE's to generalise and improve yield

The parse assigns syntactic roles

We also handle some negation

Mainly explicit negation on the verb.

2/6/2013 C.J. Rupp 25

A Simplified Fact Table

Document ID Verb Arg1 Arg2 Sentence

PMC2845863 result ciprofloxacin - Treatment wi..

PMC2817234 result ciprofloxacin PAE Treatment of..

PMC2738812 result ciprofloxacin - the combin…

PMC2847397 result ciprofloxacin - An in vivo ex..

In practice, tables are populated with identifiers in fields that may be normalised or cross references. In particular, NEs are mapped to a canonical identifier in the database and a canonical written form in generated questions. (PAE, here, represents another NE in an (oblique) object position. Otherwise, it’s just text.)

2/6/2013 C.J. Rupp 26

Sentence Snippets

The database also, contains the sentence where each fact was found

As well as the document ID to coordinate with other UKPMC services, e.g. metadata

Because of copyrighting issues (with the HTML webpages)

We were not given access to present results in situ, with highlighting and links in the text

2/6/2013 C.J. Rupp 27

Some Sentence Snippets (about Ciprofloxacin)

Treatment with ciprofloxacin, ceftriaxone or pivmecillinam resulted in a cure rate of >99% while assessing clinical failure, bacteriological failure and bacteriological relapse.

Treatment of the malaria parasites with ciprofloxacin, an inhibitor of the bacterial DNA gyrase, and other antibiotics including chloramphenicol, clindamycin, tetracycline and rifampicin resulted in the arrest of growth in the second asexual cycle, while the parasites in the current cell cycle appeared relatively unaffected (Geary et al. 1988; McFadden & Roos 1999; Surolia et al. 2004; Ramya et al. 2007).

the combination of ciprofloxacin and 5-FU resulted in a synergistic prolongation of the postantibiotic effect (PAE) in comparison with the PAE induced by the drugs alone.

An in vivo exposure to ciprofloxacin resulted in predominately efflux-mediated resistant mutants, suggesting that efflux plays a central role in emergence of fluoroquinolone resistance.

2/6/2013 C.J. Rupp 28

We Have all the Answers

Well actually we don't!

But we have all the answers we are prepared to offer

How do we provide these to the user, in response to relevant query?

This must be coordinated with searches based on:

A keyword in the text or (literary) metadata

2/6/2013 C.J. Rupp 29


The Concept:

This is a complementary search tool for UKPMC.

To search the repository from a different perspective.

We retrieve documents,

But we search on evidence, rather than publication history, or keywords.

We provide a structured answer using generated questions

2/6/2013 C.J. Rupp 30

More than a Keyword!

Evidence Finder extends a keyword search

Search on a keyword produces a, potentially large, set of possible answers from the fact database

Generating questions around the relations in those facts can structure the result into smaller answer sets: the Jeopardy® solution!?

And help the user refine their query:

• “This is what you could have asked”

2/6/2013 C.J. Rupp 31

2/6/2013 C.J. Rupp 32

Generating questions

Entity1 activates Entity2 Entity2 is activated by Entity1 Entity1 cooperate to activate Entity2 Entity1 play key roles by activating Entity2

activate

ARG1 Entity1

ARG2 Entity2

We deal with syntactic variability by deep semantic parsing

Turning these into questions suggests how they can be accessed in a search application

2/6/2013 C.J. Rupp 33


2/6/2013 C.J. Rupp 34


Evidence Finder Result

What to expect from

EvidenceFinder

Suggests questions for you

Clicking on a question will return sets of documents with evidence snippets

Shows where answers may be in the text

Answers should immediately show you if you want to look at the whole document

Helps you look at similar facts in other documents

2/6/2013 C.J. Rupp 35

2/6/2013 C.J. Rupp 36

Evidence Finder: Result

2/6/2013 C.J. Rupp 37

Evidence Finder: Result

Document Metadata

Generated Questions

Evidence Sentences

2/6/2013 C.J. Rupp 38

Fact Summary

2/6/2013 C.J. Rupp 39

“More Like This” Query


The Implementation:

A Web Services by NaCTeM

1. Suggested questions corresponding to a search term

2. Paged ‘answers’ to question: Document Metadata from EBI WS, extended with matching analyzed sentences.

3. All the analyzed factual sentences in a doc., each with a more like this query attached.

• The Platform:

• Java supported by Eclipse, using Google Web Toolkit (GWIT)

• Web Service running under Apache Tomcat

2/6/2013 C.J. Rupp 40

UKPMC Evidence Finder

Indexing Searching

New doc set

Web interface

Fact extractor

EVF

Fact DB

Enju parser

Consolidate NER data

Query from user

Web User Interface

Document Data From Europe PMC

Web Service

Retrieved facts

Store

Search

NER for UKPMC

XML Converter

Statistics and observations

2.4M articles fully parsed

67.36 million indexed facts

Representing 1.7 million documents

Relies on NE’s indexed by NaCTeM

Search results ranked by date, newest first.

Other rankings possible

2/6/2013 C.J. Rupp 42

What is Evidence Finder for?

An Evidence-based search:

Starts from the bottom

Locates specific statements

It may find unexpected or overlooked facts

It may find trivial and boring facts

It's not an antidote to literature or google search

It may not be able to handle complex queries (yet).

2/6/2013 C.J. Rupp 43

Extensions?

Structure within phrases

Select NEs with the “Head” line

More negation operators

• “lack of”, “fail to”, “avoid”

More normalisation

e.g. Acronym resolution

Relation sets from other domains

– Refine the medical verb dictionary

2/6/2013 C.J. Rupp 44

Thanks For your patience and stamina

Services to try:

http://labs.europepmc.org/evf

http://www.nactem.ac.uk/MEDIE/

2/6/2013 C.J. Rupp 45



http://test.labs.europepmc.org/evf



evidence finder: a semantic search tool for the pmc corpus...

Documents