real-time text mining for the biomedical literature a collaboration between discovery net &...

April 21, 2005 EPSRC E-Science Meeting, NeSC

Real-time Text Mining for the Biomedical Literature

a collaboration between Discovery Net & myGrid

Rob GaizauskasDepartment of Computer ScienceUniversity of Sheffield

Moustafa M. GhanemDepartment of ComputingImperial College London


Outline

• Context– Workflows, Services and Text Mining– Discovery Net & myGrid

• Aims and Objectives of New Project

• Architecture of New System– Integration of Existing Components

• Approach to Text Mining– Data Resources & Evaluation– Techniques for Go Tagging

• Interface and Results Presentation

• Lessons Learnt So far, Conclusions and Broader Applicability of Work


Workflows, Web Services and Text Mining for Bioinformatics

Workflows – useful computational models for processes that require

repeated execution of a series of complex analytical tasks

– e.g. biologist researching genetic basis of a disease repeatedly• maps reactive spot in microarray data to gene sequence• uses a sequence alignment tool to find proteins/DNA of similar

structure• mines info about these homologues from remote DBs• annotates unknown gene sequence with this discovered info



Web services– Processing resources that are

• available via the Internet• use standardised messaging formats, such as XML• enable communication between applications without being tied to a

particular operating system/programming language

– Useful for bioinformatics where data used in research is• heterogeneous in nature – DB records, numerical results, NL texts• distributed across the internet in research institutions around the world• available on a variety of platforms and via non-uniform interfaces



Text mining– any process of revealing information – regularities, patterns or

trends – in textual data

– includes more established research areas such as information extraction (IE), information retrieval (IR), natural language processing (NLP), knowledge discovery from databases (KDD) and traditional data mining (DM)

– relevant to bioinformatics because of• explosive growth of biomedical literature• availability of some information in textual form only, e.g. clinical records



WorkflowsWeb services

Text mining

Bioinformatics


Discovery Net & myGrid

• Discovery Net: An e-Science testbed for High Throughput Informatics– £2.2M EPSRC Pilot Project– Started Oct 01, Ended in March 05– Service-based infrastructure/workflow model for Life Sciences, Environmental

Modelling and Geo-hazard Modelling– Infrastructure for mixed data mining / text mining– Machine learning methods for text mining

• myGrid: Directly Supporting the e-Scientist– £3.5M EPSRC Pilot Project– Started Oct 01, Ends June 05– Service-based infrastructure/workflow model for Life Sciences– Infrastructure for Text Collection Server, Text Services Workflow Server and

Interface/Browsing Client– Service-based Terminology Servers


myGrid

• Overall aim: develop an e-biologist’s workbench – a platform allowing biologists to execute, analyze, repeat multi-stage in silico experiments involving distributed data, code and processing resources– Workflow model for composing/executing processing components– Web services for distribution

• Problem: how to integrate text mining into a biological workflow?– Most text mining runs off-line and supports interactive browsing of

results– Most workflows run end to end with no user intervention– What are the inputs to text mining to be?

• Solution: tap off result of a workflow step and treat as implicit query


A myGrid example studying the Genetic Basis of Disease

Graves’ Disease– an autoimmune condition affecting tissues in the thyroid and orbit– being investigated using the micro-array methods

• micro-array shows which genes are differentially expressed in normal patients vs patients with the disease = candidate genes

• sequence alignment search (e.g. BLAST) finds genes/proteins with similar structure

• function of these “homologues” may suggest function of candidate gene

– key step for text mining follows BLAST search• for homologous proteins BLAST report contains references to proteins in

SWISSPROT protein database• Swissprot records contain ids of abstracts describing the protein in Medline

abstract database• abstracts can be mined directly or used as ``seed'' documents to assemble a set

of related abstracts


myGrid Text Services Architecture

User Client

Medline Server

Swissprot/Blast record

Workflow Server

WorkflowEnactment

ExtractPubMed Id

Get MedlineAbstract

Initial Workflow

Cluster Abstracts

Get Related Abstracts

Medline: pre-processed offline to extract biomedical terms + indexed

Workflow definition+ parameters

Clustered PubMed Ids+ titles

PubMed Ids

PubMed Ids

Term-annotatedMedline abstracts

MedlineAbstracts


myGrid Text Services Architecture

3-way division of labour sensible way to deliver distributed text mining services– Providers of e-archives, such as Medline, will make archives

available via web-services interface• Cannot offer tailored sevices for every application• Will provide core, common services

– Specialist workflow designers will add value to basic services from archive to meet their organization’s needs

– Users will prefer to execute predefined workflows via standard light clients such as a browser

Architecture appropriate for many research areas, not just bioinformatics


Abstractbody

myGrid Interface/Browsing Client

MeSH Tree

AbstractTitles

Free textsearch

Searchscoperestrictors

Linkedterms

GetRelatedAbstracts


Find Relevant Genes from Online Databases

Find Associations between Frequent Terms

Gene Expression Analysis

Discovery Net: Adding text mining to e-Science workflows

DNet Workflow server executes DPML workflow and uses Discovery Net’s InfoGrid data access and integration wrappers and web services


Text Mining in e-Science workflows

Problem: how to develop new distributed text mining applications using a workflow?– Most text mining applications require the integration of a mixture of

components (Services) for text processing tasks (e.g. parsing and cleaning), natural language processing (e.g. named entity recognition), statistics and data mining (e.g. classification, clustering, etc).

– There are many design alternatives and end users may want to prototype and compare alternative implementations.

– Once application developed, most workflows run end to end with no user intervention

Solution: Extend service infrastructure to allow composition of text mining services.


Building text mining applications from workflows

Text Processing

Stemming,Stop-word filters,Pattern filters,Lexicon matching,Ontologies,NLP parsingetc, ..

Feature Extraction

Statistical:Word Counts, Pattern Extraction & Counts, etc

Domain-specificGene Name counts, etc

NLP-specificPhrase counts, etc

Data Mining

Classification, Clustering, Association,Statistical Analysis,Visual Analysis,etc …

Text documents

Text docs

Numerical Feature Vectors

Retrieval/ Storage

IndexingAccess DriversStorage

Text docs

Pre-process documents to enhance the ease of feature extraction

Features are summarized into vector forms which are suitable for data mining

Results can be document characterization or hidden relationship extraction

Retrieve and organize relevant documents

Text Mining Pipelines

Using workflow technologies to build text mining applications and services using finer grain components/services


Simplified Document Classification Workflow

Examples of Extracted Patterns GENE_NAME proteinGENE_NAME expressexpress GENE_NAMEGENE_NAME mutantGENE_NAME activityactivity GENE_NAMEGENE_NAME drosophila

Examples of Pattern Definitions

delet\s([a-z]*(\s)+)*genenam+\sdepend\s([a-z]*(\s)+)*genenam+\sdescrib\s([a-z]*(\s)+)*genenam+\sdetect\s([a-z]*(\s)+)*genenam+\sdetermin\s([a-z]*(\s)+)*genenam+\sdiffer\s([a-z]*(\s)+)*genenam+\sdisc\s([a-z]*(\s)+)*genenam+\sdna\s([a-z]*(\s)+)*genenam+\s

Predictive Accuracy of Relevance prediction, using Support Vector Machine classification

Overall accuracy: 84.5%Precision 78.11%Recall 73.40%


Text Meta Data Model

Build Classifier training phase using workflow co-ordinating distributed services

Build Prediction phase using workflow co-ordinating distributed servicesMetadata Model: Service Interfaces only tell you how to invoke remote service but it is up to you to decide what information flows between services !

Text Start End Annot. Type Attributes Insulin 1 7 token pos:noun, stem:insulin resistance 9 18 token pos:noun, stem:resist Insulin resistance

1 18 compound token

disease:insulin resistance

plays 20 24 token pos:verb, stem:plai major 26 30 token pos:adj, stem:major role 32 35 Token pos:noun, stem:role


Aims & Objectives of New Project

• Aim: to develop a unified real-time e-Science text-mining infrastructure that leverages the technologies and methods developed by both Discovery Net and myGrid– Software engineering challenge: integrate complementary service-based text

mining capabilities with different metadata models into a single framework– Application challenge: annotate biomedical abstracts with semantic categories

from the Gene Ontology• Deliverables:

– D1: A GO Annotation Service– D2: A Generic Shared Infrastructure for Grid-enabled Biomedical Document

Categorization– D3: Infrastructure for Semantic Document Annotation– D4: A Detailed Case Study (analysing/evaluating the GO annotator)– D5: Developing a common framework for representing + exchanging

information about:1. Data: biomedical documents/doc collections + metadata, biomedical dictionaries 2. Intermediate data: Document indexes and Document feature vectors 3. Text Analysis Results


Go TAG: A Novel Application

•The GO TAG Application: Automatic Assignment of GO (Gene Ontology) Codes to Medline Documents


A Machine Learning Approach

Overview of Training Phase


Run-time System

Overview of Run-time System


GO Annotator – Version 1

• Version 1a:– Direct search for GO Annotation descriptions and synonyms in document

text– If description is found, document is labelled with this GO Annotation– Description is also marked-up in document

• Version 1b:– 1a + search for gene names extracted from yeast genome DB– If gene name found, document labelled with GO annotation(s) associated

with gene in DB– Gene name also marked up in document

• Termino web-service, hosted at Sheffield, provides lookup capability

• This is wrapped in a DiscoveryNet workflow to include PubMed query, results visualization and performance calculations

• Workflow is deployed as a web application for end users which includes applet to interactively browse results


GO Annotator – Version 1Underlying Discovery Net Workflow



Enter query and retrieve abstracts from Enter query and retrieve abstracts from PubMed.PubMed.



Use Termino to mark-up abstracts with Use Termino to mark-up abstracts with GO Annotations when match for GO GO Annotations when match for GO Annotation description is found.Annotation description is found.



Tabulate GO Annotations by PMID.Tabulate GO Annotations by PMID.



Join PMIDs and matching GO Join PMIDs and matching GO Annotations with abstracts and titles.Annotations with abstracts and titles.


Workflow Deployment


GO Annotator – Version 2

• Use Saccharomyces (Yeast) Genome Database as source of papers expertly curated with GO Annotations

• Train classifier using these papers• Hierarchical classification• Training data sufficient to classify over 2000 GO Annotations• Classifier is then applied to assign unseen papers with GO

Annotations• Main Issues:

– Choice of features to be extracted from the training documents– Choice of feature reduction methods to produce accurate classification– Choice of classification algorithm to be used?


GO Annotator – Version 2Underlying DiscoveryNet Workflow

Papers expertly curated with GO Papers expertly curated with GO Annotations from SGD database.Annotations from SGD database.



Generate vector of features (frequent Generate vector of features (frequent phrases) for each paper. This is used phrases) for each paper. This is used to train classifier.to train classifier.



Generate a Naïve Bayesian Generate a Naïve Bayesian classification model.classification model.



Generate vector of features (frequent Generate vector of features (frequent phrases) for each paper in test data phrases) for each paper in test data set. This is used to test the classifier.set. This is used to test the classifier.



Apply classification model to test data Apply classification model to test data to evaluate classification accuracy.to evaluate classification accuracy.


Interface + Results Presentation

GOHierarchy

AbstractTitles

AbstractBodies

Go Labels/Gene Names


Achievements to date

• Infrastructure Interoperability– More than just remote web service invocation: interoperable metadata models

• Mark 1 System Implemented– Annotation based on terminology lookups– 15% Recall & 5% Precision (Exact matches for 18,000 GO terms)

• Measures inadequate due to incompleteness of gold standard

• In process of Finalising Training Data Sets and Evaluation Metrics– 4,922 papers referencing 2,455 GO Terms

• Mark 2 Systems in Progress – Naïve Bayesian Approach– 41% Recall and 27% Precision

• User Interfaces

• Mark 3, 4, … Systems and Evaluation


Implementation Options

• Feature Vector Options– Bag of words– Frequent Phrases– Key Phrases (Gene Names, Protein Names, MeSH

terms, etc). • Classifier Options

– Bayesian Classifiers– Support Vector Machines– Drag Push (a novel centroid based method)


Lessons Learnt and Challenges to Face

• Infrastructure– Interoperability Issues– Performance Issues:

• Communication vs Persistence of remote server• Off-line vs on-line feature extraction

• Text Mining– Usability Issues– Evaluation Issues

real-time text mining for the biomedical literature a collaboration between discovery net &...

Documents

realtime text mining

text miningmygrid

text services workflow

text collection server

distributed data

microarray data

information extraction

information retrieval