national centre for text mining: activities in biotext mining john mcnaught deputy director

58
National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director www.nactem.ac.uk

Upload: jeremy-page

Post on 11-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

National Centre for Text Mining:Activities in biotext mining

John McNaughtDeputy Director

www.nactem.ac.uk

Page 2: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Outline

Overview of NaCTeM, role in e-Infrastructure

Quick tour of NaCTeM services/tools Some challenges for text mining in biology

Page 3: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

UK National Centre for Text Mining (NaCTeM)

1st national text mining centre in the world www.nactem.ac.uk

Location: Manchester Interdisciplinary Biocentre (MIB)

Remit: Provision of text mining services to support UK research

Funded by: the JISC, BBSRC, EPSRC Initial focus: Biology

Page 4: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Why is there a need in the UK for a national centre for text mining?

Some researchers knew they wanted TM TM key component of e-Science UK policy to involve more researchers

(from all domains) in doing e-science and e-research• TM seen as key technology for researchers

• And one applicable in every domain (broad interest/support from major funding bodies)

Page 5: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Embedding Text Mining within e-Science in the UK

e-Science […] enables new research and increases productivity through shared e‑Infrastructure, the development of computational and logical models and new ways to discover and use the growing range of distributed and interoperable resources. It supports multidisciplinary and collaborative working and a culture that adopts the emerging methods.

M. Atkinson (2007) Beyond e-Science

Page 6: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

e-Research Infrastructure

Access to information, data resources, distributed computing essential for bio-scientists

e-Infrastructure provides services and facilities enabling advanced research

Deploying e-Infrastructure increases the pace and efficiency of new research methods and techniques

Page 7: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

E-Infrastructure and Text Mining: for whom?

End-users

Software engineers

Generic tool developers

Service and resource providers

Workflow

developers

Text miners

BLAST, Swiss-Prot,

Page 8: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

What users want to do with their data (minimally)

Easier access to data

Share data with their peers

Annotate data with metadata

Manage data across locations

Integrate data within workflows, Web Services

Aids for semantic metadata creation; enriching data with related metadata e.g. experimental results

TEXT MINING CAN SUPPORT USERS

Page 9: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

METABOLIC MODEL IN SBML

CREATE MODEL

VISUALISE

STORE MODEL IN

DB

RUN BASE MODEL

SENSITIVITY ANALYSES

SCAN PARAMETER

SPACE

COMPARE WITH ‘REAL’

DATA

DIFFERENT METABOLIC

MODEL IN SBML

TEXT MININGANNOTATE

STORE NEW MODEL IN DB

COMPARE MODELS

STORE DIFFERENCES

AS NEW MODEL IN DBSYSTEMS BIOLOGY WORKFLOWS – D.B.KELL, MCISB

Page 10: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Science is data-driven

“the current scientific literature, were it to be presented in semantically accessible form, contains huge amounts of undiscovered science”

Peter Murray-Rust, Data-driven science

Page 11: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

What do we provide and build?

Resources: lexica, terminologies, thesauri, grammars, annotated corpora

Tools: tokenisers, taggers, chunkers, parsers, NE recognisers, semantic analysers

Services• Proof of concept: evolving to large scale Grid-enabled

services

• Service provision through

Customised solutions

Page 12: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

A complex problem

TM involves• Many components (converters, analysers,

miners, visualisers, ...)

• Many resources (grammars, ontologies, lexicons, terminologies, thesauri, CVs)

• Many combinations of components and resources for different applications

• Many different user requirements and scenarios, training needs

Need to be active in all areas to effectively support researchers

Page 13: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Development strategy Re-use tools where possible

• Forge strategic relationships (UTokyo, IBM) Use integration framework (UIMA) Develop generic TM tools Customise for specific

domains/scenarios (MCISB, pharmas)• While actively engaging with user

communities (requirements, evaluation) Encapsulate in services

Page 14: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

How do we provide services?

Modes of use Demonstrators: for small-scale online use Batch mode: upload data, get email with link

to download site when job done Web Services Embedding text mining Web Services into

Workflows

Some services are compositions of tools Individual tools to process user data

Page 15: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Services based on pre-processed collections Pre-process (analyse and index) popular

collections• MEDLINE• Evolving to handle full text (UKPMC)

Provide advanced search interfaces to these• Based on user scenarios

Rapid results for end-user Regular up-dating of analyses carried out

Page 16: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

NaCTeM services and tools

TerMine AcroMine MEDIE InfoPubMed KLEIO FACTA

Page 17: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director
Page 18: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director
Page 19: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

TerMine

C-value is a hybrid technique; extracts multi-word terms; language independent

Combines linguistic filters and statistics• total frequency of occurrence of string in corpus

• frequency of string as part of longer candidate terms (nested terms)

• number of these longer candidate terms

• length of string (in number of words)

Frantzi, K., Ananiadou, S. & Mima, H. (2000) International Journal of Digital Libraries 3(2)

Page 20: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director
Page 21: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

The importance ofacronym recognition

Acronyms are among the most productive type of term variation

Acronyms are used more frequently than full terms• 5,477 documents could be retrieved by using the

acronym JNK while only 3,773 documents could be retrieved by using its full term, c-jun N-terminal kinase [Wren et al. 05]

No rules or exact patterns for the creation of acronyms from their full form

Page 22: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Top 20 acronyms in MEDLINE

Page 24: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

MEDIE

An interactive intelligent IR system retrieving events

Performs a semantic search System components

• GENIA tagger

• Enju (HPSG parser)

• Dictionary-based named entity recognition

Page 26: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Semantic structure

CRP measurement

does

NP13

VP17

VP15

So

NP1

DT2

A

VP16

AV19

not

VP21

VP22

exclude

NP25

NP24

AJ26

deep

NP28

NP29

vein

NP31

thrombosis

serum

NP4

AJ5

normal

NP7

NP8 NP10

NP11

ARG1

ARG1

MOD

MOD

ARG1

ARG2

ARG1

ARG1

ARG2

ARG1

MOD

Predicate argument relations

Page 27: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Abstraction of surface expressions

Page 28: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Info-PubMed

An interactive IE system and an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins,and the interactions between them.

System components• MEDIE

• Extraction of protein-protein interactions

• Multi-window interface on a browser

Page 29: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Info-PubMed

Page 30: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

KLEIO

• Semantically enriched information retrieval system for biology

• Offers textual and metadata searches across MEDLINE

• Provides enhanced searching functionality by leveraging terminology management technologies

http://www.nactem.ac.uk/software/kleio/

Page 31: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

KLEIO architecture

MEDLINE

Named Entity Recognition

Acronym Disambiguation

Integrator

MEDLINE Indexed by

Lucene

User

AcronymDictionary

AcronymExtraction

AcroMine

BioThesaurus

TermNormaliser

Dictionary lookup based on string

similarity

Page 32: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

32 different ABC acronyms in MEDLINE...

Page 33: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Exploit named entity recognition to focus search

Page 34: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

See detail, link to BioDBs and other resources

Page 35: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Refine query to human

Page 36: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

FACTA: finding associated concepts

Page 37: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Nicotine and AD

Page 38: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Challenges for TM in Biology

Terminology, terminology, terminology Named entities: same name, different

species Linking forms with concepts Event extraction

• Corpus annotation (who does it with what accuracy and can they agree)

(scientific argumentation, opinion mining)

Page 39: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Tackling terminology: BOOTStrep

EC collaborative project: Resource building for TM• Reusable lexical, terminological, conceptual resources for

biological domain (gene regulation)

• Tool-based methodology to create and maintain resources as fully automatically as possible

• Process BioDBs for terms then find variants in text

Relations (including facts) among entities extracted by side-effect of text analysis• Raw fact store built up as process texts for resource

building

Page 40: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

BOOTStrep BioLexiconType Entries Variants

cell 842 1,400

chemicals 19,637 106,302

enzymes 4,016 11,674

diseases 19,457 33,161

genes/proteins 1,640,608 3,048,920

GO concepts 25,219 81,642

molecular role concepts 8,850 60,408

operons 2,672 3,145

protein complexes 2,104 2,647

protein domains 16,840 33,880

SO concepts 1,431 2,326

species 482,992 669,481

Transcr. factors 160 795

Page 41: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

BOOTStrep BioLexicon Also contains

• Terminological verbs (759 base, 4,556 inflected forms)

• Terminological adjectives (1,258)

• Terminological adverbs (130)

• Nominalized verbs (1,771) Verbs have syntactic and semantic

frames (entities they occur with)• allows finer-grained analysis

All items have associated linguistic info

Page 42: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Challenge: Complex analysis currently requires highly customised solutions

Page 43: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Challenge: Dealing with full text

Need to be able to handle very large amounts of text

Other issues besides linguistic/NLP ones (already hard)• Efficiency, scalability, distributed processing

Porting TM tools to UK and European Grid environment

Page 44: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Need for processing full texts

Will allow researchers to discover hidden relationships from text that were not known before

• an abstract’s length is on average 3% of the entire article

• an abstract includes only 20% of the useful information that can be learned from text

Page 45: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Parallelising TM

TM applications are data independent

• Scale linearly in an ideal world HPC implementation Scaled linearly to 100 processors Porting to DEISA to scale over 1000s

processors to process TBs of data in reasonable time

Page 46: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

TM of full texts forUK PubMed Central (UKPMC) Free archive of life sciences journals British Library, European Bioinformatics

Institute & UManchester (NaCTeM, Mimas)

Phase 3 tasks: integration of UKPMC in biomedical DB infrastructure with TM solutions for improved search and knowledge discovery

Page 47: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

NaCTeM in UKPMC

TM “behind the scenes” on full texts Named entity recognition

• Link entities in texts to bioDB entries Fact extraction

• E.g., protein-protein interactions Add extracted info as semantic metadata

• Index for efficient access Semantic search capability

• Based on user needs, evaluation workshops

Page 48: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Challenge: Tool interoperability A plethora of NLP tools out there...

• Part-of-speech tagging: • TnT tagger, Tree tagger, SVM Tool, GENIA…

• Chunking: • YamCha, GENIA, …

• Named entity recognition: • LingPipe, ABNER, Stanford NE Recognizer, GENIA, …

• Syntactic parsing: • Enju, OpenNLP, Charniak parser, Collins parser…

• Semantic role labeling• Shalmaneser

• Co-reference resolution...

Page 49: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

So where is the problem?

Increasing contents of annotation

Increasing layers of annotation

More applications of annotation

Various annotation schemes

tokens phrases entities Predicate-arguments

morphology syntax semantics RDF

Linguistics NLP Text Mining Semantic web

TEI PTB XCES SciXML?

Page 50: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Architecture for interoperability

A good architecture should Accommodate various annotation schemas, formats

of NLP tools. Provide a common exchange annotation schema. Facilitate easy communication between NLP tools. Extend to incorporate new tools

Exchangeschema

Termextractor

POStagger

Chunker

Parser

+ tools

UIMA PLATFORM

http://incubator.apache.org/uima/

Page 51: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Uses of our tools and services

Searching Metadata creation Controlled vocabularies Ontology building Data integration Linking repositories Database curation

Reviewing Gene – disease mining Enriching pathway

models Indexing Document classification

… …

Page 52: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

NaCTeM phase II (2008-2011)

TM supporting service provision

Web Services Embedding TM

within workflows Adaptive learning Integration of data /

text mining

Issues Full paper

processing open access

collections IPR in data derived

via text mining Interoperability Education and

training

Page 53: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

NaCTeM phase II (2008-2011)

Service exemplars

• Intelligent semantic searching for construction of biological networks

• Support for qualitative data analysis for social sciences

• Intelligent semantic search of gene-disease associations for health

e-Research and e-Science Knowledge discovery Collaborative research E-publishing Personalised

searching

Page 54: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Research feeding into services or adding to knowledge

REFINE (BBSRC): enriching SBML models and integration of TM with visualisation

ONDEX (BBSRC): aiding data integration

Disease association mining (Pfizer, NHS)

Page 55: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

ONDEX (BBSRC 2008-2011)

Page 56: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Acknowledgments

Text Mining Team: 14 members NaCTeM funding agencies:

Wellcome Trust Close collaboration with University of

Tokyo

Page 57: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Acknowledgements

User group

Systems Biology Centres Middleware provider

http://taverna.sourceforge.net

Usability and evaluation Service provision MIMAS

Page 58: National Centre for Text Mining: Activities in biotext mining John McNaught Deputy Director

Further reading

Visit out site at www.nactem.ac.uk for TM briefing paper and other publications on our work

Book: Ananiadou, S. & McNaught, J. (eds) (2006) Text Mining for Biology and Biomedicine. Norwood, MA: Artech House.