mining and meaning in the chemical sciences · conventional text-mining paradigm there is a corpus...

Mining and meaning in the

chemical sciences

Richard kiddkiddr@rsc.org

Overview

� Why are we doing this?� The conventional text-mining paradigm� How we do it� Where text-mining and annotation could

happen in future� Standards� Challenges

Why are we doing this?

A solution looking for many problems

� Enhanced reader experience� Current awareness� Information retrieval (pre-indexing)

Enhanced HTML

Conventional text-mining paradigm

There is a corpus of text (PubMed abstracts, internal reports, PDFs).

There is a resource (WordNet, FrameNet, the NTU Sentiment Dictionary).

Text mining software is trained, using the resourceon subset of corpus and tested on the remainder.

This all happens after publication.

Resources, conventionally

StaticProbably developed for a single use casePossibly inconveniently licensedDeveloped by a single institution

The kind of resources we want

DynamicMultiple use casesOpenDeveloped by multiple institutions

Text mining (Oscar)http://www.sciborg.org.uk/

http://oscar3-chem.sourceforge.net/

Manual QA

Enhanced HTML

Enhanced RSS

Database

Resources we use

StaticIUPAC Gold Book

DynamicOBO biomedical ontologies, especially:ChEBI

RSC ontologies (http://www.rsc.org/ontologies)

CMO, RXNO, MOP (and more to come)

InChI Identifier

Live resource update (stage one)

Integr. Biol., 2009, doi:10.1039/b905580k

affinity chromatography (CMO:0001006)

A chromatography method where the separation is caused by differing analyte–

ligand interactions.

(source: IUPAC Orange Book 9.2.1.5)

Live resource update (stage two)

immobilized metal affinity

chromatography (CMO:0002255)

A chromatography method where the

separation is caused by differing

analyte–ligand interactions. Proteins

containing amino acids with a specific

affinity for metal ions (e.g. His which

has an affinity for Co and Zn ions) are

retained by the column.

metal oxide affinity chromatography

(CMO:0002256)

A chromatography method where the

separation is caused by differing

analyte–ligand interactions.

Phosphorylated proteins and peptides

are retained by metal oxide particles

because of their affinity for the

phosphate group.

But beware of ambiguity

distribution (noun)

Does this mean:(a) Spreading something out (a process)?(b) The way something is spread out (a

quality)?

RSC ontology development

Annotations to a particular ontology are a moving target.

And we can’t guarantee completeness for any given resource–corpus combination.

(Unless we build a corpus-specific resource, which is bad.)

Compounds?

All kinds of problems...

Different names, systematic and commonNo namesImages (specific and generic)

Best dictionary wins for names

Stephen Arnold

Search: The Three Curves of Despair

March 2008

ChemMantis

Deposit structures…build dictionaries

A free to access online database for chemists

Website and web services

A link farm for over 22 million compounds integrated to 200 data sources

A curation platform for the public to improve the quality of data online

A deposition platform for the public to annotate and extend the data

Annotation: where and when?

Pre-publication?

(by authors)

At publication?

(by editors)

Prospect

After publication?

(by the crowd)

ChemMantis

Authoring: Word ontology plugin

http://ucsdbiolit.codeplex.com/

<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"><molecule id="m1"><atomArray><atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /><atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /><atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /><atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /><atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /><atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /><atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /><atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" />

</atomArray><bondArray><bond atomRefs2="a1 a2" order="1" /><bond atomRefs2="a2 a3" order="1" /><bond atomRefs2="a2 a4" order="2" /><bond atomRefs2="a1 a5" order="1" /><bond atomRefs2="a1 a6" order="1" /><bond atomRefs2="a1 a7" order="1" /><bond atomRefs2="a3 a8" order="1" />

</bondArray></molecule>

</cml>

<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"><molecule id="m1"><atomArray><atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /><atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /><atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /><atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /><atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /><atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /><atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /><atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" />

</atomArray><bondArray><bond atomRefs2="a1 a2" order="1" /><bond atomRefs2="a2 a3" order="1" /><bond atomRefs2="a2 a4" order="2" /><bond atomRefs2="a1 a5" order="1" /><bond atomRefs2="a1 a6" order="1" /><bond atomRefs2="a1 a7" order="1" /><bond atomRefs2="a3 a8" order="1" />

</bondArray></molecule>

</cml>

Relationships: Navigate

and link referenced

chemistry

Relationships: Navigate

and link referenced

chemistry

• Peter Murray-Rust

• Joe Townsend

• Jim Downing

Available soon:http://research.microsoft.com/chem4word/

Data: Semantics stored

in Chemistry Markup

Language

Data: Semantics stored

in Chemistry Markup

Language

Intent: Recognizes

chemical dictionary and

ontology terms

Intent: Recognizes

chemical dictionary and

ontology terms

Author and edit 1D and 2D

chemistry.

Author and edit 1D and 2D

chemistry.

Intelligence: Verifies

validity of authored

chemistry

Intelligence: Verifies

validity of authored

chemistry

Authoring: Chem4Word – Chemistry Drawing in Word

oreChem

TREC Chemistry

� Combination of 1m+ patents� 36k RSC articles

� Test runs on defined tasks, prior art

� 8 runs in year one (none from UK)

InfoChem

� Chemisches Zentralblatt� Digitised

� Structure searchable

� 98k unique names, 48k unique structures

� Unique P R O B L E M S

� OCR interpretations

� Text searchable from

FIZ Chemie

� To fund development and support of the IUPAC InChI standard

� Working groups set up by IUPAC Subcommittee: reactions, organometallics, polymers, markush, business rules for structure input, Resolver protocol

Current members of the

Trust:

ACD/Labs

ChemAxonElsevier

FIZ Chemie

Informa / Taylor & Francis

NPGOpenEye

RSC Symyx Technologies

Thomson-Reuters

Wiley-Blackwell

RInChI

� Reactions

� Jonathan Goodman

� http://www-rinchi.ch.cam.ac.uk/

inchis.chemspider.com

NCI Resolver

� So we need a Resolver Protocol

Semantic Enrichment of the

Scientific Literature (SESL)

� Pistoia-funded� EBI� Elsevier, NPG, OUP, RSC

� Oct 2009 – Oct 2010

Assertion & Meta Data Mgmt

Transform / Translate

Integrator

Service Layer

Corpus 1

‘Consumer’

Firewall

Supplier

Firewall

Common

Service

Broker

Multiple

Consumers

Biomedical Knowledge Service Framework

Corpus 5

Std Public

Vocabularies

Knowledge

Applications

Content

Suppliers

Effort required

to fit DBs to

service layer

Business

SESL deliverables

� Pilot to deliver target-disease assertions� Publication of data, application and web

service standards

� So: to deliver standards for semantic delivery

Standards - are we there yet?

� InChI Trust� Compound standards

� Reaction InChIs

� Resolver Protocol

� Pistoia/EBI� Semantic standards for web services

� Microsoft/Academia� oreChem

� Chem4Word

� Semantic markup by publishers

Challenges for RSC

Open problems

� Chemical structures from images

� Productive identifiers for productively-named

entities

Putting ChemMantis and Prospect together

� Backfile (to 1841)

� Microsoft Word as well as XML

� Name to structure conversion

...and for text miners and

repositories

Inputs and outputs

� Who’s putting the data in?

� Who’s curating it?

� Who wants to use it?

� ...and what for?

Standards implementation

� Compelling use cases now here

mining and meaning in the chemical sciences · conventional text-mining paradigm there is a corpus...

Documents

text mining

text mining for clementine improve insights with text mining

mining text using keyword distributions - hebrew...

information retrieval & text mining - intranet...

text mining & web mining

text mining text classification text clusteringtext mining...

introduction to text mining · introduction to text mining...

introduction to text mining - uni-paderborn.de ·...

text mining infrastructure in r - university of...

cse 634 – data mining: text mining · text mining vs. •...

cs583 – data mining and text mining

web mining & text mining

text mining with oracle - text mining summit

introduction to text mining - edbt 2006 · text mining text...

data mining 1...

introduction to text mining -...

historical text mining historical text mining, and...

uic - cs 594bing liu1 chapter 7: text mining. uic - cs...

text mining and data mining

text mining - data mining