colin batchelor [email protected]...

26
Turning mining inside-out Colin Batchelor [email protected] 2009-08-16

Upload: lelien

Post on 15-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Turning mining inside-outColin Batchelor

[email protected]

2

Overview

Why are we doing this?The conventional text-mining paradigmHow we do it

Where text-mining and annotation could happen in future

Challenges

3

Why are we doing this?

A solution looking for many problems

Enhanced reader experienceCurrent awarenessInformation retrieval (pre-indexing)

4

Enhanced HTML

5

Enhanced HTML

6

Conventional text-mining paradigm

There is a corpus of text (PubMed abstracts, internal reports, PDFs).

There is a resource (WordNet, FrameNet, the NTU Sentiment Dictionary).

Text mining software is trained, using the resourceon subset of corpus and tested on the remainder.

This all happens after publication.

7

Resources, conventionally

StaticProbably developed for a single use casePossibly inconveniently licensedDeveloped by a single institution

8

The kind of resources we want

DynamicMultiple use casesOpenDeveloped by multiple institutions

9

10

Text mining (Oscar)http://www.sciborg.org.uk/

http://oscar3-chem.sourceforge.net/

Manual QA

Enhanced HTML

Enhanced RSS

Database

11

Resources we use

StaticIUPAC Gold Book

DynamicOBO biomedical ontologies, especially:

ChEBI (see The ontology, dictionary and database of chemical entities of biological interest, Christoph Steinbeck, 1550 today)

RSC ontologies (http://www.rsc.org/ontologies)

CMO, RXNO, MOP (and more to come)

12

Live resource update (stage one)

Integr. Biol., 2009, doi:10.1039/b905580k

affinity chromatography (CMO:0001006)

A chromatography method where the separation is caused by differing analyte–ligand interactions.

(source: IUPAC Orange Book 9.2.1.5)

13

Live resource update (stage two)

immobilized metal affinity chromatography (CMO:0002255)

A chromatography method where the separation is caused by differing analyte–ligand interactions. Proteins containing amino acids with a specific affinity for metal ions (e.g. His which has an affinity for Co and Zn ions) are retained by the column.

metal oxide affinity chromatography(CMO:0002256)

A chromatography method where the separation is caused by differing analyte–ligand interactions. Phosphorylated proteins and peptides are retained by metal oxide particles because of their affinity for the phosphate group.

14

But beware of ambiguity

distribution (noun)

Does this mean:(a) Spreading something out (a process)?(b) The way something is spread out (a

quality)?

15

External trackers, downloads

Name reactionshttp://rxno.googlecode.com/

Chemical methods and apparatushttp://rsc-cmo.googlecode.com/

16

17

How do we evaluate this? (1)

Annotations to a particular ontology are a moving target.

And we can’t guarantee completeness for any given resource–corpus combination.

(Unless we build a corpus-specific resource, which is bad.)

18

How do we evaluate this? (2)

Calculate inter-annotator agreement

Focus on principles independently from the actually-existing resource.

Example: EXACT vs. CLASS vs. PART.

Peter Corbett, Colin Batchelor and Ann Copestake (2008), “Pyridines, pyridine and pyridine rings”, Proc. BERBMTM08 at LREC 2008, Marrakech, Morocco.

19

Compare and contrast…

20

21

Annotation: where and when?

Pre-publication?

(by authors)

?

At publication?

(by editors)

Prospect

After publication?

(by the crowd)

ChemMantis

22

Authoring: Word ontology plugin

http://ucsdbiolit.codeplex.com/

23

Authoring: Word chemistry plugin

http://research.microsoft.com/en-us/projects/chem4word/default.aspx

24

25

26

Challenges

Open problemsChemical structures from imagesProductive identifiers for productively-named entities

Putting ChemMantis and Prospect togetherBackfile (to 1841)Microsoft Word as well as XMLName to structure conversion