colin batchelor [email protected]...
TRANSCRIPT
2
Overview
Why are we doing this?The conventional text-mining paradigmHow we do it
Where text-mining and annotation could happen in future
Challenges
3
Why are we doing this?
A solution looking for many problems
Enhanced reader experienceCurrent awarenessInformation retrieval (pre-indexing)
6
Conventional text-mining paradigm
There is a corpus of text (PubMed abstracts, internal reports, PDFs).
There is a resource (WordNet, FrameNet, the NTU Sentiment Dictionary).
Text mining software is trained, using the resourceon subset of corpus and tested on the remainder.
This all happens after publication.
7
Resources, conventionally
StaticProbably developed for a single use casePossibly inconveniently licensedDeveloped by a single institution
10
Text mining (Oscar)http://www.sciborg.org.uk/
http://oscar3-chem.sourceforge.net/
Manual QA
Enhanced HTML
Enhanced RSS
Database
11
Resources we use
StaticIUPAC Gold Book
DynamicOBO biomedical ontologies, especially:
ChEBI (see The ontology, dictionary and database of chemical entities of biological interest, Christoph Steinbeck, 1550 today)
RSC ontologies (http://www.rsc.org/ontologies)
CMO, RXNO, MOP (and more to come)
12
Live resource update (stage one)
Integr. Biol., 2009, doi:10.1039/b905580k
affinity chromatography (CMO:0001006)
A chromatography method where the separation is caused by differing analyte–ligand interactions.
(source: IUPAC Orange Book 9.2.1.5)
13
Live resource update (stage two)
immobilized metal affinity chromatography (CMO:0002255)
A chromatography method where the separation is caused by differing analyte–ligand interactions. Proteins containing amino acids with a specific affinity for metal ions (e.g. His which has an affinity for Co and Zn ions) are retained by the column.
metal oxide affinity chromatography(CMO:0002256)
A chromatography method where the separation is caused by differing analyte–ligand interactions. Phosphorylated proteins and peptides are retained by metal oxide particles because of their affinity for the phosphate group.
14
But beware of ambiguity
distribution (noun)
Does this mean:(a) Spreading something out (a process)?(b) The way something is spread out (a
quality)?
15
External trackers, downloads
Name reactionshttp://rxno.googlecode.com/
Chemical methods and apparatushttp://rsc-cmo.googlecode.com/
17
How do we evaluate this? (1)
Annotations to a particular ontology are a moving target.
And we can’t guarantee completeness for any given resource–corpus combination.
(Unless we build a corpus-specific resource, which is bad.)
18
How do we evaluate this? (2)
Calculate inter-annotator agreement
Focus on principles independently from the actually-existing resource.
Example: EXACT vs. CLASS vs. PART.
Peter Corbett, Colin Batchelor and Ann Copestake (2008), “Pyridines, pyridine and pyridine rings”, Proc. BERBMTM08 at LREC 2008, Marrakech, Morocco.
21
Annotation: where and when?
Pre-publication?
(by authors)
?
At publication?
(by editors)
Prospect
After publication?
(by the crowd)
ChemMantis
23
Authoring: Word chemistry plugin
http://research.microsoft.com/en-us/projects/chem4word/default.aspx