spectra-t project

20
Alan Tonge Semantic Web Data Repositories from Chemistry e-Thesis Data Mining Open Repositories 2008 Southampton University 2 April 2008 SPECTRa-T Project

Upload: ivana

Post on 13-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

SPECTRa-T Project. Alan Tonge. Semantic Web Data Repositories from Chemistry e-Thesis Data Mining. Open Repositories 2008 Southampton University 2 April 2008. Project Overview. S ubmission, P reservation and E xposure of C hemistry T eaching and R esearch Dat a. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SPECTRa-T Project

Alan Tonge

Semantic Web Data Repositories fromChemistry e-Thesis Data Mining

Open Repositories 2008Southampton University2 April 2008

SPECTRa-T Project

Page 2: SPECTRa-T Project

• 12-month project between University of Cambridge and Imperial College London to develop text- and data-mining tools to extract chemical data from e-theses

• Part of the JISC Digital Repositories programme

Project Overview

Submission, Preservation and Exposure of Chemistry Teaching and Research Data– in Theses

Page 3: SPECTRa-T Project

Background

Chemistry is an experimental science

Synthetic Organic Chemistry is the basis of Pharmaceutical and Agrochemical industries

Where does the information to make this molecule come from?

Ethyl 4,5-epoxy-hex-2-enolate

C8H12O3

Systematic Name :

Molecular Formula :

Page 4: SPECTRa-T Project

Chemical Abstracts (9000+ journals - 12,000 structures/day)Beilstein (180 core journals)

Patents (CAS, Derwent, MDL) (400,000 /annum)

Academic chemistry publications largely derived from PhD Theses

Perhaps ~10K published per year worldwide

Synthetic : contains 50-60 preparations – only 20% published in detail

Search Chemical patent & journal abstracting services – e.g.

Page 5: SPECTRa-T Project

• List of Starting Materials & Reagents

• Recipe: Reactions Conditions & Work-up

• Product Characterization – spectroscopic & physical properties

Page 6: SPECTRa-T Project

Sample preparation from synthetic chemistry thesis

Page 7: SPECTRa-T Project

• ~80% of (academic) synthetic preparations remain locked in theses

• Manual abstraction (cf journals/patents) not an option

The Problem

The Solution

• OSCAR3 : Automatic high-throughput chemical name and chemical term recognition

Open Source Chemistry Analysis Routines is an extensible Open Source framework which can identify much of the chemical terminology in electronic articles

• Semantic Web : Deposit extracted terms in searchable RDF triplestore

Page 8: SPECTRa-T Project

OSCAR Name recognition:

1. Dictionary of chemical names/terms (ChEBI Ontology)

2. Rules; chemical suffix filters 3. Regular expressions to recognise: data, formulae

Page 9: SPECTRa-T Project
Page 10: SPECTRa-T Project

Input: PDF Legacy FormatPDF is the de facto format for electronic document deposition

in digital repositories

Problem:

• irregular word order• line-breaks: loss of continuous text; paragraphs difficult to identify• loss of subscripts and superscripts • non-printing characters• erroneous character assignment with OCR.

PDF text is a Page Description Format –

optimized for human, not machine, readability

Page 11: SPECTRa-T Project
Page 12: SPECTRa-T Project

• Remove linebreaks from extended chemical names

• Remove text fragments derived from Figures and Tables

• Correct whitespace in chemical names

PDF UTF-8 text OSCAR3

SAF XML RDF statements XSLT

Used ‘as is’OSCAR used ‘as is’ on PDF e-theses :

Gives 5000 terms / thessGives 5000 terms / thesis (80% duplicates)

Cannot identify chemical objects (spectra assignments; properties)

Programmatic modifications to:

Page 13: SPECTRa-T Project

Input: MS Office Open XML – ‘docx’

• No information loss from student’s deposited thesis (written with MS software)

• Identification of experimental sections no longer a problem -> Chemical Objects

• Conversion of CO’s into Chemical Markup Language

DocX

Extract chemical terms OSCAR3

Link together

RDF statements

Extract chemical objects

CML data files

Data Repository

URI

Page 14: SPECTRa-T Project

Sample preparation from synthetic chemistry thesis

Sample preparation from chemistry thesis

Page 15: SPECTRa-T Project

CML Infra-Red ASSIGNMENTS<cml:spectrum type="cml:ir">- <cml:conditionList>  <cml:condition title="the form of the IR spectrum“ dictRef="cml:irform">film</cml:condition>   </cml:conditionList>- <cml:peakList>  <cml:peak id="p1" xValue="3446" title="OH" />   <cml:peak id="p2" xValue="3062" title="unassigned" />   <cml:peak id="p3" xValue="3029" title="unassigned" />   <cml:peak id="p4" xValue="2922" title="unassigned" />   <cml:peak id="p5" xValue="1672" title="C=O" />   <cml:peak id="p6" xValue="1604" title="C=C" />   <cml:peak id="p7" xValue="1496" title="unassigned" />   <cml:peak id="p8" xValue="1454" title="unassigned" />   <cml:peak id="p9" xValue="1366" title="unassigned" />   <cml:peak id="p10" xValue="1299" title="unassigned" />   <cml:peak id="p11" xValue="1135" title="unassigned" />   <cml:peak id="p12" xValue="1078" title="unassigned" />   <cml:peak id="p13" xValue="974" title="unassigned" />     </cml:peakList>  </cml:spectrum>

CML C-13 NMR ASSIGNMENTS<cml:spectrum type="cml:cnmr">- <cml:parameterList>  <cml:parameter dictRef="cml:frequency" units="units:MHz">50</cml:parameter>   </cml:parameterList>- <cml:substanceList>  <cml:substance ref="" />   </cml:substanceList>- <cml:peakList>  <cml:peak xValue="198.6" integral="" peakMultiplicity="" title="C=O" />   <cml:peak xValue="198.5" integral="" peakMultiplicity="" title="" />   <cml:peak xValue="145.0" integral="" peakMultiplicity="" title="C" />   <cml:peak xValue="142.7" integral="" peakMultiplicity="" title="C" />   <cml:peak xValue="137.3" integral="" peakMultiplicity="" title="CH2" />   <cml:peak xValue="136.7" integral="" peakMultiplicity="" title="CH2" />   <cml:peak xValue="129.1" integral="" peakMultiplicity="" title="" />   <cml:peak xValue="128.6" integral="" peakMultiplicity="" title="" />   <cml:peak xValue="126.7" integral="" peakMultiplicity="" title="" />   <cml:peak xValue="124.0" integral="" peakMultiplicity="" title="aryl-C" />   <cml:peak xValue="62.5" integral="" peakMultiplicity="" title="CH" />   <cml:peak xValue="59.0" integral="" peakMultiplicity="" title="CH" />   <cml:peak xValue="55.2" integral="" peakMultiplicity="" title="CH" />   <cml:peak xValue="54.9" integral="" peakMultiplicity="" title="CH" />   <cml:peak xValue="38.5" integral="" peakMultiplicity="" title="CH2" />   <cml:peak xValue="32.8" integral="" peakMultiplicity="" title="CH2" />   <cml:peak xValue="26.1" integral="" peakMultiplicity="" title="CH3" />   <cml:peak xValue="26.0" integral="" peakMultiplicity="" title="CH3" />   </cml:peakList>  </cml:spectrum>

Page 16: SPECTRa-T Project

RDF - Resource Description Framework.

A component of the Semantic Web, it is based upon the idea of making statements about resources/data in the form of a

subject-predicate-object (or resource-property-value)

expression (called a triple) e.g. :

My_thesis has_chemical_entity 2,4-dinitrobenzene

The value of one property can in turn be used as the resource for another.

Page 17: SPECTRa-T Project

RDF TRIPLESTORE ENTRY<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/"

xmlns:dcrdf="http://purl.org/metadata/dublin_core#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:spectra-t="http://wwmm.ch.cam.ac.uk/spectra-t#">

<rdf:Description rdf:about="file:/C:/spectra-t-theses/Juergen_Harter.docx">

<spectra-t:hasChemicalName>- <rdf:Description> <spectra-t:chemicalName>CDCl3</spectra-t:chemicalName> <spectra-t:hasSMILES>ClC([2H])(Cl)Cl</spectra-t:hasSMILES> <spectra-t:hasInChI>InChI=1/CHCl3/c2-1(3)4/h1H/i1D</spectra-t:hasInChI> </rdf:Description></spectra-t:hasChemicalName>

<spectra-t:hasChemicalName>- <rdf:Description> <spectra-t:chemicalName>1-Benzyloxy-but-3-yne</spectra-t:chemicalName> <spectra-t:hasSMILES>C#CCCOCC1=CC=CC=C1</spectra-t:hasSMILES> <spectra-t:hasInChI>InChI=1/C11H12O/c1-2-3-9-12-10-11-7-5-4-6-8-11/h1,4-8H,3,9-10H2</spectra-t:hasInChI> <spectra-t:hasHNMRSpectrum>http://ch.cam.ac.uk:8182/1ea7f8cd07/data-0.cml</spectra-t:hasHNMRSpectrum> <spectra-t:hasCMLMolecule>http://ch.cam.ac.uk:8182/1ea7f8cd07/data-0.cml</spectra-t:hasCMLMolecule> <spectra-t:hasPreparation>http://ch.cam.ac.uk:8182/1ea7f8cd07/preparation-0.sci.xml</spectra-t:hasPreparation> </rdf:Description></spectra-t:hasChemicalName>

<spectra-t:hasChemicalName>- <rdf:Description> <spectra-t:chemicalName>(3E,5S,6S)-8-(p-Methoxy-benzyloxy)-5,6-epoxy-6-methyl-oct-3-en-2-one</spectra-t:chemicalName> <spectra-t:hasHNMRSpectrum>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/data-20.cml</spectra-t:hasHNMRSpectrum> <spectra-t:hasIRSpectrum>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/data-20.cml</spectra-t:hasIRSpectrum> <spectra-t:hasMassSpectrum>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/data-20.cml</spectra-t:hasMassSpectrum> <spectra-t:hasHRMSSpectrum>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/data-20.cml</spectra-t:hasHRMSSpectrum> <spectra-t:hasPreparation>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/preparation-20.sci.xml</spectra-t:hasPreparation> </rdf:Description></spectra-t:hasChemicalName>

</rdf:Description><rdf:RDF>

SPARQL QUERYPREFIX st: <http://wwmm.ch.cam.ac.uk/spectra-t#>PREFIX dcrdf: <http://purl.org/metadata/dublin_core#>CONSTRUCT { ?thesis st:hasBicycloMoleculeAndHNMR ?chemical .?thesis dcrdf:author ?author}WHERE { ?thesis dcrdf:creator ?author . ?thesis st:hasChemicalName ?annot . ?annot st:chemicalName ?chemical . ?annot st:hasHNMRSpectrum ?hnmr .FILTER regex(?chemical, ".*bicyclo.*") . }

RESULT<rdf:Description rdf:about="file:/C:/spectra-t-articles/B207708F.docx">

<st:hasBicycloMoleculeAndHNMR>5-Acetyl-7,8-bis(trimethylsilyl)bicyclo[4.2.1]nona-4,7-diene</st:hasBicycloMoleculeAndHNMR><dcrdf:author>N.R.Champness</dcrdf:author><st:hasBicycloMoleculeAndHNMR>5-Acetyl-bicyclo[4.2.1]nona-4,7-diene</st:hasBicycloMoleculeAndHNMR><dcrdf:author>N.R.Champness</dcrdf:author><st:hasBicycloMoleculeAndHNMR>5-Phenyl-bicyclo[4.2.1]nona-3,7-diene</st:hasBicycloMoleculeAndHNMR><dcrdf:author>N.R.Champness</dcrdf:author><st:hasBicycloMoleculeAndHNMR>5-Acetyl-7,8-bis(trimethylsilyl)bicyclo[4.2.1]nona-4,7-diene</st:hasBicycloMoleculeAndHNMR><dcrdf:author>N.R.Champness</dcrdf:author><st:hasBicycloMoleculeAndHNMR>5-Acetyl-bicyclo[4.2.1]nona-4,7-diene</st:hasBicycloMoleculeAndHNMR><dcrdf:author>N.R.Champness</dcrdf:author><st:hasBicycloMoleculeAndHNMR>5-Phenyl-bicyclo[4.2.1]nona-3,7-diene</st:hasBicycloMoleculeAndHNMR><dcrdf:author>N.R.Champness</dcrdf:author>

</rdf:Description>

Page 18: SPECTRa-T Project

Caveats (Proof-of-concept):

Single subject area (synthetic organic chemistry)

Single institution docx (limited variation in document structure)

Limited thesis availability

Solutions :

Domain ontology development

Make your e-theses public!

Message to repository managers:

PDF is a limited format for data extraction from e-theses

Docx allows chemical data object extraction (~80% precision / recall)

Page 19: SPECTRa-T Project

Acknowledgements

• Project Director: Peter Morgan UL Cambridge• Chemistry leads: Henry Rzepa, Peter Murray-Rust• Developers: Jim Downing, Diana Stewart,

Joe Townsend, Matt Harvey• Project Manager: Alan Tonge

http://www.lib.cam.ac.uk/spectra-t/

Page 20: SPECTRa-T Project

SPECTRa Tools Workshop

Autumn 2008

Unilever Centre, Cambridge, UK

Contact: Peter Murray-Rust ([email protected])

Peter Morgan ([email protected])