capturing chemistry in xml/cml j. a. townsend *, s. e. adams *, j. m. goodman *, p. murray-rust *,...
Post on 19-Dec-2015
214 views
TRANSCRIPT
![Page 1: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/1.jpg)
Capturing Chemistry in XML/CML
J. A. Townsend*, S. E. Adams
* , J. M. Goodman
*,
P. Murray-Rust*, C. A. Waudby
*
Capturing Chemistry in XML/CML
ACS March 2004
* Unilever Centre for Molecular Informatics,University of Cambridge
![Page 2: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/2.jpg)
The Agony Of Publication - Loss
Capturing Chemistry in XML/CML
ACS March 2004
The World
![Page 3: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/3.jpg)
The Agony Of Publication - Loss
Capturing Chemistry in XML/CML
ACS March 2004
The World
Sad
The Scientist
The Lab
Journals
Web Pages
![Page 4: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/4.jpg)
The Vision-1Capturing Chemistry in
XML/CMLACS March 2004
<scalar dictRef=“ccml:mp”
units=“units:c”
minValue=“65”
maxValue=“66” />
mp 65-66 C
Human-readable Machine-readable
![Page 5: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/5.jpg)
The Vision-2
•Chemists can carry on doing what they want
Capturing Chemistry in XML/CML
ACS March 2004
Reuse chemistryArchive dataEnsure validity of dataCreate new sources of data /
molecules
But also
![Page 6: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/6.jpg)
Our Approach
•Let chemists use familiar programs …•…and document templates•Focus on Journal Articles, Theses,
CompChem•Create data for knowledge-based
discovery•Let computers do the work•Evolution…
Capturing Chemistry in XML/CML
ACS March 2004
![Page 7: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/7.jpg)
Machine Parsing of Chemistry
Capturing Chemistry in XML/CML
ACS March 2004
Structured(CompChem)
Semi-Structured(Articles)
Unstructured(Discussion)
Structured documents and
data in XML
MACHINE
PARSING
?
![Page 8: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/8.jpg)
Abstract
Discussion
Experimental
How?Capturing Chemistry in
XML/CMLACS March 2004
Article
semi-structured
Add Structure
Parse withRegular
Expressions
Legacy to CML converters
![Page 9: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/9.jpg)
Regular Expressions
Capturing Chemistry in XML/CML
ACS March 2004
[Mm]\.?p\p{Punct}?\s+>?\s?\d*\.?\d?\s?-\s?\d*?\.?\d?\s°?\s?C
Maybe ‘.’Any
punctuation0 or more
digitsCapital
‘C’
Melting point: two possible syntaxes
Capital or lowercase ‘m’
Lowercase‘p’
Maybewhitespace
Maybedegrees sign
m.p. > 23.5 °C
mp 23.5 – 25 °C
![Page 10: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/10.jpg)
CML - XML For Chemistry
•Based on W3C XML Schemas •300+ components•Customisable •Extensible through dictionaries•Openly available software
Capturing Chemistry in XML/CML
ACS March 2004
J. Chem. Inf. Comp. Sci., 2003, 43, 757
![Page 11: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/11.jpg)
The CML Family
Controlled XMLNamespaces:
CMLCore – compounds and propertiesCMLReact – reactions
CMLSpect – spectra*
CMLComp – compChemCMLCryst – crystallography and condensed matter
Interoperates with HTML, MathML, SVG, *AniML+, *ThermoML$, etc.
Capturing Chemistry in XML/CML
ACS March 2004
+spectra: ANSI/JCAMP$thermochemistry: NIST
J. Chem. Inf. Comp. Sci., 2003, 43, 757
![Page 12: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/12.jpg)
Case Studies
Parsing output from 750,000 MOPAC jobs
High-throughput parsing of journals
Capturing Chemistry in XML/CML
ACS March 2004
![Page 13: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/13.jpg)
CompChem LogsCapturing Chemistry in
XML/CMLACS March 2004
Coordinates
Molecular
Formula
Calculation Type
Point Group
Dipole
Total Energy
![Page 14: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/14.jpg)
Loss From CompChem
Capturing Chemistry in XML/CML
ACS March 2004
Coordinates
Molecular
Formula
Calculation Type
Dipole
Total Energy
Ionisation Potential
![Page 15: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/15.jpg)
Loss From CompChem
Capturing Chemistry in XML/CML
ACS March 2004
Coordinates
Molecular
Formula
Calculation Type
Dipole
Total Energy
Ionisation Potential
![Page 16: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/16.jpg)
CompChemOutput
Parsing DataCapturing Chemistry in
XML/CMLACS March 2004
Coordinates
Energy Levels
Vibrations
Coordinates
Energy Level
Vibration
CML File
CMLCore
CMLCore
CMLComp
CMLSpect
Input/jobControl General
Parsers
![Page 17: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/17.jpg)
Display Process 1Capturing Chemistry in
XML/CMLACS March 2004
CompChem LogXindice
CML
XSLT
![Page 18: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/18.jpg)
Display Process 2Capturing Chemistry in
XML/CMLACS March 2004
CML File
CMLCore
CMLCore
CMLComp
CMLSpect
compChemOutput
3D structure,electronicproperties
Coordinates
Energy Levels
Vibrations
Input/jobControl XSLT
Display
Normal modes
2D structure, thermodynamic
properties
![Page 19: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/19.jpg)
Parsing DataCapturing Chemistry in
XML/CMLACS March 2004
Dictionary Entry:The pointgroup of a molecule ...The Schoenflies convention is normally used, but Hermann Mauguin is also allowed.
D [debye]ParentSI: c.mMultiplier: 3.335641E-30CGS units for electric dipole
![Page 20: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/20.jpg)
DictionariesCapturing Chemistry in
XML/CMLACS March 2004
<scalar dictRef=“ccml:mp”
units=“units:c”
minValue=“65”
maxValue=“66” />
Linked to CML schema
Accesses CCML namespace
Units dictionaryid="celsius" name="Celsius" parentSI="k"multiplierToSI="1" constantToSI="273.15" abbreviation="C" unitType="temp"
id="meltrange" term="Melting range"definition="Minimum and maximum values of melting range in degrees Celsius"
![Page 21: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/21.jpg)
OSCAR
Open Source Chemistry Analysis Routines
Capturing Chemistry in XML/CML
ACS March 2004
Sponsored by the Royal Society of Chemistry (Cambridge)
Mounted on http://www.rsc.org/
![Page 22: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/22.jpg)
Article StructureCapturing Chemistry in
XML/CMLACS March 2004
Front Matter
Abstract
Introduction
Discussion
Experimental
References
Results
Article
![Page 23: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/23.jpg)
Article StructureCapturing Chemistry in
XML/CMLACS March 2004
Front Matter
Abstract
Introduction
Discussion
Experimental
References
Results
Article
![Page 24: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/24.jpg)
Article StructureCapturing Chemistry in
XML/CMLACS March 2004
Front Matter
Abstract
Introduction
Discussion
Experimental
References
Results
Article
![Page 25: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/25.jpg)
Article StructureCapturing Chemistry in
XML/CMLACS March 2004
Front Matter
Abstract
Introduction
Discussion
Experimental
References
Results
Article
![Page 26: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/26.jpg)
Article StructureCapturing Chemistry in
XML/CMLACS March 2004
Front Matter
Abstract
Introduction
Discussion
Experimental
References
Results
Article
![Page 27: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/27.jpg)
Article StructureCapturing Chemistry in
XML/CMLACS March 2004
Front Matter
Abstract
Introduction
Discussion
Experimental
References
Results
Synthesis
Set up
Analysis
Compound Name
Article Experimental
![Page 28: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/28.jpg)
Information Checked / Extracted
Capturing Chemistry in XML/CML
ACS March 2004
•Chemical name
•Yield
•Boiling / Melting point
•Carbon NMR
•Hydrogen NMR
•Infra Red spectrometry
•Mass spectrometry
•Elemental Analysis
•Optical Rotation
•Refractive Index
•Rf value
•Ultra Violet spectrometry
•Nature (colour, state, modifiers, description,
etc.)
![Page 29: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/29.jpg)
OSCAR Parsing DataCapturing Chemistry in
XML/CMLACS March 2004
H NMR
Nature
HRMS
![Page 30: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/30.jpg)
OSCAR Parsing DataCapturing Chemistry in
XML/CMLACS March 2004
![Page 31: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/31.jpg)
OSCAR Data FoundCapturing Chemistry in
XML/CMLACS March 2004
Results from one paper
![Page 32: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/32.jpg)
OSCAR Error Checking
Capturing Chemistry in XML/CML
ACS March 2004
Serious Error
Warning Type 1
Warning Type
2
![Page 33: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/33.jpg)
OSCAR Error Checking
Capturing Chemistry in XML/CML
ACS March 2004
~30 errors / warnings searched for
This article has:4 errors2 warnings (type 1)30 warnings (type 2)
Elemental analysis, incorrect – calculations are for a different molecular formula
![Page 34: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/34.jpg)
OSCAR Data Presentation
Capturing Chemistry in XML/CML
ACS March 2004
![Page 35: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/35.jpg)
OSCAR SpeedCapturing Chemistry in
XML/CMLACS March 2004
A typical paper contains ca. 20 compounds
JOC (Feb 2004) contains ~600 compounds
OSCAR could extract and tabulate in under 5 minutes
OBC (Feb 2004) contains ~300 compounds
OSCAR could extract and tabulate in under 3 minutes
High throughput, high precision
![Page 36: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/36.jpg)
OSCAR AccuracyCapturing Chemistry in
XML/CMLACS March 2004
92 % of Data Correctly Identified
3 % incorrect author entry
5 % missed
437 items, ~10,000 data fields in test set,working with current Regular Expressions
False-positives: 3 %
![Page 37: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/37.jpg)
XML-CML Databases
Capturing Chemistry in XML/CML
ACS March 2004
CMLJournals
Theses
CompChem
XMLDb can support > 250,000 moleculesMillisecond retrieval on INChI, properties
Xindice
![Page 38: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/38.jpg)
Capturing Molecules
Capturing Chemistry in XML/CML
ACS March 2004
•Autogenerate IUPAC INChI universal identifier•Embed MDLMol or Chemdraw files in MSWord•Autoconvert to CML connection table
•Next phase:•Parse chemical names into CML using modern
NLP+
•Learning-machine rather than rule-based
•+Natural Language Processing
Encourage chemists to
![Page 39: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/39.jpg)
NLP & Parsing Names
Capturing Chemistry in XML/CML
ACS March 2004
KEY: Locant Characteristic Group Mono valent parent hydride Multiplier Heterocyclic parent hydride
![Page 40: Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March](https://reader036.vdocuments.net/reader036/viewer/2022081519/56649d395503460f94a12c58/html5/thumbnails/40.jpg)
Thank You
UnileverRSC
Jonathan GoodmanSam Adams
Fraser NortonChris WaudbyYong Zhang
Capturing Chemistry in XML/CML
ACS March 2004