icic 2013 conference proceedings josef eiblmaier infochem
DESCRIPTION
Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge? Josef Eiblmaier (InfoChem, Germany) In the past decade various systems for the automatic identification and extraction of chemistry-related information from unstructured sources have emerged. They have opened up new possibilities for organizing, querying, and analyzing chemical content to support the research and development process. Patent authorities and scientific publishers make available, on a large scale, not only full text and images, but also ChemDraw CDX files for many sources. The chemical information contained in these CDX files is primarily intended for layout purposes for publications but it is often erroneously considered to be readily available as input for structure and reaction database building processes. Unfortunately, automatic work-up of chemical structures and reactions from these CDX files entails serious obstacles and problems and consequently the information produced is often incorrect or incomplete and thus not properly available to information professionals via structure and reaction searching. This talk will present different approaches to extracting reactions and structures correctly from CDX files and will describe the main difficulties and drawbacks encountered.TRANSCRIPT
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
1 / 23
Extraction of structural information from
ChemDraw CDX files: easy, or an
underestimated, difficult challenge?
Josef Eiblmaier, Hans Kraut, Sascha Hausberg, Peter Loew
ICIC 2013 Vienna, October 13 – 16
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
2 / 23
» ChemDraw files:
Relevance and the Challenge
» Approach
» Projects
» InfoChem ChemProspector
» Wiley Smart Article
» Thieme Science of Synthesis Update / Pharmaceutical Substances
» Conclusion / Outlook
Outline
© cora / PIXELIO, www.pixelio.de
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
3 / 23
Patents, Journal Articles and MRW‘s: a Buried Treasure?
Reactions (CDX files)
Chemical structures
(images)
Markush
structures (text,
images, CDX)
Chemical structures
(CDX files)
Chemical
names/fragments (text)
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
4 / 23
Manuscript submission
Publishing
Database production e.g. SciFinder, Reaxys, SPRESI
eEROS, ...
Manuscript Article Database …
Manual Indexing
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
5 / 23
CDX Scheme vs. Database Record
ChemDraw file Database
Purpose: presentation / publishing
no search
Purpose: search / retrieval
Unstructured Structured
Structures: no strict rules Structures: strict rules
General rules: none Database rules: strict
Reactant Product Reagent Solvent Catalyst
SOCl2
LiOH H2O, THF Pd(OAc)2
Cl-Co2Et,
Et3N
Acetone,
H2O
Source: Thieme Pharmaceutical Substances, Ticagrelor (in production)
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
6 / 23
CDX Scheme Processing,
what does that mean? Chemical structures (SD files)
ICSchemeProcessor
Reactions (RD files)
Source: Thieme Pharmaceutical Substances, Ticagrelor (in production)
Reagent Solvent Catalyst
SOCl2
LiOH H2O, THF Pd(OAc)2
Cl-Co2Et,
Et3N
Acetone,
H2O
Conditions (RD files)
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
7 / 23
But: CDX files, often an optical illusion!
Authors are very inventive for a ‚perfect‘ layout!
Appearences are deceiving!
» Usage of graphical symbols
• Polymer supports
• Heteroatoms
C Grid:
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
8 / 23
Optical illusions 2
» Unresolvable labels
• Labels not defined
• Element symbols used as R-group labels
• Ambiguous fragment labels (e.g. molecular formula)
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
9 / 23
» Variable points of attachment
Optical illusions 3
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
10 / 23
» Reaction arrows / forked arrows / brackets
Optical illusions 4
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
11 / 23
Approach
© Gerd Altmann / PIXELIO, www.pixelio.de
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
12 / 23
» The algorithmic approach:
• Application of a set of rules in the software (generic, project unspecific). Software
should recognize all cases that might occur!
• project (title-) specific rules (drawing conventions must not change), otherwise
further development necessary
• manual post correction required (cost/time intensive)
• problem is infinite, unprecedented issues can not be handled
» The templating approach:
• software is developed to recognize a defined set of problems (PS)
• all content must be manually pre-templated (cost intensive) according to the
capabilities of the software
» The hybrid approach:
• depending on the source the focus can be laid on either approach
Approach
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
13 / 23
Templating
» Templating: Guidelines for authors and typesetters
• Syntax definitions for tables, R-groups etc.
• Syntax rules for captions
• Reaction arrangement, forked arrows
• Rules for reaction conditions
(reactants, catalysts, solvents, yields, temperature)
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
14 / 23
Examples:
» Algorithmic detection of features
» Resolution of repeating groups
» Enumeration of R-groups
» Resolution of aliases/labels
• source specific alias databases
• continuously extended
» Table Enumeration
• compound enumeration
• reaction factual data:
Caption/Yield
» Variable points of
attachment
» Forked arrows
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
15 / 23
Projects
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
16 / 23
Sucessful Application of CDX Processing:
Chemistry Enrichment Workflow*, (Wiley Smart Article)
*Reinhard Neudert: Enhancing the User Experience for Wiley Chemistry Content, ICIC 2012 14. – 17. October, Berlin
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
17 / 23
Templating*
Author‘s CDX File CDX Template Enumerated structures
ICSchemeProcessor Templating
CDX-Templating
Guidelines (Structures)
*Reinhard Neudert: Enhancing the User Experience for Wiley Chemistry Content, ICIC 2012 14. – 17. October, Berlin
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
18 / 23
R4
O
R5
OH
+
H2N
HN H2O
H2
39
H2O
N
NH
R4
R5
N
O
R4
R5 NH3
40
NH
NHR5
HO
HO
R4
N
NR5
O
R4
• •
N
NHO
R5
R4
H2
R4
O
R5
OH
+
H2N
HN H2O
H2
39
H2O
N
NH
R4
R5
N
O
R4
R5 NH3
40
NH
NHR5
HO
HO
R4
N
NR5
O
R4
• •
N
NHO
R5
R4
H2
Correct /
extend process
ICSchemeProcessor
CDX-
Templating
Guidelines
(Reactions)
Scheme
Error Report R4
O
R5
OH
+
H2N
HN H2O
H2
39
H2O
N
NH
R4
R5
N
O
R4
R5 NH3
40
NH
NHR5
HO
HO
R4
N
NR5
O
R4
• •
N
NHO
R5
R4
H2
Manual data
entry
Scheme
correction not
possible
Workflow Science of Synthesis Update
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
19 / 23
Sample Pharmaceutical Substances Update
Source: Thieme Pharmaceutical Substances, Abiraterone
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
20 / 23
Conclusion
» As much as possible algorithmic processing desirable
• generic: can be applied to other contents as well
• cheaper (humans cost!)
» 100% conversion (without human interaction) never possible
» Solutions are project / source specific
» Relevance of automatic extraction will continuously increase
» Authors / Publishers play an essential role in a successful conversion
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
21 / 23
Acknowledgements
» Wiley
Michael Forster
Reinhard Neudert
» Thieme
Guido Herrmann
Rolf Hoppe
Klaus Köberlein
» InfoChem
Hans Kraut, Sascha Hausberg, Thomas Menke, Manuela Rauh
Fanny Irlinger, Huyen Ngyen, Dagmar Kunzmann
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
22 / 23
© Thomas Link / Flickr
Thank you!
InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16
23 / 23
Questions?