Transcript
Page 1: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

1 / 23

Extraction of structural information from

ChemDraw CDX files: easy, or an

underestimated, difficult challenge?

Josef Eiblmaier, Hans Kraut, Sascha Hausberg, Peter Loew

ICIC 2013 Vienna, October 13 – 16

Page 2: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

2 / 23

» ChemDraw files:

Relevance and the Challenge

» Approach

» Projects

» InfoChem ChemProspector

» Wiley Smart Article

» Thieme Science of Synthesis Update / Pharmaceutical Substances

» Conclusion / Outlook

Outline

© cora / PIXELIO, www.pixelio.de

Page 3: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

3 / 23

Patents, Journal Articles and MRW‘s: a Buried Treasure?

Reactions (CDX files)

Chemical structures

(images)

Markush

structures (text,

images, CDX)

Chemical structures

(CDX files)

Chemical

names/fragments (text)

Page 4: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

4 / 23

Manuscript submission

Publishing

Database production e.g. SciFinder, Reaxys, SPRESI

eEROS, ...

Manuscript Article Database …

Manual Indexing

Page 5: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

5 / 23

CDX Scheme vs. Database Record

ChemDraw file Database

Purpose: presentation / publishing

no search

Purpose: search / retrieval

Unstructured Structured

Structures: no strict rules Structures: strict rules

General rules: none Database rules: strict

Reactant Product Reagent Solvent Catalyst

SOCl2

LiOH H2O, THF Pd(OAc)2

Cl-Co2Et,

Et3N

Acetone,

H2O

Source: Thieme Pharmaceutical Substances, Ticagrelor (in production)

Page 6: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

6 / 23

CDX Scheme Processing,

what does that mean? Chemical structures (SD files)

ICSchemeProcessor

Reactions (RD files)

Source: Thieme Pharmaceutical Substances, Ticagrelor (in production)

Reagent Solvent Catalyst

SOCl2

LiOH H2O, THF Pd(OAc)2

Cl-Co2Et,

Et3N

Acetone,

H2O

Conditions (RD files)

Page 7: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

7 / 23

But: CDX files, often an optical illusion!

Authors are very inventive for a ‚perfect‘ layout!

Appearences are deceiving!

» Usage of graphical symbols

• Polymer supports

• Heteroatoms

C Grid:

Page 8: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

8 / 23

Optical illusions 2

» Unresolvable labels

• Labels not defined

• Element symbols used as R-group labels

• Ambiguous fragment labels (e.g. molecular formula)

Page 9: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

9 / 23

» Variable points of attachment

Optical illusions 3

Page 10: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

10 / 23

» Reaction arrows / forked arrows / brackets

Optical illusions 4

Page 11: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

11 / 23

Approach

© Gerd Altmann / PIXELIO, www.pixelio.de

Page 12: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

12 / 23

» The algorithmic approach:

• Application of a set of rules in the software (generic, project unspecific). Software

should recognize all cases that might occur!

• project (title-) specific rules (drawing conventions must not change), otherwise

further development necessary

• manual post correction required (cost/time intensive)

• problem is infinite, unprecedented issues can not be handled

» The templating approach:

• software is developed to recognize a defined set of problems (PS)

• all content must be manually pre-templated (cost intensive) according to the

capabilities of the software

» The hybrid approach:

• depending on the source the focus can be laid on either approach

Approach

Page 13: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

13 / 23

Templating

» Templating: Guidelines for authors and typesetters

• Syntax definitions for tables, R-groups etc.

• Syntax rules for captions

• Reaction arrangement, forked arrows

• Rules for reaction conditions

(reactants, catalysts, solvents, yields, temperature)

Page 14: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

14 / 23

Examples:

» Algorithmic detection of features

» Resolution of repeating groups

» Enumeration of R-groups

» Resolution of aliases/labels

• source specific alias databases

• continuously extended

» Table Enumeration

• compound enumeration

• reaction factual data:

Caption/Yield

» Variable points of

attachment

» Forked arrows

Page 15: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

15 / 23

Projects

Page 16: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

16 / 23

Sucessful Application of CDX Processing:

Chemistry Enrichment Workflow*, (Wiley Smart Article)

*Reinhard Neudert: Enhancing the User Experience for Wiley Chemistry Content, ICIC 2012 14. – 17. October, Berlin

Page 17: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

17 / 23

Templating*

Author‘s CDX File CDX Template Enumerated structures

ICSchemeProcessor Templating

CDX-Templating

Guidelines (Structures)

*Reinhard Neudert: Enhancing the User Experience for Wiley Chemistry Content, ICIC 2012 14. – 17. October, Berlin

Page 18: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

18 / 23

R4

O

R5

OH

+

H2N

HN H2O

H2

39

H2O

N

NH

R4

R5

N

O

R4

R5 NH3

40

NH

NHR5

HO

HO

R4

N

NR5

O

R4

• •

N

NHO

R5

R4

H2

R4

O

R5

OH

+

H2N

HN H2O

H2

39

H2O

N

NH

R4

R5

N

O

R4

R5 NH3

40

NH

NHR5

HO

HO

R4

N

NR5

O

R4

• •

N

NHO

R5

R4

H2

Correct /

extend process

ICSchemeProcessor

CDX-

Templating

Guidelines

(Reactions)

Scheme

Error Report R4

O

R5

OH

+

H2N

HN H2O

H2

39

H2O

N

NH

R4

R5

N

O

R4

R5 NH3

40

NH

NHR5

HO

HO

R4

N

NR5

O

R4

• •

N

NHO

R5

R4

H2

Manual data

entry

Scheme

correction not

possible

Workflow Science of Synthesis Update

Page 19: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

19 / 23

Sample Pharmaceutical Substances Update

Source: Thieme Pharmaceutical Substances, Abiraterone

Page 20: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

20 / 23

Conclusion

» As much as possible algorithmic processing desirable

• generic: can be applied to other contents as well

• cheaper (humans cost!)

» 100% conversion (without human interaction) never possible

» Solutions are project / source specific

» Relevance of automatic extraction will continuously increase

» Authors / Publishers play an essential role in a successful conversion

Page 21: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

21 / 23

Acknowledgements

» Wiley

Michael Forster

Reinhard Neudert

» Thieme

Guido Herrmann

Rolf Hoppe

Klaus Köberlein

» InfoChem

Hans Kraut, Sascha Hausberg, Thomas Menke, Manuela Rauh

Fanny Irlinger, Huyen Ngyen, Dagmar Kunzmann

Page 22: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

22 / 23

© Thomas Link / Flickr

Thank you!

Page 23: ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

InfoChem GmbH © 2013 Dr. Josef Eiblmaier ICIC 2013 Vienna, October 13 – 16

23 / 23

Questions?


Top Related