analysing and classifying names of chemical compounds with ... · analysing and classifying names...

44
Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University of Stuttgart April 11, 2006

Upload: nguyenkhuong

Post on 11-Aug-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Analysing and Classifying

Names of Chemical Compounds

with CHEMorph

Stefanie Anstein Gerhard Kremer

��� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

�� ��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�� IMS, University of Stuttgart

April 11, 2006

Page 2: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Example Analysis

CH

H

H

C

O

C

H

H

C

H

H

C

H

H

C

H

H

C

H

H

O H

7-hydroxyheptan-2-one

compd(ane(7*C),pref([1*[7]-hydroxy]),suff([1*[2]-one]))

CC(=O)CCCCCO ALCOHOL,KETONE,...

Stefanie Anstein, Gerhard Kremer CHEMorph 2 / 13

Page 3: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Example Analysis

CH

H

H

C

O

C

H

H

C

H

H

C

H

H

C

H

H

C

H

H

O H

7-hydroxyheptan-2-one

compd(ane(7*C),pref([1*[7]-hydroxy]),suff([1*[2]-one]))

CC(=O)CCCCCO ALCOHOL,KETONE,...

Stefanie Anstein, Gerhard Kremer CHEMorph 2 / 13

Page 4: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Example Analysis

CH

H

H

C

O

C

H

H

C

H

H

C

H

H

C

H

H

C

H

H

O H

7-hydroxyheptan-2-one

compd(ane(7*C),pref([1*[7]-hydroxy]),suff([1*[2]-one]))

CC(=O)CCCCCO ALCOHOL,KETONE,...

Stefanie Anstein, Gerhard Kremer CHEMorph 2 / 13

Page 5: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Example Analysis

CH

H

H

C

O

C

H

H

C

H

H

C

H

H

C

H

H

C

H

H

O H

7-hydroxyheptan-2-one

compd(ane(7*C),pref([1*[7]-hydroxy]),suff([1*[2]-one]))

CC(=O)CCCCCO ALCOHOL,KETONE,...

Stefanie Anstein, Gerhard Kremer CHEMorph 2 / 13

Page 6: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Motivation & Background

life sciences . . .

and the amount of biomedical data

terminology . . .

and biochemical nomenclature

Stefanie Anstein, Gerhard Kremer CHEMorph 3 / 13

Page 7: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Motivation & Background

life sciences . . .

and the amount of biomedical data

terminology . . .

and biochemical nomenclature

Stefanie Anstein, Gerhard Kremer CHEMorph 3 / 13

Page 8: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Challenges

term reference

coreferences

R-0. 1.7.3 (IUPAC nomenclature of organic compounds):

Addition of the vowel “o”.

For euphonic reasons, the vowel “o” is sometimes inserted

between consonants.

Stefanie Anstein, Gerhard Kremer CHEMorph 4 / 13

Page 9: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Challenges

term reference

coreferences

R-0. 1.7.3 (IUPAC nomenclature of organic compounds):

Addition of the vowel “o”.

For euphonic reasons, the vowel “o” is sometimes inserted

between consonants.

Stefanie Anstein, Gerhard Kremer CHEMorph 4 / 13

Page 10: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Modules Overview

name

parser

semantic representation

SMILES string

generator

SMILES string

classifier

classes

Stefanie Anstein, Gerhard Kremer CHEMorph 5 / 13

Page 11: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Modules Overview

name

parser

semantic representation

SMILES string

generator

SMILES string

classifier

classes

Stefanie Anstein, Gerhard Kremer CHEMorph 5 / 13

Page 12: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Modules Overview

name

parser

semantic representation

SMILES string

generator

SMILES string

classifier

classes

Stefanie Anstein, Gerhard Kremer CHEMorph 5 / 13

Page 13: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Modules Overview

name

parser

semantic representation

SMILES string

generator

SMILES string

classifier

classes

Stefanie Anstein, Gerhard Kremer CHEMorph 5 / 13

Page 14: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Name Types

fully specified underspecified

systematic 7-hydroxyheptan-2-one heptanone

trivial benzene ∅semi-systematic benzene-1,3,5-triacetic acid dihydrobenzene

class ∅ alcohol

semi-systematic ∅ 2-deoxysugar

Stefanie Anstein, Gerhard Kremer CHEMorph 6 / 13

Page 15: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Parser

7 - hydroxy hept an - 2 - one

mult7

parent suffixλ(X,ane(X*’C’))

parent nonsugarane(7*’C’)

organic compound

prefix[??*[7]-hydroxy]

locant??*[7]

loc[7]

hyphen∅

prefhydroxy

locant??*[2]

suffix[??*[2]-one]

hyphen∅

loc[2]

hyphen∅

suffone

compd( ane(7*C) , pref( [??*[7]-hydroxy] ) ,

suff( [??*[2]-one] ) )

Stefanie Anstein, Gerhard Kremer CHEMorph 7 / 13

Page 16: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Parser

7 - hydroxy hept an - 2 - one

mult7

parent suffixλ(X,ane(X*’C’))

parent nonsugarane(7*’C’)

organic compound

prefix[??*[7]-hydroxy]

locant??*[7]

loc[7]

hyphen∅

prefhydroxy

locant??*[2]

suffix[??*[2]-one]

hyphen∅

loc[2]

hyphen∅

suffone

compd( ane(7*C) , pref( [??*[7]-hydroxy] ) ,

suff( [??*[2]-one] ) )

Stefanie Anstein, Gerhard Kremer CHEMorph 7 / 13

Page 17: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Parser

7 - hydroxy hept an - 2 - one

mult7

parent suffixλ(X,ane(X*’C’))

parent nonsugarane(7*’C’)

organic compound

prefix[??*[7]-hydroxy]

locant??*[7]

loc[7]

hyphen∅

prefhydroxy

locant??*[2]

suffix[??*[2]-one]

hyphen∅

loc[2]

hyphen∅

suffone

compd( ane(7*C) , pref( [??*[7]-hydroxy] ) ,

suff( [??*[2]-one] ) )

Stefanie Anstein, Gerhard Kremer CHEMorph 7 / 13

Page 18: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Parser

7 - hydroxy hept an - 2 - one

mult7

parent suffixλ(X,ane(X*’C’))

parent nonsugarane(7*’C’)

organic compound

prefix[??*[7]-hydroxy]

locant??*[7]

loc[7]

hyphen∅

prefhydroxy

locant??*[2]

suffix[??*[2]-one]

hyphen∅

loc[2]

hyphen∅

suffone

compd( ane(7*C) , pref( [??*[7]-hydroxy] ) ,

suff( [??*[2]-one] ) )

Stefanie Anstein, Gerhard Kremer CHEMorph 7 / 13

Page 19: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Parser

7 - hydroxy hept an - 2 - one

mult7

parent suffixλ(X,ane(X*’C’))

parent nonsugarane(7*’C’)

organic compound

prefix[??*[7]-hydroxy]

locant??*[7]

loc[7]

hyphen∅

prefhydroxy

locant??*[2]

suffix[??*[2]-one]

hyphen∅

loc[2]

hyphen∅

suffone

compd( ane(7*C) , pref( [??*[7]-hydroxy] ) ,

suff( [??*[2]-one] ) )

Stefanie Anstein, Gerhard Kremer CHEMorph 7 / 13

Page 20: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

SMILES String Generator

representation of single chain elements

consistency check

underspecification:

underspecified( CC(=O)CCCCC , [{1,3,4,5,6,7}-hydroxy] )

Stefanie Anstein, Gerhard Kremer CHEMorph 8 / 13

Page 21: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

SMILES String Generator

representation of single chain elements

consistency check

underspecification:

underspecified( CC(=O)CCCCC , [{1,3,4,5,6,7}-hydroxy] )

Stefanie Anstein, Gerhard Kremer CHEMorph 8 / 13

Page 22: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

SMILES String Generator

representation of single chain elements

consistency check

underspecification:

underspecified( CC(=O)CCCCC , [{1,3,4,5,6,7}-hydroxy] )

Stefanie Anstein, Gerhard Kremer CHEMorph 8 / 13

Page 23: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Classifier

morpheme class

hydroxy- | -ol ALCOHOL

cyclo- & -ane CYCLOALKANE

compd( ane(7*C) , pref([1*[7]-hydroxy]) , suff([1*[2]-one]) )

_ ALKANE, ALCOHOL, KETONE

compd( ene(??*[??],ane(4*’C’)) , pref([]) , suff([]) )

_ ALKENE

Stefanie Anstein, Gerhard Kremer CHEMorph 9 / 13

Page 24: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Classifier

morpheme class

hydroxy- | -ol ALCOHOL

cyclo- & -ane CYCLOALKANE

compd( ane(7*C) , pref([1*[7]-hydroxy]) , suff([1*[2]-one]) )

_ ALKANE, ALCOHOL, KETONE

compd( ene(??*[??],ane(4*’C’)) , pref([]) , suff([]) )

_ ALKENE

Stefanie Anstein, Gerhard Kremer CHEMorph 9 / 13

Page 25: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Classifier

morpheme class

hydroxy- | -ol ALCOHOL

cyclo- & -ane CYCLOALKANE

compd( ane(7*C) , pref([1*[7]-hydroxy]) , suff([1*[2]-one]) )

_ ALKANE, ALCOHOL, KETONE

compd( ene(??*[??],ane(4*’C’)) , pref([]) , suff([]) )

_ ALKENE

Stefanie Anstein, Gerhard Kremer CHEMorph 9 / 13

Page 26: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Classifier

morpheme class

hydroxy- | -ol ALCOHOL

cyclo- & -ane CYCLOALKANE

compd( ane(7*C) , pref([1*[7]-hydroxy]) , suff([1*[2]-one]) )

_ ALKANE, ALCOHOL, KETONE

compd( ene(??*[??],ane(4*’C’)) , pref([]) , suff([]) )

_ ALKENE

Stefanie Anstein, Gerhard Kremer CHEMorph 9 / 13

Page 27: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Results & Applications

SMILES string and classification

underspecification

term reference

coreference resolution

database curation and ontology acquisition

Stefanie Anstein, Gerhard Kremer CHEMorph 10 / 13

Page 28: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Results & Applications

SMILES string and classification

underspecification

term reference

coreference resolution

database curation and ontology acquisition

Stefanie Anstein, Gerhard Kremer CHEMorph 10 / 13

Page 29: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Results & Applications

SMILES string and classification

underspecification

term reference

coreference resolution

database curation and ontology acquisition

Stefanie Anstein, Gerhard Kremer CHEMorph 10 / 13

Page 30: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Results & Applications

SMILES string and classification

underspecification

term reference

coreference resolution

database curation and ontology acquisition

Stefanie Anstein, Gerhard Kremer CHEMorph 10 / 13

Page 31: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Results & Applications

SMILES string and classification

underspecification

term reference

coreference resolution

database curation and ontology acquisition

Stefanie Anstein, Gerhard Kremer CHEMorph 10 / 13

Page 32: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Conclusion & Outlook

feasible, extendable and transferable approach

extend grammar and lexicon

elaborate SMILES and classification

sophisticated linguistic analysis _ database curation

term identification _ text processing applications

Stefanie Anstein, Gerhard Kremer CHEMorph 11 / 13

Page 33: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Conclusion & Outlook

feasible, extendable and transferable approach

extend grammar and lexicon

elaborate SMILES and classification

sophisticated linguistic analysis _ database curation

term identification _ text processing applications

Stefanie Anstein, Gerhard Kremer CHEMorph 11 / 13

Page 34: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Conclusion & Outlook

feasible, extendable and transferable approach

extend grammar and lexicon

elaborate SMILES and classification

sophisticated linguistic analysis _ database curation

term identification _ text processing applications

Stefanie Anstein, Gerhard Kremer CHEMorph 11 / 13

Page 35: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Acknowledgements

Stefanie Anstein

Uwe Reyle

Jasmin Saric

EML Research gGmbH

Stefanie Anstein, Gerhard Kremer CHEMorph 12 / 13

Page 36: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

Introduction System Details Conclusion

Schonen Dank.

Stefanie Anstein, Gerhard Kremer CHEMorph 13 / 13

Page 37: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University
Page 38: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

IUPAC Nomenclatures

Amino Acids and Peptides EC 5 Isomerases Phosphorus containing compds

Biochemical thermodynamics EC 6 Ligases Polymerized amino acids

Branched nucleic acids Folic acid Polypeptide conformation

Carbohydrates Glycolipids Polynucleotide conformation

Carotenoids Glycoproteins Polysaccharide conformation

Corrinoids (vitamin B12) myo-Inositol numbering Prenol nomenclature

Cyclitols Lignan Nomenclature Pyridoxal (vitamin B6)

Electron transport proteins Lipid Nomenclature Quinones w. an Isoprenoid Chain

Enzyme kinetics Multienzymes Retinoids

Enzyme nomenclature Multiple forms of enzymes Steroids

EC 1 Oxidoreductases Nucleic acid constituents Tetrapyrroles

EC 2 Transferases Nucleic acid sequence Tocopherols (vitamin E)

EC 3 Hydrolases Organic Chemistry Translation Factors

EC 4 Lyases Peptide hormones Vitamin D

Page 39: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

KEGG: Kyoto Encyclopedia of Genes and Genomes

Page 40: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University
Page 41: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University
Page 42: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University

7-HYDROXYHEPTAN-2-ONE

PRIMARY ALCOHOL

ALCOHOL

7-HYDROXYHEPTANE

HYDROXYHEPTANE

HEPTANE

7-HYDROXYALKANE

HYDROXYALKANE

7-HYDROXYKETONE

HYDROXYKETONE

HYDROXYHEPTAN-2-ONE

HEPTAN-2-ONE

KETONEALKANE

Page 43: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University
Page 44: Analysing and Classifying Names of Chemical Compounds with ... · Analysing and Classifying Names of Chemical Compounds with CHEMorph Stefanie Anstein Gerhard Kremer IMS, University