digitalization and chemical entity recognition of chemisches zentralblatt:

34
foChem / ETH Zürich Copyright © 2009 Brändle, Eigner Pitto Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009 Digitalization and Chemical Entity Digitalization and Chemical Entity Recognition of Chemisches Recognition of Chemisches Zentralblatt: Zentralblatt: Unrivaled Historical Information Unrivaled Historical Information Meets Modern Technology Meets Modern Technology M. Brändle (ETH Zürich), V. Eigner-Pitto (InfoChem GmbH) 1 / 34

Upload: hei

Post on 13-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt: Unrivaled Historical Information Meets Modern Technology. M. Brändle (ETH Zürich), V. Eigner-Pitto (InfoChem GmbH). Historical Importance of Chemisches Zentralblatt. 1830 Chemisches Zentralblatt 1969. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Digitalization and Chemical Entity Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:Recognition of Chemisches Zentralblatt:

Unrivaled Historical InformationUnrivaled Historical InformationMeets Modern TechnologyMeets Modern Technology

M. Brändle (ETH Zürich), V. Eigner-Pitto (InfoChem GmbH)

1 / 34

Page 2: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Historical Importance of Chemisches Zentralblatt Historical Importance of Chemisches Zentralblatt

1817 Gmelin Handbook …

1830 Chemisches Zentralblatt 1969

First and oldest abstracts journal in chemistry

Covers chemical literature from 1830 to 1969

Describes the „birth“ of chemistry as science (vs. alchemy)

1840 1907 Chemical Abstracts …

1881 Beilstein Handbook …

1772

1771

Biggest and single abstracts source in chemistry

Currently >31 million papers and patents

Content 1840-1906 added retrospectively

2 / 34

Page 3: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Chemisches Zentralblatt: Content Chemisches Zentralblatt: Content

• Covers 140 years of chemistry

• About 3.6 million abstracts

• journal articles

• patents

• 900‘000 pages (115‘000 for time period 1830-1906)

• 700‘000 pages with abstracts

• 200‘000 pages of indexes („Register“)

• Author 1830

• Subject

• alphabetic 1830

• systematic 1863

• Patent 1897

• Formula 1925

• General indexes 1883

3 / 34

Page 4: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

History of Chemisches Zentralblatt: Rise History of Chemisches Zentralblatt: Rise

„Pharmaceutisches Central-Blatt“, 403 abstracts/544 pages/10 journals, weekly after 8 months.

1830

1850 Title changes to „Chemisch-Pharmaceutisches Central-Blatt“

1856 „Chemisches Central-Blatt“

1864 Introduction of a systematic table of contents Classification of chemistry

1879 First patent abstracts in „kleinen Mittheilungen“

1883 1st edition of General Index

1884 In-text images

1888 273 journals excerpted

4 / 34

Page 5: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

History of Chemisches Zentralblatt: Prosperity History of Chemisches Zentralblatt: Prosperity

1897 Holding passes to Deutsche Chemische Gesellschaft for DM 15‘000.Introduction of patent index.

1901 Editorial office moves from Leipzig to Berlin.

1919 Takes over abstracts from Angew. Chem.Split into scientific (I/III) and technical part (II/IV).

1921 Begins to cover foreign patents.

1924 CZ is reunified into one journal of abstracts.

1925 Introduction of formula index.

1929 Centennial: Richard Willstätter accentuates „timeliness, exactness, completeness“ as attributes and requirements for quality of CZ.

CA

5 / 34

Page 6: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

History of Chemisches Zentralblatt: Decline History of Chemisches Zentralblatt: Decline

1940|

1945

WW II: Difficulties in collecting information.1944 bombing of editorial office.

Pages

1947|

1949

Double production of CZ in Eastand West Germany.

1950 Reunification of CZ under Eastand West German organisations.

1954 Trying to fill gap by supplement volumes.

1961 Berlin Wall does not hinder production.

Editorial Office

East Berlin

Editorial OfficeWest Berlin

1967 Introduction of SRD (Schnellreferatedienst, quick abstract service) for organic chemistry.

1969

GDR office declares unable to afford production of SRD and of journal. CZ ceases publication.

CA SRD continued as „Chemischer Informationsdienst“ (ChemInform).

6 / 34

Page 7: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Chemisches Zentralblatt vs. CA: QuantityChemisches Zentralblatt vs. CA: Quantity

Pages

WW II

WW I

CA format change

Abstracts

WW II

WW I

7 / 34

Page 8: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Chemisches Zentralblatt vs. CA: QualityChemisches Zentralblatt vs. CA: Quality

• Many textbooks on chemical literature claim better quality of Chemisches

Zentralblatt than CA for pre-WW II

• H. Skolnik, The literature matrix of chemistry, 1982: „outstanding A/I service“

• R.E. Maizell, How to find chemical information, 3rd ed. 1998, citing E.J. Crane,

„[..] has value because of [..] good abstracts“

• M. Mücke, Die chemische Literatur, 1982, „Zwar war CA zahlenmässig [..] dem

Chemischen Zentralblatt überlegen, doch war dies gerade umgekehrt, was die

Qualität der Referate betraf.“

• R.T. Bottle, J.F. Rowland, Information Sources in Chemistry, 4th ed. 1993,

„Before WW II, many chemists regarded CZ as superior in coverage to CA; its

abstracts were longer and more informative [...]“

• A.S.K. Atsu, Comparative coverage of chemical abstracting services in the period

1906-1940, M. Sc. Thesis, City University, London (1976)

8 / 34

Page 9: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

CZ I(1928), 528 CA 22:11339 (1928), 1363

Length (pages) 7.5 1

Length (words) 3,882 690

Length (chars) 24,308 4,695

Compounds ~ 120 ~ 70

Structure formulas ✔ ✕

Chemisches Zentralblatt vs. CA: QualityChemisches Zentralblatt vs. CA: Quality

Example: Hans Fischer, Georg Stangler, Synthese des

Mesoporphyrings, Mesohämins und über die Konstitution

des Hämins, Justus Liebigs Ann. Chem. 459(1927), 53-

98.

9 / 34

Page 10: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Chemisches Zentralblatt: DigitalizationChemisches Zentralblatt: Digitalization

• Relevant for documentation of prior art

• Continuous and growing demand of the information

• FIZ Chemie Berlin has scanned the whole work and offers a full text searchable

database for the web and the dataset for integration in Intranets

• ETH Zurich has bought the digitalized raw material (pdfs with OCRed text in the

background) from FIZ and is creating a database offering full text search

• 900‘000 pdf pages,1.3 TB

• Raw text content incl. search index about 10 GB

• CAS has performed automatic translation (German English) of the 1897-1907

volumes and included in CAplus

10 / 34

Page 11: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Reasons for buying digitalized Chem. ZentralblattReasons for buying digitalized Chem. Zentralblatt

www.infochembio.ethz.ch/en/holdings.html

11 / 34

Page 12: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Reasons for buying digitalized Chem. ZentralblattReasons for buying digitalized Chem. Zentralblatt

• Space

• Loss of compact shelving space in basement (432 m 194 m, -55%)

• Disposal of printed Beilstein, CA, Chem. Zentralblatt

• Access

• e-books, e-journals, end-user databases at workbench of chemist

• Chemists trained to electronic sources, print and µ-film cumbersome

• Restoration costs due to deterioration of acid-containing paper

• 17K€/t for deacidification : Chem. Zentralblatt 1.6 t 27K€

• Digitalization and operation costs much higher (10x), but can be divided

• Ease of use : Search / Browse / Print

12 / 34

Page 13: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Quality of Obtained Raw DataQuality of Obtained Raw Data

• Errors upon conversion

• Visual inspection of pages: Cover Flow / Quick Look technology

13 / 34

Page 14: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Quality of Raw Data Observed: Page ErrorsQuality of Raw Data Observed: Page Errors

• File errors (conversion)

• Unreadable directories (missing content)

• Defect pdf files (missing content)

• Errors during scanning (visual inpection)

• Duplicate pages (shifting page index)

• Missing pages (shifting page index, missing content)

• Issues scanned in wrong order (minor)

• Two pages on one (shifting page index)

• Wrong volume (missing content)

14 / 34

Page 15: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Quality of Raw Data Observed: OCRQuality of Raw Data Observed: OCR

• ETH works with OCR from FIZ Chemie

• page word index, 346 million „words“

• 8.8% with only 1 character

• slightly expanded fonts, e.g. for author names, sum formulas

• Abbreviations (journal names, Zentralblatt = C), numbers

• element names in structure formulas

15 / 34

Page 16: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Planned Tasks ETH ZürichPlanned Tasks ETH Zürich

• Adding navigation structure, provide DB

search and browse for ETH members (Q4/09)

• Mining and Markup (Q1/10)

• Bibliographic references

• Authors

• General Subject Headings

• Reference linking to journal

articles and patents (Q1/10)

16 / 34

Page 17: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Chemisches Zentralblatt: ConclusionChemisches Zentralblatt: Conclusion

• Covers chemical literature from 1830 to 1969

• Very good abstract quality

• Better quality (length, details) than CA for pre-WW II period 1907-1940

• Contains also important patent information

• Invaluable information in indexes (e.g. synonyms of ancient chemical names)

• Only comprehensive abstract journal on the market up to 1907

• More comprehensive than CA for 19th century literature

• Complements Beilstein and Gmelin handbooks for 19th century literature

17 / 34

Page 18: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Org. Lett., 2006, 8 (19), pp 4279–4281

Chemisches Zentralblatt., 1904, 2, 1145

Importance of Chemisches Zentralblatt: ExampleImportance of Chemisches Zentralblatt: Example

The authors have

retracted this paper on

November 15, 2007 (Org.

Lett. 2007, 24, 5139)

18 / 34

Page 19: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

InfoChem MotivationInfoChem Motivation

• Text search in Chemisches Zentralblatt:

• Abstracts in German language

• High number of old German chemical names

• Chemists think in structures!!!

• Language independent structure search would help ALL scientists to access this

historical source and to use the relevant information of this art

• Required technology for structure search projects

• Optimized German-English dictionaries

• 30 million SPRESI names

19 / 34

Page 20: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Overview of Approach and Applied TechnologyOverview of Approach and Applied Technology

Manual abstraction of sample set for evaluationN

OOH

H

Comparison (quantitative)

.tiff Documents

Pdf documentsText under image

skhflaskjlkfjlkdj

Link to original literature

Database

Combined search on federated

search system(ICFEDSEARCH)

OCRNER N2S

ICANNOTATOR

N

OOH

H

SPRESI Dictionaries

20 / 34

Page 21: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

1870

Challenges OCR (1)Challenges OCR (1)1830

1910 1930

1969

21 / 34

Page 22: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Challenges OCR (2)Challenges OCR (2)

• Bad quality of original source: dirty (blotted, stained) pagesprint from back page

22 / 34

Page 23: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Challenges OCR (3)Challenges OCR (3)

• Tables:extremely small fonts,

not recognizable begin / end of columns

23 / 34

Page 24: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Challenges OCR (4)Challenges OCR (4)

• Ambiguous old fonts (h=b; c=e; ligations)

• Spaced text

Specific rules, large German dictionaries and extensive training are applied to correct systematic mistakes of standard OCR process

24 / 34

Page 25: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Challenges Annotation (1)Challenges Annotation (1)

• Names lack position, valence or stoichiometric information

• Pimarsäure is it the R or L form?

• Platinchlorid in which oxidation state II, III, IV?

• Chemical names that indicate a chemical class

• Nitrolsäure (nitrolic acid)

• Lactonsäure (lactonic acid) any of several acids with a lactone ring bearing the carboxylic group

• Mixed compounds

• Eunole Naphthole + Eucalyptusöl

• Pikrotoxin Pikrotoxinin + Pikrotin

NO solution: correct structure information is not available in the original source

R C

N OH

NO2

25 / 34

Page 26: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Challenges Annotation (2)Challenges Annotation (2)

• Obsolete German language

• Schwefelsaures Natrium, Chlorür, Bromür

• Historical names

• Pelopeum Columbium Niobium

• Different spelling for the same name:

• Dibrom… Bibrom…

• Ätzkali Aetzkali

26 / 34

Page 27: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Solutions in Annotation ProcessSolutions in Annotation Process

• Correction of German-specific grammar

• Translation in English of not available chemical names

• Research in old sources:• Beilstein

• Brockhaus Encyclopedia

• German-English dictionaries of chemistry

• Meyers Encyclopedia

• Pierer Encyclopedia

• References to very old books, journals, articles

• “Naturwissenschaftliche Exzerpte und Notizen Mitte 1877 bis Anfang 1883”

by Karl Marx

27 / 34

Page 28: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Results Annotation Chemisches ZentralblattResults Annotation Chemisches Zentralblatt

• 120,000 pages covering time period 1830-1907

• 2.4 million chemical names with associated structure

• 98,000 unique names

• 47,000 unique structures

Quantitative comparison with manually abstracted sample set

• Recall 51%

• Precision 87%

28 / 34

Page 29: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Federated Search PrototypeFederated Search Prototype

29 / 34

Page 30: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Federated Search PrototypeFederated Search Prototype

30 / 34

Page 31: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

Federated Search PrototypeFederated Search Prototype

31 / 34

Page 32: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

SummarySummary

• Described history, content and importance nowadays of Chemisches Zentralblatt

• Illustrated how the challenges of OCR and annotation process have been solved

• Time period 1830-1907 contains 98,000 unique names and 47,000 unique structures

• Quantitative comparison proves over 50% recall and nearly 90% precision

• Generated structure searchable Chemisches Zentralblatt database is integrated in ICFEDSEARCH

32 / 34

Page 33: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

OutlookOutlook

Chemisches Zentralblatt: Phase 1, Q2 2009 Phase 2, Q4 2009

Pages: 120,000 900,000

Time period: 1830-1907 1830-1969

Unique names: 98,000 Ca. 1 million

Unique structures: 47,000 Ca. 500,000

Recall: 50% ?

33 / 34

Page 34: Digitalization and Chemical Entity Recognition of Chemisches Zentralblatt:

InfoChem / ETH Zürich Copyright © 2009 Brändle, Eigner PittoFraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009

AcknowledgementsAcknowledgements

InfoChem GmbH:InfoChem GmbH: www.infochem.de, www.spresi.com, [email protected]

• Prof. Dr. Deplanque, Mr. Heineke and FIZ Chemie Team Berlin

• Ms. Langanke

• InfoChem Team

• Chemistry Biology Pharmacy Information Center (ETH Zürich)

Thank you!Thank you!

ETH Zürich: ETH Zürich: www.infochembio.ethz.ch, [email protected]

34 / 34