aniko t. valko keymodule ltd
DESCRIPTION
Recent developments in the CLiDE tool for extraction of chemical structure data from patents and other documents. Aniko T. Valko Keymodule Ltd. Peter Johnson Vilmos A. Valko. About CLiDE What is CLiDE for?. Summary. Performance against a benchmark set of images - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/1.jpg)
Recent developments in the CLiDE tool for extraction of
chemical structure data from patents and other documents
Aniko T. ValkoKeymodule Ltd.
Peter Johnson Vilmos A. Valko
![Page 2: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/2.jpg)
Summary
1) About CLiDEWhat is CLiDE for?
2) Performance against a benchmark set of imagesAbout the benchmark setPerformance of CLiDEEnhancements made in CLiDEComparison with selected systems
3) Performance against selected patentsAbout patentsPerformance of CLiDEComparison with selected systems
4) Conclusions and future work
![Page 3: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/3.jpg)
Part 1:
About CLiDE
![Page 4: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/4.jpg)
What is CLiDE for?CLiDE is an Optical Chemical Structure Recognition (OCSR) software application, aimed at converting structure diagrams to computer-readable structures (i.e. connection tables)
PDF, DOC, DOCX, HTML
BMP, GIF, JPEG, PBM,PGM, PNG, PNM,
PPM, TIFF, XBM, XPM
Molfile, RGfile, SDfileCDX, CML, MRV
XML
Valence-violated atomNon-interpreted atomClashing atoms
Small bond angleAtoms at which CLiDE brokeup the structure
![Page 5: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/5.jpg)
Part 2:
Performance againsta benchmark set of images
![Page 6: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/6.jpg)
Benchmark set● Images of isolated structures, one structure per image
● #images: 5735
● US Patent Office Complex Work Unit
US07321045-20080122-C00150 US07320974-20080122-C00070US07323286-20080129-C00108 US07317070-20080108-C00008US07316739-20080108-C00281 US07314700-20080101-C00001US07320972-20080122-C00016 US07314876-20080101-C00035US07314576-20080101-C00035 US07314511-20080101-C00002
● Available on the OSRA web site
● Verification set: Each image is associated with a Molfile meant to describe
the correct connection table
![Page 7: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/7.jpg)
Test runs on the benchmark set● Test environment
CPU 3 Ghz Core 2 DuoMemory 4GBLinux distribution
Ubuntu 10.0.4 (64-bit)
● Test run per image1) CLiDE was run on an image2) CLiDE analysed the image and generated a connection
table3) The connection table extracted by CLiDE was compared (using canonical SMILES) to the corresponding connection table from the verification set (so called ‘ground truth’)● Performance measurements
1) Accuracy rate: the percentage of images that were correctly
processed by CLiDE2) Runtime: the total runtime measured over all the test runs
![Page 8: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/8.jpg)
Performance against benchmark set
3.2.0
4.2.0
4.4.0
5.0.0
5.2.0
5.2.1
5.4.0
5.5.0
5.5.1
5.5.2
5.5.3
5.5.4
0%10%20%30%40%50%60%70%80%90%
100%
00:00
07:12
14:24
21:36
28:48
36:00
43:12
50:24
57.62%
59.00%
58.91%
59.30%
59.30%
81.81%
82.75%
84.60%
85.78%
86.55%
87.79%
87.96%
Accuracy rate Runtime (min:sec)
Optimization andimprovements in CLiDE’s document segmentation method (see later)Auto correction of atom labelsBetter handling of aromatic ringsParsing chemical formulasAvoidance of loss of characters in atom labelsBetter handling of thick bondsFurther improvements to chemical formula parsing57.62% 87.98% 44 min 20 min
![Page 9: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/9.jpg)
Enhancements in CLiDECorrections in atom labels
59.30% 81.81%
● Auto correction of OCR errors in atom labels● Avoidance of misinterpretation of ‘Cl’ labels as Carbons
![Page 10: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/10.jpg)
Enhancements in CLiDEChemical formula parsing
82.75% 84.60%
● Parsing the chemical formula into a sub connection tableTwo-step process:
● Generating atom coordinates for the sub connection table
Super Atom Database: over 1000 super atoms, e.g. Me, Ph, Boc, TBDMS
Problem categories:● Super atoms in chemical formulas● Left- and right-aligned chemical formulas● Branching in chemical formulas● Chemical formulas with multiple attachments
● Chemical formulas with multiple attachments (―OCH2CH2O―)
● Super atoms in chemical formulas (―CO2Ph)● Left- and right-aligned chemical formulas (―CH2NH2 vs
NH2CH2 ―)● Branching in chemical formulas (―OC(CH3)3)
Future work:● Variables in chemical formulas (―CO2R, ―NHZ, ―SiR3)
![Page 11: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/11.jpg)
Enhancements in CLiDEAvoidance of loss of characters from atom labels
84.60% 85.78%
![Page 12: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/12.jpg)
Enhancements in CLiDEBetter handling of thick bonds (stereo indicators)
85.78% 86.55%
![Page 13: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/13.jpg)
Comparison with selected systems
0%10%20%30%40%50%60%70%80%90%
100%
74.48%70.27%
68.68%
68.68%
0.432
0.6128
OSRA Imago
Accuracy rate
57.62%
87.96%
Runtime (hour:min)
OSRA 1.3.6
OSRA 1.3.8
OSRA 1.3.9
OSRA 1.4.0
Imago 1.0
Imago 2.0
beta
0:00
1:12
2:24
3:36
4:48
6:00
7:12
05:51
04:50
04:50
04:54
00:15
01:52
OSRA Imago
CLiD
E
CLiD
E
00:44
00:20
![Page 14: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/14.jpg)
Is the benchmark set correct?
Verification set #Molfiles to be corrected: 117
US07314693-20080101-C00370.TIF US07314693-20080101-C00370.MOL
● Anomalies: 10● Stereo bonds: 22
US07316472-20080108-C00239.TIF US07316472-20080108-C00239.MOL
US07314872-20080101-C00024.TIF US07314872-20080101-C00024.MOL
● Incorrect sub connection tables for chemical formulae
(e.g. NC, H3CO2S, OCF3): 63● Errors in atom label: 14● Other kinds of error: 17
![Page 15: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/15.jpg)
Is the benchmark set correct?
Input images #images to be excluded: 16
● incorrect chemical formula: 1
US07314693-20080101-C00112.TIF
● disconnected atom: 1● incorrect or ambiguous
stereo bond: 6
US07314874-20080101-C00551.TIFUSRE039991-20080101-C00187.TIF USRE039991-20080101-C00188.TIF
● arrow with unknown meaning: 8
US07320974-20080122-C00022.TIF
![Page 16: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/16.jpg)
Performance after corrections
CLiDE 5.5.4 87.96% 90.11%
● #images: 5735● #corrected Molfiles: 117● #excluded images: 16
OSRA 1.4.0 68.68% 69.84%
Imago 2.0 beta 61.28% 61.91%
![Page 17: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/17.jpg)
Part 3:
Performance againstselected patents
![Page 18: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/18.jpg)
About patents
Patents# non-
Markush structures
US6410540 218
WO2008099019 668
Challenges:● Chemical structure diagrams
have to be identified within the document page
● Interpretation of Markush structures
Markush structures were excluded from our tests
![Page 19: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/19.jpg)
Challenge: Document segmentation
Page 65 of US6410540Underlined text
5.5.44.4.0
![Page 20: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/20.jpg)
Challenge: Document segmentation
Page 188 of WO2008019099Table
5.5.44.4.0
Performancemeasurements
● Accuracy rate
● Runtime● #Garbage structures:
The number of structures that were assigned tonon-chemical structure diagrams
![Page 21: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/21.jpg)
Performance of CLiDE
4.4.0 5.5.480%85%90%95%
100%88.90%
87.15%
Accuracy rate
US6410540
4.4.0 5.5.400:0000:0700:1400:2100:28
00:22
00:06
Runtime (hour:min)
4.4.0 5.5.40
100
200 14581
#Garbage structures
WO2008019099
4.4.0 5.5.40%
40%
80% 57.63%74.25%
Accuracy rate
4.4.0 5.5.400:0000:2800:5701:2601:55 01:34
00:04
Runtime (hour:min)
4.4.0 5.5.40
50010001500 1225
29
#Garbage structures
![Page 22: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/22.jpg)
Comparison with selected systemsComparison with OSRA
CLiDE OSRA60%70%80%90%
100%87.15%
70.18%
Accuracy rate
US6410540
CLiDE OSRA00:0000:0200:0500:0800:11
00:0600:09
Runtime (hour:min)
CLiDE OSRA0
50
100 81
29
#Garbage structures
WO2008019099
CLiDE OSRA40%
60%
80% 74.25%
51.64%
Accuracy rate
CLiDE OSRA00:00
00:28
00:57
00:04
00:58
Runtime (hour:min)
CLiDE OSRA0
100
200
29
183
#Garbage structures
![Page 23: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/23.jpg)
Part 4:
Conclusions and future work
![Page 24: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/24.jpg)
Conclusions● There has been considerable progress in OCSR, but nevertheless
there still remain many problems to be solved
● The test sets showcased the diversity and the frequency of the problem types
● Regarding performance:• CLiDE has greatly improved during the last few years• CLiDE compares well with the other OCSR systems available to
us for testing
● In favourable cases, OCSR as exemplified by CLiDE now approaches OCR in accuracy (90%)
![Page 25: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/25.jpg)
Future work
● Further improvements to structure recognition
● Filtering out garbage structures
● Identification and exclusion of non-chemical structure diagrams
● Further improvements to document segmentation
Short-term goals:
Long-term goals:● Contextual document analysis, aimed at linking
structures to text data
![Page 26: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/26.jpg)
Flavours of CLiDE
CLiDE is released in three variants, designed for individual user needs
CLiDE Standard
CLiDE Professional
CLiDE Batch
Designed for the individual chemist who wishes to convert selected images into editable structures for use in reports etc.
GUI enterprise version to process whole documents with interactive editing
Unsupervised extraction for database creation etc.
![Page 28: Aniko T. Valko Keymodule Ltd](https://reader035.vdocuments.net/reader035/viewer/2022081502/5681602d550346895dcf4183/html5/thumbnails/28.jpg)
Acknowledgment
Peter Johnson Keymodule Ltd. and University of Leeds
Anthony P. Cook University of Leeds
Vilmos A. Valko Keymodule Ltd.
Reseller agents• SimBioSys Inc. (North America)• NeoTrident Technology Ltd.
(China)• Hulinks Inc. (Japan)
All users who gave us constructive feedback
Thank you for your attention