imago ocr: open-source toolkit for chemical structure image recognition

46
Imago OCR Open-source toolkit for chemical structure image recognition 14/08/2012 GGA Software Services LLC 1 http://ggasoftware.com/opensource/imago/

Upload: mikhail-rybalkin

Post on 08-Jul-2015

5.460 views

Category:

Technology


4 download

DESCRIPTION

http://ggasoftware.com/opensource/imago Presentation at the Symposium on 244th ACS National Meeting & Exposition. Hunting for Hidden Treasures: Chemical Information in Patents and Other Documents

TRANSCRIPT

Page 1: Imago OCR: Open-source toolkit for chemical structure image recognition

Imago OCROpen-source toolkit for chemical

structure image recognition

14/08/2012 GGA Software Services LLC 1

http://ggasoftware.com/opensource/imago/

Page 2: Imago OCR: Open-source toolkit for chemical structure image recognition

Project goals

• Perform the optical chemical structure recognition applicable for a wide range of raster images:– different image formats

– various scanning quality (or even photo)

– complex structures and uncommon features

• Provide complete toolset for embedding recognition engine in any other application

GGA Software Services LLC 214/08/2012

Page 3: Imago OCR: Open-source toolkit for chemical structure image recognition

Applications

• Automated articles and patents processing

– similarity analysis

• Chemical database search (PubChem, etc.)

• “The Deep Web indexing”

– development of a universal chemical search engine;

– conversion of a human-readable data to machine-readable formats

14/08/2012 GGA Software Services LLC 3

Page 4: Imago OCR: Open-source toolkit for chemical structure image recognition

Use case

14/08/2012 GGA Software Services LLC 4

Source image MOL format

imago

• BMP, DIB, JPG, JPE, PNG, PBM, PGM, PPM, SR, RAS, TIFF;

• Images from scanner/camera;• PDF document

• MDL Molfile;• SMILES (requires Indigo);• Rendered image (requires

Indigo)

Page 5: Imago OCR: Open-source toolkit for chemical structure image recognition

Supported features

• Multiple bonds

• Single-up & single-down bonds

• Bridged bonds

• Aromatic rings

14/08/2012 GGA Software Services LLC 5

Page 6: Imago OCR: Open-source toolkit for chemical structure image recognition

Supported features

• Superatom labels,

charges, isotopes

• Abbreviations expansion

• R-groups handling

• Query features

14/08/2012 GGA Software Services LLC 6

Page 7: Imago OCR: Open-source toolkit for chemical structure image recognition

Engine structure

14/08/2012 GGA Software Services LLC 7

Prefilter & Binarization

Vectorization & Separation

Logical layout analyzer

Image loader

Molecule export

Raster level

Primitives level

Structural level

Page 8: Imago OCR: Open-source toolkit for chemical structure image recognition

Preliminary filters

• Pass-through filter

– For rendered images (only binarization)

• Cross-correlation based filter

– For scanned images (quite fast)

• Logical analysis based filter

– For low-quality photos

– Takes some time for processing

• Imago allows auto-detection of suitable filter

14/08/2012 GGA Software Services LLC 8

Page 9: Imago OCR: Open-source toolkit for chemical structure image recognition

Cross-correlation based filter

14/08/2012 GGA Software Services LLC 9

Source image Strong threshold Weak threshold

← Filter result: image combined of weak threshold image segments that passes the restrictions of the CC value between corresponding strong threshold image segments

Page 10: Imago OCR: Open-source toolkit for chemical structure image recognition

Logical analysis based filter

• Removes noise (spots, light glares)

• Suitable for out-of-focus images

• Can process low-contrast images

• Removes unusual artifacts

• Deals with multicolor photos

• Keywords: wiener filtering, wave algorithm, weak segmentation

14/08/2012 GGA Software Services LLC 10

Page 11: Imago OCR: Open-source toolkit for chemical structure image recognition

Preliminary separation

• Separate labels and graphics:

• Hu moments classifier (d1)

• Contours analysis (d2)

• Approximation criteria (d3)

• Object is symbol if f(d1, d2, d3) > c0

14/08/2012 GGA Software Services LLC 11

Page 12: Imago OCR: Open-source toolkit for chemical structure image recognition

Vectorization

• Convert pixels to a matching polyline:

• Minimization of mean distance between original and vectorized structure

– Penalty for extra segments

14/08/2012 GGA Software Services LLC 12

Page 13: Imago OCR: Open-source toolkit for chemical structure image recognition

Logical layout analysis

• Mapping labels to bonds– Group labels into superatoms

• Finding multiple bonds– Dissolving of short edges

– Connection of bridged bonds

• Removal of surely unrelated captions

• Detection of aromatic rings– Figuring out stereo bonds orientation and

aromatizing molecule if circles were presented

14/08/2012 GGA Software Services LLC 13

Page 14: Imago OCR: Open-source toolkit for chemical structure image recognition

Adaptive methods or particular cases?

• Adaptive methods

– Based on optimization of some function

– Wider input class range

– Probably better results in hard cases

14/08/2012 GGA Software Services LLC 14

• Particular-case methods

– Based on some criteria

– Stability

– Good performance

– Easier implementation

Page 15: Imago OCR: Open-source toolkit for chemical structure image recognition

Particular case methods

• What is it?

• Line? Tested line criteria: no.

• Character? Tested against ‘A’: no.… Tested against ‘Z’: no.

• Ring? no.

• Unrecognizable object – ignore.

14/08/2012 GGA Software Services LLC 15

Page 16: Imago OCR: Open-source toolkit for chemical structure image recognition

Adaptive methods

14/08/2012 GGA Software Services LLC 16

• What is it?

• Line: approximation: d=1.6

• Character? Compared with ‘C’: d=6.1… Compared with ‘L’: d=3.2

• Ring? approximation: d=653.3

• Final decision depends on neighbors

Page 17: Imago OCR: Open-source toolkit for chemical structure image recognition

Decision tree

14/08/2012 GGA Software Services LLC 17

Label with d=0.1 (almost surely recognized)

Then object is a bond and segments group recognized as bond + label with d=0.1+1.6=1.7

Bond with d=0.0

“C” with d=0.1

Then object is a letter ‘l’ and segments group recognized as bond + label of two chars with d=0.0+0.1+3.2=3.3

Page 18: Imago OCR: Open-source toolkit for chemical structure image recognition

Metrics

• For symbols– Distance between Fourier descriptors set

• For graphics– Distance between approximated and source image

• For single-up bonds– f(average fill, relative size, etc.)

• For single-down bonds– f(distance between segments, line thickness, etc.)

• … (every recognition method has a metric function)

14/08/2012 GGA Software Services LLC 18

Page 19: Imago OCR: Open-source toolkit for chemical structure image recognition

Labels correction

• Any recognized symbol can have alternatives:

: A(metric value of 3.2), R(4.9), P(5.0)

• Imago keeps probable captions information (periodic table, abbreviations)

• Labels correction: select such combination of symbols alternatives that is probably and the sum of metric values is minimal

• Allows to recognize partially broken labels

14/08/2012 GGA Software Services LLC 19

Page 20: Imago OCR: Open-source toolkit for chemical structure image recognition

Recognition

• Image recognition is a search of vectorized result gives minimal distance value between vectorized form and original image

• Can be formalized depending on metrics

• Search is exhaustive

– Needs some restrictions to achieve good speed

14/08/2012 GGA Software Services LLC 20

Page 21: Imago OCR: Open-source toolkit for chemical structure image recognition

Trade-off: restricted adaptive methods

• Limit metric values: d < 0.5 – surely; d > 10.0 –impossibly

• Limit Euclidian distances for neighbors search (up to 100 pixels)

• Limit alternatives count (not more than 10)• Assume image filling rate is less than 10%• Assume the distances for single-down bonds segments

is in range 5..10 pixels• Assume the symbol aspect ratio is in range 0.5..2.0• Some more assumptions with the “magic” constants• Gains the speed and stability

14/08/2012 GGA Software Services LLC 21

Page 22: Imago OCR: Open-source toolkit for chemical structure image recognition

Configuration clusters

• For scanned images– Strict adaptive methods limits (fast, <300ms per image)

• For photos and low quality images– Flexible limits (less than a second per image in average)

• For high-resolution images – up to 5 seconds

• For handwritten structures– up to 10 seconds in complex cases

• Imago supports auto-detection of suitable configuration cluster

14/08/2012 GGA Software Services LLC 22

Page 23: Imago OCR: Open-source toolkit for chemical structure image recognition

Configuration cluster creation

• Allows to gain better recognition success rate for specified images type:

– different render type

– images captured differently (scanner type, lighting conditions, etc.)

• Process is automated

– test set of target images type is required

– takes some time

– machine learning application

14/08/2012 GGA Software Services LLC 23

Page 24: Imago OCR: Open-source toolkit for chemical structure image recognition

Machine learning

• Test set: amount of pairs (image; related MDL molfile)

• Imago will tune the method parameters to gain the best score on the test collection– Metrics included

– No information directly related to test set (such a characters table) is stored

• Criteria of the complete set will be formed by small subset of the same type

14/08/2012 GGA Software Services LLC 24

Page 25: Imago OCR: Open-source toolkit for chemical structure image recognition

Learning effectiveness

• Used Img2Structure test set with different renderer:

• Initial results (before training): 202/944 correct, similarity value: 74.54%

• Trained on set of 50 images with new render

• Trained results: 831/944 correct, similarity value: 98.33% on the whole set

14/08/2012 GGA Software Services LLC 25

Page 26: Imago OCR: Open-source toolkit for chemical structure image recognition

Comparison: overall scores 1

• Image2Structure set from TREC 2011 Chemical IR Track (removed ambiguous & partial structures): original files

14/08/2012 GGA Software Services LLC 26

OSRA 1.4.0 Imago 1.0 Imago 2.0 beta 1

Absolutely correct 769 / 944 540 / 944 861 / 944

Almost correct1 +31 +49 +43

Average time 2.54s 0.20s 0.31s

Average similarity2 94.57% 89.59% 98.26%

1 similarity value is greater than 95%;2 correct elements (atoms and bonds) ratio; extra and missing elements are counted too.

Page 27: Imago OCR: Open-source toolkit for chemical structure image recognition

Comparison: overall scores 2

• Image2Structure re-rendered using appropriate molfiles

14/08/2012 GGA Software Services LLC 27

OSRA 1.4.0 Imago 1.0 Imago 2.0 beta 1

Absolutely correct 796 / 944 604 / 944 831 / 944

Almost correct1 +20 +58 +29

Average time 4.57s 0.47s 1.24s

Average similarity2 93.45% 95.38% 98.33%

1 similarity value is greater than 95%;2 correct elements (atoms and bonds) ratio; extra and missing elements are counted too.

Page 28: Imago OCR: Open-source toolkit for chemical structure image recognition

Common issues resolved

14/08/2012 GGA Software Services LLC 28

Source OSRA Imago

Large gap

Lines too close

No more symbols

Page 29: Imago OCR: Open-source toolkit for chemical structure image recognition

Imago Library

• API: Methods set for– Image loading– Configuration clusters setup– Retrieving molfile results– Partial processing (filtering, approximation, validation)

• Bindings for C/C++, Java• Cross-platform implementation (Windows, Linux, Mac)• Dependencies:

– Boost library (LGPL license)– OpenCV library (BSD license)– Indigo (optional)

14/08/2012 GGA Software Services LLC 29

Page 30: Imago OCR: Open-source toolkit for chemical structure image recognition

Thank you for the attention!

• Imago OCR:http://ggasoftware.com/opensource/imago/

• Try imago recognition engine online:http://ggasoftware.com/opensource/imago/online/

14/08/2012 GGA Software Services LLC 30

Page 31: Imago OCR: Open-source toolkit for chemical structure image recognition

Appendix AImago: technical details

14/08/2012 GGA Software Services LLC 31

Page 32: Imago OCR: Open-source toolkit for chemical structure image recognition

Pass-trough prefilter

• Calculate black, white and others pixels

• If (black + white) > t0 ∙ others,

– recolor others to black → image is binarized

– else schedule another prefilter call

• Perform accurate image downscale when image is too large (>5Mpix)

14/08/2012 GGA Software Services LLC 32

Page 33: Imago OCR: Open-source toolkit for chemical structure image recognition

Cross-correlation prefilter

• Smooth source image → smoothed– Pyramidal reduce 2x, then pyramidal upsample 2x

• Process adaptive threshold binarization filter of smoothed image:– With threshold t0 → strong– With threshold t1 → weak

• Segmentate (strong, weak) images using wavemap algorithm• For each weak segment find appropriate strong segment and

calculate intersection:– If intersection area to original segment area ratio is less than c0 then

remove this segment (bad segment)

• If reassembled image contains the rectangular structure R – crop image to R inner dimensions (locate molecules)

• Calculate average pixels intensity for good segments and try to add other pixels with intensity passing this boundary (if they’re not affecting segments connectivity)

14/08/2012 GGA Software Services LLC 33

Page 34: Imago OCR: Open-source toolkit for chemical structure image recognition

Separator details

• Given a binarized set of segments classify them into two main groups: letters and chemical bond representation

• Classification result is based on the value of C = k0 ∙ r0 + k1 ∙ r1 + k2 ∙ r2

– Where (r0, r1, r2) are submethods results

– And (k0, k1, k2) – weight constants (configurable)

14/08/2012 GGA Software Services LLC 34

Page 35: Imago OCR: Open-source toolkit for chemical structure image recognition

Separator: Hu moments

• Hu moments usually differs for characters and bonds, so the classification tree can be computed

• Note: some objects can not be classifiedthat way

14/08/2012 GGA Software Services LLC 35

symbolsr0 = 0

bondsr0 = 1

Page 36: Imago OCR: Open-source toolkit for chemical structure image recognition

Separator: contours analysis

• Extract the outer contour of the binarized segment S;– approximate the chain contour using Teh-Chin chain

approximation algorithm;– taking line thickness as a approximation parameter the polygon

is approximated once again;– calculate the offsets of the contour points by a clockwise step;– the output is a chain of sequential vectors normalized by their

perimeters;

• Compare the chain result to the set of patterns describing valid structures– The set contains of 8x8 matrices where the cell (j, k) denotes

the probability of changing the jth direction to the kth.

• Result of this method is r1 – probability of {S is a bond}

14/08/2012 GGA Software Services LLC 36

Page 37: Imago OCR: Open-source toolkit for chemical structure image recognition

Separator: approximation criteria

• For a given segment S we calculate its best approximation with n line segments (d0) and the closest distance to the most probable character (d1)– If d1 < d0 and n > n0 then probably segment

represents character• Check its width/height ratio, height/average_height

ratio: penalty p0 if this criteria is not matched

• Result is r2 = 1 - (d1 [+ p0]) – probability of {S is a bond}

– Result is r2 = d0 – probability of {S is a bond}

14/08/2012 GGA Software Services LLC 37

Page 38: Imago OCR: Open-source toolkit for chemical structure image recognition

Bonds skeleton analysis

• Dissolve short edges

• Join closest vertices

• Dissolve intermediate vertices

• Find multiple edges

• Connect bridged bonds

• Shrink short bonds

• Detect and mark suspicious edges

14/08/2012 GGA Software Services LLC 38

Page 39: Imago OCR: Open-source toolkit for chemical structure image recognition

Basic labels analysis

• Location analysis: check against baseline– The subscripts are underline:

– Capitals mostly above line:

• Calculate distances to all possible characters:

• Alternate distances using topological features

• Select the best result candidate and calculate recognition quality:

14/08/2012 GGA Software Services LLC 39

Page 40: Imago OCR: Open-source toolkit for chemical structure image recognition

Superatoms analysis

• Concatenate recognized characters into labels

• Check chemical validity

• If validity check is failed – try to find the most probable alternative using other distance map elements

• If such alternative is not found – try to recognize the less probable characters as bonds

• Handle R-semantic, special characters: X, Q, A

14/08/2012 GGA Software Services LLC 40

Page 41: Imago OCR: Open-source toolkit for chemical structure image recognition

Appendix BImago: workflow features

14/08/2012 GGA Software Services LLC 41

Page 42: Imago OCR: Open-source toolkit for chemical structure image recognition

Related continuous integration system

14/08/2012 GGA Software Services LLC 42

Versions list

Results estimation

Test sets

Page 43: Imago OCR: Open-source toolkit for chemical structure image recognition

Explanation: continuous integration

• Some logically grounded changes may decrease the recognition rate → convenient tracking tool is required

• Good way to improve overall stability

• Useful visual representation of the machine-learning progress

14/08/2012 GGA Software Services LLC 43

Page 44: Imago OCR: Open-source toolkit for chemical structure image recognition

Embedded HTML-based logging system

14/08/2012 GGA Software Services LLC 44

Embedded images

Performance counters

Variables and parameters dump

Call hierarchy

Page 45: Imago OCR: Open-source toolkit for chemical structure image recognition

Explanation: logging system

• Structured logs (reports) are offering– Convenient way of bugs detection;

– Exact visual representation of the internal processes;

• Several improvements may be evident just by looking through logs

• Performance decrease is comparable to the (usual) plaintext logs

• Stability is not affected

14/08/2012 GGA Software Services LLC 45

Page 46: Imago OCR: Open-source toolkit for chemical structure image recognition

Authors

• Rostislav Chutkov

• Michael Rybalkin

• Kliton Andrea

• Victor Smolov

• GGA Software Services LLC

14/08/2012 GGA Software Services LLC 46