image to network

Project Name Image to Network

September 2014 – December 2014Akrita Agarwal

Problem Statement

• Take network/pathway-like images related to ES OR (i)PSCs (MeSH tagged) from the database at NCBI and create dynamic networks from them.

• Identify the text, nodes and arrows in the system to depict the flow of the network.

Proposed Plan-of-Action

• Read images into the C++ code and detect text areas to identify nodes.

• Isolate the text area and process the text using Tesseract – a Google API– Train tesseract for current data for better accuracy.– Test tesseract with current data– Seek semantic help using meta-maps to improve

text recognition

Proposed Plan-of-Action

• Process the nodes and store them in a graph.• Detect arrowheads – arrows ( -> ) and

inhibitors ( -|)• Introduce Links in graph by detecting the

orientation (direction) of arrowheads.• Create GPML files• Process GPML files (maybe using Cytoscape)

Detect text areas to identify nodes

• Used OpenCV in C++ to process images.• Found image contours to detect text areas in

the image. Here the focus is just on detecting text area, not identifying words.

• Once the contours are detected, we draw rectangles around them.

• Images in the following slides.

Test images

Isolate text area Process text using Tesseract

• Tesseract takes a text area as input.• Generates output in the form of a text file

containing identified text.• Created a graph with each text area as a node.• The left top and right bottom coordinates of

each node (rectangle) are its properties.• Stored graph in a csv file.• Images in the following slides.

Test Images

Detect arrowheads – arrows ( -> ) and inhibitors ( -|)

Approach 1 – Polynomial• Arrows can be treated as a polynomial with 6

or 7 edges -

• The irregularities in the arrows led to a bad approximation on arrow-detection.

• Smoothening filters improved the results.• Low accuracy

Approach 2 – Cascade Filters• Cascade filters use brute force object training to

find objects in images.• In our approach I used 600 positive images (arrows)

and 1000 negative images (text, images, diagrams, other non-arrow images) to train the classifier.

• The results are highly over-fitted and require further parameter tuning for better results.

• Needs more work.• Good approach but parameters vary with images,

thus not very reliable.• Image in the following slide.

Current issues

• All algorithms give good results for all the images. Might work well on some, not very well on others.

• Text detection can be improved by training Tesseract more.

• Arrow detection is still not giving good results.

Resources

• The code is available on Github - here• Images can be found on NCBI - here• Google Tesseract OCR - here • Cascade classifiers can be tricky to work with.

A good tutorial is - here• Node-OpenCV - here

https://github.com/akritaag/Image-to-network

http://www.ncbi.nlm.nih.gov/

https://code.google.com/p/tesseract-ocr/

http://coding-robin.de/2013/07/22/train-your-own-opencv-haar-classifier.html

https://github.com/peterbraden/node-opencv

image to network

Education