Download - Image to network
Problem Statement
• Take network/pathway-like images related to ES OR (i)PSCs (MeSH tagged) from the database at NCBI and create dynamic networks from them.
• Identify the text, nodes and arrows in the system to depict the flow of the network.
Proposed Plan-of-Action
• Read images into the C++ code and detect text areas to identify nodes.
• Isolate the text area and process the text using Tesseract – a Google API– Train tesseract for current data for better accuracy.– Test tesseract with current data– Seek semantic help using meta-maps to improve
text recognition
Proposed Plan-of-Action
• Process the nodes and store them in a graph.• Detect arrowheads – arrows ( -> ) and
inhibitors ( -|)• Introduce Links in graph by detecting the
orientation (direction) of arrowheads.• Create GPML files• Process GPML files (maybe using Cytoscape)
Detect text areas to identify nodes
• Used OpenCV in C++ to process images.• Found image contours to detect text areas in
the image. Here the focus is just on detecting text area, not identifying words.
• Once the contours are detected, we draw rectangles around them.
• Images in the following slides.
Isolate text area Process text using Tesseract
• Tesseract takes a text area as input.• Generates output in the form of a text file
containing identified text.• Created a graph with each text area as a node.• The left top and right bottom coordinates of
each node (rectangle) are its properties.• Stored graph in a csv file.• Images in the following slides.
Detect arrowheads – arrows ( -> ) and inhibitors ( -|)
Approach 1 – Polynomial• Arrows can be treated as a polynomial with 6
or 7 edges -
• The irregularities in the arrows led to a bad approximation on arrow-detection.
• Smoothening filters improved the results.• Low accuracy
Approach 2 – Cascade Filters• Cascade filters use brute force object training to
find objects in images.• In our approach I used 600 positive images (arrows)
and 1000 negative images (text, images, diagrams, other non-arrow images) to train the classifier.
• The results are highly over-fitted and require further parameter tuning for better results.
• Needs more work.• Good approach but parameters vary with images,
thus not very reliable.• Image in the following slide.
Current issues
• All algorithms give good results for all the images. Might work well on some, not very well on others.
• Text detection can be improved by training Tesseract more.
• Arrow detection is still not giving good results.
Resources
• The code is available on Github - here• Images can be found on NCBI - here• Google Tesseract OCR - here • Cascade classifiers can be tricky to work with.
A good tutorial is - here• Node-OpenCV - here