bijay dahal {2008/bct/509} kabindra shrestha {2008/bct/516} raj kumar shrestha {2008/bct/527}
TRANSCRIPT
OPTICAL CHARACTER RECOGNITION TOOLBijay Dahal {2008/BCT/509}Kabindra Shrestha {2008/BCT/516}Raj Kumar Shrestha {2008/BCT/527}
OBJECTIVES
To convert alpha-numeric character from image into normal text form.
To get general idea on image processing.
TOOLS/TECHNOLOGY USEDS.N
Tools Description
1 JDK 6 Development Kit for JAVA
Programming2 NetBeans 7.0 IDE for JAVA Application Development3 Microsoft Windows &
Linux OS platforms to Application
4 Tortoise SVN Version Control Software for Project
Mgmt.5 Sourceforge Project Management and
Configuration6 Microsoft Office Documentations
OVERVIEW
Taking image as input . Converts into normal text form. Recognizes alpha-numeric
characters only. Edit and Save recognized text.
Loaded Image
Converted Text Editable
SYSTEM ARCHITECTURE
Save Text
Matrix Matching
Feature Extraction
Character Segment
Line Segment
Thinning
Binarization
Get ImageBold Thin
METHODOLOGY/ALGORITHMS Otsu Binarization Algorithm
Hilditch Skeletonization Algorithm (Thinning)
ALGORITHMS (CONTD…)
Generic Segmentation
(CONTD…)
Feature Extraction (zonning)
Based on Zones• 5 horizontal and 5 vertical zones =>25 features
Based on Upper and Lower profiles• 10 vertical zones => 20 features
Based on Left and Right profiles• 10 horizontal zones => 20 features
Total Number of features• 25 + 20 + 20 = 65
SCHEDULE
ID Task Name Start Finish DurationAug 2011Jul 2011 Sep 2011
7/3 8/7
1 14d7/3/20116/20/2011System Analysis
2 14d7/17/20117/4/2011System Design
3 62d9/17/20117/18/2011Coding
4 10d9/27/20119/18/2011Testing
5 14d10/11/20119/28/2011Debugging
6 4d10/15/201110/12/2011Efficiency & Performance Testing
7 150d11/16/20116/20/2011Documentation
Oct 2011
9/4
OFF DAYS:Exam Time: (25 Days)Dashain Holidays: (15 Days)Tihar Holidays: (3 Days)
CHALLENGES/PROBLEM FACED• Choosing the correct algorithm.• Hard to implement algorithm.• Implemented, but output is not
accurate.• accuracy of matrix matching.
CONCLUSION
Text from image gets converted to text file.
Simplest algorithm; accuracy is about 40%-60%.
LIMITATION
Can’t recognize text in noisy image.
Can’t detect inclined text from image.
Matrix matching is slow. Bad thinning & noise makes some
text unrecognizable.
FUTURE ENHANCEMENT
Scanner image input. Recognize PDF and other image
format. Nepali / Devnagari font support. Different fonts. Output in PDF or Word file format. Skewing & Noise reduction. Handwritings. Neural Network.
REFERENCES Bates, K. S. (2010). Head First Java. O'Reilly. Improving Optical Character Recognition http
://www.csc.villanova.edu/~mdamian/csc3990/csrs2008/07-csrs2008-AJPalkovic.PDF
Evaluation of OCR Algorithms for Images: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.89.9539&rep=rep1&type=PDF
Otsu Thresholding - The Lab Book Pages http://www.labbookpages.co.uk/software/imgProc/otsuThreshold.html
Image Segmentation http://people.cs.uchicago.edu/~pff/segment/
Hilditch Algorithm http://cis.k.hosei.ac.jp/~wakahara/Hilditch.c
Skeletonization http://cgm.cs.mcgill.ca/~godfried/teaching/projects97/azar/skeleton.html
Java OCR | Ron Cemer's Blog http://www.roncemer.com/software-development/java-ocr
THANK YOU …