towards e-library of telugu literaturerajarja/mtp/elibrary.pdf · towards e-library of telugu...

45
Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmenta Towards e-Library of Telugu literature Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha Nath Co-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay October 31, 2011 Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Compute Towards e-Library of Telugu literature

Upload: doankiet

Post on 03-Apr-2018

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Towards e-Library of Telugu literature

Rajesh Babu Arja(Roll no: 09305914)

Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri

Department of Computer Science and EngineeringIndian Institute of Technology Bombay

October 31, 2011

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 2: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

1 Introduction

2 Work done in stage1

3 Binarization and Skew detection

4 Segmentation and Noise removal

5 Line Segmentation

6 Classification

7 Post Processing

8 Conclusion and Future Work

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 3: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

What we want to do?

What we want to do?

Digitalize Telugu literature

Why we want to do?

For easy cataloging of documents

To enable search on huge literature

To preserve the literature of the Telugu

To encourage transliteration of Telugu documents to otherlanguages

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 4: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

How can we do?

How can we do?

By Implementing OCR for Telugu

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 5: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

What is OCR?

Figure: Optical Character Recognition(OCR)

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 6: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Applications

Library cataloging

Base for many NLP applications.

Applications in banks and post offices.

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 7: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Our Scope

Our scope

Scanned Telugu text documents

Documents with printed Text format

Documents without any images

Challenges involved

Resolution of document is not in our control

Noise can be present in the scanned document

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 8: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Why Telugu OCR is difficult?

Telugu vs English

English Telugu

No. of classes 52 more than 400No. of connected components per character 1 1-5Dependency among connected components No Yes

Confusion Characters No Yes

Table: Comparison of Telugu and English languages

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 9: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Confusion characters

Figure: Confusion Characters example

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 10: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Why Telugu OCR is difficult?

Telugu vs Other Indian Languages

Devanagari Telugu

Word Segmentation easy difficultConfusion Characters less more

Table: Comparison of Telugu and Devanagari languages

Tamil Telugu

Vowel modifiers connected disconnected

Table: Comparison of Telugu and Tamil languages

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 11: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Comparison of Telugu OCR

S.No Features Classification Font Ind Noise free Open Source

[1] Primitive Decision trees No No n/a[2] Circular Template match Yes No n/a[3] Wavelet neural networks No No n/a[4] Fringe map Template matching No No Yes[5] Gradient dir Nearest neighbor No No n/a6 ? ? Yes Yes Yes

Table: Comparison of Telugu OCR literature

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 12: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Comparison with state of art Telugu OCR

S.No Input Output Font Ind Noise free Open Source

1 PGM ACI No No Yes2 any image format txt/pdf Yes Yes Yes

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 13: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Our Goal

Our goal is to implement OCR with following features

Works as font independent system

Works on noisy data

Capable of digitalizing documents in DLI.

Open Source project

Web interface for OCR

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 14: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Work done in stage1

Implemented end to end system

Explored various stages of OCR

Implemented each stage with basic methodologies

Observed the problems involved in each stage

Identified potential improvements to do in each stage

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 15: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Stages of OCR

Stages of OCR

Binarization

Skew detection and Correction

Noise removal

Segmentation of lines, words and characters

Feature extraction and classification

Rendering data

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 16: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

What is Binarization?

Figure: Binarization example

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 17: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Binarization

Our Methodology:

Using global thresholding

Using Java Advanced Imaging API [1]

Results:

Obtained more than 90% accuracy.

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 18: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

What is Skew Detection?

Figure: Skew Correction example

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 19: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Skew Detection and Correction

Our Methodology:

Using Hough’s transformation

Using Java Deskew Implementation [2]

Results:

Correcting skew angle of ±20.

Limitations

Don’t work for multi-skewed document

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 20: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

What Segmentation?

Line segmentation example Connected component segmentation

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 21: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Why Segmentation?

Advantages of Segmentation

To make processing easier in next stages

To minimize number of classes

Different types of segmentation

Line segmentation

Word segmentation

Connected component segmentation

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 22: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Connected component segmentation

Our Methodology

Found 8-way connected components

Implemented similar algorithm as flood fill [3]

Figure: Connected components example

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 23: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Why Line Segmentation?

Characters are not labeled sequentially

To bundle the consonant modifiers with correspondingcharacters

To process document in order

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 24: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Our methodology

Our Methodology

We found each line consists of four states

We found histogram of density of black pixels in each line

Used Hidden Markov Model(HMM) for segmentation of lines

Basing on state changes found out line boundaries

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 25: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Example

Before Line segmentation After Line segmentation

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 26: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Example

Before Line segmentation [5] After Line segmentation

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 27: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Observations and Future Work

Observations

Works well with uniform spacing documents

Non-Uniform spacing documents are not segmented properly

Planned Improvements:

Planning to apply Conditional random fields(CRF) for linesegmentation.

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 28: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Feature Vector generation and classification

Our Methodology

Using 0/1 feature vector

Generated synthetic data for three Telugu fonts

Using euclidean distance for classification

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 29: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Feature Vector generation and classification

Observations

Dividing the characters into groups will improve accuracy.

Experiments

Found accuracy without classifying connected componentsinto groups

Found accuracy with classifying connected components intogroups

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 30: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Feature Vector generation and classification

Results

Method Without grouping With grouping

0/1 33.75% 41.58%SIFT 10.40% 15.97%

Table: Classification Accuracy statistics

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 31: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Post Processing

Our Methodology

Property file created which maps each character lable tocorresponding Unicode

Unicode corresponding to classified character lable is picked

Identified the character position and additional care takenaccordingly

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 32: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Example

Figure: Example for rendered output text

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 33: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Observations

Observations

Word spacing is not completely perfect.

Paragraph beginning with space are not identified.

Headings with large font are not taken care

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 34: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Noise

Sources of Noise

Quality of printer

Quality of scanner

Age of the document

Advantages of Noise removal

Improves recognition results.

Facilitates better processing in next stages.

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 35: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Noise Removal

Our Methodology

Found connected components(CC)

Identified three features for CC

Extracted three features for all CC

Applied Expectation Maximization(EM) algorithm forclustering

Using Weka’s [4] EM algorithm API.

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 36: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Experiments and Results

Experiments

EM algorithm with two clusters

EM algorithm with three clusters

K-means clustering with two clusters

Results

Method Noise Clusters removed Text clusters removed

EM 2-cluster 70% 0.01%

Table: Noise Removal statistics

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 37: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Example

Before Noise removal After Noise removal

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 38: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Example

Before Noise removal After Noise removal

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 39: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Observations

Observations

Underlines and hand written characters are not removed

Non-Telugu characters are not completely removed

In some documents border noise is not removed

Joint Telugu characters are removed

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 40: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Future Work

Planned Improvements:

Including more features like thickness of character

Including the structural features of character shapes

Including position feature to remove border noise

Trying to implement methods to find overlapping characters

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 41: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Conclusion

Created end to end functioning basic OCR.

Binarization with global thresholding giving 90% aboveaccuracy.

Able to correct skew angle of ±20

Able to remove above 70% noise from document

Able to segment lines with more than 95% accuracy

Able to classify characters with more than 30% accuracy

Able to render the characters by finding matching Unicode

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 42: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Future Work

Future Work

Consider more structural features for improving accuracy andbetter noise removal.

Apply CRF for line segmentation

Will use language models for correcting confusion charactersand broken characters

Improve classification models.

Implement page layout analysis module

Implement a web based Telugu OCR

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 43: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

References

Java Advanced Imaging API.http://java.sun.com/javase/technologies/desktop/media/.

Java Deskew by Hough Transformation. http://www.jdeskew.com.

Flood fill Algorithm. http://en.wikipedia.org/wiki/Flood fill

WEKA Java API. http://www.cs.waikato.ac.nz/ml/weka/

Digital Library of India. http://www.dli.gov.in/

Drishti Telugu OCR. http://www.ildc.in/Telugu/htm/lin ocr spell.htm

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 44: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

References

Rajasekaran S.N.S. Deekshatulu B.L. 1977 Recognition of printed Telugucharacters. Comput. Graphics Image Processing,6 pgs.335–360.

Rao P. V. S. and T. M. Ajitha 1995 Telugu Script Recognition - a Feature BasedApproach. Proce.of ICDAR, IEEE pgs.323-326,.

Pujari Arun K , C Dhanunjaya Naidu B C Jinaga 2002 An Adaptive CharacterRecognizer for Telugu Scripts using Multiresolution Analysis and AssociativeMemory. ICVGIP, Ahmedabad.

Negi Atul, Chakravarthy Bhagvati and.Krishna B 2001 An OCR system forTelugu. Proc. Of 6th Int. Conf. on Document Analysis and Recognition IEEEComp. Soc. Press, USA,. Pgs. 1110-1114.

Lakshmi C V, C Patvardhan 2003 A high accuracy OCR for printed Telugu text.

Conference on Convergent Technologies for Asia-Pacific Region (TENCON

)Volume 2, Issue, 15-17 Page(s): 725 - 729

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature

Page 45: Towards e-Library of Telugu literaturerajarja/MTP/eLibrary.pdf · Towards e-Library of Telugu literature

Outline Introduction Work done in stage1 Binarization and Skew detection Segmentation and Noise removal Line Segmentation Classification Post Processing Conclusion and Future Work

Thank You

Rajesh Babu Arja (Roll no: 09305914) Guide : Prof. J.Saketha NathCo-Guide : Prof. Parag Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology Bombay

Towards e-Library of Telugu literature