Download - OPTICAL CURSIVE HANDWRITTEN RECOGNITION USING VPP & …

OPTICAL CURSIVE HANDWRITTEN RECOGNITION

USING VPP & TDP NATIVE SEGMENTATION

ALGORITHMS AND NEURAL NETWORKS (PYTORCH)

A Project report submitted in partial fulfillment of the requirements for

the award of the degree of

BACHELOR OF TECHNOLOGY

IN

COMPUTER SCIENCE ENGINEERING

Submitted by

G. TIRUMALESH 316126510081

K. PRATIMA

316126510087

K. L. SRINIVAS

316126510088

N. ARUN

316126510100

Y. HEMANTH 316126510120

Under the guidance of

B. SIVA JYOTHI (Assistant Professor)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ANIL NEERUKONDA INSTITUTE OF TECHNOLOGY AND SCIENCES

(UGC AUTONOMOUS)

(Permanently Affiliated to AU, Approved by AICTE and Accredited by NBA & NAAC with ‘A’ Grade)

Sangivalasa, bheemili mandal, Visakhapatnam dist. (A.P)

2019-2020



(UGC AUTONOMOUS)

(Affiliated to AU, Approved by AICTE and Accredited by NBA & NAAC with ‘A’

Grade)


BONAFIDE CERTIFICATE

This is to certify that the project report entitled “OPTICAL CURSIVE

HANDWRITTEN RECOGNITION USING VPP & TDP NATIVE

SEGMENTATION ALGORITHMS AND NEURAL NETWORKS

(PYTORCH)” submitted by G.TIRUMALESH (316126510081), K.PRATIMA

(316126510087), K.L.SRINIVAS (316126510088), N.ARUN (316126510100),

Y.HEMANTH (316126510120) in partial fulfillment of the requirements for the

award of the degree of Bachelor of Technology in Computer Science Engineering

of Anil Neerukonda Institute of technology and sciences (A), Visakhapatnam is a

record of bonafide work carried out under my guidance and supervision.

Project Guide Head of the Department

B. SHIVA JYOTHI Dr. R. SHIVARANJANI

ASSISTANT PROFESSOR HEAD OF THE DEPATMENT

DEPT. OF CSE DEPT. OF CSE

DECLARATION

We, G.TIRUMALESH, K.PRATIMA, K.L.SRINIVAS, N.ARUN,

Y.HEMANTH, of final semester B.Tech., in the department of Computer

Science and Engineering from ANITS, Visakhapatnam, hereby declare that the

project work entitled “OPTICAL CURSIVE HANDWRITTEN

RECOGNITION USING VPP & TDP NATIVE SEGMENTATION

ALGORITHMS AND NEURAL NETWORKS (PYTORCH)” is carried

out by us and submitted in partial fulfillment of the requirements for the award

of Bachelor of Technology in Computer Science Engineering , under Anil

Neerukonda Institute of Technology & Sciences(A) during the academic year

2016-2020 and has not been submitted to any other university for the award of

any kind of degree.


K. PRATIMA 316126510087

K. L. SRINIVAS 316126510088

N. ARUN 316126510100

Y. HEMANTH 316126510120

ACKNOWLEDGEMENT

An endeavor over a long period can be advice and support of many well-

wishers. We take this opportunity to express our gratitude and appreciation to all of

them.

We owe our tributes to Dr. R. SHIVARANJANI, Head of the

Department, Computer Science & Engineering for her valuable support and

guidance during the period of project implementation.

We wish to express our sincere thanks and gratitude to our project guide B.

SIVA JYOTHI, Assistant Professor, Department of Computer Science &

Engineering, ANITS, for the simulating discussions, in analyzing problems

associated with our project work and for guiding us throughout the project. Project

meeting were highly informative. We express our sincere thanks for the

encouragement, untiring guidance and the confidence they had shown in us. We are

immensely indebted for their valuable guidance throughout our project.

We also thank all the staff members of CSE department for their valuable

advices.

We also supporting staff for providing resources as and when required.


K. PRATIMA 316126510087

K. L. SRINIVAS 316126510088

N. ARUN 316126510100

Y. HEMANTH 316126510120

i

ABSTRACT

In the field of Artificial Intelligence, scientists brought a revolutionary

change in the field of image processing and one of the biggest challenges in it is to

identify documents in hand-written formats. One of the most widely used techniques

for the validity of these types of documents is Character Recognition. Optical

Character Recognition (OCR) is an extensively employed method to transform the

data of any form (handwritten) into electronic format. There are millions of

techniques introduced now that can be used to recognize handwriting of any form

and language. A number of techniques are available for feature extraction and

training of CR systems in the literature, each with its own superiorities and

weaknesses. We explore these techniques to design an optimal cursive handwritten

recognition system based on character recognition. This aims to represent the

process of Converting handwritten text to computer typed document which is an

optical cursive handwritten recognition (OCR) by using segmentation algorithms

like VPP (vertical projection profile), TDP (top down profile) and the other

histogram (vertical and horizontal) projection algorithms to achieve the solution. For

feature extraction and character recognition pytorch which is an open source

machine learning tool library in python used for computer vision and natural

language processing.

ii

CONTENTS

Page No.

ABSTRACT i

LIST OF FIGURES v

LIST OF SYMBOLS vii

LIST OF TABLES viii

LIST OF ABBREVATIONS ix

CHAPTER 1 INTRODUCTION 1

1.1 Prerequisites 2

1.1.1 Software Requirements 2

1.1.2 Hardware Requirements 2

1.1.3 Data Requirements 2

1.2 Python 3

1.2.1 Machine learning in python 4

1.2.1.1 Numpy 5

1.2.1.2 Pandas 5

1.2.1.3 Opencv 6

1.2.1.4 Sk-Learn 7

1.2.1.5 Matplotlib 8

1.2.2 Neural networks in python 9

1.2.2.1 Loss Function 10

1.3 Image Processing 12

1.3.1 Types of Images 13

1.3.2 Brightness and Contrast 14

1.4 Convolutional Neural Network 15

1.4.1 Convolutional Layer 15

1.4.2 Pooling Layer 17

1.4.3 Classification 17

1.5 Problem Statement 18

iii

CHAPTER 2 LITERATURE SURVEY 19

2.1 Existing Methods for character recognition system 19

CHAPTER 3 METHODOLOGY 22

3.1 System architecture 22

3.2 Algorithm 23

3.2.1 Algorithm for line segmentation 24

3.2.2 Algorithm for word segmentation 24

3.2.3 Algorithm for character segmentation 25

3.2.3.1 Character segmentation using VPP 25

3.2.3.2 Touching character segmentation 25

3.3 Proposed Work 26

3.3.1 Image scanning 26

3.3.2 Pre-processing 26

3.3.3 Segmentation 28

3.3.3.1 Line segmentation 28

3.3.3.2 Word segmentation 29

3.3.3.3 Character segmentation 30

3.3.4 Feature extraction 33


3.3.6 Post-processing 36

CHAPTER 4 PYTORCH 37

4.1 Pytorch Library Tools 37

4.2 Pytorch in research 39

4.3 Training an image classifier using pytorch 40

CHAPTER 5 SAMPLE CODE ELABORATION 42

5.1 Pre-processing the data 42

5.1.1 Data Organization 42

5.1.2 Image Sizing and Shaping 42

5.1.3 Image Blurring Kernel Filter 42

5.1.4 Applying Kernel Filters and Contours on Image 43

iv

5.2 Line Segmentation 44

5.2.1 Calculating Line Intensity(Horizontal Histograms) 44

5.2.3 Evaluating Threshold for Line Segmentation 45

5.2.4 Segmenting Paragraph into sentences 45

5.3 Word Segmentation 46

5.3.1 Combining two Missegmented Words 46

5.3.2 Segmenting Sentence into Words 46

5.4 Character Segmentation 47

5.4.1 Evaluating VPP Intensity 47

5.4.2 First Level Character Segmentation Using VPP 48

5.4.3 Segmenting Word into Characters under VPP 49

5.4.4 Evaluating VPP and TDP Average Intensity 49

5.4.5 Connected Components Segmentation 50

5.4.6 Further Required Segmentation on Connected Components. 52

5.5 Training the model 53

5.5.1 Data Loader 53

5.5.2 Defining Transforms and Parameters 53

5.5.3 Importing the Model 54

5.5.4 Training the Model 54

5.5.5 Testing the model 56

CHAPTER 6 RESULTS AND DISCUSSIONS 58

6.1 Input & Output 58

6.2 Training Datasets 58

6.3 Experimental Results and Analysis 60

CHAPTER 7 CONCLUSION 62

REFERENCES 63

v

LIST OF FIGURES

Fig. No Topic Name Page No.

1.1 Flow-chart for OCR 2

1.2 Machine learning overview 4

1.3 Architecture of a 2-layer Neural Network 9

1.4 Illustration of flow of network 10

1.5 A CNN Sequence 15

1.6 Convolution operation with kernel 16

1.7 Performing Pooling operation 17

1.8 Describing Classification Process 17

3.1 Block Diagram of Proposed System 22

3.2 Steps in pre-processing 23

3.3 Scanned input image 26

3.4 Pre-processing 27

3.5 Blur image (for noise removal) 27

3.6 Binary image in color contrast 28

3.7 Horizontal projection graph 28

3.8 Segmented lines from the image 29

3.9 segmented line 29

3.10 First level word segmentation 30

3.11 Second level word segmentation 30

3.12 segmented word 30

3.13 VPP intensity graph for word ‘MOVE’ 31

3.14 First level VPP character segmentation 31

3.15 Touching characters (connected component) 31

3.16 VPP intensity graph for word ‘MO’ 32

3.17 TDP intensity graph for word ‘MO’ 32

3.18 Combined intensity graph for word ‘MO’ 32

vi


3.19 Final level character segmentation 33

3.20 Feature extraction of characters 33

3.21 Zero padding 34

3.22 Convolution layer 34

3.23 Max pooling and Average pooling 34

3.24 Flatten the image 35

3.25 ReLU activation function 35

3.26 Fully connected layer with classes (x and o)

along with probabilities 35

3.27 Final result 36

6.1 Input image 58

vii

LIST OF SYMBOLS

V Input layer

Ŷ Output layer

W Weights

B biases

σ Activation function

Summation

vii

LIST OF TABLES

Table. No Table Name Page No.

6.1 Breakdown of the number of available training and 60

testing samples in the NIST special database 19

using the original training and testing splits.

6.2 Testing and Accuracy 61

ix

LIST OF ABBREVATIONS

OCR

Optical Character Recognition

VPP Vertical Projection Profile

TDP Top Down Profile

CNN Convolutional Neural Network

RNN Recurrent Neural Network

RGB Red Green Blue

HTR Hand written Text Recognition

GPU Graphical Processing Unit

ReLU Rectified Linear Unit

MSE Mean Square Error

MAE Mean Absolute Error

MBE Mean Bias Error

SVM Support Vector Machine

1

1. INTRODUCTION

Optical character recognition is also called as optical character reader and it

is abbreviated as OCR. OCR translates the images into machine readable format

such as ASCII or Unicode. Character recognition can be classified into two types

based on the type of the text i.e. machine printed text and handwritten text. Character

recognition of handwritten text is more challenging than machine printed text.

Because, machine printed characters are straight with uniform alignment and

spacing. While the handwritten characters are not uniform and greatly varies in

shape and size. There are many advantages of OCR. When a printed text is

converted to machine readable text then we can search through it with keywords,

compress, edit and send it, and can store in much less space. OCR has numerous

applications. It is used by blind and visually impaired persons. In banking and legal

department, it is used to digitize the documents. Barcode recognition technique is

used in retail industry which is also related to OCR. It is widely used in education,

finance and automatic detection of number plate. The main challenge in the

recognition of handwritten characters is that every person on the earth has different

handwriting. There are various other factors also which causes difference in

handwriting such as multi-orientations, skewness of the text lines, overlapping

characters, connected components, pressure points etc. Many scripts are there with

their intrinsic variations. A single character can be written in many forms, so it is a

challenging task to recognize a particular handwritten character.

There are six steps in OCR:

They are as follows:

• Image acquisition

• Pre-processing

• Segmentation

• Feature extraction

• Classification

• Post-processing

2

Fig. 1.1 Flow-chart for OCR

1.1 Prerequisites:

1.1.1 Software requirements:

1. Python version - 3.0

2. Python IDE – Pycharm

3. Data science libraries – Matplotlib, numpy, PIL, Pytorch, Pandas

1.1.2 Hardware requirements:

1. CPU - 8 to 16 with each octa core processors in a distributed

network.

2. RAM - 128 to 256 GB

3. Storage – 30 to 50 GB

4. Entirely Organized in a cloud network

1.1.3 Data Requirements:

1. NIST DATA SET – Characters

2. IAM DATASET – Forms, Sentences, Words

3

1.2 Python:

Python is a popular programming language. It was created by Guido van Rossum, and

released in 1991.

It is used for:

web development (server-side),

software development,

mathematics,

system scripting.

Python does the things as follow:

Python can be used on a server to create web applications.

Python can be used alongside software to create workflows.

Python can connect to database systems. It can also read and modify files.

Python can be used to handle big data and perform complex mathematics.

Python can be used for rapid prototyping, or for production-ready software

development.

Advantages of python are mentioned below:

Python works on different platforms (Windows, Mac, Linux, Raspberry Pi,

etc).

Python has a simple syntax similar to the English language.

Python has syntax that allows developers to write programs with fewer lines

than some other programming languages.

Python runs on an interpreter system, meaning that code can be executed as

soon as it is written. This means that prototyping can be very quick.

Python can be treated in a procedural way, an object-orientated way or a

functional way.

4

1.2.1 Machine learning in python :

Fig 1.2 Machine learning overview

Machine learning is learning based on experience. As an example, it is like a

person who learns to play chess through observation as others play. In this way,

computers can be programmed through the provision of information which they are

trained, acquiring the ability to identify elements or their characteristics with high

probability.

There are various stages of machine learning:

data collection

data sorting

data analysis

algorithm development

checking algorithm generated

the use of an algorithm to further conclusions

Machine learning algorithms are divided into two groups:

Unsupervised learning

Supervised learning

With Unsupervised learning, your machine receives only a set of input data.

Thereafter, the machine is up to determine the relationship between the entered data

and any other hypothetical data. Unlike supervised learning, where the machine is

provided with some verification data for learning, independent Unsupervised learning

implies that the computer itself will find patterns and relationships between different

data sets. Unsupervised learning can be further divided into clustering and association.

Supervised learning implies the computer ability to recognize elements based

on the provided samples. The computer studies it and develops the ability to recognize

5

new data based on this data. For example, you can train your computer to filter spam

messages based on previously received information.

Some Supervised learning algorithms include:

Decision trees

Support-vector machine

Naive Bayes classifier

k-nearest neighbours

linear regression

1.2.1.1 Numpy :

NumPy is the fundamental package needed for scientific computing with

Python. This package contains:

a powerful N-dimensional array object

sophisticated (broadcasting) functions

basic linear algebra functions

basic Fourier transforms

sophisticated random number capabilities

tools for integrating Fortran code

tools for integrating C/C++ code

Besides its obvious scientific uses, NumPy can also be used as an efficient

multi-dimensional container of generic data. Arbitrary data types can be defined. This

allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

NumPy is a successor for two earlier scientific Python libraries: Numeric and

Numarray.

1.2.1.2 Pandas :

Pandas is a popular Python package for data science, and with good reason: it

offers powerful, expressive and flexible data structures that make data manipulation

and analysis easy, among many other things. The DataFrame is one of these structures.

Those who are familiar with R know the data frame as a way to store data in

rectangular grids that can easily be overviewed. Each row of these grids corresponds

to measurements or values of an instance, while each column is a vector containing

data for a specific variable. This means that a data frame’s rows do not need to

contain, but can contain, the same type of values: they can be numeric, character,

logical, etc.

6

Now, DataFrames in Python are very similar: they come with the Pandas

library, and they are defined as two-dimensional labeled data structures with columns

of potentially different types.

In general, you could say that the Pandas DataFrame consists of three main

components: the data, the index, and the columns.

Firstly, the DataFrame can contain data that is:

a Pandas DataFrame

a Pandas Series: a one-dimensional labeled array capable of holding any data

type with axis labels or index. An example of a Series object is one column

from a DataFrame.

a NumPy ndarray, which can be a record or structured

a two-dimensional ndarray

dictionaries of one-dimensional ndarray’s, lists, dictionaries or Series.

The difference between np.ndarray and np.array() . The former is an actual

data type, while the latter is a function to make arrays from other data structures.

Structured arrays allow users to manipulate the data by named fields: in the

example below, a structured array of three tuples is created. The first element of each

tuple will be called foo and will be of type int, while the second element will be

named bar and will be a float.

Record arrays, on the other hand, expand the properties of structured arrays.

They allow users to access fields of structured arrays by attribute rather than by index.

You see below that the foo values are accessed in the r2 record array.

1.2.1.3 Opencv :

OpenCV (Open Source Computer Vision Library) is an open source computer

vision and machine learning software library. OpenCV was built to provide a common

infrastructure for computer vision applications and to accelerate the use of machine

perception in the commercial products. Being a BSD-licensed product, OpenCV

makes it easy for businesses to utilize and modify the code.

The library has more than 2500 optimized algorithms, which includes a

comprehensive set of both classic and state-of-the-art computer vision and machine

learning algorithms. These algorithms can be used to detect and recognize faces,

identify objects, classify human actions in videos, track camera movements, track

moving objects, extract 3D models of objects, produce 3D point clouds from stereo

cameras, stitch images together to produce a high resolution image of an entire scene,

find similar images from an image database, remove red eyes from images taken using

7

flash, follow eye movements, recognize scenery and establish markers to overlay it

with augmented reality, etc. OpenCV has more than 47 thousand people of user

community and estimated number of downloads exceeding 18 million. The library is

used extensively in companies, research groups and by governmental bodies.

Along with well-established companies like Google, Yahoo, Microsoft, Intel,

IBM, Sony, Honda, Toyota that employ the library, there are many startups such as

Applied Minds, VideoSurf, and Zeitera, that make extensive use of OpenCV.

OpenCV’s deployed uses span the range from stitching streetview images together,

detecting intrusions in surveillance video in Israel, monitoring mine equipment in

China, helping robots navigate and pick up objects at Willow Garage, detection of

swimming pool drowning accidents in Europe, running interactive art in Spain and

New York, checking runways for debris in Turkey, inspecting labels on products in

factories around the world on to rapid face detection in Japan.

It has C++, Python, Java and MATLAB interfaces and supports Windows,

Linux, Android and Mac OS. OpenCV leans mostly towards real-time vision

applications and takes advantage of MMX and SSE instructions when available. A

full-featured CUDA and OpenCL interfaces are being actively developed right now.

There are over 500 algorithms and about 10 times as many functions that compose or

support those algorithms. OpenCV is written natively in C++ and has a templated

interface that works seamlessly with STL containers.

1.2.1.4 Sk-Learn :

Scikit-learn provides a range of supervised and unsupervised learning

algorithms via a consistent interface in Python.

It is licensed under a permissive simplified BSD license and is distributed

under many Linux distributions, encouraging academic and commercial use. The

library is built upon the SciPy (Scientific Python) that must be installed before you can

use scikit-learn. This stack that includes:

NumPy: Base n-dimensional array package

SciPy: Fundamental library for scientific computing

Matplotlib: Comprehensive 2D/3D plotting

IPython: Enhanced interactive console

Sympy: Symbolic mathematics

Pandas: Data structures and analysis

Extensions or modules for SciPy care conventionally named SciKits. As such,

the module provides learning algorithms and is named scikit-learn. The vision for the

library is a level of robustness and support required for use in production systems. This

8

means a deep focus on concerns such as ease of use, code quality, collaboration,

documentation and performance.

Although the interface is Python, c-libraries are leverage for performance such

as numpy for arrays and matrix operations, LAPACK, LibSVM and the careful use of

python.The library is focused on modelling data. It is not focused on loading,

manipulating and summarizing data. For these features, refer to NumPy and Pandas.

Some popular groups of models provided by scikit-learn include:

Clustering: for grouping unlabelled data such as K-Means.

Cross Validation: for estimating the performance of supervised models on unseen

data.

Datasets: for test datasets and for generating datasets with specific properties for

investigating model behaviour.

Dimensionality Reduction: for reducing the number of attributes in data for

summarization, visualization and feature selection such as Principal component

analysis.

Ensemble methods: for combining the predictions of multiple supervised models.

Feature extraction: for defining attributes in image and text data.

Feature selection: for identifying meaningful attributes from which to create

supervised models.

Parameter Tuning: for getting the most out of supervised models.

Manifold Learning: For summarizing and depicting complex multi-dimensional

data.

Supervised Models: a vast array not limited to generalized linear models,

discriminate analysis, naive bayes, lazy methods, neural networks, support vector

machines and decision trees.

1.2.1.5 Matplotlib:

Matplotlib is an amazing visualization library in Python for 2D plots of arrays.

Matplotlib is a multi-platform data visualization library built on NumPy arrays and

designed to work with the broader SciPy stack. It was introduced by John Hunter in

the year 2002.

One of the greatest benefits of visualization is that it allows us visual access to

huge amounts of data in easily digestible visuals. Matplotlib consists of several plots

like line, bar, scatter, histogram etc

9

1.2.2 Neural networks in python :

Neural Networks as a mathematical function that maps a given input to a

desired output.

Neural Networks consist of the following components:

An input layer, x

An arbitrary amount of hidden layers

An output layer, ŷ

A set of weights and biases between each layer, W and b

A choice of activation function for each hidden layer, σ.

Fig 1.3 Architecture of a 2-layer Neural Network

The output ŷ of a simple 2-layer Neural Network is:

1.1

The weights W and the biases b are the only variables that affects the output ŷ.

Calculating the predicted output ŷ, known as feedforward

Updating the weights and biases, known as backpropagation

The sequential graph below illustrates the process.

10

Fig 1.4 Illustration of flow of network

1.2.2.1 Loss Function:

There are many available loss functions, and the nature of our problem should

dictate our choice of loss function.

The common loss functions are mentioned below:

Regression loss: Mean Square Error/Quadratic Loss/L2 Loss

1.2

That is, the sum-of-squares error is simply the sum of the difference between

each predicted value and the actual value. The difference is squared so that we measure

the absolute value of the difference.

Mean Absolute Error/L1 Loss:

1.3 1.3

It is measured as the average of sum of absolute differences between

predictions and actual observations. Like MSE, this as well measures the magnitude of

error without considering their direction. Unlike MSE, MAE needs more complicated

tools such as linear programming to compute the gradients. Plus MAE is more robust to

outliers since it does not make use of square.

11

Mean Bias error:

1.4

This is much less common in machine learning domain as compared to it is

counterpart. This is same as MSE with the only difference that we don’t take absolute

values. Clearly there’s a need for caution as positive and negative errors could cancel

each other out. Although less accurate in practice, it could determine if the model has

positive bias or negative bias.

Classification Losses: Hinge Loss/Multi class SVM Loss

1.5

the score of correct category should be greater than sum of scores of all

incorrect categories by some safety margin (usually one). And hence hinge loss is used

for maximum-margin classification, most notably for SVM. Although not

differentiable, it’s a convex function which makes it easy to work with usual convex

optimizers used in machine learning domain.

Cross Entropy Loss/Negative Log Likelihood:

1.6

This is the most common setting for classification problems. Cross-entropy loss

increases as the predicted probability diverges from the actual label. when actual label

is 1 (y(i) = 1), second half of function disappears whereas in case actual label is 0 (y(i)

= 0) first half is dropped off. In short, we are just multiplying the log of the actual

predicted probability for the ground truth class. An important aspect of this is that cross

entropy loss penalizes heavily the predictions that are confident but wrong.

12

Finally, This loss function helps us to find the best set of weights and biases that

minimizes the loss function.

1.3 Image Processing:

Image processing is a method to perform some operations on an image, in

order to get an enhanced image or to extract some useful information from it. It is a

type of signal processing in which input is an image and output may be image or

characteristics/features associated with that image. Nowadays, image processing is

among rapidly growing technologies. It forms core research area within engineering

and computer science disciplines too.

Image processing basically includes the following three steps:

Importing the image via image acquisition tools;

Analysing and manipulating the image;

Output in which result can be altered image or report that is based on image

analysis.

There are two types of methods used for image processing namely, analogue

and digital image processing. Analogue image processing can be used for the hard

copies like printouts and photographs. Image analysts use various fundamentals of

interpretation while using these visual techniques. Digital image processing techniques

help in manipulation of the digital images by using computers. The three general

phases that all types of data have to undergo while using digital technique are pre-

processing, enhancement, and display, information extraction.

An image is nothing more than a two dimensional signal. It is defined by the

mathematical function f(x,y) where x and y are the two co-ordinates horizontally and

vertically.

The value of f(x,y) at any point is gives the pixel value at that point of an

image.

Some of the major fields in which digital image processing is widely used are

mentioned below

Image sharpening and restoration

Medical field

Remote sensing

Transmission and encoding

Machine/Robot vision

Color processing

13

Pattern recognition

Video processing

Microscopic Imaging

1.3.1 Types of Images:

1. The binary image:

The binary image as it name states, contain only two pixel values 0 and 1.Here

0 refers to black colour and 1 refers to white colour. It is also known as

Monochrome.

The resulting image that is formed hence consist of only black and white

colour and thus can also be called as Black and White image.

Binary images have a format of PBM ( Portable bit map ).

2. 2, 3, 4,5, 6 bit colour format:

The images with a colour format of 2, 3, 4, 5 and 6 bit are not widely used

today. They were used in old times for old TV displays, or monitor displays.

But each of these colours have more than two grey levels, and hence has grey

colour unlike the binary image.

In a 2 bit 4, in a 3 bit 8, in a 4 bit 16, in a 5 bit 32, in a 6 bit 64 different

colours are present.

3. 8 bit colour format

8 bit colour format is one of the most famous image format. It has 256

different shades of colours in it. It is commonly known as Grayscale image.

The range of the colours in 8 bit vary from 0-255. Where 0 stands for black,

and 255 stands for white, and 127 stands for grey colour.

This format was used initially by early models of the operating systems UNIX

and the early colour Macintoshes.

The format of these images are PGM ( Portable Grey Map ). This format is not

supported by default from windows. In order to see grey scale image, you need to

have an image viewer or image processing toolbox such as Matlab.

4. 16 bit colour format:

It is a colour image format. It has 65,536 different colours in it. It is also

known as High colour format.

14

It has been used by Microsoft in their systems that support more then 8 bit

colour format.

The distribution of colour in a colour image is not as simple as it was in

grayscale image. A 16 bit format is actually divided into three further formats which

are Red , Green and Blue. The famous (RGB) format.


24 bit colour format also known as true colour format. Like 16 bit colour

format, in a 24 bit colour format, the 24 bits are again distributed in three different

formats of Red, Green and Blue.

It is the most common used format. Its format is PPM ( Portable pixMap)

which is supported by Linux operating system. The famous windows has its own

format for it which is BMP ( Bitmap ).

1.3.2 Brightness and Contrast:

Brightness is a visual perception in which a source appears to be reflecting

light. Brightness is a subjective property of an object which is being observed.

Brightness is an absolute term and different from lightness. A colour screens use three

colours i.e., RGB scheme (red, green and blue) the brightness of the screen depends

upon the sum of the amplitude of red green and blue pixels, and it is divided by 3.

The perception of brightness depends upon the optical illusions to appear

brighter or darker. When the brightness is decreased, the colour appears dull, and

when brightness increases, the colour is clearer.

Contrast is a colour which makes an object distinguishable. We can say that

contrast is determined by the colour and brightness of the object. Contrast is the

difference between the maximum and minimum pixel intensity of an image.

15

1.4 Convolution Neural Network:

Fig 1.5 A CNN Sequence

A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning

algorithm which can take in an input image, assign importance (learnable weights

and biases) to various aspects/objects in the image and be able to differentiate one

from the other. The pre-processing required in a ConvNet is much lower as

compared to other classification algorithms. While in primitive methods filters are

hand-engineered, with enough training, ConvNets have the ability to learn these

filters/characteristics.

A ConvNet is able to successfully capture the Spatial and Temporal

dependencies in an image through the application of relevant filters. The architecture

performs a better fitting to the image dataset due to the reduction in the number of

parameters involved and reusability of weights. In other words, the network can be

trained to understand the sophistication of the image better.

The role of the ConvNet is to reduce the images into a form which is easier

to process, without losing features which are critical for getting a good prediction.

1.4.1. Convolution Layer:

A filter (or kernel) is an integral component of the layered architecture.

Generally, it refers to an operator applied to the entirety of the image such

that it transforms the information encoded in the pixels. In practice, however, a

16

kernel is a smaller-sized matrix in comparison to the input dimensions of the image,

that consists of real valued entries.

The real values of the kernel matrix change with each learning iteration over

the training set, indicating that the network is learning to identify which regions are

of significance for extracting features from the data.

Fig 1.6 Convolution operation with kernel

In Fig1.6, we are convoluting a 5x5x1 image with a 3x3x1 kernel (which

change each iteration to extract a significant features) to get a 3x3x1 convolved

feature. The filter moves to the right with a certain stride value till it parses the

complete width. In case of images with multiple channels (e.g. RGB) the kernel has

a same depth as that of input image. Matrix multiplication is performed between Kn

and in stack ([K1, I1]; [K2, I2]; [K3, I3]) and all results are summed with bias to

give us a squashed 1 depth channel convoluted feature output.

The objective of the Convolution Operation is to extract the high-level

features such as edges, from the input image, the first Convolutional Layer is

responsible for capturing the Low-Level features such as edges, color, gradient

orientation, etc.

17

1.4.2 Pooling Layer:

(a) Convoluted Output (b) Pooling Output

Fig 1.7 Performing Pooling operation

The Pooling layer is responsible for reducing the spatial size of the Convolved

Feature. This is to decrease the computational power required to process the data

through dimensionality reduction. Furthermore, it is useful for extracting dominant

features which are rotational and positional invariant, thus maintaining the process of

effectively training of the model.

There are two types of Pooling: Max Pooling and Average Pooling. Max

Pooling returns the maximum value from the portion of the image covered by the

Kernel. On the other hand, Average Pooling returns the average of all the values

from the portion of the image covered by the Kernel.

In Fig. 1.7 we perform maximum pooling operation by considering

convoluted feature output obtained from convolution layer.

1.4.3 Classification – Fully Connected Layer (FC LAYER):

Fig 1.8 Describing Classification Process

18

The output of the convolutional layer is converted into a suitable form for our

Multi-Level Perceptron, we shall flatten the image into a column vector. The flattened

output is fed to a feedforward neural network and back propagation applied to every

iteration of training.

1.5 Problem Statement:








shape and size. This aims to represent the process of Converting handwritten text to

computer typed document which is an optical cursive handwritten recognition

(OCR) by using segmentation algorithms like VPP (vertical projection profile), TDP

(top down profile) and the other histogram (vertical and horizontal) projection

algorithms to achieve the solution. For feature extraction and character recognition

pytorch which is an open source machine learning tool library in python used for

computer vision and natural language processing.

19

2. LITERATURE SURVEY

2.1 Existing Methods for character recognition system:

In literature of cursive English handwriting recognition, earlier study

highlighted that an off-line handwritten document analysis through segmentation,

skew recognition and writing pressure detection for cursive handwritten document.

There are many algorithms for line, word and character segmentation [19]. The

proposed segmentation method is based on modified horizontal and vertical

projection that can segment the text lines and words even if the presence of

overlapped and multi-skewed text lines. For character segmentation there are methods

like multi-layer perceptron [20]. The existing method was tested on more than 550

text images of IAM database and sample handwriting image which are written by the

different writer on the different background. Using the existing method 93.65% lines

and 91.56% word are correctly segmented from the IAM dataset. Existing work also

normalizes 92% lines and words perfectly with very small error rate. Existing skew

normalization method deals with the exact skew angle and extremely efficient with

compare to on hand techniques. [4]

Each and every pixel in an image represents some information. The pixels

which contributes to the text has more information energy. Based on this information

energy, the text- lines are segmented with 92% accuracy.[5] Artificial Neural

Network is used to recognize the characters. The study includes the performance of

convex hull feature set i.e. 125 features are computed by considering various bays

attributes of the convex hull of a pattern, for effective recognition of isolated

handwritten Bangla basic characters and digits. The recognition rate is 76.86% for

handwritten Bangla characters and 99.45% for Bangla numerals. [3]

The work includes the study of different segmentation techniques for

handwritten character recognition. Three levels of segmentation are presented i.e.

text-lines, word and character segmentation. The need and factors which affects the

segmentation process are discussed. [6]

The work contains the new approach which uses the sequence of

segmentation and recognition algorithms for the OCR of cursive handwriting.

Hidden Markov Model (HMM) is used for recognition with accuracy 92.3% with

lexicon size 50. Lexi con and HMM are combined for word-level segmentation [1].

In this work, various segmentation levels are discussed. Hough transform is used for

text-line segmentation. For the division of vertically connected components,

skeletonization is used [7]. The experiments are carried out.

In this work, the novel connectivity strength function is used for

segmentation process. Connectivity strength parameter is used to decide the

20

components of the text-line. It is a language adaptive approach with accuracy

97.30% [2].

In most of the existing systems recognition accuracy is heavily dependent on

the quality of the input document. In handwritten text adjacent characters tend to be

touched or overlapped. Therefore it is essential to segment a given string correctly into

its character components. In most of the existing segmentation algorithms, human

writing is evaluated empirically to deduce rules [21]. But there is no guarantee for the

optimum results of these heuristic rules in all styles of writing. Moreover handwriting

varies from person to person and even for the same person it varies depending on

mood, speed etc. This requires incorporating artificial neural networks, hidden Markov

models and statistical classifiers to extract segmentation rules based on numerical data.

[22][23][24].

After segmentation next crucial step is representation of character classes by

features. These features should have high discriminative abilities so that they are

different for different character classes (for example 26 uppercase and 26 lowercase

characters in case of English language and 10 digits). Also, these features should be

independent of the intra class variations.

The different representation methods can be categorized into three major classes [21]:

1. Global transformation and series expansion: includes Fourier transform, Gabor

transform, wavelet, moments and KarhuenLoeve Expansion.

2. Statistical representation: Zoning, crossing and distances, projections.

3. Geometrical and topological representation: Extracting and counting

topological structures, geometrical properties, coding, graphs and trees etc.

Features which depend on Fourier transform are suitable for recognizing

handwritten numerals where 96% accuracy has been achieved [25]. Gradient features

have been widely used in CR for machine and hand printed binary character images.

But these features are not invariant to deformations in the characters. In [26], a new

gradient feature is used where at each pixel, gradient is mapped onto 12 direction

codes with an angle span of 30 degree between the directions

In [27], a redesigned direction feature [28] with a view to describe the

character contour more effectively is developed. Also, an additional global feature was

introduced in this technique to improve the recognition accuracy for those characters

that were most frequently confused with patterns of similar appearances. But the

disadvantage of this technique is its failure to deal with changes in stroke width as

these features are extracted from non-thinned character images. Another crucial

module in a character recognition system is its pattern recognition module which

assigns an unknown sample to a predefined class. Numerous techniques for character

recognition can be classified into four general approaches of pattern recognition: [21]

21

1. Template Matching : Direct matching, deformable and elastic matching,

relaxation matching.

2. Statistical techniques : Parametric recognition, Non-pararmetric recognition,

HMM, fuzzy set reasoning.

3. Structural techniques: Grammatical methods, graphical methods.

4. Neural networks : Multilayer perceptron, radial basis function, support vector

machine

Character recognition technique has to cope with the high variability of the

handwritten cursive letters and their intrinsic ambiguity (letters like “e” and “l” or “u”

and “n” can have the same shape). Also it should be able to adapt to changes in the

input data. Template matching, statistical techniques and structural techniques can be

used when the input data is uniform over time whereas neural network (NN) classifier

can learn changes in the input data. Also NN has parallel structure because of which it

can perform computation at a higher rate than classical techniques. Therefore, we

choose neural networks for character recognition in our system.

The features that are used for training the neural network classifier also play a

very important role. The choice of a good feature vector can significantly enhance the

performance of a character classifier whereas a poor one may degrade its performance

considerably. It is found in the literature that generally separate classifiers are used for

the upper and the lower case English character classes to improve the recognition

accuracy. Moreover, good recognition accuracy could be achieved only for

handwritten numerals.

In this paper, we focus on developing a OCR system for recognition of

handwritten English words. We first segment the words into individual characters and

then represent these characters by features that have good discriminative abilities. We

also explore different neural network classifiers to find the best classifier for the OCR

system. We combine different OCR techniques in parallel so that recognition accuracy

of the system can be improved.

22

3. METHODOLOGY

3.1 System Architecture:

Fig. 3.1 Block Diagram of Proposed System

A. Image acquisition:

Images can be obtained by taking the photograph or by scanning the input

document.

B. Pre-processing:

Pre-processing techniques can be applied after image acquisition and

segmentation. These are used to remove the noise from an image and enhances the

images for further processing. Pre-processing techniques include noise removal,

skew correction, cropping and resizing, normalization, thinning, binarization,

skeletonization. Morphological operations such as dilation, erosion can also be

applied to the input scanned image.

The steps in pre-processing are shown in figure 3.2 below:

23

Fig. 3.2 Steps in pre-processing

C. Segmentation:

Segmentation is of three types i.e. line, word and character segmentation.

Line segmentation separates the lines from a paragraph. Word segmentation

separates the words from a line and character segmentation separates the characters

from a word.

D. Feature Extraction:

Feature extraction is an important step in the recognition process. In this

process, all the essential information about a character which is present in an image

is extracted.

E. Classification:

In classification, an unknown sample is assigned to the predefined class.

According to the extracted features, characters are classified and recognized.

F. Post-processing:

To achieve more accuracy, various post-processing techniques are used, for

example, matching a recognized word with a dictionary word.

3.2 Algorithm(s):

Horizontal projection method is used to segment a line from a paragraph. As

a first step horizontal histogram of an image is created. The average height of a

rising section is assumed as threshold. Then the height of each rising section is

checked whether it is greater or equal to the threshold, then each line is segmented

from a binary image.

24

3.2.1 Algorithm for Line Segmentation:

1) Read a handwritten document image as a multidimensional array.

2) Check the image is a binary image or not. If binary image then stores it

into a 2-d array IMG [] [] with size MN and go to Step 4, otherwise go to

Step 3.

3) Convert the image to binary image and store into a 2-d array IMG [] [].

4) Construct the horizontal projection histogram of the image IMG [] [] and

store into a 2-d array HPH [] [].

5) Measure the height, starting row position and ending row position of

each horizontally rising section of horizontal projection histogram image

and store into 3d array LH [] [] [] sequentially.

6) Count the number of rising points by counting the rows of the 3-d array

LH [] [] []. Then measure the threshold (Ti) value by calculating

average height of rising sections from the 3-d array LH [] [] [].

7) Select each rising section from 3-d array LH [] [] [] and check the height

of that rising section is less than the threshold or not. If yes then those

rising points is not considered as a line and go to Step 9, otherwise rising

section is treated as a line and go to Step 8.

8) Find the rising sections starting and ending rows number from the array

LH [] [] []. Let starting and ending row are r1 and r2 respectively. Extract

the line segment between r1 and r2 from the original binary image

denoted by IMG [] [].

9) Go to Step 7 for next rising sections till all rising section are not under

consideration, otherwise go to next Step.

10) End.

3.2.2 Algorithm for Word Segmentation:

1) Read a segmented binary line as 2-d binary image LN [] [].

2) Construct the vertical projection histogram of the line LN [] [] and store

into a 2-d array LVP [] [].

3) From the vertical projection histogram (LVP [] []), measures width of

each inter- word and intra-word gaps and store the width into 1-d array

GAPSW [].

4) Count total number gaps as TGP by calculating the size of GAPSW [].

Add width of all gaps by adding the elements of GAPSW [] and store into

TWD.

5) Calculate the threshold (Ti) as follows: Ti = TWD/ TGP Where, Ti is the

threshold value denoting average width of inter-word gaps, TWD denotes

total width of all gaps and TGP denotes the total number of gaps.

25

6) For each i (1<= i<= sizeof (GAPSW [])), if GAPSW[i]> = Ti then this gap

is treated as inter-word gaps, otherwise it is treated as an intra-word gap.

Depending on inter- word gaps width, words are segmented from the line.

7) End

3.2.3 Algorithm for Character Segmentation:

3.2.3.1 Character Segmentation Using VPP:

At first, proposed algorithm uses VPP of the binary image obtained after word

segmentation for the character segmentation. VPP is to represent the total number of

white pixels in vertical direction of the binary image to graph. Because boundaries of

the characters are certainly regions composed of background in the vertical direction

as value of VPP is zero, the text region is separated at these regions. When length of

width in separated character image is longer than 0.8 times length of height (feature of

the printed character in the slab image), it is judged that the separated character images

are touching character.

Width > 0.8 * Height: Touching Character 3.1

3.2.3.2 Touching Character Segmentation:

Boundaries of touching characters are located at the valley points of VPP

(Vertical Projection Profile) or TDP (top down profile). TDP is to represent position of

first white pixel at respective column to graph. Because all valley points are not

boundaries of touching characters, all candidate boundary points are extracted. For

VPP and TDP analysis, binary image, feature binary image, and gray image are used.

White pixels in the feature binary image are composed of peak, hillside, and ridge

points of topographic feature in gray image [10].

All candidate boundary points extracted are combined to Calculate the score

graph. Real boundary regions of characters have a large value at the score graph.

Cost Graph = (VPP + TDP)/2 3.2

Combined boundary points are selected from the score graph recursively.

When finding the combined boundary points more than real character boundary points,

we should choose correct boundary points. After making up all cases which are able to

separate the touching character from the combined boundary points, the proposed

algorithm selects the correct case that has minimum distance between the separated

character images and the representative images using recognition-based method. The

representative image displays the recognition result of the separated character image.

26

3.3 Proposed Work:

The process for the optical cursive handwritten recognition and the required

algorithms for various levels of segmentation and character recognition using pytorch

are as follow

It comprises of six steps.

1. Image scanning

2. Pre-processing

3. Segmentation

4. Feature extraction

5. Classification

6. Post-processing

3.3.1 Image Scanning:

The input image can be obtained either by scanning the already existing

handwritten image file (png, jpg) or by capturing the image instantly to provide input

data to the model.

Fig 3.3 Scanned input image

3.3.2 Pre-Processing:

The main goal here is to make the input image free from noise. As a first step

to move on convert the RGB image to a gray scale image and gently sharpen the given

input image to avoid loss of edges. Calculate the mean gray intensity value to reduce

the brightness of the obtained grey scale image on a threshold value of less than 0.65

and the contrast is increased to distinguish the character boundaries[8]. The text which

27

is present in the obtained result may turn dim and blur because of improper scanning

of the text image.

Fig 3.4 Pre-processing

To overcome this, binarization plays a key role by converting the gray scale

image where the values ranges between 0 and 255 to a binary image by making up a

threshold value simply to decide like an on or off (0 or 1). Since burned characters

look like dim in the text region, the characters disappear in the binary image. When

converting to binary image, we apply Otsu’s binarization [9] method not to the whole

region but to the respective local regions.

Fig 3.5 Blur image (for noise removal)

28

Fig 3.6 Binary image in color contrast

3.3.3 Segmentation:

Basically, there are three levels of segmentation. Line segmentation, Word

segmentation and Character segmentation.

3.3.3.1 Line Segmentation:

Horizontal histogram projections are used in segmenting the entire script

present in the input image into individual lines as shown in the figure below

The primary task here is to extract each individual line from the given input

image. This can be obtained by applying horizontal histogram projection to the pre-

processed image and then generate the threshold line value by adjusting the average

value for those horizontal projections. Graphical representation of horizontal

histogram projection is shown in the figure 3.7 below.[6]

Fig 3.7 Horizontal projection graph

29

Finally, Lines can be segmented from the given input script by obtaining the

break points by making use of the average threshold line value obtained from the

above graph and having comparisons with each and every horizontal projection.

Fig 3.8 Segmented lines from the image

3.3.3.2 Word Segmentation

Each word is treated as an object (Contour – in terms of image processing).

Contour can be explained simply as a curve joining all the continuous points (along

the boundary), having same color or intensity. Here, Contours are useful for object

detection where each object is a word.

Fig 3.9 segmented line

The main reason for making use of contours here is each word can be treated as

a curve joining all the continuous points along the boundary since it is cursively

written. But sometimes there may be gaps in between the letters of a single word

which causes the word to be split into two or more words as they are not continuous

points joining as a curve.

This type of words can be identified by making use of the minimum threshold

value which is obtained by taking the average separation distance between the words

and can be rejoined back as a single word (Contour) where the separation distance

between the words less than the minimum threshold value.[7]

Minimum threshold value = (sum of separation distances between words in the

line)/(no of words in the line)

30

Fig 3.10 First level word segmentation

Fig 3.11 Second level word segmentation

3.3.3.3 Character Segmentation

Segmenting each word into individual characters can be obtained by making use

of two native algorithms:

1. VPP (Vertical Projection Profile)

2. TDP (Top Down Profile)

Fig 3.12 segmented word

VPP is a plot which maintains the total number of white pixels in vertical

direction of the binary image. Characters can be segmented at the point where the VPP

value is zero (0) for some certain no of times (Threshold). But in the case of touching

characters, the VPP value can never be zero even though the characters should be

segmented (connected components).[1]

31

Fig 3.13 VPP intensity graph for word ‘MOVE’

Connected components can be identified by making use of characters width

and height. When the width of the character is greater than 0.8 times the height of the

character then it is identified as a connected component otherwise it is a single

character as per basic font size measurement.[1]

Width > 0.8 * height (Connected component)

Connected component single character single character

Fig 3.14 First level VPP character segmentation

TDP is a plot which maintains the first white pixel in vertical direction of the

binary image. Touching characters can be segmented into individual characters by

taking the combined value of both VPP AND TDP. Then obtain the minimum value in

the graph where we can segment them into individual characters and continue this

process recursively until no more touching character found in a single word. [2]

Fig 3.15 touching characters (connected component)

32

Fig 3.16 VPP intensity graph for word ‘MO’

Fig 3.17 TDP intensity graph for word ‘MO’

Fig 3.18 Combined intensity graph for word ‘MO’

33

Fig 3.19 Final level character segmentation

3.3.4 Feature Extraction:

The main goal here is to extract the features from the segmented characters

which are required to train the data.

This process comprises of zero padding, convolution layer, activation function,

max pooling and flatten. As a first step add zeros to the image to overcome the loss of

edges termed as zero padding. Then apply multiple layers of convolution and max

pooling filters(kernels) to obtain an image with reduced in size where each move is a

stride. Max pooling comes with selecting the max value from the filter and replace it

with the remaining and the same way the average pooling works by taking the average

pixel value. Now the activation function comes into picture where the ReLU activation

function identifies all the negative pixel values and replace it with zero without any

change in the positive pixel values. Finally flatten the image by reshaping the image

obtained.

Fig 3.20 Feature extraction of characters

34

Fig 3.21 Zero padding

Fig 3.22 Convolution layer

Fig 3.23 Max pooling and Average pooling

35

Fig 3.24 Flatten the image

Fig 3.25 ReLU activation function

3.3.5 Classification:

Finally, Classification is done using a fully connected layer where we get the

probabilities of each and every class for the given input character. Classify the given

input character to their respective class by selecting the class with the maximum

probability. In total there are classified into 62 classes (0 to 9, a to z, A TO Z)

Fig 3.26 Fully connected layer with classes (x and o) along with probabilities

36

3.3.6 Post-processing:

As a final step obtain the accuracy for all the levels of segmentation and the

character recognition by minimizing the error rate. Then combine all the recognized

characters into word, words into lines and lines into the original script present in the

image.

Final Result = M O V E

Fig 3.27 Final result

37

4. PYTORCH

4.1 Pytorch library tool:

Pytorch is an open source machine learning library based on a torch library

used for applications such as computer vision and natural language processing. It’s a

Python-based scientific computing package targeted at two sets of audiences:

A replacement for NumPy to use the power of GPUs

a deep learning research platform that provides maximum flexibility

and speed

It is a Python-based scientific computing package that uses the power of

graphics processing units. It is also one of the preferred deep learning research

platforms built to provide maximum flexibility and speed. It is known for providing

two of the most high-level features; namely, tensor computations with strong GPU

acceleration support and building deep neural networks on a tape-based autograd

systems.

There are many existing Python libraries which have the potential to change

how deep learning and artificial intelligence are performed, and this is one such

library. One of the key reasons behind PyTorch’s success is it is completely and one

can build neural network models effortlessly. It is still a young player when compared

to its other competitors, however, it is gaining momentum fast.

Since its release in January 2016, many researchers have continued to

increasingly adopt PyTorch. It has quickly become a go-to library because of its ease

in building extremely complex neural networks. It is giving a tough competition to

tensor flow especially when used for research work. However, there is still some time

before it is adopted by the masses due to its still “new” and “under construction” tags.

PyTorch creators envisioned this library to be highly imperative which can

allow them to run all the numerical computations quickly. This is an ideal

methodology which fits perfectly with the Python programming style. It has allowed

deep learning scientists, machine learning developers, and neural network debuggers

to run and test part of the code in real time. they don’t have to wait for the entire code

to be executed to check whether it works or not.

You can always use your favorite Python packages such as NumPy, SciPy, and

Cython to extend PyTorch functionalities and services when required. PyTorch is a

dynamic library (very flexible and you can use as per your requirements and changes)

which is currently adopted by many of the researchers, students, and artificial

38

intelligence developers. In the recent Kaggle competition, PyTorch library was used

by nearly all of the top 10 finishers.

Some of the key highlights of PyTorch includes:

Simple Interface: It offers easy to use API thus, it is very simple to operate

and run like Python.

Pythonic in nature: This library, being Pythonic, smoothly integrates with

the Python data science stack. it can leverage all the services and

functionalities offered by the Python environment.

Computational graphs: In addition to this, PyTorch provides an excellent

platform which offers dynamic computational graphs thus you can change

them during runtime. This is highly useful when you have no idea how

much memory will be required for creating a neural network model.

It is an optimized tensor library for deep learning using CPU’S and GPU’S.

The feature extraction and the classification stages are implemented under pytorch

library tool. As a first step install all the required modules/packages to train the data

using pip namely efficient net-pytorch and torch summary. Then include all the

required modules by importing them into the python script (torch, torch.vision,

torch.nn, torch.util, torch.autograd, torch.optim, torchvision.transforms, Efficient Net).

Torch: The torch package contains data structures for multi-dimensional

tensors and mathematical operations over these are defined. Additionally, it provides

many utilities for efficient serializing of Tensors and arbitrary types, and other useful

utilities.

Torchvision: This package consists of popular datasets, model architectures,

and common image transformations for computer vision.

Torch.nn: A kind of Tensor that is to be considered a module parameter.

Parameters are Tensor subclasses, that have a very special property when used

with Module s - when they’re assigned as Module attributes they are automatically

added to the list of its parameters, and will appear e.g. in parameter () iterator.

Assigning a Tensor doesn’t have such effect. This is because one might want to cache

some temporary state, like last hidden state of the RNN, in the model. If there was no

such class as Parameter, these temporaries would get registered too.

Torch.autograd: This provides classes and functions implementing automatic

differentiation of arbitrary scalar valued functions. It requires minimal changes to the

existing code - you only need to declare Tensors for which gradients should be

computed with the ‘requires grad = True’ keyword.

39

Torch.optim: This is a package implementing various optimization

algorithms. Most commonly used methods are already supported, and the interface is

general enough, so that more sophisticated ones can be also easily integrated in the

future.

4.2 Pytorch in research:

Anyone who is working in the field of deep learning and artificial intelligence

has likely worked with Tensorflow before, Google’s most popular open source

library. However, the latest deep learning framework – PyTorch solves major

problems in terms of research work. Arguably PyTorch is Tensorflow’s biggest

competitor to date, and it is currently a much-favored deep learning and artificial

intelligence library in the research community.

Dynamic Computational graphs:

It avoids static graphs that are used in frameworks such as TensorFlow,

thus allowing the developers and researchers to change how the network behaves on

the fly. The early adopters are preferring PyTorch because it is more intuitive to learn

when compared to TensorFlow.

Different back-end support:

PyTorch uses different backends for CPU, GPU and for various

functional features rather than using a single back-end. It uses tensor backend TH for

CPU and THC for GPU. While neural network backends such as THNN and

THCUNN for CPU and GPU respectively. Using separate backends makes it very easy

to deploy PyTorch on constrained systems.

Imperative style:

PyTorch library is specially designed to be intuitive and easy to use.

When you execute a line of code, it gets executed thus allowing you to perform real-

time tracking of how your neural network models are built. Because of its excellent

imperative architecture and fast and lean approach it has increased overall PyTorch

adoption in the community.

Highly extensible:

PyTorch is deeply integrated with the C++ code, and it shares

some C++ backend with the deep learning framework, Torch. Thus, allowing users to

program in C/C++ by using an extension API based on cFFI for Python and compiled

for CPU for GPU operation. This feature has extended the PyTorch usage for new and

experimental use cases thus making them a preferable choice for research use.

40

Python-Approach:

PyTorch is a native Python package by design. Its functionalities are

built as Python classes. Hence, all its code can seamlessly integrate with Python

packages and modules. Similar to NumPy, this Python-based library enables GPU-

accelerated tensor computations plus provides rich options of APIs for neural network

applications. PyTorch provides a complete end-to-end research framework which

comes with the most common building blocks for carrying out daily deep learning

research. It allows chaining of high-level neural network modules because it supports

Keras-like API in its torch.nn package.

4.3 Training an image classifier using Pytorch:

Generally, when you have to deal with image, text, audio or video data, you

can use standard python packages that load data into a numpy array. Then you can

convert this array into a torch.*tensor.

For images, packages such as Pillow, OpenCV are useful

For audio, packages such as scipy and librosa

For text, either raw Python or Cython based loading, or NLTK and

SpaCy are useful

Specifically for vision, we have created a package called torchvision, that has

data loaders for common datasets such as Imagenet, CIFAR10, MNIST, etc. and data

transformers for images, viz., torchvision.datasets and torch.utils.data.DataLoader.

This provides a huge convenience and avoids writing boilerplate code.

It includes the steps as follow:

Load and normalizing the training data and test datasets using

torchvision

Define a Convolutional Neural Network

Define a loss function

Train the network on the training data

Test the network on the test data

Using torchvision, it’s extremely easy to load Data. The output of torchvision

datasets are PILImage images of range [0, 1]. We transform them to Tensors of

normalized range [-1, 1].

Now define the Convolution neural network and essential loss functions and

optimizers.

41

Training: We simply have to loop over our data iterator, and feed the inputs to

the network and optimize. We have trained the network for around 6 passes over the

training dataset. But we need to check if the network has learnt anything at all.

We will check this by predicting the class label that the neural network outputs,

and checking it against the ground-truth. If the prediction is correct, we add the sample

to the list of correct predictions.

Training on GPU: Just like how you transfer a Tensor onto the GPU, you

transfer the neural net onto the GPU. Let’s first define our device as the first visible

cuda device if we have CUDA available.

42

INPUT_FILENAME=[]

OUTPUT_CLASS=[]

for DIRNAME, _, FILENAMES in os.WALK('/KAGGLE/input/nist-

CHARACTERS-DATASET/CHARACTERS/TEST_IMAGES'):

for FILENAME in FILENAMES:

INPUT_FILENAME.APPEND(FILENAME.split('.')[0])

OUTPUT_CLASS.APPEND(DIRNAME.split('/')[-1])

TESTDATA=pd.DATAFRAME({'FILENAME':INPUT_FILENAME,

'CLASS':OUTPUT_CLASS})

TESTDATA.to_csv('test.csv')

SAMPLEDATA=pd.DATAFRAME({'FILENAME':INPUT_FILENAME,'CLASS'

:[0 for _ in RANGE(len(OUTPUT_CLASS))]})

SAMPLEDATA.to_csv('SAMPLE_SUBMISSION.CSV')

def PREPAREIMAGE(IMAGE,req_height):

if IMAGE.ndim == 3:

IMAGE=cv2.cvtColor(IMAGE,cv2.COLOR_BGR2GRAY)

height=IMAGE.SHAPE[0]

FACTOR=req_height/height

print("Resized by FACTOR : ",FACTOR)

return cv2.resize(IMAGE,dsize=None,fx=FACTOR,fy=FACTOR)

def CREATEKERNELFILTER(kernelSize,SIGMA,THETA):

HALFSIZE=kernelSize//2

kernel=np.zeros([kernelSize,kernelSize])

5. SAMPLE CODE ELABORATION

5.1 Pre-processing the data:

5.1.1 Data Organization:

5.1.2 Image Sizing and Shaping:

5.1.3 Image Blurring Kernel Filter:

43

5.1.4 Applying Kernal Filters and Contours on image:

def Pre_Processing_Sentence(sentence):

print("RESIZED SENTENCE: ")

UPDATED_SENTENCE=PREPAREIMAGE(sentence,50)

SHOW_IMAGE(UPDATED_SENTENCE,CMAP='GRAY')

print("BLURRED SENTENCE: ")

blurred_sentence=cv2.GAUSSIANBLUR(sentence,(5,5),0)

SHOW_IMAGE(blurred_sentence,CMAP='GRAY')

print("FILTERED SENTENCE: ")

kernelSize=25

SIGMA=11

THETA=7

MINAREA=150

kernel=CREATEKERNELFILTER(kernelSize,SIGMA,THETA)

Filtered_sentence=cv2.filter2D(sentence,-

1,kernel,borderType=cv2.BORDER_REPLICATE)

SHOW_IMAGE(Filtered_sentence,CMAP='GRAY')

print("THRES SENTENCE: ")

THRES_VALUE,Thres_sentence=cv2.threshold

(Filtered_sentence,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)

Thres_sentence=255-Thres_sentence

SHOW_IMAGE(Thres_sentence,CMAP='GRAY')

cv2. version

components,HIERARCHY=cv2.findContours(

Thres_sentence,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)

SIGMAX=SIGMA

SIGMAY=SIGMA*THETA

for i in RANGE(kernelSize):

for j in RANGE(kernelSize):

x=i-HALFSIZE

y=j-HALFSIZE

expTerm=np.exp(-((x**2)/(2*(SIGMAX**2)))-

((y**2)/(2*(SIGMAY**2))))

kernel[i,j]=(1/(2*MATH.pi*SIGMAX*SIGMAY))*expTerm

return kernel

44

def LINE_SEGMENTATION(IMAGE):

Sentences=[]

Line_intensity=[0 for line in RANGE(len(IMAGE))]

for line in RANGE(len(IMAGE)):

count=0

for pixel in RANGE(len(IMAGE[0])):

if(IMAGE[line][pixel]<128):

count+=1

Line_intensity[line]=count

print("LINE INTENSITY: ")

print(Line_intensity)

plt.plot(Line_intensity)

plt.xticks([])

plt.show()

return EVALUATING_THRESHOLD(IMAGE,Line_intensity)

5.2 Line Segmentation:

5.2.1 : Calculating Line Intensity(Horizontal Histogram)

print('NO OF COMPONENTS : ',len(components))

print("CONTOURED SENTENCE: ")

SHOW_IMAGE(cv2.DRAWCONTOURS(sentence,components,-

1,(255,0,0),5))

Words=[]

for contour in components:

if(cv2.CONTOURAREA(contour) >= MINAREA):

(x,y,w,h)=cv2.boundingRect(contour)

Words.APPEND([(x,y,w,h),sentence[y:y+h,x:x+w]])

print('NO OF COMPONENTS AFTER FILTERING: ',len(Words))

Words.sort(key=SORT_IMAGES)

return Words

45

def Segmenting_Lines(IMAGE,Line_Segments,Line_Threshold):

Sentences=[]

print(Line_Segments)

for index in RANGE(1,len(Line_Segments),2):

if(Line_Segments[index]>Line_Threshold):

y=Line_Segments[index-1][0]-5

h=Line_Segments[index-1][1]-

5.2.2 : Evaluating Threshold for Line Segmentation:

def EVALUATING_THRESHOLD(IMAGE,Line_intensity):

Line_Segments=[]

LINE_SEPERATION=0

START_FLAG=0

Zero_count=0

START_INDEX=0

End_index=0

SET_FLAG=0

for line in RANGE(len(Line_intensity)):

if(Line_intensity[line]==0):

Zero_count+=1

if(SET_FLAG==0):

End_index=line

SET_FLAG=1

else:

if(SET_FLAG==1 AND START_FLAG==1):

SET_FLAG=0

Line_Segments.APPEND([START_INDEX,End_index])

START_INDEX=line

Line_Segments.APPEND(Zero_count)

LINE_SEPERATION=LINE_SEPERATION+(Zero_count**2)

Zero_count=0

if(START_FLAG==0):

START_FLAG=1

SET_FLAG=0

START_INDEX=line

Zero_count=0


Line_Threshold=MATH.sqrt(LINE_SEPERATION)/6

print("LINE THRESHOLD : ",Line_Threshold)

print("LINE SEGMENTS : ",Line_Segments)

return Segmenting_Lines(IMAGE,Line_Segments,Line_Threshold)

5.2.3 : Segmenting Paragraph into Sentences:

46

def combine_Words(Word1,Word2,sentence):

WORD=[[]]

WORD[0].APPEND(Word1[0][0])

# X-AXIS POSITION

WORD[0].APPEND(min(Word1[0][1],Word2[0][1]))

# Y-AXIS POSITION

WORD[0].APPEND((Word2[0][0]-Word1[0][0])+Word2[0][2])

# WIDTH

WORD[0].APPEND(MAX(Word1[0][1]+

Word1[0][3],Word2[0][1]+Word2[0][3])-WORD[0][1])

# HEIGHT

WORD.APPEND(sentence[WORD[0][1]:WORD[0][1]+

WORD[0][3],WORD[0][0]:WORD[0][0]+WORD[0][2]])

return WORD

def WORD_SEGMENTATION(Words,sentence):

FINAL_WORDS=[]

word=[]

FINAL_FLAG=0

WORD_SEPERATION_SUM=0

SEPERATION=[]

for word_no in RANGE(len(Words)-1):

5.3 Word Segmentation:

5.3.1 : Combining two Missegmented Words:

5.3.2 : Segmenting Sentence into Words:

Line_Segments[index-1][0]+10

Sentences.APPEND(IMAGE[y:y+h])

y=Line_Segments[-1][0]-5

h=Line_Segments[-1][1]-Line_Segments[-1][0]+10


return Sentences

47

def EVALUATING_VPP_INTENSITY(PRE_PROCESSED_BINARY_IMAGE):

VPP_Intensity=[0 for col in

RANGE(len(PRE_PROCESSED_BINARY_IMAGE[0]))]

for row in RANGE(len(PRE_PROCESSED_BINARY_IMAGE)):

for col in RANGE(len(PRE_PROCESSED_BINARY_IMAGE[row])):

if(PRE_PROCESSED_BINARY_IMAGE[row][col]==0):

VPP_Intensity[col]+=1

print(VPP_Intensity)

plt.plot(VPP_Intensity)

plt.xticks([])

DISTANCE=Words[word_no+1][0][0

]-(Words[word_no][0][0]+Words[word_no][0][2])

SEPERATION.APPEND(DISTANCE)

WORD_SEPERATION_SUM=WORD_SEPERATION_SUM+DISTANCE

WORD_AVERAGE_THRESHOLD=MATH.sqrt(

WORD_SEPERATION_SUM/(len(Words)-1))

print('WORDS SEPERATION : ',SEPERATION)

print('AVERAGE THRESHOLD FOR WORD SEPERATION :

',WORD_AVERAGE_THRESHOLD)

for index in RANGE(len(SEPERATION)):

if(len(word)==0):

word=Words[index]

if(SEPERATION[index]>WORD_AVERAGE_THRESHOLD):

FINAL_WORDS.APPEND(word)

word=[]

FINAL_FLAG=0

else:

word=combine_Words(word,Words[index+1],sentence)

FINAL_FLAG=1

if(FINAL_FLAG==0):

FINAL_WORDS.APPEND(Words[-1])

else:


return FINAL_WORDS

5.4 Character Segmentation:

5.4.1 : Evaluating VPP Intensity:

48

5.4.2 : First Level Character Segmentation Using VPP:

def FIRST_LEVEL_CHARACTER_SEGMENTATION_UNDER_VPP

(PRE_PROCESSED_BINARY_IMAGE):

VPP_Intensity=EVALUATING_VPP_INTENSITY

(PRE_PROCESSED_BINARY_IMAGE)

CHARACTER_SEGMENTS=[]

CHARACTER_SEPERATION=

0 START_FLAG=0

Zero_count=0

s=START_INDEX=0

End_index=0

SET_FLAG=0

for col in RANGE(len(VPP_Intensity)):

if(VPP_Intensity[col]==0):

Zero_count+=1

if(SET_FLAG==0):

End_index=col

SET_FLAG=1

else:


SET_FLAG=0

CHARACTER_SEGMENTS.APPEND

([START_INDEX,End_index])

START_INDEX=col

CHARACTER_SEGMENTS.APPEND(Zero_count)

CHARACTER_SEPERATION=CHARACTER_SEPERATIO

N

+(Zero_count**2)

Zero_count=0

if(START_FLAG==0):

START_FLAG=1

SET_FLAG=0

START_INDEX=col

Zero_count=0

CHARACTER_SEGMENTS.APPEND([START_INDEX,End_index])

CHARACTER_THRESHOLD=MATH.sqrt(CHARACTER_SEPERATION)/3

print("CHARACTER THRESHOLD : ",CHARACTER_THRESHOLD)

print("CHARACTER SEGMENTS : ",CHARACTER_SEGMENTS)

return CHARACTER_SEGMENTATION(CHARACTER_SEGMENTS,

plt.show()

return VPP_Intensity

49

def GENERATE_VPP_AND_TDP_AVERAGE(SEGMENT):

IMAGE=SEGMENT[1]

SHOW_IMAGE(IMAGE,CMAP='GRAY')

VPP_Intensity=[0 for col in RANGE(len(IMAGE[0]))]

for row in RANGE(len(IMAGE)):

for col in RANGE(len(IMAGE[row])):

if(IMAGE[row][col]==0):


CHARACTER_THRESHOLD,PRE_PROCESSED_BINARY_IMAGE)

5.4.3 : Segmenting Word into Characters under VPP:

def CHARACTER_SEGMENTATION(CHARACTER_SEGMENTS,

CHARACTER_THRESHOLD,PRE_PROCESSED_BINARY_IMAGE):

SEGMENTED_CHARACTERS=[]

for index in RANGE(1,len(CHARACTER_SEGMENTS),2):

if(CHARACTER_SEGMENTS[index]>CHARACTER_THRESHOLD):

x=CHARACTER_SEGMENTS[index-1][0]

y=0

Touching=0

w=CHARACTER_SEGMENTS[index-1][1]-

CHARACTER_SEGMENTS[index-1][0]

h=len(PRE_PROCESSED_BINARY_IMAGE)

if(w>0.675*h):

Touching=1

SEGMENTED_CHARACTERS.APPEND([[x,y,w,h],

PRE_PROCESSED_BINARY_IMAGE[y:y+h,x:x+w],Touching])

x=CHARACTER_SEGMENTS[-1][0]

y=0

Touching=0

w=CHARACTER_SEGMENTS[-1][1]-CHARACTER_SEGMENTS[-1][0]


if(w>0.675*h):

Touching=1



return SEGMENTED_CHARACTERS

5.4.4 : Evaluating VPP and TDP Average Intensity:

50

def TOUCHING_CHARACTER_SEGMENTATION(SEGMENT,

AVERAGE_INTENSITY,FINAL_SEGMENTED_CHARACTERS):

AVERAGE_THRESHOLD=20

TOUCHING_CHARACTERS_BREAKPOINTS=[]

col=0

while(col<len(AVERAGE_INTENSITY)):

if(col==0):

while(col<len(AVERAGE_INTENSITY) AND

AVERAGE_INTENSITY[col]<AVERAGE_THRESHOLD):

col+=1

if(col<len(AVERAGE_INTENSITY) AND


MIN_VALUE=AVERAGE_INTENSITY[col]

min_point=col



5.4.5 : Connected Components Segmentation:



plt.xticks([])

plt.show()

TDP_Intensity=[0 for col in RANGE(len(IMAGE[0]))]

for col in RANGE(len(IMAGE[0])):



TDP_Intensity[col]=len(IMAGE)-row

BREAK

print(TDP_Intensity)

plt.plot(TDP_Intensity)

plt.xticks([])

plt.show()

AVERAGE_INTENSITY=np.ADD(TDP_Intensity,VPP_Intensity)

print(AVERAGE_INTENSITY)

plt.plot(AVERAGE_INTENSITY)

plt.xticks([])

plt.show()

return AVERAGE_INTENSITY

51

if(AVERAGE_INTENSITY[col]<MIN_VALUE):


min_point=col

col+=1

if(col<len(AVERAGE_INTENSITY)):

TOUCHING_CHARACTERS_BREAKPOINTS.APPEND(min_point)

col+=1

print("CHARACTERS BREAK POINTS

",TOUCHING_CHARACTERS_BREAKPOINTS)

if(len(TOUCHING_CHARACTERS_BREAKPOINTS)==0):

REQUIRED_FURTHER_SEGMENTATION(SEGMENT,

AVERAGE_INTENSITY,FINAL_SEGMENTED_CHARACTERS)

else:

x_point=0

y_point=SEGMENT[0][1]

height=SEGMENT[0][3]

for BREAK_POINT in TOUCHING_CHARACTERS_BREAKPOINTS:

width=BREAK_POINT-x_point

if(width>0.8*height):

REQUIRED_FURTHER_SEGMENTATION([[x_point,

y_point,width,height],SEGMENT[1][y_point:

y_point+height,x_point:x_point+width],1],

AVERAGE_INTENSITY[x_point:x_point+width

],FINAL_SEGMENTED_CHARACTERS)

else:

FINAL_SEGMENTED_CHARACTERS.APPEND([[x_point,


y_point+height,x_point:x_point+width],0])

x_point=BREAK_POINT

width=SEGMENT[0][2]-x_point







else:




52

5.4.6 : Further Required Segmentation on Connected Components:

def REQUIRED_FURTHER_SEGMENTATION(IMAGE_SEGMENT,


Index_Limit=int(0.4*IMAGE_SEGMENT[0][3])

MIN_VALUE=AVERAGE_INTENSITY[Index_Limit]

BREAK_POINT=Index_Limit

for col in RANGE(Index_Limit,len(AVERAGE_INTENSITY)-

Index_Limit):



BREAK_POINT=col

x_point=0

y_point=IMAGE_SEGMENT[0][1]

height=IMAGE_SEGMENT[0][3]








else: FINAL_SEGMENTED_CHARACTERS.APPEND([[x_point,



x_point=BREAK_POINT

width=IMAGE_SEGMENT[0][2]-x_point







else:




53

CLASS DATASET(DATA.DATASET):

def init (self,CSV_PATH,IMAGES_PATH,TRANSFORM=None):

self.TRAIN_SET=pd.READ_CSV(CSV_PATH)

self.TRAIN_PATH=IMAGES_PATH

self.TRANSFORM=TRANSFORM

def len (self):

return len(self.TRAIN_SET)

def getitem (self,idx):

FILE_NAME=self.TRAIN_SET.iloc[idx][1]+'.png'

LABEL=self.TRAIN_SET.iloc[idx][2]

img=IMAGE.open(os.PATH.join(self.TRAIN_PATH,FILE_NAME))

if self.TRANSFORM is not None:

img=self.TRANSFORM(img)

return img,LABEL

PARAMS = {'BATCH_SIZE': 16,

'shuffle': True

}

epochs = 6

LEARNING_RATE=1e-

3

TRANSFORM_TRAIN = TRANSFORMS.Compose([TRANSFORMS.

Resize((224,224)),TRANSFORMS.RANDOMAPPLY([

torchvision.TRANSFORMS.RANDOMROTATION(10),

TRANSFORMS.RANDOMHORIZONTALFLIP()],0.7),

TRANSFORMS.ToTensor()])

TRAINING_SET=DATASET(os.PATH.join(BASE_PATH,

'TRAIN.CSV'),os.PATH.join(BASE_PATH,

'TRAIN_IMAGES/'),TRANSFORM=TRANSFORM_TRAIN)

TRAINING_GENERATOR=DATA.DATALOADER(TRAINING_SET,**PARAMS)

USE_CUDA = torch.CUDA.IS_AVAILABLE()

device = torch.device("CUDA:0" if USE_CUDA else "cpu")

5.5 Training the model:

5.5.1 : Data Loader:

5.5.2 : Defining Transforms and Parameters:

54

model = EfficientNet.FROM_PRETRAINED('efficientnet-b0',

NUM_CLASSES=62)

model.to(device)

print(SUMMARY(model, input_size=(3, 512, 512)))

PATH_SAVE='./Weights/'

if(not os.PATH.exists(PATH_SAVE)):

os.mkdir(PATH_SAVE)

criterion = nn.CrossEntropyLoss()

LR_DECAY=0.99

optimizer = optim.ADAM(model.PARAMETERS(), lr=LEARNING_RATE)

eye = torch.eye(62).to(device)

CLASSES=[i for i in RANGE(62)]

HISTORY_ACCURACY=[]

history_loss=[]

epochs = 1

for epoch in RANGE(epochs):

running_loss = 0.0

correct=0

TOTAL=0

CLASS_CORRECT = list(0. for _ in CLASSES)

CLASS_TOTAL = list(0. for _ in CLASSES)

for i, DATA in ENUMERATE(TRAINING_GENERATOR, 0):

inputs, LABELS = DATA

t0 = time()

inputs, LABELS = inputs.to(device), LABELS.to(device)

5.5.3 : Importing the Model:

5.5.4 : Traning the Model:

print(device)

55

LABELS = eye[LABELS]

optimizer.ZERO_GRAD()

outputs = model(inputs)

loss = criterion(outputs, torch.MAX(LABELS, 1)[1])

_, predicted = torch.MAX(outputs, 1)

_, LABELS = torch.MAX(LABELS, 1)

c = (predicted == LABELS.DATA).squeeze()

correct += (predicted == LABELS).sum().item()

TOTAL += LABELS.size(0)

ACCURACY = FLOAT(correct) / FLOAT(TOTAL)

HISTORY_ACCURACY.APPEND(ACCURACY)

history_loss.APPEND(loss)

loss.BACKWARD()

optimizer.step()

for j in RANGE(LABELS.size(0)):

LABEL = LABELS[j]

CLASS_CORRECT[LABEL] += c[j].item()

CLASS_TOTAL[LABEL] += 1

running_loss += loss.item()

if(i%100==99):

print( "Epoch : ",epoch+1," BATCH : ", i+1," Loss :

",running_loss/(i+1)," ACCURACY : ",

ACCURACY,"Time ",round(time()-t0, 2),"s" )

for k in RANGE(len(CLASSES)):

if(CLASS_TOTAL[k]!=0):

print('ACCURACY of %5s : %2d %%' % (

CLASSES[k], 100 * CLASS_CORRECT[k] / CLASS_TOTAL[k]))

print('[%d epoch] ACCURACY of the network

on the TRAINING IMAGES: %d %%' %

(epoch+1, 100 * correct / TOTAL))

if epoch%10==9:

torch.SAVE(model.STATE_DICT(),

os.PATH.join(PATH_SAVE,str(epoch+1)+'.pth'))


os.PATH.join(PATH_SAVE,'FINAL_EPOCH'+'.pth'))

56

model.LOAD_STATE_DICT(torch.LOAD('/KAGGLE/WORKING

/Weights/FINAL_EPOCH.pth'))

model.EVAL()

TEST_TRANSFORMS = TRANSFORMS.Compose(

[TRANSFORMS.Resize(512),

TRANSFORMS.ToTensor(),])

def PREDICT_IMAGE(IMAGE):

IMAGE_TENSOR = TEST_TRANSFORMS(IMAGE)

IMAGE_TENSOR = IMAGE_TENSOR.unsqueeze_(0)

input = VARIABLE(IMAGE_TENSOR)

input = input.to(device)

output = model(input)

index = output.DATA.cpu().numpy().ARGMAX()

return index

IMG_TEST_PATH=os.PATH.join(BASE_PATH,'TEST_IMAGES/')

for i in RANGE(len(submission)):

img=IMAGE.open(IMG_TEST_PATH+submission.iloc[i][1]+'.png')

submission['CLASS'][i]=PREDICT_IMAGE(img)

if(i%10==0 or i==len(submission)-1):

print('[',32*'=','>]

',round((i+1)*100/len(submission),2),' % Complete')

Result=[[0 for _ in RANGE(62)] for i in RANGE(62)]

TOTAL_DATA=[0 for i in RANGE(62)]

CORRECT_DATA=[0 for i in RANGE(62)]


Result[TEST_DATASET['CLASS'][i]][submission['CLASS'][i]]+=1

if(TEST_DATASET['CLASS'][i]==submission['CLASS'][i]):

CORRECT_DATA[TEST_DATASET['CLASS'][i]]+=1

TOTAL_DATA[TEST_DATASET['CLASS'][i]]+=1

for i in Result:

for j in i:

print(str(10000+j)[1:],end=" ")

print()

5.5.5 : Testing the Model:

57

for i in RANGE(62):

print(i,'-',TOTAL_DATA[i],CORRECT_DATA[i

],(CORRECT_DATA[i]*100)/TOTAL_DATA[i])

print("TOTAL",'-',sum(TOTAL_DATA),sum

(CORRECT_DATA),(sum(CORRECT_DATA)*100)/sum(TOTAL_DATA))

58

6. RESULTS AND DISCUSSIONS

6.1 Input & Output:

INPUT:

Fig 6.1 Input image

OUTPUT:

A MOVE to stop Mr. Gaitskell from nominating any more Labour

life Peers is to be made at a meeting of Labour M P’s tomorrow. Mr. Michael Foot has

put down a resolution on the subject and he is to be backed by Mr. Will Griffiths, MP

for Manchester exchange.

6.2 Training Datasets:

IAM DATASET: The IAM Handwriting Database contains forms of

handwritten English text which can be used to train and test handwritten text

recognizers and to perform writer identification and verification experiments.

The database was first published in [13] at the ICDAR 1999. Using this

database an HMM based recognition system for handwritten sentences was developed

and published in [14] at the ICPR 2000. The segmentation scheme used in the second

version of the database is documented in [15] and has been published in the ICPR

2002. The IAM-database as of October 2002 is described in [16]. We use the database

extensively in our own research.

59

The database contains forms of unconstrained handwritten text, which were

scanned at a resolution of 300dpi and saved as PNG images with 256 gray levels.

The IAM Handwriting Database 3.0 is structured as follows:

657 writers contributed samples of their handwriting

1'539 pages of scanned text

5'685 isolated and labeled sentences

13'353 isolated and labeled text lines

115'320 isolated and labeled words

The words have been extracted from pages of scanned text using an automatic

segmentation scheme and were verified manually. The segmentation scheme has been

developed at our institute [15].

All form, line and word images are provided as PNG files and the

corresponding form label files, including segmentation information and variety of

estimated parameters (from the preprocessing steps described in [14]), are included in

the image files as meta-information in XML format

NIST DATASET: The EMNIST dataset is a set of handwritten character

digits derived from the NIST Special Dataset 2019 and converted to a 28x28 pixel

image format and dataset structure that directly matches the MNIST dataset.

The dataset is provided in two file formats. Both versions of the dataset contain

identical information, and are provided entirely for the sake of convenience. The first

dataset is provided in a Matlab format that is accessible through both Matlab and

Python (using the scipy.io.loadmat function). The second version of the dataset is

provided in the same binary format as the original MNIST dataset.[18]

There are six different splits provided in this dataset. A short summary of the

dataset is provided below:

EMNIST ByClass: 814,255 characters. 62 unbalanced classes.

EMNIST ByMerge: 814,255 characters. 47 unbalanced classes.

EMNIST Balanced: 131,600 characters. 47 balanced classes.

EMNIST Letters: 145,600 characters. 26 balanced classes.

EMNIST Digits: 280,000 characters. 10 balanced classes.

EMNIST MNIST: 70,000 characters. 10 balanced classes.

The full complement of the NIST Special Database 19 is available in the

ByClass and ByMerge splits. The EMNIST Balanced dataset contains a set of

characters with an equal number of samples per class. The EMNIST Letters dataset

merges a balanced set of the uppercase and lowercase letters into a single 26-class

60

task. The EMNIST Digits and EMNIST MNIST dataset provide balanced handwritten

digit datasets directly compatible with the original MNIST dataset.

Type Classes Training Testing Total

BY CLASS Digits 10 344,307 58,646 402,953

BY MERGE

Uppercase 26 208,363 11,941 220,304

Lowercase 26 178,998 12,000 190,998

Total 62 731,668 82,587 814,255

Digits 10 344,307 58,646 402,953

Letters 37 387,361 23,941 411,302

Total 47 731,668 82,587 814,255

Table-6.1

Breakdown of the number of available training and testing samples in the NIST

special database 19 using the original training and testing splits.

6.3 Experimental Results and Analysis:

Segmentation: This system is trained under around 1600 text images

(paragraphs) of IAM dataset with almost 5678 labeled sentences, 13353 isolated and

labelled text lines, 115,320 isolated and labeled words with an accuracy of around

98% in terms of line segmentation, 93 % in terms of word segmentation and 88% in

terms of character segmentation.

Character Recognition: This system is trained under NIST dataset. This

represents the most useful organization from a classification perspective as it contains

the segmented digits and characters arranged by class. There are 62 classes comprising

[0-9], [a-z] and [A-Z]. The data is also split into a suggested training and testing set

around 731,668 images and 82,587 images respectively with an accuracy of around

96% in character recognition.

61

Type Testing Accuracy

Line Segmentation 1,539 98%

Word Segmentation

Character Segmentation

Character Recognition

5,685 93%

115,320 88%

82,587 96%

Table -6.2: Testing and Accuracy

62

7. CONCLUSION

This paper mainly carries out a study on segmenting the connected components

(touching characters). We improved a performance of binarization in pre-processing,

and proposed new method separating the touching character using combined profile

analysis. Finally, because the proposed algorithm shows a good performance in the

experimental results, it is effective that the algorithm is applied to character

recognition system.

The proposed method for segmenting the connected components (touching

character) is by using VPP (Vertical projection profile) and TDP (Top down Profile)

and various other histogram projections (horizontal and vertical) for the line and word

segmentation respectively. pytorch is a python library tool used for the recognition of

the segmented characters. [4]

There are many more challenges involved in the optical cursive handwritten

recognition like the skewness, pressure detection etc. can be treated as a future study.

63

REFERENCES

[1] Nafiz arica, Student Member, IEEE, and Fatos T. Yarman-Vuraj, Senior Member

IEEE, “Optical Character Recognition for Cursive Handwriting”, IEEE transaction on

pattern analysis and machine intelligence, vol. 24 no. 6, June 2002

[2] Subhash Panwar and Neeta Naina, “A Novel Segmentation Methodology for

Cursive Handwritten Documents”, IETE JOURNAL OF RESEARCH, VOL 60-NO 6

NOV-DEC 2014

[3] Nibaran Das Sandip Pramanik, Subhadio Basu, Punam Kumar Saha, “Recognition

of handwritten Bangla basic characters and digits using convex hull feature set”, 2009

International conference on Artificial intelligence and pattern recognition (AIPRL-09)

[4] Abhishek Bala and Rajib Saha, “An IMPROVED Method for Handwritten

Document Analysis using Segmentation, Baseline Recognition and Writing Pressure

Detection”, 6th

International Conference on Advances in Computing Communication,

ICACC 2016, 6-8 September 2016, Cochin, India, Elesevier-2016

[5] Kanchan Keisham and Sunanda Dixit, “Recognition of Handwritten English Text

Using Energy Minimisation”, Information Systems Design and Intelligence

Applications, Advances in Intelligent Systems and Computing, Bangalore, India,

Spinger-2016

[6] Namrata Dave,” Segmentation Methods for Hand written Character Recognition”,

International Journal of Signal Processing, Image Processing and Pattern Recognition

Vol.8, No.4 (2015), pp.155-164

[7] G. Louloudisa, B.Gatosb, I.Pratikakisb, C.Halatsis, “Text line and word

segmentation of handwritten documents” a Department of Informatics and

Telecommunications, University of Athens, Greece Computational Intelligence

Laboratory, Institute of Informatics and Telecommunications, National Center for

Scientific Research Demokritos, 15310Athens, Greece.

[8] Rafael C. Gonzalez and Richard E. Woods, “Digital Image Processing, Second

Edition”, Prentice Hall.

[9] N. Otsu, “A Threshold Selection Method from GrayLevel Histogram”, IEEE

Trans, Systems, Man and Cybernetics, Vol. 1, No. 9, 1979, pp. 62-69.

[10] Seong-Whan Lee and Young Joon Kim, “Direct Extraction of Topographic

Features for Gray Scale Character Recogition”, IEEE Trans, On Pattern Analysis and

Machine Intelligence, Vol.17, No. 7, Jul., 1995.

[11] Seong-Whan, Dong-June Lee, and Hee-Seon Park, “A New Methodology for

Gray-Scale Character Segmentation and Recognition”, IEEE Trans. On Pattern

Analysis and Machine Intelligence, Vol. 18, No. 10, Oct., 1996.

64

[12] A. Ariyoshi, “A Character Segmentation Method for Japanese Documents Coping

with Touching Character Problems”, Proc. 31th Int’l Conf. Pattern Recognition, The

Hague, Netherlands, Aug., 1992. 313-316.

[13] U. Marti and H. Bunke. A full English sentence database for off-line handwriting

recognition. In Proc. of the 5th Int. Conf. on Document Analysis and Recognition,

pages 705 - 708, 1999.

[14] U. Marti and H. Bunke. Handwritten Sentence Recognition. In Proc. of the 15th

Int. Conf. on Pattern Recognition, Volume 3, pages 467 - 470, 2000.

[15] M. Zimmermann and H. Bunke. Automatic Segmentation of the IAM Off-line

Database for Handwritten English Text. In Proc. of the 16th Int. Conf. on Pattern

Recognition, Volume 4, pages 35 - 39, 2000.

[16] U. Marti and H. Bunke. The IAM-database: An English Sentence Database for

Off-line Handwriting Recognition. Int. Journal on Document Analysis and


[17] S. Johansson, G.N. Leech and H. Goodluck. Manual of Information to accompany

the Lancaster-Oslo/Bergen Corpus of British English, for use with digital Computers

Department of English, University of Oslo, Norway, 1978.

[18] Cohen, G., Afshar, S., Tapson, J., & van Schaik, A. (2017). EMNIST: an

extension of MNIST to handwritten letters.

[19] Richard G. Casey and Eric Lecolinet, “A Survey of Methods and Strategies in

Character Segmentation”, IEEE Trans. On Pattern Analysis and Machine Intelligence,

Vol. 18, No. 7, Jul., 1996.

[20] Jin Hak Bae, Kee Chul Jung, Jin Wook Kim, and Hang Joon Kim, “Segmentation

of touching characters using an MLP”, Pattern Recognition Letters, Vol. 19, No. 8,

1998, pp. 701-709.

[21] N.Aricaand, F.Yarman-Vural,“Opticalcharacterrecognitionforcursive

handwriting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.

24, pp. 801 –813, Jun 2002.

[22] M. Blumenstein and B. Verma, “Neural-based solutions for the segmentation

and recognition of difficult handwritten words from a benchmark database,” in

Proceedings of the Fifth International Conference on Document Analysis and

Recognition, ICDAR ’99, pp. 281 –284, Sept 1999.

[23] Y. Tay, M. Khalid, R. Yusof, and C. Viard-Gaudin, “Offline cursive handwriting

recognition system based on hybrid markov model and neural networks,” in

65

Proceedings of the IEEE International Symposium on Computational Intelligence in

Robotics and Automation, 2003, vol. 3, pp. 1190 – 1195, July 2003.

[24] G.Kim, V.Govindaraju, S.Srihari,“Asegmentationandrecognition strategy for

handwritten phrases,” in Proceedings of the 13th Interna- tional Conference on Pattern

Recognition, 1996, vol. 4, pp. 510 –514, Aug 1996.

[25] Y. Y. Chung and M. T. Wong, “Handwritten character recognition by fourier

descriptors and neural network,” in Proceedings of IEEE Region 10 Annual

Conference on Speech and Image Technologies for Computing and

Telecommunications, TENCON ’97, vol. 1, pp. 391 –394, Dec 1997.

[26] B. S. Moni and G. Raju, “Modified quadratic classifier and directional features

for handwritten malayalam character recognition,” in Computa- tional Science - New

Dimensions and Perspectives, NCCSE 2011, IJCA Special Issue, vol. 1, pp. 30 –34,

Feb 2011.

[27] M. Blumenstein, X. Y. Liu, and B. Verma, “An investigation of the modified

direction feature for cursive character recognition,” Pattern Recognition, vol. 40, no. 2,

pp. 376 – 388, 2007.

[28] M. Blumenstein, B. Verma, and H. Basli, “A novel feature extraction technique

for the recognition of segmented handwritten characters,” in Proceedings of the

Seventh International Conference on Document Analysis and Recognition, 2003, vol.

1, pp. 137 – 141, Aug 2003.

66

Base Paper

2011 International Conference on Computer Applications and Industrial Electronics (ICCAIE 2011)

Offline Handwritten Character Recognition Using

Neural Network

Anshul Gupta Department of Electronics and

Electrical Engineering

IITGuwahati

Guwahati, India

Email: [email protected]

Manisha Srivastava

Department of Electronics and


IITGuwahati

Guwahati, India


Chitralekha Mahanta



IITGuwahati

Guwahati, India


Abstract—Character Recognition (CR) has been an active area

of research in the past and due to its diverse applications it continues to be a challenging research topic. In this paper, we focus especially on offline recognition of handwritten En- glish words by first detecting individual characters. The main approaches for offline handwritten word recognition can be divided into two classes, holistic and segmentation based. The holistic approach is used in recognition of limited size vocabulary where global features extracted from the entire word image are considered. As the size of the vocabulary increases, the complexity of holistic based algorithms also increases and correspondingly the recognition rate decreases rapidly. The segmentation based strategies, on the other hand, employ bottom-up approaches, starting from the stroke or the character level and going towards producing a meaningful word. After segmentation the problem gets reduced to the recognition of simple isolated characters or strokes and hence the system can be employed for unlimited vocabulary. We here adopt segmentation based handwritten word recognition where neural networks are used to identify individual characters. A number of techniques are available for feature extraction and training of CR systems in the literature, each with its own superiorities and weaknesses. We explore these techniques to design an optimal offline handwritten English word recognition system based on character recognition. Post processing technique that uses lexicon is employed to improve the overall recognition accuracy.

Index Terms—Offline, handwritten, character, recognition, neural network.

I. INTRODUCTION

It is really a challenging issue to develop a practical hand-

written character recognition (CR) system which can maintain

high recognition accuracy. A generic character recognition

system is shown in Fig. 1.

Fig. 1. Generic CR system

In most of the existing systems recognition accuracy is

heavily dependent on the quality of the input document.

In handwritten text adjacent characters tend to be touched

or overlapped. Therefore it is essential to segment a given

string correctly into its character components. In most of the

existing segmentation algorithms, human writing is evaluated

empirically to deduce rules [1]. But there is no guarantee

for the optimum results of these heuristic rules in all styles

of writing. Moreover handwriting varies from person to

person and even for the same person it varies depending on

mood, speed etc. This requires incorporating artificial neural

networks, hidden Markov models and statistical classifiers to

extract segmentation rules based on numerical data. [2][3][4].

After segmentation next crucial step is representation of

character classes by features. These features should have high

discriminative abilities so that they are different for different

character classes (for example 26 uppercase and 26 lowercase

characters in case of English language). Also, these features

should be independent of the intra class variations.

The different representation methods can be categorized into

three major classes [1]:

1. Global transformation and series expansion: includes

Fourier transform, Gabor transform, wavelet, mo-

ments and Karhuen-Loeve Expansion.

2. Statistical representation: Zoning, crossing and dis-

tances, projections.

3. Geometrical and topological representation: Extract-

ing and counting topological structures, geometrical

properties, coding, graphs and trees etc.

Features which depend on Fourier transform are suitable

for recognizing handwritten numerals where 96% accuracy

has been achieved [5]. Gradient features have been widely

used in CR for machine and hand printed binary character

images. But these features are not invariant to deformations

in the characters. In [6], a new gradient feature is used where

at each pixel, gradient is mapped onto 12 direction codes

with an angle span of 30 degree between the directions.

In [7], a redesigned direction feature [8] with a view to

978-1-4577-2059-8/11/$26.00 ©2011 IEEE 102

mailto:[email protected]



103

describe the character contour more effectively is developed.

Also, an additional global feature was introduced in this

technique to improve the recognition accuracy for those

characters that were most frequently confused with patterns

of similar appearances. But the disadvantage of this technique

is its failure to deal with changes in stroke width as these

features are extracted from non thinned character images.

Another crucial module in a character recognition system is

its pattern recognition module which assigns an unknown

sample to a predefined class. Numerous techniques for

character recognition can be classified into four general

approaches of pattern recognition: [1]

1. Template Matching : Direct matching, deformable

and elastic matching, relaxation matching

2. Statistical techniques : Parametric recognition, Non-

pararmetric recognition, HMM, fuzzy set reasoning

3. Structural techniques: Grammatical methods, graph-

ical methods

4. Neural networks : Multilayer perceptron, radial basis

function, support vector machine

Character recognition technique has to cope with the high

variability of the handwritten cursive letters and their intrinsic

ambiguity (letters like “e” and “l” or “u” and “n” can have

the same shape). Also it should be able to adapt to changes

in the input data. Template matching, statistical techniques

and structural techniques can be used when the input data is

uniform over time whereas neural network (NN) classifier

can learn changes in the input data. Also NN has parallel

structure because of which it can perform computation at a

higher rate than classical techniques. Therefore, we choose

neural networks for character recognition in our system.

The features that are used for training the neural network

classifier also play a very important role. The choice of a

good feature vector can significantly enhance the performance

of a character classifier whereas a poor one may degrade

its performance considerably. It is found in the literature

that generally separate classifiers are used for the upper

and the lower case English character classes to improve the

recognition accuracy. Moreover, good recognition accuracy

could be achieved only for handwritten numerals.

In this paper, we focus on developing a CR system for

recognition of handwritten English words. We first segment

the words into individual characters and then represent these

characters by features that have good discriminative abilities.

We also explore different neural network classifiers to find

the best classifier for the CR system. We combine different

CR techniques in parallel so that recognition accuracy of the

system can be improved.

The organization of the paper is as follows: Section II

deals with segmentation of words into individual characters

where a heuristic algorithm is used to first oversegment the

word followed by verification using neural network. Feature

extraction of handwritten characters is discussed in Section

III. Section IV describes selection procedure of a suitable

classifier. This is done by testing multilayer perceptron (MLP),

radial basis function (RBF) and support vector machine (SVM)

and selecting the one that has the maximum accuracy. In Sec-

tion V post processing is discussed where different character

recognition techniques are combined in parallel by using a

variation of the Borda count. Section VI presents results and

discussion. Conclusions are drawn in Section VII.

II. SEGMENTATION

In this paper segmentation algorithm used is similar to [2],

where heuristics and artificial intelligence are used for the

segmentation of a handwritten word. Here gray level image

is first converted into the binary image. Next slant detection

similar to the one used in [9] is employed and then slant

from −45◦ to 45

◦. The horizontal projection is taken at each correction is done. The method involves rotating the image rotation to calculate Wigner - Ville distribution (WVD - a joint

function of time and frequency). The angle which presents

the maximum intensity after applying WVD is taken as the

estimated slant angle.

For both the training and the testing phases, a heuristic

algorithm is used to locate prospective segmentation points in

the handwritten words. Each word is inspected in an attempt to

locate characteristic representative of the segmentation points.

A. Segmentation using a heuristic algorithm

A simple heuristic segmentation algorithm is implemented

which scans handwritten words to identify valid segmentation

points between characters. The segmentation is based on

locating the minima or arcs between letters, common in

handwritten cursive script. For this a histogram of vertical

pixel densities is examined which may indicate the location of

possible segmentation points in the word. However in the case

of letters as “a” and “o”, an erroneous segmentation point

could be identified. Therefore a “hole seeking” component

is incorporated which prunes the segmentation points those

pass through a “hole”. Finally, the algorithm performs a

check to see if one segmentation point is not too close

to another by ascertaining that the distance between the

previous segmentation point and the position being checked

is equal to or greater than the average character width.

Conversely if the contour in a region has sparse segmentation

points then a new segmentation point is inserted in that region.

B. Manual marking of segmentation points

We created our own database to train the neural network

for segmentation. Altogether 26 English words were chosen

which contained all the upper and lower case alphabets and

then 10 different samples of each word were collected on

paper from different writers. The images were then scanned

and preprocessed to create a list of 260 words. Prior to ANN

training, the heuristic feature detector was used to segment

all the words. The segmentation point outputs obtained by

104

L

L

using the heuristic feature detector can be categorized into

“correct” and “incorrect” segmentation point classes. The

feature extractor then extracts a matrix of pixels representing

the segmentation area and breaks it down into small windows

of equal size 5x5 pixels and analyzes the density of black

and white pixels. The density value for the black pixels

for each 5x5 window is written to the training file to

k = 5 are considered. The feature vectors made up from these

moduli are then normalized to 1 to compensate for image

scaling. To spread the input data more evenly over the input

space, the mean and the standard deviation vectors are found

over the whole set of training data. The jth

component of

input vector i is calculated as: . . Σ Σ

1 represent the value of that window. Accompanying each

matrix the desired output is also stored in the training file (0.1 ipj = (ipoj − ioj ) α − 1 + 1 σnoj

(3)

for an incorrect segmentation point and 0.9 for a correct point).

C. Training of the Artificial Neural Network (ANN)

For this step, a multilayer feedforward neural network

trained with the back propagation algorithm is used. The ANN

is presented with the training file prepared in the previous step.

D. Testing phase of the segmentation technique

Like ANN training, the words used for testing are also

segmented using the heuristic algorithm. The segmentation

points are automatically extracted and are fed into the trained

ANN. The ANN then verifies each segmentation point as

correct or incorrect. Finally, upon ANN verification, each word

used for testing should only contain valid segmentation points.

III. FEATURE EXTRACTION

A compact and characteristic representation of the character

image is required in the CR system. For this purpose, a set

of features is extracted for each class that helps to distinguish

it from other classes, while remaining invariant to intra class

differences.

pattern p, ioj is the mean of the jth

components of the original where ipoj is the jth

component of the original vector or vectors and σnoj is the corresponding standard deviation.

Coefficient α linearly controls the degree of standard deviation

compensation. We have also used Fourier descriptors for

extracting the following two features:

1) Fourier angle: It is mentioned in [10] that if the moduli

alone are not successful in discriminating all the classes then

adding angles of Fourier descriptors can improve the results.

Experiments can be done to incorporate angles in the training

set.

2) Fourier magnitude [11]: The Fourier coefficients de-

rived from equations (1) and (2) are not rotation or shift

invariant (in fact, it is noted that a shift will occur if the starting

point of the boundary following is arbitrary). In order to derive

a set of Fourier descriptors which have the invariant property

performed. For each n a set of invariant descriptors r[n] is with respect to rotation and shift, the following operations are computed as : √

r[n] = |a[n]|2 + |b[n]|2 (4) It is easy to show that r[n] is invariant to rotation or shift. A further refinement can be made by computing a new set of

descriptors as follows:

A. Fourier Descriptors

The method adopted is similar to [10] where boundary

detection is done at first. After obtaining a boundary image,

Fourier descriptors are found. This involves finding the

discrete Fourier coefficients a[k] and b[k] for 0 < k < L − 1, where L is the total number of boundary points found by applying the following:

s[n] = r[n]/r[1] (5) Thus dependence of r[n] on the size of the character is also eliminated. The Fourier coefficients |a[n]|, |b[n]|, their phases and the invariant descriptors s[n]; n = 2; 3, were derived for all the character specimens and stored in files for application in

reconstruction and recognition. We will be using the following

set of features in our final system:

1. Magnitude, s(k), |a[k]| and |b[k]|

a[k] =

b[k] =

. 1 Σ ΣL L m=1

. 1 Σ ΣL

L

x[m]e(−jk( 2π )m) (1)

y[m]e(−jk( 2π )m) (2)

2. Phase, |a[k]| and |b[k]|

3. Magnitude, phase, s(k), |a[k]| and |b[k]| IV. CLASSIFIER SELECTION

Classification can be done using various methods like clus-

tering, Bayesian classification, artificial neural networks etc. m=1

where x[m] and y[m] are the x and y coordinates respectively of the mth

boundary point. The values for k = 0 are discarded as they contain information only about the position of the image. The coefficients for high values of k describe

high frequency features in the image but do not contain

much information about the overall shape of the character

and so these high frequency components are also discarded.

Therefore, the first five coefficients beginning from k = 1 to

out of which artificial neural networks have been widely used. For our case we will use them to classify 52 character classes:

26 lower cases and 26 upper cases. We have considered

three networks: Multi-layer perceptron (MLP), radial basis

function (RBF) and support vector machine (SVM). Results

of character classification by these classifiers are given below.

We have used neural network toolbox in the Matlab platform

for testing the classifiers. The character database used for the

training and testing is taken from The Chars74K dataset.

105

A. Multilayer Perceptron (MLP)

Table I shows the MLP configuration that produced the best

results in our case. Fig.2 illustrates the validation performance

of the MLP network. Results obtained are poor on validation

and testing data.

TABLE I

MLP CONFIGURATION

No. of hidden layers

and respective acti-

vation function

No.

hidden

nodes

of Training algorithm

Learning Rate

Mom- entum

3[tansig tansig tan- [80 50 traingdx Adaptive 0.9

sig] 50]

Fig. 2. Validation performance of the MLP network

B. Radial Basis Function (RBF)

Table II shows the RBF network used. Fig.3 illustrates the

validation performance of the RBF network. The results are

good on training data but suffer from overlearning.

TABLE II

RBF CONFIGURATION

No. of hidden nodes Type of radial basis function Target Er-

ror

Adaptive addition and pruning of hidden neurons

Gaussian radial basis function 0.001

Fig. 3. Validation performance of the RBF network

Although the RBF network produced good results on the

validation dataset but it required 1800 neurons for this per-

formance. As a result this network suffered from overlearning

and showed very poor results on the test data.

C. Support vector machine (SVM) data is 98.86% and it achieves the optimum learning. The recognition result on the test data is 62.93%. It is observed In the case of the SVM, the recognition rate on the training

Table III shows the recognition rate (%) on the training data that on the test data SVM outperforms the other two networks. produced by the SVM for all the three feature vectors. This

testing is performed on the Chars74K dataset.

TABLE III SVM CONFIGURATION

Fourier with magnitude s(k),|a(k)| and |b(k)|

Fourier with phase,

|a(k)| and |b(k)|.

Fourier with magnitude s(k), |a(k)|, |b(k)| and phase

86.66% 98.74% 98.04%

Now we build a CR system using all the three sets of

features in parallel. Our proposed system is shown in Fig.4.

Fig. 4. Block diagram of the proposed CR system

V. POST PROCESSING

It has been found that in many real word applications, it

is better to fuse multiple techniques to improve the results.

Fusion takes advantages of different techniques by emphasiz-

ing on the strengths and avoiding the weaknesses of individual

techniques. We here use a fusion method based on Borda count

that is inspired from [12] to combine the following techniques

in parallel: 1. SVM on Moduli of Fourier Coefficients |a(k)|, |b(k)| and magnitude s(k)

2. SVM on Moduli of Fourier Coefficients |a(k)|, |b(k)| and phase

3. SVM on Moduli of Fourier Coefficients |a(k)|,

|b(k)|, phase and magnitude s(k) A rank is assigned and used in the calculation of the Borda count instead of calculating the number of strings below the

predicted string. The output string from a given technique is

compared with all the words in a lexicon. Then the lexicon

words are ranked according to their similarity with the output

106

string. The similarity between the output string and the lexicon

words are found by finding the number of matching characters

and their relative positions. The rank for a particular string can

be calculated using the following formulae:

Rank = 1-(position of the string in the top N strings)/N.

The rank is 0, if the string is not in the top N choices. We have

taken N=3.Therefore only the top three words are considered

from each technique to calculate the rank.

Secondly the confidence values produced by different tech-

niques are considered. The confidence value for all the three

predicted words for any given technique is the confidence that

the classifier has in its output string even if the string is not

a valid lexicon word. This is reasonable because the top three

strings are chosen based on its similarity with the output string.

The classifier’s confidence in its output string can be estimated

by summing up the scores of each of the predicted characters

of the output string. Final Boda count of a lexicon word = (rank ∗ confidence)tech1 + (rank ∗ confidence)tech2 + (rank ∗ confidence)tech3

VI. RESULTS AND DISCUSSION

The proposed CR system was tested on a database consist-

ing of 26 word images. All of these images were given as input

to the proposed CR system. The lexicon used also consisted

of the same 26 words that were used for testing. Out of these

26 words, the proposed system correctly recognized 21 word

images. Figs. 5-7 show some results from the 21 correctly

recognized handwritten words.

Fig. 5. Result on “Moderated”.

Fig. 6. Result on “Puzzle”.

It is evident from these figures that the proposed CR system

107

Fig. 7. Result on “Rolled”.

produces fairly good results on the test samples presented to

it. The segmentation method used was efficient. The heuristic

algorithm is based on rules which are deduced empirically

and there is no guarantee about their optimum results for

different styles of writing. So their validation using neural

network becomes essential. We tried different Fourier features

like moduli of Fourier coefficients, magnitude, phase and

their various combinations as feature vectors. The feature

vector formed using moduli of Fourier coefficients and phase

produced the best recognition accuracy of 98.74% on the

training dataset using SVM as the classifier. We have used

three combinations of Fourier descriptors in parallel for our

final system. Moreover our character recognition network has

52 output classes whereas in most of the literature separate

classifiers were used for upper and lower case characters. We

tested MLP and RBF neural networks that have been used

in the past for character recognition. We also tried support

vector machine (SVM) as classifier on the same feature set

and achieved 98% classification accuracy on the training data

set and 62.93% on the test data set. Finally, we selected SVM

as it outperformed MLP and RBF. Post processing which uses

lexicon becomes imperative as there is no other way to find out

the errors that have crept in at any of the previous stages. The

only way to do that is to verify whether the predicted word is

a valid lexicon word or not. Thus incorporating lexicon in our

final system using Borda Count improved the overall efficiency

of the system.

VII. CONCLUSION

This paper carries out a study of various feature based clas-

sification techniques for offline handwritten character recogni-

tion. After experimentation, it proposes an optimal character

recognition technique. The proposed method involves segmen-

tation of a handwritten word by using heuristics and artificial

intelligence. Three combinations of Fourier descriptors are

used in parallel as feature vectors. Support vector machine

is used as the classifier. Post processing is carried out by

employing lexicon to verify the validity of the predicted word.

The results obtained by using the proposed CR system are

found to be satisfactory.

REFERENCES

[1] N. Arica and F. Yarman-Vural, “Optical character recognition for cursive

handwriting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 801 –813, Jun 2002.

[2] M. Blumenstein and B. Verma, “Neural-based solutions for the segmentation and recognition of difficult handwritten words from a benchmark database,” in Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR ’99, pp. 281 –284, Sept 1999.

[3] Y. Tay, M. Khalid, R. Yusof, and C. Viard-Gaudin, “Offline cursive handwriting recognition system based on hybrid markov model and neural networks,” in Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation, 2003, vol. 3, pp. 1190 – 1195, July 2003.

[4] G. Kim, V. Govindaraju, and S. Srihari, “A segmentation and recognition strategy for handwritten phrases,” in Proceedings of the 13th Interna- tional Conference on Pattern Recognition, 1996, vol. 4, pp. 510 –514, Aug 1996.

[5] Y. Y. Chung and M. T. Wong, “Handwritten character recognition by

fourier descriptors and neural network,” in Proceedings of IEEE Region 10 Annual Conference on Speech and Image Technologies for Computing and Telecommunications, TENCON ’97, vol. 1, pp. 391 –394, Dec 1997.

[6] B. S. Moni and G. Raju, “Modified quadratic classifier and directional features for handwritten malayalam character recognition,” in Computa- tional Science - New Dimensions and Perspectives, NCCSE 2011, IJCA Special Issue, vol. 1, pp. 30 –34, Feb 2011.

[7] M. Blumenstein, X. Y. Liu, and B. Verma, “An investigation of the modified direction feature for cursive character recognition,” Pattern Recognition, vol. 40, no. 2, pp. 376 – 388, 2007.

[8] M. Blumenstein, B. Verma, and H. Basli, “A novel feature extraction technique for the recognition of segmented handwritten characters,” in Proceedings of the Seventh International Conference on Document Analysis and Recognition, 2003, vol. 1, pp. 137 – 141, Aug 2003.

[9] E. Kavallieratou, N. Fakotakis, and G. Kokkinakis, “Skew angle esti- mation for printed and handwritten documents using the wigner-ville distribution,” Image and Vision Computing, vol. 20, no. 11, pp. 813 – 824, 2002.

[10] I. P. Morns and S. S. Dlay, “Character recognition using fourier descriptors and a new form of dynamic semisupervised neural network,” Microelectronics Journal, vol. 28, no. 1, pp. 73 – 84, 1997.

[11] M. Shridhar and A. Badreldin, “High accuracy character recognition algorithm using fourier and topological descriptors,” Pattern Recognition, vol. 17, no. 5, pp. 515 – 524, 1984.

[12] B. Verma, P. Gader, and W. Chen, “Fusion of multiple handwritten word recognition techniques,” Pattern Recognition Letters, vol. 22, no. 9, pp. 991 – 998, 2001.

Project Paper

JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 191

© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)

OPTICAL CURSIVE HANDWRITTEN

RECOGNITION USING VPP & TDP NATIVE

SEGMENTATION ALGORITHMS AND NEURAL

NETWORKS (PYTORCH)

1G. Tirumalesh,

2K. L. Srinivas,

3K. Pratima,

4N. Arun,

5Y. Hemanth

1student,

2student,

3student,

4student,

5student

Department of Computer Science and Engineering,

Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam, India.

Abstract- In the domain of Artificial Intelligence, scientist brought ultra-modern changes in many fields, one of

them is image processing. This paper aims to represent the process of Converting handwritten text to computer

typed document which is an optical cursive handwritten recognition (OCR) by using segmentation based

algorithms like VPP (vertical projection profile), TDP (top down profile) and the other histogram (vertical and

horizontal) projection algorithms to achieve the solution. Several other approaches were also available for the

segmentation of text into individual characters. For feature extraction and character recognition pytorch which is

an open source machine learning tool library in python used for computer vision and natural language processing.

Keywords: Vertical Projection Profile (VPP), Top Down Profile (TDP), Pytorch.

1. INTRODUCTION:

OCR means Optical Character Recognition, it is also known as optical character reader. OCR translates the

text in the given images to machine readable format. Character recognition is classified into two types based on the

text. They are machine printed text and handwritten text. It is difficult to work with handwritten text mainly in the

case of cursive handwritten because it varies from person to person and there is no perfect line spacing and size of

the character and margins etc in handwritten text. A single character is written in many styles so it is difficult to

identify and translate the script into machine readable format or ASCII format. In this scenario, there is a step by

step process to convert given script into each individual character by starting with line segmentation followed by

word segmentation and finally the character segmentation. Finally predicting each individual character under a

trained model using pytorch to recognize the character and combined together again to generate the original

machine-readable script to the end user.

1.1 Literature Survey

Previous handwritten recognition uses various segmentation algorithms like heuristic, skew recognition

techniques and written pressure detection. Almost every segmentation algorithm is based on horizontal and

vertical projections to segment the script into individual characters. Even when there are text lines or characters

which are overlapped on each other can be separated by adjusting the threshold value. The existed method is tested

on more than 1000 text images of IAM datasets and by using these existing method 91.55% lines and 90.5%

words are correctly segmented from the IAM dataset and also normalize 92% lines and words perfectly with

invisible error rate.[9]

2. PROPOSED ALGORITHM

The process for the optical cursive handwritten recognition and the required algorithms for various level of

segmentation and character recognition using pytorch are as follow


1. Image scanning

http://www.jetir.org/



2. Pre-processing

3. Segmentation


5. Classification

6. Post-processing

2.1 Image scanning:

The input image can be obtained either by scanning the already existing handwritten image file (png, jpg) or by

capturing the image instantly to provide input data to the model.


2.2 Image pre-processing:

The main goal here is to make the input image free from noise. As a first step to move on convert the RGB

image to a gray scale image and gently sharpen the given input image to avoid loss of edges. Calculate the mean

gray intensity value to reduce the brightness of the obtained grey scale image on a threshold value of less than 0.65

and the contrast is increased to distinguish the character boundaries. The text which is present in the obtained

result may turn dim and blur because of improper scanning of the text image. To overcome this, binarization plays

a key role by converting the gray scale image where the values ranges between 0 and 255 to a binary image by

making up a threshold value simply to decide like an on or off (0 or 1).






2.3 Segmentation

Basically, there are three levels of segmentation. Line segmentation, Word segmentation and Character

segmentation.

2.3.1 Line segmentation

Horizontal histogram projections are used in segmenting the entire script present in the input image into

individual lines as shown in the figure below

The primary task here is to extract each individual line from the given input image. This can be obtained by

applying horizontal histogram projection to the pre-processed image and then generate the threshold line value by

adjusting the average value for those horizontal projections. Graphical representation of horizontal histogram

projection is shown in the figure 2.4 below.[6]





Finally, Lines can be segmented from the given input script by obtaining the break points by making use of the

average threshold line value obtained from the above graph and having comparisons with each and every

horizontal projections.


2.3.2 Word segmentation

Each word is treated as an object (Contour – in terms of image processing). Contour can be explained simply

as a curve joining all the continuous points (along the boundary), having same color or intensity. Here, Contours

are useful for object detection where each object is a word.


The main reason for making use of contours here is each word can be treated as a curve joining all the

continuous points along the boundary since it is cursively written. But sometimes there may be gaps in between

the letters of a single word which causes the word to be split into two or more words as they are not continuous


This type of words can be identified by making use of the minimum threshold value which is obtained by

taking the average separation distance between the words and can be rejoined back as a single word (Contour)

where the separation distance between the words less than the minimum threshold value.[7]




Minimum threshold value = sum of separation distances between words in the line/

no of words in the line



2.3.3 Character segmentation

Segmenting each word into individual characters can be obtained by making use of two native algorithms:

1) VPP (Vertical Projection Profile)

2) TDP (Top Down Profile)


VPP is a plot which maintains the total number of white pixels in vertical direction of the binary image. Characters

can be segmented at the point where the VPP value is zero (0) for some certain no of times (Threshold). But in the

case of touching characters, the VPP value can never be zero even though the characters should be segmented

(connected components).[1]





Connected components can be identified by making use of characters width and height. When the width of

the character is greater than 0.8 times the height of the character then it is identified as a connected component

otherwise it is a single character as per basic font size measurement.[1]




TDP is a plot which maintains the first white pixel in vertical direction of the binary image. Touching

characters can be segmented into individual characters by taking the combined value of both VPP AND TDP.

Then obtain the minimum value in the graph where we can segment them into individual characters and continue

this process recursively until no more touching character found in a single word. [2]







FIG 2.15 Combined intensity graph for word ‘MO’


2.4 Feature Extraction:

The main goal here is to extract the features from the segmented characters which are required to train the

data.

This process comprises of zero padding, convolution layer, activation function, max pooling and flatten. As a

first step add zeros to the image to overcome the loss of edges termed as zero padding. Then apply multiple layers

of convolution and max pooling filters(kernels) to obtain an image with reduced in size where each move is a

stride. Max pooling comes with selecting the max value from the filter and replace it with the remaining and the

same way the average pooling works by taking the average pixel value. Now the activation function comes into

picture where the ReLU activation function identifies all the negative pixel values and replace it with zero without

any change in the positive pixel values. Finally flatten the image by reshaping the image obtained.







2.5 Classification:

Finally, Classification is done using a fully connected layer where we get the probabilities of each and every

class for the given input character. Classify the given input character to their respective class by selecting the class

with the maximum probability. In total there are classified into 62 classes (0 to 9, a to z, A TO Z)





2.6 Post-processing:

As a final step obtain the accuracy for all the levels of segmentation and the character recognition by

minimizing the error rate. Then combine all the recognized characters into word, words into lines and lines into the

original script present in the image.



3. PYTORCH LIBRARY TOOL:

Pytorch is an open source machine learning library based on torch library used for applications such as


It is a very popular framework for deep learning. The feature extraction and the classification stages are

implemented under pytorch library tool. As a first step install all the required modules/packages to train the data

using pip namely efficientnet-pytorch and torchsummary. Then include all the required modules by importing

them into the python script (torch, torchvision, torch.nn, torch.util, torch.autograd, torch.optim,

torchvision.transforms, Efficientnet).

Prepare the Dataset for training the model from the NIST dataset (700000 images) by dividing them into two

around 600000 images as training data and remaining as test data.

Create a class named DATASET for listing all the training images and their respective labels for training.

Download the pretrained model efficientnet-b0 and assign all the required parameters like batch size, learning rate,




error rate, no of classes and transformations if any. Finally train the model under certain epcohs until the loss gets

minimized and accuracy gets increased.

Finally predict each input segmented character under the trained model to be classified into one of the class.

4. EXPERIMENTAL RESULTS AND ANALYSIS:

This system is trained under around 1600 text images (paragraphs) of IAM dataset with almost 5678 labelled

sentences, 13353 isolated and labelled text lines, 115,320 isolated and labelled words with an accuracy of around

98% in terms of line segmentation, 93 % in terms of word segmentation and 88% in terms of character

segmentation.[8]

In terms of character recognition, this system is trained under around 623,000 images of 62 different characters

(0 to 9, a to z, A to Z) with an accuracy of around 96% in character recognition.

5. CONCLUSION:

This paper mainly carries out a study on segmenting the connected components (touching characters). There are

many more challenges involved in the optical cursive handwritten recognition like the skewness, pressure

detection etc can be treated as a future study. The proposed method for segmenting the connected components

(touching character) is by using VPP (Vertical projection profile) and TDP (Top Down Profile) and various other

histogram projections (horizontal and vertical) for the line and word segmentation respectively. pytorch is a

python library tool used for the recognition of the segmented characters. [4]

6. ACKNOWLEDGMENT:

The project team members would like to express our thanks to our guide B. shiva jyoti Assistant professor of

Computer Science and Engineering Department, Anits for her valuable suggestions and guidance in completing

out project model.

7. REFERENCES:

[1] Nafiz arica, Student Member, IEEE, and Fatos T. Yarman-Vuraj, Senior Member IEEE, “Optical Character

Recognition for Cursive Handwriting”, IEEE transaction on pattern analysis and machine intelligence, vol. 24 no.

6, June 2002

[2] Subhash Panwar and Neeta Naina, “A Novel Segmentation Methodology for Cursive Handwritten

Documents”, IETE JOURNAL OF RESEARCH, VOL 60-NO 6 NOV-DEC 2014

[3] Nibaran Das Sandip Pramanik, Subhadio Basu, Punam Kumar Saha, “Recognition of handwritten Bangla basic

characters and digits using convex hull based feature set”, 2009 International conference on Artificial intelligence

and pattern recognition (AIPRL-09)

[4] Abhishek Bala and Rajib Saha, “An IMPROVED Method for Handwritten Document Analysis using

Segmentation, Baseline Recognition and Writing Pressure Detection”, 6th

International Conference on Advances

In Computing Communication, ICACC 2016, 6-8 September 2016, Cochin, India, Elesevier-2016

[5] Kanchan Keisham and Sunanda Dixit, “Recognition of Handwritten English Text Using Energy

Minimisation”, Information Systems Design and Intelligence Applications, Advances in Intelligent Systems and

Computing, Bangalore, India, Spinger-2016




[6] Namrata Dave,” Segmentation Methods for Hand written Character Recognition”, International Journal of

Signal Processing, Image Processing and Pattern Recognition Vol.8, No.4 (2015), pp.155-164

[7]G. Louloudisa, B.Gatosb, I.Pratikakisb, C.Halatsis, “Text line and word segmentation of handwritten

documents” a Department of Informatics and Telecommunications, University of Athens, Greece Computational

Intelligence Laboratory, Institute of Informatics and Telecommunications, National Center for Scientific Research

Demokritos, 15310Athens, Greece

[8] IAM dataset http://www.fki.inf.unibe.ch/databases/iam-handwriting-database

[9] Offline handwritten character recognition using neural networks

https://www.researchgate.net/publication/239765657_Offline_handwritten_ch

aracter_recognition_using_neural_network


http://www.fki.inf.unibe.ch/databases/iam-handwriting-database

http://www.researchgate.net/publication/239765657_Offline_handwritten_ch




(UGC AUTONOMOUS)

(Affiliated to AU, Approved by AICTE and Accredited by NBA & NAAC with ‘A’

Grade)


BONAFIDE CERTIFICATE

This is to certify that the project report entitled “OPTICAL CURSIVE

HANDWRITTEN RECOGNITION USING VPP & TDP NATIVE

SEGMENTATION ALGORITHMS AND NEURAL NETWORKS

(PYTORCH)” submitted by G.TIRUMALESH (316126510081), K.PRATIMA

(316126510087), K.L.SRINIVAS (316126510088), N.ARUN (316126510100),

Y.HEMANTH (316126510120) in partial fulfillment of the requirements for the

award of the degree of Bachelor of Technology in Computer Science Engineering

of Anil Neerukonda Institute of technology and sciences (A), Visakhapatnam is a

record of bonafide work carried out under my guidance and supervision.

Project Guide Head of the Department

B. SHIVA JYOTHI Dr. R. SIVARANJANI

ASSISTANT PROFESSOR HEAD OF THE DEPATMENT

DEPT. OF CSE DEPT. OF CSE

DECLARATION

We, G.TIRUMALESH, K.PRATIMA, K.L.SRINIVAS, N.ARUN,

Y.HEMANTH, of final semester B.Tech., in the department of Computer

Science and Engineering from ANITS, Visakhapatnam, hereby declare that the

project work entitled “OPTICAL CURSIVE HANDWRITTEN

RECOGNITION USING VPP & TDP NATIVE SEGMENTATION

ALGORITHMS AND NEURAL NETWORKS (PYTORCH)” is carried

out by us and submitted in partial fulfillment of the requirements for the award

of Bachelor of Technology in Computer Science Engineering , under Anil

Neerukonda Institute of Technology & Sciences(A) during the academic year

2016-2020 and has not been submitted to any other university for the award of

any kind of degree.


K. PRATIMA 316126510087

K. L. SRINIVAS 316126510088

N. ARUN 316126510100

Y. HEMANTH 316126510120

ACKNOWLEDGEMENT

An endeavor over a long period can be advice and support of many well-

wishers. We take this opportunity to express our gratitude and appreciation to all of

them.

We owe our tributes to Dr. R. SHIVARANJANI, Head of the

Department, Computer Science & Engineering for her valuable support and

guidance during the period of project implementation.

We wish to express our sincere thanks and gratitude to our project guide B.

SIVA JYOTHI, Assistant Professor, Department of Computer Science &

Engineering, ANITS, for the simulating discussions, in analyzing problems

associated with our project work and for guiding us throughout the project. Project

meeting were highly informative. We express our sincere thanks for the

encouragement, untiring guidance and the confidence they had shown in us. We are

immensely indebted for their valuable guidance throughout our project.

We also thank all the staff members of CSE department for their valuable

advices.

We also supporting staff for providing resources as and when required.


K. PRATIMA 316126510087

K. L. SRINIVAS 316126510088

N. ARUN 316126510100

Y. HEMANTH 316126510120

i

ABSTRACT

In the field of Artificial Intelligence, scientists brought a revolutionary

change in the field of image processing and one of the biggest challenges in it is to

identify documents in hand-written formats. One of the most widely used techniques

for the validity of these types of documents is Character Recognition. Optical

Character Recognition (OCR) is an extensively employed method to transform the

data of any form (handwritten) into electronic format. There are millions of

techniques introduced now that can be used to recognize handwriting of any form

and language. A number of techniques are available for feature extraction and

training of CR systems in the literature, each with its own superiorities and

weaknesses. We explore these techniques to design an optimal cursive handwritten

recognition system based on character recognition. This aims to represent the

process of Converting handwritten text to computer typed document which is an

optical cursive handwritten recognition (OCR) by using segmentation algorithms

like VPP (vertical projection profile), TDP (top down profile) and the other

histogram (vertical and horizontal) projection algorithms to achieve the solution. For

feature extraction and character recognition pytorch which is an open source

machine learning tool library in python used for computer vision and natural

language processing.

ii

CONTENTS

Page No.

ABSTRACT i

LIST OF FIGURES v

LIST OF SYMBOLS vii

LIST OF TABLES viii

LIST OF ABBREVATIONS ix

CHAPTER 1 INTRODUCTION 1

1.1 Prerequisites 2

1.1.1 Software Requirements 2

1.1.2 Hardware Requirements 2

1.1.3 Data Requirements 2

1.2 Python 3

1.2.1 Machine learning in python 4

1.2.1.1 Numpy 5

1.2.1.2 Pandas 5

1.2.1.3 Opencv 6

1.2.1.4 Sk-Learn 7

1.2.1.5 Matplotlib 8

1.2.2 Neural networks in python 9

1.2.2.1 Loss Function 10

1.3 Image Processing 12

1.3.1 Types of Images 13

1.3.2 Brightness and Contrast 14

1.4 Convolutional Neural Network 15

1.4.1 Convolutional Layer 15

1.4.2 Pooling Layer 17


1.5 Problem Statement 18

iii

CHAPTER 2 LITERATURE SURVEY 19

2.1 Existing Methods for character recognition system 19

CHAPTER 3 METHODOLOGY 22

3.1 System architecture 22

3.2 Algorithm 23

3.2.1 Algorithm for line segmentation 24

3.2.2 Algorithm for word segmentation 24

3.2.3 Algorithm for character segmentation 25

3.2.3.1 Character segmentation using VPP 25

3.2.3.2 Touching character segmentation 25

3.3 Proposed Work 26

3.3.1 Image scanning 26

3.3.2 Pre-processing 26

3.3.3 Segmentation 28

3.3.3.1 Line segmentation 28

3.3.3.2 Word segmentation 29

3.3.3.3 Character segmentation 30

3.3.4 Feature extraction 33


3.3.6 Post-processing 36

CHAPTER 4 PYTORCH 37

4.1 Pytorch Library Tools 37

4.2 Pytorch in research 39

4.3 Training an image classifier using pytorch 40

CHAPTER 5 SAMPLE CODE ELABORATION 42

5.1 Pre-processing the data 42

5.1.1 Data Organization 42

5.1.2 Image Sizing and Shaping 42

5.1.3 Image Blurring Kernel Filter 42

5.1.4 Applying Kernel Filters and Contours on Image 43

iv

5.2 Line Segmentation 44

5.2.1 Calculating Line Intensity(Horizontal Histograms) 44

5.2.3 Evaluating Threshold for Line Segmentation 45

5.2.4 Segmenting Paragraph into sentences 45

5.3 Word Segmentation 46

5.3.1 Combining two Missegmented Words 46

5.3.2 Segmenting Sentence into Words 46

5.4 Character Segmentation 47

5.4.1 Evaluating VPP Intensity 47

5.4.2 First Level Character Segmentation Using VPP 48

5.4.3 Segmenting Word into Characters under VPP 49

5.4.4 Evaluating VPP and TDP Average Intensity 49

5.4.5 Connected Components Segmentation 50

5.4.6 Further Required Segmentation on Connected Components. 52

5.5 Training the model 53

5.5.1 Data Loader 53

5.5.2 Defining Transforms and Parameters 53

5.5.3 Importing the Model 54

5.5.4 Training the Model 54

5.5.5 Testing the model 56

CHAPTER 6 RESULTS AND DISCUSSIONS 58

6.1 Input & Output 58

6.2 Training Datasets 58

6.3 Experimental Results and Analysis 60

CHAPTER 7 CONCLUSION 62

REFERENCES 63

v

LIST OF FIGURES


1.1 Flow-chart for OCR 2

1.2 Machine learning overview 4

1.3 Architecture of a 2-layer Neural Network 9

1.4 Illustration of flow of network 10

1.5 A CNN Sequence 15

1.6 Convolution operation with kernel 16

1.7 Performing Pooling operation 17

1.8 Describing Classification Process 17

3.1 Block Diagram of Proposed System 22

3.2 Steps in pre-processing 23

3.3 Scanned input image 26

3.4 Pre-processing 27

3.5 Blur image (for noise removal) 27

3.6 Binary image in color contrast 28

3.7 Horizontal projection graph 28

3.8 Segmented lines from the image 29

3.9 segmented line 29

3.10 First level word segmentation 30

3.11 Second level word segmentation 30

3.12 segmented word 30

3.13 VPP intensity graph for word ‘MOVE’ 31

3.14 First level VPP character segmentation 31

3.15 Touching characters (connected component) 31

3.16 VPP intensity graph for word ‘MO’ 32

3.17 TDP intensity graph for word ‘MO’ 32

3.18 Combined intensity graph for word ‘MO’ 32

vi


3.19 Final level character segmentation 33

3.20 Feature extraction of characters 33

3.21 Zero padding 34

3.22 Convolution layer 34

3.23 Max pooling and Average pooling 34

3.24 Flatten the image 35

3.25 ReLU activation function 35

3.26 Fully connected layer with classes (x and o)

along with probabilities 35

3.27 Final result 36

6.1 Input image 58

vii

LIST OF SYMBOLS

X Input layer

Ŷ Output layer

W Weights

B biases

σ Activation function

Summation

vii

LIST OF TABLES

Table. No Table Name Page No.

6.1 Breakdown of the number of available training and 60

testing samples in the NIST special database 19

using the original training and testing splits.

6.2 Testing and Accuracy 61

ix

LIST OF ABBREVATIONS

OCR

Optical Character Recognition

VPP Vertical Projection Profile

TDP Top Down Profile

CNN Convolutional Neural Network

RNN Recurrent Neural Network

RGB Red Green Blue

HTR Hand written Text Recognition

GPU Graphical Processing Unit

ReLU Rectified Linear Unit

MSE Mean Square Error

MAE Mean Absolute Error

MBE Mean Bias Error

SVM Support Vector Machine

1

1. INTRODUCTION








shape and size. There are many advantages of OCR. When a printed text is

converted to machine readable text then we can search through it with keywords,

compress, edit and send it, and can store in much less space. OCR has numerous

applications. It is used by blind and visually impaired persons. In banking and legal

department, it is used to digitize the documents. Barcode recognition technique is

used in retail industry which is also related to OCR. It is widely used in education,

finance and automatic detection of number plate. The main challenge in the

recognition of handwritten characters is that every person on the earth has different

handwriting. There are various other factors also which causes difference in

handwriting such as multi-orientations, skewness of the text lines, overlapping

characters, connected components, pressure points etc. Many scripts are there with

their intrinsic variations. A single character can be written in many forms, so it is a

challenging task to recognize a particular handwritten character.

There are six steps in OCR:

They are as follows:

• Image acquisition

• Pre-processing

• Segmentation

• Feature extraction

• Classification

• Post-processing