OPTICAL CURSIVE HANDWRITTEN RECOGNITION
USING VPP & TDP NATIVE SEGMENTATION
ALGORITHMS AND NEURAL NETWORKS (PYTORCH)
A Project report submitted in partial fulfillment of the requirements for
the award of the degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE ENGINEERING
Submitted by
G. TIRUMALESH 316126510081
K. PRATIMA
316126510087
K. L. SRINIVAS
316126510088
N. ARUN
316126510100
Y. HEMANTH 316126510120
Under the guidance of
B. SIVA JYOTHI (Assistant Professor)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ANIL NEERUKONDA INSTITUTE OF TECHNOLOGY AND SCIENCES
(UGC AUTONOMOUS)
(Permanently Affiliated to AU, Approved by AICTE and Accredited by NBA & NAAC with ‘A’ Grade)
Sangivalasa, bheemili mandal, Visakhapatnam dist. (A.P)
2019-2020
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ANIL NEERUKONDA INSTITUTE OF TECHNOLOGY AND SCIENCES
(UGC AUTONOMOUS)
(Affiliated to AU, Approved by AICTE and Accredited by NBA & NAAC with ‘A’
Grade)
Sangivalasa, bheemili mandal, Visakhapatnam dist. (A.P)
BONAFIDE CERTIFICATE
This is to certify that the project report entitled “OPTICAL CURSIVE
HANDWRITTEN RECOGNITION USING VPP & TDP NATIVE
SEGMENTATION ALGORITHMS AND NEURAL NETWORKS
(PYTORCH)” submitted by G.TIRUMALESH (316126510081), K.PRATIMA
(316126510087), K.L.SRINIVAS (316126510088), N.ARUN (316126510100),
Y.HEMANTH (316126510120) in partial fulfillment of the requirements for the
award of the degree of Bachelor of Technology in Computer Science Engineering
of Anil Neerukonda Institute of technology and sciences (A), Visakhapatnam is a
record of bonafide work carried out under my guidance and supervision.
Project Guide Head of the Department
B. SHIVA JYOTHI Dr. R. SHIVARANJANI
ASSISTANT PROFESSOR HEAD OF THE DEPATMENT
DEPT. OF CSE DEPT. OF CSE
DECLARATION
We, G.TIRUMALESH, K.PRATIMA, K.L.SRINIVAS, N.ARUN,
Y.HEMANTH, of final semester B.Tech., in the department of Computer
Science and Engineering from ANITS, Visakhapatnam, hereby declare that the
project work entitled “OPTICAL CURSIVE HANDWRITTEN
RECOGNITION USING VPP & TDP NATIVE SEGMENTATION
ALGORITHMS AND NEURAL NETWORKS (PYTORCH)” is carried
out by us and submitted in partial fulfillment of the requirements for the award
of Bachelor of Technology in Computer Science Engineering , under Anil
Neerukonda Institute of Technology & Sciences(A) during the academic year
2016-2020 and has not been submitted to any other university for the award of
any kind of degree.
G. TIRUMALESH 316126510081
K. PRATIMA 316126510087
K. L. SRINIVAS 316126510088
N. ARUN 316126510100
Y. HEMANTH 316126510120
ACKNOWLEDGEMENT
An endeavor over a long period can be advice and support of many well-
wishers. We take this opportunity to express our gratitude and appreciation to all of
them.
We owe our tributes to Dr. R. SHIVARANJANI, Head of the
Department, Computer Science & Engineering for her valuable support and
guidance during the period of project implementation.
We wish to express our sincere thanks and gratitude to our project guide B.
SIVA JYOTHI, Assistant Professor, Department of Computer Science &
Engineering, ANITS, for the simulating discussions, in analyzing problems
associated with our project work and for guiding us throughout the project. Project
meeting were highly informative. We express our sincere thanks for the
encouragement, untiring guidance and the confidence they had shown in us. We are
immensely indebted for their valuable guidance throughout our project.
We also thank all the staff members of CSE department for their valuable
advices.
We also supporting staff for providing resources as and when required.
G. TIRUMALESH 316126510081
K. PRATIMA 316126510087
K. L. SRINIVAS 316126510088
N. ARUN 316126510100
Y. HEMANTH 316126510120
i
ABSTRACT
In the field of Artificial Intelligence, scientists brought a revolutionary
change in the field of image processing and one of the biggest challenges in it is to
identify documents in hand-written formats. One of the most widely used techniques
for the validity of these types of documents is Character Recognition. Optical
Character Recognition (OCR) is an extensively employed method to transform the
data of any form (handwritten) into electronic format. There are millions of
techniques introduced now that can be used to recognize handwriting of any form
and language. A number of techniques are available for feature extraction and
training of CR systems in the literature, each with its own superiorities and
weaknesses. We explore these techniques to design an optimal cursive handwritten
recognition system based on character recognition. This aims to represent the
process of Converting handwritten text to computer typed document which is an
optical cursive handwritten recognition (OCR) by using segmentation algorithms
like VPP (vertical projection profile), TDP (top down profile) and the other
histogram (vertical and horizontal) projection algorithms to achieve the solution. For
feature extraction and character recognition pytorch which is an open source
machine learning tool library in python used for computer vision and natural
language processing.
ii
CONTENTS
Page No.
ABSTRACT i
LIST OF FIGURES v
LIST OF SYMBOLS vii
LIST OF TABLES viii
LIST OF ABBREVATIONS ix
CHAPTER 1 INTRODUCTION 1
1.1 Prerequisites 2
1.1.1 Software Requirements 2
1.1.2 Hardware Requirements 2
1.1.3 Data Requirements 2
1.2 Python 3
1.2.1 Machine learning in python 4
1.2.1.1 Numpy 5
1.2.1.2 Pandas 5
1.2.1.3 Opencv 6
1.2.1.4 Sk-Learn 7
1.2.1.5 Matplotlib 8
1.2.2 Neural networks in python 9
1.2.2.1 Loss Function 10
1.3 Image Processing 12
1.3.1 Types of Images 13
1.3.2 Brightness and Contrast 14
1.4 Convolutional Neural Network 15
1.4.1 Convolutional Layer 15
1.4.2 Pooling Layer 17
1.4.3 Classification 17
1.5 Problem Statement 18
iii
CHAPTER 2 LITERATURE SURVEY 19
2.1 Existing Methods for character recognition system 19
CHAPTER 3 METHODOLOGY 22
3.1 System architecture 22
3.2 Algorithm 23
3.2.1 Algorithm for line segmentation 24
3.2.2 Algorithm for word segmentation 24
3.2.3 Algorithm for character segmentation 25
3.2.3.1 Character segmentation using VPP 25
3.2.3.2 Touching character segmentation 25
3.3 Proposed Work 26
3.3.1 Image scanning 26
3.3.2 Pre-processing 26
3.3.3 Segmentation 28
3.3.3.1 Line segmentation 28
3.3.3.2 Word segmentation 29
3.3.3.3 Character segmentation 30
3.3.4 Feature extraction 33
3.3.5 Classification 35
3.3.6 Post-processing 36
CHAPTER 4 PYTORCH 37
4.1 Pytorch Library Tools 37
4.2 Pytorch in research 39
4.3 Training an image classifier using pytorch 40
CHAPTER 5 SAMPLE CODE ELABORATION 42
5.1 Pre-processing the data 42
5.1.1 Data Organization 42
5.1.2 Image Sizing and Shaping 42
5.1.3 Image Blurring Kernel Filter 42
5.1.4 Applying Kernel Filters and Contours on Image 43
iv
5.2 Line Segmentation 44
5.2.1 Calculating Line Intensity(Horizontal Histograms) 44
5.2.3 Evaluating Threshold for Line Segmentation 45
5.2.4 Segmenting Paragraph into sentences 45
5.3 Word Segmentation 46
5.3.1 Combining two Missegmented Words 46
5.3.2 Segmenting Sentence into Words 46
5.4 Character Segmentation 47
5.4.1 Evaluating VPP Intensity 47
5.4.2 First Level Character Segmentation Using VPP 48
5.4.3 Segmenting Word into Characters under VPP 49
5.4.4 Evaluating VPP and TDP Average Intensity 49
5.4.5 Connected Components Segmentation 50
5.4.6 Further Required Segmentation on Connected Components. 52
5.5 Training the model 53
5.5.1 Data Loader 53
5.5.2 Defining Transforms and Parameters 53
5.5.3 Importing the Model 54
5.5.4 Training the Model 54
5.5.5 Testing the model 56
CHAPTER 6 RESULTS AND DISCUSSIONS 58
6.1 Input & Output 58
6.2 Training Datasets 58
6.3 Experimental Results and Analysis 60
CHAPTER 7 CONCLUSION 62
REFERENCES 63
v
LIST OF FIGURES
Fig. No Topic Name Page No.
1.1 Flow-chart for OCR 2
1.2 Machine learning overview 4
1.3 Architecture of a 2-layer Neural Network 9
1.4 Illustration of flow of network 10
1.5 A CNN Sequence 15
1.6 Convolution operation with kernel 16
1.7 Performing Pooling operation 17
1.8 Describing Classification Process 17
3.1 Block Diagram of Proposed System 22
3.2 Steps in pre-processing 23
3.3 Scanned input image 26
3.4 Pre-processing 27
3.5 Blur image (for noise removal) 27
3.6 Binary image in color contrast 28
3.7 Horizontal projection graph 28
3.8 Segmented lines from the image 29
3.9 segmented line 29
3.10 First level word segmentation 30
3.11 Second level word segmentation 30
3.12 segmented word 30
3.13 VPP intensity graph for word ‘MOVE’ 31
3.14 First level VPP character segmentation 31
3.15 Touching characters (connected component) 31
3.16 VPP intensity graph for word ‘MO’ 32
3.17 TDP intensity graph for word ‘MO’ 32
3.18 Combined intensity graph for word ‘MO’ 32
vi
Fig. No Topic Name Page No.
3.19 Final level character segmentation 33
3.20 Feature extraction of characters 33
3.21 Zero padding 34
3.22 Convolution layer 34
3.23 Max pooling and Average pooling 34
3.24 Flatten the image 35
3.25 ReLU activation function 35
3.26 Fully connected layer with classes (x and o)
along with probabilities 35
3.27 Final result 36
6.1 Input image 58
vii
LIST OF SYMBOLS
V Input layer
Ŷ Output layer
W Weights
B biases
σ Activation function
Summation
vii
LIST OF TABLES
Table. No Table Name Page No.
6.1 Breakdown of the number of available training and 60
testing samples in the NIST special database 19
using the original training and testing splits.
6.2 Testing and Accuracy 61
ix
LIST OF ABBREVATIONS
OCR
Optical Character Recognition
VPP Vertical Projection Profile
TDP Top Down Profile
CNN Convolutional Neural Network
RNN Recurrent Neural Network
RGB Red Green Blue
HTR Hand written Text Recognition
GPU Graphical Processing Unit
ReLU Rectified Linear Unit
MSE Mean Square Error
MAE Mean Absolute Error
MBE Mean Bias Error
SVM Support Vector Machine
1
1. INTRODUCTION
Optical character recognition is also called as optical character reader and it
is abbreviated as OCR. OCR translates the images into machine readable format
such as ASCII or Unicode. Character recognition can be classified into two types
based on the type of the text i.e. machine printed text and handwritten text. Character
recognition of handwritten text is more challenging than machine printed text.
Because, machine printed characters are straight with uniform alignment and
spacing. While the handwritten characters are not uniform and greatly varies in
shape and size. There are many advantages of OCR. When a printed text is
converted to machine readable text then we can search through it with keywords,
compress, edit and send it, and can store in much less space. OCR has numerous
applications. It is used by blind and visually impaired persons. In banking and legal
department, it is used to digitize the documents. Barcode recognition technique is
used in retail industry which is also related to OCR. It is widely used in education,
finance and automatic detection of number plate. The main challenge in the
recognition of handwritten characters is that every person on the earth has different
handwriting. There are various other factors also which causes difference in
handwriting such as multi-orientations, skewness of the text lines, overlapping
characters, connected components, pressure points etc. Many scripts are there with
their intrinsic variations. A single character can be written in many forms, so it is a
challenging task to recognize a particular handwritten character.
There are six steps in OCR:
They are as follows:
• Image acquisition
• Pre-processing
• Segmentation
• Feature extraction
• Classification
• Post-processing
2
Fig. 1.1 Flow-chart for OCR
1.1 Prerequisites:
1.1.1 Software requirements:
1. Python version - 3.0
2. Python IDE – Pycharm
3. Data science libraries – Matplotlib, numpy, PIL, Pytorch, Pandas
1.1.2 Hardware requirements:
1. CPU - 8 to 16 with each octa core processors in a distributed
network.
2. RAM - 128 to 256 GB
3. Storage – 30 to 50 GB
4. Entirely Organized in a cloud network
1.1.3 Data Requirements:
1. NIST DATA SET – Characters
2. IAM DATASET – Forms, Sentences, Words
3
1.2 Python:
Python is a popular programming language. It was created by Guido van Rossum, and
released in 1991.
It is used for:
web development (server-side),
software development,
mathematics,
system scripting.
Python does the things as follow:
Python can be used on a server to create web applications.
Python can be used alongside software to create workflows.
Python can connect to database systems. It can also read and modify files.
Python can be used to handle big data and perform complex mathematics.
Python can be used for rapid prototyping, or for production-ready software
development.
Advantages of python are mentioned below:
Python works on different platforms (Windows, Mac, Linux, Raspberry Pi,
etc).
Python has a simple syntax similar to the English language.
Python has syntax that allows developers to write programs with fewer lines
than some other programming languages.
Python runs on an interpreter system, meaning that code can be executed as
soon as it is written. This means that prototyping can be very quick.
Python can be treated in a procedural way, an object-orientated way or a
functional way.
4
1.2.1 Machine learning in python :
Fig 1.2 Machine learning overview
Machine learning is learning based on experience. As an example, it is like a
person who learns to play chess through observation as others play. In this way,
computers can be programmed through the provision of information which they are
trained, acquiring the ability to identify elements or their characteristics with high
probability.
There are various stages of machine learning:
data collection
data sorting
data analysis
algorithm development
checking algorithm generated
the use of an algorithm to further conclusions
Machine learning algorithms are divided into two groups:
Unsupervised learning
Supervised learning
With Unsupervised learning, your machine receives only a set of input data.
Thereafter, the machine is up to determine the relationship between the entered data
and any other hypothetical data. Unlike supervised learning, where the machine is
provided with some verification data for learning, independent Unsupervised learning
implies that the computer itself will find patterns and relationships between different
data sets. Unsupervised learning can be further divided into clustering and association.
Supervised learning implies the computer ability to recognize elements based
on the provided samples. The computer studies it and develops the ability to recognize
5
new data based on this data. For example, you can train your computer to filter spam
messages based on previously received information.
Some Supervised learning algorithms include:
Decision trees
Support-vector machine
Naive Bayes classifier
k-nearest neighbours
linear regression
1.2.1.1 Numpy :
NumPy is the fundamental package needed for scientific computing with
Python. This package contains:
a powerful N-dimensional array object
sophisticated (broadcasting) functions
basic linear algebra functions
basic Fourier transforms
sophisticated random number capabilities
tools for integrating Fortran code
tools for integrating C/C++ code
Besides its obvious scientific uses, NumPy can also be used as an efficient
multi-dimensional container of generic data. Arbitrary data types can be defined. This
allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
NumPy is a successor for two earlier scientific Python libraries: Numeric and
Numarray.
1.2.1.2 Pandas :
Pandas is a popular Python package for data science, and with good reason: it
offers powerful, expressive and flexible data structures that make data manipulation
and analysis easy, among many other things. The DataFrame is one of these structures.
Those who are familiar with R know the data frame as a way to store data in
rectangular grids that can easily be overviewed. Each row of these grids corresponds
to measurements or values of an instance, while each column is a vector containing
data for a specific variable. This means that a data frame’s rows do not need to
contain, but can contain, the same type of values: they can be numeric, character,
logical, etc.
6
Now, DataFrames in Python are very similar: they come with the Pandas
library, and they are defined as two-dimensional labeled data structures with columns
of potentially different types.
In general, you could say that the Pandas DataFrame consists of three main
components: the data, the index, and the columns.
Firstly, the DataFrame can contain data that is:
a Pandas DataFrame
a Pandas Series: a one-dimensional labeled array capable of holding any data
type with axis labels or index. An example of a Series object is one column
from a DataFrame.
a NumPy ndarray, which can be a record or structured
a two-dimensional ndarray
dictionaries of one-dimensional ndarray’s, lists, dictionaries or Series.
The difference between np.ndarray and np.array() . The former is an actual
data type, while the latter is a function to make arrays from other data structures.
Structured arrays allow users to manipulate the data by named fields: in the
example below, a structured array of three tuples is created. The first element of each
tuple will be called foo and will be of type int, while the second element will be
named bar and will be a float.
Record arrays, on the other hand, expand the properties of structured arrays.
They allow users to access fields of structured arrays by attribute rather than by index.
You see below that the foo values are accessed in the r2 record array.
1.2.1.3 Opencv :
OpenCV (Open Source Computer Vision Library) is an open source computer
vision and machine learning software library. OpenCV was built to provide a common
infrastructure for computer vision applications and to accelerate the use of machine
perception in the commercial products. Being a BSD-licensed product, OpenCV
makes it easy for businesses to utilize and modify the code.
The library has more than 2500 optimized algorithms, which includes a
comprehensive set of both classic and state-of-the-art computer vision and machine
learning algorithms. These algorithms can be used to detect and recognize faces,
identify objects, classify human actions in videos, track camera movements, track
moving objects, extract 3D models of objects, produce 3D point clouds from stereo
cameras, stitch images together to produce a high resolution image of an entire scene,
find similar images from an image database, remove red eyes from images taken using
7
flash, follow eye movements, recognize scenery and establish markers to overlay it
with augmented reality, etc. OpenCV has more than 47 thousand people of user
community and estimated number of downloads exceeding 18 million. The library is
used extensively in companies, research groups and by governmental bodies.
Along with well-established companies like Google, Yahoo, Microsoft, Intel,
IBM, Sony, Honda, Toyota that employ the library, there are many startups such as
Applied Minds, VideoSurf, and Zeitera, that make extensive use of OpenCV.
OpenCV’s deployed uses span the range from stitching streetview images together,
detecting intrusions in surveillance video in Israel, monitoring mine equipment in
China, helping robots navigate and pick up objects at Willow Garage, detection of
swimming pool drowning accidents in Europe, running interactive art in Spain and
New York, checking runways for debris in Turkey, inspecting labels on products in
factories around the world on to rapid face detection in Japan.
It has C++, Python, Java and MATLAB interfaces and supports Windows,
Linux, Android and Mac OS. OpenCV leans mostly towards real-time vision
applications and takes advantage of MMX and SSE instructions when available. A
full-featured CUDA and OpenCL interfaces are being actively developed right now.
There are over 500 algorithms and about 10 times as many functions that compose or
support those algorithms. OpenCV is written natively in C++ and has a templated
interface that works seamlessly with STL containers.
1.2.1.4 Sk-Learn :
Scikit-learn provides a range of supervised and unsupervised learning
algorithms via a consistent interface in Python.
It is licensed under a permissive simplified BSD license and is distributed
under many Linux distributions, encouraging academic and commercial use. The
library is built upon the SciPy (Scientific Python) that must be installed before you can
use scikit-learn. This stack that includes:
NumPy: Base n-dimensional array package
SciPy: Fundamental library for scientific computing
Matplotlib: Comprehensive 2D/3D plotting
IPython: Enhanced interactive console
Sympy: Symbolic mathematics
Pandas: Data structures and analysis
Extensions or modules for SciPy care conventionally named SciKits. As such,
the module provides learning algorithms and is named scikit-learn. The vision for the
library is a level of robustness and support required for use in production systems. This
8
means a deep focus on concerns such as ease of use, code quality, collaboration,
documentation and performance.
Although the interface is Python, c-libraries are leverage for performance such
as numpy for arrays and matrix operations, LAPACK, LibSVM and the careful use of
python.The library is focused on modelling data. It is not focused on loading,
manipulating and summarizing data. For these features, refer to NumPy and Pandas.
Some popular groups of models provided by scikit-learn include:
Clustering: for grouping unlabelled data such as K-Means.
Cross Validation: for estimating the performance of supervised models on unseen
data.
Datasets: for test datasets and for generating datasets with specific properties for
investigating model behaviour.
Dimensionality Reduction: for reducing the number of attributes in data for
summarization, visualization and feature selection such as Principal component
analysis.
Ensemble methods: for combining the predictions of multiple supervised models.
Feature extraction: for defining attributes in image and text data.
Feature selection: for identifying meaningful attributes from which to create
supervised models.
Parameter Tuning: for getting the most out of supervised models.
Manifold Learning: For summarizing and depicting complex multi-dimensional
data.
Supervised Models: a vast array not limited to generalized linear models,
discriminate analysis, naive bayes, lazy methods, neural networks, support vector
machines and decision trees.
1.2.1.5 Matplotlib:
Matplotlib is an amazing visualization library in Python for 2D plots of arrays.
Matplotlib is a multi-platform data visualization library built on NumPy arrays and
designed to work with the broader SciPy stack. It was introduced by John Hunter in
the year 2002.
One of the greatest benefits of visualization is that it allows us visual access to
huge amounts of data in easily digestible visuals. Matplotlib consists of several plots
like line, bar, scatter, histogram etc
9
1.2.2 Neural networks in python :
Neural Networks as a mathematical function that maps a given input to a
desired output.
Neural Networks consist of the following components:
An input layer, x
An arbitrary amount of hidden layers
An output layer, ŷ
A set of weights and biases between each layer, W and b
A choice of activation function for each hidden layer, σ.
Fig 1.3 Architecture of a 2-layer Neural Network
The output ŷ of a simple 2-layer Neural Network is:
1.1
The weights W and the biases b are the only variables that affects the output ŷ.
Calculating the predicted output ŷ, known as feedforward
Updating the weights and biases, known as backpropagation
The sequential graph below illustrates the process.
10
Fig 1.4 Illustration of flow of network
1.2.2.1 Loss Function:
There are many available loss functions, and the nature of our problem should
dictate our choice of loss function.
The common loss functions are mentioned below:
Regression loss: Mean Square Error/Quadratic Loss/L2 Loss
1.2
That is, the sum-of-squares error is simply the sum of the difference between
each predicted value and the actual value. The difference is squared so that we measure
the absolute value of the difference.
Mean Absolute Error/L1 Loss:
1.3 1.3
It is measured as the average of sum of absolute differences between
predictions and actual observations. Like MSE, this as well measures the magnitude of
error without considering their direction. Unlike MSE, MAE needs more complicated
tools such as linear programming to compute the gradients. Plus MAE is more robust to
outliers since it does not make use of square.
11
Mean Bias error:
1.4
This is much less common in machine learning domain as compared to it is
counterpart. This is same as MSE with the only difference that we don’t take absolute
values. Clearly there’s a need for caution as positive and negative errors could cancel
each other out. Although less accurate in practice, it could determine if the model has
positive bias or negative bias.
Classification Losses: Hinge Loss/Multi class SVM Loss
1.5
the score of correct category should be greater than sum of scores of all
incorrect categories by some safety margin (usually one). And hence hinge loss is used
for maximum-margin classification, most notably for SVM. Although not
differentiable, it’s a convex function which makes it easy to work with usual convex
optimizers used in machine learning domain.
Cross Entropy Loss/Negative Log Likelihood:
1.6
This is the most common setting for classification problems. Cross-entropy loss
increases as the predicted probability diverges from the actual label. when actual label
is 1 (y(i) = 1), second half of function disappears whereas in case actual label is 0 (y(i)
= 0) first half is dropped off. In short, we are just multiplying the log of the actual
predicted probability for the ground truth class. An important aspect of this is that cross
entropy loss penalizes heavily the predictions that are confident but wrong.
12
Finally, This loss function helps us to find the best set of weights and biases that
minimizes the loss function.
1.3 Image Processing:
Image processing is a method to perform some operations on an image, in
order to get an enhanced image or to extract some useful information from it. It is a
type of signal processing in which input is an image and output may be image or
characteristics/features associated with that image. Nowadays, image processing is
among rapidly growing technologies. It forms core research area within engineering
and computer science disciplines too.
Image processing basically includes the following three steps:
Importing the image via image acquisition tools;
Analysing and manipulating the image;
Output in which result can be altered image or report that is based on image
analysis.
There are two types of methods used for image processing namely, analogue
and digital image processing. Analogue image processing can be used for the hard
copies like printouts and photographs. Image analysts use various fundamentals of
interpretation while using these visual techniques. Digital image processing techniques
help in manipulation of the digital images by using computers. The three general
phases that all types of data have to undergo while using digital technique are pre-
processing, enhancement, and display, information extraction.
An image is nothing more than a two dimensional signal. It is defined by the
mathematical function f(x,y) where x and y are the two co-ordinates horizontally and
vertically.
The value of f(x,y) at any point is gives the pixel value at that point of an
image.
Some of the major fields in which digital image processing is widely used are
mentioned below
Image sharpening and restoration
Medical field
Remote sensing
Transmission and encoding
Machine/Robot vision
Color processing
13
Pattern recognition
Video processing
Microscopic Imaging
1.3.1 Types of Images:
1. The binary image:
The binary image as it name states, contain only two pixel values 0 and 1.Here
0 refers to black colour and 1 refers to white colour. It is also known as
Monochrome.
The resulting image that is formed hence consist of only black and white
colour and thus can also be called as Black and White image.
Binary images have a format of PBM ( Portable bit map ).
2. 2, 3, 4,5, 6 bit colour format:
The images with a colour format of 2, 3, 4, 5 and 6 bit are not widely used
today. They were used in old times for old TV displays, or monitor displays.
But each of these colours have more than two grey levels, and hence has grey
colour unlike the binary image.
In a 2 bit 4, in a 3 bit 8, in a 4 bit 16, in a 5 bit 32, in a 6 bit 64 different
colours are present.
3. 8 bit colour format
8 bit colour format is one of the most famous image format. It has 256
different shades of colours in it. It is commonly known as Grayscale image.
The range of the colours in 8 bit vary from 0-255. Where 0 stands for black,
and 255 stands for white, and 127 stands for grey colour.
This format was used initially by early models of the operating systems UNIX
and the early colour Macintoshes.
The format of these images are PGM ( Portable Grey Map ). This format is not
supported by default from windows. In order to see grey scale image, you need to
have an image viewer or image processing toolbox such as Matlab.
4. 16 bit colour format:
It is a colour image format. It has 65,536 different colours in it. It is also
known as High colour format.
14
It has been used by Microsoft in their systems that support more then 8 bit
colour format.
The distribution of colour in a colour image is not as simple as it was in
grayscale image. A 16 bit format is actually divided into three further formats which
are Red , Green and Blue. The famous (RGB) format.
5. 24 bit colour format:
24 bit colour format also known as true colour format. Like 16 bit colour
format, in a 24 bit colour format, the 24 bits are again distributed in three different
formats of Red, Green and Blue.
It is the most common used format. Its format is PPM ( Portable pixMap)
which is supported by Linux operating system. The famous windows has its own
format for it which is BMP ( Bitmap ).
1.3.2 Brightness and Contrast:
Brightness is a visual perception in which a source appears to be reflecting
light. Brightness is a subjective property of an object which is being observed.
Brightness is an absolute term and different from lightness. A colour screens use three
colours i.e., RGB scheme (red, green and blue) the brightness of the screen depends
upon the sum of the amplitude of red green and blue pixels, and it is divided by 3.
The perception of brightness depends upon the optical illusions to appear
brighter or darker. When the brightness is decreased, the colour appears dull, and
when brightness increases, the colour is clearer.
Contrast is a colour which makes an object distinguishable. We can say that
contrast is determined by the colour and brightness of the object. Contrast is the
difference between the maximum and minimum pixel intensity of an image.
15
1.4 Convolution Neural Network:
Fig 1.5 A CNN Sequence
A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning
algorithm which can take in an input image, assign importance (learnable weights
and biases) to various aspects/objects in the image and be able to differentiate one
from the other. The pre-processing required in a ConvNet is much lower as
compared to other classification algorithms. While in primitive methods filters are
hand-engineered, with enough training, ConvNets have the ability to learn these
filters/characteristics.
A ConvNet is able to successfully capture the Spatial and Temporal
dependencies in an image through the application of relevant filters. The architecture
performs a better fitting to the image dataset due to the reduction in the number of
parameters involved and reusability of weights. In other words, the network can be
trained to understand the sophistication of the image better.
The role of the ConvNet is to reduce the images into a form which is easier
to process, without losing features which are critical for getting a good prediction.
1.4.1. Convolution Layer:
A filter (or kernel) is an integral component of the layered architecture.
Generally, it refers to an operator applied to the entirety of the image such
that it transforms the information encoded in the pixels. In practice, however, a
16
kernel is a smaller-sized matrix in comparison to the input dimensions of the image,
that consists of real valued entries.
The real values of the kernel matrix change with each learning iteration over
the training set, indicating that the network is learning to identify which regions are
of significance for extracting features from the data.
Fig 1.6 Convolution operation with kernel
In Fig1.6, we are convoluting a 5x5x1 image with a 3x3x1 kernel (which
change each iteration to extract a significant features) to get a 3x3x1 convolved
feature. The filter moves to the right with a certain stride value till it parses the
complete width. In case of images with multiple channels (e.g. RGB) the kernel has
a same depth as that of input image. Matrix multiplication is performed between Kn
and in stack ([K1, I1]; [K2, I2]; [K3, I3]) and all results are summed with bias to
give us a squashed 1 depth channel convoluted feature output.
The objective of the Convolution Operation is to extract the high-level
features such as edges, from the input image, the first Convolutional Layer is
responsible for capturing the Low-Level features such as edges, color, gradient
orientation, etc.
17
1.4.2 Pooling Layer:
(a) Convoluted Output (b) Pooling Output
Fig 1.7 Performing Pooling operation
The Pooling layer is responsible for reducing the spatial size of the Convolved
Feature. This is to decrease the computational power required to process the data
through dimensionality reduction. Furthermore, it is useful for extracting dominant
features which are rotational and positional invariant, thus maintaining the process of
effectively training of the model.
There are two types of Pooling: Max Pooling and Average Pooling. Max
Pooling returns the maximum value from the portion of the image covered by the
Kernel. On the other hand, Average Pooling returns the average of all the values
from the portion of the image covered by the Kernel.
In Fig. 1.7 we perform maximum pooling operation by considering
convoluted feature output obtained from convolution layer.
1.4.3 Classification – Fully Connected Layer (FC LAYER):
Fig 1.8 Describing Classification Process
18
The output of the convolutional layer is converted into a suitable form for our
Multi-Level Perceptron, we shall flatten the image into a column vector. The flattened
output is fed to a feed- forward neural network and back propagation applied to every
iteration of training.
1.5 Problem Statement:
Optical character recognition is also called as optical character reader and it
is abbreviated as OCR. OCR translates the images into machine readable format
such as ASCII or Unicode. Character recognition can be classified into two types
based on the type of the text i.e. machine printed text and handwritten text. Character
recognition of handwritten text is more challenging than machine printed text.
Because, machine printed characters are straight with uniform alignment and
spacing. While the handwritten characters are not uniform and greatly varies in
shape and size. This aims to represent the process of Converting handwritten text to
computer typed document which is an optical cursive handwritten recognition
(OCR) by using segmentation algorithms like VPP (vertical projection profile), TDP
(top down profile) and the other histogram (vertical and horizontal) projection
algorithms to achieve the solution. For feature extraction and character recognition
pytorch which is an open source machine learning tool library in python used for
computer vision and natural language processing.
19
2. LITERATURE SURVEY
2.1 Existing Methods for character recognition system:
In literature of cursive English handwriting recognition, earlier study
highlighted that an off-line handwritten document analysis through segmentation,
skew recognition and writing pressure detection for cursive handwritten document.
There are many algorithms for line, word and character segmentation [19]. The
proposed segmentation method is based on modified horizontal and vertical
projection that can segment the text lines and words even if the presence of
overlapped and multi-skewed text lines. For character segmentation there are methods
like multi-layer perceptron [20]. The existing method was tested on more than 550
text images of IAM database and sample handwriting image which are written by the
different writer on the different background. Using the existing method 93.65% lines
and 91.56% word are correctly segmented from the IAM dataset. Existing work also
normalizes 92% lines and words perfectly with very small error rate. Existing skew
normalization method deals with the exact skew angle and extremely efficient with
compare to on hand techniques. [4]
Each and every pixel in an image represents some information. The pixels
which contributes to the text has more information energy. Based on this information
energy, the text- lines are segmented with 92% accuracy.[5] Artificial Neural
Network is used to recognize the characters. The study includes the performance of
convex hull feature set i.e. 125 features are computed by considering various bays
attributes of the convex hull of a pattern, for effective recognition of isolated
handwritten Bangla basic characters and digits. The recognition rate is 76.86% for
handwritten Bangla characters and 99.45% for Bangla numerals. [3]
The work includes the study of different segmentation techniques for
handwritten character recognition. Three levels of segmentation are presented i.e.
text-lines, word and character segmentation. The need and factors which affects the
segmentation process are discussed. [6]
The work contains the new approach which uses the sequence of
segmentation and recognition algorithms for the OCR of cursive handwriting.
Hidden Markov Model (HMM) is used for recognition with accuracy 92.3% with
lexicon size 50. Lexi con and HMM are combined for word-level segmentation [1].
In this work, various segmentation levels are discussed. Hough transform is used for
text-line segmentation. For the division of vertically connected components,
skeletonization is used [7]. The experiments are carried out.
In this work, the novel connectivity strength function is used for
segmentation process. Connectivity strength parameter is used to decide the
20
components of the text-line. It is a language adaptive approach with accuracy
97.30% [2].
In most of the existing systems recognition accuracy is heavily dependent on
the quality of the input document. In handwritten text adjacent characters tend to be
touched or overlapped. Therefore it is essential to segment a given string correctly into
its character components. In most of the existing segmentation algorithms, human
writing is evaluated empirically to deduce rules [21]. But there is no guarantee for the
optimum results of these heuristic rules in all styles of writing. Moreover handwriting
varies from person to person and even for the same person it varies depending on
mood, speed etc. This requires incorporating artificial neural networks, hidden Markov
models and statistical classifiers to extract segmentation rules based on numerical data.
[22][23][24].
After segmentation next crucial step is representation of character classes by
features. These features should have high discriminative abilities so that they are
different for different character classes (for example 26 uppercase and 26 lowercase
characters in case of English language and 10 digits). Also, these features should be
independent of the intra class variations.
The different representation methods can be categorized into three major classes [21]:
1. Global transformation and series expansion: includes Fourier transform, Gabor
transform, wavelet, moments and KarhuenLoeve Expansion.
2. Statistical representation: Zoning, crossing and distances, projections.
3. Geometrical and topological representation: Extracting and counting
topological structures, geometrical properties, coding, graphs and trees etc.
Features which depend on Fourier transform are suitable for recognizing
handwritten numerals where 96% accuracy has been achieved [25]. Gradient features
have been widely used in CR for machine and hand printed binary character images.
But these features are not invariant to deformations in the characters. In [26], a new
gradient feature is used where at each pixel, gradient is mapped onto 12 direction
codes with an angle span of 30 degree between the directions
In [27], a redesigned direction feature [28] with a view to describe the
character contour more effectively is developed. Also, an additional global feature was
introduced in this technique to improve the recognition accuracy for those characters
that were most frequently confused with patterns of similar appearances. But the
disadvantage of this technique is its failure to deal with changes in stroke width as
these features are extracted from non-thinned character images. Another crucial
module in a character recognition system is its pattern recognition module which
assigns an unknown sample to a predefined class. Numerous techniques for character
recognition can be classified into four general approaches of pattern recognition: [21]
21
1. Template Matching : Direct matching, deformable and elastic matching,
relaxation matching.
2. Statistical techniques : Parametric recognition, Non-pararmetric recognition,
HMM, fuzzy set reasoning.
3. Structural techniques: Grammatical methods, graphical methods.
4. Neural networks : Multilayer perceptron, radial basis function, support vector
machine
Character recognition technique has to cope with the high variability of the
handwritten cursive letters and their intrinsic ambiguity (letters like “e” and “l” or “u”
and “n” can have the same shape). Also it should be able to adapt to changes in the
input data. Template matching, statistical techniques and structural techniques can be
used when the input data is uniform over time whereas neural network (NN) classifier
can learn changes in the input data. Also NN has parallel structure because of which it
can perform computation at a higher rate than classical techniques. Therefore, we
choose neural networks for character recognition in our system.
The features that are used for training the neural network classifier also play a
very important role. The choice of a good feature vector can significantly enhance the
performance of a character classifier whereas a poor one may degrade its performance
considerably. It is found in the literature that generally separate classifiers are used for
the upper and the lower case English character classes to improve the recognition
accuracy. Moreover, good recognition accuracy could be achieved only for
handwritten numerals.
In this paper, we focus on developing a OCR system for recognition of
handwritten English words. We first segment the words into individual characters and
then represent these characters by features that have good discriminative abilities. We
also explore different neural network classifiers to find the best classifier for the OCR
system. We combine different OCR techniques in parallel so that recognition accuracy
of the system can be improved.
22
3. METHODOLOGY
3.1 System Architecture:
Fig. 3.1 Block Diagram of Proposed System
A. Image acquisition:
Images can be obtained by taking the photograph or by scanning the input
document.
B. Pre-processing:
Pre-processing techniques can be applied after image acquisition and
segmentation. These are used to remove the noise from an image and enhances the
images for further processing. Pre-processing techniques include noise removal,
skew correction, cropping and resizing, normalization, thinning, binarization,
skeletonization. Morphological operations such as dilation, erosion can also be
applied to the input scanned image.
The steps in pre-processing are shown in figure 3.2 below:
23
Fig. 3.2 Steps in pre-processing
C. Segmentation:
Segmentation is of three types i.e. line, word and character segmentation.
Line segmentation separates the lines from a paragraph. Word segmentation
separates the words from a line and character segmentation separates the characters
from a word.
D. Feature Extraction:
Feature extraction is an important step in the recognition process. In this
process, all the essential information about a character which is present in an image
is extracted.
E. Classification:
In classification, an unknown sample is assigned to the predefined class.
According to the extracted features, characters are classified and recognized.
F. Post-processing:
To achieve more accuracy, various post-processing techniques are used, for
example, matching a recognized word with a dictionary word.
3.2 Algorithm(s):
Horizontal projection method is used to segment a line from a paragraph. As
a first step horizontal histogram of an image is created. The average height of a
rising section is assumed as threshold. Then the height of each rising section is
checked whether it is greater or equal to the threshold, then each line is segmented
from a binary image.
24
3.2.1 Algorithm for Line Segmentation:
1) Read a handwritten document image as a multidimensional array.
2) Check the image is a binary image or not. If binary image then stores it
into a 2-d array IMG [] [] with size MN and go to Step 4, otherwise go to
Step 3.
3) Convert the image to binary image and store into a 2-d array IMG [] [].
4) Construct the horizontal projection histogram of the image IMG [] [] and
store into a 2-d array HPH [] [].
5) Measure the height, starting row position and ending row position of
each horizontally rising section of horizontal projection histogram image
and store into 3d array LH [] [] [] sequentially.
6) Count the number of rising points by counting the rows of the 3-d array
LH [] [] []. Then measure the threshold (Ti) value by calculating
average height of rising sections from the 3-d array LH [] [] [].
7) Select each rising section from 3-d array LH [] [] [] and check the height
of that rising section is less than the threshold or not. If yes then those
rising points is not considered as a line and go to Step 9, otherwise rising
section is treated as a line and go to Step 8.
8) Find the rising sections starting and ending rows number from the array
LH [] [] []. Let starting and ending row are r1 and r2 respectively. Extract
the line segment between r1 and r2 from the original binary image
denoted by IMG [] [].
9) Go to Step 7 for next rising sections till all rising section are not under
consideration, otherwise go to next Step.
10) End.
3.2.2 Algorithm for Word Segmentation:
1) Read a segmented binary line as 2-d binary image LN [] [].
2) Construct the vertical projection histogram of the line LN [] [] and store
into a 2-d array LVP [] [].
3) From the vertical projection histogram (LVP [] []), measures width of
each inter- word and intra-word gaps and store the width into 1-d array
GAPSW [].
4) Count total number gaps as TGP by calculating the size of GAPSW [].
Add width of all gaps by adding the elements of GAPSW [] and store into
TWD.
5) Calculate the threshold (Ti) as follows: Ti = TWD/ TGP Where, Ti is the
threshold value denoting average width of inter-word gaps, TWD denotes
total width of all gaps and TGP denotes the total number of gaps.
25
6) For each i (1<= i<= sizeof (GAPSW [])), if GAPSW[i]> = Ti then this gap
is treated as inter-word gaps, otherwise it is treated as an intra-word gap.
Depending on inter- word gaps width, words are segmented from the line.
7) End
3.2.3 Algorithm for Character Segmentation:
3.2.3.1 Character Segmentation Using VPP:
At first, proposed algorithm uses VPP of the binary image obtained after word
segmentation for the character segmentation. VPP is to represent the total number of
white pixels in vertical direction of the binary image to graph. Because boundaries of
the characters are certainly regions composed of background in the vertical direction
as value of VPP is zero, the text region is separated at these regions. When length of
width in separated character image is longer than 0.8 times length of height (feature of
the printed character in the slab image), it is judged that the separated character images
are touching character.
Width > 0.8 * Height: Touching Character 3.1
3.2.3.2 Touching Character Segmentation:
Boundaries of touching characters are located at the valley points of VPP
(Vertical Projection Profile) or TDP (top down profile). TDP is to represent position of
first white pixel at respective column to graph. Because all valley points are not
boundaries of touching characters, all candidate boundary points are extracted. For
VPP and TDP analysis, binary image, feature binary image, and gray image are used.
White pixels in the feature binary image are composed of peak, hillside, and ridge
points of topographic feature in gray image [10].
All candidate boundary points extracted are combined to Calculate the score
graph. Real boundary regions of characters have a large value at the score graph.
Cost Graph = (VPP + TDP)/2 3.2
Combined boundary points are selected from the score graph recursively.
When finding the combined boundary points more than real character boundary points,
we should choose correct boundary points. After making up all cases which are able to
separate the touching character from the combined boundary points, the proposed
algorithm selects the correct case that has minimum distance between the separated
character images and the representative images using recognition-based method. The
representative image displays the recognition result of the separated character image.
26
3.3 Proposed Work:
The process for the optical cursive handwritten recognition and the required
algorithms for various levels of segmentation and character recognition using pytorch
are as follow
It comprises of six steps.
1. Image scanning
2. Pre-processing
3. Segmentation
4. Feature extraction
5. Classification
6. Post-processing
3.3.1 Image Scanning:
The input image can be obtained either by scanning the already existing
handwritten image file (png, jpg) or by capturing the image instantly to provide input
data to the model.
Fig 3.3 Scanned input image
3.3.2 Pre-Processing:
The main goal here is to make the input image free from noise. As a first step
to move on convert the RGB image to a gray scale image and gently sharpen the given
input image to avoid loss of edges. Calculate the mean gray intensity value to reduce
the brightness of the obtained grey scale image on a threshold value of less than 0.65
and the contrast is increased to distinguish the character boundaries[8]. The text which
27
is present in the obtained result may turn dim and blur because of improper scanning
of the text image.
Fig 3.4 Pre-processing
To overcome this, binarization plays a key role by converting the gray scale
image where the values ranges between 0 and 255 to a binary image by making up a
threshold value simply to decide like an on or off (0 or 1). Since burned characters
look like dim in the text region, the characters disappear in the binary image. When
converting to binary image, we apply Otsu’s binarization [9] method not to the whole
region but to the respective local regions.
Fig 3.5 Blur image (for noise removal)
28
Fig 3.6 Binary image in color contrast
3.3.3 Segmentation:
Basically, there are three levels of segmentation. Line segmentation, Word
segmentation and Character segmentation.
3.3.3.1 Line Segmentation:
Horizontal histogram projections are used in segmenting the entire script
present in the input image into individual lines as shown in the figure below
The primary task here is to extract each individual line from the given input
image. This can be obtained by applying horizontal histogram projection to the pre-
processed image and then generate the threshold line value by adjusting the average
value for those horizontal projections. Graphical representation of horizontal
histogram projection is shown in the figure 3.7 below.[6]
Fig 3.7 Horizontal projection graph
29
Finally, Lines can be segmented from the given input script by obtaining the
break points by making use of the average threshold line value obtained from the
above graph and having comparisons with each and every horizontal projection.
Fig 3.8 Segmented lines from the image
3.3.3.2 Word Segmentation
Each word is treated as an object (Contour – in terms of image processing).
Contour can be explained simply as a curve joining all the continuous points (along
the boundary), having same color or intensity. Here, Contours are useful for object
detection where each object is a word.
Fig 3.9 segmented line
The main reason for making use of contours here is each word can be treated as
a curve joining all the continuous points along the boundary since it is cursively
written. But sometimes there may be gaps in between the letters of a single word
which causes the word to be split into two or more words as they are not continuous
points joining as a curve.
This type of words can be identified by making use of the minimum threshold
value which is obtained by taking the average separation distance between the words
and can be rejoined back as a single word (Contour) where the separation distance
between the words less than the minimum threshold value.[7]
Minimum threshold value = (sum of separation distances between words in the
line)/(no of words in the line)
30
Fig 3.10 First level word segmentation
Fig 3.11 Second level word segmentation
3.3.3.3 Character Segmentation
Segmenting each word into individual characters can be obtained by making use
of two native algorithms:
1. VPP (Vertical Projection Profile)
2. TDP (Top Down Profile)
Fig 3.12 segmented word
VPP is a plot which maintains the total number of white pixels in vertical
direction of the binary image. Characters can be segmented at the point where the VPP
value is zero (0) for some certain no of times (Threshold). But in the case of touching
characters, the VPP value can never be zero even though the characters should be
segmented (connected components).[1]
31
Fig 3.13 VPP intensity graph for word ‘MOVE’
Connected components can be identified by making use of characters width
and height. When the width of the character is greater than 0.8 times the height of the
character then it is identified as a connected component otherwise it is a single
character as per basic font size measurement.[1]
Width > 0.8 * height (Connected component)
Connected component single character single character
Fig 3.14 First level VPP character segmentation
TDP is a plot which maintains the first white pixel in vertical direction of the
binary image. Touching characters can be segmented into individual characters by
taking the combined value of both VPP AND TDP. Then obtain the minimum value in
the graph where we can segment them into individual characters and continue this
process recursively until no more touching character found in a single word. [2]
Fig 3.15 touching characters (connected component)
32
Fig 3.16 VPP intensity graph for word ‘MO’
Fig 3.17 TDP intensity graph for word ‘MO’
Fig 3.18 Combined intensity graph for word ‘MO’
33
Fig 3.19 Final level character segmentation
3.3.4 Feature Extraction:
The main goal here is to extract the features from the segmented characters
which are required to train the data.
This process comprises of zero padding, convolution layer, activation function,
max pooling and flatten. As a first step add zeros to the image to overcome the loss of
edges termed as zero padding. Then apply multiple layers of convolution and max
pooling filters(kernels) to obtain an image with reduced in size where each move is a
stride. Max pooling comes with selecting the max value from the filter and replace it
with the remaining and the same way the average pooling works by taking the average
pixel value. Now the activation function comes into picture where the ReLU activation
function identifies all the negative pixel values and replace it with zero without any
change in the positive pixel values. Finally flatten the image by reshaping the image
obtained.
Fig 3.20 Feature extraction of characters
34
Fig 3.21 Zero padding
Fig 3.22 Convolution layer
Fig 3.23 Max pooling and Average pooling
35
Fig 3.24 Flatten the image
Fig 3.25 ReLU activation function
3.3.5 Classification:
Finally, Classification is done using a fully connected layer where we get the
probabilities of each and every class for the given input character. Classify the given
input character to their respective class by selecting the class with the maximum
probability. In total there are classified into 62 classes (0 to 9, a to z, A TO Z)
Fig 3.26 Fully connected layer with classes (x and o) along with probabilities
36
3.3.6 Post-processing:
As a final step obtain the accuracy for all the levels of segmentation and the
character recognition by minimizing the error rate. Then combine all the recognized
characters into word, words into lines and lines into the original script present in the
image.
Final Result = M O V E
Fig 3.27 Final result
37
4. PYTORCH
4.1 Pytorch library tool:
Pytorch is an open source machine learning library based on a torch library
used for applications such as computer vision and natural language processing. It’s a
Python-based scientific computing package targeted at two sets of audiences:
A replacement for NumPy to use the power of GPUs
a deep learning research platform that provides maximum flexibility
and speed
It is a Python-based scientific computing package that uses the power of
graphics processing units. It is also one of the preferred deep learning research
platforms built to provide maximum flexibility and speed. It is known for providing
two of the most high-level features; namely, tensor computations with strong GPU
acceleration support and building deep neural networks on a tape-based autograd
systems.
There are many existing Python libraries which have the potential to change
how deep learning and artificial intelligence are performed, and this is one such
library. One of the key reasons behind PyTorch’s success is it is completely and one
can build neural network models effortlessly. It is still a young player when compared
to its other competitors, however, it is gaining momentum fast.
Since its release in January 2016, many researchers have continued to
increasingly adopt PyTorch. It has quickly become a go-to library because of its ease
in building extremely complex neural networks. It is giving a tough competition to
tensor flow especially when used for research work. However, there is still some time
before it is adopted by the masses due to its still “new” and “under construction” tags.
PyTorch creators envisioned this library to be highly imperative which can
allow them to run all the numerical computations quickly. This is an ideal
methodology which fits perfectly with the Python programming style. It has allowed
deep learning scientists, machine learning developers, and neural network debuggers
to run and test part of the code in real time. they don’t have to wait for the entire code
to be executed to check whether it works or not.
You can always use your favorite Python packages such as NumPy, SciPy, and
Cython to extend PyTorch functionalities and services when required. PyTorch is a
dynamic library (very flexible and you can use as per your requirements and changes)
which is currently adopted by many of the researchers, students, and artificial
38
intelligence developers. In the recent Kaggle competition, PyTorch library was used
by nearly all of the top 10 finishers.
Some of the key highlights of PyTorch includes:
Simple Interface: It offers easy to use API thus, it is very simple to operate
and run like Python.
Pythonic in nature: This library, being Pythonic, smoothly integrates with
the Python data science stack. it can leverage all the services and
functionalities offered by the Python environment.
Computational graphs: In addition to this, PyTorch provides an excellent
platform which offers dynamic computational graphs thus you can change
them during runtime. This is highly useful when you have no idea how
much memory will be required for creating a neural network model.
It is an optimized tensor library for deep learning using CPU’S and GPU’S.
The feature extraction and the classification stages are implemented under pytorch
library tool. As a first step install all the required modules/packages to train the data
using pip namely efficient net-pytorch and torch summary. Then include all the
required modules by importing them into the python script (torch, torch.vision,
torch.nn, torch.util, torch.autograd, torch.optim, torchvision.transforms, Efficient Net).
Torch: The torch package contains data structures for multi-dimensional
tensors and mathematical operations over these are defined. Additionally, it provides
many utilities for efficient serializing of Tensors and arbitrary types, and other useful
utilities.
Torchvision: This package consists of popular datasets, model architectures,
and common image transformations for computer vision.
Torch.nn: A kind of Tensor that is to be considered a module parameter.
Parameters are Tensor subclasses, that have a very special property when used
with Module s - when they’re assigned as Module attributes they are automatically
added to the list of its parameters, and will appear e.g. in parameter () iterator.
Assigning a Tensor doesn’t have such effect. This is because one might want to cache
some temporary state, like last hidden state of the RNN, in the model. If there was no
such class as Parameter, these temporaries would get registered too.
Torch.autograd: This provides classes and functions implementing automatic
differentiation of arbitrary scalar valued functions. It requires minimal changes to the
existing code - you only need to declare Tensors for which gradients should be
computed with the ‘requires grad = True’ keyword.
39
Torch.optim: This is a package implementing various optimization
algorithms. Most commonly used methods are already supported, and the interface is
general enough, so that more sophisticated ones can be also easily integrated in the
future.
4.2 Pytorch in research:
Anyone who is working in the field of deep learning and artificial intelligence
has likely worked with Tensorflow before, Google’s most popular open source
library. However, the latest deep learning framework – PyTorch solves major
problems in terms of research work. Arguably PyTorch is Tensorflow’s biggest
competitor to date, and it is currently a much-favored deep learning and artificial
intelligence library in the research community.
Dynamic Computational graphs:
It avoids static graphs that are used in frameworks such as TensorFlow,
thus allowing the developers and researchers to change how the network behaves on
the fly. The early adopters are preferring PyTorch because it is more intuitive to learn
when compared to TensorFlow.
Different back-end support:
PyTorch uses different backends for CPU, GPU and for various
functional features rather than using a single back-end. It uses tensor backend TH for
CPU and THC for GPU. While neural network backends such as THNN and
THCUNN for CPU and GPU respectively. Using separate backends makes it very easy
to deploy PyTorch on constrained systems.
Imperative style:
PyTorch library is specially designed to be intuitive and easy to use.
When you execute a line of code, it gets executed thus allowing you to perform real-
time tracking of how your neural network models are built. Because of its excellent
imperative architecture and fast and lean approach it has increased overall PyTorch
adoption in the community.
Highly extensible:
PyTorch is deeply integrated with the C++ code, and it shares
some C++ backend with the deep learning framework, Torch. Thus, allowing users to
program in C/C++ by using an extension API based on cFFI for Python and compiled
for CPU for GPU operation. This feature has extended the PyTorch usage for new and
experimental use cases thus making them a preferable choice for research use.
40
Python-Approach:
PyTorch is a native Python package by design. Its functionalities are
built as Python classes. Hence, all its code can seamlessly integrate with Python
packages and modules. Similar to NumPy, this Python-based library enables GPU-
accelerated tensor computations plus provides rich options of APIs for neural network
applications. PyTorch provides a complete end-to-end research framework which
comes with the most common building blocks for carrying out daily deep learning
research. It allows chaining of high-level neural network modules because it supports
Keras-like API in its torch.nn package.
4.3 Training an image classifier using Pytorch:
Generally, when you have to deal with image, text, audio or video data, you
can use standard python packages that load data into a numpy array. Then you can
convert this array into a torch.*tensor.
For images, packages such as Pillow, OpenCV are useful
For audio, packages such as scipy and librosa
For text, either raw Python or Cython based loading, or NLTK and
SpaCy are useful
Specifically for vision, we have created a package called torchvision, that has
data loaders for common datasets such as Imagenet, CIFAR10, MNIST, etc. and data
transformers for images, viz., torchvision.datasets and torch.utils.data.DataLoader.
This provides a huge convenience and avoids writing boilerplate code.
It includes the steps as follow:
Load and normalizing the training data and test datasets using
torchvision
Define a Convolutional Neural Network
Define a loss function
Train the network on the training data
Test the network on the test data
Using torchvision, it’s extremely easy to load Data. The output of torchvision
datasets are PILImage images of range [0, 1]. We transform them to Tensors of
normalized range [-1, 1].
Now define the Convolution neural network and essential loss functions and
optimizers.
41
Training: We simply have to loop over our data iterator, and feed the inputs to
the network and optimize. We have trained the network for around 6 passes over the
training dataset. But we need to check if the network has learnt anything at all.
We will check this by predicting the class label that the neural network outputs,
and checking it against the ground-truth. If the prediction is correct, we add the sample
to the list of correct predictions.
Training on GPU: Just like how you transfer a Tensor onto the GPU, you
transfer the neural net onto the GPU. Let’s first define our device as the first visible
cuda device if we have CUDA available.
42
INPUT_FILENAME=[]
OUTPUT_CLASS=[]
for DIRNAME, _, FILENAMES in os.WALK('/KAGGLE/input/nist-
CHARACTERS-DATASET/CHARACTERS/TEST_IMAGES'):
for FILENAME in FILENAMES:
INPUT_FILENAME.APPEND(FILENAME.split('.')[0])
OUTPUT_CLASS.APPEND(DIRNAME.split('/')[-1])
TESTDATA=pd.DATAFRAME({'FILENAME':INPUT_FILENAME,
'CLASS':OUTPUT_CLASS})
TESTDATA.to_csv('test.csv')
SAMPLEDATA=pd.DATAFRAME({'FILENAME':INPUT_FILENAME,'CLASS'
:[0 for _ in RANGE(len(OUTPUT_CLASS))]})
SAMPLEDATA.to_csv('SAMPLE_SUBMISSION.CSV')
def PREPAREIMAGE(IMAGE,req_height):
if IMAGE.ndim == 3:
IMAGE=cv2.cvtColor(IMAGE,cv2.COLOR_BGR2GRAY)
height=IMAGE.SHAPE[0]
FACTOR=req_height/height
print("Resized by FACTOR : ",FACTOR)
return cv2.resize(IMAGE,dsize=None,fx=FACTOR,fy=FACTOR)
def CREATEKERNELFILTER(kernelSize,SIGMA,THETA):
HALFSIZE=kernelSize//2
kernel=np.zeros([kernelSize,kernelSize])
5. SAMPLE CODE ELABORATION
5.1 Pre-processing the data:
5.1.1 Data Organization:
5.1.2 Image Sizing and Shaping:
5.1.3 Image Blurring Kernel Filter:
43
5.1.4 Applying Kernal Filters and Contours on image:
def Pre_Processing_Sentence(sentence):
print("RESIZED SENTENCE: ")
UPDATED_SENTENCE=PREPAREIMAGE(sentence,50)
SHOW_IMAGE(UPDATED_SENTENCE,CMAP='GRAY')
print("BLURRED SENTENCE: ")
blurred_sentence=cv2.GAUSSIANBLUR(sentence,(5,5),0)
SHOW_IMAGE(blurred_sentence,CMAP='GRAY')
print("FILTERED SENTENCE: ")
kernelSize=25
SIGMA=11
THETA=7
MINAREA=150
kernel=CREATEKERNELFILTER(kernelSize,SIGMA,THETA)
Filtered_sentence=cv2.filter2D(sentence,-
1,kernel,borderType=cv2.BORDER_REPLICATE)
SHOW_IMAGE(Filtered_sentence,CMAP='GRAY')
print("THRES SENTENCE: ")
THRES_VALUE,Thres_sentence=cv2.threshold
(Filtered_sentence,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)
Thres_sentence=255-Thres_sentence
SHOW_IMAGE(Thres_sentence,CMAP='GRAY')
cv2. version
components,HIERARCHY=cv2.findContours(
Thres_sentence,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
SIGMAX=SIGMA
SIGMAY=SIGMA*THETA
for i in RANGE(kernelSize):
for j in RANGE(kernelSize):
x=i-HALFSIZE
y=j-HALFSIZE
expTerm=np.exp(-((x**2)/(2*(SIGMAX**2)))-
((y**2)/(2*(SIGMAY**2))))
kernel[i,j]=(1/(2*MATH.pi*SIGMAX*SIGMAY))*expTerm
return kernel
44
def LINE_SEGMENTATION(IMAGE):
Sentences=[]
Line_intensity=[0 for line in RANGE(len(IMAGE))]
for line in RANGE(len(IMAGE)):
count=0
for pixel in RANGE(len(IMAGE[0])):
if(IMAGE[line][pixel]<128):
count+=1
Line_intensity[line]=count
print("LINE INTENSITY: ")
print(Line_intensity)
plt.plot(Line_intensity)
plt.xticks([])
plt.show()
return EVALUATING_THRESHOLD(IMAGE,Line_intensity)
5.2 Line Segmentation:
5.2.1 : Calculating Line Intensity(Horizontal Histogram)
print('NO OF COMPONENTS : ',len(components))
print("CONTOURED SENTENCE: ")
SHOW_IMAGE(cv2.DRAWCONTOURS(sentence,components,-
1,(255,0,0),5))
Words=[]
for contour in components:
if(cv2.CONTOURAREA(contour) >= MINAREA):
(x,y,w,h)=cv2.boundingRect(contour)
Words.APPEND([(x,y,w,h),sentence[y:y+h,x:x+w]])
print('NO OF COMPONENTS AFTER FILTERING: ',len(Words))
Words.sort(key=SORT_IMAGES)
return Words
45
def Segmenting_Lines(IMAGE,Line_Segments,Line_Threshold):
Sentences=[]
print(Line_Segments)
for index in RANGE(1,len(Line_Segments),2):
if(Line_Segments[index]>Line_Threshold):
y=Line_Segments[index-1][0]-5
h=Line_Segments[index-1][1]-
5.2.2 : Evaluating Threshold for Line Segmentation:
def EVALUATING_THRESHOLD(IMAGE,Line_intensity):
Line_Segments=[]
LINE_SEPERATION=0
START_FLAG=0
Zero_count=0
START_INDEX=0
End_index=0
SET_FLAG=0
for line in RANGE(len(Line_intensity)):
if(Line_intensity[line]==0):
Zero_count+=1
if(SET_FLAG==0):
End_index=line
SET_FLAG=1
else:
if(SET_FLAG==1 AND START_FLAG==1):
SET_FLAG=0
Line_Segments.APPEND([START_INDEX,End_index])
START_INDEX=line
Line_Segments.APPEND(Zero_count)
LINE_SEPERATION=LINE_SEPERATION+(Zero_count**2)
Zero_count=0
if(START_FLAG==0):
START_FLAG=1
SET_FLAG=0
START_INDEX=line
Zero_count=0
Line_Segments.APPEND([START_INDEX,End_index])
Line_Threshold=MATH.sqrt(LINE_SEPERATION)/6
print("LINE THRESHOLD : ",Line_Threshold)
print("LINE SEGMENTS : ",Line_Segments)
return Segmenting_Lines(IMAGE,Line_Segments,Line_Threshold)
5.2.3 : Segmenting Paragraph into Sentences:
46
def combine_Words(Word1,Word2,sentence):
WORD=[[]]
WORD[0].APPEND(Word1[0][0])
# X-AXIS POSITION
WORD[0].APPEND(min(Word1[0][1],Word2[0][1]))
# Y-AXIS POSITION
WORD[0].APPEND((Word2[0][0]-Word1[0][0])+Word2[0][2])
# WIDTH
WORD[0].APPEND(MAX(Word1[0][1]+
Word1[0][3],Word2[0][1]+Word2[0][3])-WORD[0][1])
# HEIGHT
WORD.APPEND(sentence[WORD[0][1]:WORD[0][1]+
WORD[0][3],WORD[0][0]:WORD[0][0]+WORD[0][2]])
return WORD
def WORD_SEGMENTATION(Words,sentence):
FINAL_WORDS=[]
word=[]
FINAL_FLAG=0
WORD_SEPERATION_SUM=0
SEPERATION=[]
for word_no in RANGE(len(Words)-1):
5.3 Word Segmentation:
5.3.1 : Combining two Missegmented Words:
5.3.2 : Segmenting Sentence into Words:
Line_Segments[index-1][0]+10
Sentences.APPEND(IMAGE[y:y+h])
y=Line_Segments[-1][0]-5
h=Line_Segments[-1][1]-Line_Segments[-1][0]+10
Sentences.APPEND(IMAGE[y:y+h])
return Sentences
47
def EVALUATING_VPP_INTENSITY(PRE_PROCESSED_BINARY_IMAGE):
VPP_Intensity=[0 for col in
RANGE(len(PRE_PROCESSED_BINARY_IMAGE[0]))]
for row in RANGE(len(PRE_PROCESSED_BINARY_IMAGE)):
for col in RANGE(len(PRE_PROCESSED_BINARY_IMAGE[row])):
if(PRE_PROCESSED_BINARY_IMAGE[row][col]==0):
VPP_Intensity[col]+=1
print(VPP_Intensity)
plt.plot(VPP_Intensity)
plt.xticks([])
DISTANCE=Words[word_no+1][0][0
]-(Words[word_no][0][0]+Words[word_no][0][2])
SEPERATION.APPEND(DISTANCE)
WORD_SEPERATION_SUM=WORD_SEPERATION_SUM+DISTANCE
WORD_AVERAGE_THRESHOLD=MATH.sqrt(
WORD_SEPERATION_SUM/(len(Words)-1))
print('WORDS SEPERATION : ',SEPERATION)
print('AVERAGE THRESHOLD FOR WORD SEPERATION :
',WORD_AVERAGE_THRESHOLD)
for index in RANGE(len(SEPERATION)):
if(len(word)==0):
word=Words[index]
if(SEPERATION[index]>WORD_AVERAGE_THRESHOLD):
FINAL_WORDS.APPEND(word)
word=[]
FINAL_FLAG=0
else:
word=combine_Words(word,Words[index+1],sentence)
FINAL_FLAG=1
if(FINAL_FLAG==0):
FINAL_WORDS.APPEND(Words[-1])
else:
FINAL_WORDS.APPEND(word)
return FINAL_WORDS
5.4 Character Segmentation:
5.4.1 : Evaluating VPP Intensity:
48
5.4.2 : First Level Character Segmentation Using VPP:
def FIRST_LEVEL_CHARACTER_SEGMENTATION_UNDER_VPP
(PRE_PROCESSED_BINARY_IMAGE):
VPP_Intensity=EVALUATING_VPP_INTENSITY
(PRE_PROCESSED_BINARY_IMAGE)
CHARACTER_SEGMENTS=[]
CHARACTER_SEPERATION=
0 START_FLAG=0
Zero_count=0
s=START_INDEX=0
End_index=0
SET_FLAG=0
for col in RANGE(len(VPP_Intensity)):
if(VPP_Intensity[col]==0):
Zero_count+=1
if(SET_FLAG==0):
End_index=col
SET_FLAG=1
else:
if(SET_FLAG==1 AND START_FLAG==1):
SET_FLAG=0
CHARACTER_SEGMENTS.APPEND
([START_INDEX,End_index])
START_INDEX=col
CHARACTER_SEGMENTS.APPEND(Zero_count)
CHARACTER_SEPERATION=CHARACTER_SEPERATIO
N
+(Zero_count**2)
Zero_count=0
if(START_FLAG==0):
START_FLAG=1
SET_FLAG=0
START_INDEX=col
Zero_count=0
CHARACTER_SEGMENTS.APPEND([START_INDEX,End_index])
CHARACTER_THRESHOLD=MATH.sqrt(CHARACTER_SEPERATION)/3
print("CHARACTER THRESHOLD : ",CHARACTER_THRESHOLD)
print("CHARACTER SEGMENTS : ",CHARACTER_SEGMENTS)
return CHARACTER_SEGMENTATION(CHARACTER_SEGMENTS,
plt.show()
return VPP_Intensity
49
def GENERATE_VPP_AND_TDP_AVERAGE(SEGMENT):
IMAGE=SEGMENT[1]
SHOW_IMAGE(IMAGE,CMAP='GRAY')
VPP_Intensity=[0 for col in RANGE(len(IMAGE[0]))]
for row in RANGE(len(IMAGE)):
for col in RANGE(len(IMAGE[row])):
if(IMAGE[row][col]==0):
VPP_Intensity[col]+=1
CHARACTER_THRESHOLD,PRE_PROCESSED_BINARY_IMAGE)
5.4.3 : Segmenting Word into Characters under VPP:
def CHARACTER_SEGMENTATION(CHARACTER_SEGMENTS,
CHARACTER_THRESHOLD,PRE_PROCESSED_BINARY_IMAGE):
SEGMENTED_CHARACTERS=[]
for index in RANGE(1,len(CHARACTER_SEGMENTS),2):
if(CHARACTER_SEGMENTS[index]>CHARACTER_THRESHOLD):
x=CHARACTER_SEGMENTS[index-1][0]
y=0
Touching=0
w=CHARACTER_SEGMENTS[index-1][1]-
CHARACTER_SEGMENTS[index-1][0]
h=len(PRE_PROCESSED_BINARY_IMAGE)
if(w>0.675*h):
Touching=1
SEGMENTED_CHARACTERS.APPEND([[x,y,w,h],
PRE_PROCESSED_BINARY_IMAGE[y:y+h,x:x+w],Touching])
x=CHARACTER_SEGMENTS[-1][0]
y=0
Touching=0
w=CHARACTER_SEGMENTS[-1][1]-CHARACTER_SEGMENTS[-1][0]
h=len(PRE_PROCESSED_BINARY_IMAGE)
if(w>0.675*h):
Touching=1
SEGMENTED_CHARACTERS.APPEND([[x,y,w,h],
PRE_PROCESSED_BINARY_IMAGE[y:y+h,x:x+w],Touching])
return SEGMENTED_CHARACTERS
5.4.4 : Evaluating VPP and TDP Average Intensity:
50
def TOUCHING_CHARACTER_SEGMENTATION(SEGMENT,
AVERAGE_INTENSITY,FINAL_SEGMENTED_CHARACTERS):
AVERAGE_THRESHOLD=20
TOUCHING_CHARACTERS_BREAKPOINTS=[]
col=0
while(col<len(AVERAGE_INTENSITY)):
if(col==0):
while(col<len(AVERAGE_INTENSITY) AND
AVERAGE_INTENSITY[col]<AVERAGE_THRESHOLD):
col+=1
if(col<len(AVERAGE_INTENSITY) AND
AVERAGE_INTENSITY[col]<AVERAGE_THRESHOLD):
MIN_VALUE=AVERAGE_INTENSITY[col]
min_point=col
while(col<len(AVERAGE_INTENSITY) AND
AVERAGE_INTENSITY[col]<AVERAGE_THRESHOLD):
5.4.5 : Connected Components Segmentation:
print(VPP_Intensity)
plt.plot(VPP_Intensity)
plt.xticks([])
plt.show()
TDP_Intensity=[0 for col in RANGE(len(IMAGE[0]))]
for col in RANGE(len(IMAGE[0])):
for row in RANGE(len(IMAGE)):
if(IMAGE[row][col]==0):
TDP_Intensity[col]=len(IMAGE)-row
BREAK
print(TDP_Intensity)
plt.plot(TDP_Intensity)
plt.xticks([])
plt.show()
AVERAGE_INTENSITY=np.ADD(TDP_Intensity,VPP_Intensity)
print(AVERAGE_INTENSITY)
plt.plot(AVERAGE_INTENSITY)
plt.xticks([])
plt.show()
return AVERAGE_INTENSITY
51
if(AVERAGE_INTENSITY[col]<MIN_VALUE):
MIN_VALUE=AVERAGE_INTENSITY[col]
min_point=col
col+=1
if(col<len(AVERAGE_INTENSITY)):
TOUCHING_CHARACTERS_BREAKPOINTS.APPEND(min_point)
col+=1
print("CHARACTERS BREAK POINTS
",TOUCHING_CHARACTERS_BREAKPOINTS)
if(len(TOUCHING_CHARACTERS_BREAKPOINTS)==0):
REQUIRED_FURTHER_SEGMENTATION(SEGMENT,
AVERAGE_INTENSITY,FINAL_SEGMENTED_CHARACTERS)
else:
x_point=0
y_point=SEGMENT[0][1]
height=SEGMENT[0][3]
for BREAK_POINT in TOUCHING_CHARACTERS_BREAKPOINTS:
width=BREAK_POINT-x_point
if(width>0.8*height):
REQUIRED_FURTHER_SEGMENTATION([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],1],
AVERAGE_INTENSITY[x_point:x_point+width
],FINAL_SEGMENTED_CHARACTERS)
else:
FINAL_SEGMENTED_CHARACTERS.APPEND([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],0])
x_point=BREAK_POINT
width=SEGMENT[0][2]-x_point
if(width>0.8*height):
REQUIRED_FURTHER_SEGMENTATION([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],1],
AVERAGE_INTENSITY[x_point:x_point+width
],FINAL_SEGMENTED_CHARACTERS)
else:
FINAL_SEGMENTED_CHARACTERS.APPEND([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],0])
52
5.4.6 : Further Required Segmentation on Connected Components:
def REQUIRED_FURTHER_SEGMENTATION(IMAGE_SEGMENT,
AVERAGE_INTENSITY,FINAL_SEGMENTED_CHARACTERS):
Index_Limit=int(0.4*IMAGE_SEGMENT[0][3])
MIN_VALUE=AVERAGE_INTENSITY[Index_Limit]
BREAK_POINT=Index_Limit
for col in RANGE(Index_Limit,len(AVERAGE_INTENSITY)-
Index_Limit):
if(AVERAGE_INTENSITY[col]<MIN_VALUE):
MIN_VALUE=AVERAGE_INTENSITY[col]
BREAK_POINT=col
x_point=0
y_point=IMAGE_SEGMENT[0][1]
height=IMAGE_SEGMENT[0][3]
width=BREAK_POINT-x_point
if(width>0.8*height):
REQUIRED_FURTHER_SEGMENTATION([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],1],
AVERAGE_INTENSITY[x_point:x_point+width
],FINAL_SEGMENTED_CHARACTERS)
else: FINAL_SEGMENTED_CHARACTERS.APPEND([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],0])
x_point=BREAK_POINT
width=IMAGE_SEGMENT[0][2]-x_point
if(width>0.8*height):
REQUIRED_FURTHER_SEGMENTATION([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],1],
AVERAGE_INTENSITY[x_point:x_point+width
],FINAL_SEGMENTED_CHARACTERS)
else:
FINAL_SEGMENTED_CHARACTERS.APPEND([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],0])
53
CLASS DATASET(DATA.DATASET):
def init (self,CSV_PATH,IMAGES_PATH,TRANSFORM=None):
self.TRAIN_SET=pd.READ_CSV(CSV_PATH)
self.TRAIN_PATH=IMAGES_PATH
self.TRANSFORM=TRANSFORM
def len (self):
return len(self.TRAIN_SET)
def getitem (self,idx):
FILE_NAME=self.TRAIN_SET.iloc[idx][1]+'.png'
LABEL=self.TRAIN_SET.iloc[idx][2]
img=IMAGE.open(os.PATH.join(self.TRAIN_PATH,FILE_NAME))
if self.TRANSFORM is not None:
img=self.TRANSFORM(img)
return img,LABEL
PARAMS = {'BATCH_SIZE': 16,
'shuffle': True
}
epochs = 6
LEARNING_RATE=1e-
3
TRANSFORM_TRAIN = TRANSFORMS.Compose([TRANSFORMS.
Resize((224,224)),TRANSFORMS.RANDOMAPPLY([
torchvision.TRANSFORMS.RANDOMROTATION(10),
TRANSFORMS.RANDOMHORIZONTALFLIP()],0.7),
TRANSFORMS.ToTensor()])
TRAINING_SET=DATASET(os.PATH.join(BASE_PATH,
'TRAIN.CSV'),os.PATH.join(BASE_PATH,
'TRAIN_IMAGES/'),TRANSFORM=TRANSFORM_TRAIN)
TRAINING_GENERATOR=DATA.DATALOADER(TRAINING_SET,**PARAMS)
USE_CUDA = torch.CUDA.IS_AVAILABLE()
device = torch.device("CUDA:0" if USE_CUDA else "cpu")
5.5 Training the model:
5.5.1 : Data Loader:
5.5.2 : Defining Transforms and Parameters:
54
model = EfficientNet.FROM_PRETRAINED('efficientnet-b0',
NUM_CLASSES=62)
model.to(device)
print(SUMMARY(model, input_size=(3, 512, 512)))
PATH_SAVE='./Weights/'
if(not os.PATH.exists(PATH_SAVE)):
os.mkdir(PATH_SAVE)
criterion = nn.CrossEntropyLoss()
LR_DECAY=0.99
optimizer = optim.ADAM(model.PARAMETERS(), lr=LEARNING_RATE)
eye = torch.eye(62).to(device)
CLASSES=[i for i in RANGE(62)]
HISTORY_ACCURACY=[]
history_loss=[]
epochs = 1
for epoch in RANGE(epochs):
running_loss = 0.0
correct=0
TOTAL=0
CLASS_CORRECT = list(0. for _ in CLASSES)
CLASS_TOTAL = list(0. for _ in CLASSES)
for i, DATA in ENUMERATE(TRAINING_GENERATOR, 0):
inputs, LABELS = DATA
t0 = time()
inputs, LABELS = inputs.to(device), LABELS.to(device)
5.5.3 : Importing the Model:
5.5.4 : Traning the Model:
print(device)
55
LABELS = eye[LABELS]
optimizer.ZERO_GRAD()
outputs = model(inputs)
loss = criterion(outputs, torch.MAX(LABELS, 1)[1])
_, predicted = torch.MAX(outputs, 1)
_, LABELS = torch.MAX(LABELS, 1)
c = (predicted == LABELS.DATA).squeeze()
correct += (predicted == LABELS).sum().item()
TOTAL += LABELS.size(0)
ACCURACY = FLOAT(correct) / FLOAT(TOTAL)
HISTORY_ACCURACY.APPEND(ACCURACY)
history_loss.APPEND(loss)
loss.BACKWARD()
optimizer.step()
for j in RANGE(LABELS.size(0)):
LABEL = LABELS[j]
CLASS_CORRECT[LABEL] += c[j].item()
CLASS_TOTAL[LABEL] += 1
running_loss += loss.item()
if(i%100==99):
print( "Epoch : ",epoch+1," BATCH : ", i+1," Loss :
",running_loss/(i+1)," ACCURACY : ",
ACCURACY,"Time ",round(time()-t0, 2),"s" )
for k in RANGE(len(CLASSES)):
if(CLASS_TOTAL[k]!=0):
print('ACCURACY of %5s : %2d %%' % (
CLASSES[k], 100 * CLASS_CORRECT[k] / CLASS_TOTAL[k]))
print('[%d epoch] ACCURACY of the network
on the TRAINING IMAGES: %d %%' %
(epoch+1, 100 * correct / TOTAL))
if epoch%10==9:
torch.SAVE(model.STATE_DICT(),
os.PATH.join(PATH_SAVE,str(epoch+1)+'.pth'))
torch.SAVE(model.STATE_DICT(),
os.PATH.join(PATH_SAVE,'FINAL_EPOCH'+'.pth'))
56
model.LOAD_STATE_DICT(torch.LOAD('/KAGGLE/WORKING
/Weights/FINAL_EPOCH.pth'))
model.EVAL()
TEST_TRANSFORMS = TRANSFORMS.Compose(
[TRANSFORMS.Resize(512),
TRANSFORMS.ToTensor(),])
def PREDICT_IMAGE(IMAGE):
IMAGE_TENSOR = TEST_TRANSFORMS(IMAGE)
IMAGE_TENSOR = IMAGE_TENSOR.unsqueeze_(0)
input = VARIABLE(IMAGE_TENSOR)
input = input.to(device)
output = model(input)
index = output.DATA.cpu().numpy().ARGMAX()
return index
IMG_TEST_PATH=os.PATH.join(BASE_PATH,'TEST_IMAGES/')
for i in RANGE(len(submission)):
img=IMAGE.open(IMG_TEST_PATH+submission.iloc[i][1]+'.png')
submission['CLASS'][i]=PREDICT_IMAGE(img)
if(i%10==0 or i==len(submission)-1):
print('[',32*'=','>]
',round((i+1)*100/len(submission),2),' % Complete')
Result=[[0 for _ in RANGE(62)] for i in RANGE(62)]
TOTAL_DATA=[0 for i in RANGE(62)]
CORRECT_DATA=[0 for i in RANGE(62)]
for i in RANGE(len(submission)):
Result[TEST_DATASET['CLASS'][i]][submission['CLASS'][i]]+=1
if(TEST_DATASET['CLASS'][i]==submission['CLASS'][i]):
CORRECT_DATA[TEST_DATASET['CLASS'][i]]+=1
TOTAL_DATA[TEST_DATASET['CLASS'][i]]+=1
for i in Result:
for j in i:
print(str(10000+j)[1:],end=" ")
print()
5.5.5 : Testing the Model:
57
for i in RANGE(62):
print(i,'-',TOTAL_DATA[i],CORRECT_DATA[i
],(CORRECT_DATA[i]*100)/TOTAL_DATA[i])
print("TOTAL",'-',sum(TOTAL_DATA),sum
(CORRECT_DATA),(sum(CORRECT_DATA)*100)/sum(TOTAL_DATA))
58
6. RESULTS AND DISCUSSIONS
6.1 Input & Output:
INPUT:
Fig 6.1 Input image
OUTPUT:
A MOVE to stop Mr. Gaitskell from nominating any more Labour
life Peers is to be made at a meeting of Labour M P’s tomorrow. Mr. Michael Foot has
put down a resolution on the subject and he is to be backed by Mr. Will Griffiths, MP
for Manchester exchange.
6.2 Training Datasets:
IAM DATASET: The IAM Handwriting Database contains forms of
handwritten English text which can be used to train and test handwritten text
recognizers and to perform writer identification and verification experiments.
The database was first published in [13] at the ICDAR 1999. Using this
database an HMM based recognition system for handwritten sentences was developed
and published in [14] at the ICPR 2000. The segmentation scheme used in the second
version of the database is documented in [15] and has been published in the ICPR
2002. The IAM-database as of October 2002 is described in [16]. We use the database
extensively in our own research.
59
The database contains forms of unconstrained handwritten text, which were
scanned at a resolution of 300dpi and saved as PNG images with 256 gray levels.
The IAM Handwriting Database 3.0 is structured as follows:
657 writers contributed samples of their handwriting
1'539 pages of scanned text
5'685 isolated and labeled sentences
13'353 isolated and labeled text lines
115'320 isolated and labeled words
The words have been extracted from pages of scanned text using an automatic
segmentation scheme and were verified manually. The segmentation scheme has been
developed at our institute [15].
All form, line and word images are provided as PNG files and the
corresponding form label files, including segmentation information and variety of
estimated parameters (from the preprocessing steps described in [14]), are included in
the image files as meta-information in XML format
NIST DATASET: The EMNIST dataset is a set of handwritten character
digits derived from the NIST Special Dataset 2019 and converted to a 28x28 pixel
image format and dataset structure that directly matches the MNIST dataset.
The dataset is provided in two file formats. Both versions of the dataset contain
identical information, and are provided entirely for the sake of convenience. The first
dataset is provided in a Matlab format that is accessible through both Matlab and
Python (using the scipy.io.loadmat function). The second version of the dataset is
provided in the same binary format as the original MNIST dataset.[18]
There are six different splits provided in this dataset. A short summary of the
dataset is provided below:
EMNIST ByClass: 814,255 characters. 62 unbalanced classes.
EMNIST ByMerge: 814,255 characters. 47 unbalanced classes.
EMNIST Balanced: 131,600 characters. 47 balanced classes.
EMNIST Letters: 145,600 characters. 26 balanced classes.
EMNIST Digits: 280,000 characters. 10 balanced classes.
EMNIST MNIST: 70,000 characters. 10 balanced classes.
The full complement of the NIST Special Database 19 is available in the
ByClass and ByMerge splits. The EMNIST Balanced dataset contains a set of
characters with an equal number of samples per class. The EMNIST Letters dataset
merges a balanced set of the uppercase and lowercase letters into a single 26-class
60
task. The EMNIST Digits and EMNIST MNIST dataset provide balanced handwritten
digit datasets directly compatible with the original MNIST dataset.
Type Classes Training Testing Total
BY CLASS Digits 10 344,307 58,646 402,953
BY MERGE
Uppercase 26 208,363 11,941 220,304
Lowercase 26 178,998 12,000 190,998
Total 62 731,668 82,587 814,255
Digits 10 344,307 58,646 402,953
Letters 37 387,361 23,941 411,302
Total 47 731,668 82,587 814,255
Table-6.1
Breakdown of the number of available training and testing samples in the NIST
special database 19 using the original training and testing splits.
6.3 Experimental Results and Analysis:
Segmentation: This system is trained under around 1600 text images
(paragraphs) of IAM dataset with almost 5678 labeled sentences, 13353 isolated and
labelled text lines, 115,320 isolated and labeled words with an accuracy of around
98% in terms of line segmentation, 93 % in terms of word segmentation and 88% in
terms of character segmentation.
Character Recognition: This system is trained under NIST dataset. This
represents the most useful organization from a classification perspective as it contains
the segmented digits and characters arranged by class. There are 62 classes comprising
[0-9], [a-z] and [A-Z]. The data is also split into a suggested training and testing set
around 731,668 images and 82,587 images respectively with an accuracy of around
96% in character recognition.
61
Type Testing Accuracy
Line Segmentation 1,539 98%
Word Segmentation
Character Segmentation
Character Recognition
5,685 93%
115,320 88%
82,587 96%
Table -6.2: Testing and Accuracy
62
7. CONCLUSION
This paper mainly carries out a study on segmenting the connected components
(touching characters). We improved a performance of binarization in pre-processing,
and proposed new method separating the touching character using combined profile
analysis. Finally, because the proposed algorithm shows a good performance in the
experimental results, it is effective that the algorithm is applied to character
recognition system.
The proposed method for segmenting the connected components (touching
character) is by using VPP (Vertical projection profile) and TDP (Top down Profile)
and various other histogram projections (horizontal and vertical) for the line and word
segmentation respectively. pytorch is a python library tool used for the recognition of
the segmented characters. [4]
There are many more challenges involved in the optical cursive handwritten
recognition like the skewness, pressure detection etc. can be treated as a future study.
63
REFERENCES
[1] Nafiz arica, Student Member, IEEE, and Fatos T. Yarman-Vuraj, Senior Member
IEEE, “Optical Character Recognition for Cursive Handwriting”, IEEE transaction on
pattern analysis and machine intelligence, vol. 24 no. 6, June 2002
[2] Subhash Panwar and Neeta Naina, “A Novel Segmentation Methodology for
Cursive Handwritten Documents”, IETE JOURNAL OF RESEARCH, VOL 60-NO 6
NOV-DEC 2014
[3] Nibaran Das Sandip Pramanik, Subhadio Basu, Punam Kumar Saha, “Recognition
of handwritten Bangla basic characters and digits using convex hull feature set”, 2009
International conference on Artificial intelligence and pattern recognition (AIPRL-09)
[4] Abhishek Bala and Rajib Saha, “An IMPROVED Method for Handwritten
Document Analysis using Segmentation, Baseline Recognition and Writing Pressure
Detection”, 6th
International Conference on Advances in Computing Communication,
ICACC 2016, 6-8 September 2016, Cochin, India, Elesevier-2016
[5] Kanchan Keisham and Sunanda Dixit, “Recognition of Handwritten English Text
Using Energy Minimisation”, Information Systems Design and Intelligence
Applications, Advances in Intelligent Systems and Computing, Bangalore, India,
Spinger-2016
[6] Namrata Dave,” Segmentation Methods for Hand written Character Recognition”,
International Journal of Signal Processing, Image Processing and Pattern Recognition
Vol.8, No.4 (2015), pp.155-164
[7] G. Louloudisa, B.Gatosb, I.Pratikakisb, C.Halatsis, “Text line and word
segmentation of handwritten documents” a Department of Informatics and
Telecommunications, University of Athens, Greece Computational Intelligence
Laboratory, Institute of Informatics and Telecommunications, National Center for
Scientific Research Demokritos, 15310Athens, Greece.
[8] Rafael C. Gonzalez and Richard E. Woods, “Digital Image Processing, Second
Edition”, Prentice Hall.
[9] N. Otsu, “A Threshold Selection Method from GrayLevel Histogram”, IEEE
Trans, Systems, Man and Cybernetics, Vol. 1, No. 9, 1979, pp. 62-69.
[10] Seong-Whan Lee and Young Joon Kim, “Direct Extraction of Topographic
Features for Gray Scale Character Recogition”, IEEE Trans, On Pattern Analysis and
Machine Intelligence, Vol.17, No. 7, Jul., 1995.
[11] Seong-Whan, Dong-June Lee, and Hee-Seon Park, “A New Methodology for
Gray-Scale Character Segmentation and Recognition”, IEEE Trans. On Pattern
Analysis and Machine Intelligence, Vol. 18, No. 10, Oct., 1996.
64
[12] A. Ariyoshi, “A Character Segmentation Method for Japanese Documents Coping
with Touching Character Problems”, Proc. 31th Int’l Conf. Pattern Recognition, The
Hague, Netherlands, Aug., 1992. 313-316.
[13] U. Marti and H. Bunke. A full English sentence database for off-line handwriting
recognition. In Proc. of the 5th Int. Conf. on Document Analysis and Recognition,
pages 705 - 708, 1999.
[14] U. Marti and H. Bunke. Handwritten Sentence Recognition. In Proc. of the 15th
Int. Conf. on Pattern Recognition, Volume 3, pages 467 - 470, 2000.
[15] M. Zimmermann and H. Bunke. Automatic Segmentation of the IAM Off-line
Database for Handwritten English Text. In Proc. of the 16th Int. Conf. on Pattern
Recognition, Volume 4, pages 35 - 39, 2000.
[16] U. Marti and H. Bunke. The IAM-database: An English Sentence Database for
Off-line Handwriting Recognition. Int. Journal on Document Analysis and
Recognition, Volume 5, pages 39 - 46, 2002.
[17] S. Johansson, G.N. Leech and H. Goodluck. Manual of Information to accompany
the Lancaster-Oslo/Bergen Corpus of British English, for use with digital Computers
Department of English, University of Oslo, Norway, 1978.
[18] Cohen, G., Afshar, S., Tapson, J., & van Schaik, A. (2017). EMNIST: an
extension of MNIST to handwritten letters.
[19] Richard G. Casey and Eric Lecolinet, “A Survey of Methods and Strategies in
Character Segmentation”, IEEE Trans. On Pattern Analysis and Machine Intelligence,
Vol. 18, No. 7, Jul., 1996.
[20] Jin Hak Bae, Kee Chul Jung, Jin Wook Kim, and Hang Joon Kim, “Segmentation
of touching characters using an MLP”, Pattern Recognition Letters, Vol. 19, No. 8,
1998, pp. 701-709.
[21] N.Aricaand, F.Yarman-Vural,“Opticalcharacterrecognitionforcursive
handwriting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
24, pp. 801 –813, Jun 2002.
[22] M. Blumenstein and B. Verma, “Neural-based solutions for the segmen- tation
and recognition of difficult handwritten words from a benchmark database,” in
Proceedings of the Fifth International Conference on Document Analysis and
Recognition, ICDAR ’99, pp. 281 –284, Sept 1999.
[23] Y. Tay, M. Khalid, R. Yusof, and C. Viard-Gaudin, “Offline cursive handwriting
recognition system based on hybrid markov model and neural networks,” in
65
Proceedings of the IEEE International Symposium on Computational Intelligence in
Robotics and Automation, 2003, vol. 3, pp. 1190 – 1195, July 2003.
[24] G.Kim, V.Govindaraju, S.Srihari,“Asegmentationandrecognition strategy for
handwritten phrases,” in Proceedings of the 13th Interna- tional Conference on Pattern
Recognition, 1996, vol. 4, pp. 510 –514, Aug 1996.
[25] Y. Y. Chung and M. T. Wong, “Handwritten character recognition by fourier
descriptors and neural network,” in Proceedings of IEEE Region 10 Annual
Conference on Speech and Image Technologies for Computing and
Telecommunications, TENCON ’97, vol. 1, pp. 391 –394, Dec 1997.
[26] B. S. Moni and G. Raju, “Modified quadratic classifier and directional features
for handwritten malayalam character recognition,” in Computa- tional Science - New
Dimensions and Perspectives, NCCSE 2011, IJCA Special Issue, vol. 1, pp. 30 –34,
Feb 2011.
[27] M. Blumenstein, X. Y. Liu, and B. Verma, “An investigation of the modified
direction feature for cursive character recognition,” Pattern Recognition, vol. 40, no. 2,
pp. 376 – 388, 2007.
[28] M. Blumenstein, B. Verma, and H. Basli, “A novel feature extraction technique
for the recognition of segmented handwritten characters,” in Proceedings of the
Seventh International Conference on Document Analysis and Recognition, 2003, vol.
1, pp. 137 – 141, Aug 2003.
66
Base Paper
2011 International Conference on Computer Applications and Industrial Electronics (ICCAIE 2011)
Offline Handwritten Character Recognition Using
Neural Network
Anshul Gupta Department of Electronics and
Electrical Engineering
IITGuwahati
Guwahati, India
Email: [email protected]
Manisha Srivastava
Department of Electronics and
Electrical Engineering
IITGuwahati
Guwahati, India
Email: [email protected]
Chitralekha Mahanta
Department of Electronics and
Electrical Engineering
IITGuwahati
Guwahati, India
Email: [email protected]
Abstract—Character Recognition (CR) has been an active area
of research in the past and due to its diverse applications it continues to be a challenging research topic. In this paper, we focus especially on offline recognition of handwritten En- glish words by first detecting individual characters. The main approaches for offline handwritten word recognition can be divided into two classes, holistic and segmentation based. The holistic approach is used in recognition of limited size vocabulary where global features extracted from the entire word image are considered. As the size of the vocabulary increases, the complexity of holistic based algorithms also increases and correspondingly the recognition rate decreases rapidly. The segmentation based strategies, on the other hand, employ bottom-up approaches, starting from the stroke or the character level and going towards producing a meaningful word. After segmentation the problem gets reduced to the recognition of simple isolated characters or strokes and hence the system can be employed for unlimited vocabulary. We here adopt segmentation based handwritten word recognition where neural networks are used to identify individual characters. A number of techniques are available for feature extraction and training of CR systems in the literature, each with its own superiorities and weaknesses. We explore these techniques to design an optimal offline handwritten English word recognition system based on character recognition. Post processing technique that uses lexicon is employed to improve the overall recognition accuracy.
Index Terms—Offline, handwritten, character, recognition, neural network.
I. INTRODUCTION
It is really a challenging issue to develop a practical hand-
written character recognition (CR) system which can maintain
high recognition accuracy. A generic character recognition
system is shown in Fig. 1.
Fig. 1. Generic CR system
In most of the existing systems recognition accuracy is
heavily dependent on the quality of the input document.
In handwritten text adjacent characters tend to be touched
or overlapped. Therefore it is essential to segment a given
string correctly into its character components. In most of the
existing segmentation algorithms, human writing is evaluated
empirically to deduce rules [1]. But there is no guarantee
for the optimum results of these heuristic rules in all styles
of writing. Moreover handwriting varies from person to
person and even for the same person it varies depending on
mood, speed etc. This requires incorporating artificial neural
networks, hidden Markov models and statistical classifiers to
extract segmentation rules based on numerical data. [2][3][4].
After segmentation next crucial step is representation of
character classes by features. These features should have high
discriminative abilities so that they are different for different
character classes (for example 26 uppercase and 26 lowercase
characters in case of English language). Also, these features
should be independent of the intra class variations.
The different representation methods can be categorized into
three major classes [1]:
1. Global transformation and series expansion: includes
Fourier transform, Gabor transform, wavelet, mo-
ments and Karhuen-Loeve Expansion.
2. Statistical representation: Zoning, crossing and dis-
tances, projections.
3. Geometrical and topological representation: Extract-
ing and counting topological structures, geometrical
properties, coding, graphs and trees etc.
Features which depend on Fourier transform are suitable
for recognizing handwritten numerals where 96% accuracy
has been achieved [5]. Gradient features have been widely
used in CR for machine and hand printed binary character
images. But these features are not invariant to deformations
in the characters. In [6], a new gradient feature is used where
at each pixel, gradient is mapped onto 12 direction codes
with an angle span of 30 degree between the directions.
In [7], a redesigned direction feature [8] with a view to
978-1-4577-2059-8/11/$26.00 ©2011 IEEE 102
103
describe the character contour more effectively is developed.
Also, an additional global feature was introduced in this
technique to improve the recognition accuracy for those
characters that were most frequently confused with patterns
of similar appearances. But the disadvantage of this technique
is its failure to deal with changes in stroke width as these
features are extracted from non thinned character images.
Another crucial module in a character recognition system is
its pattern recognition module which assigns an unknown
sample to a predefined class. Numerous techniques for
character recognition can be classified into four general
approaches of pattern recognition: [1]
1. Template Matching : Direct matching, deformable
and elastic matching, relaxation matching
2. Statistical techniques : Parametric recognition, Non-
pararmetric recognition, HMM, fuzzy set reasoning
3. Structural techniques: Grammatical methods, graph-
ical methods
4. Neural networks : Multilayer perceptron, radial basis
function, support vector machine
Character recognition technique has to cope with the high
variability of the handwritten cursive letters and their intrinsic
ambiguity (letters like “e” and “l” or “u” and “n” can have
the same shape). Also it should be able to adapt to changes
in the input data. Template matching, statistical techniques
and structural techniques can be used when the input data is
uniform over time whereas neural network (NN) classifier
can learn changes in the input data. Also NN has parallel
structure because of which it can perform computation at a
higher rate than classical techniques. Therefore, we choose
neural networks for character recognition in our system.
The features that are used for training the neural network
classifier also play a very important role. The choice of a
good feature vector can significantly enhance the performance
of a character classifier whereas a poor one may degrade
its performance considerably. It is found in the literature
that generally separate classifiers are used for the upper
and the lower case English character classes to improve the
recognition accuracy. Moreover, good recognition accuracy
could be achieved only for handwritten numerals.
In this paper, we focus on developing a CR system for
recognition of handwritten English words. We first segment
the words into individual characters and then represent these
characters by features that have good discriminative abilities.
We also explore different neural network classifiers to find
the best classifier for the CR system. We combine different
CR techniques in parallel so that recognition accuracy of the
system can be improved.
The organization of the paper is as follows: Section II
deals with segmentation of words into individual characters
where a heuristic algorithm is used to first oversegment the
word followed by verification using neural network. Feature
extraction of handwritten characters is discussed in Section
III. Section IV describes selection procedure of a suitable
classifier. This is done by testing multilayer perceptron (MLP),
radial basis function (RBF) and support vector machine (SVM)
and selecting the one that has the maximum accuracy. In Sec-
tion V post processing is discussed where different character
recognition techniques are combined in parallel by using a
variation of the Borda count. Section VI presents results and
discussion. Conclusions are drawn in Section VII.
II. SEGMENTATION
In this paper segmentation algorithm used is similar to [2],
where heuristics and artificial intelligence are used for the
segmentation of a handwritten word. Here gray level image
is first converted into the binary image. Next slant detection
similar to the one used in [9] is employed and then slant
from −45◦ to 45
◦. The horizontal projection is taken at each correction is done. The method involves rotating the image rotation to calculate Wigner - Ville distribution (WVD - a joint
function of time and frequency). The angle which presents
the maximum intensity after applying WVD is taken as the
estimated slant angle.
For both the training and the testing phases, a heuristic
algorithm is used to locate prospective segmentation points in
the handwritten words. Each word is inspected in an attempt to
locate characteristic representative of the segmentation points.
A. Segmentation using a heuristic algorithm
A simple heuristic segmentation algorithm is implemented
which scans handwritten words to identify valid segmentation
points between characters. The segmentation is based on
locating the minima or arcs between letters, common in
handwritten cursive script. For this a histogram of vertical
pixel densities is examined which may indicate the location of
possible segmentation points in the word. However in the case
of letters as “a” and “o”, an erroneous segmentation point
could be identified. Therefore a “hole seeking” component
is incorporated which prunes the segmentation points those
pass through a “hole”. Finally, the algorithm performs a
check to see if one segmentation point is not too close
to another by ascertaining that the distance between the
previous segmentation point and the position being checked
is equal to or greater than the average character width.
Conversely if the contour in a region has sparse segmentation
points then a new segmentation point is inserted in that region.
B. Manual marking of segmentation points
We created our own database to train the neural network
for segmentation. Altogether 26 English words were chosen
which contained all the upper and lower case alphabets and
then 10 different samples of each word were collected on
paper from different writers. The images were then scanned
and preprocessed to create a list of 260 words. Prior to ANN
training, the heuristic feature detector was used to segment
all the words. The segmentation point outputs obtained by
104
L
L
using the heuristic feature detector can be categorized into
“correct” and “incorrect” segmentation point classes. The
feature extractor then extracts a matrix of pixels representing
the segmentation area and breaks it down into small windows
of equal size 5x5 pixels and analyzes the density of black
and white pixels. The density value for the black pixels
for each 5x5 window is written to the training file to
k = 5 are considered. The feature vectors made up from these
moduli are then normalized to 1 to compensate for image
scaling. To spread the input data more evenly over the input
space, the mean and the standard deviation vectors are found
over the whole set of training data. The jth
component of
input vector i is calculated as: . . Σ Σ
1 represent the value of that window. Accompanying each
matrix the desired output is also stored in the training file (0.1 ipj = (ipoj − ioj ) α − 1 + 1 σnoj
(3)
for an incorrect segmentation point and 0.9 for a correct point).
C. Training of the Artificial Neural Network (ANN)
For this step, a multilayer feedforward neural network
trained with the back propagation algorithm is used. The ANN
is presented with the training file prepared in the previous step.
D. Testing phase of the segmentation technique
Like ANN training, the words used for testing are also
segmented using the heuristic algorithm. The segmentation
points are automatically extracted and are fed into the trained
ANN. The ANN then verifies each segmentation point as
correct or incorrect. Finally, upon ANN verification, each word
used for testing should only contain valid segmentation points.
III. FEATURE EXTRACTION
A compact and characteristic representation of the character
image is required in the CR system. For this purpose, a set
of features is extracted for each class that helps to distinguish
it from other classes, while remaining invariant to intra class
differences.
pattern p, ioj is the mean of the jth
components of the original where ipoj is the jth
component of the original vector or vectors and σnoj is the corresponding standard deviation.
Coefficient α linearly controls the degree of standard deviation
compensation. We have also used Fourier descriptors for
extracting the following two features:
1) Fourier angle: It is mentioned in [10] that if the moduli
alone are not successful in discriminating all the classes then
adding angles of Fourier descriptors can improve the results.
Experiments can be done to incorporate angles in the training
set.
2) Fourier magnitude [11]: The Fourier coefficients de-
rived from equations (1) and (2) are not rotation or shift
invariant (in fact, it is noted that a shift will occur if the starting
point of the boundary following is arbitrary). In order to derive
a set of Fourier descriptors which have the invariant property
performed. For each n a set of invariant descriptors r[n] is with respect to rotation and shift, the following operations are computed as : √
r[n] = |a[n]|2 + |b[n]|2 (4) It is easy to show that r[n] is invariant to rotation or shift. A further refinement can be made by computing a new set of
descriptors as follows:
A. Fourier Descriptors
The method adopted is similar to [10] where boundary
detection is done at first. After obtaining a boundary image,
Fourier descriptors are found. This involves finding the
discrete Fourier coefficients a[k] and b[k] for 0 < k < L − 1, where L is the total number of boundary points found by applying the following:
s[n] = r[n]/r[1] (5) Thus dependence of r[n] on the size of the character is also eliminated. The Fourier coefficients |a[n]|, |b[n]|, their phases and the invariant descriptors s[n]; n = 2; 3, were derived for all the character specimens and stored in files for application in
reconstruction and recognition. We will be using the following
set of features in our final system:
1. Magnitude, s(k), |a[k]| and |b[k]|
a[k] =
b[k] =
. 1 Σ ΣL L m=1
. 1 Σ ΣL
L
x[m]e(−jk( 2π )m) (1)
y[m]e(−jk( 2π )m) (2)
2. Phase, |a[k]| and |b[k]|
3. Magnitude, phase, s(k), |a[k]| and |b[k]| IV. CLASSIFIER SELECTION
Classification can be done using various methods like clus-
tering, Bayesian classification, artificial neural networks etc. m=1
where x[m] and y[m] are the x and y coordinates respectively of the mth
boundary point. The values for k = 0 are discarded as they contain information only about the position of the image. The coefficients for high values of k describe
high frequency features in the image but do not contain
much information about the overall shape of the character
and so these high frequency components are also discarded.
Therefore, the first five coefficients beginning from k = 1 to
out of which artificial neural networks have been widely used. For our case we will use them to classify 52 character classes:
26 lower cases and 26 upper cases. We have considered
three networks: Multi-layer perceptron (MLP), radial basis
function (RBF) and support vector machine (SVM). Results
of character classification by these classifiers are given below.
We have used neural network toolbox in the Matlab platform
for testing the classifiers. The character database used for the
training and testing is taken from The Chars74K dataset.
105
A. Multilayer Perceptron (MLP)
Table I shows the MLP configuration that produced the best
results in our case. Fig.2 illustrates the validation performance
of the MLP network. Results obtained are poor on validation
and testing data.
TABLE I
MLP CONFIGURATION
No. of hidden layers
and respective acti-
vation function
No.
hidden
nodes
of Training algorithm
Learning Rate
Mom- entum
3[tansig tansig tan- [80 50 traingdx Adaptive 0.9
sig] 50]
Fig. 2. Validation performance of the MLP network
B. Radial Basis Function (RBF)
Table II shows the RBF network used. Fig.3 illustrates the
validation performance of the RBF network. The results are
good on training data but suffer from overlearning.
TABLE II
RBF CONFIGURATION
No. of hidden nodes Type of radial basis function Target Er-
ror
Adaptive addition and pruning of hidden neurons
Gaussian radial basis function 0.001
Fig. 3. Validation performance of the RBF network
Although the RBF network produced good results on the
validation dataset but it required 1800 neurons for this per-
formance. As a result this network suffered from overlearning
and showed very poor results on the test data.
C. Support vector machine (SVM) data is 98.86% and it achieves the optimum learning. The recognition result on the test data is 62.93%. It is observed In the case of the SVM, the recognition rate on the training
Table III shows the recognition rate (%) on the training data that on the test data SVM outperforms the other two networks. produced by the SVM for all the three feature vectors. This
testing is performed on the Chars74K dataset.
TABLE III SVM CONFIGURATION
Fourier with mag- nitude s(k),|a(k)| and |b(k)|
Fourier with phase,
|a(k)| and |b(k)|.
Fourier with magni- tude s(k), |a(k)|, |b(k)| and phase
86.66% 98.74% 98.04%
Now we build a CR system using all the three sets of
features in parallel. Our proposed system is shown in Fig.4.
Fig. 4. Block diagram of the proposed CR system
V. POST PROCESSING
It has been found that in many real word applications, it
is better to fuse multiple techniques to improve the results.
Fusion takes advantages of different techniques by emphasiz-
ing on the strengths and avoiding the weaknesses of individual
techniques. We here use a fusion method based on Borda count
that is inspired from [12] to combine the following techniques
in parallel: 1. SVM on Moduli of Fourier Coefficients |a(k)|, |b(k)| and magnitude s(k)
2. SVM on Moduli of Fourier Coefficients |a(k)|, |b(k)| and phase
3. SVM on Moduli of Fourier Coefficients |a(k)|,
|b(k)|, phase and magnitude s(k) A rank is assigned and used in the calculation of the Borda count instead of calculating the number of strings below the
predicted string. The output string from a given technique is
compared with all the words in a lexicon. Then the lexicon
words are ranked according to their similarity with the output
106
string. The similarity between the output string and the lexicon
words are found by finding the number of matching characters
and their relative positions. The rank for a particular string can
be calculated using the following formulae:
Rank = 1-(position of the string in the top N strings)/N.
The rank is 0, if the string is not in the top N choices. We have
taken N=3.Therefore only the top three words are considered
from each technique to calculate the rank.
Secondly the confidence values produced by different tech-
niques are considered. The confidence value for all the three
predicted words for any given technique is the confidence that
the classifier has in its output string even if the string is not
a valid lexicon word. This is reasonable because the top three
strings are chosen based on its similarity with the output string.
The classifier’s confidence in its output string can be estimated
by summing up the scores of each of the predicted characters
of the output string. Final Boda count of a lexicon word = (rank ∗ confidence)tech1 + (rank ∗ confidence)tech2 + (rank ∗ confidence)tech3
VI. RESULTS AND DISCUSSION
The proposed CR system was tested on a database consist-
ing of 26 word images. All of these images were given as input
to the proposed CR system. The lexicon used also consisted
of the same 26 words that were used for testing. Out of these
26 words, the proposed system correctly recognized 21 word
images. Figs. 5-7 show some results from the 21 correctly
recognized handwritten words.
Fig. 5. Result on “Moderated”.
Fig. 6. Result on “Puzzle”.
It is evident from these figures that the proposed CR system
107
Fig. 7. Result on “Rolled”.
produces fairly good results on the test samples presented to
it. The segmentation method used was efficient. The heuristic
algorithm is based on rules which are deduced empirically
and there is no guarantee about their optimum results for
different styles of writing. So their validation using neural
network becomes essential. We tried different Fourier features
like moduli of Fourier coefficients, magnitude, phase and
their various combinations as feature vectors. The feature
vector formed using moduli of Fourier coefficients and phase
produced the best recognition accuracy of 98.74% on the
training dataset using SVM as the classifier. We have used
three combinations of Fourier descriptors in parallel for our
final system. Moreover our character recognition network has
52 output classes whereas in most of the literature separate
classifiers were used for upper and lower case characters. We
tested MLP and RBF neural networks that have been used
in the past for character recognition. We also tried support
vector machine (SVM) as classifier on the same feature set
and achieved 98% classification accuracy on the training data
set and 62.93% on the test data set. Finally, we selected SVM
as it outperformed MLP and RBF. Post processing which uses
lexicon becomes imperative as there is no other way to find out
the errors that have crept in at any of the previous stages. The
only way to do that is to verify whether the predicted word is
a valid lexicon word or not. Thus incorporating lexicon in our
final system using Borda Count improved the overall efficiency
of the system.
VII. CONCLUSION
This paper carries out a study of various feature based clas-
sification techniques for offline handwritten character recogni-
tion. After experimentation, it proposes an optimal character
recognition technique. The proposed method involves segmen-
tation of a handwritten word by using heuristics and artificial
intelligence. Three combinations of Fourier descriptors are
used in parallel as feature vectors. Support vector machine
is used as the classifier. Post processing is carried out by
employing lexicon to verify the validity of the predicted word.
The results obtained by using the proposed CR system are
found to be satisfactory.
REFERENCES
[1] N. Arica and F. Yarman-Vural, “Optical character recognition for cursive
handwriting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 801 –813, Jun 2002.
[2] M. Blumenstein and B. Verma, “Neural-based solutions for the segmen- tation and recognition of difficult handwritten words from a benchmark database,” in Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR ’99, pp. 281 –284, Sept 1999.
[3] Y. Tay, M. Khalid, R. Yusof, and C. Viard-Gaudin, “Offline cursive handwriting recognition system based on hybrid markov model and neural networks,” in Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation, 2003, vol. 3, pp. 1190 – 1195, July 2003.
[4] G. Kim, V. Govindaraju, and S. Srihari, “A segmentation and recognition strategy for handwritten phrases,” in Proceedings of the 13th Interna- tional Conference on Pattern Recognition, 1996, vol. 4, pp. 510 –514, Aug 1996.
[5] Y. Y. Chung and M. T. Wong, “Handwritten character recognition by
fourier descriptors and neural network,” in Proceedings of IEEE Region 10 Annual Conference on Speech and Image Technologies for Computing and Telecommunications, TENCON ’97, vol. 1, pp. 391 –394, Dec 1997.
[6] B. S. Moni and G. Raju, “Modified quadratic classifier and directional features for handwritten malayalam character recognition,” in Computa- tional Science - New Dimensions and Perspectives, NCCSE 2011, IJCA Special Issue, vol. 1, pp. 30 –34, Feb 2011.
[7] M. Blumenstein, X. Y. Liu, and B. Verma, “An investigation of the modified direction feature for cursive character recognition,” Pattern Recognition, vol. 40, no. 2, pp. 376 – 388, 2007.
[8] M. Blumenstein, B. Verma, and H. Basli, “A novel feature extraction technique for the recognition of segmented handwritten characters,” in Proceedings of the Seventh International Conference on Document Analysis and Recognition, 2003, vol. 1, pp. 137 – 141, Aug 2003.
[9] E. Kavallieratou, N. Fakotakis, and G. Kokkinakis, “Skew angle esti- mation for printed and handwritten documents using the wigner-ville distribution,” Image and Vision Computing, vol. 20, no. 11, pp. 813 – 824, 2002.
[10] I. P. Morns and S. S. Dlay, “Character recognition using fourier descriptors and a new form of dynamic semisupervised neural network,” Microelectronics Journal, vol. 28, no. 1, pp. 73 – 84, 1997.
[11] M. Shridhar and A. Badreldin, “High accuracy character recognition al- gorithm using fourier and topological descriptors,” Pattern Recognition, vol. 17, no. 5, pp. 515 – 524, 1984.
[12] B. Verma, P. Gader, and W. Chen, “Fusion of multiple handwritten word recognition techniques,” Pattern Recognition Letters, vol. 22, no. 9, pp. 991 – 998, 2001.
Project Paper
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 191
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
OPTICAL CURSIVE HANDWRITTEN
RECOGNITION USING VPP & TDP NATIVE
SEGMENTATION ALGORITHMS AND NEURAL
NETWORKS (PYTORCH)
1G. Tirumalesh,
2K. L. Srinivas,
3K. Pratima,
4N. Arun,
5Y. Hemanth
1student,
2student,
3student,
4student,
5student
Department of Computer Science and Engineering,
Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam, India.
Abstract- In the domain of Artificial Intelligence, scientist brought ultra-modern changes in many fields, one of
them is image processing. This paper aims to represent the process of Converting handwritten text to computer
typed document which is an optical cursive handwritten recognition (OCR) by using segmentation based
algorithms like VPP (vertical projection profile), TDP (top down profile) and the other histogram (vertical and
horizontal) projection algorithms to achieve the solution. Several other approaches were also available for the
segmentation of text into individual characters. For feature extraction and character recognition pytorch which is
an open source machine learning tool library in python used for computer vision and natural language processing.
Keywords: Vertical Projection Profile (VPP), Top Down Profile (TDP), Pytorch.
1. INTRODUCTION:
OCR means Optical Character Recognition, it is also known as optical character reader. OCR translates the
text in the given images to machine readable format. Character recognition is classified into two types based on the
text. They are machine printed text and handwritten text. It is difficult to work with handwritten text mainly in the
case of cursive handwritten because it varies from person to person and there is no perfect line spacing and size of
the character and margins etc in handwritten text. A single character is written in many styles so it is difficult to
identify and translate the script into machine readable format or ASCII format. In this scenario, there is a step by
step process to convert given script into each individual character by starting with line segmentation followed by
word segmentation and finally the character segmentation. Finally predicting each individual character under a
trained model using pytorch to recognize the character and combined together again to generate the original
machine-readable script to the end user.
1.1 Literature Survey
Previous handwritten recognition uses various segmentation algorithms like heuristic, skew recognition
techniques and written pressure detection. Almost every segmentation algorithm is based on horizontal and
vertical projections to segment the script into individual characters. Even when there are text lines or characters
which are overlapped on each other can be separated by adjusting the threshold value. The existed method is tested
on more than 1000 text images of IAM datasets and by using these existing method 91.55% lines and 90.5%
words are correctly segmented from the IAM dataset and also normalize 92% lines and words perfectly with
invisible error rate.[9]
2. PROPOSED ALGORITHM
The process for the optical cursive handwritten recognition and the required algorithms for various level of
segmentation and character recognition using pytorch are as follow
It comprises of six steps.
1. Image scanning
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 192
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
2. Pre-processing
3. Segmentation
4. Feature extraction
5. Classification
6. Post-processing
2.1 Image scanning:
The input image can be obtained either by scanning the already existing handwritten image file (png, jpg) or by
capturing the image instantly to provide input data to the model.
Fig 2.1 Scanned input image
2.2 Image pre-processing:
The main goal here is to make the input image free from noise. As a first step to move on convert the RGB
image to a gray scale image and gently sharpen the given input image to avoid loss of edges. Calculate the mean
gray intensity value to reduce the brightness of the obtained grey scale image on a threshold value of less than 0.65
and the contrast is increased to distinguish the character boundaries. The text which is present in the obtained
result may turn dim and blur because of improper scanning of the text image. To overcome this, binarization plays
a key role by converting the gray scale image where the values ranges between 0 and 255 to a binary image by
making up a threshold value simply to decide like an on or off (0 or 1).
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 193
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Fig 2.2 Blur image (for noise removal)
Fig 2.3 Binary image in color contrast
2.3 Segmentation
Basically, there are three levels of segmentation. Line segmentation, Word segmentation and Character
segmentation.
2.3.1 Line segmentation
Horizontal histogram projections are used in segmenting the entire script present in the input image into
individual lines as shown in the figure below
The primary task here is to extract each individual line from the given input image. This can be obtained by
applying horizontal histogram projection to the pre-processed image and then generate the threshold line value by
adjusting the average value for those horizontal projections. Graphical representation of horizontal histogram
projection is shown in the figure 2.4 below.[6]
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 194
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Fig 2.4 Horizontal projection graph
Finally, Lines can be segmented from the given input script by obtaining the break points by making use of the
average threshold line value obtained from the above graph and having comparisons with each and every
horizontal projections.
Fig 2.5 Segmented lines from the image
2.3.2 Word segmentation
Each word is treated as an object (Contour – in terms of image processing). Contour can be explained simply
as a curve joining all the continuous points (along the boundary), having same color or intensity. Here, Contours
are useful for object detection where each object is a word.
Fig 2.6 segmented line
The main reason for making use of contours here is each word can be treated as a curve joining all the
continuous points along the boundary since it is cursively written. But sometimes there may be gaps in between
the letters of a single word which causes the word to be split into two or more words as they are not continuous
points joining as a curve.
This type of words can be identified by making use of the minimum threshold value which is obtained by
taking the average separation distance between the words and can be rejoined back as a single word (Contour)
where the separation distance between the words less than the minimum threshold value.[7]
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 195
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Minimum threshold value = sum of separation distances between words in the line/
no of words in the line
Fig 2.7 First level word segmentation
Fig 2.8 Second level word segmentation
2.3.3 Character segmentation
Segmenting each word into individual characters can be obtained by making use of two native algorithms:
1) VPP (Vertical Projection Profile)
2) TDP (Top Down Profile)
Fig 2.9 segmented word
VPP is a plot which maintains the total number of white pixels in vertical direction of the binary image. Characters
can be segmented at the point where the VPP value is zero (0) for some certain no of times (Threshold). But in the
case of touching characters, the VPP value can never be zero even though the characters should be segmented
(connected components).[1]
Fig 2.10 VPP intensity graph for word ‘MOVE’
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 196
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Connected components can be identified by making use of characters width and height. When the width of
the character is greater than 0.8 times the height of the character then it is identified as a connected component
otherwise it is a single character as per basic font size measurement.[1]
Width > 0.8 * height (Connected component)
Connected component single character single character
Fig 2.11 First level VPP character segmentation
TDP is a plot which maintains the first white pixel in vertical direction of the binary image. Touching
characters can be segmented into individual characters by taking the combined value of both VPP AND TDP.
Then obtain the minimum value in the graph where we can segment them into individual characters and continue
this process recursively until no more touching character found in a single word. [2]
Fig 2.12 touching characters (connected component)
Fig 2.13 VPP intensity graph for word ‘MO’
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 197
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Fig 2.14 TDP intensity graph for word ‘MO’
FIG 2.15 Combined intensity graph for word ‘MO’
Fig 2.16 Final level character segmentation
2.4 Feature Extraction:
The main goal here is to extract the features from the segmented characters which are required to train the
data.
This process comprises of zero padding, convolution layer, activation function, max pooling and flatten. As a
first step add zeros to the image to overcome the loss of edges termed as zero padding. Then apply multiple layers
of convolution and max pooling filters(kernels) to obtain an image with reduced in size where each move is a
stride. Max pooling comes with selecting the max value from the filter and replace it with the remaining and the
same way the average pooling works by taking the average pixel value. Now the activation function comes into
picture where the ReLU activation function identifies all the negative pixel values and replace it with zero without
any change in the positive pixel values. Finally flatten the image by reshaping the image obtained.
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 198
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Fig 2.17 Feature extraction of characters
Fig 2.18 Zero padding
Fig 2.19 Convolution layer
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 199
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Fig 2.20 Max pooling and Average pooling
Fig 2.21 Flatten the image
Fig 2.22 ReLU activation function
2.5 Classification:
Finally, Classification is done using a fully connected layer where we get the probabilities of each and every
class for the given input character. Classify the given input character to their respective class by selecting the class
with the maximum probability. In total there are classified into 62 classes (0 to 9, a to z, A TO Z)
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 200
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Fig 2.23 Fully connected layer with classes (x and o) along with probabilities
2.6 Post-processing:
As a final step obtain the accuracy for all the levels of segmentation and the character recognition by
minimizing the error rate. Then combine all the recognized characters into word, words into lines and lines into the
original script present in the image.
Final Result = M O V E
Fig 2.24 Final result
3. PYTORCH LIBRARY TOOL:
Pytorch is an open source machine learning library based on torch library used for applications such as
computer vision and natural language processing.
It is a very popular framework for deep learning. The feature extraction and the classification stages are
implemented under pytorch library tool. As a first step install all the required modules/packages to train the data
using pip namely efficientnet-pytorch and torchsummary. Then include all the required modules by importing
them into the python script (torch, torchvision, torch.nn, torch.util, torch.autograd, torch.optim,
torchvision.transforms, Efficientnet).
Prepare the Dataset for training the model from the NIST dataset (700000 images) by dividing them into two
around 600000 images as training data and remaining as test data.
Create a class named DATASET for listing all the training images and their respective labels for training.
Download the pretrained model efficientnet-b0 and assign all the required parameters like batch size, learning rate,
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 201
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
error rate, no of classes and transformations if any. Finally train the model under certain epcohs until the loss gets
minimized and accuracy gets increased.
Finally predict each input segmented character under the trained model to be classified into one of the class.
4. EXPERIMENTAL RESULTS AND ANALYSIS:
This system is trained under around 1600 text images (paragraphs) of IAM dataset with almost 5678 labelled
sentences, 13353 isolated and labelled text lines, 115,320 isolated and labelled words with an accuracy of around
98% in terms of line segmentation, 93 % in terms of word segmentation and 88% in terms of character
segmentation.[8]
In terms of character recognition, this system is trained under around 623,000 images of 62 different characters
(0 to 9, a to z, A to Z) with an accuracy of around 96% in character recognition.
5. CONCLUSION:
This paper mainly carries out a study on segmenting the connected components (touching characters). There are
many more challenges involved in the optical cursive handwritten recognition like the skewness, pressure
detection etc can be treated as a future study. The proposed method for segmenting the connected components
(touching character) is by using VPP (Vertical projection profile) and TDP (Top Down Profile) and various other
histogram projections (horizontal and vertical) for the line and word segmentation respectively. pytorch is a
python library tool used for the recognition of the segmented characters. [4]
6. ACKNOWLEDGMENT:
The project team members would like to express our thanks to our guide B. shiva jyoti Assistant professor of
Computer Science and Engineering Department, Anits for her valuable suggestions and guidance in completing
out project model.
7. REFERENCES:
[1] Nafiz arica, Student Member, IEEE, and Fatos T. Yarman-Vuraj, Senior Member IEEE, “Optical Character
Recognition for Cursive Handwriting”, IEEE transaction on pattern analysis and machine intelligence, vol. 24 no.
6, June 2002
[2] Subhash Panwar and Neeta Naina, “A Novel Segmentation Methodology for Cursive Handwritten
Documents”, IETE JOURNAL OF RESEARCH, VOL 60-NO 6 NOV-DEC 2014
[3] Nibaran Das Sandip Pramanik, Subhadio Basu, Punam Kumar Saha, “Recognition of handwritten Bangla basic
characters and digits using convex hull based feature set”, 2009 International conference on Artificial intelligence
and pattern recognition (AIPRL-09)
[4] Abhishek Bala and Rajib Saha, “An IMPROVED Method for Handwritten Document Analysis using
Segmentation, Baseline Recognition and Writing Pressure Detection”, 6th
International Conference on Advances
In Computing Communication, ICACC 2016, 6-8 September 2016, Cochin, India, Elesevier-2016
[5] Kanchan Keisham and Sunanda Dixit, “Recognition of Handwritten English Text Using Energy
Minimisation”, Information Systems Design and Intelligence Applications, Advances in Intelligent Systems and
Computing, Bangalore, India, Spinger-2016
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 202
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
[6] Namrata Dave,” Segmentation Methods for Hand written Character Recognition”, International Journal of
Signal Processing, Image Processing and Pattern Recognition Vol.8, No.4 (2015), pp.155-164
[7]G. Louloudisa, B.Gatosb, I.Pratikakisb, C.Halatsis, “Text line and word segmentation of handwritten
documents” a Department of Informatics and Telecommunications, University of Athens, Greece Computational
Intelligence Laboratory, Institute of Informatics and Telecommunications, National Center for Scientific Research
Demokritos, 15310Athens, Greece
[8] IAM dataset http://www.fki.inf.unibe.ch/databases/iam-handwriting-database
[9] Offline handwritten character recognition using neural networks
https://www.researchgate.net/publication/239765657_Offline_handwritten_ch
aracter_recognition_using_neural_network
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ANIL NEERUKONDA INSTITUTE OF TECHNOLOGY AND SCIENCES
(UGC AUTONOMOUS)
(Affiliated to AU, Approved by AICTE and Accredited by NBA & NAAC with ‘A’
Grade)
Sangivalasa, bheemili mandal, Visakhapatnam dist. (A.P)
BONAFIDE CERTIFICATE
This is to certify that the project report entitled “OPTICAL CURSIVE
HANDWRITTEN RECOGNITION USING VPP & TDP NATIVE
SEGMENTATION ALGORITHMS AND NEURAL NETWORKS
(PYTORCH)” submitted by G.TIRUMALESH (316126510081), K.PRATIMA
(316126510087), K.L.SRINIVAS (316126510088), N.ARUN (316126510100),
Y.HEMANTH (316126510120) in partial fulfillment of the requirements for the
award of the degree of Bachelor of Technology in Computer Science Engineering
of Anil Neerukonda Institute of technology and sciences (A), Visakhapatnam is a
record of bonafide work carried out under my guidance and supervision.
Project Guide Head of the Department
B. SHIVA JYOTHI Dr. R. SIVARANJANI
ASSISTANT PROFESSOR HEAD OF THE DEPATMENT
DEPT. OF CSE DEPT. OF CSE
DECLARATION
We, G.TIRUMALESH, K.PRATIMA, K.L.SRINIVAS, N.ARUN,
Y.HEMANTH, of final semester B.Tech., in the department of Computer
Science and Engineering from ANITS, Visakhapatnam, hereby declare that the
project work entitled “OPTICAL CURSIVE HANDWRITTEN
RECOGNITION USING VPP & TDP NATIVE SEGMENTATION
ALGORITHMS AND NEURAL NETWORKS (PYTORCH)” is carried
out by us and submitted in partial fulfillment of the requirements for the award
of Bachelor of Technology in Computer Science Engineering , under Anil
Neerukonda Institute of Technology & Sciences(A) during the academic year
2016-2020 and has not been submitted to any other university for the award of
any kind of degree.
G. TIRUMALESH 316126510081
K. PRATIMA 316126510087
K. L. SRINIVAS 316126510088
N. ARUN 316126510100
Y. HEMANTH 316126510120
ACKNOWLEDGEMENT
An endeavor over a long period can be advice and support of many well-
wishers. We take this opportunity to express our gratitude and appreciation to all of
them.
We owe our tributes to Dr. R. SHIVARANJANI, Head of the
Department, Computer Science & Engineering for her valuable support and
guidance during the period of project implementation.
We wish to express our sincere thanks and gratitude to our project guide B.
SIVA JYOTHI, Assistant Professor, Department of Computer Science &
Engineering, ANITS, for the simulating discussions, in analyzing problems
associated with our project work and for guiding us throughout the project. Project
meeting were highly informative. We express our sincere thanks for the
encouragement, untiring guidance and the confidence they had shown in us. We are
immensely indebted for their valuable guidance throughout our project.
We also thank all the staff members of CSE department for their valuable
advices.
We also supporting staff for providing resources as and when required.
G. TIRUMALESH 316126510081
K. PRATIMA 316126510087
K. L. SRINIVAS 316126510088
N. ARUN 316126510100
Y. HEMANTH 316126510120
i
ABSTRACT
In the field of Artificial Intelligence, scientists brought a revolutionary
change in the field of image processing and one of the biggest challenges in it is to
identify documents in hand-written formats. One of the most widely used techniques
for the validity of these types of documents is Character Recognition. Optical
Character Recognition (OCR) is an extensively employed method to transform the
data of any form (handwritten) into electronic format. There are millions of
techniques introduced now that can be used to recognize handwriting of any form
and language. A number of techniques are available for feature extraction and
training of CR systems in the literature, each with its own superiorities and
weaknesses. We explore these techniques to design an optimal cursive handwritten
recognition system based on character recognition. This aims to represent the
process of Converting handwritten text to computer typed document which is an
optical cursive handwritten recognition (OCR) by using segmentation algorithms
like VPP (vertical projection profile), TDP (top down profile) and the other
histogram (vertical and horizontal) projection algorithms to achieve the solution. For
feature extraction and character recognition pytorch which is an open source
machine learning tool library in python used for computer vision and natural
language processing.
ii
CONTENTS
Page No.
ABSTRACT i
LIST OF FIGURES v
LIST OF SYMBOLS vii
LIST OF TABLES viii
LIST OF ABBREVATIONS ix
CHAPTER 1 INTRODUCTION 1
1.1 Prerequisites 2
1.1.1 Software Requirements 2
1.1.2 Hardware Requirements 2
1.1.3 Data Requirements 2
1.2 Python 3
1.2.1 Machine learning in python 4
1.2.1.1 Numpy 5
1.2.1.2 Pandas 5
1.2.1.3 Opencv 6
1.2.1.4 Sk-Learn 7
1.2.1.5 Matplotlib 8
1.2.2 Neural networks in python 9
1.2.2.1 Loss Function 10
1.3 Image Processing 12
1.3.1 Types of Images 13
1.3.2 Brightness and Contrast 14
1.4 Convolutional Neural Network 15
1.4.1 Convolutional Layer 15
1.4.2 Pooling Layer 17
1.4.3 Classification 17
1.5 Problem Statement 18
iii
CHAPTER 2 LITERATURE SURVEY 19
2.1 Existing Methods for character recognition system 19
CHAPTER 3 METHODOLOGY 22
3.1 System architecture 22
3.2 Algorithm 23
3.2.1 Algorithm for line segmentation 24
3.2.2 Algorithm for word segmentation 24
3.2.3 Algorithm for character segmentation 25
3.2.3.1 Character segmentation using VPP 25
3.2.3.2 Touching character segmentation 25
3.3 Proposed Work 26
3.3.1 Image scanning 26
3.3.2 Pre-processing 26
3.3.3 Segmentation 28
3.3.3.1 Line segmentation 28
3.3.3.2 Word segmentation 29
3.3.3.3 Character segmentation 30
3.3.4 Feature extraction 33
3.3.5 Classification 35
3.3.6 Post-processing 36
CHAPTER 4 PYTORCH 37
4.1 Pytorch Library Tools 37
4.2 Pytorch in research 39
4.3 Training an image classifier using pytorch 40
CHAPTER 5 SAMPLE CODE ELABORATION 42
5.1 Pre-processing the data 42
5.1.1 Data Organization 42
5.1.2 Image Sizing and Shaping 42
5.1.3 Image Blurring Kernel Filter 42
5.1.4 Applying Kernel Filters and Contours on Image 43
iv
5.2 Line Segmentation 44
5.2.1 Calculating Line Intensity(Horizontal Histograms) 44
5.2.3 Evaluating Threshold for Line Segmentation 45
5.2.4 Segmenting Paragraph into sentences 45
5.3 Word Segmentation 46
5.3.1 Combining two Missegmented Words 46
5.3.2 Segmenting Sentence into Words 46
5.4 Character Segmentation 47
5.4.1 Evaluating VPP Intensity 47
5.4.2 First Level Character Segmentation Using VPP 48
5.4.3 Segmenting Word into Characters under VPP 49
5.4.4 Evaluating VPP and TDP Average Intensity 49
5.4.5 Connected Components Segmentation 50
5.4.6 Further Required Segmentation on Connected Components. 52
5.5 Training the model 53
5.5.1 Data Loader 53
5.5.2 Defining Transforms and Parameters 53
5.5.3 Importing the Model 54
5.5.4 Training the Model 54
5.5.5 Testing the model 56
CHAPTER 6 RESULTS AND DISCUSSIONS 58
6.1 Input & Output 58
6.2 Training Datasets 58
6.3 Experimental Results and Analysis 60
CHAPTER 7 CONCLUSION 62
REFERENCES 63
v
LIST OF FIGURES
Fig. No Topic Name Page No.
1.1 Flow-chart for OCR 2
1.2 Machine learning overview 4
1.3 Architecture of a 2-layer Neural Network 9
1.4 Illustration of flow of network 10
1.5 A CNN Sequence 15
1.6 Convolution operation with kernel 16
1.7 Performing Pooling operation 17
1.8 Describing Classification Process 17
3.1 Block Diagram of Proposed System 22
3.2 Steps in pre-processing 23
3.3 Scanned input image 26
3.4 Pre-processing 27
3.5 Blur image (for noise removal) 27
3.6 Binary image in color contrast 28
3.7 Horizontal projection graph 28
3.8 Segmented lines from the image 29
3.9 segmented line 29
3.10 First level word segmentation 30
3.11 Second level word segmentation 30
3.12 segmented word 30
3.13 VPP intensity graph for word ‘MOVE’ 31
3.14 First level VPP character segmentation 31
3.15 Touching characters (connected component) 31
3.16 VPP intensity graph for word ‘MO’ 32
3.17 TDP intensity graph for word ‘MO’ 32
3.18 Combined intensity graph for word ‘MO’ 32
vi
Fig. No Topic Name Page No.
3.19 Final level character segmentation 33
3.20 Feature extraction of characters 33
3.21 Zero padding 34
3.22 Convolution layer 34
3.23 Max pooling and Average pooling 34
3.24 Flatten the image 35
3.25 ReLU activation function 35
3.26 Fully connected layer with classes (x and o)
along with probabilities 35
3.27 Final result 36
6.1 Input image 58
vii
LIST OF SYMBOLS
X Input layer
Ŷ Output layer
W Weights
B biases
σ Activation function
Summation
vii
LIST OF TABLES
Table. No Table Name Page No.
6.1 Breakdown of the number of available training and 60
testing samples in the NIST special database 19
using the original training and testing splits.
6.2 Testing and Accuracy 61
ix
LIST OF ABBREVATIONS
OCR
Optical Character Recognition
VPP Vertical Projection Profile
TDP Top Down Profile
CNN Convolutional Neural Network
RNN Recurrent Neural Network
RGB Red Green Blue
HTR Hand written Text Recognition
GPU Graphical Processing Unit
ReLU Rectified Linear Unit
MSE Mean Square Error
MAE Mean Absolute Error
MBE Mean Bias Error
SVM Support Vector Machine
1
1. INTRODUCTION
Optical character recognition is also called as optical character reader and it
is abbreviated as OCR. OCR translates the images into machine readable format
such as ASCII or Unicode. Character recognition can be classified into two types
based on the type of the text i.e. machine printed text and handwritten text. Character
recognition of handwritten text is more challenging than machine printed text.
Because, machine printed characters are straight with uniform alignment and
spacing. While the handwritten characters are not uniform and greatly varies in
shape and size. There are many advantages of OCR. When a printed text is
converted to machine readable text then we can search through it with keywords,
compress, edit and send it, and can store in much less space. OCR has numerous
applications. It is used by blind and visually impaired persons. In banking and legal
department, it is used to digitize the documents. Barcode recognition technique is
used in retail industry which is also related to OCR. It is widely used in education,
finance and automatic detection of number plate. The main challenge in the
recognition of handwritten characters is that every person on the earth has different
handwriting. There are various other factors also which causes difference in
handwriting such as multi-orientations, skewness of the text lines, overlapping
characters, connected components, pressure points etc. Many scripts are there with
their intrinsic variations. A single character can be written in many forms, so it is a
challenging task to recognize a particular handwritten character.
There are six steps in OCR:
They are as follows:
• Image acquisition
• Pre-processing
• Segmentation
• Feature extraction
• Classification
• Post-processing
2
Fig. 1.1 Flow-chart for OCR
1.1 Prerequisites:
1.1.1 Software requirements:
1. Python version - 3.0
2. Python IDE – Pycharm
3. Data science libraries – Matplotlib, numpy, PIL, Pytorch, Pandas
1.1.2 Hardware requirements:
1. CPU - 8 to 16 with each octa core processors in a distributed
network.
2. RAM - 128 to 256 GB
3. Storage – 30 to 50 GB
4. Entirely Organized in a cloud network
1.1.3 Data Requirements:
1. NIST DATA SET – Characters
2. IAM DATASET – Forms, Sentences, Words
3
1.2 Python:
Python is a popular programming language. It was created by Guido van Rossum, and
released in 1991.
It is used for:
web development (server-side),
software development,
mathematics,
system scripting.
Python does the things as follow:
Python can be used on a server to create web applications.
Python can be used alongside software to create workflows.
Python can connect to database systems. It can also read and modify files.
Python can be used to handle big data and perform complex mathematics.
Python can be used for rapid prototyping, or for production-ready software
development.
Advantages of python are mentioned below:
Python works on different platforms (Windows, Mac, Linux, Raspberry Pi,
etc).
Python has a simple syntax similar to the English language.
Python has syntax that allows developers to write programs with fewer lines
than some other programming languages.
Python runs on an interpreter system, meaning that code can be executed as
soon as it is written. This means that prototyping can be very quick.
Python can be treated in a procedural way, an object-orientated way or a
functional way.
4
1.2.1 Machine learning in python :
Fig 1.2 Machine learning overview
Machine learning is learning based on experience. As an example, it is like a
person who learns to play chess through observation as others play. In this way,
computers can be programmed through the provision of information which they are
trained, acquiring the ability to identify elements or their characteristics with high
probability.
There are various stages of machine learning:
data collection
data sorting
data analysis
algorithm development
checking algorithm generated
the use of an algorithm to further conclusions
Machine learning algorithms are divided into two groups:
Unsupervised learning
Supervised learning
With Unsupervised learning, your machine receives only a set of input data.
Thereafter, the machine is up to determine the relationship between the entered data
and any other hypothetical data. Unlike supervised learning, where the machine is
provided with some verification data for learning, independent Unsupervised learning
implies that the computer itself will find patterns and relationships between different
data sets. Unsupervised learning can be further divided into clustering and association.
Supervised learning implies the computer ability to recognize elements based
on the provided samples. The computer studies it and develops the ability to recognize
5
new data based on this data. For example, you can train your computer to filter spam
messages based on previously received information.
Some Supervised learning algorithms include:
Decision trees
Support-vector machine
Naive Bayes classifier
k-nearest neighbours
linear regression
1.2.1.1 Numpy :
NumPy is the fundamental package needed for scientific computing with
Python. This package contains:
a powerful N-dimensional array object
sophisticated (broadcasting) functions
basic linear algebra functions
basic Fourier transforms
sophisticated random number capabilities
tools for integrating Fortran code
tools for integrating C/C++ code
Besides its obvious scientific uses, NumPy can also be used as an efficient
multi-dimensional container of generic data. Arbitrary data types can be defined. This
allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
NumPy is a successor for two earlier scientific Python libraries: Numeric and
Numarray.
1.2.1.2 Pandas :
Pandas is a popular Python package for data science, and with good reason: it
offers powerful, expressive and flexible data structures that make data manipulation
and analysis easy, among many other things. The DataFrame is one of these structures.
Those who are familiar with R know the data frame as a way to store data in
rectangular grids that can easily be overviewed. Each row of these grids corresponds
to measurements or values of an instance, while each column is a vector containing
data for a specific variable. This means that a data frame’s rows do not need to
contain, but can contain, the same type of values: they can be numeric, character,
logical, etc.
6
Now, DataFrames in Python are very similar: they come with the Pandas
library, and they are defined as two-dimensional labeled data structures with columns
of potentially different types.
In general, you could say that the Pandas DataFrame consists of three main
components: the data, the index, and the columns.
Firstly, the DataFrame can contain data that is:
a Pandas DataFrame
a Pandas Series: a one-dimensional labeled array capable of holding any data
type with axis labels or index. An example of a Series object is one column
from a DataFrame.
a NumPy ndarray, which can be a record or structured
a two-dimensional ndarray
dictionaries of one-dimensional ndarray’s, lists, dictionaries or Series.
The difference between np.ndarray and np.array() . The former is an actual
data type, while the latter is a function to make arrays from other data structures.
Structured arrays allow users to manipulate the data by named fields: in the
example below, a structured array of three tuples is created. The first element of each
tuple will be called foo and will be of type int, while the second element will be
named bar and will be a float.
Record arrays, on the other hand, expand the properties of structured arrays.
They allow users to access fields of structured arrays by attribute rather than by index.
You see below that the foo values are accessed in the r2 record array.
1.2.1.3 Opencv :
OpenCV (Open Source Computer Vision Library) is an open source computer
vision and machine learning software library. OpenCV was built to provide a common
infrastructure for computer vision applications and to accelerate the use of machine
perception in the commercial products. Being a BSD-licensed product, OpenCV
makes it easy for businesses to utilize and modify the code.
The library has more than 2500 optimized algorithms, which includes a
comprehensive set of both classic and state-of-the-art computer vision and machine
learning algorithms. These algorithms can be used to detect and recognize faces,
identify objects, classify human actions in videos, track camera movements, track
moving objects, extract 3D models of objects, produce 3D point clouds from stereo
cameras, stitch images together to produce a high resolution image of an entire scene,
find similar images from an image database, remove red eyes from images taken using
7
flash, follow eye movements, recognize scenery and establish markers to overlay it
with augmented reality, etc. OpenCV has more than 47 thousand people of user
community and estimated number of downloads exceeding 18 million. The library is
used extensively in companies, research groups and by governmental bodies.
Along with well-established companies like Google, Yahoo, Microsoft, Intel,
IBM, Sony, Honda, Toyota that employ the library, there are many startups such as
Applied Minds, VideoSurf, and Zeitera, that make extensive use of OpenCV.
OpenCV’s deployed uses span the range from stitching streetview images together,
detecting intrusions in surveillance video in Israel, monitoring mine equipment in
China, helping robots navigate and pick up objects at Willow Garage, detection of
swimming pool drowning accidents in Europe, running interactive art in Spain and
New York, checking runways for debris in Turkey, inspecting labels on products in
factories around the world on to rapid face detection in Japan.
It has C++, Python, Java and MATLAB interfaces and supports Windows,
Linux, Android and Mac OS. OpenCV leans mostly towards real-time vision
applications and takes advantage of MMX and SSE instructions when available. A
full-featured CUDA and OpenCL interfaces are being actively developed right now.
There are over 500 algorithms and about 10 times as many functions that compose or
support those algorithms. OpenCV is written natively in C++ and has a templated
interface that works seamlessly with STL containers.
1.2.1.4 Sk-Learn :
Scikit-learn provides a range of supervised and unsupervised learning
algorithms via a consistent interface in Python.
It is licensed under a permissive simplified BSD license and is distributed
under many Linux distributions, encouraging academic and commercial use. The
library is built upon the SciPy (Scientific Python) that must be installed before you can
use scikit-learn. This stack that includes:
NumPy: Base n-dimensional array package
SciPy: Fundamental library for scientific computing
Matplotlib: Comprehensive 2D/3D plotting
IPython: Enhanced interactive console
Sympy: Symbolic mathematics
Pandas: Data structures and analysis
Extensions or modules for SciPy care conventionally named SciKits. As such,
the module provides learning algorithms and is named scikit-learn. The vision for the
library is a level of robustness and support required for use in production systems. This
8
means a deep focus on concerns such as ease of use, code quality, collaboration,
documentation and performance.
Although the interface is Python, c-libraries are leverage for performance such
as numpy for arrays and matrix operations, LAPACK, LibSVM and the careful use of
python.The library is focused on modelling data. It is not focused on loading,
manipulating and summarizing data. For these features, refer to NumPy and Pandas.
Some popular groups of models provided by scikit-learn include:
Clustering: for grouping unlabelled data such as K-Means.
Cross Validation: for estimating the performance of supervised models on unseen
data.
Datasets: for test datasets and for generating datasets with specific properties for
investigating model behaviour.
Dimensionality Reduction: for reducing the number of attributes in data for
summarization, visualization and feature selection such as Principal component
analysis.
Ensemble methods: for combining the predictions of multiple supervised models.
Feature extraction: for defining attributes in image and text data.
Feature selection: for identifying meaningful attributes from which to create
supervised models.
Parameter Tuning: for getting the most out of supervised models.
Manifold Learning: For summarizing and depicting complex multi-dimensional
data.
Supervised Models: a vast array not limited to generalized linear models,
discriminate analysis, naive bayes, lazy methods, neural networks, support vector
machines and decision trees.
1.2.1.5 Matplotlib:
Matplotlib is an amazing visualization library in Python for 2D plots of arrays.
Matplotlib is a multi-platform data visualization library built on NumPy arrays and
designed to work with the broader SciPy stack. It was introduced by John Hunter in
the year 2002.
One of the greatest benefits of visualization is that it allows us visual access to
huge amounts of data in easily digestible visuals. Matplotlib consists of several plots
like line, bar, scatter, histogram etc
9
1.2.2 Neural networks in python :
Neural Networks as a mathematical function that maps a given input to a
desired output.
Neural Networks consist of the following components:
An input layer, x
An arbitrary amount of hidden layers
An output layer, ŷ
A set of weights and biases between each layer, W and b
A choice of activation function for each hidden layer, σ.
Fig 1.3 Architecture of a 2-layer Neural Network
The output ŷ of a simple 2-layer Neural Network is:
1.1
The weights W and the biases b are the only variables that affects the output ŷ.
Calculating the predicted output ŷ, known as feedforward
Updating the weights and biases, known as backpropagation
The sequential graph below illustrates the process.
10
Fig 1.4 Illustration of flow of network
1.2.2.1 Loss Function:
There are many available loss functions, and the nature of our problem should
dictate our choice of loss function.
The common loss functions are mentioned below:
Regression loss: Mean Square Error/Quadratic Loss/L2 Loss
1.2
That is, the sum-of-squares error is simply the sum of the difference between
each predicted value and the actual value. The difference is squared so that we measure
the absolute value of the difference.
Mean Absolute Error/L1 Loss:
1.3 1.3
It is measured as the average of sum of absolute differences between
predictions and actual observations. Like MSE, this as well measures the magnitude of
error without considering their direction. Unlike MSE, MAE needs more complicated
tools such as linear programming to compute the gradients. Plus MAE is more robust to
outliers since it does not make use of square.
11
Mean Bias error:
1.4
This is much less common in machine learning domain as compared to it is
counterpart. This is same as MSE with the only difference that we don’t take absolute
values. Clearly there’s a need for caution as positive and negative errors could cancel
each other out. Although less accurate in practice, it could determine if the model has
positive bias or negative bias.
Classification Losses: Hinge Loss/Multi class SVM Loss
1.5
the score of correct category should be greater than sum of scores of all
incorrect categories by some safety margin (usually one). And hence hinge loss is used
for maximum-margin classification, most notably for SVM. Although not
differentiable, it’s a convex function which makes it easy to work with usual convex
optimizers used in machine learning domain.
Cross Entropy Loss/Negative Log Likelihood:
1.6
This is the most common setting for classification problems. Cross-entropy loss
increases as the predicted probability diverges from the actual label. when actual label
is 1 (y(i) = 1), second half of function disappears whereas in case actual label is 0 (y(i)
= 0) first half is dropped off. In short, we are just multiplying the log of the actual
predicted probability for the ground truth class. An important aspect of this is that cross
entropy loss penalizes heavily the predictions that are confident but wrong.
12
Finally, This loss function helps us to find the best set of weights and biases that
minimizes the loss function.
1.3 Image Processing:
Image processing is a method to perform some operations on an image, in
order to get an enhanced image or to extract some useful information from it. It is a
type of signal processing in which input is an image and output may be image or
characteristics/features associated with that image. Nowadays, image processing is
among rapidly growing technologies. It forms core research area within engineering
and computer science disciplines too.
Image processing basically includes the following three steps:
Importing the image via image acquisition tools;
Analysing and manipulating the image;
Output in which result can be altered image or report that is based on image
analysis.
There are two types of methods used for image processing namely, analogue
and digital image processing. Analogue image processing can be used for the hard
copies like printouts and photographs. Image analysts use various fundamentals of
interpretation while using these visual techniques. Digital image processing techniques
help in manipulation of the digital images by using computers. The three general
phases that all types of data have to undergo while using digital technique are pre-
processing, enhancement, and display, information extraction.
An image is nothing more than a two dimensional signal. It is defined by the
mathematical function f(x,y) where x and y are the two co-ordinates horizontally and
vertically.
The value of f(x,y) at any point is gives the pixel value at that point of an
image.
Some of the major fields in which digital image processing is widely used are
mentioned below
Image sharpening and restoration
Medical field
Remote sensing
Transmission and encoding
Machine/Robot vision
Color processing
13
Pattern recognition
Video processing
Microscopic Imaging
1.3.1 Types of Images:
1. The binary image:
The binary image as it name states, contain only two pixel values 0 and 1.Here
0 refers to black colour and 1 refers to white colour. It is also known as
Monochrome.
The resulting image that is formed hence consist of only black and white
colour and thus can also be called as Black and White image.
Binary images have a format of PBM ( Portable bit map ).
2. 2, 3, 4,5, 6 bit colour format:
The images with a colour format of 2, 3, 4, 5 and 6 bit are not widely used
today. They were used in old times for old TV displays, or monitor displays.
But each of these colours have more than two grey levels, and hence has grey
colour unlike the binary image.
In a 2 bit 4, in a 3 bit 8, in a 4 bit 16, in a 5 bit 32, in a 6 bit 64 different
colours are present.
3. 8 bit colour format
8 bit colour format is one of the most famous image format. It has 256
different shades of colours in it. It is commonly known as Grayscale image.
The range of the colours in 8 bit vary from 0-255. Where 0 stands for black,
and 255 stands for white, and 127 stands for grey colour.
This format was used initially by early models of the operating systems UNIX
and the early colour Macintoshes.
The format of these images are PGM ( Portable Grey Map ). This format is not
supported by default from windows. In order to see grey scale image, you need to
have an image viewer or image processing toolbox such as Matlab.
4. 16 bit colour format:
It is a colour image format. It has 65,536 different colours in it. It is also
known as High colour format.
14
It has been used by Microsoft in their systems that support more then 8 bit
colour format.
The distribution of colour in a colour image is not as simple as it was in
grayscale image. A 16 bit format is actually divided into three further formats which
are Red , Green and Blue. The famous (RGB) format.
5. 24 bit colour format:
24 bit colour format also known as true colour format. Like 16 bit colour
format, in a 24 bit colour format, the 24 bits are again distributed in three different
formats of Red, Green and Blue.
It is the most common used format. Its format is PPM ( Portable pixMap)
which is supported by Linux operating system. The famous windows has its own
format for it which is BMP ( Bitmap ).
1.3.2 Brightness and Contrast:
Brightness is a visual perception in which a source appears to be reflecting
light. Brightness is a subjective property of an object which is being observed.
Brightness is an absolute term and different from lightness. A colour screens use three
colours i.e., RGB scheme (red, green and blue) the brightness of the screen depends
upon the sum of the amplitude of red green and blue pixels, and it is divided by 3.
The perception of brightness depends upon the optical illusions to appear
brighter or darker. When the brightness is decreased, the colour appears dull, and
when brightness increases, the colour is clearer.
Contrast is a colour which makes an object distinguishable. We can say that
contrast is determined by the colour and brightness of the object. Contrast is the
difference between the maximum and minimum pixel intensity of an image.
15
1.4 Convolution Neural Network:
Fig 1.5 A CNN Sequence
A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning
algorithm which can take in an input image, assign importance (learnable weights
and biases) to various aspects/objects in the image and be able to differentiate one
from the other. The pre-processing required in a ConvNet is much lower as
compared to other classification algorithms. While in primitive methods filters are
hand-engineered, with enough training, ConvNets have the ability to learn these
filters/characteristics.
A ConvNet is able to successfully capture the Spatial and Temporal
dependencies in an image through the application of relevant filters. The architecture
performs a better fitting to the image dataset due to the reduction in the number of
parameters involved and reusability of weights. In other words, the network can be
trained to understand the sophistication of the image better.
The role of the ConvNet is to reduce the images into a form which is easier
to process, without losing features which are critical for getting a good prediction.
1.4.1. Convolution Layer:
A filter (or kernel) is an integral component of the layered architecture.
Generally, it refers to an operator applied to the entirety of the image such
that it transforms the information encoded in the pixels. In practice, however, a
16
kernel is a smaller-sized matrix in comparison to the input dimensions of the image,
that consists of real valued entries.
The real values of the kernel matrix change with each learning iteration over
the training set, indicating that the network is learning to identify which regions are
of significance for extracting features from the data.
Fig 1.6 Convolution operation with kernel
In Fig1.6, we are convoluting a 5x5x1 image with a 3x3x1 kernel (which
change each iteration to extract a significant features) to get a 3x3x1 convolved
feature. The filter moves to the right with a certain stride value till it parses the
complete width. In case of images with multiple channels (e.g. RGB) the kernel has
a same depth as that of input image. Matrix multiplication is performed between Kn
and in stack ([K1, I1]; [K2, I2]; [K3, I3]) and all results are summed with bias to
give us a squashed 1 depth channel convoluted feature output.
The objective of the Convolution Operation is to extract the high-level
features such as edges, from the input image, the first Convolutional Layer is
responsible for capturing the Low-Level features such as edges, color, gradient
orientation, etc.
17
1.4.2 Pooling Layer:
(a) Convoluted Output (b) Pooling Output
Fig 1.7 Performing Pooling operation
The Pooling layer is responsible for reducing the spatial size of the Convolved
Feature. This is to decrease the computational power required to process the data
through dimensionality reduction. Furthermore, it is useful for extracting dominant
features which are rotational and positional invariant, thus maintaining the process of
effectively training of the model.
There are two types of Pooling: Max Pooling and Average Pooling. Max
Pooling returns the maximum value from the portion of the image covered by the
Kernel. On the other hand, Average Pooling returns the average of all the values
from the portion of the image covered by the Kernel.
In Fig. 1.7 we perform maximum pooling operation by considering
convoluted feature output obtained from convolution layer.
1.4.3 Classification – Fully Connected Layer (FC LAYER):
Fig 1.8 Describing Classification Process
18
The output of the convolutional layer is converted into a suitable form for our
Multi-Level Perceptron, we shall flatten the image into a column vector. The flattened
output is fed to a feed- forward neural network and back propagation applied to every
iteration of training.
1.5 Problem Statement:
Optical character recognition is also called as optical character reader and it
is abbreviated as OCR. OCR translates the images into machine readable format
such as ASCII or Unicode. Character recognition can be classified into two types
based on the type of the text i.e. machine printed text and handwritten text. Character
recognition of handwritten text is more challenging than machine printed text.
Because, machine printed characters are straight with uniform alignment and
spacing. While the handwritten characters are not uniform and greatly varies in
shape and size. This aims to represent the process of Converting handwritten text to
computer typed document which is an optical cursive handwritten recognition
(OCR) by using segmentation algorithms like VPP (vertical projection profile), TDP
(top down profile) and the other histogram (vertical and horizontal) projection
algorithms to achieve the solution. For feature extraction and character recognition
pytorch which is an open source machine learning tool library in python used for
computer vision and natural language processing.
19
2. LITERATURE SURVEY
2.1 Existing Methods for character recognition system:
In literature of cursive English handwriting recognition, earlier study
highlighted that an off-line handwritten document analysis through segmentation,
skew recognition and writing pressure detection for cursive handwritten document.
There are many algorithms for line, word and character segmentation [19]. The
proposed segmentation method is based on modified horizontal and vertical
projection that can segment the text lines and words even if the presence of
overlapped and multi-skewed text lines. For character segmentation there are methods
like multi-layer perceptron [20]. The existing method was tested on more than 550
text images of IAM database and sample handwriting image which are written by the
different writer on the different background. Using the existing method 93.65% lines
and 91.56% word are correctly segmented from the IAM dataset. Existing work also
normalizes 92% lines and words perfectly with very small error rate. Existing skew
normalization method deals with the exact skew angle and extremely efficient with
compare to on hand techniques. [4]
Each and every pixel in an image represents some information. The pixels
which contributes to the text has more information energy. Based on this information
energy, the text- lines are segmented with 92% accuracy.[5] Artificial Neural
Network is used to recognize the characters. The study includes the performance of
convex hull feature set i.e. 125 features are computed by considering various bays
attributes of the convex hull of a pattern, for effective recognition of isolated
handwritten Bangla basic characters and digits. The recognition rate is 76.86% for
handwritten Bangla characters and 99.45% for Bangla numerals. [3]
The work includes the study of different segmentation techniques for
handwritten character recognition. Three levels of segmentation are presented i.e.
text-lines, word and character segmentation. The need and factors which affects the
segmentation process are discussed. [6]
The work contains the new approach which uses the sequence of
segmentation and recognition algorithms for the OCR of cursive handwriting.
Hidden Markov Model (HMM) is used for recognition with accuracy 92.3% with
lexicon size 50. Lexi con and HMM are combined for word-level segmentation [1].
In this work, various segmentation levels are discussed. Hough transform is used for
text-line segmentation. For the division of vertically connected components,
skeletonization is used [7]. The experiments are carried out.
In this work, the novel connectivity strength function is used for
segmentation process. Connectivity strength parameter is used to decide the
20
components of the text-line. It is a language adaptive approach with accuracy
97.30% [2].
In most of the existing systems recognition accuracy is heavily dependent on
the quality of the input document. In handwritten text adjacent characters tend to be
touched or overlapped. Therefore it is essential to segment a given string correctly into
its character components. In most of the existing segmentation algorithms, human
writing is evaluated empirically to deduce rules [21]. But there is no guarantee for the
optimum results of these heuristic rules in all styles of writing. Moreover handwriting
varies from person to person and even for the same person it varies depending on
mood, speed etc. This requires incorporating artificial neural networks, hidden Markov
models and statistical classifiers to extract segmentation rules based on numerical data.
[22][23][24].
After segmentation next crucial step is representation of character classes by
features. These features should have high discriminative abilities so that they are
different for different character classes (for example 26 uppercase and 26 lowercase
characters in case of English language and 10 digits). Also, these features should be
independent of the intra class variations.
The different representation methods can be categorized into three major classes [21]:
1. Global transformation and series expansion: includes Fourier transform, Gabor
transform, wavelet, moments and KarhuenLoeve Expansion.
2. Statistical representation: Zoning, crossing and distances, projections.
3. Geometrical and topological representation: Extracting and counting
topological structures, geometrical properties, coding, graphs and trees etc.
Features which depend on Fourier transform are suitable for recognizing
handwritten numerals where 96% accuracy has been achieved [25]. Gradient features
have been widely used in CR for machine and hand printed binary character images.
But these features are not invariant to deformations in the characters. In [26], a new
gradient feature is used where at each pixel, gradient is mapped onto 12 direction
codes with an angle span of 30 degree between the directions
In [27], a redesigned direction feature [28] with a view to describe the
character contour more effectively is developed. Also, an additional global feature was
introduced in this technique to improve the recognition accuracy for those characters
that were most frequently confused with patterns of similar appearances. But the
disadvantage of this technique is its failure to deal with changes in stroke width as
these features are extracted from non-thinned character images. Another crucial
module in a character recognition system is its pattern recognition module which
assigns an unknown sample to a predefined class. Numerous techniques for character
recognition can be classified into four general approaches of pattern recognition: [21]
21
1. Template Matching : Direct matching, deformable and elastic matching,
relaxation matching.
2. Statistical techniques : Parametric recognition, Non-pararmetric recognition,
HMM, fuzzy set reasoning.
3. Structural techniques: Grammatical methods, graphical methods.
4. Neural networks : Multilayer perceptron, radial basis function, support vector
machine
Character recognition technique has to cope with the high variability of the
handwritten cursive letters and their intrinsic ambiguity (letters like “e” and “l” or “u”
and “n” can have the same shape). Also it should be able to adapt to changes in the
input data. Template matching, statistical techniques and structural techniques can be
used when the input data is uniform over time whereas neural network (NN) classifier
can learn changes in the input data. Also NN has parallel structure because of which it
can perform computation at a higher rate than classical techniques. Therefore, we
choose neural networks for character recognition in our system.
The features that are used for training the neural network classifier also play a
very important role. The choice of a good feature vector can significantly enhance the
performance of a character classifier whereas a poor one may degrade its performance
considerably. It is found in the literature that generally separate classifiers are used for
the upper and the lower case English character classes to improve the recognition
accuracy. Moreover, good recognition accuracy could be achieved only for
handwritten numerals.
In this paper, we focus on developing a OCR system for recognition of
handwritten English words. We first segment the words into individual characters and
then represent these characters by features that have good discriminative abilities. We
also explore different neural network classifiers to find the best classifier for the OCR
system. We combine different OCR techniques in parallel so that recognition accuracy
of the system can be improved.
22
3. METHODOLOGY
3.1 System Architecture:
Fig. 3.1 Block Diagram of Proposed System
A. Image acquisition:
Images can be obtained by taking the photograph or by scanning the input
document.
B. Pre-processing:
Pre-processing techniques can be applied after image acquisition and
segmentation. These are used to remove the noise from an image and enhances the
images for further processing. Pre-processing techniques include noise removal,
skew correction, cropping and resizing, normalization, thinning, binarization,
skeletonization. Morphological operations such as dilation, erosion can also be
applied to the input scanned image.
The steps in pre-processing are shown in figure 3.2 below:
23
Fig. 3.2 Steps in pre-processing
C. Segmentation:
Segmentation is of three types i.e. line, word and character segmentation.
Line segmentation separates the lines from a paragraph. Word segmentation
separates the words from a line and character segmentation separates the characters
from a word.
D. Feature Extraction:
Feature extraction is an important step in the recognition process. In this
process, all the essential information about a character which is present in an image
is extracted.
E. Classification:
In classification, an unknown sample is assigned to the predefined class.
According to the extracted features, characters are classified and recognized.
F. Post-processing:
To achieve more accuracy, various post-processing techniques are used, for
example, matching a recognized word with a dictionary word.
3.2 Algorithm(s):
Horizontal projection method is used to segment a line from a paragraph. As
a first step horizontal histogram of an image is created. The average height of a
rising section is assumed as threshold. Then the height of each rising section is
checked whether it is greater or equal to the threshold, then each line is segmented
from a binary image.
24
3.2.1 Algorithm for Line Segmentation:
1) Read a handwritten document image as a multidimensional array.
2) Check the image is a binary image or not. If binary image then stores it
into a 2-d array IMG [] [] with size MN and go to Step 4, otherwise go to
Step 3.
3) Convert the image to binary image and store into a 2-d array IMG [] [].
4) Construct the horizontal projection histogram of the image IMG [] [] and
store into a 2-d array HPH [] [].
5) Measure the height, starting row position and ending row position of
each horizontally rising section of horizontal projection histogram image
and store into 3d array LH [] [] [] sequentially.
6) Count the number of rising points by counting the rows of the 3-d array
LH [] [] []. Then measure the threshold (Ti) value by calculating
average height of rising sections from the 3-d array LH [] [] [].
7) Select each rising section from 3-d array LH [] [] [] and check the height
of that rising section is less than the threshold or not. If yes then those
rising points is not considered as a line and go to Step 9, otherwise rising
section is treated as a line and go to Step 8.
8) Find the rising sections starting and ending rows number from the array
LH [] [] []. Let starting and ending row are r1 and r2 respectively. Extract
the line segment between r1 and r2 from the original binary image
denoted by IMG [] [].
9) Go to Step 7 for next rising sections till all rising section are not under
consideration, otherwise go to next Step.
10) End.
3.2.2 Algorithm for Word Segmentation:
1) Read a segmented binary line as 2-d binary image LN [] [].
2) Construct the vertical projection histogram of the line LN [] [] and store
into a 2-d array LVP [] [].
3) From the vertical projection histogram (LVP [] []), measures width of
each inter- word and intra-word gaps and store the width into 1-d array
GAPSW [].
4) Count total number gaps as TGP by calculating the size of GAPSW [].
Add width of all gaps by adding the elements of GAPSW [] and store into
TWD.
5) Calculate the threshold (Ti) as follows: Ti = TWD/ TGP Where, Ti is the
threshold value denoting average width of inter-word gaps, TWD denotes
total width of all gaps and TGP denotes the total number of gaps.
25
6) For each i (1<= i<= sizeof (GAPSW [])), if GAPSW[i]> = Ti then this gap
is treated as inter-word gaps, otherwise it is treated as an intra-word gap.
Depending on inter- word gaps width, words are segmented from the line.
7) End
3.2.3 Algorithm for Character Segmentation:
3.2.3.1 Character Segmentation Using VPP:
At first, proposed algorithm uses VPP of the binary image obtained after word
segmentation for the character segmentation. VPP is to represent the total number of
white pixels in vertical direction of the binary image to graph. Because boundaries of
the characters are certainly regions composed of background in the vertical direction
as value of VPP is zero, the text region is separated at these regions. When length of
width in separated character image is longer than 0.8 times length of height (feature of
the printed character in the slab image), it is judged that the separated character images
are touching character.
Width > 0.8 * Height: Touching Character 3.1
3.2.3.2 Touching Character Segmentation:
Boundaries of touching characters are located at the valley points of VPP
(Vertical Projection Profile) or TDP (top down profile). TDP is to represent position of
first white pixel at respective column to graph. Because all valley points are not
boundaries of touching characters, all candidate boundary points are extracted. For
VPP and TDP analysis, binary image, feature binary image, and gray image are used.
White pixels in the feature binary image are composed of peak, hillside, and ridge
points of topographic feature in gray image [10].
All candidate boundary points extracted are combined to Calculate the score
graph. Real boundary regions of characters have a large value at the score graph.
Cost Graph = (VPP + TDP)/2 3.2
Combined boundary points are selected from the score graph recursively.
When finding the combined boundary points more than real character boundary points,
we should choose correct boundary points. After making up all cases which are able to
separate the touching character from the combined boundary points, the proposed
algorithm selects the correct case that has minimum distance between the separated
character images and the representative images using recognition-based method. The
representative image displays the recognition result of the separated character image.
26
3.3 Proposed Work:
The process for the optical cursive handwritten recognition and the required
algorithms for various levels of segmentation and character recognition using pytorch
are as follow
It comprises of six steps.
1. Image scanning
2. Pre-processing
3. Segmentation
4. Feature extraction
5. Classification
6. Post-processing
3.3.1 Image Scanning:
The input image can be obtained either by scanning the already existing
handwritten image file (png, jpg) or by capturing the image instantly to provide input
data to the model.
Fig 3.3 Scanned input image
3.3.2 Pre-Processing:
The main goal here is to make the input image free from noise. As a first step
to move on convert the RGB image to a gray scale image and gently sharpen the given
input image to avoid loss of edges. Calculate the mean gray intensity value to reduce
the brightness of the obtained grey scale image on a threshold value of less than 0.65
and the contrast is increased to distinguish the character boundaries[8]. The text which
27
is present in the obtained result may turn dim and blur because of improper scanning
of the text image.
Fig 3.4 Pre-processing
To overcome this, binarization plays a key role by converting the gray scale
image where the values ranges between 0 and 255 to a binary image by making up a
threshold value simply to decide like an on or off (0 or 1). Since burned characters
look like dim in the text region, the characters disappear in the binary image. When
converting to binary image, we apply Otsu’s binarization [9] method not to the whole
region but to the respective local regions.
Fig 3.5 Blur image (for noise removal)
28
Fig 3.6 Binary image in color contrast
3.3.3 Segmentation:
Basically, there are three levels of segmentation. Line segmentation, Word
segmentation and Character segmentation.
3.3.3.1 Line Segmentation:
Horizontal histogram projections are used in segmenting the entire script
present in the input image into individual lines as shown in the figure below
The primary task here is to extract each individual line from the given input
image. This can be obtained by applying horizontal histogram projection to the pre-
processed image and then generate the threshold line value by adjusting the average
value for those horizontal projections. Graphical representation of horizontal
histogram projection is shown in the figure 3.7 below.[6]
Fig 3.7 Horizontal projection graph
29
Finally, Lines can be segmented from the given input script by obtaining the
break points by making use of the average threshold line value obtained from the
above graph and having comparisons with each and every horizontal projection.
Fig 3.8 Segmented lines from the image
3.3.3.2 Word Segmentation
Each word is treated as an object (Contour – in terms of image processing).
Contour can be explained simply as a curve joining all the continuous points (along
the boundary), having same color or intensity. Here, Contours are useful for object
detection where each object is a word.
Fig 3.9 segmented line
The main reason for making use of contours here is each word can be treated as
a curve joining all the continuous points along the boundary since it is cursively
written. But sometimes there may be gaps in between the letters of a single word
which causes the word to be split into two or more words as they are not continuous
points joining as a curve.
This type of words can be identified by making use of the minimum threshold
value which is obtained by taking the average separation distance between the words
and can be rejoined back as a single word (Contour) where the separation distance
between the words less than the minimum threshold value.[7]
Minimum threshold value = (sum of separation distances between words in the
line)/(no of words in the line)
30
Fig 3.10 First level word segmentation
Fig 3.11 Second level word segmentation
3.3.3.3 Character Segmentation
Segmenting each word into individual characters can be obtained by making use
of two native algorithms:
1. VPP (Vertical Projection Profile)
2. TDP (Top Down Profile)
Fig 3.12 segmented word
VPP is a plot which maintains the total number of white pixels in vertical
direction of the binary image. Characters can be segmented at the point where the VPP
value is zero (0) for some certain no of times (Threshold). But in the case of touching
characters, the VPP value can never be zero even though the characters should be
segmented (connected components).[1]
31
Fig 3.13 VPP intensity graph for word ‘MOVE’
Connected components can be identified by making use of characters width
and height. When the width of the character is greater than 0.8 times the height of the
character then it is identified as a connected component otherwise it is a single
character as per basic font size measurement.[1]
Width > 0.8 * height (Connected component)
Connected component single character single character
Fig 3.14 First level VPP character segmentation
TDP is a plot which maintains the first white pixel in vertical direction of the
binary image. Touching characters can be segmented into individual characters by
taking the combined value of both VPP AND TDP. Then obtain the minimum value in
the graph where we can segment them into individual characters and continue this
process recursively until no more touching character found in a single word. [2]
Fig 3.15 touching characters (connected component)
32
Fig 3.16 VPP intensity graph for word ‘MO’
Fig 3.17 TDP intensity graph for word ‘MO’
Fig 3.18 Combined intensity graph for word ‘MO’
33
Fig 3.19 Final level character segmentation
3.3.4 Feature Extraction:
The main goal here is to extract the features from the segmented characters
which are required to train the data.
This process comprises of zero padding, convolution layer, activation function,
max pooling and flatten. As a first step add zeros to the image to overcome the loss of
edges termed as zero padding. Then apply multiple layers of convolution and max
pooling filters(kernels) to obtain an image with reduced in size where each move is a
stride. Max pooling comes with selecting the max value from the filter and replace it
with the remaining and the same way the average pooling works by taking the average
pixel value. Now the activation function comes into picture where the ReLU activation
function identifies all the negative pixel values and replace it with zero without any
change in the positive pixel values. Finally flatten the image by reshaping the image
obtained.
Fig 3.20 Feature extraction of characters
34
Fig 3.21 Zero padding
Fig 3.22 Convolution layer
Fig 3.23 Max pooling and Average pooling
35
Fig 3.24 Flatten the image
Fig 3.25 ReLU activation function
3.3.5 Classification:
Finally, Classification is done using a fully connected layer where we get the
probabilities of each and every class for the given input character. Classify the given
input character to their respective class by selecting the class with the maximum
probability. In total there are classified into 62 classes (0 to 9, a to z, A TO Z)
Fig 3.26 Fully connected layer with classes (x and o) along with probabilities
36
3.3.6 Post-processing:
As a final step obtain the accuracy for all the levels of segmentation and the
character recognition by minimizing the error rate. Then combine all the recognized
characters into word, words into lines and lines into the original script present in the
image.
Final Result = M O V E
Fig 3.27 Final result
37
4. PYTORCH
4.1 Pytorch library tool:
Pytorch is an open source machine learning library based on a torch library
used for applications such as computer vision and natural language processing. It’s a
Python-based scientific computing package targeted at two sets of audiences:
A replacement for NumPy to use the power of GPUs
a deep learning research platform that provides maximum flexibility
and speed
It is a Python-based scientific computing package that uses the power of
graphics processing units. It is also one of the preferred deep learning research
platforms built to provide maximum flexibility and speed. It is known for providing
two of the most high-level features; namely, tensor computations with strong GPU
acceleration support and building deep neural networks on a tape-based autograd
systems.
There are many existing Python libraries which have the potential to change
how deep learning and artificial intelligence are performed, and this is one such
library. One of the key reasons behind PyTorch’s success is it is completely and one
can build neural network models effortlessly. It is still a young player when compared
to its other competitors, however, it is gaining momentum fast.
Since its release in January 2016, many researchers have continued to
increasingly adopt PyTorch. It has quickly become a go-to library because of its ease
in building extremely complex neural networks. It is giving a tough competition to
tensor flow especially when used for research work. However, there is still some time
before it is adopted by the masses due to its still “new” and “under construction” tags.
PyTorch creators envisioned this library to be highly imperative which can
allow them to run all the numerical computations quickly. This is an ideal
methodology which fits perfectly with the Python programming style. It has allowed
deep learning scientists, machine learning developers, and neural network debuggers
to run and test part of the code in real time. they don’t have to wait for the entire code
to be executed to check whether it works or not.
You can always use your favorite Python packages such as NumPy, SciPy, and
Cython to extend PyTorch functionalities and services when required. PyTorch is a
dynamic library (very flexible and you can use as per your requirements and changes)
which is currently adopted by many of the researchers, students, and artificial
38
intelligence developers. In the recent Kaggle competition, PyTorch library was used
by nearly all of the top 10 finishers.
Some of the key highlights of PyTorch includes:
Simple Interface: It offers easy to use API thus, it is very simple to operate
and run like Python.
Pythonic in nature: This library, being Pythonic, smoothly integrates with
the Python data science stack. it can leverage all the services and
functionalities offered by the Python environment.
Computational graphs: In addition to this, PyTorch provides an excellent
platform which offers dynamic computational graphs thus you can change
them during runtime. This is highly useful when you have no idea how
much memory will be required for creating a neural network model.
It is an optimized tensor library for deep learning using CPU’S and GPU’S.
The feature extraction and the classification stages are implemented under pytorch
library tool. As a first step install all the required modules/packages to train the data
using pip namely efficient net-pytorch and torch summary. Then include all the
required modules by importing them into the python script (torch, torch.vision,
torch.nn, torch.util, torch.autograd, torch.optim, torchvision.transforms, Efficient Net).
Torch: The torch package contains data structures for multi-dimensional
tensors and mathematical operations over these are defined. Additionally, it provides
many utilities for efficient serializing of Tensors and arbitrary types, and other useful
utilities.
Torchvision: This package consists of popular datasets, model architectures,
and common image transformations for computer vision.
Torch.nn: A kind of Tensor that is to be considered a module parameter.
Parameters are Tensor subclasses, that have a very special property when used
with Module s - when they’re assigned as Module attributes they are automatically
added to the list of its parameters, and will appear e.g. in parameter () iterator.
Assigning a Tensor doesn’t have such effect. This is because one might want to cache
some temporary state, like last hidden state of the RNN, in the model. If there was no
such class as Parameter, these temporaries would get registered too.
Torch.autograd: This provides classes and functions implementing automatic
differentiation of arbitrary scalar valued functions. It requires minimal changes to the
existing code - you only need to declare Tensors for which gradients should be
computed with the ‘requires grad = True’ keyword.
39
Torch.optim: This is a package implementing various optimization
algorithms. Most commonly used methods are already supported, and the interface is
general enough, so that more sophisticated ones can be also easily integrated in the
future.
4.2 Pytorch in research:
Anyone who is working in the field of deep learning and artificial intelligence
has likely worked with Tensorflow before, Google’s most popular open source
library. However, the latest deep learning framework – PyTorch solves major
problems in terms of research work. Arguably PyTorch is Tensorflow’s biggest
competitor to date, and it is currently a much-favored deep learning and artificial
intelligence library in the research community.
Dynamic Computational graphs:
It avoids static graphs that are used in frameworks such as TensorFlow,
thus allowing the developers and researchers to change how the network behaves on
the fly. The early adopters are preferring PyTorch because it is more intuitive to learn
when compared to TensorFlow.
Different back-end support:
PyTorch uses different backends for CPU, GPU and for various
functional features rather than using a single back-end. It uses tensor backend TH for
CPU and THC for GPU. While neural network backends such as THNN and
THCUNN for CPU and GPU respectively. Using separate backends makes it very easy
to deploy PyTorch on constrained systems.
Imperative style:
PyTorch library is specially designed to be intuitive and easy to use.
When you execute a line of code, it gets executed thus allowing you to perform real-
time tracking of how your neural network models are built. Because of its excellent
imperative architecture and fast and lean approach it has increased overall PyTorch
adoption in the community.
Highly extensible:
PyTorch is deeply integrated with the C++ code, and it shares
some C++ backend with the deep learning framework, Torch. Thus, allowing users to
program in C/C++ by using an extension API based on cFFI for Python and compiled
for CPU for GPU operation. This feature has extended the PyTorch usage for new and
experimental use cases thus making them a preferable choice for research use.
40
Python-Approach:
PyTorch is a native Python package by design. Its functionalities are
built as Python classes. Hence, all its code can seamlessly integrate with Python
packages and modules. Similar to NumPy, this Python-based library enables GPU-
accelerated tensor computations plus provides rich options of APIs for neural network
applications. PyTorch provides a complete end-to-end research framework which
comes with the most common building blocks for carrying out daily deep learning
research. It allows chaining of high-level neural network modules because it supports
Keras-like API in its torch.nn package.
4.3 Training an image classifier using Pytorch:
Generally, when you have to deal with image, text, audio or video data, you
can use standard python packages that load data into a numpy array. Then you can
convert this array into a torch.*tensor.
For images, packages such as Pillow, OpenCV are useful
For audio, packages such as scipy and librosa
For text, either raw Python or Cython based loading, or NLTK and
SpaCy are useful
Specifically for vision, we have created a package called torchvision, that has
data loaders for common datasets such as Imagenet, CIFAR10, MNIST, etc. and data
transformers for images, viz., torchvision.datasets and torch.utils.data.DataLoader.
This provides a huge convenience and avoids writing boilerplate code.
It includes the steps as follow:
Load and normalizing the training data and test datasets using
torchvision
Define a Convolutional Neural Network
Define a loss function
Train the network on the training data
Test the network on the test data
Using torchvision, it’s extremely easy to load Data. The output of torchvision
datasets are PILImage images of range [0, 1]. We transform them to Tensors of
normalized range [-1, 1].
Now define the Convolution neural network and essential loss functions and
optimizers.
41
Training: We simply have to loop over our data iterator, and feed the inputs to
the network and optimize. We have trained the network for around 6 passes over the
training dataset. But we need to check if the network has learnt anything at all.
We will check this by predicting the class label that the neural network outputs,
and checking it against the ground-truth. If the prediction is correct, we add the sample
to the list of correct predictions.
Training on GPU: Just like how you transfer a Tensor onto the GPU, you
transfer the neural net onto the GPU. Let’s first define our device as the first visible
cuda device if we have CUDA available.
42
INPUT_FILENAME=[]
OUTPUT_CLASS=[]
for DIRNAME, _, FILENAMES in os.WALK('/KAGGLE/input/nist-
CHARACTERS-DATASET/CHARACTERS/TEST_IMAGES'):
for FILENAME in FILENAMES:
INPUT_FILENAME.APPEND(FILENAME.split('.')[0])
OUTPUT_CLASS.APPEND(DIRNAME.split('/')[-1])
TESTDATA=pd.DATAFRAME({'FILENAME':INPUT_FILENAME,
'CLASS':OUTPUT_CLASS})
TESTDATA.to_csv('test.csv')
SAMPLEDATA=pd.DATAFRAME({'FILENAME':INPUT_FILENAME,'CLASS'
:[0 for _ in RANGE(len(OUTPUT_CLASS))]})
SAMPLEDATA.to_csv('SAMPLE_SUBMISSION.CSV')
def PREPAREIMAGE(IMAGE,req_height):
if IMAGE.ndim == 3:
IMAGE=cv2.cvtColor(IMAGE,cv2.COLOR_BGR2GRAY)
height=IMAGE.SHAPE[0]
FACTOR=req_height/height
print("Resized by FACTOR : ",FACTOR)
return cv2.resize(IMAGE,dsize=None,fx=FACTOR,fy=FACTOR)
def CREATEKERNELFILTER(kernelSize,SIGMA,THETA):
HALFSIZE=kernelSize//2
kernel=np.zeros([kernelSize,kernelSize])
5. SAMPLE CODE ELABORATION
5.1 Pre-processing the data:
5.1.1 Data Organization:
5.1.2 Image Sizing and Shaping:
5.1.3 Image Blurring Kernel Filter:
43
5.1.4 Applying Kernal Filters and Contours on image:
def Pre_Processing_Sentence(sentence):
print("RESIZED SENTENCE: ")
UPDATED_SENTENCE=PREPAREIMAGE(sentence,50)
SHOW_IMAGE(UPDATED_SENTENCE,CMAP='GRAY')
print("BLURRED SENTENCE: ")
blurred_sentence=cv2.GAUSSIANBLUR(sentence,(5,5),0)
SHOW_IMAGE(blurred_sentence,CMAP='GRAY')
print("FILTERED SENTENCE: ")
kernelSize=25
SIGMA=11
THETA=7
MINAREA=150
kernel=CREATEKERNELFILTER(kernelSize,SIGMA,THETA)
Filtered_sentence=cv2.filter2D(sentence,-
1,kernel,borderType=cv2.BORDER_REPLICATE)
SHOW_IMAGE(Filtered_sentence,CMAP='GRAY')
print("THRES SENTENCE: ")
THRES_VALUE,Thres_sentence=cv2.threshold
(Filtered_sentence,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)
Thres_sentence=255-Thres_sentence
SHOW_IMAGE(Thres_sentence,CMAP='GRAY')
cv2. version
components,HIERARCHY=cv2.findContours(
Thres_sentence,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
SIGMAX=SIGMA
SIGMAY=SIGMA*THETA
for i in RANGE(kernelSize):
for j in RANGE(kernelSize):
x=i-HALFSIZE
y=j-HALFSIZE
expTerm=np.exp(-((x**2)/(2*(SIGMAX**2)))-
((y**2)/(2*(SIGMAY**2))))
kernel[i,j]=(1/(2*MATH.pi*SIGMAX*SIGMAY))*expTerm
return kernel
44
def LINE_SEGMENTATION(IMAGE):
Sentences=[]
Line_intensity=[0 for line in RANGE(len(IMAGE))]
for line in RANGE(len(IMAGE)):
count=0
for pixel in RANGE(len(IMAGE[0])):
if(IMAGE[line][pixel]<128):
count+=1
Line_intensity[line]=count
print("LINE INTENSITY: ")
print(Line_intensity)
plt.plot(Line_intensity)
plt.xticks([])
plt.show()
return EVALUATING_THRESHOLD(IMAGE,Line_intensity)
5.2 Line Segmentation:
5.2.1 : Calculating Line Intensity(Horizontal Histogram)
print('NO OF COMPONENTS : ',len(components))
print("CONTOURED SENTENCE: ")
SHOW_IMAGE(cv2.DRAWCONTOURS(sentence,components,-
1,(255,0,0),5))
Words=[]
for contour in components:
if(cv2.CONTOURAREA(contour) >= MINAREA):
(x,y,w,h)=cv2.boundingRect(contour)
Words.APPEND([(x,y,w,h),sentence[y:y+h,x:x+w]])
print('NO OF COMPONENTS AFTER FILTERING: ',len(Words))
Words.sort(key=SORT_IMAGES)
return Words
45
def Segmenting_Lines(IMAGE,Line_Segments,Line_Threshold):
Sentences=[]
print(Line_Segments)
for index in RANGE(1,len(Line_Segments),2):
if(Line_Segments[index]>Line_Threshold):
y=Line_Segments[index-1][0]-5
h=Line_Segments[index-1][1]-
5.2.2 : Evaluating Threshold for Line Segmentation:
def EVALUATING_THRESHOLD(IMAGE,Line_intensity):
Line_Segments=[]
LINE_SEPERATION=0
START_FLAG=0
Zero_count=0
START_INDEX=0
End_index=0
SET_FLAG=0
for line in RANGE(len(Line_intensity)):
if(Line_intensity[line]==0):
Zero_count+=1
if(SET_FLAG==0):
End_index=line
SET_FLAG=1
else:
if(SET_FLAG==1 AND START_FLAG==1):
SET_FLAG=0
Line_Segments.APPEND([START_INDEX,End_index])
START_INDEX=line
Line_Segments.APPEND(Zero_count)
LINE_SEPERATION=LINE_SEPERATION+(Zero_count**2)
Zero_count=0
if(START_FLAG==0):
START_FLAG=1
SET_FLAG=0
START_INDEX=line
Zero_count=0
Line_Segments.APPEND([START_INDEX,End_index])
Line_Threshold=MATH.sqrt(LINE_SEPERATION)/6
print("LINE THRESHOLD : ",Line_Threshold)
print("LINE SEGMENTS : ",Line_Segments)
return Segmenting_Lines(IMAGE,Line_Segments,Line_Threshold)
5.2.3 : Segmenting Paragraph into Sentences:
46
def combine_Words(Word1,Word2,sentence):
WORD=[[]]
WORD[0].APPEND(Word1[0][0])
# X-AXIS POSITION
WORD[0].APPEND(min(Word1[0][1],Word2[0][1]))
# Y-AXIS POSITION
WORD[0].APPEND((Word2[0][0]-Word1[0][0])+Word2[0][2])
# WIDTH
WORD[0].APPEND(MAX(Word1[0][1]+
Word1[0][3],Word2[0][1]+Word2[0][3])-WORD[0][1])
# HEIGHT
WORD.APPEND(sentence[WORD[0][1]:WORD[0][1]+
WORD[0][3],WORD[0][0]:WORD[0][0]+WORD[0][2]])
return WORD
def WORD_SEGMENTATION(Words,sentence):
FINAL_WORDS=[]
word=[]
FINAL_FLAG=0
WORD_SEPERATION_SUM=0
SEPERATION=[]
for word_no in RANGE(len(Words)-1):
5.3 Word Segmentation:
5.3.1 : Combining two Missegmented Words:
5.3.2 : Segmenting Sentence into Words:
Line_Segments[index-1][0]+10
Sentences.APPEND(IMAGE[y:y+h])
y=Line_Segments[-1][0]-5
h=Line_Segments[-1][1]-Line_Segments[-1][0]+10
Sentences.APPEND(IMAGE[y:y+h])
return Sentences
47
def EVALUATING_VPP_INTENSITY(PRE_PROCESSED_BINARY_IMAGE):
VPP_Intensity=[0 for col in
RANGE(len(PRE_PROCESSED_BINARY_IMAGE[0]))]
for row in RANGE(len(PRE_PROCESSED_BINARY_IMAGE)):
for col in RANGE(len(PRE_PROCESSED_BINARY_IMAGE[row])):
if(PRE_PROCESSED_BINARY_IMAGE[row][col]==0):
VPP_Intensity[col]+=1
print(VPP_Intensity)
plt.plot(VPP_Intensity)
plt.xticks([])
DISTANCE=Words[word_no+1][0][0
]-(Words[word_no][0][0]+Words[word_no][0][2])
SEPERATION.APPEND(DISTANCE)
WORD_SEPERATION_SUM=WORD_SEPERATION_SUM+DISTANCE
WORD_AVERAGE_THRESHOLD=MATH.sqrt(
WORD_SEPERATION_SUM/(len(Words)-1))
print('WORDS SEPERATION : ',SEPERATION)
print('AVERAGE THRESHOLD FOR WORD SEPERATION :
',WORD_AVERAGE_THRESHOLD)
for index in RANGE(len(SEPERATION)):
if(len(word)==0):
word=Words[index]
if(SEPERATION[index]>WORD_AVERAGE_THRESHOLD):
FINAL_WORDS.APPEND(word)
word=[]
FINAL_FLAG=0
else:
word=combine_Words(word,Words[index+1],sentence)
FINAL_FLAG=1
if(FINAL_FLAG==0):
FINAL_WORDS.APPEND(Words[-1])
else:
FINAL_WORDS.APPEND(word)
return FINAL_WORDS
5.4 Character Segmentation:
5.4.1 : Evaluating VPP Intensity:
48
5.4.2 : First Level Character Segmentation Using VPP:
def FIRST_LEVEL_CHARACTER_SEGMENTATION_UNDER_VPP
(PRE_PROCESSED_BINARY_IMAGE):
VPP_Intensity=EVALUATING_VPP_INTENSITY
(PRE_PROCESSED_BINARY_IMAGE)
CHARACTER_SEGMENTS=[]
CHARACTER_SEPERATION=
0 START_FLAG=0
Zero_count=0
s=START_INDEX=0
End_index=0
SET_FLAG=0
for col in RANGE(len(VPP_Intensity)):
if(VPP_Intensity[col]==0):
Zero_count+=1
if(SET_FLAG==0):
End_index=col
SET_FLAG=1
else:
if(SET_FLAG==1 AND START_FLAG==1):
SET_FLAG=0
CHARACTER_SEGMENTS.APPEND
([START_INDEX,End_index])
START_INDEX=col
CHARACTER_SEGMENTS.APPEND(Zero_count)
CHARACTER_SEPERATION=CHARACTER_SEPERATIO
N
+(Zero_count**2)
Zero_count=0
if(START_FLAG==0):
START_FLAG=1
SET_FLAG=0
START_INDEX=col
Zero_count=0
CHARACTER_SEGMENTS.APPEND([START_INDEX,End_index])
CHARACTER_THRESHOLD=MATH.sqrt(CHARACTER_SEPERATION)/3
print("CHARACTER THRESHOLD : ",CHARACTER_THRESHOLD)
print("CHARACTER SEGMENTS : ",CHARACTER_SEGMENTS)
return CHARACTER_SEGMENTATION(CHARACTER_SEGMENTS,
plt.show()
return VPP_Intensity
49
def GENERATE_VPP_AND_TDP_AVERAGE(SEGMENT):
IMAGE=SEGMENT[1]
SHOW_IMAGE(IMAGE,CMAP='GRAY')
VPP_Intensity=[0 for col in RANGE(len(IMAGE[0]))]
for row in RANGE(len(IMAGE)):
for col in RANGE(len(IMAGE[row])):
if(IMAGE[row][col]==0):
VPP_Intensity[col]+=1
CHARACTER_THRESHOLD,PRE_PROCESSED_BINARY_IMAGE)
5.4.3 : Segmenting Word into Characters under VPP:
def CHARACTER_SEGMENTATION(CHARACTER_SEGMENTS,
CHARACTER_THRESHOLD,PRE_PROCESSED_BINARY_IMAGE):
SEGMENTED_CHARACTERS=[]
for index in RANGE(1,len(CHARACTER_SEGMENTS),2):
if(CHARACTER_SEGMENTS[index]>CHARACTER_THRESHOLD):
x=CHARACTER_SEGMENTS[index-1][0]
y=0
Touching=0
w=CHARACTER_SEGMENTS[index-1][1]-
CHARACTER_SEGMENTS[index-1][0]
h=len(PRE_PROCESSED_BINARY_IMAGE)
if(w>0.675*h):
Touching=1
SEGMENTED_CHARACTERS.APPEND([[x,y,w,h],
PRE_PROCESSED_BINARY_IMAGE[y:y+h,x:x+w],Touching])
x=CHARACTER_SEGMENTS[-1][0]
y=0
Touching=0
w=CHARACTER_SEGMENTS[-1][1]-CHARACTER_SEGMENTS[-1][0]
h=len(PRE_PROCESSED_BINARY_IMAGE)
if(w>0.675*h):
Touching=1
SEGMENTED_CHARACTERS.APPEND([[x,y,w,h],
PRE_PROCESSED_BINARY_IMAGE[y:y+h,x:x+w],Touching])
return SEGMENTED_CHARACTERS
5.4.4 : Evaluating VPP and TDP Average Intensity:
50
def TOUCHING_CHARACTER_SEGMENTATION(SEGMENT,
AVERAGE_INTENSITY,FINAL_SEGMENTED_CHARACTERS):
AVERAGE_THRESHOLD=20
TOUCHING_CHARACTERS_BREAKPOINTS=[]
col=0
while(col<len(AVERAGE_INTENSITY)):
if(col==0):
while(col<len(AVERAGE_INTENSITY) AND
AVERAGE_INTENSITY[col]<AVERAGE_THRESHOLD):
col+=1
if(col<len(AVERAGE_INTENSITY) AND
AVERAGE_INTENSITY[col]<AVERAGE_THRESHOLD):
MIN_VALUE=AVERAGE_INTENSITY[col]
min_point=col
while(col<len(AVERAGE_INTENSITY) AND
AVERAGE_INTENSITY[col]<AVERAGE_THRESHOLD):
5.4.5 : Connected Components Segmentation:
print(VPP_Intensity)
plt.plot(VPP_Intensity)
plt.xticks([])
plt.show()
TDP_Intensity=[0 for col in RANGE(len(IMAGE[0]))]
for col in RANGE(len(IMAGE[0])):
for row in RANGE(len(IMAGE)):
if(IMAGE[row][col]==0):
TDP_Intensity[col]=len(IMAGE)-row
BREAK
print(TDP_Intensity)
plt.plot(TDP_Intensity)
plt.xticks([])
plt.show()
AVERAGE_INTENSITY=np.ADD(TDP_Intensity,VPP_Intensity)
print(AVERAGE_INTENSITY)
plt.plot(AVERAGE_INTENSITY)
plt.xticks([])
plt.show()
return AVERAGE_INTENSITY
51
if(AVERAGE_INTENSITY[col]<MIN_VALUE):
MIN_VALUE=AVERAGE_INTENSITY[col]
min_point=col
col+=1
if(col<len(AVERAGE_INTENSITY)):
TOUCHING_CHARACTERS_BREAKPOINTS.APPEND(min_point)
col+=1
print("CHARACTERS BREAK POINTS
",TOUCHING_CHARACTERS_BREAKPOINTS)
if(len(TOUCHING_CHARACTERS_BREAKPOINTS)==0):
REQUIRED_FURTHER_SEGMENTATION(SEGMENT,
AVERAGE_INTENSITY,FINAL_SEGMENTED_CHARACTERS)
else:
x_point=0
y_point=SEGMENT[0][1]
height=SEGMENT[0][3]
for BREAK_POINT in TOUCHING_CHARACTERS_BREAKPOINTS:
width=BREAK_POINT-x_point
if(width>0.8*height):
REQUIRED_FURTHER_SEGMENTATION([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],1],
AVERAGE_INTENSITY[x_point:x_point+width
],FINAL_SEGMENTED_CHARACTERS)
else:
FINAL_SEGMENTED_CHARACTERS.APPEND([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],0])
x_point=BREAK_POINT
width=SEGMENT[0][2]-x_point
if(width>0.8*height):
REQUIRED_FURTHER_SEGMENTATION([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],1],
AVERAGE_INTENSITY[x_point:x_point+width
],FINAL_SEGMENTED_CHARACTERS)
else:
FINAL_SEGMENTED_CHARACTERS.APPEND([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],0])
52
5.4.6 : Further Required Segmentation on Connected Components:
def REQUIRED_FURTHER_SEGMENTATION(IMAGE_SEGMENT,
AVERAGE_INTENSITY,FINAL_SEGMENTED_CHARACTERS):
Index_Limit=int(0.4*IMAGE_SEGMENT[0][3])
MIN_VALUE=AVERAGE_INTENSITY[Index_Limit]
BREAK_POINT=Index_Limit
for col in RANGE(Index_Limit,len(AVERAGE_INTENSITY)-
Index_Limit):
if(AVERAGE_INTENSITY[col]<MIN_VALUE):
MIN_VALUE=AVERAGE_INTENSITY[col]
BREAK_POINT=col
x_point=0
y_point=IMAGE_SEGMENT[0][1]
height=IMAGE_SEGMENT[0][3]
width=BREAK_POINT-x_point
if(width>0.8*height):
REQUIRED_FURTHER_SEGMENTATION([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],1],
AVERAGE_INTENSITY[x_point:x_point+width
],FINAL_SEGMENTED_CHARACTERS)
else: FINAL_SEGMENTED_CHARACTERS.APPEND([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],0])
x_point=BREAK_POINT
width=IMAGE_SEGMENT[0][2]-x_point
if(width>0.8*height):
REQUIRED_FURTHER_SEGMENTATION([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],1],
AVERAGE_INTENSITY[x_point:x_point+width
],FINAL_SEGMENTED_CHARACTERS)
else:
FINAL_SEGMENTED_CHARACTERS.APPEND([[x_point,
y_point,width,height],SEGMENT[1][y_point:
y_point+height,x_point:x_point+width],0])
53
CLASS DATASET(DATA.DATASET):
def init (self,CSV_PATH,IMAGES_PATH,TRANSFORM=None):
self.TRAIN_SET=pd.READ_CSV(CSV_PATH)
self.TRAIN_PATH=IMAGES_PATH
self.TRANSFORM=TRANSFORM
def len (self):
return len(self.TRAIN_SET)
def getitem (self,idx):
FILE_NAME=self.TRAIN_SET.iloc[idx][1]+'.png'
LABEL=self.TRAIN_SET.iloc[idx][2]
img=IMAGE.open(os.PATH.join(self.TRAIN_PATH,FILE_NAME))
if self.TRANSFORM is not None:
img=self.TRANSFORM(img)
return img,LABEL
PARAMS = {'BATCH_SIZE': 16,
'shuffle': True
}
epochs = 6
LEARNING_RATE=1e-
3
TRANSFORM_TRAIN = TRANSFORMS.Compose([TRANSFORMS.
Resize((224,224)),TRANSFORMS.RANDOMAPPLY([
torchvision.TRANSFORMS.RANDOMROTATION(10),
TRANSFORMS.RANDOMHORIZONTALFLIP()],0.7),
TRANSFORMS.ToTensor()])
TRAINING_SET=DATASET(os.PATH.join(BASE_PATH,
'TRAIN.CSV'),os.PATH.join(BASE_PATH,
'TRAIN_IMAGES/'),TRANSFORM=TRANSFORM_TRAIN)
TRAINING_GENERATOR=DATA.DATALOADER(TRAINING_SET,**PARAMS)
USE_CUDA = torch.CUDA.IS_AVAILABLE()
device = torch.device("CUDA:0" if USE_CUDA else "cpu")
5.5 Training the model:
5.5.1 : Data Loader:
5.5.2 : Defining Transforms and Parameters:
54
model = EfficientNet.FROM_PRETRAINED('efficientnet-b0',
NUM_CLASSES=62)
model.to(device)
print(SUMMARY(model, input_size=(3, 512, 512)))
PATH_SAVE='./Weights/'
if(not os.PATH.exists(PATH_SAVE)):
os.mkdir(PATH_SAVE)
criterion = nn.CrossEntropyLoss()
LR_DECAY=0.99
optimizer = optim.ADAM(model.PARAMETERS(), lr=LEARNING_RATE)
eye = torch.eye(62).to(device)
CLASSES=[i for i in RANGE(62)]
HISTORY_ACCURACY=[]
history_loss=[]
epochs = 1
for epoch in RANGE(epochs):
running_loss = 0.0
correct=0
TOTAL=0
CLASS_CORRECT = list(0. for _ in CLASSES)
CLASS_TOTAL = list(0. for _ in CLASSES)
for i, DATA in ENUMERATE(TRAINING_GENERATOR, 0):
inputs, LABELS = DATA
t0 = time()
inputs, LABELS = inputs.to(device), LABELS.to(device)
5.5.3 : Importing the Model:
5.5.4 : Traning the Model:
print(device)
55
LABELS = eye[LABELS]
optimizer.ZERO_GRAD()
outputs = model(inputs)
loss = criterion(outputs, torch.MAX(LABELS, 1)[1])
_, predicted = torch.MAX(outputs, 1)
_, LABELS = torch.MAX(LABELS, 1)
c = (predicted == LABELS.DATA).squeeze()
correct += (predicted == LABELS).sum().item()
TOTAL += LABELS.size(0)
ACCURACY = FLOAT(correct) / FLOAT(TOTAL)
HISTORY_ACCURACY.APPEND(ACCURACY)
history_loss.APPEND(loss)
loss.BACKWARD()
optimizer.step()
for j in RANGE(LABELS.size(0)):
LABEL = LABELS[j]
CLASS_CORRECT[LABEL] += c[j].item()
CLASS_TOTAL[LABEL] += 1
running_loss += loss.item()
if(i%100==99):
print( "Epoch : ",epoch+1," BATCH : ", i+1," Loss :
",running_loss/(i+1)," ACCURACY : ",
ACCURACY,"Time ",round(time()-t0, 2),"s" )
for k in RANGE(len(CLASSES)):
if(CLASS_TOTAL[k]!=0):
print('ACCURACY of %5s : %2d %%' % (
CLASSES[k], 100 * CLASS_CORRECT[k] / CLASS_TOTAL[k]))
print('[%d epoch] ACCURACY of the network
on the TRAINING IMAGES: %d %%' %
(epoch+1, 100 * correct / TOTAL))
if epoch%10==9:
torch.SAVE(model.STATE_DICT(),
os.PATH.join(PATH_SAVE,str(epoch+1)+'.pth'))
torch.SAVE(model.STATE_DICT(),
os.PATH.join(PATH_SAVE,'FINAL_EPOCH'+'.pth'))
56
model.LOAD_STATE_DICT(torch.LOAD('/KAGGLE/WORKING
/Weights/FINAL_EPOCH.pth'))
model.EVAL()
TEST_TRANSFORMS = TRANSFORMS.Compose(
[TRANSFORMS.Resize(512),
TRANSFORMS.ToTensor(),])
def PREDICT_IMAGE(IMAGE):
IMAGE_TENSOR = TEST_TRANSFORMS(IMAGE)
IMAGE_TENSOR = IMAGE_TENSOR.unsqueeze_(0)
input = VARIABLE(IMAGE_TENSOR)
input = input.to(device)
output = model(input)
index = output.DATA.cpu().numpy().ARGMAX()
return index
IMG_TEST_PATH=os.PATH.join(BASE_PATH,'TEST_IMAGES/')
for i in RANGE(len(submission)):
img=IMAGE.open(IMG_TEST_PATH+submission.iloc[i][1]+'.png')
submission['CLASS'][i]=PREDICT_IMAGE(img)
if(i%10==0 or i==len(submission)-1):
print('[',32*'=','>]
',round((i+1)*100/len(submission),2),' % Complete')
Result=[[0 for _ in RANGE(62)] for i in RANGE(62)]
TOTAL_DATA=[0 for i in RANGE(62)]
CORRECT_DATA=[0 for i in RANGE(62)]
for i in RANGE(len(submission)):
Result[TEST_DATASET['CLASS'][i]][submission['CLASS'][i]]+=1
if(TEST_DATASET['CLASS'][i]==submission['CLASS'][i]):
CORRECT_DATA[TEST_DATASET['CLASS'][i]]+=1
TOTAL_DATA[TEST_DATASET['CLASS'][i]]+=1
for i in Result:
for j in i:
print(str(10000+j)[1:],end=" ")
print()
5.5.5 : Testing the Model:
57
for i in RANGE(62):
print(i,'-',TOTAL_DATA[i],CORRECT_DATA[i
],(CORRECT_DATA[i]*100)/TOTAL_DATA[i])
print("TOTAL",'-',sum(TOTAL_DATA),sum
(CORRECT_DATA),(sum(CORRECT_DATA)*100)/sum(TOTAL_DATA))
58
6. RESULTS AND DISCUSSIONS
6.1 Input & Output:
INPUT:
Fig 6.1 Input image
OUTPUT:
A MOVE to stop Mr. Gaitskell from nominating any more Labour
life Peers is to be made at a meeting of Labour M P’s tomorrow. Mr. Michael Foot has
put down a resolution on the subject and he is to be backed by Mr. Will Griffiths, MP
for Manchester exchange.
6.2 Training Datasets:
IAM DATASET: The IAM Handwriting Database contains forms of
handwritten English text which can be used to train and test handwritten text
recognizers and to perform writer identification and verification experiments.
The database was first published in [13] at the ICDAR 1999. Using this
database an HMM based recognition system for handwritten sentences was developed
and published in [14] at the ICPR 2000. The segmentation scheme used in the second
version of the database is documented in [15] and has been published in the ICPR
2002. The IAM-database as of October 2002 is described in [16]. We use the database
extensively in our own research.
59
The database contains forms of unconstrained handwritten text, which were
scanned at a resolution of 300dpi and saved as PNG images with 256 gray levels.
The IAM Handwriting Database 3.0 is structured as follows:
657 writers contributed samples of their handwriting
1'539 pages of scanned text
5'685 isolated and labeled sentences
13'353 isolated and labeled text lines
115'320 isolated and labeled words
The words have been extracted from pages of scanned text using an automatic
segmentation scheme and were verified manually. The segmentation scheme has been
developed at our institute [15].
All form, line and word images are provided as PNG files and the
corresponding form label files, including segmentation information and variety of
estimated parameters (from the preprocessing steps described in [14]), are included in
the image files as meta-information in XML format
NIST DATASET: The EMNIST dataset is a set of handwritten character
digits derived from the NIST Special Dataset 2019 and converted to a 28x28 pixel
image format and dataset structure that directly matches the MNIST dataset.
The dataset is provided in two file formats. Both versions of the dataset contain
identical information, and are provided entirely for the sake of convenience. The first
dataset is provided in a Matlab format that is accessible through both Matlab and
Python (using the scipy.io.loadmat function). The second version of the dataset is
provided in the same binary format as the original MNIST dataset.[18]
There are six different splits provided in this dataset. A short summary of the
dataset is provided below:
EMNIST ByClass: 814,255 characters. 62 unbalanced classes.
EMNIST ByMerge: 814,255 characters. 47 unbalanced classes.
EMNIST Balanced: 131,600 characters. 47 balanced classes.
EMNIST Letters: 145,600 characters. 26 balanced classes.
EMNIST Digits: 280,000 characters. 10 balanced classes.
EMNIST MNIST: 70,000 characters. 10 balanced classes.
The full complement of the NIST Special Database 19 is available in the
ByClass and ByMerge splits. The EMNIST Balanced dataset contains a set of
characters with an equal number of samples per class. The EMNIST Letters dataset
merges a balanced set of the uppercase and lowercase letters into a single 26-class
60
task. The EMNIST Digits and EMNIST MNIST dataset provide balanced handwritten
digit datasets directly compatible with the original MNIST dataset.
Type Classes Training Testing Total
BY CLASS Digits 10 344,307 58,646 402,953
BY MERGE
Uppercase 26 208,363 11,941 220,304
Lowercase 26 178,998 12,000 190,998
Total 62 731,668 82,587 814,255
Digits 10 344,307 58,646 402,953
Letters 37 387,361 23,941 411,302
Total 47 731,668 82,587 814,255
Table-6.1
Breakdown of the number of available training and testing samples in the NIST
special database 19 using the original training and testing splits.
6.3 Experimental Results and Analysis:
Segmentation: This system is trained under around 1600 text images
(paragraphs) of IAM dataset with almost 5678 labeled sentences, 13353 isolated and
labelled text lines, 115,320 isolated and labeled words with an accuracy of around
98% in terms of line segmentation, 93 % in terms of word segmentation and 88% in
terms of character segmentation.
Character Recognition: This system is trained under NIST dataset. This
represents the most useful organization from a classification perspective as it contains
the segmented digits and characters arranged by class. There are 62 classes comprising
[0-9], [a-z] and [A-Z]. The data is also split into a suggested training and testing set
around 731,668 images and 82,587 images respectively with an accuracy of around
96% in character recognition.
61
Type Testing Accuracy
Line Segmentation 1,539 98%
Word Segmentation
Character Segmentation
Character Recognition
5,685 93%
115,320 88%
82,587 96%
Table -6.2: Testing and Accuracy
62
7. CONCLUSION
This paper mainly carries out a study on segmenting the connected components
(touching characters). We improved a performance of binarization in pre-processing,
and proposed new method separating the touching character using combined profile
analysis. Finally, because the proposed algorithm shows a good performance in the
experimental results, it is effective that the algorithm is applied to character
recognition system.
The proposed method for segmenting the connected components (touching
character) is by using VPP (Vertical projection profile) and TDP (Top down Profile)
and various other histogram projections (horizontal and vertical) for the line and word
segmentation respectively. pytorch is a python library tool used for the recognition of
the segmented characters. [4]
There are many more challenges involved in the optical cursive handwritten
recognition like the skewness, pressure detection etc. can be treated as a future study.
63
REFERENCES
[1] Nafiz arica, Student Member, IEEE, and Fatos T. Yarman-Vuraj, Senior Member
IEEE, “Optical Character Recognition for Cursive Handwriting”, IEEE transaction on
pattern analysis and machine intelligence, vol. 24 no. 6, June 2002
[2] Subhash Panwar and Neeta Naina, “A Novel Segmentation Methodology for
Cursive Handwritten Documents”, IETE JOURNAL OF RESEARCH, VOL 60-NO 6
NOV-DEC 2014
[3] Nibaran Das Sandip Pramanik, Subhadio Basu, Punam Kumar Saha, “Recognition
of handwritten Bangla basic characters and digits using convex hull feature set”, 2009
International conference on Artificial intelligence and pattern recognition (AIPRL-09)
[4] Abhishek Bala and Rajib Saha, “An IMPROVED Method for Handwritten
Document Analysis using Segmentation, Baseline Recognition and Writing Pressure
Detection”, 6th
International Conference on Advances in Computing Communication,
ICACC 2016, 6-8 September 2016, Cochin, India, Elesevier-2016
[5] Kanchan Keisham and Sunanda Dixit, “Recognition of Handwritten English Text
Using Energy Minimisation”, Information Systems Design and Intelligence
Applications, Advances in Intelligent Systems and Computing, Bangalore, India,
Spinger-2016
[6] Namrata Dave,” Segmentation Methods for Hand written Character Recognition”,
International Journal of Signal Processing, Image Processing and Pattern Recognition
Vol.8, No.4 (2015), pp.155-164
[7] G. Louloudisa, B.Gatosb, I.Pratikakisb, C.Halatsis, “Text line and word
segmentation of handwritten documents” a Department of Informatics and
Telecommunications, University of Athens, Greece Computational Intelligence
Laboratory, Institute of Informatics and Telecommunications, National Center for
Scientific Research Demokritos, 15310Athens, Greece.
[8] Rafael C. Gonzalez and Richard E. Woods, “Digital Image Processing, Second
Edition”, Prentice Hall.
[9] N. Otsu, “A Threshold Selection Method from GrayLevel Histogram”, IEEE
Trans, Systems, Man and Cybernetics, Vol. 1, No. 9, 1979, pp. 62-69.
[10] Seong-Whan Lee and Young Joon Kim, “Direct Extraction of Topographic
Features for Gray Scale Character Recogition”, IEEE Trans, On Pattern Analysis and
Machine Intelligence, Vol.17, No. 7, Jul., 1995.
[11] Seong-Whan, Dong-June Lee, and Hee-Seon Park, “A New Methodology for
Gray-Scale Character Segmentation and Recognition”, IEEE Trans. On Pattern
Analysis and Machine Intelligence, Vol. 18, No. 10, Oct., 1996.
64
[12] A. Ariyoshi, “A Character Segmentation Method for Japanese Documents Coping
with Touching Character Problems”, Proc. 31th Int’l Conf. Pattern Recognition, The
Hague, Netherlands, Aug., 1992. 313-316.
[13] U. Marti and H. Bunke. A full English sentence database for off-line handwriting
recognition. In Proc. of the 5th Int. Conf. on Document Analysis and Recognition,
pages 705 - 708, 1999.
[14] U. Marti and H. Bunke. Handwritten Sentence Recognition. In Proc. of the 15th
Int. Conf. on Pattern Recognition, Volume 3, pages 467 - 470, 2000.
[15] M. Zimmermann and H. Bunke. Automatic Segmentation of the IAM Off-line
Database for Handwritten English Text. In Proc. of the 16th Int. Conf. on Pattern
Recognition, Volume 4, pages 35 - 39, 2000.
[16] U. Marti and H. Bunke. The IAM-database: An English Sentence Database for
Off-line Handwriting Recognition. Int. Journal on Document Analysis and
Recognition, Volume 5, pages 39 - 46, 2002.
[17] S. Johansson, G.N. Leech and H. Goodluck. Manual of Information to accompany
the Lancaster-Oslo/Bergen Corpus of British English, for use with digital Computers
Department of English, University of Oslo, Norway, 1978.
[18] Cohen, G., Afshar, S., Tapson, J., & van Schaik, A. (2017). EMNIST: an
extension of MNIST to handwritten letters.
[19] Richard G. Casey and Eric Lecolinet, “A Survey of Methods and Strategies in
Character Segmentation”, IEEE Trans. On Pattern Analysis and Machine Intelligence,
Vol. 18, No. 7, Jul., 1996.
[20] Jin Hak Bae, Kee Chul Jung, Jin Wook Kim, and Hang Joon Kim, “Segmentation
of touching characters using an MLP”, Pattern Recognition Letters, Vol. 19, No. 8,
1998, pp. 701-709.
[21] N.Aricaand, F.Yarman-Vural,“Opticalcharacterrecognitionforcursive
handwriting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
24, pp. 801 –813, Jun 2002.
[22] M. Blumenstein and B. Verma, “Neural-based solutions for the segmen- tation
and recognition of difficult handwritten words from a benchmark database,” in
Proceedings of the Fifth International Conference on Document Analysis and
Recognition, ICDAR ’99, pp. 281 –284, Sept 1999.
[23] Y. Tay, M. Khalid, R. Yusof, and C. Viard-Gaudin, “Offline cursive handwriting
recognition system based on hybrid markov model and neural networks,” in
65
Proceedings of the IEEE International Symposium on Computational Intelligence in
Robotics and Automation, 2003, vol. 3, pp. 1190 – 1195, July 2003.
[24] G.Kim, V.Govindaraju, S.Srihari,“Asegmentationandrecognition strategy for
handwritten phrases,” in Proceedings of the 13th Interna- tional Conference on Pattern
Recognition, 1996, vol. 4, pp. 510 –514, Aug 1996.
[25] Y. Y. Chung and M. T. Wong, “Handwritten character recognition by fourier
descriptors and neural network,” in Proceedings of IEEE Region 10 Annual
Conference on Speech and Image Technologies for Computing and
Telecommunications, TENCON ’97, vol. 1, pp. 391 –394, Dec 1997.
[26] B. S. Moni and G. Raju, “Modified quadratic classifier and directional features
for handwritten malayalam character recognition,” in Computa- tional Science - New
Dimensions and Perspectives, NCCSE 2011, IJCA Special Issue, vol. 1, pp. 30 –34,
Feb 2011.
[27] M. Blumenstein, X. Y. Liu, and B. Verma, “An investigation of the modified
direction feature for cursive character recognition,” Pattern Recognition, vol. 40, no. 2,
pp. 376 – 388, 2007.
[28] M. Blumenstein, B. Verma, and H. Basli, “A novel feature extraction technique
for the recognition of segmented handwritten characters,” in Proceedings of the
Seventh International Conference on Document Analysis and Recognition, 2003, vol.
1, pp. 137 – 141, Aug 2003.
66
Base Paper
2011 International Conference on Computer Applications and Industrial Electronics (ICCAIE 2011)
Offline Handwritten Character Recognition Using
Neural Network
Anshul Gupta Department of Electronics and
Electrical Engineering
IITGuwahati
Guwahati, India
Email: [email protected]
Manisha Srivastava
Department of Electronics and
Electrical Engineering
IITGuwahati
Guwahati, India
Email: [email protected]
Chitralekha Mahanta
Department of Electronics and
Electrical Engineering
IITGuwahati
Guwahati, India
Email: [email protected]
Abstract—Character Recognition (CR) has been an active area
of research in the past and due to its diverse applications it continues to be a challenging research topic. In this paper, we focus especially on offline recognition of handwritten En- glish words by first detecting individual characters. The main approaches for offline handwritten word recognition can be divided into two classes, holistic and segmentation based. The holistic approach is used in recognition of limited size vocabulary where global features extracted from the entire word image are considered. As the size of the vocabulary increases, the complexity of holistic based algorithms also increases and correspondingly the recognition rate decreases rapidly. The segmentation based strategies, on the other hand, employ bottom-up approaches, starting from the stroke or the character level and going towards producing a meaningful word. After segmentation the problem gets reduced to the recognition of simple isolated characters or strokes and hence the system can be employed for unlimited vocabulary. We here adopt segmentation based handwritten word recognition where neural networks are used to identify individual characters. A number of techniques are available for feature extraction and training of CR systems in the literature, each with its own superiorities and weaknesses. We explore these techniques to design an optimal offline handwritten English word recognition system based on character recognition. Post processing technique that uses lexicon is employed to improve the overall recognition accuracy.
Index Terms—Offline, handwritten, character, recognition, neural network.
I. INTRODUCTION
It is really a challenging issue to develop a practical hand-
written character recognition (CR) system which can maintain
high recognition accuracy. A generic character recognition
system is shown in Fig. 1.
Fig. 1. Generic CR system
In most of the existing systems recognition accuracy is
heavily dependent on the quality of the input document.
In handwritten text adjacent characters tend to be touched
or overlapped. Therefore it is essential to segment a given
string correctly into its character components. In most of the
existing segmentation algorithms, human writing is evaluated
empirically to deduce rules [1]. But there is no guarantee
for the optimum results of these heuristic rules in all styles
of writing. Moreover handwriting varies from person to
person and even for the same person it varies depending on
mood, speed etc. This requires incorporating artificial neural
networks, hidden Markov models and statistical classifiers to
extract segmentation rules based on numerical data. [2][3][4].
After segmentation next crucial step is representation of
character classes by features. These features should have high
discriminative abilities so that they are different for different
character classes (for example 26 uppercase and 26 lowercase
characters in case of English language). Also, these features
should be independent of the intra class variations.
The different representation methods can be categorized into
three major classes [1]:
1. Global transformation and series expansion: includes
Fourier transform, Gabor transform, wavelet, mo-
ments and Karhuen-Loeve Expansion.
2. Statistical representation: Zoning, crossing and dis-
tances, projections.
3. Geometrical and topological representation: Extract-
ing and counting topological structures, geometrical
properties, coding, graphs and trees etc.
Features which depend on Fourier transform are suitable
for recognizing handwritten numerals where 96% accuracy
has been achieved [5]. Gradient features have been widely
used in CR for machine and hand printed binary character
images. But these features are not invariant to deformations
in the characters. In [6], a new gradient feature is used where
at each pixel, gradient is mapped onto 12 direction codes
with an angle span of 30 degree between the directions.
In [7], a redesigned direction feature [8] with a view to
978-1-4577-2059-8/11/$26.00 ©2011 IEEE 102
103
describe the character contour more effectively is developed.
Also, an additional global feature was introduced in this
technique to improve the recognition accuracy for those
characters that were most frequently confused with patterns
of similar appearances. But the disadvantage of this technique
is its failure to deal with changes in stroke width as these
features are extracted from non thinned character images.
Another crucial module in a character recognition system is
its pattern recognition module which assigns an unknown
sample to a predefined class. Numerous techniques for
character recognition can be classified into four general
approaches of pattern recognition: [1]
1. Template Matching : Direct matching, deformable
and elastic matching, relaxation matching
2. Statistical techniques : Parametric recognition, Non-
pararmetric recognition, HMM, fuzzy set reasoning
3. Structural techniques: Grammatical methods, graph-
ical methods
4. Neural networks : Multilayer perceptron, radial basis
function, support vector machine
Character recognition technique has to cope with the high
variability of the handwritten cursive letters and their intrinsic
ambiguity (letters like “e” and “l” or “u” and “n” can have
the same shape). Also it should be able to adapt to changes
in the input data. Template matching, statistical techniques
and structural techniques can be used when the input data is
uniform over time whereas neural network (NN) classifier
can learn changes in the input data. Also NN has parallel
structure because of which it can perform computation at a
higher rate than classical techniques. Therefore, we choose
neural networks for character recognition in our system.
The features that are used for training the neural network
classifier also play a very important role. The choice of a
good feature vector can significantly enhance the performance
of a character classifier whereas a poor one may degrade
its performance considerably. It is found in the literature
that generally separate classifiers are used for the upper
and the lower case English character classes to improve the
recognition accuracy. Moreover, good recognition accuracy
could be achieved only for handwritten numerals.
In this paper, we focus on developing a CR system for
recognition of handwritten English words. We first segment
the words into individual characters and then represent these
characters by features that have good discriminative abilities.
We also explore different neural network classifiers to find
the best classifier for the CR system. We combine different
CR techniques in parallel so that recognition accuracy of the
system can be improved.
The organization of the paper is as follows: Section II
deals with segmentation of words into individual characters
where a heuristic algorithm is used to first oversegment the
word followed by verification using neural network. Feature
extraction of handwritten characters is discussed in Section
III. Section IV describes selection procedure of a suitable
classifier. This is done by testing multilayer perceptron (MLP),
radial basis function (RBF) and support vector machine (SVM)
and selecting the one that has the maximum accuracy. In Sec-
tion V post processing is discussed where different character
recognition techniques are combined in parallel by using a
variation of the Borda count. Section VI presents results and
discussion. Conclusions are drawn in Section VII.
II. SEGMENTATION
In this paper segmentation algorithm used is similar to [2],
where heuristics and artificial intelligence are used for the
segmentation of a handwritten word. Here gray level image
is first converted into the binary image. Next slant detection
similar to the one used in [9] is employed and then slant
from −45◦ to 45
◦. The horizontal projection is taken at each correction is done. The method involves rotating the image rotation to calculate Wigner - Ville distribution (WVD - a joint
function of time and frequency). The angle which presents
the maximum intensity after applying WVD is taken as the
estimated slant angle.
For both the training and the testing phases, a heuristic
algorithm is used to locate prospective segmentation points in
the handwritten words. Each word is inspected in an attempt to
locate characteristic representative of the segmentation points.
A. Segmentation using a heuristic algorithm
A simple heuristic segmentation algorithm is implemented
which scans handwritten words to identify valid segmentation
points between characters. The segmentation is based on
locating the minima or arcs between letters, common in
handwritten cursive script. For this a histogram of vertical
pixel densities is examined which may indicate the location of
possible segmentation points in the word. However in the case
of letters as “a” and “o”, an erroneous segmentation point
could be identified. Therefore a “hole seeking” component
is incorporated which prunes the segmentation points those
pass through a “hole”. Finally, the algorithm performs a
check to see if one segmentation point is not too close
to another by ascertaining that the distance between the
previous segmentation point and the position being checked
is equal to or greater than the average character width.
Conversely if the contour in a region has sparse segmentation
points then a new segmentation point is inserted in that region.
B. Manual marking of segmentation points
We created our own database to train the neural network
for segmentation. Altogether 26 English words were chosen
which contained all the upper and lower case alphabets and
then 10 different samples of each word were collected on
paper from different writers. The images were then scanned
and preprocessed to create a list of 260 words. Prior to ANN
training, the heuristic feature detector was used to segment
all the words. The segmentation point outputs obtained by
104
L
L
using the heuristic feature detector can be categorized into
“correct” and “incorrect” segmentation point classes. The
feature extractor then extracts a matrix of pixels representing
the segmentation area and breaks it down into small windows
of equal size 5x5 pixels and analyzes the density of black
and white pixels. The density value for the black pixels
for each 5x5 window is written to the training file to
k = 5 are considered. The feature vectors made up from these
moduli are then normalized to 1 to compensate for image
scaling. To spread the input data more evenly over the input
space, the mean and the standard deviation vectors are found
over the whole set of training data. The jth
component of
input vector i is calculated as: . . Σ Σ
1 represent the value of that window. Accompanying each
matrix the desired output is also stored in the training file (0.1 ipj = (ipoj − ioj ) α − 1 + 1 σnoj
(3)
for an incorrect segmentation point and 0.9 for a correct point).
C. Training of the Artificial Neural Network (ANN)
For this step, a multilayer feedforward neural network
trained with the back propagation algorithm is used. The ANN
is presented with the training file prepared in the previous step.
D. Testing phase of the segmentation technique
Like ANN training, the words used for testing are also
segmented using the heuristic algorithm. The segmentation
points are automatically extracted and are fed into the trained
ANN. The ANN then verifies each segmentation point as
correct or incorrect. Finally, upon ANN verification, each word
used for testing should only contain valid segmentation points.
III. FEATURE EXTRACTION
A compact and characteristic representation of the character
image is required in the CR system. For this purpose, a set
of features is extracted for each class that helps to distinguish
it from other classes, while remaining invariant to intra class
differences.
pattern p, ioj is the mean of the jth
components of the original where ipoj is the jth
component of the original vector or vectors and σnoj is the corresponding standard deviation.
Coefficient α linearly controls the degree of standard deviation
compensation. We have also used Fourier descriptors for
extracting the following two features:
1) Fourier angle: It is mentioned in [10] that if the moduli
alone are not successful in discriminating all the classes then
adding angles of Fourier descriptors can improve the results.
Experiments can be done to incorporate angles in the training
set.
2) Fourier magnitude [11]: The Fourier coefficients de-
rived from equations (1) and (2) are not rotation or shift
invariant (in fact, it is noted that a shift will occur if the starting
point of the boundary following is arbitrary). In order to derive
a set of Fourier descriptors which have the invariant property
performed. For each n a set of invariant descriptors r[n] is with respect to rotation and shift, the following operations are computed as : √
r[n] = |a[n]|2 + |b[n]|2 (4) It is easy to show that r[n] is invariant to rotation or shift. A further refinement can be made by computing a new set of
descriptors as follows:
A. Fourier Descriptors
The method adopted is similar to [10] where boundary
detection is done at first. After obtaining a boundary image,
Fourier descriptors are found. This involves finding the
discrete Fourier coefficients a[k] and b[k] for 0 < k < L − 1, where L is the total number of boundary points found by applying the following:
s[n] = r[n]/r[1] (5) Thus dependence of r[n] on the size of the character is also eliminated. The Fourier coefficients |a[n]|, |b[n]|, their phases and the invariant descriptors s[n]; n = 2; 3, were derived for all the character specimens and stored in files for application in
reconstruction and recognition. We will be using the following
set of features in our final system:
1. Magnitude, s(k), |a[k]| and |b[k]|
a[k] =
b[k] =
. 1 Σ ΣL L m=1
. 1 Σ ΣL
L
x[m]e(−jk( 2π )m) (1)
y[m]e(−jk( 2π )m) (2)
2. Phase, |a[k]| and |b[k]|
3. Magnitude, phase, s(k), |a[k]| and |b[k]| IV. CLASSIFIER SELECTION
Classification can be done using various methods like clus-
tering, Bayesian classification, artificial neural networks etc. m=1
where x[m] and y[m] are the x and y coordinates respectively of the mth
boundary point. The values for k = 0 are discarded as they contain information only about the position of the image. The coefficients for high values of k describe
high frequency features in the image but do not contain
much information about the overall shape of the character
and so these high frequency components are also discarded.
Therefore, the first five coefficients beginning from k = 1 to
out of which artificial neural networks have been widely used. For our case we will use them to classify 52 character classes:
26 lower cases and 26 upper cases. We have considered
three networks: Multi-layer perceptron (MLP), radial basis
function (RBF) and support vector machine (SVM). Results
of character classification by these classifiers are given below.
We have used neural network toolbox in the Matlab platform
for testing the classifiers. The character database used for the
training and testing is taken from The Chars74K dataset.
105
A. Multilayer Perceptron (MLP)
Table I shows the MLP configuration that produced the best
results in our case. Fig.2 illustrates the validation performance
of the MLP network. Results obtained are poor on validation
and testing data.
TABLE I
MLP CONFIGURATION
No. of hidden layers
and respective acti-
vation function
No.
hidden
nodes
of Training algorithm
Learning Rate
Mom- entum
3[tansig tansig tan- [80 50 traingdx Adaptive 0.9
sig] 50]
Fig. 2. Validation performance of the MLP network
B. Radial Basis Function (RBF)
Table II shows the RBF network used. Fig.3 illustrates the
validation performance of the RBF network. The results are
good on training data but suffer from overlearning.
TABLE II
RBF CONFIGURATION
No. of hidden nodes Type of radial basis function Target Er-
ror
Adaptive addition and pruning of hidden neurons
Gaussian radial basis function 0.001
Fig. 3. Validation performance of the RBF network
Although the RBF network produced good results on the
validation dataset but it required 1800 neurons for this per-
formance. As a result this network suffered from overlearning
and showed very poor results on the test data.
C. Support vector machine (SVM) data is 98.86% and it achieves the optimum learning. The recognition result on the test data is 62.93%. It is observed In the case of the SVM, the recognition rate on the training
Table III shows the recognition rate (%) on the training data that on the test data SVM outperforms the other two networks. produced by the SVM for all the three feature vectors. This
testing is performed on the Chars74K dataset.
TABLE III SVM CONFIGURATION
Fourier with mag- nitude s(k),|a(k)| and |b(k)|
Fourier with phase,
|a(k)| and |b(k)|.
Fourier with magni- tude s(k), |a(k)|, |b(k)| and phase
86.66% 98.74% 98.04%
Now we build a CR system using all the three sets of
features in parallel. Our proposed system is shown in Fig.4.
Fig. 4. Block diagram of the proposed CR system
V. POST PROCESSING
It has been found that in many real word applications, it
is better to fuse multiple techniques to improve the results.
Fusion takes advantages of different techniques by emphasiz-
ing on the strengths and avoiding the weaknesses of individual
techniques. We here use a fusion method based on Borda count
that is inspired from [12] to combine the following techniques
in parallel: 1. SVM on Moduli of Fourier Coefficients |a(k)|, |b(k)| and magnitude s(k)
2. SVM on Moduli of Fourier Coefficients |a(k)|, |b(k)| and phase
3. SVM on Moduli of Fourier Coefficients |a(k)|,
|b(k)|, phase and magnitude s(k) A rank is assigned and used in the calculation of the Borda count instead of calculating the number of strings below the
predicted string. The output string from a given technique is
compared with all the words in a lexicon. Then the lexicon
words are ranked according to their similarity with the output
106
string. The similarity between the output string and the lexicon
words are found by finding the number of matching characters
and their relative positions. The rank for a particular string can
be calculated using the following formulae:
Rank = 1-(position of the string in the top N strings)/N.
The rank is 0, if the string is not in the top N choices. We have
taken N=3.Therefore only the top three words are considered
from each technique to calculate the rank.
Secondly the confidence values produced by different tech-
niques are considered. The confidence value for all the three
predicted words for any given technique is the confidence that
the classifier has in its output string even if the string is not
a valid lexicon word. This is reasonable because the top three
strings are chosen based on its similarity with the output string.
The classifier’s confidence in its output string can be estimated
by summing up the scores of each of the predicted characters
of the output string. Final Boda count of a lexicon word = (rank ∗ confidence)tech1 + (rank ∗ confidence)tech2 + (rank ∗ confidence)tech3
VI. RESULTS AND DISCUSSION
The proposed CR system was tested on a database consist-
ing of 26 word images. All of these images were given as input
to the proposed CR system. The lexicon used also consisted
of the same 26 words that were used for testing. Out of these
26 words, the proposed system correctly recognized 21 word
images. Figs. 5-7 show some results from the 21 correctly
recognized handwritten words.
Fig. 5. Result on “Moderated”.
Fig. 6. Result on “Puzzle”.
It is evident from these figures that the proposed CR system
107
Fig. 7. Result on “Rolled”.
produces fairly good results on the test samples presented to
it. The segmentation method used was efficient. The heuristic
algorithm is based on rules which are deduced empirically
and there is no guarantee about their optimum results for
different styles of writing. So their validation using neural
network becomes essential. We tried different Fourier features
like moduli of Fourier coefficients, magnitude, phase and
their various combinations as feature vectors. The feature
vector formed using moduli of Fourier coefficients and phase
produced the best recognition accuracy of 98.74% on the
training dataset using SVM as the classifier. We have used
three combinations of Fourier descriptors in parallel for our
final system. Moreover our character recognition network has
52 output classes whereas in most of the literature separate
classifiers were used for upper and lower case characters. We
tested MLP and RBF neural networks that have been used
in the past for character recognition. We also tried support
vector machine (SVM) as classifier on the same feature set
and achieved 98% classification accuracy on the training data
set and 62.93% on the test data set. Finally, we selected SVM
as it outperformed MLP and RBF. Post processing which uses
lexicon becomes imperative as there is no other way to find out
the errors that have crept in at any of the previous stages. The
only way to do that is to verify whether the predicted word is
a valid lexicon word or not. Thus incorporating lexicon in our
final system using Borda Count improved the overall efficiency
of the system.
VII. CONCLUSION
This paper carries out a study of various feature based clas-
sification techniques for offline handwritten character recogni-
tion. After experimentation, it proposes an optimal character
recognition technique. The proposed method involves segmen-
tation of a handwritten word by using heuristics and artificial
intelligence. Three combinations of Fourier descriptors are
used in parallel as feature vectors. Support vector machine
is used as the classifier. Post processing is carried out by
employing lexicon to verify the validity of the predicted word.
The results obtained by using the proposed CR system are
found to be satisfactory.
REFERENCES
[1] N. Arica and F. Yarman-Vural, “Optical character recognition for cursive
handwriting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 801 –813, Jun 2002.
[2] M. Blumenstein and B. Verma, “Neural-based solutions for the segmen- tation and recognition of difficult handwritten words from a benchmark database,” in Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR ’99, pp. 281 –284, Sept 1999.
[3] Y. Tay, M. Khalid, R. Yusof, and C. Viard-Gaudin, “Offline cursive handwriting recognition system based on hybrid markov model and neural networks,” in Proceedings of the IEEE International Symposium on Computational Intelligence in Robotics and Automation, 2003, vol. 3, pp. 1190 – 1195, July 2003.
[4] G. Kim, V. Govindaraju, and S. Srihari, “A segmentation and recognition strategy for handwritten phrases,” in Proceedings of the 13th Interna- tional Conference on Pattern Recognition, 1996, vol. 4, pp. 510 –514, Aug 1996.
[5] Y. Y. Chung and M. T. Wong, “Handwritten character recognition by
fourier descriptors and neural network,” in Proceedings of IEEE Region 10 Annual Conference on Speech and Image Technologies for Computing and Telecommunications, TENCON ’97, vol. 1, pp. 391 –394, Dec 1997.
[6] B. S. Moni and G. Raju, “Modified quadratic classifier and directional features for handwritten malayalam character recognition,” in Computa- tional Science - New Dimensions and Perspectives, NCCSE 2011, IJCA Special Issue, vol. 1, pp. 30 –34, Feb 2011.
[7] M. Blumenstein, X. Y. Liu, and B. Verma, “An investigation of the modified direction feature for cursive character recognition,” Pattern Recognition, vol. 40, no. 2, pp. 376 – 388, 2007.
[8] M. Blumenstein, B. Verma, and H. Basli, “A novel feature extraction technique for the recognition of segmented handwritten characters,” in Proceedings of the Seventh International Conference on Document Analysis and Recognition, 2003, vol. 1, pp. 137 – 141, Aug 2003.
[9] E. Kavallieratou, N. Fakotakis, and G. Kokkinakis, “Skew angle esti- mation for printed and handwritten documents using the wigner-ville distribution,” Image and Vision Computing, vol. 20, no. 11, pp. 813 – 824, 2002.
[10] I. P. Morns and S. S. Dlay, “Character recognition using fourier descriptors and a new form of dynamic semisupervised neural network,” Microelectronics Journal, vol. 28, no. 1, pp. 73 – 84, 1997.
[11] M. Shridhar and A. Badreldin, “High accuracy character recognition al- gorithm using fourier and topological descriptors,” Pattern Recognition, vol. 17, no. 5, pp. 515 – 524, 1984.
[12] B. Verma, P. Gader, and W. Chen, “Fusion of multiple handwritten word recognition techniques,” Pattern Recognition Letters, vol. 22, no. 9, pp. 991 – 998, 2001.
Project Paper
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 191
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
OPTICAL CURSIVE HANDWRITTEN
RECOGNITION USING VPP & TDP NATIVE
SEGMENTATION ALGORITHMS AND NEURAL
NETWORKS (PYTORCH)
1G. Tirumalesh,
2K. L. Srinivas,
3K. Pratima,
4N. Arun,
5Y. Hemanth
1student,
2student,
3student,
4student,
5student
Department of Computer Science and Engineering,
Anil Neerukonda Institute of Technology and Sciences, Visakhapatnam, India.
Abstract- In the domain of Artificial Intelligence, scientist brought ultra-modern changes in many fields, one of
them is image processing. This paper aims to represent the process of Converting handwritten text to computer
typed document which is an optical cursive handwritten recognition (OCR) by using segmentation based
algorithms like VPP (vertical projection profile), TDP (top down profile) and the other histogram (vertical and
horizontal) projection algorithms to achieve the solution. Several other approaches were also available for the
segmentation of text into individual characters. For feature extraction and character recognition pytorch which is
an open source machine learning tool library in python used for computer vision and natural language processing.
Keywords: Vertical Projection Profile (VPP), Top Down Profile (TDP), Pytorch.
1. INTRODUCTION:
OCR means Optical Character Recognition, it is also known as optical character reader. OCR translates the
text in the given images to machine readable format. Character recognition is classified into two types based on the
text. They are machine printed text and handwritten text. It is difficult to work with handwritten text mainly in the
case of cursive handwritten because it varies from person to person and there is no perfect line spacing and size of
the character and margins etc in handwritten text. A single character is written in many styles so it is difficult to
identify and translate the script into machine readable format or ASCII format. In this scenario, there is a step by
step process to convert given script into each individual character by starting with line segmentation followed by
word segmentation and finally the character segmentation. Finally predicting each individual character under a
trained model using pytorch to recognize the character and combined together again to generate the original
machine-readable script to the end user.
1.1 Literature Survey
Previous handwritten recognition uses various segmentation algorithms like heuristic, skew recognition
techniques and written pressure detection. Almost every segmentation algorithm is based on horizontal and
vertical projections to segment the script into individual characters. Even when there are text lines or characters
which are overlapped on each other can be separated by adjusting the threshold value. The existed method is tested
on more than 1000 text images of IAM datasets and by using these existing method 91.55% lines and 90.5%
words are correctly segmented from the IAM dataset and also normalize 92% lines and words perfectly with
invisible error rate.[9]
2. PROPOSED ALGORITHM
The process for the optical cursive handwritten recognition and the required algorithms for various level of
segmentation and character recognition using pytorch are as follow
It comprises of six steps.
1. Image scanning
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 192
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
2. Pre-processing
3. Segmentation
4. Feature extraction
5. Classification
6. Post-processing
2.1 Image scanning:
The input image can be obtained either by scanning the already existing handwritten image file (png, jpg) or by
capturing the image instantly to provide input data to the model.
Fig 2.1 Scanned input image
2.2 Image pre-processing:
The main goal here is to make the input image free from noise. As a first step to move on convert the RGB
image to a gray scale image and gently sharpen the given input image to avoid loss of edges. Calculate the mean
gray intensity value to reduce the brightness of the obtained grey scale image on a threshold value of less than 0.65
and the contrast is increased to distinguish the character boundaries. The text which is present in the obtained
result may turn dim and blur because of improper scanning of the text image. To overcome this, binarization plays
a key role by converting the gray scale image where the values ranges between 0 and 255 to a binary image by
making up a threshold value simply to decide like an on or off (0 or 1).
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 193
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Fig 2.2 Blur image (for noise removal)
Fig 2.3 Binary image in color contrast
2.3 Segmentation
Basically, there are three levels of segmentation. Line segmentation, Word segmentation and Character
segmentation.
2.3.1 Line segmentation
Horizontal histogram projections are used in segmenting the entire script present in the input image into
individual lines as shown in the figure below
The primary task here is to extract each individual line from the given input image. This can be obtained by
applying horizontal histogram projection to the pre-processed image and then generate the threshold line value by
adjusting the average value for those horizontal projections. Graphical representation of horizontal histogram
projection is shown in the figure 2.4 below.[6]
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 194
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Fig 2.4 Horizontal projection graph
Finally, Lines can be segmented from the given input script by obtaining the break points by making use of the
average threshold line value obtained from the above graph and having comparisons with each and every
horizontal projections.
Fig 2.5 Segmented lines from the image
2.3.2 Word segmentation
Each word is treated as an object (Contour – in terms of image processing). Contour can be explained simply
as a curve joining all the continuous points (along the boundary), having same color or intensity. Here, Contours
are useful for object detection where each object is a word.
Fig 2.6 segmented line
The main reason for making use of contours here is each word can be treated as a curve joining all the
continuous points along the boundary since it is cursively written. But sometimes there may be gaps in between
the letters of a single word which causes the word to be split into two or more words as they are not continuous
points joining as a curve.
This type of words can be identified by making use of the minimum threshold value which is obtained by
taking the average separation distance between the words and can be rejoined back as a single word (Contour)
where the separation distance between the words less than the minimum threshold value.[7]
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 195
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Minimum threshold value = sum of separation distances between words in the line/
no of words in the line
Fig 2.7 First level word segmentation
Fig 2.8 Second level word segmentation
2.3.3 Character segmentation
Segmenting each word into individual characters can be obtained by making use of two native algorithms:
1) VPP (Vertical Projection Profile)
2) TDP (Top Down Profile)
Fig 2.9 segmented word
VPP is a plot which maintains the total number of white pixels in vertical direction of the binary image. Characters
can be segmented at the point where the VPP value is zero (0) for some certain no of times (Threshold). But in the
case of touching characters, the VPP value can never be zero even though the characters should be segmented
(connected components).[1]
Fig 2.10 VPP intensity graph for word ‘MOVE’
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 196
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Connected components can be identified by making use of characters width and height. When the width of
the character is greater than 0.8 times the height of the character then it is identified as a connected component
otherwise it is a single character as per basic font size measurement.[1]
Width > 0.8 * height (Connected component)
Connected component single character single character
Fig 2.11 First level VPP character segmentation
TDP is a plot which maintains the first white pixel in vertical direction of the binary image. Touching
characters can be segmented into individual characters by taking the combined value of both VPP AND TDP.
Then obtain the minimum value in the graph where we can segment them into individual characters and continue
this process recursively until no more touching character found in a single word. [2]
Fig 2.12 touching characters (connected component)
Fig 2.13 VPP intensity graph for word ‘MO’
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 197
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Fig 2.14 TDP intensity graph for word ‘MO’
FIG 2.15 Combined intensity graph for word ‘MO’
Fig 2.16 Final level character segmentation
2.4 Feature Extraction:
The main goal here is to extract the features from the segmented characters which are required to train the
data.
This process comprises of zero padding, convolution layer, activation function, max pooling and flatten. As a
first step add zeros to the image to overcome the loss of edges termed as zero padding. Then apply multiple layers
of convolution and max pooling filters(kernels) to obtain an image with reduced in size where each move is a
stride. Max pooling comes with selecting the max value from the filter and replace it with the remaining and the
same way the average pooling works by taking the average pixel value. Now the activation function comes into
picture where the ReLU activation function identifies all the negative pixel values and replace it with zero without
any change in the positive pixel values. Finally flatten the image by reshaping the image obtained.
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 198
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Fig 2.17 Feature extraction of characters
Fig 2.18 Zero padding
Fig 2.19 Convolution layer
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 199
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Fig 2.20 Max pooling and Average pooling
Fig 2.21 Flatten the image
Fig 2.22 ReLU activation function
2.5 Classification:
Finally, Classification is done using a fully connected layer where we get the probabilities of each and every
class for the given input character. Classify the given input character to their respective class by selecting the class
with the maximum probability. In total there are classified into 62 classes (0 to 9, a to z, A TO Z)
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 200
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
Fig 2.23 Fully connected layer with classes (x and o) along with probabilities
2.6 Post-processing:
As a final step obtain the accuracy for all the levels of segmentation and the character recognition by
minimizing the error rate. Then combine all the recognized characters into word, words into lines and lines into the
original script present in the image.
Final Result = M O V E
Fig 2.24 Final result
3. PYTORCH LIBRARY TOOL:
Pytorch is an open source machine learning library based on torch library used for applications such as
computer vision and natural language processing.
It is a very popular framework for deep learning. The feature extraction and the classification stages are
implemented under pytorch library tool. As a first step install all the required modules/packages to train the data
using pip namely efficientnet-pytorch and torchsummary. Then include all the required modules by importing
them into the python script (torch, torchvision, torch.nn, torch.util, torch.autograd, torch.optim,
torchvision.transforms, Efficientnet).
Prepare the Dataset for training the model from the NIST dataset (700000 images) by dividing them into two
around 600000 images as training data and remaining as test data.
Create a class named DATASET for listing all the training images and their respective labels for training.
Download the pretrained model efficientnet-b0 and assign all the required parameters like batch size, learning rate,
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 201
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
error rate, no of classes and transformations if any. Finally train the model under certain epcohs until the loss gets
minimized and accuracy gets increased.
Finally predict each input segmented character under the trained model to be classified into one of the class.
4. EXPERIMENTAL RESULTS AND ANALYSIS:
This system is trained under around 1600 text images (paragraphs) of IAM dataset with almost 5678 labelled
sentences, 13353 isolated and labelled text lines, 115,320 isolated and labelled words with an accuracy of around
98% in terms of line segmentation, 93 % in terms of word segmentation and 88% in terms of character
segmentation.[8]
In terms of character recognition, this system is trained under around 623,000 images of 62 different characters
(0 to 9, a to z, A to Z) with an accuracy of around 96% in character recognition.
5. CONCLUSION:
This paper mainly carries out a study on segmenting the connected components (touching characters). There are
many more challenges involved in the optical cursive handwritten recognition like the skewness, pressure
detection etc can be treated as a future study. The proposed method for segmenting the connected components
(touching character) is by using VPP (Vertical projection profile) and TDP (Top Down Profile) and various other
histogram projections (horizontal and vertical) for the line and word segmentation respectively. pytorch is a
python library tool used for the recognition of the segmented characters. [4]
6. ACKNOWLEDGMENT:
The project team members would like to express our thanks to our guide B. shiva jyoti Assistant professor of
Computer Science and Engineering Department, Anits for her valuable suggestions and guidance in completing
out project model.
7. REFERENCES:
[1] Nafiz arica, Student Member, IEEE, and Fatos T. Yarman-Vuraj, Senior Member IEEE, “Optical Character
Recognition for Cursive Handwriting”, IEEE transaction on pattern analysis and machine intelligence, vol. 24 no.
6, June 2002
[2] Subhash Panwar and Neeta Naina, “A Novel Segmentation Methodology for Cursive Handwritten
Documents”, IETE JOURNAL OF RESEARCH, VOL 60-NO 6 NOV-DEC 2014
[3] Nibaran Das Sandip Pramanik, Subhadio Basu, Punam Kumar Saha, “Recognition of handwritten Bangla basic
characters and digits using convex hull based feature set”, 2009 International conference on Artificial intelligence
and pattern recognition (AIPRL-09)
[4] Abhishek Bala and Rajib Saha, “An IMPROVED Method for Handwritten Document Analysis using
Segmentation, Baseline Recognition and Writing Pressure Detection”, 6th
International Conference on Advances
In Computing Communication, ICACC 2016, 6-8 September 2016, Cochin, India, Elesevier-2016
[5] Kanchan Keisham and Sunanda Dixit, “Recognition of Handwritten English Text Using Energy
Minimisation”, Information Systems Design and Intelligence Applications, Advances in Intelligent Systems and
Computing, Bangalore, India, Spinger-2016
JETIR2004233 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org 202
© 2020 JETIR April 2020, Volume 7, Issue 4 www.jetir.org (ISSN-2349-5162)
[6] Namrata Dave,” Segmentation Methods for Hand written Character Recognition”, International Journal of
Signal Processing, Image Processing and Pattern Recognition Vol.8, No.4 (2015), pp.155-164
[7]G. Louloudisa, B.Gatosb, I.Pratikakisb, C.Halatsis, “Text line and word segmentation of handwritten
documents” a Department of Informatics and Telecommunications, University of Athens, Greece Computational
Intelligence Laboratory, Institute of Informatics and Telecommunications, National Center for Scientific Research
Demokritos, 15310Athens, Greece
[8] IAM dataset http://www.fki.inf.unibe.ch/databases/iam-handwriting-database
[9] Offline handwritten character recognition using neural networks
https://www.researchgate.net/publication/239765657_Offline_handwritten_ch
aracter_recognition_using_neural_network