improving automated postal address recognition draft · improving automated postal address...
TRANSCRIPT
DRA
FT
Improving Automated Postal
Address Recognition
David Lomas
Submitted for the Degree of Master of Science
University of York
Department of Computer Science
June 1996
Improving Automated Postal Address Recognition
Abstract
Improving the efficiency of an automated address recognition system is key to
improving the overall efficiency of the UK’s mail delivery system. It is clear that
Optical Character Recognition (OCR) is fundamental to such a system. However an
extensive survey of the current research shows that most groups involved in this
area agree that the way to improve current systems’ performance is to incorporate
context information into the recognition process. The problem then becomes one of
efficiently processing the large volume of data and refining it to an address. This rep-
resents a need for the efficient searching of large databases with partial or incom-
plete queries. A technique using Correlation Matrix Memories (CMMs) would seem
ideal as it allows this type of query to be made extremely efficiently. One major prob-
lem with this method is identified and a solution proposed. The final section also
contains details of a number of questions raised during this research and it is
intended to follow these up over the course of the next 3 years.
Improving Automated Postal Address Recognition
3
Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2. OCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Machine Printed Character Recognition . . . . . . . . . . . . 13
2.3 Hand Printed Character Recognition . . . . . . . . . . . . . 20
2.3.1 Printed Writing. . . . . . . . . . . . . . . . 21
2.3.2 Cursive Writing . . . . . . . . . . . . . . . 29
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . 33
3. Verification . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Review . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . 51
4. Partial Matching . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Review . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Correlation Matrix Memories . . . . . . . . . . . . . . . . . 57
4.3.1 Storage Capacity of a CMM. . . . . . . . . . . . 59
4.3.2 Coding of Input and Output Patterns . . . . . . . . . 59
5. Ghosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Problems Caused by Ghosting . . . . . . . . . . . . . . . . 65
5.3 Maximum Ghosting Sets . . . . . . . . . . . . . . . . . . 66
5.3.1 Generating the Sets . . . . . . . . . . . . . . 68
5.4 Analysis of some Maximum-Ghosting Sets . . . . . . . . . . . . 69
5.4.1 Quadratic Model . . . . . . . . . . . . . . . 73
5.4.2 Cubic Model . . . . . . . . . . . . . . . . 74
5.4.3 Exponential Model . . . . . . . . . . . . . . 74
5.4.4 Set Size Ratio Model . . . . . . . . . . . . . . 75
5.4.5 Comparison of Models . . . . . . . . . . . . . 77
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . 80
6. Analysis of PAF . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Format of the Postcode . . . . . . . . . . . . . . . . . . 84
6.3 Missing Characters . . . . . . . . . . . . . . . . . . . . 86
7. Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 92
7.2 Speed of Database Access . . . . . . . . . . . . . . . . . 92
Improving Automated Postal Address Recognition
4
7.3 Other Factors . . . . . . . . . . . . . . . . . . . . . . 96
8. Conclusions and Further Work . . . . . . . . . . . . . . . . . . 99
8.1 Code Generation . . . . . . . . . . . . . . . . . . . . . 99
8.2 Values of k . . . . . . . . . . . . . . . . . . . . . . 100
8.3 Strategies for Verification . . . . . . . . . . . . . . . . . 101
8.4 OCR . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.5 Information Feedback . . . . . . . . . . . . . . . . . . 105
8.5.1 Algorithmic Processing of Feedback . . . . . . . . 106
8.5.2 Asynchronous Processing of Feedback . . . . . . . . 107
8.6 System Design . . . . . . . . . . . . . . . . . . . . . 108
8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . 109
9. References . . . . . . . . . . . . . . . . . . . . . . . . . 111
Improving Automated Postal Address Recognition
5
List of Figures
Fig. 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . 11Example of poor image quality from scanning machine printed text (taken from[5 Mulgaonkar et al.])
Fig. 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . 12Example of touching handwritten characters (taken from[19 Hendrawan, Leedham])
Fig. 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . 19Diagram of how the features of an image ‘vote’ for the objects which could havegenerated them.
Fig. 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . 24Table of improvements to OCR system using a combination of 3 networks over asingle network
Fig. 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . 29The four reference lines used by the system described in [8 Yanikoglu, Sandon]
Fig. 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . 36Summary of results for the OCR systems reviewed
Fig. 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . 46A diagram of the first stage of the SNN method for retrieving valid postcodes
Fig. 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . 47Diagram of the matrix formed at each node of the SNN
Fig. 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . 48Block diagram of the way information is processed in [39 Lucas]
Fig. 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . 57Diagram of a simple correlation matrix memory
Fig. 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . 58A CMM during recall
Fig. 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . 61Example input pattern coding for a CMM to use partial matching
Fig. 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . 62Result of recalling ‘C?T’ from a CMM
Fig. 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . 63Superimposition of 2 7-segment number patterns
Fig. 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . 64Example of superimposed codes generating a ghost.
Fig. 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . 68Example of orthogonal codes which can ghost any other code
Fig. 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . 69Times to complete exhaustive search of some small code sets.
Improving Automated Postal Address Recognition
6
Fig. 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . 70Graphs of set size against code width for k3s2g1 and k3s2g2
Fig. 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . 71Graphs of set size against code width for k4s2g1 and k4s2g2
Fig. 5.6 . . . . . . . . . . . . . . . . . . . . . . . . . 73Graphs of quadratic functions against experimental data for sets k3s2g2 andk4s2g2
Fig. 5.7 . . . . . . . . . . . . . . . . . . . . . . . . . 74Graphs of cubic functions against experimental data for sets k3s2g2 and k4s2g2
Fig. 5.8 . . . . . . . . . . . . . . . . . . . . . . . . . 75Combined graphs showing exponential functions against experimental data fork3s2g2 and k4s2g2
Fig. 5.9 . . . . . . . . . . . . . . . . . . . . . . . . . 76Graphs of ratio functions against experimental data for sets k3s2g2 and k4s2g2
Fig. 5.10. . . . . . . . . . . . . . . . . . . . . . . . . 79Table of predicted k3s2g2 and k4s2g2 set sizes for various widths
Fig. 5.11. . . . . . . . . . . . . . . . . . . . . . . . . 80Table of predicted code widths for a storage requirement of 866026 associations
Fig. 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . 84The syntax of the postcodes
Fig. 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . 87Analysis of five-character postcodes
Fig. 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . 88Analysis of six-character postcodes
Fig. 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . 88Analysis of seven-character postcodes
Fig. 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . 93Estimated code widths for the 3 classes of postcode, using Eqn. 5.14
Fig. 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . 94Time taken to search each database for one specific postcode
Fig. 7.3 . . . . . . . . . . . . . . . . . . . . . . . . . 95Total size of codes required to represent each class of postcode
Fig. 7.4 . . . . . . . . . . . . . . . . . . . . . . . . . 95Overall time to recover actual postcode
Fig. 8.1 . . . . . . . . . . . . . . . . . . . . . . . . 100Representation of 3 bit binary codes as vertices of a 3-dimensional cube
Fig. 8.2 . . . . . . . . . . . . . . . . . . . . . . . . 108System Outline of automated address recognition system
Improving Automated Postal Address Recognition 1. Introduction
7
1. Introduction
The research presented here was sponsored by The Post Office and is driven by a
need to increase the performance of the automatic sorting machines employed
throughout the country. The sorting process consists of two stages known as out-
bound and inbound. The outbound stage involves identifying the destination post-
town for the mail piece. The inbound stage (which is performed on mail already in
the correct town, or mail which has arrived from outbound sorting in a different
town) involves sorting the mail into delivery rounds which can then be collected by
the delivery personnel. The automatic recognition of the address is the first stage of
these sorting processes, and from this a machine readable code is printed on the mail
piece in the form of phosphor dots which can be read by all the other automated
machines in the sorting path. The aim of this year’s research was to find a way of
improving the performance of the address recognition system.
An extensive survey of the relevant work showed that current OCR technology is
very close to the theoretical limit in terms of recognition rate for machine printed
characters, and therefore, the only improvement which can be made is the speed
with which that recognition is performed. However there is also a limit to which
increasing the recognition speed is advantageous in this application as the overall
goal is not to increase throughput but to increase the reliability of the system. Is was
proposed therefore, that spending time trying to increase the reliability of recognis-
ing individual characters was not the best way to set about increasing the reliability
of recognising the address as a whole. What was needed was a system capable of
verifying and correcting the small number of errors which were made by the OCR
system. This in turn led to an evaluation of database technology, specifically using
Correlation Matrix Memories (CMMs) as the engine for the database. Some prob-
Improving Automated Postal Address Recognition 1. Introduction
8
lems which occur when CMMs are used to perform partial matching were investi-
gated, and a possible solution proposed.
The actual sorting machines are supplied to The Post Office by the German company,
AEG. It is hoped that, at some point, the results of this research can be integrated into
the new machinery. This would almost certainly require the involvement of AEG,
but the situation between The Post Office and AEG is politically sensitive at present
and no attempt has been made to contact AEG so far. However the new machine is
intended to be highly modular in design and should present few problems for the
address recognition system to be upgraded or even replaced by a more powerful sys-
tem in the future.
Improving Automated Postal Address Recognition 2. OCR
9
2. OCR
Optical Character Recognition is the area of computing that concerns itself with the
ability of computers to interpret printed characters. The characters may be from a
standard alphabet or one designed with computer recognition in mind. They may be
produced by machine or by a human writer. There are also a number of methods
with which the document may be translated into machine-readable form, such as
scanning, imaging using a camera or direct entry by writing onto a touch-sensitive
screen.
2.1 Introduction
The OCR research currently being carried out is split into 3 areas. These are:
• Machine Printed Character Recognition (MPCR)
• On-line Hand Printed Character Recognition (OnHPCR)
• Off-line Hand Printed Character Recognition (OffHPCR)
The strategies applied to OnHPCR use information which is only available when the
automated system can be used during the process of writing the characters. Typi-
cally, the characters are written onto some sensitive screen using a wand, and the
computer system imitates the ink by colouring in pixels which the wand passes over.
This allows the system to record stroke information such as order, direction and
speed of the strokes which are used to form each character. There are a number of
commercial systems available at present in the form of portable computers which
use this type of OCR, and they usually include recognition of other characters or ges-
tures, which allow the user to command the machine in certain ways. For example,
they allow a cross to be written over a word in a word-processor thus indicating the
‘delete word’ function. These sort of characters, and particularly the real-time stroke
information, are obviously not available from a scanned image of hand printed char-
Improving Automated Postal Address Recognition 2. OCR
10
acters, so the techniques used are not applicable here. Therefore, on-line hand
printed character recognition will be disregarded for the remainder of this report,
and off-line hand printed character recognition, which is what is being considered
here, will be referred to simply as hand printed character recognition (HPCR).
There is an obvious difference between machine printed characters and hand printed
characters. The variation which is found in hand printed characters is far greater
than machine print. The only real variations in machine print are font style and size,
and the number of these is for all intents and purposes, finite. However with hand
printed characters, even given the same writer and character, there can be huge vari-
ations in the form of the character image. Not only does the character change every
time it is written, but it can also change shape simply because of the character it is
next to. As a result of this, the recognition rate for MPCR is much higher than for
HPCR, as it turns out to be a much simpler problem.
Apart from simply being able to recognise characters, the system must be able to
extract them from the image. It is very rare that characters appear alone and isolated
from those they are associated with in forming a word. This would only happen on
forms with boxes for characters, and then only if the writer had carefully followed
the outlines of the boxes and kept each character completely within each box. In real-
ity characters tend to touch one another, even with machine printed characters. It
Improving Automated Postal Address Recognition 2. OCR
11
may appear at first sight that machine printed text would be simple to segment into
individual characters, but there are various reasons why this is not so.
Fig. 2.1 - Example of poor image quality fromscanning machine printed text (taken from
[5 Mulgaonkar et al.])
As can be seen in Fig. 2.1, the image quality of a scanned, machine printed document
is not always perfect. Dot-matrix printers, especially high speed ones can produce
very smudged text, so much so that the characters actually run into each other. This
effect is made worse by the scanning procedure which has to quantise the image into
pixels. If the gap between two characters is smaller than the pixel size, they will be
imaged as touching. Secondly, in this particular application, the imaging has to be
very fast, as the scanning process is on-line within the sorting machine. This leads to
reduced resolution being employed, and also, as the mail piece is moving, tends to
smear characters along the horizontal axis. Again, this tends to render them as
touching. Characters in proportionally spaced fonts can also overlap in the sense
that there is no vertical white space between characters. This is due to kerning,
where the characters are moved closer together to give a more pleasing appearance
to the human reader. Usually there is still separation between them but it is no longer
trivial to find it, and it sometimes does not even follow a straight let alone vertical
path.
With handprinting, the problems of touching characters is much worse as can be
seen in Fig. 2.2. Firstly, it is much more natural for people to write ‘joined-up’. This
means there is no intentional break in the characters. Secondly, even if the writer is
Improving Automated Postal Address Recognition 2. OCR
12
deliberately writing separate characters, there is a tendency for some characters to be
joined together unintentionally, simply because people are in the habit of joining
them together. The same problem can occur as with proportionally spaced machine
print, when characters are not physically joined, but their enclosing rectangles over-
lap.
Fig. 2.2 - Example of touching handwrittencharacters (taken from
[19 Hendrawan, Leedham])
The upshot of all this is that it is as much if not more of a problem to segment the
image into individual characters as it is to recognise the characters themselves. In the
approaches to OCR reviewed here, some operate only on isolated (segmented) char-
acters, and some attempt both segmentation and recognition. In some, the two proc-
esses are independent, and in others, they are integrated.
The remainder of this section will consider the three subject areas mentioned above
in turn. These are Machine Printed Character Recognition, Hand Printed Character
Recognition and Cursive Writing Recognition. Some relevant publications are
reviewed and the details of the particular method are presented. A summary of the
results achieved by each of the systems reviewed is presented at the end of the sec-
tion, along with a discussion of some of the more salient points with respect to the
application of automated mail sorting.
Improving Automated Postal Address Recognition 2. OCR
13
2.2 Machine Printed Character Recognition
In [22 Wang, Jean], the authors present a multi-resolution neural network system
which is capable of recognising isolated machine printed characters in any font.
They describe a number of different configurations which are all based on the idea of
using a low resolution neural network which can operate at high speed to perform
an initial attempt at recognising the characters. This network is intended to recognise
around 85% of the characters at a resolution of only 12×8 pixels. The second network
uses a resolution of 24×20 pixels, and a more complex neural network (four hidden
layers instead of one). Consequently this network is more computationally expen-
sive, but is only used on the 15% of characters which cannot be recognised by the
first network. The results show that the first network can operate at 50 times the
speed of the second, but the slow network is still being used 15% of the time which
limits the overall speed-up. In order to reduce this limiting factor, a third network
was introduced which worked at the same resolution as the second network, but had
only one hidden layer as with the first network. This new network was used
between the first and second networks, and was able to recognise 80% of the rejects
from the first network. This reduced the use of the slow second network to only 3%
of characters, and represented a speed-up of between 14 and 20 times for the whole
system. They also used a weighted voting scheme when none of the networks could
successfully recognise a character to allow evidence from the 3 networks to be drawn
together and an overall decision made.
Various configurations of the networks were tested, with the best achieving a 99.81%
recognition rate on a test set which included characters from first and second genera-
tion photocopies, at speeds of 107-148 characters per second on a DEC workstation
Improving Automated Postal Address Recognition 2. OCR
14
rated at 42 MIPS. It is interesting to note that on a random subset of the test set, the
authors themselves only achieved a 99.83% recognition rate, and it is noted in [22]:
“Although there are 62 classes (A-Z, a-z, 0-9) in each font, some of themcannot be distinguished from each other after normalisation and they areconsidered equivalent for recognition purposes.”
This potentially poses a serious problem for address recognition, as two valid post-
codes could be generated by the characters which could not be distinguished. In
actual fact however, this is not likely to occur, as the main candidates for confusion
are ‘l’ (ell) &‘1’ (one), and ‘O’ (owe) & ‘0’ (zero). In the first case, postcodes are not
generally written in lower case although it is not inconceivable that this could hap-
pen. Secondly, there are no postcodes which contain either ‘O’ or ‘0’ in a character
position where both would be legal, therefore the confusion can be resolved by using
grammatical rules of the postcodes.
In [4 Wang, Jean], the neural network system mentioned above is integrated with a
character segmentation system which allows scanned printed documents to be ana-
lysed. The segmentation system uses a hybrid of neural networks and conventional
algorithms to determine the best cut position for merged characters. There is a two
stage approach to the segmentation. Firstly, if a character is rejected from the OCR
system, or has a large aspect ratio (i.e. is much wider than ‘normal’ characters), then
it is immediately segmented. The second approach is used if the character is classi-
fied during OCR, but then fails a subsequent spelling check. Each character in the
word is then examined by a neural network which is trained to identify touching
characters. The training of this network is on character pairs which are generated by
an algorithm designed to produce likely touching pairs — for example it doesn’t
generate touching upper-case characters (as it is suggested that these would be
rejected anyway on the grounds that the resulting image would be too wide), or
pairs of characters in different fonts (as there is rarely a font change in the middle of
Improving Automated Postal Address Recognition 2. OCR
15
a word). However the output of this network is a simple yes/no answer when applied
to a character pair, indicating whether or not the network considers the pair to be
touching. The actual segmentation is left to a later stage. It would seem that there is
an opportunity missed here to allow the network to assist in the segmentation by
providing a method for it to suggest a suitable cutting point. While this would
undoubtedly complicate the network and require more training data (a cut point
associated with each character pair), it would seem that the benefit to the segmenta-
tion algorithm would outweigh this initial effort. A possible counter argument is
that the network is trained on only a small subset of touching pairs, and is required
to generalise over the whole set of possible pairs it might encounter while processing
a document. It would be difficult to see how it could infer the correct cutting point in
an unseen character pair from this. However, there are only 26 different initial char-
acters for each possible pair. Providing the network is trained with at least one exam-
ple from each of these 26 classes, it will be able to suggest the correct cutting point.
The second character in the touching pair would not influence the width of the first
character, and so the cutting point would be correct no matter what the second char-
acter was. It would still be possible for the network to generalise over the set of all
touching pairs even though it has only been trained on a small subset. The fact that
the subset contains one example for every possible initial character allows the cut-
ting point to be suggested by the network even for unseen pairs.
The actual segmentation is carried out by a shortest path algorithm which attempts
to find the least cost curve from the top to the bottom of the character pair image.
The cost is defined in terms of the number of pixels involved in the path, and the
number of those that are set (i.e. form part of the actual character image rather than
the background). An extra penalty is also applied to paths which take diagonal steps,
in an attempt to keep the path as vertical as possible. Once a cut is proposed the seg-
Improving Automated Postal Address Recognition 2. OCR
16
mented images are passed back to the OCR system and another classification is per-
formed. This procedure is repeated until either the characters are classified with high
confidence or no more low cost cuts can be made. The system also checks for touch-
ing character triples in this way. As soon as the left hand portion of the image is rec-
ognised, the remaining portion is deemed to be a touching pair and is segmented
repeatedly until it is classified or cannot be segmented any more.
Overall, this system performs admirably. It achieved a character recognition rate of
99.71% on various documents scanned at 300dpi. The speed of the recognition was
not reported. However it is unlikely to be excessive given the multitude of complex
stages involved in the process.
A spelling checker is used to catch errors which are very difficult to detect from the
image. One such error is the character pair ‘rt’ being classified as ‘n’. However as
with the character recognition there are some images which simply cannot be distin-
guished by any of these methods, cf. ‘close’ & ‘dose’, ‘stern’ & ‘stem’, and the only
solution is to employ some kind of context information which will show that one
word is permissible in a particular context whereas the other is not.
In [7 Liang et al.], a discrimination function is presented for segmenting touching
characters. The function relies on the pixel and profile projections of the character
shapes. The function is implemented as a dynamic recursive system which repeat-
edly segments the image and attempts OCR on the results until the OCR system clas-
sifies the segments with high confidence. This is then taken to be the correct
segmentation. It is a very similar approach to the previous one, although it uses a
very different implementation.
The OCR system uses a minimum distance classifier applied to the border chain
codes of the character images. The authors also developed a novel solution to the
Improving Automated Postal Address Recognition 2. OCR
17
problem of large chain code variations due to relatively small input image changes.
A chain code is stored as a histogram with four bins: one for horizontal lines, one for
vertical lines and one each for the 2 orientations of diagonal line which are found in
the border of the character image. The image is split into 16 (4×4) rectangles and the
histogram for each calculated. The large variation occurs when the edge of a charac-
ter moves from one of these rectangles to another, and all the information associated
with that edge moves from one histogram to another. Their solution to this problem
was to make the 16 rectangles overlap by an amount which could be altered until the
system performed best. This way, small variations in the position of the edges of a
character are unlikely to move the edge outside the rectangle, and also that the edge
information will be included in more than one histogram. There experiments show
that there is a decrease in the Euclidean distance between input patterns and the
stored patterns of 23% when the rectangles of 64×80 pixels are overlapped by 8 pixels
horizontally and 10 pixels vertically.
Finally, a contextual analysis of the image of the text line allows the system to pre-
vent characters such as ‘h’ being split into ‘l’ and ‘I’. This is achieved by recognising
the fact that the ‘I’ would have to be much smaller than the ‘l’ for them to form the
image of an ‘h’ when touching. This variation in font size within a word is disal-
lowed.
Overall the system performs very well, with an average character recognition rate of
99.65% from 300dpi scanned images of a multi-column newspaper type publication.
The speed of recognition was not reported.
In [5 Mulgaonkar et al.], it is stated that
“... half the errors in character recognition are due to [poor] segmenta-tion.”
Improving Automated Postal Address Recognition 2. OCR
18
The approach they adopt is to avoid completely the segmentation step and use a fea-
ture voting scheme similar to the one in [29 O’Keefe, Austin]. This is noted in [29] to
be akin to the Generalised Hough Transform which is used to recognise arbitrary
objects in images by accumulating evidence for them in some kind of array. In [5] the
array is 1-dimensional, representing the line of characters which make up the word
being recognised. In [29] however, the array is 2-dimensional and each cell in the
array represents evidence for an object at a particular location in the image. The
method used for collecting evidence in [5] is a simple sequential search through a
library of features which have been previously extracted from examples of the
objects (in this case, characters) that the system is to recognise. As soon as the match
between the current input and a character from the library is above a certain thresh-
old the character is considered to have been classified. It is noted in the report that
this is a very inefficient method of searching the library for a matching character. The
authors suggest this could be improved using hashing and indexing techniques,
however they are not specific about how they plan to implement this. The recogniser
used in [29] is a neural network which is much more suited to the fuzzy matching
which has to be performed on the input. This system has to handle a 2-dimensional
input image and so has to be much more efficient to allow images to be processed in
reasonable time. It is likely that a similar approach would be of great benefit to the
OCR task tackled in [5].
The input window of the classifier is scanned over the input image and whenever a
known feature is recognised, a ‘vote’ for the object/character which could have gen-
erated that feature is stored in an accumulator array. The array has an entry for each
position in the original image which could contain an object or character. A vote is
placed in the entry associated with the position in the input image where the object/
character would have to be in order for the feature to be present at the location it was
Improving Automated Postal Address Recognition 2. OCR
19
found. With reference to Fig. 2.3, the input window currently contains a feature
which the system has been trained on. This feature happens to be the lower left cor-
ner of a square. It is possible to infer from this that the centre of the square must lie
on the dotted line drawn from the apex of the corner, extending up and to the right.
All the accumulator elements which lie on this line are then incremented. This proc-
ess is repeated for the other corners, and the accumulator entries in the centre of the
square will have been incremented four times, whereas the others will have been
incremented a maximum of twice. This local maximum is used to locate the centre of
the square.
Fig. 2.3 - Diagram of how the features of animage ‘vote’ for the objects which could have
generated them.
At the end of the recognition process the accumulator will contain local maxima or
‘peaks’ where the objects are most likely to be, as they were voted for by the most
features. A thresholding algorithm then decides which of these objects are suffi-
ciently evident to include them in the final output. In the OCR system in [5], a lexi-
con is used to restrict the outputs to valid words. The best matching valid word is
selected based upon the characters with highest confidence in the output array. The
authors reported a word recognition rate of 80% but it was expected that this could
be improved by using more character features during the recognition stage (they
Input Window
Image Features
Voting Positions
The voting positions show thataccumulator entries under
these diagonal lines would beincremented by the features
found, indicating thepossibility of a square centredsomewhere on the line. When
all 4 corners are recognised,the actual centre will be a local
maximum within theaccumulator.
Actual Centre
Improving Automated Postal Address Recognition 2. OCR
20
listed several in the report, but only used one — contours of characters — during the
tests).
The system in [29] is a very generalised system capable of recognising arbitrary
objects at arbitrary positions on the document, whereas the OCR system in [5]
requires the document to have been segmented into text lines before recognition can
begin. This is because it only employs a 1-dimensional accumulator array which rep-
resents the line of characters. However it is likely that the system in [29] would not
require this step as it could locate characters at any position within the image of the
document. It would be interesting to compare the performance of the system
described in [29] trained on character images, on a similar document to the one used
in [5].
2.3 Hand Printed Character Recognition
There is really a further subdivision necessary here — that of hand printed charac-
ters versus cursive characters. Hand printed characters tend to be separated from
their neighbours, whereas cursive characters are almost always joined. In
[8 Yanikoglu, Sandon] it is noted that recognition rates of the order of 95% are
achievable for hand printed characters (Martin et al.), but as low as 36% for cursive
writing (Edelman et al.). Their work also assumed isolated words and did not
attempt word segmentation. The results are often much higher for hand printed dig-
its as they tend to be clearly separated. Also, as there are only 10 classes to distin-
guish, the problem is inherently easier. Recognition rates of up to 98% are reported
for handprinted digits (Baptista et al. and Burr). Many different techniques were
used by the systems whose results were reported, including neural networks, radial
basis functions, syntactic and elastic matching. There was no clear method which
achieved better results than all the others. The recognition rate achievable seems to
depend quite heavily on the restrictions which are put on the scope of the recogni-
Improving Automated Postal Address Recognition 2. OCR
21
tion system. For example, one result of a 48% word recognition rate for cursive writ-
ing is reported (Srihari et al.) but the notes which accompany the result show that the
system was only tested on writing supplied by the author who also wrote the train-
ing set.
2.3.1 Printed Writing
A system is presented in [26 Burges et al.] which is applied to both printed digits
and cursive handwriting. The system and its results for printed digits are presented
here. A description of its performance when applied to cursive writing is given in
section 2.3.2 (page 31).
The input images are segmented into ‘cells’ by first locating ‘definite cut’ points
where there is a large amount of white space between adjacent characters. Possible
cuts are then identified using a method named “Modulated Gradient Hit and Deflect”.
This algorithm produces a set of possible segmentation points within the text line.
The segments thereby created are known as ‘cells’. It is assumed by the rest of the
system that it is possible to construct the correct segmentation of the line into charac-
ters by merging some of the cells. Thus the cells represent an over-segmentation of
the text line and the goal of the next stage is to identify which cells should be joined
together to form characters. This is achieved using an exhaustive scan of the possible
combinations by applying them to the character recogniser and using the generated
confidence value to indicate whether this combination is likely to be a good one (i.e.
representing a character). Once a set of combinations is found such that the confi-
dence for each segment is above some threshold, this is taken as the correct segmen-
tation of the text line and the characters are output from the classifier. Obviously it is
not necessary to test every possible combination of cells as it is assumed that the
number of characters in the final word is known1. There is no point trying combina-
tions which produce either more or less characters than this.
Improving Automated Postal Address Recognition 2. OCR
22
The system achieved a ZIP code recognition rate of 82.7% with no rejects and 96%
with 25% rejects. It is noted in their paper that the test set contains around 3% of
images which are not in the lexicon and so could not possibly be recognised correctly
by the system. These were not removed for the tests and would seem only to compli-
cate the interpretation which can be placed on these results.
A novel approach to the problem of increasing the reliability of an OCR system is
presented in [23 Drucker et al.]. Three conventional neural networks were used.
Their architecture is irrelevant to the method which can be applied to any trainable
classifier. The only requirement is that a very large training set must be available and
it will not necessarily be known in advance how many training examples will be
required. The training procedure is outlined below.
The first network (NET1) is trained as a normal classifier using some examples from
the large initial training set. The training set for the second network (NET2) is
formed by passing more example characters (unseen by NET1) through NET1 until
it incorrectly classifies one. This character image is added to the training set for
NET2. The process is repeated, but this time the first character to be correctly classi-
fied by NET1 is added to the training set for NET2. The selection of characters is
alternated between those which were and were not correctly classified by NET1. In
this way, the training set for NET2 is always made up of an equal number of charac-
ters which were classified correctly and incorrectly by NET1. When a sufficiently
large training set has been generated, NET2 can be trained. The training set for the
third and final network (NET3) is now generated by passing more unseen (by either
NET1 or NET2) character images through both NET1 and NET2. This is repeated
until the networks disagree on the classification of the image. This image is then
1. The tests were performed on U.S. ZIP codes which contain either 5 or 9 characters. A dis-criminator was used prior to the segmentation to identify which format the target image wasand so it is known in advance how many segments will make up the final word.
Improving Automated Postal Address Recognition 2. OCR
23
added to the training set for NET3. All other images (those that networks 1 and 2
agree on) are discarded. Thus the training set for NET3 contains only images whose
classification NET1 and NET2 disagree on. Once a sufficient number of these images
has been collected, NET3 can be trained.
During the recognition phase the character image is applied to networks 1 and 2. If
they agree then this is taken as the correct answer. If they disagree however, the
image is applied to network 3 and its output is taken as correct.
While this method can be used in principle to increase the performance of any neural
network system it is applied here to the recognition of hand printed characters and
digits. However it can be seen from the description above that this system will
always produce an output, whether or not there is high confidence in that output. It
is stated in [23] and would also appear to be common sense that in a mail sorting
application it is much more desirable to reject a piece of mail and have is sorted by
hand than to miss-classify it and have is delivered to the wrong address. The voting
scheme described above is not suitable for this and so a modified one is presented.
The input image is applied to all 3 networks and their outputs are summed. This
total is then thresholded and the confidence (the difference between the highest scor-
ing character and the next highest) can be determined. If this is too small (i.e. below
the threshold), the character is rejected. The threshold was set so that there was only
a 1% error rate on characters accepted from the validation sets used during training.
Improving Automated Postal Address Recognition 2. OCR
24
The resulting system was tested on four databases, two of digits and two of alpha-
betics (one each of upper and lower case). The results for the system described above
are shown in Fig. 2.4.
Fig. 2.4 - Table of improvements to OCR systemusing a combination of 3 networks over a single
network
It is obvious that these are not incredible improvements, especially for the lower case
characters. Another drawback is the large (and unknown) number of training
images that will be required in order to generate the three sets for training the three
networks. However it is a method which can be applied to any classifier, and using a
sieving procedure presented in [23], the system performance can be improved.
Instead of the computational requirement going up by a factor of 3 (three networks
are now being used rather than one), it is limited to an increase by a factor of 1.75.
This is achieved by preventing the invocation of the second and third networks if the
confidence of the first is high enough. This process is referred to in the paper as “siev-
ing”. The potential problem is that the first network may miss-classify an image with
high confidence and this would have been caught had all three networks been used.
However in their tests they showed that the previous figures are still accurate, apart
from the lower case characters. The reasons they give for this failure are difficult to
comprehend however. They state:
“However, for the lower case alphabets, this procedure does not producereasonable results (achieving a 4.0% error rate by rejecting 7.2% of theimages) and sieving does not work.”
Database ContentsSingle Network 3 Networks
Error Rate Reject Rate Error Rate Reject Rate
1 Digits 4.9% 11.5% 3.6% 6.6%
2 Digits 1.4% 1% 0.8% ~
3Upper CaseCharacters
4% 9.2% 2.4% 3.1%
4Lower CaseCharacters
9.8% 29% 8.1% 21%
Improving Automated Postal Address Recognition 2. OCR
25
The error and reject rates reported in this statement are far better than the results
listed earlier for the lower case characters. However it is not clear what they refer to.
It would seem that they do not refer to the sieving procedure, as it is stated at the end
that this procedure does not work. It is possible that there was an error in their table
of results, and these figures are much worse than the actual results for lower case
characters. This would seem unlikely though as all other reports suggest that lower
case handprinted characters are the most difficult to classify.
In [18 Leedham], a comparison is made of several approaches to HPCR and the
results on three datasets of characters reported. The databases used were:
• USPS/CEDAR database, which contains a mixture of cursive words
(which were not attempted) and segmented handprinted characters
(alphabetics and numerics)
• CENPARMI database, containing only segmented digits
• Royal Mail/Essex database, which contains segmented postcode charac-
ters (alphabetics and numerics)
The algorithms tested are from Essex, Brunel, Manchester and Kent Universities.
Results from other algorithms were taken from published sources.
Five of these algorithms were tested on the Royal Mail/Essex alphanumeric charac-
ters, with results from 63.4% to 98.7% character recognition rate. The highest score
went to a 2-level classifier developed by a group at Kent University and is listed as
“Binary Weighted Scheme / Least Mean Squared with Complex 2-dimensional Moments”.
The actual details of this approach are not given though. The problem was simplified
by aggregating the characters ‘I’ (eye) & ‘1’ (one) and ‘O’ (owe) & ‘0’ (zero) into the
same classes, which seems to be a fairly common approach to simple OCR as there is
often no way of discriminating between these characters without contextual infor-
mation.
Improving Automated Postal Address Recognition 2. OCR
26
The performance of seven different algorithms applied to the USPS/CEDAR data-
base was reported. These results were obtained directly from CEDAR’s tests, and no
actual evaluation was done. The best score was obtained by a GSC algorithm (Gradi-
ent, Structural and Concavity). This used image processing techniques, such as Sobel
operators to determine image gradients, and an eight point star operator to deter-
mine concavity in the image. The result was formed into a 448 bit feature vector, and
a K Nearest Neighbour algorithm was used to classify the result. This algorithm
achieved a character recognition rate of 97%. However on a Sun Sparcstation 2 it
only processed 2 characters per second which is plainly well below the performance
required for a real time application to mail sorting. This is probably due to the exces-
sive processing which has to be performed on the character image (convolution of
filters tends to be a time consuming process). They also investigated the advantages
of combining the results from a number of different algorithms using methods such
as majority vote and neural networks. The highest results obtained this way were
using a neural network to combine the outputs and achieved a character recognition
rate of 97.5%, an increase of only 0.5% over the best single method. It is unlikely that
this represent a good trade-off, as the increase in computation (caused by evaluating
more than one classification and then combining the results) would almost certainly
outweigh the slight increase in performance.
The results for the CENPARMI database are reported for eight different algorithms
tested at Concordia University, Montreal. The best results are achieved by a statisti-
cal method which uses a network of 231 2×2 classifiers and obtained a 98.3% recogni-
tion rate. Several methods were then investigated to combine the outputs of the
various algorithms and the best combination reported was a voting system incorpo-
rating 3 of the algorithms. However this achieved a 98.5% recognition rate, which is
an improvement of only 0.2%. This would appear insignificant and the only possible
Improving Automated Postal Address Recognition 2. OCR
27
advantage which could be gained would be a reduction in the substitution errors (in
favour of rejection), but these figures are not given and so it is not possible to draw
any conclusions from this result. However it is also reported that another group
incorporated four of the classifiers mentioned above using Bayesian Formalism to
combine the outputs and achieved a recognition rate of 99.2%, which is getting close
to the maximum that could be expected.
The results for numeral recognition tend to be slightly better than for alphabetics or
alphanumerics because fewer classes are involved. This may be of use however as
some character positions within the postcode are restricted to numerals only. There
is no reason why different classifiers should not be used to increase the recognition
success in this way (see section 6.2, “Format of the Postcode” on page 84).
Interestingly, one of the algorithms was trained on the training set from one database
and tested on the test set of another. This produced a recognition rate of only 50.3%
for handwritten numerals. Sadly the algorithm was not tested on the corresponding
matching training/test sets so no exact conclusions can be drawn from this but it is
stated in [18] that the expected performance of this algorithm would be around 80%.
This means a dramatic reduction in performance when tested against alternative test
sets. This may simply be a ‘feature’ of this particular algorithm but it may mean that
the databases commonly used to compare algorithms are not particularly universal
— that is to say they exhibit some characteristic which is peculiar to that database. A
classifier trained on one set would then be good at classifying images with the same
characteristic but may be very poor at classifying images without that characteristic.
The characteristic, whatever it may be, could be caused by the authors of the test
images, the actual scanning process or any constraints which were placed on the
type of images which were to be included in the data set, etc. Further investigation
would be required to ascertain whether this behaviour was common to many classi-
Improving Automated Postal Address Recognition 2. OCR
28
fiers and many databases. If proved correct, it would place greater emphasis on col-
lecting a representative training set for the application being designed and could
mean that totally universal classifiers would be impossible without a totally univer-
sal training set which would be very difficult, if not impossible, to collect.
Two systems are presented in [25 Martin et al.] and [30 Martin, Rashid] which again
are similar to the approaches described in [5] and [29] (see page 17). In the first, a
neural network character classifier is scanned over the text line and its output is
thresholded so that it provides positive outputs when its input area is centred over a
character. This is similar to the voting systems described earlier but the thresholding
is performed at the recognition stage rather than after the whole image has been seen
by the recogniser. In this way there is less information to store and the network pro-
duces an on-line output of characters as it sees them. The drawback is that there can
be no feedback to the recogniser from all other characters in the image; if the final
output is found not to be a valid word from the lexicon, the recognition must be
repeated. The main point of interest in this system is outlined in the second sections
of [25] and [30] which indicates how the scanning process can be improved. A sec-
ond neural network is trained to recognise how wide certain characters are. This
allows it to move the input window of the classifier network by large amounts rather
than having to scan it slowly across the word image. It is reported to be similar to the
way the human eye behaves when reading text. Although humans tend to recognise
words rather than characters, the eye jumps from one word to the next rather than
scanning the text line smoothly. It was shown in [25] and [30] that this can improve
the efficiency of the recognition process by 4 or 5 times. It does not however influ-
ence the actual recognition rate as this is purely dependant on the classifier. It was
tested on several different length numbers and achieved a word recognition rate of
94.23% for 2-digit numbers with a 1% error rate, and 63.26% for 6-digit numbers
Improving Automated Postal Address Recognition 2. OCR
29
again with a 1% error rate (the remaining percentages are made up of rejects). No
lexicon was used to assist in the recognition of the numbers and this is the main rea-
son for the fairly low recognition rate once the number of digits starts to increase.
These figures represent a character classification rate of somewhere between 92%
and 97%. However when this is applied to 6 digits the overall classification rate falls
quickly as can be seen.
The method for moving the input window by recognising the width of the input
character is essentially the improvement which was suggested on page 15 to the sys-
tem presented in [4]. However instead of moving the input window it would be
used to suggest the correct segmentation point for touching characters.
2.3.2 Cursive Writing
Segmentation is more of an issue with cursive writing recognition as there is a
greater tendency for adjacent characters to join together. Several groups have pro-
posed solutions to this problem. A system is presented in [8 Yanikoglu, Sandon] for
recognising cursive handwriting which uses an minimum cost cut method for seg-
mentation, and a neural network for character recognition. The segmentation step
consists of the following stages:
• First segment the page into text lines by computing the horizontal histo-
gram of the page and identifying the baselines of the text
• Then find the reference lines of each text line. There are four lines associ-
ated with each text line which are shown pictorially in Fig. 2.5
Fig. 2.5 - The four reference lines used by thesystem described in [8 Yanikoglu, Sandon]
poolAscender
Body
Baseline
Descender
Improving Automated Postal Address Recognition 2. OCR
30
• Finally segment the line into characters by looking for minima in the ver-
tical pixel histogram of the text line
The first stage includes a check for a skewed page by computing the horizontal his-
togram at -10˚ and +10˚ from the horizontal and using this information to shear the
image of the page accordingly. It is obvious that this could be improved by using
more computations at intermediate angles and this would represent a trade-off
between the time taken to process the page and the reliability of the results. It is pre-
sumed that these angles were found to give the most satisfactory results, however no
comparisons were presented to show the trade-off mentioned.
The results from the second stage are used during actual character recognition proc-
ess to give a rough indication of the geometry of the characters — for example, the
width of a character is roughly equal to the body height for a given text line. The seg-
mentation of characters is performed by looking for least cost cut point within the
line. The cost of a cut is determined by, among other things, the number of pixels it
must go through, the height above the baseline at which the cut is made and the dis-
tance from the last cut (relative to the approximate width of the character). Four cuts
are made at 0˚, 10˚, 20˚ and 30˚ to vertical. It seems odd that these cuts look for char-
acters which are from vertical to slanted right and none look for left slanted charac-
ters. While it is more common for handwriting to slant to the right, it would have
been a simple matter to include a cut which could handle left slanting characters
because, as it stands, the system will not recognise these at all.
After segmentation the character image is slant and size normalised and then pre-
sented to the neural network for classification. The network has an input size of
20×50 pixels (greyscale) and 26 outputs, one for each character. It is not clear whether
upper case characters are ignored or not recognised — certainly all the examples
shown in the report are only of lower case characters.
Improving Automated Postal Address Recognition 2. OCR
31
A Hidden Markov Model is used to maximise the probability of the recognised word
given some analysis of character pairs found in a large dictionary of English words.
For their tests they assumed independence of probability of words, since their lexi-
con used for word validation was small and using actual written English word prob-
abilities reduced performance in the small scale test. However, given these caveats,
the system achieved overall 61% word recognition with 71% of words being in the
top three suggested by the system. It is important to note however that this figure
was arrived at by averaging the results from three tests. In two of the tests the words
were written by authors who had also written training sets and one of them was on
hand printed rather than cursive characters. The results for these tests were 93%
word recognition for hand printed characters and 70% word recognition for cursive
characters. In the third test the author had not written a training set and the system
only managed 28% word recognition. This result indicates the huge variation
between character images from different writers and would indicate that a much
more versatile system would be required for recognising handwritten addresses on
mail pieces.
The system mention earlier in [26 Burges et al.] was applied to cursive words using
lexicons of 10, 100 and 1000 words. There are a number of differences between the
two systems; these are outlined below but, otherwise, it can be taken to be the same
as the one described on page 21.
The complex segmentation scheme is replaced by a neural network, and segmenta-
tion is now implicit in the character recognition. The neural network now has 104
outputs which are assigned to 4 sets of 26 outputs. The 26 outputs in each set repre-
sent the characters of the alphabet and the 4 sets represent different widths of charac-
ters. The neural network is thus able to recognise varying width characters within
the text line. This is obviously important if it is being used as the segmentation algo-
Improving Automated Postal Address Recognition 2. OCR
32
rithm, as the only way to ensure they are all of equal width would be to size normal-
ise them and this can only be done on segmented characters! The input of the
network is scanned over the text line and its outputs are recorded in an array which,
again, is very similar to the approach described on page 12.
The word recognition rates for images from the three lexicons were 86%, 68% and
47% respectively. The authors of [26] noted however that if the constraints were
relaxed so that the system produced its top few choices then the results went up to
93% (top 2), 82% (top 3) and 74% (top 6). This does show that as the lexicon size
increases the performance drops rapidly and it becomes necessary to accept a ‘top-n’
type output from the recogniser in order to obtain the correct word with any reliabil-
ity.
In [6 Seni, Cohen], a complex system is presented which attempts to segment totally
unconstrained cursive handwriting. It is one of the few systems which attempts the
segmentation stage of the process without attempting OCR and as noted at the end
of this section, this may be the reason for its apparently poor performance given the
encouraging results which are mentioned below. The system is applied to recognis-
ing portions of addresses written on mail pieces which had been previously seg-
mented into text lines. The system used connected components from the text line to
identify inter-word gaps. A connected component may be noise, a character frag-
ment, a whole character or a number of touching characters. Eight different algo-
rithms for detecting these inter-word gaps were tested and the results for each
tabulated. The results are percentages of the total number of text lines (1453) which
are correctly segmented into words and range from 78.5% to 87.4%. The top scoring
algorithm is a hybrid of 3 of the others and so it is not surprising that it scores better
than all the others. This algorithm also identified 97.1% of all inter-character gaps
within the words. They noted that while punctuation within a text line tends to
Improving Automated Postal Address Recognition 2. OCR
33
reduce the inter-word gaps, it also gives a good indication of the existence of an
inter-word gap as punctuation does not generally occur within a word. Three punc-
tuation detection algorithms are evaluated with the highest scoring being a K-near-
est-neighbour method. The other methods were 2 variations on discriminant
functions and all three methods scored between 97.38% and 97.84%. The percentage
indicates correct classification of each connected component as either a comma, a
period or neither of these. The only appreciable difference between the three meth-
ods is reported as being the distribution of false-positives, false-negatives and sensi-
tivity of the algorithms. No results for these measures were given however.
The punctuation detection algorithms and the previous eight gaps classification
algorithms were combined and tested on a final test set which had not been used
during the training of any of the algorithms. It is apparent now how difficult the
problem actually is, with the best combination managing to correctly parse only 39%
of these unseen text lines. However this result is akin to attempting to classify a
whole word using only a character recogniser. Even if the recogniser is 90% reliable
on each character, the chances of correctly recognising an eight character word is
only 43%. It is obvious then that some method of integrating the word segmentation
algorithms reported here into a text recognition system which could recognise these
words and provide feedback to support or contradict the segmentation would be
necessary to increase the overall reliability.
2.4 Summary
Several other approaches to OCR were reviewed including systems which use many
small neural networks to recognise individual characters [3 Kertesz, Kertesz], neural
networks applied to the cartesian and polar coordinates of the character images
[2 Lee, Choi], a single large backpropagation network [21 LeCun et al.] and a system
which maps each character image into a feature space known as ‘holograph’ and
Improving Automated Postal Address Recognition 2. OCR
34
compares the features using simple matrix functions [20 Gorsky]. All these system
performed reasonably well but none achieved a significant advantage over the oth-
ers in terms of recognition accuracy or efficiency. A summary of the results presented
throughout this section is shown in Fig. 2.6. If one approach had to be selected for
the application described in the remainder of this report then purely on recognition
rate the one in [4] would seem to be the best choice for machine printed text. For
handwritten text the choice is not so clear as a decision would have to be made as to
whether to simply attempt to classify the postcode, in which case a segmented char-
acter recogniser such as the ones reported in [8] would probably be sufficient. How-
ever there is little detail on these methods, and in particular the restrictions placed
on the methods, which would enable a judgement to be made on the reported
results. If an attempt was to be made on other address information such as the post-
town, a word recogniser would be needed and the performance of any presented
here on untrained writing would seem to be nowhere near the performance which
would be necessary to produce useful information. Another option for the character
classifier was mentioned on page 20. The system presented in [29] would benefit
from the hardware architecture which would also be used for the partial matching
which is described in section 4, and may represent a huge performance increase over
any of the other methods.
It is clear then that there is a great deal of interest in this area of machine vision,
probably due to the commercial interest that would be shown by organisations such
as The Post Office, banks and building societies in any system capable of reading
addresses, cheque details, etc. However as was shown with machine printed text,
which has almost reached its maximum attainable level of performance, a system
based solely on character recognition will not be sufficient. With addresses there is
plenty of other information on the mail piece apart from the postcode which would
Improving Automated Postal Address Recognition 2. OCR
35
aid the successful recognition of the address. With cheques the amount is written in
both words and figures. Bringing these two fields together to verify the recognition
process would be the only way of improving the accuracy significantly.
The next section looks at some of the research currently being carried out in the area
of OCR verification. The aim of a verification system is primarily to constrain the
output of the OCR system so that it conforms to some specification of a valid output
for the particular application. There are a number of ways this can be achieved.
Some systems follow sequentially from the OCR system and some are an integral
part of the OCR system. However the purpose of all of them is to improve the per-
formance of the OCR system to a level which it would be difficult if not impossible to
attain using purely OCR.
The following table summarises the results reported earlier in this section.
Source Target† Results†† Notes
[22 Wang, Jean] MC 99.81%CDoes not differentiate certaincharacters, such as ‘I’ and ‘1’
[4 Wang, Jean] MW 99.71%CUses character recogniser from
above
[7 Liang et al.] MW 99.65%C
Uses character contextualclasses to split touching charac-ters and merge broken character
components
[5 Mulgaonkar et al.] MW 80%W
Scans the text line with the recog-niser and maintains a voting
array to avoid explicit segmenta-tion step
[8 Yanikoglu, Sandon]
HC 95%CResults reported for Martin et al.
— only uppercase characters
HD 98%C Results reported for Guyon et al.
HW 95%WResults reported for Burr — pos-
sibly only for a single writerthough
CW 48%WResults reported for Srihari et al.
— only for a single writer
Improving Automated Postal Address Recognition 2. OCR
36
Fig. 2.6 - Summary of results for the OCRsystems reviewed
† The codes in this column are M for machine printed, H for handprinted andC for cursive, followed by C for characters, D for digits and W for words.
†† The codes in this column after the percentages are C for character recognitionrate and W for word recognition rate.
[26 Burges et al.]
HD 82.7%W
Applied to ZIP codes, hence theresult is in word recognition rateas there was a lexicon to check
against
CW
86%W Only a 10 word lexicon
68%W 100 word lexicon
47%W 1000 word lexicon
[23 Drucker et al.]
HD89.8%C This system uses a performance
improving scheme which couldin theory be used on any neuralnetwork. It improves the perfor-mance of a single network (using3 of the same type) from 83.6% &97.8% (digits), 86.8% (upper case
characters), and 61.2% (lowercase characters) respectively
99%C
HC
94.6%C
70.9%C
[18 Leedham]
HC
98.7%CReported for a group at Kent
University – aggregates ‘I’ and‘1’ etc. as before
97%CReported for a group at CEDAR,the authors of the character data-
base used
HD 98.3%CReport for a group at Concordia
University, Montreal
[25 Martin et al.][30 Martin, Rashid]
HD
94.23%W These results are for 2 and 6 digitnumbers respectively. Howeverthe main interest is in the novel
method used to improve thespeed of recognition 4-5 times
63.26%W
[8 Yanikoglu, Sandon] CW 61%W
The result is an average of sev-eral tests — 93% segmented
character recognition, 70% wordrecognition and 28% word recog-nition for an author who had not
written a training set
[26 Burges et al.] CW
86%W Only a 10 word lexicon
68%W 100 word lexicon
47%W 1000 word lexicon
Source Target† Results†† Notes
Improving Automated Postal Address Recognition 3. Verification
37
3. Verification
The dictionary definition of verification is the process of establishing the truth or
validity of something.
3.1 Introduction
With respect to OCR systems, verification is the process of establishing the truth of
the output of the OCR module. Usually this output will be in the form of a character
corresponding to a section of the image of the input document and a confidence
value with which the OCR system classified the image as being that character. In
order to determine the truth of that classification some other information from the
input image will usually be required. The main strategies for verification of the out-
put from a classifier are twofold. Firstly, the output words can be checked against
some database of valid words. It is likely in most cases where automated recognition
is being employed that there are some constraints on the words which will appear in
the document being analysed. This is certainly the case for postcodes, which are the
main target being considered in this report. Postcodes follow a syntax which
describes how many characters they may be composed of and what characters may
appear in certain locations. There is also a database of all valid addresses currently
being used within the UK. While this is updated from time to time with new post-
codes as they are needed, the overall syntax is not changed. This allows two checks
to be made on the validity of the postcode either during or after the recognition proc-
ess.
Secondly, output words from the classifier can be checked against other information
on the document image. For example, with analysis of cheques, the main target for
automated recognition is the amount. The account numbers are already machine
readable and the name field can be so varied that there would seem to be little point
Improving Automated Postal Address Recognition 3. Verification
38
in attempting OCR at present. However the amount field appears in two places on
the cheque and in two different forms. These two fields can then be recognised sepa-
rately and the results compared to allow verification of the recognition process. The
same is true of addresses on mail pieces, however the verification process will be
more complex.
3.2 Review
The majority of this chapter will review verification systems developed by research-
ers in this field. There are many diverse techniques employed to accomplish the task
and these are described along with the results for that particular implementation.
The overall goal of all of the systems presented here is that of improving the per-
formance of an OCR system.
It was shown in [9 Kabir, Downton] that simple character recognition of the out-
ward postcode (the section which dictates to which town the mail piece will be sent
for further sorting) was improved by 120% from simple OCR, when combined with
syntax and context information available from the rest of the address image. The
character recogniser was based on a template matching scheme, using a similarity
function which effectively computed the cosine of the angle between the character
vector and the template vector. The overall performance of this system was a charac-
ter classification rate of 62%. Two approaches to improving this performance are
investigated, which are the dictionary lookup method to represent the valid post-
codes and a Markov model to represent valid syntax within the postcode. They are
in fact combined into a hybrid system which employs features of both algorithms. At
each stage, the most likely prefix according to the Markov model is searched for in
the dictionary and any invalid possibilities are discarded. This means that at each
stage the most likely valid postcode is the one being considered. In fact, they only
considered the outward section of the postcode in their tests but the recognition rate
Improving Automated Postal Address Recognition 3. Verification
39
went from 25% using simple OCR to 55% using the Markov model and dictionary
search algorithms. They also mention the inadequacy of their sample database of
addresses and propose using random addresses from the database of all possible
addresses within the UK; they state:
“In particular, random selection of postcodes from the CD-ROM databasewill, in the limit, enable us to estimate the a priori probability of occur-rence of each character class in each postcode character position, and thusinclude this information in the character recognition model.”
However it is clear that this statement is incorrect, as there is almost certainly not a
random distribution of addresses in the live mail stream across the UK. What would
actually need to be done would be to collect random samples of addresses from each
sorting office and this would, in the limit, give the true a priori probability for each
character in each position within the postcode.
In [13 Leedham, Jones] and [19 Hendrawan, Leedham], such a system for the verifi-
cation of Australian and British addresses respectively is considered. In [13], a data-
base of addresses was collected which consisted of 200 mythical but realistic
addresses, scanned at 200dpi in accordance with Australian sorting machines. The
addresses were also written in ‘Post Office Preferred’ format, which means the post-
code (a 4-digit numeric code) is written to the lower right of the address. The system
comprises a character locator and classifier for actually recognising characters from
the postcode, a feature analyser for extracting other features from the address, such
as posttown information, and a database for matching the word information with
the postcode.
The postcode location is performed by assuming the postcode lies within a small
window on the address image. This size and position of this window is adjusted to
include all the postcode characters but exclude other parts of the image, however no
details are given as to how this is achieved. The vertical pixel histogram of this win-
Improving Automated Postal Address Recognition 3. Verification
40
dow is then used to segment the characters. Checks are made to prevent individual
characters being segmented but again there are no details as to how this is per-
formed. The character’s height was then checked to ensure it was reasonable to
assume it was a character and not a dash or other mark on the image. The OCR was
performed using a characteristic loci method which achieved a 42% postcode recog-
nition rate which equates to an 80.5% character recognition rate. It has to be said,
after the results of the last chapter, that this is a fairly poor performance even for
handwritten character recognition, when it is only numeric characters that have to be
considered. Using the best results for handprinted digits from the previous section, a
character recognition rate of up to 99% could be expected and this would instantly
yield an improvement from 42% to 96% word recognition rate. Even using a more
conservative estimate of 95% character recognition rate, this yields a postcode recog-
nition rate of over 81%, which is nearly twice the current value. The authors also
state that the 52% error rate is unacceptable and needs to be reduced to around 0.1%
for a real application. However they seem to be ignoring the possibility of rejects
from the automated system, which would almost certainly have to be used to
achieve an error rate as low as 0.1%. It is not clear whether rejects are possible from
their OCR system, but if not (and hence the error rate and success rate summing to
100%), this would give another reason to change the OCR method used.
The other address image information is obtained via a number of stages. The image
is first smeared horizontally and vertically in an attempt to make all the characters
within a word connected (as they are not going to be classified by a character recog-
niser, it doesn’t matter if they are slightly distorted by this process and having a
word as a connected component simplifies the word segmentation step). The image
is segmented into lines by considering the horizontal histogram of the address
image. The tops and bottoms of characters tend to show up as peaks in the histo-
Improving Automated Postal Address Recognition 3. Verification
41
gram. They note also that a disconnected top stroke from a letter ‘T’ can sometimes
cause the line to be split into more smaller apparent lines if only the histogram is
considered. They overcome this by then making a second pass over the image and
merging lines which appear to be from the same actual text line. They do not com-
ment on how this is achieved, however it could be done using the height of the seg-
mented line — the segment containing only a horizontal stroke from a ‘T’ would be
considerably smaller in height than one which contained the body of the ‘T’. Once
this has been done, an 8-connected region growing and labelling process is applied
to the image to attempt to label each word in the address. The growing is prevented
from moving far outside the line segmentation points found in the previous step to
avoid joining text lines together. Components whose bounding boxes overlap hori-
zontally are then joined, as the authors state that this is almost always due to a word
being split into two or more pieces at an earlier stage (smearing or region labelling).
OCR was reportedly attempted one the first and last character of each word. Some
details are given about the method used to locate the first character — the character
is segmented using either a search for a white gap (as the initial character tends to be
upper case and therefore disconnected from the rest of the word), component label-
ling and finally “character splitting techniques” which are otherwise unspecified. If all
these methods fail the character is simply split at a certain width relative to the
height of the current line (to give it a fixed aspect ratio). No mention is made of the
techniques used for the last character of the word, although it is possible that they
are the same as the above.
The word is then tested for upper/mixed/undetermined case characters. Again, no
details of the method are given other than the shape of the horizontal histogram of
the horizontally smeared image is used, and the technique correctly identifies the
case of 78% of the words in the address image database. The number of characters in
Improving Automated Postal Address Recognition 3. Verification
42
the word is estimated for upper and mixed case words by counting the number of
times a stroke crossed the horizontal centre of the word. The value chosen was the
rounded value of half of the number of line crossings found and was correct to
within 1 character for over 90% of the words. For mixed case words, the ascender/
descender sequence was obtained by scanning horizontally along the top and bot-
tom of the word. The system correctly identified 55% of the ascender/descender
sequences and the rest “with minor errors”. For upper case words, lobe features such
as the closed lobes in ‘A’, ‘B’ etc., the upward open lobes in ‘V’, ‘W’ etc., and the
downward open lobes in ‘N’, ‘M’ etc. are extracted. Once again, no details as to how
the extraction is performed, what is done with characters such as ‘W’ and ‘M’ which
have both upward and downward lobes or the performance of the extraction system
are given.
The results of the verification section were not reported. However in [19], the same
system is applied to British postcodes. The opening paragraphs state that “Out of the
120 address images analysed 71 (51.2%) were segmented without any errors.” However 71
out of 120 is 59%! This is clearly a typographical error, however, the rest of the results
presented here are presumably accurate. Once the words were segmented, the initial
character algorithms correctly isolated the first character of 71.4% of the 329 words
attempted. There is more detail in [19] about the actual methods but it is not clear if
they are exactly the same as in [13] above. The initial search for white space is used if
the character is clearly separated from the rest of the word. If this fails the compo-
nent labelling scheme is used for characters which are physically separated but
whose bounding box overlaps that of the rest of the word. Finally the vertical histo-
gram profile of the initial part of the word is used to split the character, which is now
assumed to be touching the rest of the word. The profiles of all 26 characters are used
but it is not clear how one is selected, as the character recognition is not performed
Improving Automated Postal Address Recognition 3. Verification
43
until after the character has been segmented. The aspect ratio method used as a last
resort gives the character a width of 0.9 times its height. OCR is attempted on the
character using a method developed by one of the authors colleagues, Robert Tregi-
digo. No results were reported for this stage though. This is unfortunate, as the first
stage of a verification process would probably be a comparison of the initial part of
the postcode with the initial letter of the posttown. The performance of the OCR on
the initial letter would have a huge influence on the reliability of this type of verifica-
tion.
The word case classification was performed and achieved an average word classifi-
cation of 71.8%. Of these, 36 were correctly classified as mixed case and the
ascender/descender sequence within the word was estimated as before. The algo-
rithm correctly analysed 55.6% of these words. The number of letters in the words
was also estimated as before and it is reported that 89.3% of the words were esti-
mated from 0 to +2 characters of their actual length. However closer analysis of the
graphs shows that only 27.7% were correct and 46.4% were in the +1 band (i.e.
reported 1 more character than there actually was). This would indicate that some
adjustment of the algorithm is required. The distribution looks fairly normal from
the graphs, and it would make sense for the mode of the results to be correct. In fact,
by simply subtracting 1 from the estimated lengths, the results immediately become
89.3% correct to within 1 character which would appear to be better than 0 to +2
characters.
The results of the verification process are reported in [12 Hendrawan, Leedham]. It
was assumed that each line of the address contains only one field such as posttown
or county but a comma detection algorithm was used to check if more than one field
was on the same line, separated by a comma. Similarly, hyphens were detected and
removed so that hyphenated place names such as ‘Clacton-on-sea’ always appeared
Improving Automated Postal Address Recognition 3. Verification
44
the same whether they were written with the hyphens or not. From the OCR of the
postcode (for which the results were not presented in [19]), a search is made which
lists in order of likelihood the possible addresses from the database of all valid
addresses. These candidate addresses are then matched against the features
extracted from the address image. For each corresponding line in the address (image
and candidate), the following features are used:
• Number of words on the line
• First character of each word
• Number of letters in each word
and for mixed case addresses as indicated by the case discrimination algorithm,
• Number of ascenders/descenders in each word
• Ascender/descender sequence for each word
Each of these factors was given a weight which was determined heuristically. The
results for each line were summed and normalised into the range 0 to 1, and the val-
ues were then weighted according to which line they were on and the number of
lines in the address. These weights were also determined heuristically. Finally the
weighted value for each line was summed and this represented the verification value
of the address. A threshold was then used to decide at what point the address was
considered verified correctly.
The results of simple OCR on the postcode show that 40.89% of addresses are cor-
rectly identified but once again they imply that this means an error rate of 59.11%.
This must mean that the OCR system is unable to reject an address if its recognition
of the postcode is below a confidence threshold. Clearly this would not improve the
success rate, but rejection is preferred over error in this application, as manual sort-
ing is preferred over delivery to the wrong address. The results for the verification
system indicate that at a certain threshold, 38.18% of the addresses are verified cor-
Improving Automated Postal Address Recognition 3. Verification
45
rectly with an error rate of 4.89%. It is assumed then that the verification stage is
intended to identify which postcodes were incorrectly classified by the OCR system
by rejecting some of the addresses (presumably for manual sorting). So now the cor-
rect address classification rate is 38.18% with an error rate of 4.89% and presumably
a reject rate of 56.93% which has not actually improved the recognition performance
of the system at all. One of the main reasons cited for errors is the fact that the
address image is compared line by line with the address candidate from the data-
base. This means that if extra information is included in the address or one line is
missed out (which does not necessarily mean the address is incomplete), the com-
parison gets out of step resulting in a low verification value. This is because the sys-
tem is implying an ordering in the address that does not really exist — the address
consists of all the information together and is not a hierarchy. This means an order
independent comparison with the database may be beneficial (see section 8.3, “Strat-
egies for Verification” on page 101).
A very interesting report of a system in given in [39 Lucas], and although not strictly
a verification system, it is described here as it could form the basis for the verification
system described above. In fact, it is more accurate to say that it performs validation
rather than verification. The distinction is quite subtle, but validation is really lim-
ited, in this application, to ensuring that the postcodes which are returned by the
OCR module are real postcodes — i.e. they exist in the database of all valid post-
codes. Verification would involve ensuring that the postcode matched the other
address information, such as posttown, on the mail piece.
This system tackles almost exactly the same problem as will be discussed for the
remainder of this report; that is the validation of OCRed characters against a data-
base of valid words (in this case, postcodes). The problem is approached in a very
different way however. The system described uses a syntactic neural network (SNN)
Improving Automated Postal Address Recognition 3. Verification
46
to parse the grammar of postcodes, to identify the valid ones from the list of charac-
ter confidences from the OCR system. It also employs a lazy multiplication scheme
to allow efficient best first retrieval of valid codes. The real problem is to find the best
path through a set of lists of real numbers, returned by the character classifier.
Fig. 3.1 shows a possible output from the OCR module.
Fig. 3.1 - A diagram of the first stage of the SNNmethod for retrieving valid postcodes
As each list is sorted, it can be seen that the most likely1 postcode can be found by
simply taking the top line, which in this case is ‘SOI26BL’. Clearly this is not a valid
postcode and a check must be made for this. Disregarding this problem for the
moment, the next best output from the classifier is not trivial to find. What is really
needed is the full cartesian product of all the character confidences, sorted into order.
This would then give every possible output of the classifier in confidence order but is
clearly very costly to produce. The cartesian product of the above example has 8640
(5 × 4 × 3 × 4 × 3 × 4 × 3) possible postcodes involving 51840 real multiplications, and
this list would also have to be sorted after it had been generated.
1. This definition of ‘most likely’ assumes independent probability among the characterswhich is not necessarily the case. However for now it will be assumed to be true.
S O 2 6 B L1
S - 0.95 - 0.6B - 0.1H - 0.04E - 0.03
O - 0.940 - 0.89Q - 0.7D - 0.15
I - 0.911 - 0.85L - 0.2
2 - 0.88Z - 0.57 - 0.1S - 0.08
6 - 0.92C - 0.67S - 0.3
B - 0.918 - 0.82R - 0.44E - 0.32
L - 0.8I - 0.651 - 0.51
Input Characters
Output of Classifier
The output of theclassifier is a sorted
list of the confidencevalues of the top few
characters in eachposition. Since there
are only a fewcharacters in each
list, the penalty forsorting the lists is
negligible.
Improving Automated Postal Address Recognition 3. Verification
47
The system described in [39] offers a way of improving the efficiency of generating
this list by implementing a lazy evaluation of the cartesian product. The overall
structure of the system is a binary tree. Each node is a processing element which
accepts two inputs from lower level nodes and passes the combination of these as its
output to the next higher level node. The lowest level nodes take their inputs directly
from the OCR system in the form of an ordered list of characters and confidence val-
ues (see Fig. 3.1). The highest level node outputs valid postcodes. At each node, the
following kind of matrix is formed from the two input sources:
Fig. 3.2 - Diagram of the matrix formed at eachnode of the SNN
The dark grey square in Fig. 3.2 is bound to be the best output at first because the lists
are ordered and the product of the top of each list will always be higher than any
other product within the list. To produce the second output, only the 2 lighter col-
oured squares (which represent from top to bottom, the sequences ‘5O’ and ‘S0’)
need be considered as they are bound to be higher than any other product from the 2
lists. Again this is a property of the fact that the lists are ordered.
So, 4 nodes would be required to accept a 7-character postcode, with the last node
taking its inputs from the last character and a null list (which effectively just returns
the list of characters in order). Above these nodes 2 more nodes are required, taking
5 B H E
O
0
Q
D
S
The matrix representsthe inputs to the node
which acceptscharacters 1 & 2 from the
left hand end of thepostcode shown in
Fig. 3.1. The top leftsquare is guaranteed to
be the best output atfirst. After this, only the
two lighter colouredsquares need be
considered as one ofthese is guaranteed to be
the next best.
1st Character
2nd Character
Improving Automated Postal Address Recognition 3. Verification
48
inputs from nodes 1 & 2 and 3 & 4 respectively. These nodes implicitly form
sequences of four characters, as each input represents a character pair from the low-
est level of the tree. The final node takes its two inputs from the middle level of the
tree and outputs complete postcodes along with their confidences. The overall struc-
ture is shown Fig. 3.3.
Fig. 3.3 - Block diagram of the way informationis processed in [39 Lucas]
At each node, the lazy evaluation of the cartesian product of the input pair is per-
formed as shown in Fig. 3.2, and a check is made to ensure that, at each level, only a
valid postcode is being formulated. This means that the system has to be trained on
valid postcodes before is can be used (hence the term neural network). During train-
ing, the lowest nodes for example are trained on valid character pairs for their
respective position within the postcode. The middle nodes are trained on valid pre-
fixes and suffixes, but can assume that the inputs (character pairs) are already valid
Valid Postcodes
Processing Elements(nodes)
Ordered lists of characters and confidences
Null list
Improving Automated Postal Address Recognition 3. Verification
49
so, in fact, they only have to know which pairs can go with which to make valid 4-
character sequences. The top level node takes valid 4 and 3 character sequences and
knows how these can be combined to produce valid postcodes. As the numerical
product is passed up at each node, it is a simple matter to produce the overall post-
code confidence along with the postcode itself, from the top level node.
The results presented in [39] show that the system performs well when compared
with a trie implementation of the same search. The SNN system also displays the
unusual property of performing a faster search as more data is added to the system.
However, taken in context, this is inevitable as the system spends most of its time
discarding invalid postcodes. Hence as more valid postcodes are added to the data-
base it takes less time before a valid one is found from the possibilities suggested by
the character classifier. In the limit, if all combinations of characters could form valid
postcodes, the system would produce the next best postcode on each cycle of com-
putation.
There is however one major disadvantage to this approach. As stated before, the
probabilities of the characters are assumed to be independent so that the probability
of the postcode can be made equal to the product of the confidences of the individual
characters. This is obviously not the case as there are some character pairs which
would be much more likely than others. More importantly, the probability of each
character is influenced by all the other characters in the postcode. By splitting the
postcode into pairs of characters in this way and then combining them into pairs of
pairs and so on until the postcode is finally output, this dependence cannot easily be
Improving Automated Postal Address Recognition 3. Verification
50
modelled. Another assumption mentioned in the report is that all postcodes are
assumed equiprobable. However it is also stated that,
“... a priori postcode probabilities can easily be modelled in theory (whileretaining best-first retrieval characteristics) by having a top level node inthe SNN taking one set of inputs from the data at hand, and the other setfrom the pre-compiled set of possible postcodes, which are also retrievedmost likely first.”
It is not at all clear what this statement means. However if it is taken to mean that the
top level node takes one input from the output of the existing system and the other
from the list of postcode probabilities, then it is unclear how this helps. The true post-
code probability (ignoring for the moment the character probability interdepend-
ence) is the product of the postcode confidence from the existing SNN system and
the probability of that postcode occurring. For example, a very common postcode
would have a high probability in the pre-compiled list and should be accepted
before a very uncommon one, even if the confidence of the uncommon one was
slightly higher according to the SNN system. So it is not clear how this final list can
be output in best first order without retrieving all the valid postcodes from the SNN
system. If the pre-compiled list of postcode probabilities is complete, as it must be to
ensure that every address can be handled by the system, this means that all 1.6 mil-
lion postcodes would have to be retrieved from the SNN system and multiplied with
their corresponding probability of occurring, and the results of this sorted to give the
actual most probable postcode. In fact, using the system described in [39], it would
be possible to improve on this by retrieving postcodes from the SNN system until
the one which matches the top entry in the pre-compiled list is returned, and this has
to be the most probable postcode. However there is no way of telling how many
recalls will have to be made from the SNN system before this postcode is returned. It
is clear then that the phrase “in theory” in the above quote is essential, as the practical
Improving Automated Postal Address Recognition 3. Verification
51
implications would seem to outweigh the undoubtedly efficient system when real
probability values are required.
It is possible to imagine another tree akin to the one described above which was
trained to recognise posttowns, by combining character pairs until they form a valid
posttown name. These two trees can then be thought of as producing ordered lists of
postcodes and ordered lists of posttowns which could be combined in the same way
to eventually produce addresses. In this way, a hierarchy of trees could be used to
perform verification rather than simply validation. However this is a fairly sweeping
statement about how the system could be extended, and would require a great deal
of further work to ensure the practicality of such a system.
3.3 Summary
We have seen some attempts which have been made towards the verification of auto-
mated address recognition. It is clear that this is a quite complex problem, especially
given the requirement for an on-line solution. It would appear that although there is
undoubtedly a great deal of value in a system which could improve the automated
address recognition rate, there is no immediately obvious solution. The complexity
of the task is due to fact that the address/postcode combination was not really
designed for this kind of automation. With the infrastructure so firmly embedded in
the market place it would be difficult to change the style of addressing to any great
degree, and so this adds to the value of a system which can be reliably incorporated
into the existing processes.
There are currently around 1.6 million postcodes in use in the UK. In the case of
restricting the recognised postcodes to these valid ones, this represents a not insub-
stantial amount of data which will have to be searched to validate the postcode.
However, as shown above, there are efficient searching methods which can be
Improving Automated Postal Address Recognition 3. Verification
52
employed. The problem is confounded though by the fact that the character recog-
niser will undoubtedly fail to recognise one or more characters from the postcode
some of the time. This will then require a search of the database to determine what
possible valid postcodes the image could represent. In effect, this will produce a list
of possible characters which could occur at the position which currently cannot be
recognised. This information will have to be fed back to the character recogniser in
order for it to make a second attempt at classification, now that there is more infor-
mation available in the form of a restricted set of possibilities.
When this idea is extended to cover other features from the address, the database
which was 1.6 million records of a few characters each becomes considerably larger,
as information such as posttown name, building name or company name, P.O. Box
numbers, etc. are added to it. It is clear then that one of the most crucial parts of this
system will be a very efficient method of extracting valid addresses from the data-
base given the character recogniser’s first attempt at classifying characters from the
address image. The next chapter looks at some of the methods which can be
employed to solve this problem. A system using Correlation Matrix Memories was
found to give the best performance, and a detailed discussion of this type of system
is presented.
Improving Automated Postal Address Recognition 4. Partial Matching
53
4. Partial Matching
If a database is queried by supplying a key which uniquely identifies the record
being sought, only one record should be returned by the database system. If how-
ever the key is not fully specified, it is possible that more than one record will match
the partial key. This is then a partial match query.
4.1 Introduction
From the previous discussion it is clear that the verification process will require a
partial match to be made on the database. This is because the OCR system is bound
at some point to fail to recognise a character and this means, for example, that the
postcode will have one or more characters missing. This forms a postcode template
which may match a number of possible postcodes in the database. The problem is
very similar to occluded object recognition, where an object must be identified even
if some of its features are unknown. The features of a postcode are the characters
which make up that postcode and when some of those features are missing, one
postcode may ‘look’ very much like several others (see section 6, “Analysis of PAF”
on page 83). What is required then is a system which can provide some sort of list of
all the postcodes which match the template given by the OCR system.
In this section, a very brief review of some of the more common methods for partial
match searching of a database is given. One of the best methods, a technique using
Correlation Matrix Memories, is then looked at in greater detail. One major problem
with using this technique is identified, which will then lead into the next section.
4.2 Review
There are many conventional systems which could in principle perform the task of
taking a partial postcode and returning all valid postcodes which fit the template.
Improving Automated Postal Address Recognition 4. Partial Matching
54
For example, SQL databases can be queried in this way. There has been much inter-
est in partial match search algorithms in the past ([34 Rivest], [35 Burkhard],
[36 Kim, Pramanik]) such as hashing tables and tree/trie structures. A review of the
current methods is given in [37 Kennedy]. The review starts with conventional tech-
niques such as the Inverted File Technique, where an index is held for every attribute
which may form part of the partial match. The search is then performed by retriev-
ing all records from the file using the index for each attribute specified in the query
and then performing an intersection operation on the results. This was shown to be a
very inefficient method, as the more well specified a query is the more data is
retrieved from the database prior to the intersection operation. Next, Hash Coding
Techniques were investigated. These included Standard Hashing, Address Genera-
tion Hashing and Hashing with Descriptors. Standard Hashing uses similar tech-
niques as the Inverted File method, but uses a hash function instead of the index for
each attribute. The same problem of excessive data retrieval for well specified que-
ries is noted. Address Generation Hashing uses the attributes to generate parts of the
address within the database of the corresponding records. It means that no intersec-
tion operation has to be performed as with Standard Hashing, but many false
matches may be returned. This is because the number of bits of the address allocated
to a particular attribute will most likely be less than the number of possible values
the attribute could take (the corollary to this is that different values of the same
attribute will hash to the same address, hence the false matches). The next technique,
Hashing with Descriptors, overcomes this problem. In this method, the attribute val-
ues for each record are hashed and the results concatenated together. This forms a
descriptor for that record. The whole file is split into a number of ‘pages’, and all the
descriptors from each record within a page are bitwise ORed to form a descriptor for
that page. There is no mention however of how the file is split, how many pages
there should be or whether the records within each page have something in com-
Improving Automated Postal Address Recognition 4. Partial Matching
55
mon. It was stated though that this method significantly reduced the number of false
pages accessed compared to the previous method.
The report then goes on to consider superimposed-coding techniques. These are sim-
ilar to the Hashing with Descriptors method outlined above but instead of concate-
nating the hashed attributes, they are superimposed or bitwise ORed on top of one
another. These superimposed codes are then used to form the index to the file but
only one index is needed for all the attributes. A query is processed by forming the
superimposed code of the attributes in the query and then searching the index for all
index codes which contain the query code. These records are then retrieved. A more
advanced method is two-level superimposed coding, which simply treats the index
codes as records, which are then hashed and superimposed to form a hierarchical
structure (albeit only a two level one). The query is made by forming codes for both
indexes. The higher level one is searched first (as it is smaller) and this results in a
subset of the second level index. This subset is then searched using the second code
from the query to get the actual records. This method was shown in the worst case to
be no worse than one-level superimposed coding, but usually to be much more effi-
cient, as the size of the index which has to be searched is usually much smaller.
A number of variations on the superimposed coding techniques were also reviewed
which involved various trade-offs between storage, disk accesses and performance.
However none were shown to have any significant advantage over the others. They
all simply represent a kind of tuning which could be performed for a particular
application.
Improving Automated Postal Address Recognition 4. Partial Matching
56
It was shown however that a system based on Correlation Matrix Memories (CMMs)
can outperform other conventional partial match algorithms for certain classes of
problem. These problems are ones of the form:
“Return all records which match n from m attributes where n ≤ m”
This means that while say 4 attributes can be provided to the search algorithm, it can
be asked to return all records which contain any 2 of those attributes. While the
inherent ordering of the characters within a postcodes does not require such a gen-
eral matching algorithm as it can be accomplished simply by a wildcard type search,
there are some extensions to this idea which would require such a searching capabil-
ity (see section 8.3, “Strategies for Verification” on page 101). It should be stated
however that a system capable of performing these types of extended queries is per-
fectly capable of making the standard partial match queries simply by ensuring that
the values of n and m above are equal. That way the only records returned are the
ones which contain all the attributes passed to the search algorithm.
In fact, the system proposed in [37] deals with much more abstract entities than char-
acters from a postcode, and in particular can be made sensitive or insensitive to the
ordering of the attributes passed to it. This is ideal for a reasoning system, where the
order in which the information is presented is irrelevant (and is one of the main
strengths of the system). However the ordering of the characters is an essential part
of the postcode. We do not want to recognise some of the characters and then
retrieve a list of all postcodes which contain those characters in any order; in fact, we
require a list of post codes which have the recognised characters in specific positions.
To accomplish this, while still retaining the speed advantages of the new system, we
simply omit the binding and superimposing stages detailed in [37] which are what
allows the system to produce results for arbitrary orderings of attributes. It may be
that at a later stage, the order independence capabilities will be exploited. It may be
Improving Automated Postal Address Recognition 4. Partial Matching
57
possible to use the system to bring together other information from the address
image. There is no ordering inherent in the postcode, post town and street name, yet
they are all attributes of some record within the database. A search may need to be
made using any or all of these, depending on what can be recognised from the
address image, and this is discussed in section 8.3.
The remainder of this chapter will describe the operation of CMMs in greater detail
and, in particular, show how they can be used within this application.
4.3 Correlation Matrix Memories
These were proposed in [31 Willshaw et al.] in 1969 and were based on the image
recall properties of holographs, although the original idea came from Stinbech Matri-
ces. The basic structure of the associative network is shown in below.
Fig. 4.1 - Diagram of a simple correlation matrixmemory
The memories can store an association between two binary patterns or numbers. Pat-
terns to be associated are presented to the matrix as binary strings. One pattern is
applied to the horizontal lines and the other to the vertical lines. Where two 1’s in the
patterns coincide, that position in the matrix is set to a 1. During recall, the input pat-
terns are applied to the horizontal lines and the rows of the matrix which have 1’s
Inp
uts
Outputs
Each representsa bit set to 1 in a
binary matrix.
The horizontallines represent the
inputs to thematrix and the
vertical lines theoutputs.
Improving Automated Postal Address Recognition 4. Partial Matching
58
applied to them are summed vertically to form the output. This output is then
thresholded according to a certain algorithm and the original pattern is thus recalled.
The operation is shown in Fig. 4.2.
Fig. 4.2 - A CMM during recall
There are many issues connected with the performance of such a system, for exam-
ple:
• Number of associations which can be stored
• Size of array to represent number of input and output patterns required
• Number of bits set in input and output patterns
• Thresholding algorithms
• Coding of actual inputs to input patterns, and similarly for outputs to
output patterns
These issues will be dealt with in turn along with a method for using CMMs for
recalling more than one pattern at a time. This is essential for performing partial
matching on the database.
Inp
ut
Pat
tern
Output Pattern
The input patternis applied
horizontally andthe output patternappears vertically.
5 2 5 5 2 5 5 2Five input
bits set to onemean thethreshold
value is five.
Improving Automated Postal Address Recognition 4. Partial Matching
59
4.3.1 Storage Capacity of a CMM
The basic equation for the error-free storage capacity of a CMM is shown below.
(from [32 Nadal, Toulouse]) Eqn. 4.1
The value N is the maximum number of associations which can be stored by a CMM
whose input and output sizes are both w, while guaranteeing that there will be no
errors in the output pattern. The equation is based on having log2w bits set to 1 in
both the input and output patterns, and with a random distribution of patterns. This
serves as a rule-of-thumb when estimating the size of CMM required for a certain
application. However it only caters for square matrices, and some more work is
required to find the general solution for w × h matrices where w and h are the width
and height of the matrix respectively. It may also be advantageous to have some
other number of bits set to 1 rather than the function of w given above. Again, the
equation for number of association which can be stored would need some alteration
to reflect that.
4.3.2 Coding of Input and Output Patterns
To recall an output pattern after an input pattern has been summed through the
matrix, a suitable threshold must be applied to the raw totals. It is clear that the cor-
rect value to threshold at is the number of bits set to 1 in the input pattern. However
this should actually be the number of bits set to 1 in the original input pattern. If a
noisy pattern is being applied to the CMM, there may be more or less 1’s in the pat-
tern than there were during the training phase (when the associations were stored).
The matrix will still recall the correct output pattern, but the threshold value must be
set correctly. If the input is noisy, there is no easy way to determine how many 1’s
there should have been. A solut ion to th is problem was proposed in
[33 Austin, Stonham], where every output pattern used has the same number of bits
N2log( ) 3w2
wlog( ) 2---------------------------=
Improving Automated Postal Address Recognition 4. Partial Matching
60
set – the position of the bits is the only thing that changes from one pattern to
another. This is known as k-bit coding.
Using their scheme, the maximum number of patterns P which can be generated is
given by the following equation.
Eqn. 4.2
where η is the combinatorial operator, w is the width of the code and k is the number
of bits set to 1 in that code.
The thresholding problem is now simply to select the k highest responding outputs,
thus producing a k-bit binary pattern. This property is key to the operation of the
ADAM associative memory system described in [33]. The maximum number of pat-
terns which can be generated in this way is considerably more than the number of
associations which can be stored in the matrix, given that the matrix has input and
output sizes which are of the same order of magnitude. Therefore it does not put any
restrictions on the capacity of the network.
The interesting and essential property of CMMs in this application comes into play
when an incomplete input pattern is applied. By carefully controlling the threshold-
ing process, the correct output pattern can still be recalled. However if the incom-
plete input pattern now matches not one but two or more original input patterns,
then the output patterns associated with each of these will be returned bitwise ORed
on top of one another. This is the way that a CMM can be made to perform partial
matching. The actual process involved here stems from the fact that the CMM is a
type of neural network which forms associations between patterns. When being
tested, the input pattern is matched against all the patterns trained into the network,
and the output pattern associated with the stored pattern which most closely
matches the input pattern is generated. When an incomplete input pattern is applied
P ηwk=
Improving Automated Postal Address Recognition 4. Partial Matching
61
to the network, it may be that this partial pattern is equally similar to 2 or more
stored patterns. In this case, the network has no way to distinguish them. Its
response is to assume that the input pattern could be any one of the similar patterns,
and to output all the output patterns which match. However as it only has one out-
put array, the outputs are superimposed on top of one another and they then have to
be separated into the individual output patterns. By carefully controlling the way the
actual data is mapped to the different input and output patterns, it is possible to
define a method for performing partial match type queries. For example, let us
assume that the input data are words from some dictionary. All the words are three
characters long. A simple mapping would be to give each character a field in the
input pattern, say 1 in 26 bits representing the character of the alphabet. These 26 bit
words are then simply concatenated to form the actual input to the CMM. Some
examples are shown in Fig. 4.3.
Fig. 4.3 - Example input pattern coding for aCMM to use partial matching
Note that the entries in the table are 1-dimensional binary strings — they are only
split across lines to prevent the table from being too wide. The final input pattern can
be seen then to be a 78 (26 × 3) bit pattern. Now once these patterns have been associ-
ated with suitable output patterns in the CMM (suitable meaning that there is a one-
to-one mapping between the output patterns and the original words), it can be used
to perform partial matching such as ‘C?T’, meaning all words which have ‘C’ at the
WordCharacter 1bit pattern
Character 2bit pattern
Character 3bit pattern
CMMinput pattern
CAT00100000000000000000000000
10000000000000000000000000
00000000000000000010000000
001000000000000000000000001000000000000000000000000000000000000000000010000000
COT00100000000000000000000000
00000000000000100000000000
00000000000000000010000000
001000000000000000000000000000000000000010000000000000000000000000000010000000
DOG00010000000000000000000000
00000000000000100000000000
00000010000000000000000000
000100000000000000000000000000000000000010000000000000000010000000000000000000
Improving Automated Postal Address Recognition 4. Partial Matching
62
beginning, ‘T’ at the end and any other letter in the middle position. This is achieved
by taking the patterns for ‘C’ and ‘T’ and putting a string of 26 zero’s between them.
This gives a 78 bit input pattern, but the total number of 1’s on the input is now 2
instead of 3. This means that an adjustment to the thresholding must be made in
order to compensate. There is nothing particularly subtle in this — the total number
of expected 1’s is known (3, as this is the number of characters in the words this
CMM will recognise) and the number of characters missing is known. When the out-
put of the CMM is thresholded accordingly, the result will be the patterns for ‘CAT’
and ‘COT’ superimposed on top of one another (see Fig. 4.4). As the output patterns
are directly mappable onto the original words, it is a simple matter to search the out-
put of the CMM for known output patterns, and this will give us back the list of
words.
Fig. 4.4 - Result of recalling ‘C?T’ from a CMM
Seven different methods which can be used to separate the output into its constitu-
ent codes are discussed in [37]. It is shown that overall, the method with best per-
formance is Middle Bit Indexing [38 Filer]. However this assumes various
parameters for a specific application ([45 Austin et al.]), and may need to be re-eval-
uated for a different application. There would be no benefit in undertaking this work
at the current time.
There is one issue which will be of importance to any application using CMM tech-
niques which has not yet been considered — that of ghosting. Ghosting is an unde-
sirable feature of the way the outputs are generated by the CMM. Because they are
† This code contains the codes for ‘CAT’ and‘COT’, ORed together.
Words Example Output Codes
CAT 00000100000000010000
COT 00010000000010000000
Superimposed Result 00010100000010001000†
Improving Automated Postal Address Recognition 4. Partial Matching
63
superimposed, it is not always easy to tell what patterns are there. This effect can be
shown by a simple example using a familiar 7-segment display used in digital
watches, etc. Suppose that this is the output of the CMM, and the actual patterns are
‘1’ and ‘2’. Superimposing these is shown below:
Fig. 4.5 - Superimposition of 2 7-segmentnumber patterns
It is now not clear whether the final pattern contained just a ‘1’ and a ‘2’ as this same
pattern would be made by ‘2’ and ‘3’ or ‘2’ and ‘7’ (in their 7-segment form). So the
four numbers which could be extracted from the pattern are ‘1’, ‘2’, ‘3’ and ‘7’. If only
two of these were actually used to make the pattern in the first place, the remaining
two are called ghosts. The problem arises because the numbers which are used to
make the final pattern are hidden within the internal workings of the CMM and
there is no way to find out directly which numbers were used and which weren’t.
The next chapter gives a detailed discussion of how and why ghosting occurs, and
presents a method for reducing its undesirable effects.
OR =
Improving Automated Postal Address Recognition 5. Ghosting
64
5. Ghosting
Ghosting is the term given to a property of images which are superimposed. The
images may be binary numbers or line drawings. The effect is the same, and it is that
once two or more images are superimposed, it is not always possible to know for
definite which of a number of possible original images were used to make the super-
imposition.
5.1 Introduction
It is a simple property of binary numbers that given an arbitrary set of fixed width
numbers, it is possible in principle for some combination of codes ORed together to
include codes from the set which were not among those ORed together. An example
is shown in Fig. 5.1.
Code 1: 0010010
Code 2: 1000100
Code 3: 0000110
1 OR 2: 1010110
Fig. 5.1 - Example of superimposed codesgenerating a ghost.
Given this result, if the ORed code were to be separated up into its constituent codes
using the techniques referred to in section 4, it would be impossible to tell whether
or not code 3 was included in the ORing operation or not. If it was not, as in this case,
it is known as a ‘Ghosted Code’ or ‘Ghost’. This is simply a binary representation of the
example given at the end of the previous section — if the parts of the 7-segment dis-
play were arranged in a row they could be thought of a forming binary numbers
where the lit segments represent a ‘1’ and the unlit segments represent a ‘0’.
In this chapter, the causes of ghosting are explored and a definition for a set of binary
numbers which exhibit a desirable property when used with CMMs is given. This
property is that a particular set will exhibit a known worst case ghosting no matter
Codes 1 and 2 ORedtogether produce a result
which includes code 3.
Improving Automated Postal Address Recognition 5. Ghosting
65
what combination of codes are superimposed. The size of the set thus determines the
number of records which can be stored by the CMM system, and it is therefore desir-
able to maximise the size of the set while retaining the maximum ghosting property.
While the definition of the set is rigorous, there is no obvious efficient method for
generating such sets. In the absence of this, a brute force algorithm was used to gen-
erate some small sets for experimentation. The term small applies both to the width
of the binary numbers and to the number of elements within the set. Because of the
algorithm used, the time taken to generate the sets increases factorially with the
width of the code and so it was only practical to produce small sets. The experiments
were designed to investigate how the sets might behave as the width of the numbers
increases. Without a sound mathematical basis for the generation of the sets, 4 differ-
ent models are tested to give rough estimates for the expected size of set given a par-
ticular width. The deficiencies of these models are pointed out, but it is shown that
their predictions are quite encouraging.
5.2 Problems Caused by Ghosting
When a partial match retrieval is performed on a database stored using CMMs,
ghosting may occur as described above. The reason this is a problem is clear when it
is taken up one level of abstraction. The codes returned by the CMM represent
records from the database. Once the ORed code is separated into its constituent
codes, the records can be uniquely identified. If one of the codes is a ghost, this
means the CMM has returned a record which should not be in the set of records
which correspond to the query performed. In effect, it has returned all the correct
records as well as some extra, incorrect records. This can be likened to the false
matches which are obtained when using some of the database systems mentioned on
page 54 in section 4. These incorrect records will have to be identified and removed
before the system can return the actual result of the query. It is obvious therefore that
Improving Automated Postal Address Recognition 5. Ghosting
66
the effect of ghosting should be reduced as much as possible in a system designed to
perform partial matching, as it represent extra work which must be carried out by
the system and will thus reduce performance.
In fact, it can be shown simply that the effect of ghosting can be prevented by ensur-
ing that the output codes conform to some specification. However this drastically
reduces the number of codes which can be generated. For partial matching, where
any number of codes may be returned by the system, it can be shown empirically
that the number of codes, N, which would be useable to guarantee no ghosts is given
by Eqn. 5.1.
Eqn. 5.1
In this equation, w is the width of the codes and k is the number of bits set to 1.
This means that the number of usable codes is linear with the code width which in
turn means that in any practical partial match retrieval system using CMMs, some
level of ghosting will have to be tolerated. It is obvious that prior knowledge of the
extent to which ghosted codes will be generated is very important in assessing the
performance of the system. The next section therefore deals with sets of codes which
display a property whereby the maximum number of ghosts that will be generated is
fixed for that set.
5.3 Maximum Ghosting Sets
There are basically four parameters which define the sets of codes being discussed.
These are:
• w — the width of the code in bits
• k — the number of bits set to 1 in each code
N w k– 1+=
Improving Automated Postal Address Recognition 5. Ghosting
67
• s — the maximum number of codes which will be superimposed, which
is the maximum number of records which will be returned by the partial
match
• g — the maximum number of ghosts which will be generated when no
more than s codes are superimposed
The sets can be specified by an identifying sequence such as w10k3s2g2, where each
number indicates the value of the parameter immediately preceding it. Such a set
would consist of codes which are 10 bits wide, each having 3 of those bits set to 1,
and guaranteed that when no more than 2 codes are superimposed, no more than 2
ghosts will be generated. A formal specification of the sets now follows.
A code, c, can represented as a set of integers which denote the positions of the bits
set to 1 within that code.
Eqn. 5.2
The includes operator as defined in terms of binary patterns in Fig. 5.1, is now sim-
ply the subset relation.
Eqn. 5.3
To specify the sets mentioned above, let S denote the set of codes. Then, for S to be a
set with parameters w, k, s, g as explained above, Eqn. 5.4 must hold.
Eqn. 5.4
This equation states that for all combinations of s distinct codes from S, the number
of ghosts which will be generated by ORing together those s codes (achieved using
the set union operator), will be less than or equal to g.
c p1 p2 … pk, , ,{ }=
ca includescb iff ca cb⊇
x1 x2 … xs, , ,( ) S∈( ) x1 x2 … xs≠ ≠ ≠( )
card y S∈( )∀ y x1 x2 … xs, , ,{ } xii 1=
s
∪ y⊇∧∉{ } g≤
•∀
Improving Automated Postal Address Recognition 5. Ghosting
68
The representation of a code as a set can be freely converted to a real binary code b
simply by taking the sum of 2 raised to the power of every element of the set.
Eqn. 5.5
5.3.1 Generating the Sets
The sets can be generated easily enough by simply taking the set of all possible codes
which can be generated within the bounds of w and k and adding them one by one to
the set, checking each time that the conditions set by s and g are not broken. This is
basically a brute force algorithm and as such is not very efficient. An improvement
can be made to this algorithm by considering the Hamming distance between codes
as they are added to the set S. The example in Fig. shows that codes with a large
Hamming distance tend to produce smaller sets. By ensuring that codes are added to
the set in least-Hamming distance-first order, then in general, larger sets will be pro-
duced.
Code 1: 111000
Code 2: 000111
1 OR 2: 111111
Fig. 5.2 - Example of orthogonal codes which canghost any other code
Even if codes are taken in this quasi-sorted order there are still plenty of different
orderings of codes to be considered. What is intriguing is that the order the codes are
added to the set can have a marked effect on the final size of the set. This implies that
there is some other feature of the ordering which should be taken into consideration
when generating the sets, but this feature is not immediately obvious. In the interim,
it is sufficient to use a random ordering along with the heuristic described above,
and run many iterations of the generation program to obtain the best set within some
time limits. It is impractical to run with every possible ordering of codes, simply
because of the number of combinations involved. The table in Fig. 5.3 shows the rel-
b 2c i[ ]
i 1=
k
∑=
These 2 codes when ORedproduce a code which can
ghost any other in the set, as ithas all its bits set to 1. No more
codes could be added herewithout increasing the
ghosting.
Improving Automated Postal Address Recognition 5. Ghosting
69
ative increase in time taken to execute an exhaustive search on a Silicon Graphics
R8000 based machine.
Fig. 5.3 - Times to complete exhaustive search ofsome small code sets.
However, it is possible to generate sub-optimal sets using the random search
method. These are only sub-optimal in that they are not necessarily the largest set
possible, but they do conform to the ghosting specification as described earlier. With
these codes, it is possible to perform some analysis which might give an insight into
a possible efficient algorithm for generating them and some mathematical specifica-
tions which would allow them to be modelled in order to determine other parame-
ters such as the required width of code for a certain database application, etc.
5.4 Analysis of some Maximum-Ghosting Sets
Because of the computational problems involved in generating these codes, only
very small codes have been analysed. They were generated by running the random
search method mentioned above a number of times and using the best set found
over all the runs. The number of runs used to produce each set for given values of
the parameters was dependent on the parameters themselves. For example, for the
smallest sets such as w10k3s2g1, 10000 iterations could be used. However for sets
such as w35k4s2g2, only 15 iterations were possible. Even then it could take over 24
hours to complete one run on the same Silicon Graphics R8000 machine. One prob-
lem with the small codes used is that it is already known that some of the models
used to determine CMM performance do not work well at very small code sizes.
† This test was not actually performed! It wasextrapolated from the previous test, which would
give a conservative estimate of the true figure.
Set Specification Time Taken
w4k2s2g2 0.072 seconds
w5k3s2g2 8 minutes
w6k3s2g2 7.7 million years†
Improving Automated Postal Address Recognition 5. Ghosting
70
Nevertheless, it is hoped that this analysis will give a least some indication to how
the larger codes would behave.
The following graphs show how the set size varies with code size given that the
remaining parameters are fixed. The vertical lines show the points which were actu-
ally calculated, the main curve shows the trend in between these points.
Fig. 5.4 - Graphs of set size against code widthfor k3s2g1 and k3s2g2
0
100
200
300
400
500
600
700
800
10 15 20 25 30 35 40 45 50
Set
Siz
e (c
od
es)
Code Width (bits)
k3s2g2
0
50
100
150
200
250
10 20 30 40 50 60 70
Set
Siz
e (c
od
es)
Code Width (bits)
k3s2g1
Improving Automated Postal Address Recognition 5. Ghosting
71
Fig. 5.5 - Graphs of set size against code widthfor k4s2g1 and k4s2g2
The first graph in Fig. 5.5 exhibits some undesirable behaviour in that it should ide-
ally be a smooth curve. The reason for these results is simply the amount of time
taken to generate the sets meant that it was not feasible to run as many iterations as
would be necessary to give smooth data points. It just happened that the runs for
code widths 12 and 15 produced larger sets than the others in the given time. How-
ever, given enough runs, it is expected that the other points would gradually move
up to smooth out the curve. Other than this anomaly, the figures tend to show that
overall there is a more than linear increase in the size of set with linear increase in the
width of code. This is really the only useful result, and any other outcome would
10
15
20
25
30
35
10 12 14 16 18 20
Set
Siz
e (c
od
es)
Code Width (bits)
k4s2g1
0
50
100
150
200
250
300
350
400
450
500
550
10 15 20 25 30 35
Set
Siz
e (c
od
es)
Code Width (bits)
k4s2g2
Improving Automated Postal Address Recognition 5. Ghosting
72
have basically indicated that further work would be fruitless — a linear increase
would indicate that the size of the CMM would grow at least as fast as the size of the
problem, and a less than linear increase would indicate that the CMM would grow
more quickly than the size of the problem. Neither of these outcomes would be use-
ful in practical terms. However these results show that, in fact, for a linear increase in
code width, a more than linear increase in set size is obtainable and hence a more
than linear increase in the number of associations which would be possible, while
still guaranteeing the maximum ghosting property of the set.
It would now be useful to be able to model the curves with a function, so that set size
values can be predicted for higher code widths, rather than having to run the gener-
ation program which takes exponentially more time as the code width increases.
Four models were put forward to match the curves of the second two graphs in
Fig. 5.4 and Fig. 5.5. These are outlined in turn, along with their merits and predicted
results.
Improving Automated Postal Address Recognition 5. Ghosting
73
5.4.1 Quadratic Model
As a first step, it was decided to model the set size as a simple polynomial function.
An attempt was made to fit a quadratic function to the curve for the data k3s2g2 and
k4s2g2. These two graphs are shown in Fig. 5.6.
Fig. 5.6 - Graphs of quadratic functions againstexperimental data for sets k3s2g2 and k4s2g2
It can be seen that the first graph of Fig. 5.6 fits the data points quite precisely, having
an average correlation of 0.7. However the second graph shows the curve only
roughly fits the data points, and has an average correlation of 26.8. This could be
simply because the data points are not accurate enough to fit a smooth function to
them, or it could be because the data does not actually represent a quadratic func-
tion, the first result being purely coincidence. There is no way without further
research into the mathematical behaviour of the sets to solve this argument.
Improving Automated Postal Address Recognition 5. Ghosting
74
5.4.2 Cubic Model
This model is similar to the previous one, but uses a polynomial of one higher
degree. The graphs for these functions are shown below.
Fig. 5.7 - Graphs of cubic functions againstexperimental data for sets k3s2g2 and k4s2g2
The average correlations of these two functions to the data sets are 0.7 and 18.6, a
slight improvement for the k4s2g2 set.
It would be possible to go on increasing the degree of the polynomial and get ever
closer results, but it would seem that an alternative model may be more useful.
5.4.3 Exponential Model
This model uses an exponential function, which can be fitted to the data by taking
the logarithm of both axes and fitting a straight line to the result. The coefficients of
Improving Automated Postal Address Recognition 5. Ghosting
75
this line can then be used to calculate the required exponential coefficients. The
result of this analysis is shown in the graph below.
Fig. 5.8 - Combined graphs1 showingexponential functions against experimental data
for k3s2g2 and k4s2g2
The average correlation figures for these two functions are 0.9 and 0.5 for the two
data sets respectively. These are correlations based on fitting linear functions to loga-
rithmic data points, and so must be converted to correlations against the real data
values before they can be compared to the other models. When this is done, the
actual average correlations are 2.4 and 1.6.
5.4.4 Set Size Ratio Model
An alternative approach to the problem is to investigate how the ratio of set size to
total possible codes varied with the code width. The total possible codes N which
1. These graphs are combined to illustrate a feature which is clearer when the point at whichthe two lines cross can be seen (see section 8.2, “Values of k” on page 100).
Improving Automated Postal Address Recognition 5. Ghosting
76
can be generated for given values of w and k is a simple combinatorial function
shown in Eqn. 5.6.
Eqn. 5.6
If this ratio was constant with code width, it would provide an easy method for pre-
dicting the behaviour of the set size. The resulting graphs are shown in Fig. 5.9.
Fig. 5.9 - Graphs of ratio functions againstexperimental data for sets k3s2g2 and k4s2g2
These models seem quite accurate, having an average correlation between the exper-
imental data and the fitted functions of 0.01 and 0.4 respectively. It is probable that
the difficulty in producing the experimental data1 for the k4s2g2 set is the reason this
data does not fit quite as well, however it certainly shows the correct trend.
1. This difficulty is the fact that as the values of the different parameters increase linearly, thetime taken to compute the results for a given set of parameters increases combinatorially, andso the experiment could not be run the same number of times for k=4 as for k=3.
Nw!
k! w k–( ) !--------------------------=
Improving Automated Postal Address Recognition 5. Ghosting
77
5.4.5 Comparison of Models
The four models which have described all have their merits and demerits. The quad-
ratic function is very simple, but has the lowest overall quality of fit with the data
analysed.
The cubic function performed better as would be expected. Indeed it would not be
possible for it to perform worse than the quadratic, as by setting the coefficient of the
cubic term to zero, the function becomes a quadratic. The identical value for the
average correlation on the k3s2g2 data set is due to this, as the cubic coefficient is
very small (0.0016). However the correlation between the function and the second
data set is improved over the quadratic model by the introduction of the small but
significant cubic term.
The exponential functions, looking at Fig. 5.8, would seem to give a very good
approximation. The fact that the data points are in a very straight line would suggest
that they are in fact modelled by an exponential function. The more consistent values
of the average correlation between these functions and the data sets would also sug-
gest that they are a more accurate model of the actual data.
However the best model is undoubtedly the ratio model. Apart from the very close
correlation between the data and the functions, it seems intuitive that the set will be
influenced in some way by the total number of codes which are considered when
generating it.
The fact that the set sizes for wider codes are likely to increase by more than the set
sizes for smaller codes, as more experiments are performed, could mean that the
curves edge close and closer to a polynomial function. But they could also simply
adjust the parameters of the exponential function to allow a greater accuracy of fit.
As the ratio function takes the total number of codes into account, it may be unaf-
Improving Automated Postal Address Recognition 5. Ghosting
78
fected by this larger increase in set size. All that can be said is that without further
mathematical analysis, the four models all will give roughly accurate predictions of
the path of the curve, providing that the code width is not allowed to increase to far.
The wider the code when using these functions, the less confidence can be placed on
the calculated result.
The functions are shown below with the coefficients for, k3s2g2 and k4s2g2 respec-
tively, taken to 3 decimal places.
Quadratic Model:
Eqn. 5.7
Eqn. 5.8
Cubic Model:
Eqn. 5.9
Eqn. 5.10
Exponential Model:
Eqn. 5.11
Eqn. 5.12
Ratio Model:
Eqn. 5.13
Eqn. 5.14
card S( ) 0.346w2
3.602w– 22.035+=
card S( ) 0.812w2 17.424w– 114.817+=
card S( ) 0.002w3 0.207w2 0.075w– 4.172–+=
card S( ) 0.018w3 0.379w2– 6.488w 32.248–+=
card S( ) w2.318
11.546----------------=
card S( ) w2.871
57.443----------------=
card S( ) w!3.198w 11.406+( ) w 3–( ) !
---------------------------------------------------------------------=
card S( ) w!85.68w 472.824–( ) w 4–( ) !
------------------------------------------------------------------------=
Improving Automated Postal Address Recognition 5. Ghosting
79
These 4 sets of 2 equations allow some possible values to be predicted as shown in
Fig. 5.10.
Fig. 5.10 - Table of predicted k3s2g2 and k4s2g2set sizes for various widths
5.5 Conclusions
It is clear from the large variations in predicted sizes that none of these models can
be used for any accurate predictions of width of code required without first finding a
way of showing how the sets should behave for larger values of w. This is not only
because of the different predictions these equations give, but also because the grand
unified theory of maximum ghosting sets must use an equation which has not only
w, but k, s, and g as variables as well. It is clear that k would not remain fixed as the
code size varied, but due to the computational problems outlined earlier it was not
possible to generate experimental data for larger values of k. The values of s and g
may remain fixed, as they are problem dependant, and this would usually be known
beforehand. In fact, as k is usually taken as being a function of w, it may not be neces-
sary to involve k in the equation at all. However if, for a particular problem, the
value of k has to be set to an ‘unconventional’ value, it would still be useful to be able
to model the maximum ghosting sets.
As an example, part of the PAF file (see section 6, “Analysis of PAF” on page 83)
which would be stored in one single CMM contains 866026 records. This means that
the maximum ghosting set must contain at least 866026 codes in order to train each
CodeWidth
Set Size for k3s2g2 Set Size for k4s2g2
Eqn. 5.13 Eqn. 5.7 Eqn. 5.9 Eqn. 5.11 Eqn. 5.14 Eqn. 5.8 Eqn. 5.10 Eqn. 5.12
500 77155 84721 301708 156235 1457561 194402 2158461 976125
1000 310650 342420 2206920 779055 11665813 794690 17627455 7141063
2000 1246682 1376818 16827845 3884686 93348397 3213266 142496943 52242025
3000 2808106 3103216 55862770 9943413 315075763 7255842 482608431 167331617
4000 4994921 5521614 131311695 193706193 746875924 12922418 1145961920 382188066
Improving Automated Postal Address Recognition 5. Ghosting
80
record into the memory. Using the equations above, the estimated code widths are
shown in Fig. 5.11.
Fig. 5.11 - Table of predicted code widths for astorage requirement of 866026 associations
While the values predicted vary as expected from model to model, the overall trend
is reasonable, with all models predicting a narrower code for k=4 than for k=3. It can
safely be assumed that if k was increased even more, the codes could get narrower
still. Providing that one of these models can be assumed to be a fairly close approxi-
mation to the actual behaviour of the sets, Fig. 5.11 also shows that the size of code is
not impractically large — code widths of 1000 - 2000 bits are not uncommon when
using CMMs for this type of application. This is perhaps the most important result
for the purposes of this research.
5.6 Summary
In this section, the problems associated with ghosting when using CMMs to perform
partial match queries was explained. It was shown that it is not possible to obtain a
practical solution to this problem and so some level of ghosting will have to be
accepted. A method for guaranteeing the maximum number of ghosts which will
ever be produced by the CMM was presented — that of the maximum ghosting code
set. If such a set is used when training associations into the CMM, and the maximum
Formula Code Width
Eqn. 5.13 1668
Eqn. 5.14 421
Eqn. 5.7 1577
Eqn. 5.8 1022
Eqn. 5.9 774
Eqn. 5.10 361
Eqn. 5.11 1047
Eqn. 5.12 480
Improving Automated Postal Address Recognition 5. Ghosting
81
number of valid responses which will be returned by the CMM is known, then the
maximum number of ghosts which will have to be removed from the CMM’s
response is also known. The ghosts can only be removed by back-checking with the
original query. In the example given in section 4 on page 61, the response for the
query ‘C?T’ might have produced 3 codes which, when expanded, referred to the
words ‘COT’, ‘CAT’ and ‘PIN’. In this case, ‘PIN’ was obviously produced by a
ghost, as it plainly does not satisfy the query. Each of the output words has to be
checked in this way to remove the ghosts. Knowing how many there are to search for
allows the worst case performance to be assessed.
In the absence of a mathematical model of these sets, experiments were performed to
establish the approximate set size for some small codes. This data was then analysed
in a number of ways to try and predict the set sizes for larger codes. While this anal-
ysis is in no way intended to allow sound arguments to be made about the behav-
iour of the larger sets, it does give rough indications of the kind of sizes that will be
possible. It can also be argued that the models presented in this section give an
under-estimate of the set size for a given code width. This is because the number of
bits set in each code was not varied, but as larger codes are used, more bits would be
set. This was shown to increase the set size for codes of width 16 bits.
Whatever database engine is chosen for this application, it will be used to store and
search a database which contains address information used by The Post Office for
sorting mail. The contents of this database will have to be coded into an appropriate
form for the database system. Some knowledge of the kind of information contained
within the database will be essential for the coding to be performed in an efficient
manner. It will also be useful to know what kind of outputs will be obtained from the
Improving Automated Postal Address Recognition 5. Ghosting
82
database when queries of the kind required by a verification system are made. The
next chapter presents a detailed discussion of the database, and the kind of searches
which will be made.
Improving Automated Postal Address Recognition 6. Analysis of PAF
83
6. Analysis of PAF
The Postal Address File (PAF) is a database which contains address information such
as postcode, posttown, building name/number, latitude/longitude, etc. for every
mail delivery address in the United Kingdom.
6.1 Introduction
This chapter gives an indication of how this database will be used during the OCR
and verification process. The kinds of queries which are likely to be made are
explained and some of the potential results are presented. This includes an analysis
of the format of the postcode and how missing characters within a postcode (for
example, failure of the OCR system to recognise one character) will impact the verifi-
cation process. The PAF itself holds over 25 million addresses or ‘delivery points’, as
for most domestic addresses the postcode is shared by a number of buildings. In its
fully expanded form, the database’s size is around 7.5 gigabytes. As mail pieces pass
through the automated sorting machines, the address image is scanned by a camera
and fed into a computer. From there, it must be segmented into lines, the different
lines identified (specifically the line containing the postcode), OCR performed on
that line, the resulting postcode searched for in the database, and a machine-reada-
ble version of the address printed on the mail piece. This is in the form of a binary
pattern of phosphor dots which can be read easily at a later stage. All this has to be
carried out over 10 times per second, as this is the speed at which mail passes
through the sorting machines. It is clearly not a trivial problem! Some immediate
reductions can be made in the amount of work which has to be done however. There
is no reason why each of the operations described above cannot be pipelined, giving
a substantial increase in the overall performance of the address recognition system.
Secondly, it is not necessary at this stage to search the entire database as only the
postcode is being recognised (however this may not be the case in the final system —
Improving Automated Postal Address Recognition 6. Analysis of PAF
84
see section 8.3, “Strategies for Verification” on page 101). Finally, there is no real rea-
son why the machine-readable code printed on the mail piece cannot simply be a
unique number which is stored in a separate database to be cross-referenced at a
later time by the address recognition system. Indeed, this is what currently happens
to mail pieces which cannot be identified by the automated address recognition sys-
tem. The image of the mail piece is tagged with the machine-readable code printed
on the mail piece and is then displayed to an operator who visually recognises the
postcode and keys it into a terminal. This is then associated with that machine-read-
able code and when the final sorting machine reads the code, it simply looks up the
postcode keyed in by the operator. This is typically done at a much later stage than
the initial address recognition and so could still be done off-line, automatically.
6.2 Format of the Postcode
The postcode follows a fairly rigorous syntax format, although it is subject to change
from time to time. There are 3 different lengths of postcode — 5, 6 and 7 characters.
The formats for each are shown below.
Fig. 6.1 - The syntax of the postcodes
This format is very unlikely to change — it is the set of characters within each posi-
tion which can change. For example, at the time [9 Kabir, Downton] was written, it
was stated that in the first 6 character format in Fig. 6.1, the second character, being
† ‘A’ represents an alphabetic character, ‘N’ represent anumeric character.
Number ofCharacters
Character Codes† Number ofPostcodes
5 A N N A A 45649
6
A A N N A A
866026A N N N A A
A N A N A A
7A A N N N A A
717693A A N A N A A
Improving Automated Postal Address Recognition 6. Analysis of PAF
85
alphabetic, could not be an ‘I’ (eye). This was presumably designed to prevent possi-
ble clashes with other similar postcodes which have a ‘1’ (one) in this position. How-
ever there is at least one postcode now in use which does indeed have an ‘I’ (eye) in
this position, and it remains to be seen whether it is similar enough to others in the
database to cause problems.
It is clear from the table that there are other possible examples of valid postcodes, as
far as the syntax is concerned, which could not easily be distinguished by OCR.
These would be ones where the only distinguishing characters were in a position
which could be either alphabetic or numeric, and were either a ‘1’ (one) or an ‘I’
(eye), or a ‘0’ (zero) or an ‘O’ (owe). According to an initial scan of the data there are
no such clashes, but there are other possibilities such as ‘5’/’S’ which could be diffi-
cult depending on the font used or the style of handwriting.
The main reason for making the distinction between the classes of characters permit-
ted is to refine the OCR system by specifically employing an alphabetic or numeric
character recogniser at each character position, rather than having one alphanumeric
recogniser, which is bound to be less reliable. Obviously some character positions
can be either alphabetics or numerics and so would require a discriminator to decide
either which recogniser to apply, or if both were applied, which one to believe. This
would probably be based on the relative confidence of each recogniser. By starting
the recognition process at the right and moving to the left, it can be seen from Fig. 6.1
that there are only two character positions which could be alphabetic or numeric1. If
the recognition process were started from the left, the number of undecided charac-
ters would be four, so this is another simple way to reduce the complexity of the
problem.
1. We assume that the length of the postcode is unknown at this time, therefore recognitionproceeds from one end of the postcode to the other until the block located as the postcode isexhausted. The number of characters is then counted implicitly in the recognition process.
Improving Automated Postal Address Recognition 6. Analysis of PAF
86
6.3 Missing Characters
Because the reliability of the OCR system can never reach 100%, it is inevitable that
some characters are going to be unrecognisable. It is also desirable that the system
reject characters rather than miss-classify them, as a miss-classified character could
certainly generate a valid but incorrect postcode. This type of error would be very
difficult to detect without further cross-referencing with the rest of the address. So
the system must be able to deal with cases where not all of the characters of the post-
code are known, and this represents a type of partial match (see section 4, “Partial
Matching” on page 53).
From the discussions in section 5, it is clearly desirable to know how many valid
postcodes could be generated when a search of this kind is made on the database, as
this determines the superimposition factor that will be present in the output of the
CMM. The three graphs shown in Fig. 6.2, Fig. 6.3 and Fig. 6.4 show what happens
when single characters are unrecognised in the different classes of postcode. The
basic question being asked here is probably better expressed in English:
Take a 6 character postcode, ‘S10 4FP’ for example. If the first character isunrecognisable, so that the result of OCR is ‘?10 4FP’, how many post-codes match the ‘10 4FP’ part, and so can be considered as candidates forthe actual postcode on the mail piece?
This question is repeated for each postcode within each class, and for each of the
three classes. The ‘Field’ entry on the graphs indicates which character within the
postcode is being left out. The number of matches indicates how many postcodes
matched against the remaining partial postcode. For example in 7-character post-
codes, Fig. 6.4 shows that with the first field missing, 483761 postcodes matched
against a single entry in the database. This means that 483761 of the 717693 7-charac-
ter postcodes will still be unique, even if the first character cannot be recognised. The
worst case for this field is that 3220 of the 7-character postcodes match 5 entries in
Improving Automated Postal Address Recognition 6. Analysis of PAF
87
the database, when the first character cannot be recognised. This means that there
will be 4 incorrect and the 1 correct postcodes returned by the search algorithm.
These values can be interpreted directly as probabilities, providing it is assumed
there is an even distribution of postcodes, which may not be the case in a real sorting
office. So there is a 67.4% probability that a seven character postcode can be uniquely
identified given that its first character is unknown, and a 0.45% probability that it
will be one of 5 possible postcodes.
Fig. 6.2 - Analysis of five-character postcodes
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 2 4 6 8 10 12 14 16 18 20
Nu
mb
er o
f P
ost
cod
es
Number of matches
5 Character Postcodes
Field 1
Field 2
Field 3
Field 4
Field 5
Improving Automated Postal Address Recognition 6. Analysis of PAF
88
Fig. 6.3 - Analysis of six-character postcodes
Fig. 6.4 - Analysis of seven-character postcodes
0
50000
100000
150000
200000
250000
300000
350000
0 2 4 6 8 10 12 14 16 18 20
Nu
mb
er o
f P
ost
cod
es
Number of matches
6 Character Postcodes
Field 1
Field 2
Field 3
Field 4
Field 5
Field 6
0
100000
200000
300000
400000
500000
600000
0 2 4 6 8 10 12 14 16 18 20
Nu
mb
er o
f P
ost
cod
es
Number of matches
7 Character Postcodes
Field 1
Field 2
Field 3
Field 4
Field 5
Field 6
Field 7
Improving Automated Postal Address Recognition 6. Analysis of PAF
89
It is clear from the shape of these graphs that identifying a postcode is relatively easy
if one of its first few characters is unknown, but specifically the last two characters of
each class seem to be fairly evenly distributed among postcodes. This means that
there will on average be more possibilities to consider if one of the last two charac-
ters is unknown rather than one of the first few. However the graphs also show that
no matter which character is missing, there are never more than 20 possibilities
which could be returned by the search algorithm, and so a brute force search through
these to identify the correct one should not be out of the question. As to how this
might be achieved however, is left for future work.
It is also possible that there may be more than one character in a particular postcode
which cannot be recognised by the OCR system. This may be due to one of the fol-
lowing reasons:
• Due to the <100% reliability of the OCR system, there will be a small
number of occasions when one character cannot be recognised. When dis-
tributed over the input sequence of characters, this should mean that
there will never be more than one failure in any particular postcode. But
it is inevitable that eventually, two or more of the statistically predicted
failures will occur in one postcode.
• The address image is very badly formed — so much so that maybe only a
handful of all the characters in the address could be recognised automati-
cally.
These two events will together contribute to the reject rate of the system. In the first
case, it is unlikely that a postcode with two characters missing could be searched for
and the possibilities considered, as there would probably be too many of them. It is
easy to see in the worst case that there would be 400 (20 × 20) possibilities if the last
two characters were the two which could not be recognised. In the second case, if
only a few of the characters can be recognised, there would seem to be little point in
continuing with the automated recognition procedure as there would be no way to
Improving Automated Postal Address Recognition 6. Analysis of PAF
90
check the possibilities which were returned against other information in the address,
even if it were in principle possible to deal with that many alternatives. In fact, a rec-
ogniser with a greater than 70% recognition rate (which is a fairly modest target
given the results presented in section 2) would on average fail on less than 1 in 7
characters. Since postcodes have no more than 7 characters, then on average there
will be no more than 1 character unrecognised in any given postcode. If all charac-
ters can be recognised, there will be less possibilities to deal with (unless of course
the characters do not represent a valid postcode — this is a separate issue and is
dealt with in section 8.3), whereas if more than one character is missing, it may be
very difficult to make any kind of automatic interpretation of the address, simply
because of the number of possibilities involved. Once a system has been designed, it
will be possible to calculate the total cycle time for recognition and database lookup,
and the upper bound of the number of possibilities which can be considered can be
calculated.
It has been argued that the main target for automated mail sorting must be to cor-
rectly recognise the postcode on a mail piece. The main consideration has been the
possibilities for recovery if, for some reason, one of the characters within the post-
code cannot be recognised. The results presented in this chapter show that there will
be little problem in identifying the postcode, even if one of the characters is missing.
This can be achieved either by passing hints to the OCR system once the set of possi-
ble characters is known or by attempting to integrate information from other parts of
the address. It has been shown that even if this is one of the last characters, there will
never be more than 20 possible postcodes to choose from. It is likely that a system
could be made to run fast enough to consider 20 possibilities — if not then a radical
alteration in the approach will be necessary to afford any benefit to the current sys-
tem.
Improving Automated Postal Address Recognition 6. Analysis of PAF
91
In the case when more than one character is missing, it is difficult to say whether the
system would be able to correctly process the mail piece. It is likely that considera-
tion of all the possible valid characters will be too time consuming to be practical in
an on-line system. However the alternative method of incorporating other address
information into the database search may still allow the single correct address record
to be identified. This is certainly a topic for further research.
Now that the problems and possible solutions have been identified, the next chapter
presents some discussions about the feasibility of the proposed approach to this
application.
Improving Automated Postal Address Recognition 7. Feasibility
92
7. Feasibility
It is obviously important to assess whether a system based on the work presented
here would actually form a feasible solution to the problem of improving automated
address recognition. The following section gives some outlines of the measures
involved in determining whether or not the system will perform as required.
7.1 Introduction
At present, the only consideration which can reasonably be addressed is whether or
not the system is likely to perform its function within the strict time requirements of
the on-line mail sorting process. Some analysis is given of the CMM approach to
database searching, followed by a discussion of the implications of this for the mail
sorting application. In particular, a hardware implementation of the CMM process is
discussed and its performance analysed.
7.2 Speed of Database Access
The central problem to the partial matching exercise is how quickly the database can
be searched for the required address information. In [44 Austin et al.], a dedicated
piece of hardware known as the ‘Sum and Threshold’ processor (SAT) is presented,
which is capable of performing the CMM operations detailed in the previous sec-
tions at very high speed. The main equation which determines the speed with which
a database search can be made is shown in Eqn. 7.1. It gives the cycle time (CT) of the
unit given a number of parameters governing the matrix involved.
Eqn. 7.1
This equation covers a two stage process which actually implements the ADAM
structure. However, for this application, only the first stage of the operation is
required and some of the variables become irrelevant. Since the results of the first
CT 50α16------ 3.5β
δ----------- 34+
ρσ---2σ
16----------
3.5ϒ 35+( )
ι 4.5α 3φ+( )+ +
ns=
Improving Automated Postal Address Recognition 7. Feasibility
93
stage are available within the SAT processor, it is reasonable to ignore the second
stage components of the above equation, and the simplified version, for the first
stage only, becomes:
Eqn. 7.2
The coefficients are:
• α — Output size
• β — Number of bits set in input pattern
• φ — Number of bits set in output pattern
The values for these variables can be calculated as follows1. The output size, α, is the
required width of code to represent all the records in the database. From the results
of section 5, using the worst case estimator for k=4 (Eqn. 5.14), and the size of the
database for 5, 6 and 7-character postcodes, the approximate code widths are shown
in Fig. 7.1.
Fig. 7.1 - Estimated code widths for the 3 classesof postcode, using Eqn. 5.14
The number of bits set, φ, has to be 4, as this is the value of k in Eqn. 5.14. Obviously,
with these widths of code, a larger value of k would be better (assuming the log2 rule
holds). But as explained before, it is was not possible to generate equations for
higher values of k due to the excessive amount of time this would take. Using a
1. In all the following discussions, it is implied that 3 separate CMMs will be used to representthe 3 classes of postcode (5, 6, and 7-characters). This can be achieved using only one physicalSAT processor by setting up each CMM in the processor’s memory, and then simply adjustingpointers within the SAT so that the correct CMM is actually evaluated for the given search tobe performed.
PostcodeWidth
Number ofPostcodes
EstimatedCode Width
5 characters 45649 158 bits
6 characters 866026 421 bits
7 characters 717693 395 bits
CT 50α16------ β 34+( ) 4.5α 3φ++
ns=
Improving Automated Postal Address Recognition 7. Feasibility
94
higher value of k would almost certainly reduce the width of the code and so these
results are sure to give a worst case estimate of the speed of operation.
The input bits, β, is 5, 6 or 7, depending on the postcode width being used. This is
because a very simple coding scheme can be used on the input, where each character
position within the postcode is represented by a 1-in-n bit binary code. The equation
only needs to know how many bits are set to 1 on the input, not the total input size.
The value of n will be different for different positions (for example, a purely numeric
character position can be represented by a 1-in-10 bit binary code, whereas a purely
alphabetic field would need a 1-in-26 bit binary code), however, no matter what the
size of the code, it will always have 1 bit set, and so the value of β will always be
equal to the number of characters in the postcode. Evaluating Eqn. 7.2 for the given
parameters yields the results shown in Fig. 7.2.
Fig. 7.2 - Time taken to search each database forone specific postcode
The result of the CMM operation is a code which will uniquely identify a record in
the main PAF and, obviously, this will have to be retrieved in order to get the actual
postcode and other address details for verification purposes. It would be possible
however to use a second CMM which takes the output of the first and returns actual
postcodes. In this case the variables shown before would take the following values.
The output size, α, would be dependant on the width of the postcode. From Fig. 6.1
it can be seen that the worst case (i.e. largest required code) to represent each post-
code, is ‘ANNAA’ for 5-character postcodes, ‘ABBNAA’ for 6-character postcodes,
and ‘AANBNAA’ for 7-character postcodes. The letters ‘A’, ‘N’ and ‘B’ represent
PostcodeWidth
Number ofPostcodes
Time to Searchfor 1 Postcode
5 characters 45649 55.4 µs
6 characters 866026 148 µs
7 characters 717693 140.1 µs
Improving Automated Postal Address Recognition 7. Feasibility
95
‘alphabetic’, ‘numeric’ and ‘both’ fields within the postcode (for example, the third
character of a 7-character postcode can only be a numeric, and so can be represented
as a 1-in-10 bit binary code). Summing all these positions gives the following sizes:
Fig. 7.3 - Total size of codes required to representeach class of postcode
In fact these sizes represent the actual input sizes used on the first CMM. However,
as explained before, it is only the number of bits set to 1 on the input which affects
the speed of the processor.
The input bits for the second CMM, β, is the number of bits set on the output of the
first CMM. This is the value of k, and is thus 4 in this example, as Eqn. 5.14 predicts
codes with 4 bits set.
The value of φ is the number of characters in the postcode, as there is a direct rela-
tionship between bits in the output code and characters in the postcode that it repre-
sents — this will be 5, 6, or 7 depending on which class of postcode is being searched
for. Re-evaluating for the second CMM gives the following results:
Fig. 7.4 - Overall time to recover actual postcode
It is then trivial to convert the output of the second CMM into an ASCII interpreta-
tion of the postcode, and this can either be used directly to compare with the results
PostcodeWidth
Format ofPostcode
Total Size ofCode Required
5 characters ANNAA 98 bits (5 set)
6 characters ABBNAA 160 bits (6 set)
7 characters AANBNAA 160 bits (7 set)
PostcodeWidth
Time to Recover Postcodefrom First CMM Output
Total Time to Recover Postcodefrom OCR Output
5 characters 34.4 µs 89.8 µs
6 characters 55.9 µs 203.9 µs
7 characters 56.1 µs 196.2 µs
Improving Automated Postal Address Recognition 7. Feasibility
96
from OCR, or used as a key to locate a record in the PAF. Note that if a partial match
is being evaluated, then the second CMM will have to be evaluated for every post-
code returned by the partial match. There will also be an overhead associated with
separating the superimposed codes. It was shown in [37 Kennedy] that the average
time taken to separate 5 superimposed codes of width 400 bits is less than 2ms per
code, using a technique known as Middle Bit Indexing (see [38 Filer]). This time will
vary with the width of the codes and the number of codes superimposed, but should
give a reasonable indication of the order of magnitude of the problem. So the total
time will be the time taken to evaluate the first CMM, plus the number of records
returned × 2ms, plus the number of records returned × the time taken to evaluate the
second CMM. Note again though that the retrieval of the superimposed codes can be
pipelined with the second CMM to improve the efficiency still further. The worst
case time, for a 6-character fully specified postcode equates to around 4900 postcode
searches per second. In [44] it is shown that the SAT processor shows an average
speed-up, over a Silicon Graphics R460SC Indy workstation by a factor of 5. This
means the worst case time using a conventional machine would still only be
1019.5µs, or around 980 postcodes per second. However the SAT processor is also
over 20 times cheaper than the workstation, which gives a price/performance ratio
of more than 100:1 in favour of the SAT. There would obviously be additional costs
involved with providing a host for the SAT, however it is also likely that there would
be interfacing costs involved if the workstation was used, and these could only be
assessed if the details of the actual sorting machine were available.
7.3 Other Factors
The worst case time of 203.9µs per postcode looks promising, however as the system
will effectively form a pipeline comprising of OCR, database search, verification and
machine-readable code printing — possibly with iteration of the database search and
Improving Automated Postal Address Recognition 7. Feasibility
97
verification steps, the overall cycle time for the whole system will be the longest time
required to execute any one of those parts. As the machine-readable code printing
can really be made to go as fast as the mail moves through the machine, this is not
likely to be the longest part of the process. The database search is very fast, as shown
in the previous tables, and so the slowest part of the pipeline will probably be the
OCR (which also involves address location and line segmentation etc.). It ought to be
possible then to tune the iteration of the next steps to take nearly as much time as the
OCR stage, so as to allow the maximum amount of work to be done without slowing
the overall pipeline down.
It has been alluded to that the system must process around 10 mail pieces per sec-
ond. However the actual time a mail piece spends in the sorting machine (from
breaking the beam as it enters the machine and triggering the imaging process, to
leaving the machine with its machine-readable code printed on it) is around 1 sec-
ond. This means that the pipeline will be processing up to 10 mail pieces at any one
time, but the overall time for the pipeline will be 1 second per mail piece. Given the
times for the database searching and the fact that a character recogniser could be
implemented on the same dedicated hardware, this does not look unfeasible. But it is
also possible to introduce a delay line1 into the mail path which effectively increases
the processing time for each mail piece to 7 seconds. However to maintain the aver-
age throughput of 10 mail pieces per second, it is clear that the number of stages in
the pipeline would have to be increased to take advantage of this, as it means that
there will be on average 70 mail pieces in the sorting machine at any one time. This
may mean that the different CMMs used for the various parts of the system (OCR,
PAF search and postcode retrieval) will all have to be working at the same time and
1. This is a simple mechanical device which forces the mail to take an arduous path throughthe machinery. This increases the time it takes for the mail piece to move from the scannerwhich reads the address, to the printer which prints the machine-readable code on the mailpiece.
Improving Automated Postal Address Recognition 7. Feasibility
98
this will require more physical SAT processors. However they are designed in such a
way that once the host and interface have been set up, many SATs can be added at
little extra cost. There are other tasks which may not be suitable for implementation
on the SAT, such as the address location and segmentation and the superimposed
code separation and verification. However it is not clear yet exactly how these parts
of the system would be implemented and so it is difficult to give accurate timings for
the whole pipeline. All that can be said is that given the current state of technology, it
would be surprising to find that the task could not be completed within 1, let alone 7
seconds.
Improving Automated Postal Address Recognition 8. Conclusions and Further Work
99
8. Conclusions and Further Work
It has been shown that for a realistic improvement in the reliability of automated
address recognition, the main target area has to be the integration of address infor-
mation rather than improving the performance of an OCR system. It has also been
shown that the crux of this issue is the efficient retrieval of a valid address record
from the Postal Address File. This address has to have the highest probability of
being the one that was intended by the author of the address, given the (possibly
incomplete) information obtained from the address image. This amounts to a partial
match search of the database. A number of approaches to this have been proposed.
One in particular was considered in detail, and some of the problems with this
method identified.
Many of the issues raised during the course of this research would warrant further
investigation. This section details some of the more interesting questions which were
raised. As this research was intended as preparatory work for a longer study of a
system for improving automated address recognition, some of the topics discussed
in this section will be taken up over then next 3 years.
8.1 Code Generation
If the maximum ghost code sets are to be used as an effective way of reducing the
problems associated with partial match searching using CMMs, an efficient way of
generating them is essential. It is believed that in order to obtain such a method, a
more complete understanding of the behaviour of the sets is required. One possible
model for the sets which was not covered in the main text involves the use of hyper-
cubes. As shown in Fig. 8.1, a three bit code can be represented as the vertices on a 3-
dimensional cube. For wider codes, more dimensions are required and so they
Improving Automated Postal Address Recognition 8. Conclusions and Further Work
100
become difficult to envisage. Questions about these hypercubes can still be asked
though.
• What subset of vertices is represented by a maximum ghosting set?
• What geometric features are displayed by such a subset?
• Do different sets have common features when represented in this way?
In particular, if the answer to the last question is in the affirmative, this may be the
key to understanding what makes these sets exhibit the properties they do.
Fig. 8.1 - Representation of 3 bit binary codes asvertices of a 3-dimensional cube
8.2 Values of k
It was mentioned in section 4.3 that the optimum value for k is log2w. This allows the
maximum storage within the CMM — any higher than this and errors in the output
start to affect the reliability of the system. It is not clear however why this should be
the case. It has always been assumed that the problems occur because of saturation
of the matrix. However the graph in Fig. 5.8 on page 75 shows that the size of the sets
of maximum ghosting codes for k=4 starts below the size of sets for k=3, but eventu-
ally becomes higher. The point at which the lines cross represents the point at which
one should stop using codes with 3 bits set and start using codes with 4 bits set. This
value is at around 2.7 on a logarithmic scale, which gives an actual code width of e2.7
— approximately 16. Using the log2 rule above, the number of bits set in each code
goes from 3 at w=15 to 4 at w=16. This is a surprising coincidence, and may indicate
000100
010
110
101
111
001
011 y
x
z
Bit Position: 1 2 3Axis: x y z
Each bit of each coderepresents the position
of that code on therelevant axis
Improving Automated Postal Address Recognition 8. Conclusions and Further Work
101
that the log2 rule holds not because of saturation in the matrix but because of exces-
sive ghosting on the outputs. In order to confirm this, another line would need to be
plotted on the graph in Fig. 5.8 for k=5. If it were to cross the k=4 line at 3.5 (which is
the logarithm of 32, the code width where the number of bits set goes from 4 to 5 as
determined by the log2 rule), it would provide more than coincidental evidence for a
link between ghosting and the maximum storage capacity of a CMM. As mentioned
before however, this is not practical given the current method of generating the max-
imum ghosting sets because of the large amount of time it would take.
8.3 Strategies for Verification
In section 3, some initial ideas concerning the verification of postcode recognition
were put forward. There are many unanswered questions with regard to how this
may be done. The first stage, which has so far been overlooked, is to find the actual
address block on the mail piece. This can be done either using a simple line finding
algorithm, or a more complex locator such as the one in [1 Wolf, Platt], which
achieves 98.2% success at finding the address block when allowed to propose its top
5 choices. It was not reported how often the correct block was the first choice though.
This system would also require a line segmentation algorithm, but that task would
be considerably simplified by the fact that the box is assumed to contain only
address information. When presented with the entire image of the mail piece, line
segmentation algorithms can be easily confused by graphics on the envelope.
Once the address has been segmented into lines, it is fair to assume that the postcode
will be on the bottom line (either on its own, or following another word such as the
posttown), or on the second bottom line. In this case, the bottom line would probably
contain the country name, for international mail. However there is very little else
which can be assumed about the format of the address. It is likely that the recipients
name is on the first line, but this is not useful when identifying the address within
Improving Automated Postal Address Recognition 8. Conclusions and Further Work
102
the PAF. Other information which would be useful when identifying the address is,
for example:
• Street Name and Number of Premise
• Posttown Name
• PO Box Number
• Premise Name (building or company name)
Unfortunately there is no standard way of writing an address and so the system can-
not make any assumptions about what information will and will not be present for
any given address. There is also likely to be information which is not helpful, such as
the county, which is included on many addresses but does not actually add any
information. In fact for most domestic addresses, the only two pieces of information
that are needed are the house number and the postcode. For large organisations, just
the postcode is sufficient. However the goal is to verify this against other redundant
information in the address, and the automated system needs some way of identify-
ing what information there is, and how it could be used.
One solution would be to apply OCR to every segmentable word and check each
word in a large dictionary of valid words which could appear on addresses. This dic-
tionary would have to include all the postcodes, posttowns, street names, annota-
tions such as ‘P.O. Box’ and possibly others. Once all the useless information such as
recipients name etc. has been discarded, a search akin to the one described in
[37 Kennedy] and mentioned in section 4 would be performed on some database of
‘address words’. This type of search allows the information being searched for to be
presented unordered and incomplete, as the output of this type of system would
almost certainly be. The search would need to allow the record(s) which matched
against the highest number of input words to be returned, and these would then be
taken as the candidate addresses. There may be scope for further refinement of the
Improving Automated Postal Address Recognition 8. Conclusions and Further Work
103
input words once some candidate addresses are available, or the system could sim-
ply accept the address which matched the most inputs, providing there was only one
such address.
Another possibility is to make more assumptions about the address. For example, if
the posttown is included on the address, it is usually placed immediately above or to
the left of the postcode. If the county is included, it may be between the posttown
and the postcode. Using a database of hints such as these, the approach described
above could be refined slightly to avoid having to perform OCR on the entire
address, which could result in a substantial improvement in performance.
All these types of approach will probably improve the reliability of the system, at the
expense of reducing its performance in terms of speed. The target of any system
must be to recognise the address and code the mail piece in real time, as any off-line
system would necessarily incur the expense of buffering the address image informa-
tion and the machine-readable code database. However this will eventually become
the less expensive option as more and more processing power is required to perform
the ever increasingly complex sequence of operations involved in actually recognis-
ing the address. There is obviously some trade-off to be made here between the com-
plexity (and hopefully, reliability) of the system and the cost associated with making
this system on-line. In order to make this decision, there has to be some way of meas-
uring the cost of moving the recognition system off-line and this would require a
more detailed analysis of the particular application.
There is a problem which will undoubtedly occur at some time during the operation
of the system — that when all the characters are recognised by the OCR system (in
that the confidence is above some threshold), but the set of characters returned does
not represent a valid postcode. This could be caused either by substitution errors in
Improving Automated Postal Address Recognition 8. Conclusions and Further Work
104
the OCR system or a genuine error on the mail piece. The job of the verification sys-
tem will be to identify which character(s) are in error. This could be done simply by
finding the character with the lowest confidence from OCR, and removing it from
the postcode, which then forms a partial match postcode with one character missing.
This assumes however that substitution errors are characterised by low confidence
within the OCR system, and this may not be the case. It may be possible to try the
postcode with each character missing in turn, and search all the possibilities. How-
ever as was shown in section 6, if one of the last few characters are missing, this can
represent a large number of possibilities to search. Another approach might be to try
and relate parts of the postcode to other information from the address — specifically
the first portion of the postcode can be matched against the posttown. It is more
important that this section of the address is recognised correctly as this determines
which town within the UK the mail piece is sent to. If this is incorrect it can double
the delivery time. If, however, it arrives in the correct town, miss-classifying the sec-
ond section of the postcode/address will only results in it being sent on the wrong
delivery round and would only delay it by 1/2 to 1 day.
8.4 OCR
Several OCR techniques were discussed in section 2, however these need to be
implemented and tested for this particular application. Specifically, the type of hard-
ware used to implement the CMMs would probably be a custom chip, and so any
OCR system which could be implemented using CMMs would almost certainly ben-
efit from the performance improvements of this chip over standard workstations.
Whether the best OCR technique for the job could use CMMs represents another
trade-off which would require consideration within the framework of the specific
project.
Improving Automated Postal Address Recognition 8. Conclusions and Further Work
105
One type of search which can be performed using CMMs which has not been men-
tioned so far is a probability based search. This allows the inputs and outputs to the
CMM to be real values rather than binary bits, while keeping the internal weights of
the CMM binary. This has the advantages of inputting real probabilities to the CMM
and producing results according to those probabilities, while retaining the size and
performance advantages of a binary weighted network. So far, it has been assumed
that the output of the OCR system for each character position would simply be the
character with the highest confidence. If the OCR is allowed to output its top few
choices, along with their confidences, these could be overlaid onto the input to the
CMM and a search performed implicitly on all possible combinations of all the input
characters. The output would be biased by the confidences attached to the input
characters and would return the most likely postcode(s). Obviously this requires a
different type of OCR system (real outputs for a number of characters rather than
just the most likely character), and the potential benefits of the probability based
search would have to be weighed against the added complexity of this search and
the different requirements placed on the OCR system.
8.5 Information Feedback
There is clearly a loop in the overall system design (see Fig. 8.2 on page 108), and this
represents the feeding back of information from the database system to the OCR sys-
tem. This information is in the form of valid addresses which the output of the OCR
system points towards. There are a number of ways that this feedback could be han-
dled. Below are outlined two alternatives, but it is quite conceivable that more could
be investigated.
Improving Automated Postal Address Recognition 8. Conclusions and Further Work
106
8.5.1 Algorithmic Processing of Feedback
The output of the database system is likely to be in the form of a list of valid
addresses. This information has to be correlated with the information found in the
address image by the OCR system. It is also likely that some of the characters which
were suggested by the OCR system would be ruled out by the database search
because they represent invalid addresses. If the system is going to iterate round this
loop of recognition and searching, there needs to be some control over the informa-
tion flow. This can be achieved by taking each address as returned by the database
search and comparing it with the characters found in the address image. For exam-
ple, the OCR system may have given very low confidence values to some characters
in the posttown name, but very high confidence to the characters in the postcode.
The database search should then have indicated what posttown corresponds to that
postcode. The OCR system can now be given extra information in terms of what
characters should be present in the posttown. If it knows what characters it is expect-
ing, a bias can be given to those characters and another attempt made at classifying
them. This could also help to resolve cases where the OCR system returned two
characters with very similar confidence, but only one of them is suggested by the
database search. It can now be given a higher confidence.
The output of this iteration would be a new set of information to be passed to the
database search system, and the loop can be continued until either a single address is
found with high enough overall confidence1, or some fixed maximum number of
iterations is reached without resolving the address. The latter case would then result
in a reject of this mail piece from the automated system.
1. The overall confidence would have to take into account the individual confidences from theOCR system on the various parts of the address and the number of potential addresses whichthe database system suggests would be valid given this output from the OCR system.
Improving Automated Postal Address Recognition 8. Conclusions and Further Work
107
8.5.2 Asynchronous Processing of Feedback
This method would depend upon the actual implementation of the OCR and data-
base systems, but could result in a greater increase in performance of the overall sys-
tem. It would require that the OCR system be able to take inputs not only from the
address image, but from other sources as well, specifically the database system. Ini-
tially, there would be no output from the database system and this would therefore
have no influence on the OCR system. As the OCR system started to produce out-
puts, these would be fed as they arrived to the database system for searching. When
the search has been completed, the outputs from the database would feed back to the
OCR system and affect its recognition in some way as to bias it towards the address
features associated with the addresses returned from the search. In turn the OCR
system would produce new outputs, which would again feed into the search. Given
appropriate constraints on the flow of information, the whole system would eventu-
ally settle on the final output address using a kind of relaxation process.
It is possible that some of the work currently underway at the University of York
involving the use of the ADAM network and Cellular Automata (CA) could be use-
ful as a framework for this information flow model, and it would be interesting to
investigate whether this kind of application is suited to a CA type implementation. If
so, it would be possible that the feedback system be implemented on the same cus-
tom hardware as the high speed database lookup system. This would obviously be
advantageous as far as communication efficiency was concerned.
Tight control of the process would be needed to make sure the system converged
onto an address or rejected the mail piece within a given time, rather than oscillating
or diverging. However this removes the burden of actually trying to decide before-
hand which pieces of information would be useful in the feedback loop and building
Improving Automated Postal Address Recognition 8. Conclusions and Further Work
108
them into the control process — this system could be tuned or even evolved by
adjusting the parameters controlling the relaxation process.
8.6 System Design
The eventual aim of this research is to provide an improved automated address rec-
ognition system. It is clear that there will be many component parts to such a system
and there are alternatives for the implementation of each component. In order to
properly assess the impact of the choice of one component implementation over
another, it is necessary to have an overall view of the system and how it will interact
with the existing hardware of the sorting offices. It is also crucial to completely mod-
ularise the system to allow alternative approaches to each component to be imple-
mented and tested. Without this, it will be very difficult to asses the performance of
the system objectively. An outline of the system is shown in Fig. 8.2.
Fig. 8.2 - System Outline of automated addressrecognition system
There are almost certainly parts of this system which are already in place. For exam-
ple, the camera which images the mail piece and the system for printing the machine
readable code on the mail piece are already in use. The exact interfaces would have
to be specified to ensure any new system would work within these modules. Some
kind of control mechanism would be required to handle the loop between the OCR
and Search Engine. This could be as simple as a threshold which must be reached by
Mail Stream
Camera
OCRSearchEngine
MachineReadable
CodeSystem
PAF Index
Improving Automated Postal Address Recognition 8. Conclusions and Further Work
109
the address recognition system before it is taken as correct. However there must be
some way for the system to identify when an address cannot be recognised. Then,
the image of the mail piece must be passed on to the OCR/VCS1 system, which is the
system which currently handles the mail which cannot be automatically recognised.
The interface to this system would require specification as well.
There is clearly a lot of work to be done as far as the system is concerned. This report
has concentrated mainly on the components of that system in isolation and no
attempt has been made to integrate them. This is left for the actual implementation,
as there are many issues concerning the components which must be resolved before
that can reasonably be addressed.
8.7 Summary
Many diverse issues related to the automated recognition of postal addresses have
been considered, from the initial OCR of the characters which make up the address,
through to an outline of a system for generating the most likely address record from
the PAF. It is not surprising that many more questions have been raised than have
been answered, and as this is report is intended to provide a foundation for further
work, it is perhaps the most useful result. The issues to be addressed are:
• The implementation of an OCR module
• Which parts of the address image are to be considered when attempting
to interpret the address
• The method of searching the PAF for the matching record
• If the CMM method is to be used, the problem of ghosting and its possi-
ble solutions
1. OCR/VCS stands for Optical Character Recognition/Video Coding System. It is the namefor the system which takes address images from mail pieces which cannot be recognised bythe automated system, and presents them on video screens to human operators who key in theaddress information by hand.
Improving Automated Postal Address Recognition 8. Conclusions and Further Work
110
• The integration of the verification stage with the OCR module, to provide
greater reliability of recognition
• The speed with which the whole operation can be performed
There is also a question of the application of the searching methods to other prob-
lems within the Post Office. While the automated recognition of addresses is clearly
key to one of their main operations, and as such formed an ideal framework for the
research carried out so far, it is by no means the only possible area in which the kind
of technology could be of benefit. It is intended to obtain as wide a picture as possi-
ble of other potential uses of this kind of system, in order to both guide future work
and provide the sponsor with some kind of realisation of the research.
Improving Automated Postal Address Recognition 9. References
111
9. References[1] WOLF, PLATT
Wolf R., Platt J. C.
Postal Address Block Location Using A Convolutional Locator Network
Submission to Advances in Neural Information Processing 6, 1994
[2] LEE, CHOI
Lee S., Choi Y.
Robust Recognition of Handwritten Numerals based on Dual Cooperative
Network
International Joint Conference on Neural Networks Vol. 3 pp 760-768, 1992
[3] KERTESZ, KERTESZ
Kertesz A., Kertesz V.
Dynamically Connected Neural Network for Character Recognition
International Joint Conference on Neural Networks Vol. 3 pp 672-676, 1992
[4] WANG, JEAN
Wang J., Jean J. S. N.
Segmentation of Merged Characters by Neural Networks and Shortest Path
Pattern Recognition Vol. 27 No. 5 pp 649-658, 1994
[5] MULGAONKAR ET AL.
Mulgaonkar P. G., Chen C., DeCurtins J. L.
Word Recognition in a Segmentation-Free Approach to OCR
SPIE Vol. 2103 pp 135-141, 1994
[6] SENI, COHEN
Seni G., Cohen E.
External Word Segmentation of Off-Line Handwritten Text Lines
Pattern Recognition Vol. 27 No. 1 pp 41-52, 1994
[7] LIANG ET AL.
Liang S., Shridhar M., Ahmadi A.
Segmentation of Touching Characters in Printed Document Recognition
Pattern Recognition Vol. 27 No. 6 pp 825-840, 1994
[8] YANIKOGLU, SANDON
Yanikoglu B. A., Sandon P. A.
Off-Line Cursive Handwriting Recognition Using Neural Networks
SPIE Vol. 1965 Application of Artificial Neural Networks IV pp 577-588, 1993
Improving Automated Postal Address Recognition 9. References
112
[9] KABIR, DOWNTON
Kabir E., Downton, A. C.
Syntax and Context in OCR of Handwritten British Postcodes
Draft Paper, University of Essex, Colchester
[10] KABIR ET AL.
Kabir E., Downton A. C., Birch R.
Recognition and Verification of Postcodes in Handwritten and Hand Printed
Addresses
Submission to 10ICPR, University of Essex, Colchester
[11] DOWNTON ET AL.
Downton A. C., Kabir E., Guillevic D.
Syntactic and Contextual Post-Processing of Handwritten Addresses for OCR
Draft Paper for 9ICPR, University of Essex, Colchester
[12] HENDRAWAN, LEEDHAM
Hendrawan, Leedham C. G.
Verification of Constrained Postcode Recognition Using Global Features Extracted
From The Handwritten Address - Verification
Commercial Report, University of Essex, Colchester, 1991
[13] LEEDHAM, JONES
Leedham C. G., Jones P. E.
Automatic Sorting of Australian Handwritten Letter Mail Using OCR and Address
Feature Verification
TENCON ‘92 Vol. 1 pp 287-291, 1992
[14] TREGIDGO, DOWNTON
Tregidgo R. W. S., Downton A. C.
Generalised Parallelism for Embedded Vision Systems: An Application to Real
Time OCR of Postal Addresses
Submission to 6th International Conference on Image Analysis and Processing,
University of Essex, Colchester
[15] TREGIDGO, DOWNTON
Tregidgo R. W. S., Downton A. C.
Scalable Parallelism for Embedded Vision Applications: The Generalised Tree
Pipeline
Submission to Transputer Applications ‘91, University of Essex, Colchester, 1991
Improving Automated Postal Address Recognition 9. References
113
[16] TREGIDGO, DOWNTON
Tregidgo R. W. S., Downton A. C.
A Design Philosophy for Scalable Parallel Embedded Vision Systems
University of Essex, Colchester
[17] ROVNER ET AL.
Rovner R. M., Gillies A. M., Ganzberger M. J., Hepp D. J.
Strategies for the Automatic Interpretation of Handwritten Addresses
SPIE Vol. 2103 pp 174-185, 1994
[18] LEEDHAM
Leedham C. G.
Comparison of Optical Recogniser Performance in Postal Applications
Commercial Report, University of Essex, Colchester, 1993
[19] HENDRAWAN, LEEDHAM
Hendrawan, Leedham C. G.
Verification of Constrained Postcode Recognition Using Global Features Extracted
From The Handwritten Address - Address Segmentation and Feature Extraction
Commercial Report, University of Essex, Colchester, 1991
[20] GORSKY
Gorsky N. D.
Experiments with Handwriting Recognition Using Holographic Representation of
Line Images
Pattern Recognition Letters 15 pp 853-859, 1994
[21] LECUN ET AL.
LeCun Y., Boser B., Denker J. S., Henderson D., Howard R. E., Hubbard W., Jackel
L. D.
Handwritten Digit Recognition with a Back-Propagation Network
Neural Information Processing Systems Vol 2, 1990
[22] WANG, JEAN
Wang J., Jean J. S. N.
Multi-resolution Neural Networks for Omnifont Character Recognition
IEEE International Conference on Neural Networks pp 1588-1593, 1993
[23] DRUCKER ET AL.
Drucker H., Schapire R., Simard P.
Boosting Performance in Neural Networks
International Journal of Pattern Recognition and Artificial Intelligence Vol. 7 No. 4
pp 705-719, 1993
Improving Automated Postal Address Recognition 9. References
114
[24] GUPTA ET AL.
Gupta A., Nagendraprasad M. V., Liu A., Wang P. S. P., Ayyadurai S.
An Integrated Architecture for Recognition of Totally Unconstrained Handwritten
Numerals
International Journal of Pattern Recognition and Artificial Intelligence Vol. 7 No. 4
pp 757-773, 1993
[25] MARTIN ET AL.
Martin G. L., Rashid M., Pittman J. A.
Integrated Segmentation and Recognition through Exhaustive Scans or Learned
Saccadic Jumps
International Journal of Pattern Recognition and Artificial Intelligence Vol. 7 No. 4
pp 831-847, 1993
[26] BURGES ET AL.
Burges C. J. C., Ben J. I., Denker J. S., LeCun Y., Nohl C. R.
Off Line Recognition of Handwritten Postal Words using Neural Networks
International Journal of Pattern Recognition and Artificial Intelligence Vol. 7 No. 4
pp 689-704, 1993
[27] YOUNG, FU
Young T. Y., Fu K.
Handbook of Pattern Recognition and Image Analysis
Orlando, Academic Press, 1986-1994
[28] HARTIGAN
Hartigan J. A.
Clustering Algorithms
Yale University, 1975
[29] O’KEEFE, AUSTIN
O’Keefe S. E. M., Austin J.
Application of the ADAM Associative Memory to the Analysis of Document
Images
Proceedings of the Weightless Neural Network Workshop pp 17-22, 1995
[30] MARTIN, RASHID
Martin G. L., Rashid M.
Recognizing Overlapping Hand-Printed Characters by Centered-Object Integrated
Segmentation and Recognition
Advances in Neural Information Processing Systems Vol 4 pp 504-511, 1992
Improving Automated Postal Address Recognition 9. References
115
[31] WILLSHAW ET AL.
Willshaw D. J., Buneman O. P., Longuet-Higgins H. C.
Non-Holographic Associative Memory
Nature Vol 222 pp 960-962, 1969
[32] NADAL, TOULOUSE
Nadal J., Toulouse G.
Information Storage in Sparsely Coded Memory Nets
Network I pp 61-74, 1990
[33] AUSTIN, STONHAM
Austin J., Stonham T.
An Associative Memory for use in Image Recognition and Occlusion Analysis
Image and Vision Computing Vol. 5 No. 4 pp 251-261, 1987
[34] RIVEST
Rivest R. L.
Partial-Match Retrieval Algorithms
SIAM Journal of Computing Vol. 5 No. 1 pp 19-50, 1976
[35] BURKHARD
Burkhard W. A.
Partial Match Retrieval
BIT 16 pp 13-31, 1976
[36] KIM, PRAMANIK
Kim M. H., Pramanik S.
Optimal File Distribution for Partial Match Retrieval
Proceedings of Sigmod International Conference on Management of Data pp 173-
182, 1988
[37] KENNEDY
Kennedy J. V.
An Exploration into Novel Architectures for Uncertain Reasoning
First Year Report, University of York, 1995
[38] FILER
Filer R.
Symbolic Reasoning in an Associative Neural Network
Masters Thesis, University of York, 1994
Improving Automated Postal Address Recognition 9. References
116
[39] LUCAS
Lucas S. M.
Rapid Best-First Retrieval from Massive Dictionaries
Submission to IEEE International Conference on Neural Networks, 1995
[40] LUCAS
Lucas S. M.
High Performance OCR with Syntactic Neural Networks
Artificial Neural Networks Publication No. 409 pp 133-138, 1995
[41] ELLIMAN, LANCASTER
Elliman D. G., Lancaster I. T.
A Review of Segmentation and Contextual Analysis Techniques for Text
Recognition
Pattern Recognition Vol. 23 No. 3/4 pp 337-346, 1990
[42] CHAHAL
Chahal S.
Discrimination of Handwritten from Machine Printed Text
SPIE Vol 2238 pp 190-197, 1994
[43] AUSTIN
Austin J.
Reasoning with Correlation Matrix Memories
Draft Paper, University of York, 1994
[44] AUSTIN ET AL.
Austin J., Kennedy J. V., Pack R., Cass B.
C-NNAP: An Architecture for the Parallel Processing of Binary Neural Networks
Proceedings of the Weightless Neural Network Workshop pp 23-28, 1995
[45] AUSTIN ET AL.
Austin J., Kennedy J. V., Lees K.
The Advanced Uncertain Reasoning Architecture, AURA
Proceedings of the Weightless Neural Network Workshop, 1995