improving automated postal address recognition draft · improving automated postal address...

DRA

FT

Improving Automated Postal

Address Recognition

David Lomas

Submitted for the Degree of Master of Science

University of York

Department of Computer Science

June 1996

Improving Automated Postal Address Recognition

Abstract

Improving the efficiency of an automated address recognition system is key to

improving the overall efficiency of the UK’s mail delivery system. It is clear that

Optical Character Recognition (OCR) is fundamental to such a system. However an

extensive survey of the current research shows that most groups involved in this

area agree that the way to improve current systems’ performance is to incorporate

context information into the recognition process. The problem then becomes one of

efficiently processing the large volume of data and refining it to an address. This rep-

resents a need for the efficient searching of large databases with partial or incom-

plete queries. A technique using Correlation Matrix Memories (CMMs) would seem

ideal as it allows this type of query to be made extremely efficiently. One major prob-

lem with this method is identified and a solution proposed. The final section also

contains details of a number of questions raised during this research and it is

intended to follow these up over the course of the next 3 years.


3

Contents

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2. OCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Machine Printed Character Recognition . . . . . . . . . . . . 13

2.3 Hand Printed Character Recognition . . . . . . . . . . . . . 20

2.3.1 Printed Writing. . . . . . . . . . . . . . . . 21

2.3.2 Cursive Writing . . . . . . . . . . . . . . . 29

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . 33

3. Verification . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Review . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . 51

4. Partial Matching . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Review . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3 Correlation Matrix Memories . . . . . . . . . . . . . . . . . 57

4.3.1 Storage Capacity of a CMM. . . . . . . . . . . . 59

4.3.2 Coding of Input and Output Patterns . . . . . . . . . 59

5. Ghosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 64

5.2 Problems Caused by Ghosting . . . . . . . . . . . . . . . . 65

5.3 Maximum Ghosting Sets . . . . . . . . . . . . . . . . . . 66

5.3.1 Generating the Sets . . . . . . . . . . . . . . 68

5.4 Analysis of some Maximum-Ghosting Sets . . . . . . . . . . . . 69

5.4.1 Quadratic Model . . . . . . . . . . . . . . . 73

5.4.2 Cubic Model . . . . . . . . . . . . . . . . 74

5.4.3 Exponential Model . . . . . . . . . . . . . . 74

5.4.4 Set Size Ratio Model . . . . . . . . . . . . . . 75

5.4.5 Comparison of Models . . . . . . . . . . . . . 77

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . 79

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . 80

6. Analysis of PAF . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 83

6.2 Format of the Postcode . . . . . . . . . . . . . . . . . . 84

6.3 Missing Characters . . . . . . . . . . . . . . . . . . . . 86

7. Feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 92

7.2 Speed of Database Access . . . . . . . . . . . . . . . . . 92


4

7.3 Other Factors . . . . . . . . . . . . . . . . . . . . . . 96

8. Conclusions and Further Work . . . . . . . . . . . . . . . . . . 99

8.1 Code Generation . . . . . . . . . . . . . . . . . . . . . 99

8.2 Values of k . . . . . . . . . . . . . . . . . . . . . . 100

8.3 Strategies for Verification . . . . . . . . . . . . . . . . . 101

8.4 OCR . . . . . . . . . . . . . . . . . . . . . . . . . 104

8.5 Information Feedback . . . . . . . . . . . . . . . . . . 105

8.5.1 Algorithmic Processing of Feedback . . . . . . . . 106

8.5.2 Asynchronous Processing of Feedback . . . . . . . . 107

8.6 System Design . . . . . . . . . . . . . . . . . . . . . 108

8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . 109

9. References . . . . . . . . . . . . . . . . . . . . . . . . . 111


5

List of Figures

Fig. 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . 11Example of poor image quality from scanning machine printed text (taken from[5 Mulgaonkar et al.])

Fig. 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . 12Example of touching handwritten characters (taken from[19 Hendrawan, Leedham])

Fig. 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . 19Diagram of how the features of an image ‘vote’ for the objects which could havegenerated them.

Fig. 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . 24Table of improvements to OCR system using a combination of 3 networks over asingle network

Fig. 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . 29The four reference lines used by the system described in [8 Yanikoglu, Sandon]

Fig. 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . 36Summary of results for the OCR systems reviewed

Fig. 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . 46A diagram of the first stage of the SNN method for retrieving valid postcodes

Fig. 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . 47Diagram of the matrix formed at each node of the SNN

Fig. 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . 48Block diagram of the way information is processed in [39 Lucas]

Fig. 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . 57Diagram of a simple correlation matrix memory

Fig. 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . 58A CMM during recall

Fig. 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . 61Example input pattern coding for a CMM to use partial matching

Fig. 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . 62Result of recalling ‘C?T’ from a CMM

Fig. 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . 63Superimposition of 2 7-segment number patterns

Fig. 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . 64Example of superimposed codes generating a ghost.

Fig. 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . 68Example of orthogonal codes which can ghost any other code

Fig. 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . 69Times to complete exhaustive search of some small code sets.


6

Fig. 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . 70Graphs of set size against code width for k3s2g1 and k3s2g2

Fig. 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . 71Graphs of set size against code width for k4s2g1 and k4s2g2

Fig. 5.6 . . . . . . . . . . . . . . . . . . . . . . . . . 73Graphs of quadratic functions against experimental data for sets k3s2g2 andk4s2g2

Fig. 5.7 . . . . . . . . . . . . . . . . . . . . . . . . . 74Graphs of cubic functions against experimental data for sets k3s2g2 and k4s2g2

Fig. 5.8 . . . . . . . . . . . . . . . . . . . . . . . . . 75Combined graphs showing exponential functions against experimental data fork3s2g2 and k4s2g2

Fig. 5.9 . . . . . . . . . . . . . . . . . . . . . . . . . 76Graphs of ratio functions against experimental data for sets k3s2g2 and k4s2g2

Fig. 5.10. . . . . . . . . . . . . . . . . . . . . . . . . 79Table of predicted k3s2g2 and k4s2g2 set sizes for various widths

Fig. 5.11. . . . . . . . . . . . . . . . . . . . . . . . . 80Table of predicted code widths for a storage requirement of 866026 associations

Fig. 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . 84The syntax of the postcodes

Fig. 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . 87Analysis of five-character postcodes

Fig. 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . 88Analysis of six-character postcodes

Fig. 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . 88Analysis of seven-character postcodes

Fig. 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . 93Estimated code widths for the 3 classes of postcode, using Eqn. 5.14

Fig. 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . 94Time taken to search each database for one specific postcode

Fig. 7.3 . . . . . . . . . . . . . . . . . . . . . . . . . 95Total size of codes required to represent each class of postcode

Fig. 7.4 . . . . . . . . . . . . . . . . . . . . . . . . . 95Overall time to recover actual postcode

Fig. 8.1 . . . . . . . . . . . . . . . . . . . . . . . . 100Representation of 3 bit binary codes as vertices of a 3-dimensional cube

Fig. 8.2 . . . . . . . . . . . . . . . . . . . . . . . . 108System Outline of automated address recognition system

Improving Automated Postal Address Recognition 1. Introduction

7

1. Introduction

The research presented here was sponsored by The Post Office and is driven by a

need to increase the performance of the automatic sorting machines employed

throughout the country. The sorting process consists of two stages known as out-

bound and inbound. The outbound stage involves identifying the destination post-

town for the mail piece. The inbound stage (which is performed on mail already in

the correct town, or mail which has arrived from outbound sorting in a different

town) involves sorting the mail into delivery rounds which can then be collected by

the delivery personnel. The automatic recognition of the address is the first stage of

these sorting processes, and from this a machine readable code is printed on the mail

piece in the form of phosphor dots which can be read by all the other automated

machines in the sorting path. The aim of this year’s research was to find a way of

improving the performance of the address recognition system.

An extensive survey of the relevant work showed that current OCR technology is

very close to the theoretical limit in terms of recognition rate for machine printed

characters, and therefore, the only improvement which can be made is the speed

with which that recognition is performed. However there is also a limit to which

increasing the recognition speed is advantageous in this application as the overall

goal is not to increase throughput but to increase the reliability of the system. Is was

proposed therefore, that spending time trying to increase the reliability of recognis-

ing individual characters was not the best way to set about increasing the reliability

of recognising the address as a whole. What was needed was a system capable of

verifying and correcting the small number of errors which were made by the OCR

system. This in turn led to an evaluation of database technology, specifically using

Correlation Matrix Memories (CMMs) as the engine for the database. Some prob-

Improving Automated Postal Address Recognition 1. Introduction

8

lems which occur when CMMs are used to perform partial matching were investi-

gated, and a possible solution proposed.

The actual sorting machines are supplied to The Post Office by the German company,

AEG. It is hoped that, at some point, the results of this research can be integrated into

the new machinery. This would almost certainly require the involvement of AEG,

but the situation between The Post Office and AEG is politically sensitive at present

and no attempt has been made to contact AEG so far. However the new machine is

intended to be highly modular in design and should present few problems for the

address recognition system to be upgraded or even replaced by a more powerful sys-

tem in the future.

Improving Automated Postal Address Recognition 2. OCR

9

2. OCR

Optical Character Recognition is the area of computing that concerns itself with the

ability of computers to interpret printed characters. The characters may be from a

standard alphabet or one designed with computer recognition in mind. They may be

produced by machine or by a human writer. There are also a number of methods

with which the document may be translated into machine-readable form, such as

scanning, imaging using a camera or direct entry by writing onto a touch-sensitive

screen.

2.1 Introduction

The OCR research currently being carried out is split into 3 areas. These are:

• Machine Printed Character Recognition (MPCR)

• On-line Hand Printed Character Recognition (OnHPCR)

• Off-line Hand Printed Character Recognition (OffHPCR)

The strategies applied to OnHPCR use information which is only available when the

automated system can be used during the process of writing the characters. Typi-

cally, the characters are written onto some sensitive screen using a wand, and the

computer system imitates the ink by colouring in pixels which the wand passes over.

This allows the system to record stroke information such as order, direction and

speed of the strokes which are used to form each character. There are a number of

commercial systems available at present in the form of portable computers which

use this type of OCR, and they usually include recognition of other characters or ges-

tures, which allow the user to command the machine in certain ways. For example,

they allow a cross to be written over a word in a word-processor thus indicating the

‘delete word’ function. These sort of characters, and particularly the real-time stroke

information, are obviously not available from a scanned image of hand printed char-


10

acters, so the techniques used are not applicable here. Therefore, on-line hand

printed character recognition will be disregarded for the remainder of this report,

and off-line hand printed character recognition, which is what is being considered

here, will be referred to simply as hand printed character recognition (HPCR).

There is an obvious difference between machine printed characters and hand printed

characters. The variation which is found in hand printed characters is far greater

than machine print. The only real variations in machine print are font style and size,

and the number of these is for all intents and purposes, finite. However with hand

printed characters, even given the same writer and character, there can be huge vari-

ations in the form of the character image. Not only does the character change every

time it is written, but it can also change shape simply because of the character it is

next to. As a result of this, the recognition rate for MPCR is much higher than for

HPCR, as it turns out to be a much simpler problem.

Apart from simply being able to recognise characters, the system must be able to

extract them from the image. It is very rare that characters appear alone and isolated

from those they are associated with in forming a word. This would only happen on

forms with boxes for characters, and then only if the writer had carefully followed

the outlines of the boxes and kept each character completely within each box. In real-

ity characters tend to touch one another, even with machine printed characters. It


11

may appear at first sight that machine printed text would be simple to segment into

individual characters, but there are various reasons why this is not so.

Fig. 2.1 - Example of poor image quality fromscanning machine printed text (taken from

[5 Mulgaonkar et al.])

As can be seen in Fig. 2.1, the image quality of a scanned, machine printed document

is not always perfect. Dot-matrix printers, especially high speed ones can produce

very smudged text, so much so that the characters actually run into each other. This

effect is made worse by the scanning procedure which has to quantise the image into

pixels. If the gap between two characters is smaller than the pixel size, they will be

imaged as touching. Secondly, in this particular application, the imaging has to be

very fast, as the scanning process is on-line within the sorting machine. This leads to

reduced resolution being employed, and also, as the mail piece is moving, tends to

smear characters along the horizontal axis. Again, this tends to render them as

touching. Characters in proportionally spaced fonts can also overlap in the sense

that there is no vertical white space between characters. This is due to kerning,

where the characters are moved closer together to give a more pleasing appearance

to the human reader. Usually there is still separation between them but it is no longer

trivial to find it, and it sometimes does not even follow a straight let alone vertical

path.

With handprinting, the problems of touching characters is much worse as can be

seen in Fig. 2.2. Firstly, it is much more natural for people to write ‘joined-up’. This

means there is no intentional break in the characters. Secondly, even if the writer is


12

deliberately writing separate characters, there is a tendency for some characters to be

joined together unintentionally, simply because people are in the habit of joining

them together. The same problem can occur as with proportionally spaced machine

print, when characters are not physically joined, but their enclosing rectangles over-

lap.

Fig. 2.2 - Example of touching handwrittencharacters (taken from

[19 Hendrawan, Leedham])

The upshot of all this is that it is as much if not more of a problem to segment the

image into individual characters as it is to recognise the characters themselves. In the

approaches to OCR reviewed here, some operate only on isolated (segmented) char-

acters, and some attempt both segmentation and recognition. In some, the two proc-

esses are independent, and in others, they are integrated.

The remainder of this section will consider the three subject areas mentioned above

in turn. These are Machine Printed Character Recognition, Hand Printed Character

Recognition and Cursive Writing Recognition. Some relevant publications are

reviewed and the details of the particular method are presented. A summary of the

results achieved by each of the systems reviewed is presented at the end of the sec-

tion, along with a discussion of some of the more salient points with respect to the

application of automated mail sorting.


13

2.2 Machine Printed Character Recognition

In [22 Wang, Jean], the authors present a multi-resolution neural network system

which is capable of recognising isolated machine printed characters in any font.

They describe a number of different configurations which are all based on the idea of

using a low resolution neural network which can operate at high speed to perform

an initial attempt at recognising the characters. This network is intended to recognise

around 85% of the characters at a resolution of only 12×8 pixels. The second network

uses a resolution of 24×20 pixels, and a more complex neural network (four hidden

layers instead of one). Consequently this network is more computationally expen-

sive, but is only used on the 15% of characters which cannot be recognised by the

first network. The results show that the first network can operate at 50 times the

speed of the second, but the slow network is still being used 15% of the time which

limits the overall speed-up. In order to reduce this limiting factor, a third network

was introduced which worked at the same resolution as the second network, but had

only one hidden layer as with the first network. This new network was used

between the first and second networks, and was able to recognise 80% of the rejects

from the first network. This reduced the use of the slow second network to only 3%

of characters, and represented a speed-up of between 14 and 20 times for the whole

system. They also used a weighted voting scheme when none of the networks could

successfully recognise a character to allow evidence from the 3 networks to be drawn

together and an overall decision made.

Various configurations of the networks were tested, with the best achieving a 99.81%

recognition rate on a test set which included characters from first and second genera-

tion photocopies, at speeds of 107-148 characters per second on a DEC workstation


14

rated at 42 MIPS. It is interesting to note that on a random subset of the test set, the

authors themselves only achieved a 99.83% recognition rate, and it is noted in [22]:

“Although there are 62 classes (A-Z, a-z, 0-9) in each font, some of themcannot be distinguished from each other after normalisation and they areconsidered equivalent for recognition purposes.”

This potentially poses a serious problem for address recognition, as two valid post-

codes could be generated by the characters which could not be distinguished. In

actual fact however, this is not likely to occur, as the main candidates for confusion

are ‘l’ (ell) &‘1’ (one), and ‘O’ (owe) & ‘0’ (zero). In the first case, postcodes are not

generally written in lower case although it is not inconceivable that this could hap-

pen. Secondly, there are no postcodes which contain either ‘O’ or ‘0’ in a character

position where both would be legal, therefore the confusion can be resolved by using

grammatical rules of the postcodes.

In [4 Wang, Jean], the neural network system mentioned above is integrated with a

character segmentation system which allows scanned printed documents to be ana-

lysed. The segmentation system uses a hybrid of neural networks and conventional

algorithms to determine the best cut position for merged characters. There is a two

stage approach to the segmentation. Firstly, if a character is rejected from the OCR

system, or has a large aspect ratio (i.e. is much wider than ‘normal’ characters), then

it is immediately segmented. The second approach is used if the character is classi-

fied during OCR, but then fails a subsequent spelling check. Each character in the

word is then examined by a neural network which is trained to identify touching

characters. The training of this network is on character pairs which are generated by

an algorithm designed to produce likely touching pairs — for example it doesn’t

generate touching upper-case characters (as it is suggested that these would be

rejected anyway on the grounds that the resulting image would be too wide), or

pairs of characters in different fonts (as there is rarely a font change in the middle of


15

a word). However the output of this network is a simple yes/no answer when applied

to a character pair, indicating whether or not the network considers the pair to be

touching. The actual segmentation is left to a later stage. It would seem that there is

an opportunity missed here to allow the network to assist in the segmentation by

providing a method for it to suggest a suitable cutting point. While this would

undoubtedly complicate the network and require more training data (a cut point

associated with each character pair), it would seem that the benefit to the segmenta-

tion algorithm would outweigh this initial effort. A possible counter argument is

that the network is trained on only a small subset of touching pairs, and is required

to generalise over the whole set of possible pairs it might encounter while processing

a document. It would be difficult to see how it could infer the correct cutting point in

an unseen character pair from this. However, there are only 26 different initial char-

acters for each possible pair. Providing the network is trained with at least one exam-

ple from each of these 26 classes, it will be able to suggest the correct cutting point.

The second character in the touching pair would not influence the width of the first

character, and so the cutting point would be correct no matter what the second char-

acter was. It would still be possible for the network to generalise over the set of all

touching pairs even though it has only been trained on a small subset. The fact that

the subset contains one example for every possible initial character allows the cut-

ting point to be suggested by the network even for unseen pairs.

The actual segmentation is carried out by a shortest path algorithm which attempts

to find the least cost curve from the top to the bottom of the character pair image.

The cost is defined in terms of the number of pixels involved in the path, and the

number of those that are set (i.e. form part of the actual character image rather than

the background). An extra penalty is also applied to paths which take diagonal steps,

in an attempt to keep the path as vertical as possible. Once a cut is proposed the seg-


16

mented images are passed back to the OCR system and another classification is per-

formed. This procedure is repeated until either the characters are classified with high

confidence or no more low cost cuts can be made. The system also checks for touch-

ing character triples in this way. As soon as the left hand portion of the image is rec-

ognised, the remaining portion is deemed to be a touching pair and is segmented

repeatedly until it is classified or cannot be segmented any more.

Overall, this system performs admirably. It achieved a character recognition rate of

99.71% on various documents scanned at 300dpi. The speed of the recognition was

not reported. However it is unlikely to be excessive given the multitude of complex

stages involved in the process.

A spelling checker is used to catch errors which are very difficult to detect from the

image. One such error is the character pair ‘rt’ being classified as ‘n’. However as

with the character recognition there are some images which simply cannot be distin-

guished by any of these methods, cf. ‘close’ & ‘dose’, ‘stern’ & ‘stem’, and the only

solution is to employ some kind of context information which will show that one

word is permissible in a particular context whereas the other is not.

In [7 Liang et al.], a discrimination function is presented for segmenting touching

characters. The function relies on the pixel and profile projections of the character

shapes. The function is implemented as a dynamic recursive system which repeat-

edly segments the image and attempts OCR on the results until the OCR system clas-

sifies the segments with high confidence. This is then taken to be the correct

segmentation. It is a very similar approach to the previous one, although it uses a

very different implementation.

The OCR system uses a minimum distance classifier applied to the border chain

codes of the character images. The authors also developed a novel solution to the


17

problem of large chain code variations due to relatively small input image changes.

A chain code is stored as a histogram with four bins: one for horizontal lines, one for

vertical lines and one each for the 2 orientations of diagonal line which are found in

the border of the character image. The image is split into 16 (4×4) rectangles and the

histogram for each calculated. The large variation occurs when the edge of a charac-

ter moves from one of these rectangles to another, and all the information associated

with that edge moves from one histogram to another. Their solution to this problem

was to make the 16 rectangles overlap by an amount which could be altered until the

system performed best. This way, small variations in the position of the edges of a

character are unlikely to move the edge outside the rectangle, and also that the edge

information will be included in more than one histogram. There experiments show

that there is a decrease in the Euclidean distance between input patterns and the

stored patterns of 23% when the rectangles of 64×80 pixels are overlapped by 8 pixels

horizontally and 10 pixels vertically.

Finally, a contextual analysis of the image of the text line allows the system to pre-

vent characters such as ‘h’ being split into ‘l’ and ‘I’. This is achieved by recognising

the fact that the ‘I’ would have to be much smaller than the ‘l’ for them to form the

image of an ‘h’ when touching. This variation in font size within a word is disal-

lowed.

Overall the system performs very well, with an average character recognition rate of

99.65% from 300dpi scanned images of a multi-column newspaper type publication.

The speed of recognition was not reported.

In [5 Mulgaonkar et al.], it is stated that

“... half the errors in character recognition are due to [poor] segmenta-tion.”


18

The approach they adopt is to avoid completely the segmentation step and use a fea-

ture voting scheme similar to the one in [29 O’Keefe, Austin]. This is noted in [29] to

be akin to the Generalised Hough Transform which is used to recognise arbitrary

objects in images by accumulating evidence for them in some kind of array. In [5] the

array is 1-dimensional, representing the line of characters which make up the word

being recognised. In [29] however, the array is 2-dimensional and each cell in the

array represents evidence for an object at a particular location in the image. The

method used for collecting evidence in [5] is a simple sequential search through a

library of features which have been previously extracted from examples of the

objects (in this case, characters) that the system is to recognise. As soon as the match

between the current input and a character from the library is above a certain thresh-

old the character is considered to have been classified. It is noted in the report that

this is a very inefficient method of searching the library for a matching character. The

authors suggest this could be improved using hashing and indexing techniques,

however they are not specific about how they plan to implement this. The recogniser

used in [29] is a neural network which is much more suited to the fuzzy matching

which has to be performed on the input. This system has to handle a 2-dimensional

input image and so has to be much more efficient to allow images to be processed in

reasonable time. It is likely that a similar approach would be of great benefit to the

OCR task tackled in [5].

The input window of the classifier is scanned over the input image and whenever a

known feature is recognised, a ‘vote’ for the object/character which could have gen-

erated that feature is stored in an accumulator array. The array has an entry for each

position in the original image which could contain an object or character. A vote is

placed in the entry associated with the position in the input image where the object/

character would have to be in order for the feature to be present at the location it was


19

found. With reference to Fig. 2.3, the input window currently contains a feature

which the system has been trained on. This feature happens to be the lower left cor-

ner of a square. It is possible to infer from this that the centre of the square must lie

on the dotted line drawn from the apex of the corner, extending up and to the right.

All the accumulator elements which lie on this line are then incremented. This proc-

ess is repeated for the other corners, and the accumulator entries in the centre of the

square will have been incremented four times, whereas the others will have been

incremented a maximum of twice. This local maximum is used to locate the centre of

the square.

Fig. 2.3 - Diagram of how the features of animage ‘vote’ for the objects which could have

generated them.

At the end of the recognition process the accumulator will contain local maxima or

‘peaks’ where the objects are most likely to be, as they were voted for by the most

features. A thresholding algorithm then decides which of these objects are suffi-

ciently evident to include them in the final output. In the OCR system in [5], a lexi-

con is used to restrict the outputs to valid words. The best matching valid word is

selected based upon the characters with highest confidence in the output array. The

authors reported a word recognition rate of 80% but it was expected that this could

be improved by using more character features during the recognition stage (they

Input Window

Image Features

Voting Positions

The voting positions show thataccumulator entries under

these diagonal lines would beincremented by the features

found, indicating thepossibility of a square centredsomewhere on the line. When

all 4 corners are recognised,the actual centre will be a local

maximum within theaccumulator.

Actual Centre


20

listed several in the report, but only used one — contours of characters — during the

tests).

The system in [29] is a very generalised system capable of recognising arbitrary

objects at arbitrary positions on the document, whereas the OCR system in [5]

requires the document to have been segmented into text lines before recognition can

begin. This is because it only employs a 1-dimensional accumulator array which rep-

resents the line of characters. However it is likely that the system in [29] would not

require this step as it could locate characters at any position within the image of the

document. It would be interesting to compare the performance of the system

described in [29] trained on character images, on a similar document to the one used

in [5].

2.3 Hand Printed Character Recognition

There is really a further subdivision necessary here — that of hand printed charac-

ters versus cursive characters. Hand printed characters tend to be separated from

their neighbours, whereas cursive characters are almost always joined. In

[8 Yanikoglu, Sandon] it is noted that recognition rates of the order of 95% are

achievable for hand printed characters (Martin et al.), but as low as 36% for cursive

writing (Edelman et al.). Their work also assumed isolated words and did not

attempt word segmentation. The results are often much higher for hand printed dig-

its as they tend to be clearly separated. Also, as there are only 10 classes to distin-

guish, the problem is inherently easier. Recognition rates of up to 98% are reported

for handprinted digits (Baptista et al. and Burr). Many different techniques were

used by the systems whose results were reported, including neural networks, radial

basis functions, syntactic and elastic matching. There was no clear method which

achieved better results than all the others. The recognition rate achievable seems to

depend quite heavily on the restrictions which are put on the scope of the recogni-


21

tion system. For example, one result of a 48% word recognition rate for cursive writ-

ing is reported (Srihari et al.) but the notes which accompany the result show that the

system was only tested on writing supplied by the author who also wrote the train-

ing set.

2.3.1 Printed Writing

A system is presented in [26 Burges et al.] which is applied to both printed digits

and cursive handwriting. The system and its results for printed digits are presented

here. A description of its performance when applied to cursive writing is given in

section 2.3.2 (page 31).

The input images are segmented into ‘cells’ by first locating ‘definite cut’ points

where there is a large amount of white space between adjacent characters. Possible

cuts are then identified using a method named “Modulated Gradient Hit and Deflect”.

This algorithm produces a set of possible segmentation points within the text line.

The segments thereby created are known as ‘cells’. It is assumed by the rest of the

system that it is possible to construct the correct segmentation of the line into charac-

ters by merging some of the cells. Thus the cells represent an over-segmentation of

the text line and the goal of the next stage is to identify which cells should be joined

together to form characters. This is achieved using an exhaustive scan of the possible

combinations by applying them to the character recogniser and using the generated

confidence value to indicate whether this combination is likely to be a good one (i.e.

representing a character). Once a set of combinations is found such that the confi-

dence for each segment is above some threshold, this is taken as the correct segmen-

tation of the text line and the characters are output from the classifier. Obviously it is

not necessary to test every possible combination of cells as it is assumed that the

number of characters in the final word is known1. There is no point trying combina-

tions which produce either more or less characters than this.


22

The system achieved a ZIP code recognition rate of 82.7% with no rejects and 96%

with 25% rejects. It is noted in their paper that the test set contains around 3% of

images which are not in the lexicon and so could not possibly be recognised correctly

by the system. These were not removed for the tests and would seem only to compli-

cate the interpretation which can be placed on these results.

A novel approach to the problem of increasing the reliability of an OCR system is

presented in [23 Drucker et al.]. Three conventional neural networks were used.

Their architecture is irrelevant to the method which can be applied to any trainable

classifier. The only requirement is that a very large training set must be available and

it will not necessarily be known in advance how many training examples will be

required. The training procedure is outlined below.

The first network (NET1) is trained as a normal classifier using some examples from

the large initial training set. The training set for the second network (NET2) is

formed by passing more example characters (unseen by NET1) through NET1 until

it incorrectly classifies one. This character image is added to the training set for

NET2. The process is repeated, but this time the first character to be correctly classi-

fied by NET1 is added to the training set for NET2. The selection of characters is

alternated between those which were and were not correctly classified by NET1. In

this way, the training set for NET2 is always made up of an equal number of charac-

ters which were classified correctly and incorrectly by NET1. When a sufficiently

large training set has been generated, NET2 can be trained. The training set for the

third and final network (NET3) is now generated by passing more unseen (by either

NET1 or NET2) character images through both NET1 and NET2. This is repeated

until the networks disagree on the classification of the image. This image is then

1. The tests were performed on U.S. ZIP codes which contain either 5 or 9 characters. A dis-criminator was used prior to the segmentation to identify which format the target image wasand so it is known in advance how many segments will make up the final word.


23

added to the training set for NET3. All other images (those that networks 1 and 2

agree on) are discarded. Thus the training set for NET3 contains only images whose

classification NET1 and NET2 disagree on. Once a sufficient number of these images

has been collected, NET3 can be trained.

During the recognition phase the character image is applied to networks 1 and 2. If

they agree then this is taken as the correct answer. If they disagree however, the

image is applied to network 3 and its output is taken as correct.

While this method can be used in principle to increase the performance of any neural

network system it is applied here to the recognition of hand printed characters and

digits. However it can be seen from the description above that this system will

always produce an output, whether or not there is high confidence in that output. It

is stated in [23] and would also appear to be common sense that in a mail sorting

application it is much more desirable to reject a piece of mail and have is sorted by

hand than to miss-classify it and have is delivered to the wrong address. The voting

scheme described above is not suitable for this and so a modified one is presented.

The input image is applied to all 3 networks and their outputs are summed. This

total is then thresholded and the confidence (the difference between the highest scor-

ing character and the next highest) can be determined. If this is too small (i.e. below

the threshold), the character is rejected. The threshold was set so that there was only

a 1% error rate on characters accepted from the validation sets used during training.


24

The resulting system was tested on four databases, two of digits and two of alpha-

betics (one each of upper and lower case). The results for the system described above

are shown in Fig. 2.4.

Fig. 2.4 - Table of improvements to OCR systemusing a combination of 3 networks over a single

network

It is obvious that these are not incredible improvements, especially for the lower case

characters. Another drawback is the large (and unknown) number of training

images that will be required in order to generate the three sets for training the three

networks. However it is a method which can be applied to any classifier, and using a

sieving procedure presented in [23], the system performance can be improved.

Instead of the computational requirement going up by a factor of 3 (three networks

are now being used rather than one), it is limited to an increase by a factor of 1.75.

This is achieved by preventing the invocation of the second and third networks if the

confidence of the first is high enough. This process is referred to in the paper as “siev-

ing”. The potential problem is that the first network may miss-classify an image with

high confidence and this would have been caught had all three networks been used.

However in their tests they showed that the previous figures are still accurate, apart

from the lower case characters. The reasons they give for this failure are difficult to

comprehend however. They state:

“However, for the lower case alphabets, this procedure does not producereasonable results (achieving a 4.0% error rate by rejecting 7.2% of theimages) and sieving does not work.”

Database ContentsSingle Network 3 Networks

Error Rate Reject Rate Error Rate Reject Rate

1 Digits 4.9% 11.5% 3.6% 6.6%

2 Digits 1.4% 1% 0.8% ~

3Upper CaseCharacters

4% 9.2% 2.4% 3.1%

4Lower CaseCharacters

9.8% 29% 8.1% 21%


25

The error and reject rates reported in this statement are far better than the results

listed earlier for the lower case characters. However it is not clear what they refer to.

It would seem that they do not refer to the sieving procedure, as it is stated at the end

that this procedure does not work. It is possible that there was an error in their table

of results, and these figures are much worse than the actual results for lower case

characters. This would seem unlikely though as all other reports suggest that lower

case handprinted characters are the most difficult to classify.

In [18 Leedham], a comparison is made of several approaches to HPCR and the

results on three datasets of characters reported. The databases used were:

• USPS/CEDAR database, which contains a mixture of cursive words

(which were not attempted) and segmented handprinted characters

(alphabetics and numerics)

• CENPARMI database, containing only segmented digits

• Royal Mail/Essex database, which contains segmented postcode charac-

ters (alphabetics and numerics)

The algorithms tested are from Essex, Brunel, Manchester and Kent Universities.

Results from other algorithms were taken from published sources.

Five of these algorithms were tested on the Royal Mail/Essex alphanumeric charac-

ters, with results from 63.4% to 98.7% character recognition rate. The highest score

went to a 2-level classifier developed by a group at Kent University and is listed as

“Binary Weighted Scheme / Least Mean Squared with Complex 2-dimensional Moments”.

The actual details of this approach are not given though. The problem was simplified

by aggregating the characters ‘I’ (eye) & ‘1’ (one) and ‘O’ (owe) & ‘0’ (zero) into the

same classes, which seems to be a fairly common approach to simple OCR as there is

often no way of discriminating between these characters without contextual infor-

mation.


26

The performance of seven different algorithms applied to the USPS/CEDAR data-

base was reported. These results were obtained directly from CEDAR’s tests, and no

actual evaluation was done. The best score was obtained by a GSC algorithm (Gradi-

ent, Structural and Concavity). This used image processing techniques, such as Sobel

operators to determine image gradients, and an eight point star operator to deter-

mine concavity in the image. The result was formed into a 448 bit feature vector, and

a K Nearest Neighbour algorithm was used to classify the result. This algorithm

achieved a character recognition rate of 97%. However on a Sun Sparcstation 2 it

only processed 2 characters per second which is plainly well below the performance

required for a real time application to mail sorting. This is probably due to the exces-

sive processing which has to be performed on the character image (convolution of

filters tends to be a time consuming process). They also investigated the advantages

of combining the results from a number of different algorithms using methods such

as majority vote and neural networks. The highest results obtained this way were

using a neural network to combine the outputs and achieved a character recognition

rate of 97.5%, an increase of only 0.5% over the best single method. It is unlikely that

this represent a good trade-off, as the increase in computation (caused by evaluating

more than one classification and then combining the results) would almost certainly

outweigh the slight increase in performance.

The results for the CENPARMI database are reported for eight different algorithms

tested at Concordia University, Montreal. The best results are achieved by a statisti-

cal method which uses a network of 231 2×2 classifiers and obtained a 98.3% recogni-

tion rate. Several methods were then investigated to combine the outputs of the

various algorithms and the best combination reported was a voting system incorpo-

rating 3 of the algorithms. However this achieved a 98.5% recognition rate, which is

an improvement of only 0.2%. This would appear insignificant and the only possible


27

advantage which could be gained would be a reduction in the substitution errors (in

favour of rejection), but these figures are not given and so it is not possible to draw

any conclusions from this result. However it is also reported that another group

incorporated four of the classifiers mentioned above using Bayesian Formalism to

combine the outputs and achieved a recognition rate of 99.2%, which is getting close

to the maximum that could be expected.

The results for numeral recognition tend to be slightly better than for alphabetics or

alphanumerics because fewer classes are involved. This may be of use however as

some character positions within the postcode are restricted to numerals only. There

is no reason why different classifiers should not be used to increase the recognition

success in this way (see section 6.2, “Format of the Postcode” on page 84).

Interestingly, one of the algorithms was trained on the training set from one database

and tested on the test set of another. This produced a recognition rate of only 50.3%

for handwritten numerals. Sadly the algorithm was not tested on the corresponding

matching training/test sets so no exact conclusions can be drawn from this but it is

stated in [18] that the expected performance of this algorithm would be around 80%.

This means a dramatic reduction in performance when tested against alternative test

sets. This may simply be a ‘feature’ of this particular algorithm but it may mean that

the databases commonly used to compare algorithms are not particularly universal

— that is to say they exhibit some characteristic which is peculiar to that database. A

classifier trained on one set would then be good at classifying images with the same

characteristic but may be very poor at classifying images without that characteristic.

The characteristic, whatever it may be, could be caused by the authors of the test

images, the actual scanning process or any constraints which were placed on the

type of images which were to be included in the data set, etc. Further investigation

would be required to ascertain whether this behaviour was common to many classi-


28

fiers and many databases. If proved correct, it would place greater emphasis on col-

lecting a representative training set for the application being designed and could

mean that totally universal classifiers would be impossible without a totally univer-

sal training set which would be very difficult, if not impossible, to collect.

Two systems are presented in [25 Martin et al.] and [30 Martin, Rashid] which again

are similar to the approaches described in [5] and [29] (see page 17). In the first, a

neural network character classifier is scanned over the text line and its output is

thresholded so that it provides positive outputs when its input area is centred over a

character. This is similar to the voting systems described earlier but the thresholding

is performed at the recognition stage rather than after the whole image has been seen

by the recogniser. In this way there is less information to store and the network pro-

duces an on-line output of characters as it sees them. The drawback is that there can

be no feedback to the recogniser from all other characters in the image; if the final

output is found not to be a valid word from the lexicon, the recognition must be

repeated. The main point of interest in this system is outlined in the second sections

of [25] and [30] which indicates how the scanning process can be improved. A sec-

ond neural network is trained to recognise how wide certain characters are. This

allows it to move the input window of the classifier network by large amounts rather

than having to scan it slowly across the word image. It is reported to be similar to the

way the human eye behaves when reading text. Although humans tend to recognise

words rather than characters, the eye jumps from one word to the next rather than

scanning the text line smoothly. It was shown in [25] and [30] that this can improve

the efficiency of the recognition process by 4 or 5 times. It does not however influ-

ence the actual recognition rate as this is purely dependant on the classifier. It was

tested on several different length numbers and achieved a word recognition rate of

94.23% for 2-digit numbers with a 1% error rate, and 63.26% for 6-digit numbers


29

again with a 1% error rate (the remaining percentages are made up of rejects). No

lexicon was used to assist in the recognition of the numbers and this is the main rea-

son for the fairly low recognition rate once the number of digits starts to increase.

These figures represent a character classification rate of somewhere between 92%

and 97%. However when this is applied to 6 digits the overall classification rate falls

quickly as can be seen.

The method for moving the input window by recognising the width of the input

character is essentially the improvement which was suggested on page 15 to the sys-

tem presented in [4]. However instead of moving the input window it would be

used to suggest the correct segmentation point for touching characters.

2.3.2 Cursive Writing

Segmentation is more of an issue with cursive writing recognition as there is a

greater tendency for adjacent characters to join together. Several groups have pro-

posed solutions to this problem. A system is presented in [8 Yanikoglu, Sandon] for

recognising cursive handwriting which uses an minimum cost cut method for seg-

mentation, and a neural network for character recognition. The segmentation step

consists of the following stages:

• First segment the page into text lines by computing the horizontal histo-

gram of the page and identifying the baselines of the text

• Then find the reference lines of each text line. There are four lines associ-

ated with each text line which are shown pictorially in Fig. 2.5

Fig. 2.5 - The four reference lines used by thesystem described in [8 Yanikoglu, Sandon]

poolAscender

Body

Baseline

Descender


30

• Finally segment the line into characters by looking for minima in the ver-

tical pixel histogram of the text line

The first stage includes a check for a skewed page by computing the horizontal his-

togram at -10˚ and +10˚ from the horizontal and using this information to shear the

image of the page accordingly. It is obvious that this could be improved by using

more computations at intermediate angles and this would represent a trade-off

between the time taken to process the page and the reliability of the results. It is pre-

sumed that these angles were found to give the most satisfactory results, however no

comparisons were presented to show the trade-off mentioned.

The results from the second stage are used during actual character recognition proc-

ess to give a rough indication of the geometry of the characters — for example, the

width of a character is roughly equal to the body height for a given text line. The seg-

mentation of characters is performed by looking for least cost cut point within the

line. The cost of a cut is determined by, among other things, the number of pixels it

must go through, the height above the baseline at which the cut is made and the dis-

tance from the last cut (relative to the approximate width of the character). Four cuts

are made at 0˚, 10˚, 20˚ and 30˚ to vertical. It seems odd that these cuts look for char-

acters which are from vertical to slanted right and none look for left slanted charac-

ters. While it is more common for handwriting to slant to the right, it would have

been a simple matter to include a cut which could handle left slanting characters

because, as it stands, the system will not recognise these at all.

After segmentation the character image is slant and size normalised and then pre-

sented to the neural network for classification. The network has an input size of

20×50 pixels (greyscale) and 26 outputs, one for each character. It is not clear whether

upper case characters are ignored or not recognised — certainly all the examples

shown in the report are only of lower case characters.


31

A Hidden Markov Model is used to maximise the probability of the recognised word

given some analysis of character pairs found in a large dictionary of English words.

For their tests they assumed independence of probability of words, since their lexi-

con used for word validation was small and using actual written English word prob-

abilities reduced performance in the small scale test. However, given these caveats,

the system achieved overall 61% word recognition with 71% of words being in the

top three suggested by the system. It is important to note however that this figure

was arrived at by averaging the results from three tests. In two of the tests the words

were written by authors who had also written training sets and one of them was on

hand printed rather than cursive characters. The results for these tests were 93%

word recognition for hand printed characters and 70% word recognition for cursive

characters. In the third test the author had not written a training set and the system

only managed 28% word recognition. This result indicates the huge variation

between character images from different writers and would indicate that a much

more versatile system would be required for recognising handwritten addresses on

mail pieces.

The system mention earlier in [26 Burges et al.] was applied to cursive words using

lexicons of 10, 100 and 1000 words. There are a number of differences between the

two systems; these are outlined below but, otherwise, it can be taken to be the same

as the one described on page 21.

The complex segmentation scheme is replaced by a neural network, and segmenta-

tion is now implicit in the character recognition. The neural network now has 104

outputs which are assigned to 4 sets of 26 outputs. The 26 outputs in each set repre-

sent the characters of the alphabet and the 4 sets represent different widths of charac-

ters. The neural network is thus able to recognise varying width characters within

the text line. This is obviously important if it is being used as the segmentation algo-


32

rithm, as the only way to ensure they are all of equal width would be to size normal-

ise them and this can only be done on segmented characters! The input of the

network is scanned over the text line and its outputs are recorded in an array which,

again, is very similar to the approach described on page 12.

The word recognition rates for images from the three lexicons were 86%, 68% and

47% respectively. The authors of [26] noted however that if the constraints were

relaxed so that the system produced its top few choices then the results went up to

93% (top 2), 82% (top 3) and 74% (top 6). This does show that as the lexicon size

increases the performance drops rapidly and it becomes necessary to accept a ‘top-n’

type output from the recogniser in order to obtain the correct word with any reliabil-

ity.

In [6 Seni, Cohen], a complex system is presented which attempts to segment totally

unconstrained cursive handwriting. It is one of the few systems which attempts the

segmentation stage of the process without attempting OCR and as noted at the end

of this section, this may be the reason for its apparently poor performance given the

encouraging results which are mentioned below. The system is applied to recognis-

ing portions of addresses written on mail pieces which had been previously seg-

mented into text lines. The system used connected components from the text line to

identify inter-word gaps. A connected component may be noise, a character frag-

ment, a whole character or a number of touching characters. Eight different algo-

rithms for detecting these inter-word gaps were tested and the results for each

tabulated. The results are percentages of the total number of text lines (1453) which

are correctly segmented into words and range from 78.5% to 87.4%. The top scoring

algorithm is a hybrid of 3 of the others and so it is not surprising that it scores better

than all the others. This algorithm also identified 97.1% of all inter-character gaps

within the words. They noted that while punctuation within a text line tends to


33

reduce the inter-word gaps, it also gives a good indication of the existence of an

inter-word gap as punctuation does not generally occur within a word. Three punc-

tuation detection algorithms are evaluated with the highest scoring being a K-near-

est-neighbour method. The other methods were 2 variations on discriminant

functions and all three methods scored between 97.38% and 97.84%. The percentage

indicates correct classification of each connected component as either a comma, a

period or neither of these. The only appreciable difference between the three meth-

ods is reported as being the distribution of false-positives, false-negatives and sensi-

tivity of the algorithms. No results for these measures were given however.

The punctuation detection algorithms and the previous eight gaps classification

algorithms were combined and tested on a final test set which had not been used

during the training of any of the algorithms. It is apparent now how difficult the

problem actually is, with the best combination managing to correctly parse only 39%

of these unseen text lines. However this result is akin to attempting to classify a

whole word using only a character recogniser. Even if the recogniser is 90% reliable

on each character, the chances of correctly recognising an eight character word is

only 43%. It is obvious then that some method of integrating the word segmentation

algorithms reported here into a text recognition system which could recognise these

words and provide feedback to support or contradict the segmentation would be

necessary to increase the overall reliability.

2.4 Summary

Several other approaches to OCR were reviewed including systems which use many

small neural networks to recognise individual characters [3 Kertesz, Kertesz], neural

networks applied to the cartesian and polar coordinates of the character images

[2 Lee, Choi], a single large backpropagation network [21 LeCun et al.] and a system

which maps each character image into a feature space known as ‘holograph’ and


34

compares the features using simple matrix functions [20 Gorsky]. All these system

performed reasonably well but none achieved a significant advantage over the oth-

ers in terms of recognition accuracy or efficiency. A summary of the results presented

throughout this section is shown in Fig. 2.6. If one approach had to be selected for

the application described in the remainder of this report then purely on recognition

rate the one in [4] would seem to be the best choice for machine printed text. For

handwritten text the choice is not so clear as a decision would have to be made as to

whether to simply attempt to classify the postcode, in which case a segmented char-

acter recogniser such as the ones reported in [8] would probably be sufficient. How-

ever there is little detail on these methods, and in particular the restrictions placed

on the methods, which would enable a judgement to be made on the reported

results. If an attempt was to be made on other address information such as the post-

town, a word recogniser would be needed and the performance of any presented

here on untrained writing would seem to be nowhere near the performance which

would be necessary to produce useful information. Another option for the character

classifier was mentioned on page 20. The system presented in [29] would benefit

from the hardware architecture which would also be used for the partial matching

which is described in section 4, and may represent a huge performance increase over

any of the other methods.

It is clear then that there is a great deal of interest in this area of machine vision,

probably due to the commercial interest that would be shown by organisations such

as The Post Office, banks and building societies in any system capable of reading

addresses, cheque details, etc. However as was shown with machine printed text,

which has almost reached its maximum attainable level of performance, a system

based solely on character recognition will not be sufficient. With addresses there is

plenty of other information on the mail piece apart from the postcode which would


35

aid the successful recognition of the address. With cheques the amount is written in

both words and figures. Bringing these two fields together to verify the recognition

process would be the only way of improving the accuracy significantly.

The next section looks at some of the research currently being carried out in the area

of OCR verification. The aim of a verification system is primarily to constrain the

output of the OCR system so that it conforms to some specification of a valid output

for the particular application. There are a number of ways this can be achieved.

Some systems follow sequentially from the OCR system and some are an integral

part of the OCR system. However the purpose of all of them is to improve the per-

formance of the OCR system to a level which it would be difficult if not impossible to

attain using purely OCR.

The following table summarises the results reported earlier in this section.

Source Target† Results†† Notes

[22 Wang, Jean] MC 99.81%CDoes not differentiate certaincharacters, such as ‘I’ and ‘1’

[4 Wang, Jean] MW 99.71%CUses character recogniser from

above

[7 Liang et al.] MW 99.65%C

Uses character contextualclasses to split touching charac-ters and merge broken character

components

[5 Mulgaonkar et al.] MW 80%W

Scans the text line with the recog-niser and maintains a voting

array to avoid explicit segmenta-tion step

[8 Yanikoglu, Sandon]

HC 95%CResults reported for Martin et al.

— only uppercase characters

HD 98%C Results reported for Guyon et al.

HW 95%WResults reported for Burr — pos-

sibly only for a single writerthough

CW 48%WResults reported for Srihari et al.

— only for a single writer


36

Fig. 2.6 - Summary of results for the OCRsystems reviewed

† The codes in this column are M for machine printed, H for handprinted andC for cursive, followed by C for characters, D for digits and W for words.

†† The codes in this column after the percentages are C for character recognitionrate and W for word recognition rate.

[26 Burges et al.]

HD 82.7%W

Applied to ZIP codes, hence theresult is in word recognition rateas there was a lexicon to check

against

CW

86%W Only a 10 word lexicon

68%W 100 word lexicon


[23 Drucker et al.]

HD89.8%C This system uses a performance

improving scheme which couldin theory be used on any neuralnetwork. It improves the perfor-mance of a single network (using3 of the same type) from 83.6% &97.8% (digits), 86.8% (upper case

characters), and 61.2% (lowercase characters) respectively

99%C

HC

94.6%C

70.9%C

[18 Leedham]

HC

98.7%CReported for a group at Kent

University – aggregates ‘I’ and‘1’ etc. as before

97%CReported for a group at CEDAR,the authors of the character data-

base used

HD 98.3%CReport for a group at Concordia

University, Montreal

[25 Martin et al.][30 Martin, Rashid]

HD

94.23%W These results are for 2 and 6 digitnumbers respectively. Howeverthe main interest is in the novel

method used to improve thespeed of recognition 4-5 times

63.26%W

[8 Yanikoglu, Sandon] CW 61%W

The result is an average of sev-eral tests — 93% segmented

character recognition, 70% wordrecognition and 28% word recog-nition for an author who had not

written a training set

[26 Burges et al.] CW

86%W Only a 10 word lexicon



Source Target† Results†† Notes

Improving Automated Postal Address Recognition 3. Verification

37

3. Verification

The dictionary definition of verification is the process of establishing the truth or

validity of something.

3.1 Introduction

With respect to OCR systems, verification is the process of establishing the truth of

the output of the OCR module. Usually this output will be in the form of a character

corresponding to a section of the image of the input document and a confidence

value with which the OCR system classified the image as being that character. In

order to determine the truth of that classification some other information from the

input image will usually be required. The main strategies for verification of the out-

put from a classifier are twofold. Firstly, the output words can be checked against

some database of valid words. It is likely in most cases where automated recognition

is being employed that there are some constraints on the words which will appear in

the document being analysed. This is certainly the case for postcodes, which are the

main target being considered in this report. Postcodes follow a syntax which

describes how many characters they may be composed of and what characters may

appear in certain locations. There is also a database of all valid addresses currently

being used within the UK. While this is updated from time to time with new post-

codes as they are needed, the overall syntax is not changed. This allows two checks

to be made on the validity of the postcode either during or after the recognition proc-

ess.

Secondly, output words from the classifier can be checked against other information

on the document image. For example, with analysis of cheques, the main target for

automated recognition is the amount. The account numbers are already machine

readable and the name field can be so varied that there would seem to be little point


38

in attempting OCR at present. However the amount field appears in two places on

the cheque and in two different forms. These two fields can then be recognised sepa-

rately and the results compared to allow verification of the recognition process. The

same is true of addresses on mail pieces, however the verification process will be

more complex.

3.2 Review

The majority of this chapter will review verification systems developed by research-

ers in this field. There are many diverse techniques employed to accomplish the task

and these are described along with the results for that particular implementation.

The overall goal of all of the systems presented here is that of improving the per-

formance of an OCR system.

It was shown in [9 Kabir, Downton] that simple character recognition of the out-

ward postcode (the section which dictates to which town the mail piece will be sent

for further sorting) was improved by 120% from simple OCR, when combined with

syntax and context information available from the rest of the address image. The

character recogniser was based on a template matching scheme, using a similarity

function which effectively computed the cosine of the angle between the character

vector and the template vector. The overall performance of this system was a charac-

ter classification rate of 62%. Two approaches to improving this performance are

investigated, which are the dictionary lookup method to represent the valid post-

codes and a Markov model to represent valid syntax within the postcode. They are

in fact combined into a hybrid system which employs features of both algorithms. At

each stage, the most likely prefix according to the Markov model is searched for in

the dictionary and any invalid possibilities are discarded. This means that at each

stage the most likely valid postcode is the one being considered. In fact, they only

considered the outward section of the postcode in their tests but the recognition rate


39

went from 25% using simple OCR to 55% using the Markov model and dictionary

search algorithms. They also mention the inadequacy of their sample database of

addresses and propose using random addresses from the database of all possible

addresses within the UK; they state:

“In particular, random selection of postcodes from the CD-ROM databasewill, in the limit, enable us to estimate the a priori probability of occur-rence of each character class in each postcode character position, and thusinclude this information in the character recognition model.”

However it is clear that this statement is incorrect, as there is almost certainly not a

random distribution of addresses in the live mail stream across the UK. What would

actually need to be done would be to collect random samples of addresses from each

sorting office and this would, in the limit, give the true a priori probability for each

character in each position within the postcode.

In [13 Leedham, Jones] and [19 Hendrawan, Leedham], such a system for the verifi-

cation of Australian and British addresses respectively is considered. In [13], a data-

base of addresses was collected which consisted of 200 mythical but realistic

addresses, scanned at 200dpi in accordance with Australian sorting machines. The

addresses were also written in ‘Post Office Preferred’ format, which means the post-

code (a 4-digit numeric code) is written to the lower right of the address. The system

comprises a character locator and classifier for actually recognising characters from

the postcode, a feature analyser for extracting other features from the address, such

as posttown information, and a database for matching the word information with

the postcode.

The postcode location is performed by assuming the postcode lies within a small

window on the address image. This size and position of this window is adjusted to

include all the postcode characters but exclude other parts of the image, however no

details are given as to how this is achieved. The vertical pixel histogram of this win-


40

dow is then used to segment the characters. Checks are made to prevent individual

characters being segmented but again there are no details as to how this is per-

formed. The character’s height was then checked to ensure it was reasonable to

assume it was a character and not a dash or other mark on the image. The OCR was

performed using a characteristic loci method which achieved a 42% postcode recog-

nition rate which equates to an 80.5% character recognition rate. It has to be said,

after the results of the last chapter, that this is a fairly poor performance even for

handwritten character recognition, when it is only numeric characters that have to be

considered. Using the best results for handprinted digits from the previous section, a

character recognition rate of up to 99% could be expected and this would instantly

yield an improvement from 42% to 96% word recognition rate. Even using a more

conservative estimate of 95% character recognition rate, this yields a postcode recog-

nition rate of over 81%, which is nearly twice the current value. The authors also

state that the 52% error rate is unacceptable and needs to be reduced to around 0.1%

for a real application. However they seem to be ignoring the possibility of rejects

from the automated system, which would almost certainly have to be used to

achieve an error rate as low as 0.1%. It is not clear whether rejects are possible from

their OCR system, but if not (and hence the error rate and success rate summing to

100%), this would give another reason to change the OCR method used.

The other address image information is obtained via a number of stages. The image

is first smeared horizontally and vertically in an attempt to make all the characters

within a word connected (as they are not going to be classified by a character recog-

niser, it doesn’t matter if they are slightly distorted by this process and having a

word as a connected component simplifies the word segmentation step). The image

is segmented into lines by considering the horizontal histogram of the address

image. The tops and bottoms of characters tend to show up as peaks in the histo-


41

gram. They note also that a disconnected top stroke from a letter ‘T’ can sometimes

cause the line to be split into more smaller apparent lines if only the histogram is

considered. They overcome this by then making a second pass over the image and

merging lines which appear to be from the same actual text line. They do not com-

ment on how this is achieved, however it could be done using the height of the seg-

mented line — the segment containing only a horizontal stroke from a ‘T’ would be

considerably smaller in height than one which contained the body of the ‘T’. Once

this has been done, an 8-connected region growing and labelling process is applied

to the image to attempt to label each word in the address. The growing is prevented

from moving far outside the line segmentation points found in the previous step to

avoid joining text lines together. Components whose bounding boxes overlap hori-

zontally are then joined, as the authors state that this is almost always due to a word

being split into two or more pieces at an earlier stage (smearing or region labelling).

OCR was reportedly attempted one the first and last character of each word. Some

details are given about the method used to locate the first character — the character

is segmented using either a search for a white gap (as the initial character tends to be

upper case and therefore disconnected from the rest of the word), component label-

ling and finally “character splitting techniques” which are otherwise unspecified. If all

these methods fail the character is simply split at a certain width relative to the

height of the current line (to give it a fixed aspect ratio). No mention is made of the

techniques used for the last character of the word, although it is possible that they

are the same as the above.

The word is then tested for upper/mixed/undetermined case characters. Again, no

details of the method are given other than the shape of the horizontal histogram of

the horizontally smeared image is used, and the technique correctly identifies the

case of 78% of the words in the address image database. The number of characters in


42

the word is estimated for upper and mixed case words by counting the number of

times a stroke crossed the horizontal centre of the word. The value chosen was the

rounded value of half of the number of line crossings found and was correct to

within 1 character for over 90% of the words. For mixed case words, the ascender/

descender sequence was obtained by scanning horizontally along the top and bot-

tom of the word. The system correctly identified 55% of the ascender/descender

sequences and the rest “with minor errors”. For upper case words, lobe features such

as the closed lobes in ‘A’, ‘B’ etc., the upward open lobes in ‘V’, ‘W’ etc., and the

downward open lobes in ‘N’, ‘M’ etc. are extracted. Once again, no details as to how

the extraction is performed, what is done with characters such as ‘W’ and ‘M’ which

have both upward and downward lobes or the performance of the extraction system

are given.

The results of the verification section were not reported. However in [19], the same

system is applied to British postcodes. The opening paragraphs state that “Out of the

120 address images analysed 71 (51.2%) were segmented without any errors.” However 71

out of 120 is 59%! This is clearly a typographical error, however, the rest of the results

presented here are presumably accurate. Once the words were segmented, the initial

character algorithms correctly isolated the first character of 71.4% of the 329 words

attempted. There is more detail in [19] about the actual methods but it is not clear if

they are exactly the same as in [13] above. The initial search for white space is used if

the character is clearly separated from the rest of the word. If this fails the compo-

nent labelling scheme is used for characters which are physically separated but

whose bounding box overlaps that of the rest of the word. Finally the vertical histo-

gram profile of the initial part of the word is used to split the character, which is now

assumed to be touching the rest of the word. The profiles of all 26 characters are used

but it is not clear how one is selected, as the character recognition is not performed


43

until after the character has been segmented. The aspect ratio method used as a last

resort gives the character a width of 0.9 times its height. OCR is attempted on the

character using a method developed by one of the authors colleagues, Robert Tregi-

digo. No results were reported for this stage though. This is unfortunate, as the first

stage of a verification process would probably be a comparison of the initial part of

the postcode with the initial letter of the posttown. The performance of the OCR on

the initial letter would have a huge influence on the reliability of this type of verifica-

tion.

The word case classification was performed and achieved an average word classifi-

cation of 71.8%. Of these, 36 were correctly classified as mixed case and the

ascender/descender sequence within the word was estimated as before. The algo-

rithm correctly analysed 55.6% of these words. The number of letters in the words

was also estimated as before and it is reported that 89.3% of the words were esti-

mated from 0 to +2 characters of their actual length. However closer analysis of the

graphs shows that only 27.7% were correct and 46.4% were in the +1 band (i.e.

reported 1 more character than there actually was). This would indicate that some

adjustment of the algorithm is required. The distribution looks fairly normal from

the graphs, and it would make sense for the mode of the results to be correct. In fact,

by simply subtracting 1 from the estimated lengths, the results immediately become

89.3% correct to within 1 character which would appear to be better than 0 to +2

characters.

The results of the verification process are reported in [12 Hendrawan, Leedham]. It

was assumed that each line of the address contains only one field such as posttown

or county but a comma detection algorithm was used to check if more than one field

was on the same line, separated by a comma. Similarly, hyphens were detected and

removed so that hyphenated place names such as ‘Clacton-on-sea’ always appeared


44

the same whether they were written with the hyphens or not. From the OCR of the

postcode (for which the results were not presented in [19]), a search is made which

lists in order of likelihood the possible addresses from the database of all valid

addresses. These candidate addresses are then matched against the features

extracted from the address image. For each corresponding line in the address (image

and candidate), the following features are used:

• Number of words on the line

• First character of each word

• Number of letters in each word

and for mixed case addresses as indicated by the case discrimination algorithm,

• Number of ascenders/descenders in each word

• Ascender/descender sequence for each word

Each of these factors was given a weight which was determined heuristically. The

results for each line were summed and normalised into the range 0 to 1, and the val-

ues were then weighted according to which line they were on and the number of

lines in the address. These weights were also determined heuristically. Finally the

weighted value for each line was summed and this represented the verification value

of the address. A threshold was then used to decide at what point the address was

considered verified correctly.

The results of simple OCR on the postcode show that 40.89% of addresses are cor-

rectly identified but once again they imply that this means an error rate of 59.11%.

This must mean that the OCR system is unable to reject an address if its recognition

of the postcode is below a confidence threshold. Clearly this would not improve the

success rate, but rejection is preferred over error in this application, as manual sort-

ing is preferred over delivery to the wrong address. The results for the verification

system indicate that at a certain threshold, 38.18% of the addresses are verified cor-


45

rectly with an error rate of 4.89%. It is assumed then that the verification stage is

intended to identify which postcodes were incorrectly classified by the OCR system

by rejecting some of the addresses (presumably for manual sorting). So now the cor-

rect address classification rate is 38.18% with an error rate of 4.89% and presumably

a reject rate of 56.93% which has not actually improved the recognition performance

of the system at all. One of the main reasons cited for errors is the fact that the

address image is compared line by line with the address candidate from the data-

base. This means that if extra information is included in the address or one line is

missed out (which does not necessarily mean the address is incomplete), the com-

parison gets out of step resulting in a low verification value. This is because the sys-

tem is implying an ordering in the address that does not really exist — the address

consists of all the information together and is not a hierarchy. This means an order

independent comparison with the database may be beneficial (see section 8.3, “Strat-

egies for Verification” on page 101).

A very interesting report of a system in given in [39 Lucas], and although not strictly

a verification system, it is described here as it could form the basis for the verification

system described above. In fact, it is more accurate to say that it performs validation

rather than verification. The distinction is quite subtle, but validation is really lim-

ited, in this application, to ensuring that the postcodes which are returned by the

OCR module are real postcodes — i.e. they exist in the database of all valid post-

codes. Verification would involve ensuring that the postcode matched the other

address information, such as posttown, on the mail piece.

This system tackles almost exactly the same problem as will be discussed for the

remainder of this report; that is the validation of OCRed characters against a data-

base of valid words (in this case, postcodes). The problem is approached in a very

different way however. The system described uses a syntactic neural network (SNN)


46

to parse the grammar of postcodes, to identify the valid ones from the list of charac-

ter confidences from the OCR system. It also employs a lazy multiplication scheme

to allow efficient best first retrieval of valid codes. The real problem is to find the best

path through a set of lists of real numbers, returned by the character classifier.

Fig. 3.1 shows a possible output from the OCR module.

Fig. 3.1 - A diagram of the first stage of the SNNmethod for retrieving valid postcodes

As each list is sorted, it can be seen that the most likely1 postcode can be found by

simply taking the top line, which in this case is ‘SOI26BL’. Clearly this is not a valid

postcode and a check must be made for this. Disregarding this problem for the

moment, the next best output from the classifier is not trivial to find. What is really

needed is the full cartesian product of all the character confidences, sorted into order.

This would then give every possible output of the classifier in confidence order but is

clearly very costly to produce. The cartesian product of the above example has 8640

(5 × 4 × 3 × 4 × 3 × 4 × 3) possible postcodes involving 51840 real multiplications, and

this list would also have to be sorted after it had been generated.

1. This definition of ‘most likely’ assumes independent probability among the characterswhich is not necessarily the case. However for now it will be assumed to be true.

S O 2 6 B L1

S - 0.95 - 0.6B - 0.1H - 0.04E - 0.03

O - 0.940 - 0.89Q - 0.7D - 0.15

I - 0.911 - 0.85L - 0.2

2 - 0.88Z - 0.57 - 0.1S - 0.08

6 - 0.92C - 0.67S - 0.3

B - 0.918 - 0.82R - 0.44E - 0.32

L - 0.8I - 0.651 - 0.51

Input Characters

Output of Classifier

The output of theclassifier is a sorted

list of the confidencevalues of the top few

characters in eachposition. Since there

are only a fewcharacters in each

list, the penalty forsorting the lists is

negligible.


47

The system described in [39] offers a way of improving the efficiency of generating

this list by implementing a lazy evaluation of the cartesian product. The overall

structure of the system is a binary tree. Each node is a processing element which

accepts two inputs from lower level nodes and passes the combination of these as its

output to the next higher level node. The lowest level nodes take their inputs directly

from the OCR system in the form of an ordered list of characters and confidence val-

ues (see Fig. 3.1). The highest level node outputs valid postcodes. At each node, the

following kind of matrix is formed from the two input sources:

Fig. 3.2 - Diagram of the matrix formed at eachnode of the SNN

The dark grey square in Fig. 3.2 is bound to be the best output at first because the lists

are ordered and the product of the top of each list will always be higher than any

other product within the list. To produce the second output, only the 2 lighter col-

oured squares (which represent from top to bottom, the sequences ‘5O’ and ‘S0’)

need be considered as they are bound to be higher than any other product from the 2

lists. Again this is a property of the fact that the lists are ordered.

So, 4 nodes would be required to accept a 7-character postcode, with the last node

taking its inputs from the last character and a null list (which effectively just returns

the list of characters in order). Above these nodes 2 more nodes are required, taking

5 B H E

O

0

Q

D

S

The matrix representsthe inputs to the node

which acceptscharacters 1 & 2 from the

left hand end of thepostcode shown in

Fig. 3.1. The top leftsquare is guaranteed to

be the best output atfirst. After this, only the

two lighter colouredsquares need be

considered as one ofthese is guaranteed to be

the next best.

1st Character

2nd Character


48

inputs from nodes 1 & 2 and 3 & 4 respectively. These nodes implicitly form

sequences of four characters, as each input represents a character pair from the low-

est level of the tree. The final node takes its two inputs from the middle level of the

tree and outputs complete postcodes along with their confidences. The overall struc-

ture is shown Fig. 3.3.

Fig. 3.3 - Block diagram of the way informationis processed in [39 Lucas]

At each node, the lazy evaluation of the cartesian product of the input pair is per-

formed as shown in Fig. 3.2, and a check is made to ensure that, at each level, only a

valid postcode is being formulated. This means that the system has to be trained on

valid postcodes before is can be used (hence the term neural network). During train-

ing, the lowest nodes for example are trained on valid character pairs for their

respective position within the postcode. The middle nodes are trained on valid pre-

fixes and suffixes, but can assume that the inputs (character pairs) are already valid

Valid Postcodes

Processing Elements(nodes)

Ordered lists of characters and confidences

Null list


49

so, in fact, they only have to know which pairs can go with which to make valid 4-

character sequences. The top level node takes valid 4 and 3 character sequences and

knows how these can be combined to produce valid postcodes. As the numerical

product is passed up at each node, it is a simple matter to produce the overall post-

code confidence along with the postcode itself, from the top level node.

The results presented in [39] show that the system performs well when compared

with a trie implementation of the same search. The SNN system also displays the

unusual property of performing a faster search as more data is added to the system.

However, taken in context, this is inevitable as the system spends most of its time

discarding invalid postcodes. Hence as more valid postcodes are added to the data-

base it takes less time before a valid one is found from the possibilities suggested by

the character classifier. In the limit, if all combinations of characters could form valid

postcodes, the system would produce the next best postcode on each cycle of com-

putation.

There is however one major disadvantage to this approach. As stated before, the

probabilities of the characters are assumed to be independent so that the probability

of the postcode can be made equal to the product of the confidences of the individual

characters. This is obviously not the case as there are some character pairs which

would be much more likely than others. More importantly, the probability of each

character is influenced by all the other characters in the postcode. By splitting the

postcode into pairs of characters in this way and then combining them into pairs of

pairs and so on until the postcode is finally output, this dependence cannot easily be


50

modelled. Another assumption mentioned in the report is that all postcodes are

assumed equiprobable. However it is also stated that,

“... a priori postcode probabilities can easily be modelled in theory (whileretaining best-first retrieval characteristics) by having a top level node inthe SNN taking one set of inputs from the data at hand, and the other setfrom the pre-compiled set of possible postcodes, which are also retrievedmost likely first.”

It is not at all clear what this statement means. However if it is taken to mean that the

top level node takes one input from the output of the existing system and the other

from the list of postcode probabilities, then it is unclear how this helps. The true post-

code probability (ignoring for the moment the character probability interdepend-

ence) is the product of the postcode confidence from the existing SNN system and

the probability of that postcode occurring. For example, a very common postcode

would have a high probability in the pre-compiled list and should be accepted

before a very uncommon one, even if the confidence of the uncommon one was

slightly higher according to the SNN system. So it is not clear how this final list can

be output in best first order without retrieving all the valid postcodes from the SNN

system. If the pre-compiled list of postcode probabilities is complete, as it must be to

ensure that every address can be handled by the system, this means that all 1.6 mil-

lion postcodes would have to be retrieved from the SNN system and multiplied with

their corresponding probability of occurring, and the results of this sorted to give the

actual most probable postcode. In fact, using the system described in [39], it would

be possible to improve on this by retrieving postcodes from the SNN system until

the one which matches the top entry in the pre-compiled list is returned, and this has

to be the most probable postcode. However there is no way of telling how many

recalls will have to be made from the SNN system before this postcode is returned. It

is clear then that the phrase “in theory” in the above quote is essential, as the practical


51

implications would seem to outweigh the undoubtedly efficient system when real

probability values are required.

It is possible to imagine another tree akin to the one described above which was

trained to recognise posttowns, by combining character pairs until they form a valid

posttown name. These two trees can then be thought of as producing ordered lists of

postcodes and ordered lists of posttowns which could be combined in the same way

to eventually produce addresses. In this way, a hierarchy of trees could be used to

perform verification rather than simply validation. However this is a fairly sweeping

statement about how the system could be extended, and would require a great deal

of further work to ensure the practicality of such a system.

3.3 Summary

We have seen some attempts which have been made towards the verification of auto-

mated address recognition. It is clear that this is a quite complex problem, especially

given the requirement for an on-line solution. It would appear that although there is

undoubtedly a great deal of value in a system which could improve the automated

address recognition rate, there is no immediately obvious solution. The complexity

of the task is due to fact that the address/postcode combination was not really

designed for this kind of automation. With the infrastructure so firmly embedded in

the market place it would be difficult to change the style of addressing to any great

degree, and so this adds to the value of a system which can be reliably incorporated

into the existing processes.

There are currently around 1.6 million postcodes in use in the UK. In the case of

restricting the recognised postcodes to these valid ones, this represents a not insub-

stantial amount of data which will have to be searched to validate the postcode.

However, as shown above, there are efficient searching methods which can be


52

employed. The problem is confounded though by the fact that the character recog-

niser will undoubtedly fail to recognise one or more characters from the postcode

some of the time. This will then require a search of the database to determine what

possible valid postcodes the image could represent. In effect, this will produce a list

of possible characters which could occur at the position which currently cannot be

recognised. This information will have to be fed back to the character recogniser in

order for it to make a second attempt at classification, now that there is more infor-

mation available in the form of a restricted set of possibilities.

When this idea is extended to cover other features from the address, the database

which was 1.6 million records of a few characters each becomes considerably larger,

as information such as posttown name, building name or company name, P.O. Box

numbers, etc. are added to it. It is clear then that one of the most crucial parts of this

system will be a very efficient method of extracting valid addresses from the data-

base given the character recogniser’s first attempt at classifying characters from the

address image. The next chapter looks at some of the methods which can be

employed to solve this problem. A system using Correlation Matrix Memories was

found to give the best performance, and a detailed discussion of this type of system

is presented.

Improving Automated Postal Address Recognition 4. Partial Matching

53

4. Partial Matching

If a database is queried by supplying a key which uniquely identifies the record

being sought, only one record should be returned by the database system. If how-

ever the key is not fully specified, it is possible that more than one record will match

the partial key. This is then a partial match query.

4.1 Introduction

From the previous discussion it is clear that the verification process will require a

partial match to be made on the database. This is because the OCR system is bound

at some point to fail to recognise a character and this means, for example, that the

postcode will have one or more characters missing. This forms a postcode template

which may match a number of possible postcodes in the database. The problem is

very similar to occluded object recognition, where an object must be identified even

if some of its features are unknown. The features of a postcode are the characters

which make up that postcode and when some of those features are missing, one

postcode may ‘look’ very much like several others (see section 6, “Analysis of PAF”

on page 83). What is required then is a system which can provide some sort of list of

all the postcodes which match the template given by the OCR system.

In this section, a very brief review of some of the more common methods for partial

match searching of a database is given. One of the best methods, a technique using

Correlation Matrix Memories, is then looked at in greater detail. One major problem

with using this technique is identified, which will then lead into the next section.

4.2 Review

There are many conventional systems which could in principle perform the task of

taking a partial postcode and returning all valid postcodes which fit the template.


54

For example, SQL databases can be queried in this way. There has been much inter-

est in partial match search algorithms in the past ([34 Rivest], [35 Burkhard],

[36 Kim, Pramanik]) such as hashing tables and tree/trie structures. A review of the

current methods is given in [37 Kennedy]. The review starts with conventional tech-

niques such as the Inverted File Technique, where an index is held for every attribute

which may form part of the partial match. The search is then performed by retriev-

ing all records from the file using the index for each attribute specified in the query

and then performing an intersection operation on the results. This was shown to be a

very inefficient method, as the more well specified a query is the more data is

retrieved from the database prior to the intersection operation. Next, Hash Coding

Techniques were investigated. These included Standard Hashing, Address Genera-

tion Hashing and Hashing with Descriptors. Standard Hashing uses similar tech-

niques as the Inverted File method, but uses a hash function instead of the index for

each attribute. The same problem of excessive data retrieval for well specified que-

ries is noted. Address Generation Hashing uses the attributes to generate parts of the

address within the database of the corresponding records. It means that no intersec-

tion operation has to be performed as with Standard Hashing, but many false

matches may be returned. This is because the number of bits of the address allocated

to a particular attribute will most likely be less than the number of possible values

the attribute could take (the corollary to this is that different values of the same

attribute will hash to the same address, hence the false matches). The next technique,

Hashing with Descriptors, overcomes this problem. In this method, the attribute val-

ues for each record are hashed and the results concatenated together. This forms a

descriptor for that record. The whole file is split into a number of ‘pages’, and all the

descriptors from each record within a page are bitwise ORed to form a descriptor for

that page. There is no mention however of how the file is split, how many pages

there should be or whether the records within each page have something in com-


55

mon. It was stated though that this method significantly reduced the number of false

pages accessed compared to the previous method.

The report then goes on to consider superimposed-coding techniques. These are sim-

ilar to the Hashing with Descriptors method outlined above but instead of concate-

nating the hashed attributes, they are superimposed or bitwise ORed on top of one

another. These superimposed codes are then used to form the index to the file but

only one index is needed for all the attributes. A query is processed by forming the

superimposed code of the attributes in the query and then searching the index for all

index codes which contain the query code. These records are then retrieved. A more

advanced method is two-level superimposed coding, which simply treats the index

codes as records, which are then hashed and superimposed to form a hierarchical

structure (albeit only a two level one). The query is made by forming codes for both

indexes. The higher level one is searched first (as it is smaller) and this results in a

subset of the second level index. This subset is then searched using the second code

from the query to get the actual records. This method was shown in the worst case to

be no worse than one-level superimposed coding, but usually to be much more effi-

cient, as the size of the index which has to be searched is usually much smaller.

A number of variations on the superimposed coding techniques were also reviewed

which involved various trade-offs between storage, disk accesses and performance.

However none were shown to have any significant advantage over the others. They

all simply represent a kind of tuning which could be performed for a particular

application.


56

It was shown however that a system based on Correlation Matrix Memories (CMMs)

can outperform other conventional partial match algorithms for certain classes of

problem. These problems are ones of the form:

“Return all records which match n from m attributes where n ≤ m”

This means that while say 4 attributes can be provided to the search algorithm, it can

be asked to return all records which contain any 2 of those attributes. While the

inherent ordering of the characters within a postcodes does not require such a gen-

eral matching algorithm as it can be accomplished simply by a wildcard type search,

there are some extensions to this idea which would require such a searching capabil-

ity (see section 8.3, “Strategies for Verification” on page 101). It should be stated

however that a system capable of performing these types of extended queries is per-

fectly capable of making the standard partial match queries simply by ensuring that

the values of n and m above are equal. That way the only records returned are the

ones which contain all the attributes passed to the search algorithm.

In fact, the system proposed in [37] deals with much more abstract entities than char-

acters from a postcode, and in particular can be made sensitive or insensitive to the

ordering of the attributes passed to it. This is ideal for a reasoning system, where the

order in which the information is presented is irrelevant (and is one of the main

strengths of the system). However the ordering of the characters is an essential part

of the postcode. We do not want to recognise some of the characters and then

retrieve a list of all postcodes which contain those characters in any order; in fact, we

require a list of post codes which have the recognised characters in specific positions.

To accomplish this, while still retaining the speed advantages of the new system, we

simply omit the binding and superimposing stages detailed in [37] which are what

allows the system to produce results for arbitrary orderings of attributes. It may be

that at a later stage, the order independence capabilities will be exploited. It may be


57

possible to use the system to bring together other information from the address

image. There is no ordering inherent in the postcode, post town and street name, yet

they are all attributes of some record within the database. A search may need to be

made using any or all of these, depending on what can be recognised from the

address image, and this is discussed in section 8.3.

The remainder of this chapter will describe the operation of CMMs in greater detail

and, in particular, show how they can be used within this application.

4.3 Correlation Matrix Memories

These were proposed in [31 Willshaw et al.] in 1969 and were based on the image

recall properties of holographs, although the original idea came from Stinbech Matri-

ces. The basic structure of the associative network is shown in below.

Fig. 4.1 - Diagram of a simple correlation matrixmemory

The memories can store an association between two binary patterns or numbers. Pat-

terns to be associated are presented to the matrix as binary strings. One pattern is

applied to the horizontal lines and the other to the vertical lines. Where two 1’s in the

patterns coincide, that position in the matrix is set to a 1. During recall, the input pat-

terns are applied to the horizontal lines and the rows of the matrix which have 1’s

Inp

uts

Outputs

Each representsa bit set to 1 in a

binary matrix.

The horizontallines represent the

inputs to thematrix and the

vertical lines theoutputs.


58

applied to them are summed vertically to form the output. This output is then

thresholded according to a certain algorithm and the original pattern is thus recalled.

The operation is shown in Fig. 4.2.

Fig. 4.2 - A CMM during recall

There are many issues connected with the performance of such a system, for exam-

ple:

• Number of associations which can be stored

• Size of array to represent number of input and output patterns required

• Number of bits set in input and output patterns

• Thresholding algorithms

• Coding of actual inputs to input patterns, and similarly for outputs to

output patterns

These issues will be dealt with in turn along with a method for using CMMs for

recalling more than one pattern at a time. This is essential for performing partial

matching on the database.

Inp

ut

Pat

tern

Output Pattern

The input patternis applied

horizontally andthe output patternappears vertically.

5 2 5 5 2 5 5 2Five input

bits set to onemean thethreshold

value is five.


59

4.3.1 Storage Capacity of a CMM

The basic equation for the error-free storage capacity of a CMM is shown below.

(from [32 Nadal, Toulouse]) Eqn. 4.1

The value N is the maximum number of associations which can be stored by a CMM

whose input and output sizes are both w, while guaranteeing that there will be no

errors in the output pattern. The equation is based on having log2w bits set to 1 in

both the input and output patterns, and with a random distribution of patterns. This

serves as a rule-of-thumb when estimating the size of CMM required for a certain

application. However it only caters for square matrices, and some more work is

required to find the general solution for w × h matrices where w and h are the width

and height of the matrix respectively. It may also be advantageous to have some

other number of bits set to 1 rather than the function of w given above. Again, the

equation for number of association which can be stored would need some alteration

to reflect that.

4.3.2 Coding of Input and Output Patterns

To recall an output pattern after an input pattern has been summed through the

matrix, a suitable threshold must be applied to the raw totals. It is clear that the cor-

rect value to threshold at is the number of bits set to 1 in the input pattern. However

this should actually be the number of bits set to 1 in the original input pattern. If a

noisy pattern is being applied to the CMM, there may be more or less 1’s in the pat-

tern than there were during the training phase (when the associations were stored).

The matrix will still recall the correct output pattern, but the threshold value must be

set correctly. If the input is noisy, there is no easy way to determine how many 1’s

there should have been. A solut ion to th is problem was proposed in

[33 Austin, Stonham], where every output pattern used has the same number of bits

N2log( ) 3w2

wlog( ) 2---------------------------=


60

set – the position of the bits is the only thing that changes from one pattern to

another. This is known as k-bit coding.

Using their scheme, the maximum number of patterns P which can be generated is

given by the following equation.

Eqn. 4.2

where η is the combinatorial operator, w is the width of the code and k is the number

of bits set to 1 in that code.

The thresholding problem is now simply to select the k highest responding outputs,

thus producing a k-bit binary pattern. This property is key to the operation of the

ADAM associative memory system described in [33]. The maximum number of pat-

terns which can be generated in this way is considerably more than the number of

associations which can be stored in the matrix, given that the matrix has input and

output sizes which are of the same order of magnitude. Therefore it does not put any

restrictions on the capacity of the network.

The interesting and essential property of CMMs in this application comes into play

when an incomplete input pattern is applied. By carefully controlling the threshold-

ing process, the correct output pattern can still be recalled. However if the incom-

plete input pattern now matches not one but two or more original input patterns,

then the output patterns associated with each of these will be returned bitwise ORed

on top of one another. This is the way that a CMM can be made to perform partial

matching. The actual process involved here stems from the fact that the CMM is a

type of neural network which forms associations between patterns. When being

tested, the input pattern is matched against all the patterns trained into the network,

and the output pattern associated with the stored pattern which most closely

matches the input pattern is generated. When an incomplete input pattern is applied

P ηwk=


61

to the network, it may be that this partial pattern is equally similar to 2 or more

stored patterns. In this case, the network has no way to distinguish them. Its

response is to assume that the input pattern could be any one of the similar patterns,

and to output all the output patterns which match. However as it only has one out-

put array, the outputs are superimposed on top of one another and they then have to

be separated into the individual output patterns. By carefully controlling the way the

actual data is mapped to the different input and output patterns, it is possible to

define a method for performing partial match type queries. For example, let us

assume that the input data are words from some dictionary. All the words are three

characters long. A simple mapping would be to give each character a field in the

input pattern, say 1 in 26 bits representing the character of the alphabet. These 26 bit

words are then simply concatenated to form the actual input to the CMM. Some

examples are shown in Fig. 4.3.

Fig. 4.3 - Example input pattern coding for aCMM to use partial matching

Note that the entries in the table are 1-dimensional binary strings — they are only

split across lines to prevent the table from being too wide. The final input pattern can

be seen then to be a 78 (26 × 3) bit pattern. Now once these patterns have been associ-

ated with suitable output patterns in the CMM (suitable meaning that there is a one-

to-one mapping between the output patterns and the original words), it can be used

to perform partial matching such as ‘C?T’, meaning all words which have ‘C’ at the

WordCharacter 1bit pattern

Character 2bit pattern

Character 3bit pattern

CMMinput pattern

CAT00100000000000000000000000

10000000000000000000000000

00000000000000000010000000

001000000000000000000000001000000000000000000000000000000000000000000010000000

COT00100000000000000000000000

00000000000000100000000000

00000000000000000010000000

001000000000000000000000000000000000000010000000000000000000000000000010000000

DOG00010000000000000000000000

00000000000000100000000000

00000010000000000000000000

000100000000000000000000000000000000000010000000000000000010000000000000000000


62

beginning, ‘T’ at the end and any other letter in the middle position. This is achieved

by taking the patterns for ‘C’ and ‘T’ and putting a string of 26 zero’s between them.

This gives a 78 bit input pattern, but the total number of 1’s on the input is now 2

instead of 3. This means that an adjustment to the thresholding must be made in

order to compensate. There is nothing particularly subtle in this — the total number

of expected 1’s is known (3, as this is the number of characters in the words this

CMM will recognise) and the number of characters missing is known. When the out-

put of the CMM is thresholded accordingly, the result will be the patterns for ‘CAT’

and ‘COT’ superimposed on top of one another (see Fig. 4.4). As the output patterns

are directly mappable onto the original words, it is a simple matter to search the out-

put of the CMM for known output patterns, and this will give us back the list of

words.

Fig. 4.4 - Result of recalling ‘C?T’ from a CMM

Seven different methods which can be used to separate the output into its constitu-

ent codes are discussed in [37]. It is shown that overall, the method with best per-

formance is Middle Bit Indexing [38 Filer]. However this assumes various

parameters for a specific application ([45 Austin et al.]), and may need to be re-eval-

uated for a different application. There would be no benefit in undertaking this work

at the current time.

There is one issue which will be of importance to any application using CMM tech-

niques which has not yet been considered — that of ghosting. Ghosting is an unde-

sirable feature of the way the outputs are generated by the CMM. Because they are

† This code contains the codes for ‘CAT’ and‘COT’, ORed together.

Words Example Output Codes

CAT 00000100000000010000

COT 00010000000010000000

Superimposed Result 00010100000010001000†


63

superimposed, it is not always easy to tell what patterns are there. This effect can be

shown by a simple example using a familiar 7-segment display used in digital

watches, etc. Suppose that this is the output of the CMM, and the actual patterns are

‘1’ and ‘2’. Superimposing these is shown below:

Fig. 4.5 - Superimposition of 2 7-segmentnumber patterns

It is now not clear whether the final pattern contained just a ‘1’ and a ‘2’ as this same

pattern would be made by ‘2’ and ‘3’ or ‘2’ and ‘7’ (in their 7-segment form). So the

four numbers which could be extracted from the pattern are ‘1’, ‘2’, ‘3’ and ‘7’. If only

two of these were actually used to make the pattern in the first place, the remaining

two are called ghosts. The problem arises because the numbers which are used to

make the final pattern are hidden within the internal workings of the CMM and

there is no way to find out directly which numbers were used and which weren’t.

The next chapter gives a detailed discussion of how and why ghosting occurs, and

presents a method for reducing its undesirable effects.

OR =

Improving Automated Postal Address Recognition 5. Ghosting

64

5. Ghosting

Ghosting is the term given to a property of images which are superimposed. The

images may be binary numbers or line drawings. The effect is the same, and it is that

once two or more images are superimposed, it is not always possible to know for

definite which of a number of possible original images were used to make the super-

imposition.

5.1 Introduction

It is a simple property of binary numbers that given an arbitrary set of fixed width

numbers, it is possible in principle for some combination of codes ORed together to

include codes from the set which were not among those ORed together. An example

is shown in Fig. 5.1.

Code 1: 0010010

Code 2: 1000100

Code 3: 0000110

1 OR 2: 1010110

Fig. 5.1 - Example of superimposed codesgenerating a ghost.

Given this result, if the ORed code were to be separated up into its constituent codes

using the techniques referred to in section 4, it would be impossible to tell whether

or not code 3 was included in the ORing operation or not. If it was not, as in this case,

it is known as a ‘Ghosted Code’ or ‘Ghost’. This is simply a binary representation of the

example given at the end of the previous section — if the parts of the 7-segment dis-

play were arranged in a row they could be thought of a forming binary numbers

where the lit segments represent a ‘1’ and the unlit segments represent a ‘0’.

In this chapter, the causes of ghosting are explored and a definition for a set of binary

numbers which exhibit a desirable property when used with CMMs is given. This

property is that a particular set will exhibit a known worst case ghosting no matter

Codes 1 and 2 ORedtogether produce a result

which includes code 3.


65

what combination of codes are superimposed. The size of the set thus determines the

number of records which can be stored by the CMM system, and it is therefore desir-

able to maximise the size of the set while retaining the maximum ghosting property.

While the definition of the set is rigorous, there is no obvious efficient method for

generating such sets. In the absence of this, a brute force algorithm was used to gen-

erate some small sets for experimentation. The term small applies both to the width

of the binary numbers and to the number of elements within the set. Because of the

algorithm used, the time taken to generate the sets increases factorially with the

width of the code and so it was only practical to produce small sets. The experiments

were designed to investigate how the sets might behave as the width of the numbers

increases. Without a sound mathematical basis for the generation of the sets, 4 differ-

ent models are tested to give rough estimates for the expected size of set given a par-

ticular width. The deficiencies of these models are pointed out, but it is shown that

their predictions are quite encouraging.

5.2 Problems Caused by Ghosting

When a partial match retrieval is performed on a database stored using CMMs,

ghosting may occur as described above. The reason this is a problem is clear when it

is taken up one level of abstraction. The codes returned by the CMM represent

records from the database. Once the ORed code is separated into its constituent

codes, the records can be uniquely identified. If one of the codes is a ghost, this

means the CMM has returned a record which should not be in the set of records

which correspond to the query performed. In effect, it has returned all the correct

records as well as some extra, incorrect records. This can be likened to the false

matches which are obtained when using some of the database systems mentioned on

page 54 in section 4. These incorrect records will have to be identified and removed

before the system can return the actual result of the query. It is obvious therefore that


66

the effect of ghosting should be reduced as much as possible in a system designed to

perform partial matching, as it represent extra work which must be carried out by

the system and will thus reduce performance.

In fact, it can be shown simply that the effect of ghosting can be prevented by ensur-

ing that the output codes conform to some specification. However this drastically

reduces the number of codes which can be generated. For partial matching, where

any number of codes may be returned by the system, it can be shown empirically

that the number of codes, N, which would be useable to guarantee no ghosts is given

by Eqn. 5.1.

Eqn. 5.1

In this equation, w is the width of the codes and k is the number of bits set to 1.

This means that the number of usable codes is linear with the code width which in

turn means that in any practical partial match retrieval system using CMMs, some

level of ghosting will have to be tolerated. It is obvious that prior knowledge of the

extent to which ghosted codes will be generated is very important in assessing the

performance of the system. The next section therefore deals with sets of codes which

display a property whereby the maximum number of ghosts that will be generated is

fixed for that set.

5.3 Maximum Ghosting Sets

There are basically four parameters which define the sets of codes being discussed.

These are:

• w — the width of the code in bits

• k — the number of bits set to 1 in each code

N w k– 1+=


67

• s — the maximum number of codes which will be superimposed, which

is the maximum number of records which will be returned by the partial

match

• g — the maximum number of ghosts which will be generated when no

more than s codes are superimposed

The sets can be specified by an identifying sequence such as w10k3s2g2, where each

number indicates the value of the parameter immediately preceding it. Such a set

would consist of codes which are 10 bits wide, each having 3 of those bits set to 1,

and guaranteed that when no more than 2 codes are superimposed, no more than 2

ghosts will be generated. A formal specification of the sets now follows.

A code, c, can represented as a set of integers which denote the positions of the bits

set to 1 within that code.

Eqn. 5.2

The includes operator as defined in terms of binary patterns in Fig. 5.1, is now sim-

ply the subset relation.

Eqn. 5.3

To specify the sets mentioned above, let S denote the set of codes. Then, for S to be a

set with parameters w, k, s, g as explained above, Eqn. 5.4 must hold.

Eqn. 5.4

This equation states that for all combinations of s distinct codes from S, the number

of ghosts which will be generated by ORing together those s codes (achieved using

the set union operator), will be less than or equal to g.

c p1 p2 … pk, , ,{ }=

ca includescb iff ca cb⊇

x1 x2 … xs, , ,( ) S∈( ) x1 x2 … xs≠ ≠ ≠( )

card y S∈( )∀ y x1 x2 … xs, , ,{ } xii 1=

s

∪ y⊇∧∉{ } g≤

•∀


68

The representation of a code as a set can be freely converted to a real binary code b

simply by taking the sum of 2 raised to the power of every element of the set.

Eqn. 5.5

5.3.1 Generating the Sets

The sets can be generated easily enough by simply taking the set of all possible codes

which can be generated within the bounds of w and k and adding them one by one to

the set, checking each time that the conditions set by s and g are not broken. This is

basically a brute force algorithm and as such is not very efficient. An improvement

can be made to this algorithm by considering the Hamming distance between codes

as they are added to the set S. The example in Fig. shows that codes with a large

Hamming distance tend to produce smaller sets. By ensuring that codes are added to

the set in least-Hamming distance-first order, then in general, larger sets will be pro-

duced.

Code 1: 111000

Code 2: 000111

1 OR 2: 111111

Fig. 5.2 - Example of orthogonal codes which canghost any other code

Even if codes are taken in this quasi-sorted order there are still plenty of different

orderings of codes to be considered. What is intriguing is that the order the codes are

added to the set can have a marked effect on the final size of the set. This implies that

there is some other feature of the ordering which should be taken into consideration

when generating the sets, but this feature is not immediately obvious. In the interim,

it is sufficient to use a random ordering along with the heuristic described above,

and run many iterations of the generation program to obtain the best set within some

time limits. It is impractical to run with every possible ordering of codes, simply

because of the number of combinations involved. The table in Fig. 5.3 shows the rel-

b 2c i[ ]

i 1=

k

∑=

These 2 codes when ORedproduce a code which can

ghost any other in the set, as ithas all its bits set to 1. No more

codes could be added herewithout increasing the

ghosting.


69

ative increase in time taken to execute an exhaustive search on a Silicon Graphics

R8000 based machine.

Fig. 5.3 - Times to complete exhaustive search ofsome small code sets.

However, it is possible to generate sub-optimal sets using the random search

method. These are only sub-optimal in that they are not necessarily the largest set

possible, but they do conform to the ghosting specification as described earlier. With

these codes, it is possible to perform some analysis which might give an insight into

a possible efficient algorithm for generating them and some mathematical specifica-

tions which would allow them to be modelled in order to determine other parame-

ters such as the required width of code for a certain database application, etc.

5.4 Analysis of some Maximum-Ghosting Sets

Because of the computational problems involved in generating these codes, only

very small codes have been analysed. They were generated by running the random

search method mentioned above a number of times and using the best set found

over all the runs. The number of runs used to produce each set for given values of

the parameters was dependent on the parameters themselves. For example, for the

smallest sets such as w10k3s2g1, 10000 iterations could be used. However for sets

such as w35k4s2g2, only 15 iterations were possible. Even then it could take over 24

hours to complete one run on the same Silicon Graphics R8000 machine. One prob-

lem with the small codes used is that it is already known that some of the models

used to determine CMM performance do not work well at very small code sizes.

† This test was not actually performed! It wasextrapolated from the previous test, which would

give a conservative estimate of the true figure.

Set Specification Time Taken

w4k2s2g2 0.072 seconds

w5k3s2g2 8 minutes

w6k3s2g2 7.7 million years†


70

Nevertheless, it is hoped that this analysis will give a least some indication to how

the larger codes would behave.

The following graphs show how the set size varies with code size given that the

remaining parameters are fixed. The vertical lines show the points which were actu-

ally calculated, the main curve shows the trend in between these points.

Fig. 5.4 - Graphs of set size against code widthfor k3s2g1 and k3s2g2

0

100

200

300

400

500

600

700

800

10 15 20 25 30 35 40 45 50

Set

Siz

e (c

od

es)

Code Width (bits)

k3s2g2

0

50

100

150

200

250

10 20 30 40 50 60 70

Set

Siz

e (c

od

es)

Code Width (bits)

k3s2g1


71

Fig. 5.5 - Graphs of set size against code widthfor k4s2g1 and k4s2g2

The first graph in Fig. 5.5 exhibits some undesirable behaviour in that it should ide-

ally be a smooth curve. The reason for these results is simply the amount of time

taken to generate the sets meant that it was not feasible to run as many iterations as

would be necessary to give smooth data points. It just happened that the runs for

code widths 12 and 15 produced larger sets than the others in the given time. How-

ever, given enough runs, it is expected that the other points would gradually move

up to smooth out the curve. Other than this anomaly, the figures tend to show that

overall there is a more than linear increase in the size of set with linear increase in the

width of code. This is really the only useful result, and any other outcome would

10

15

20

25

30

35

10 12 14 16 18 20

Set

Siz

e (c

od

es)

Code Width (bits)

k4s2g1

0

50

100

150

200

250

300

350

400

450

500

550

10 15 20 25 30 35

Set

Siz

e (c

od

es)

Code Width (bits)

k4s2g2


72

have basically indicated that further work would be fruitless — a linear increase

would indicate that the size of the CMM would grow at least as fast as the size of the

problem, and a less than linear increase would indicate that the CMM would grow

more quickly than the size of the problem. Neither of these outcomes would be use-

ful in practical terms. However these results show that, in fact, for a linear increase in

code width, a more than linear increase in set size is obtainable and hence a more

than linear increase in the number of associations which would be possible, while

still guaranteeing the maximum ghosting property of the set.

It would now be useful to be able to model the curves with a function, so that set size

values can be predicted for higher code widths, rather than having to run the gener-

ation program which takes exponentially more time as the code width increases.

Four models were put forward to match the curves of the second two graphs in

Fig. 5.4 and Fig. 5.5. These are outlined in turn, along with their merits and predicted

results.


73

5.4.1 Quadratic Model

As a first step, it was decided to model the set size as a simple polynomial function.

An attempt was made to fit a quadratic function to the curve for the data k3s2g2 and

k4s2g2. These two graphs are shown in Fig. 5.6.

Fig. 5.6 - Graphs of quadratic functions againstexperimental data for sets k3s2g2 and k4s2g2

It can be seen that the first graph of Fig. 5.6 fits the data points quite precisely, having

an average correlation of 0.7. However the second graph shows the curve only

roughly fits the data points, and has an average correlation of 26.8. This could be

simply because the data points are not accurate enough to fit a smooth function to

them, or it could be because the data does not actually represent a quadratic func-

tion, the first result being purely coincidence. There is no way without further

research into the mathematical behaviour of the sets to solve this argument.


74

5.4.2 Cubic Model

This model is similar to the previous one, but uses a polynomial of one higher

degree. The graphs for these functions are shown below.

Fig. 5.7 - Graphs of cubic functions againstexperimental data for sets k3s2g2 and k4s2g2

The average correlations of these two functions to the data sets are 0.7 and 18.6, a

slight improvement for the k4s2g2 set.

It would be possible to go on increasing the degree of the polynomial and get ever

closer results, but it would seem that an alternative model may be more useful.

5.4.3 Exponential Model

This model uses an exponential function, which can be fitted to the data by taking

the logarithm of both axes and fitting a straight line to the result. The coefficients of


75

this line can then be used to calculate the required exponential coefficients. The

result of this analysis is shown in the graph below.

Fig. 5.8 - Combined graphs1 showingexponential functions against experimental data

for k3s2g2 and k4s2g2

The average correlation figures for these two functions are 0.9 and 0.5 for the two

data sets respectively. These are correlations based on fitting linear functions to loga-

rithmic data points, and so must be converted to correlations against the real data

values before they can be compared to the other models. When this is done, the

actual average correlations are 2.4 and 1.6.

5.4.4 Set Size Ratio Model

An alternative approach to the problem is to investigate how the ratio of set size to

total possible codes varied with the code width. The total possible codes N which

1. These graphs are combined to illustrate a feature which is clearer when the point at whichthe two lines cross can be seen (see section 8.2, “Values of k” on page 100).


76

can be generated for given values of w and k is a simple combinatorial function

shown in Eqn. 5.6.

Eqn. 5.6

If this ratio was constant with code width, it would provide an easy method for pre-

dicting the behaviour of the set size. The resulting graphs are shown in Fig. 5.9.

Fig. 5.9 - Graphs of ratio functions againstexperimental data for sets k3s2g2 and k4s2g2

These models seem quite accurate, having an average correlation between the exper-

imental data and the fitted functions of 0.01 and 0.4 respectively. It is probable that

the difficulty in producing the experimental data1 for the k4s2g2 set is the reason this

data does not fit quite as well, however it certainly shows the correct trend.

1. This difficulty is the fact that as the values of the different parameters increase linearly, thetime taken to compute the results for a given set of parameters increases combinatorially, andso the experiment could not be run the same number of times for k=4 as for k=3.

Nw!

k! w k–( ) !--------------------------=


77

5.4.5 Comparison of Models

The four models which have described all have their merits and demerits. The quad-

ratic function is very simple, but has the lowest overall quality of fit with the data

analysed.

The cubic function performed better as would be expected. Indeed it would not be

possible for it to perform worse than the quadratic, as by setting the coefficient of the

cubic term to zero, the function becomes a quadratic. The identical value for the

average correlation on the k3s2g2 data set is due to this, as the cubic coefficient is

very small (0.0016). However the correlation between the function and the second

data set is improved over the quadratic model by the introduction of the small but

significant cubic term.

The exponential functions, looking at Fig. 5.8, would seem to give a very good

approximation. The fact that the data points are in a very straight line would suggest

that they are in fact modelled by an exponential function. The more consistent values

of the average correlation between these functions and the data sets would also sug-

gest that they are a more accurate model of the actual data.

However the best model is undoubtedly the ratio model. Apart from the very close

correlation between the data and the functions, it seems intuitive that the set will be

influenced in some way by the total number of codes which are considered when

generating it.

The fact that the set sizes for wider codes are likely to increase by more than the set

sizes for smaller codes, as more experiments are performed, could mean that the

curves edge close and closer to a polynomial function. But they could also simply

adjust the parameters of the exponential function to allow a greater accuracy of fit.

As the ratio function takes the total number of codes into account, it may be unaf-


78

fected by this larger increase in set size. All that can be said is that without further

mathematical analysis, the four models all will give roughly accurate predictions of

the path of the curve, providing that the code width is not allowed to increase to far.

The wider the code when using these functions, the less confidence can be placed on

the calculated result.

The functions are shown below with the coefficients for, k3s2g2 and k4s2g2 respec-

tively, taken to 3 decimal places.

Quadratic Model:

Eqn. 5.7

Eqn. 5.8

Cubic Model:

Eqn. 5.9

Eqn. 5.10

Exponential Model:

Eqn. 5.11

Eqn. 5.12

Ratio Model:

Eqn. 5.13

Eqn. 5.14

card S( ) 0.346w2

3.602w– 22.035+=

card S( ) 0.812w2 17.424w– 114.817+=

card S( ) 0.002w3 0.207w2 0.075w– 4.172–+=

card S( ) 0.018w3 0.379w2– 6.488w 32.248–+=

card S( ) w2.318

11.546----------------=

card S( ) w2.871

57.443----------------=

card S( ) w!3.198w 11.406+( ) w 3–( ) !

---------------------------------------------------------------------=

card S( ) w!85.68w 472.824–( ) w 4–( ) !

------------------------------------------------------------------------=


79

These 4 sets of 2 equations allow some possible values to be predicted as shown in

Fig. 5.10.

Fig. 5.10 - Table of predicted k3s2g2 and k4s2g2set sizes for various widths

5.5 Conclusions

It is clear from the large variations in predicted sizes that none of these models can

be used for any accurate predictions of width of code required without first finding a

way of showing how the sets should behave for larger values of w. This is not only

because of the different predictions these equations give, but also because the grand

unified theory of maximum ghosting sets must use an equation which has not only

w, but k, s, and g as variables as well. It is clear that k would not remain fixed as the

code size varied, but due to the computational problems outlined earlier it was not

possible to generate experimental data for larger values of k. The values of s and g

may remain fixed, as they are problem dependant, and this would usually be known

beforehand. In fact, as k is usually taken as being a function of w, it may not be neces-

sary to involve k in the equation at all. However if, for a particular problem, the

value of k has to be set to an ‘unconventional’ value, it would still be useful to be able

to model the maximum ghosting sets.

As an example, part of the PAF file (see section 6, “Analysis of PAF” on page 83)

which would be stored in one single CMM contains 866026 records. This means that

the maximum ghosting set must contain at least 866026 codes in order to train each

CodeWidth

Set Size for k3s2g2 Set Size for k4s2g2

Eqn. 5.13 Eqn. 5.7 Eqn. 5.9 Eqn. 5.11 Eqn. 5.14 Eqn. 5.8 Eqn. 5.10 Eqn. 5.12

500 77155 84721 301708 156235 1457561 194402 2158461 976125

1000 310650 342420 2206920 779055 11665813 794690 17627455 7141063

2000 1246682 1376818 16827845 3884686 93348397 3213266 142496943 52242025

3000 2808106 3103216 55862770 9943413 315075763 7255842 482608431 167331617

4000 4994921 5521614 131311695 193706193 746875924 12922418 1145961920 382188066


80

record into the memory. Using the equations above, the estimated code widths are

shown in Fig. 5.11.

Fig. 5.11 - Table of predicted code widths for astorage requirement of 866026 associations

While the values predicted vary as expected from model to model, the overall trend

is reasonable, with all models predicting a narrower code for k=4 than for k=3. It can

safely be assumed that if k was increased even more, the codes could get narrower

still. Providing that one of these models can be assumed to be a fairly close approxi-

mation to the actual behaviour of the sets, Fig. 5.11 also shows that the size of code is

not impractically large — code widths of 1000 - 2000 bits are not uncommon when

using CMMs for this type of application. This is perhaps the most important result

for the purposes of this research.

5.6 Summary

In this section, the problems associated with ghosting when using CMMs to perform

partial match queries was explained. It was shown that it is not possible to obtain a

practical solution to this problem and so some level of ghosting will have to be

accepted. A method for guaranteeing the maximum number of ghosts which will

ever be produced by the CMM was presented — that of the maximum ghosting code

set. If such a set is used when training associations into the CMM, and the maximum

Formula Code Width

Eqn. 5.13 1668

Eqn. 5.14 421

Eqn. 5.7 1577

Eqn. 5.8 1022

Eqn. 5.9 774

Eqn. 5.10 361

Eqn. 5.11 1047

Eqn. 5.12 480


81

number of valid responses which will be returned by the CMM is known, then the

maximum number of ghosts which will have to be removed from the CMM’s

response is also known. The ghosts can only be removed by back-checking with the

original query. In the example given in section 4 on page 61, the response for the

query ‘C?T’ might have produced 3 codes which, when expanded, referred to the

words ‘COT’, ‘CAT’ and ‘PIN’. In this case, ‘PIN’ was obviously produced by a

ghost, as it plainly does not satisfy the query. Each of the output words has to be

checked in this way to remove the ghosts. Knowing how many there are to search for

allows the worst case performance to be assessed.

In the absence of a mathematical model of these sets, experiments were performed to

establish the approximate set size for some small codes. This data was then analysed

in a number of ways to try and predict the set sizes for larger codes. While this anal-

ysis is in no way intended to allow sound arguments to be made about the behav-

iour of the larger sets, it does give rough indications of the kind of sizes that will be

possible. It can also be argued that the models presented in this section give an

under-estimate of the set size for a given code width. This is because the number of

bits set in each code was not varied, but as larger codes are used, more bits would be

set. This was shown to increase the set size for codes of width 16 bits.

Whatever database engine is chosen for this application, it will be used to store and

search a database which contains address information used by The Post Office for

sorting mail. The contents of this database will have to be coded into an appropriate

form for the database system. Some knowledge of the kind of information contained

within the database will be essential for the coding to be performed in an efficient

manner. It will also be useful to know what kind of outputs will be obtained from the


82

database when queries of the kind required by a verification system are made. The

next chapter presents a detailed discussion of the database, and the kind of searches

which will be made.

Improving Automated Postal Address Recognition 6. Analysis of PAF

83

6. Analysis of PAF

The Postal Address File (PAF) is a database which contains address information such

as postcode, posttown, building name/number, latitude/longitude, etc. for every

mail delivery address in the United Kingdom.

6.1 Introduction

This chapter gives an indication of how this database will be used during the OCR

and verification process. The kinds of queries which are likely to be made are

explained and some of the potential results are presented. This includes an analysis

of the format of the postcode and how missing characters within a postcode (for

example, failure of the OCR system to recognise one character) will impact the verifi-

cation process. The PAF itself holds over 25 million addresses or ‘delivery points’, as

for most domestic addresses the postcode is shared by a number of buildings. In its

fully expanded form, the database’s size is around 7.5 gigabytes. As mail pieces pass

through the automated sorting machines, the address image is scanned by a camera

and fed into a computer. From there, it must be segmented into lines, the different

lines identified (specifically the line containing the postcode), OCR performed on

that line, the resulting postcode searched for in the database, and a machine-reada-

ble version of the address printed on the mail piece. This is in the form of a binary

pattern of phosphor dots which can be read easily at a later stage. All this has to be

carried out over 10 times per second, as this is the speed at which mail passes

through the sorting machines. It is clearly not a trivial problem! Some immediate

reductions can be made in the amount of work which has to be done however. There

is no reason why each of the operations described above cannot be pipelined, giving

a substantial increase in the overall performance of the address recognition system.

Secondly, it is not necessary at this stage to search the entire database as only the

postcode is being recognised (however this may not be the case in the final system —


84

see section 8.3, “Strategies for Verification” on page 101). Finally, there is no real rea-

son why the machine-readable code printed on the mail piece cannot simply be a

unique number which is stored in a separate database to be cross-referenced at a

later time by the address recognition system. Indeed, this is what currently happens

to mail pieces which cannot be identified by the automated address recognition sys-

tem. The image of the mail piece is tagged with the machine-readable code printed

on the mail piece and is then displayed to an operator who visually recognises the

postcode and keys it into a terminal. This is then associated with that machine-read-

able code and when the final sorting machine reads the code, it simply looks up the

postcode keyed in by the operator. This is typically done at a much later stage than

the initial address recognition and so could still be done off-line, automatically.

6.2 Format of the Postcode

The postcode follows a fairly rigorous syntax format, although it is subject to change

from time to time. There are 3 different lengths of postcode — 5, 6 and 7 characters.

The formats for each are shown below.

Fig. 6.1 - The syntax of the postcodes

This format is very unlikely to change — it is the set of characters within each posi-

tion which can change. For example, at the time [9 Kabir, Downton] was written, it

was stated that in the first 6 character format in Fig. 6.1, the second character, being

† ‘A’ represents an alphabetic character, ‘N’ represent anumeric character.

Number ofCharacters

Character Codes† Number ofPostcodes

5 A N N A A 45649

6

A A N N A A

866026A N N N A A

A N A N A A

7A A N N N A A

717693A A N A N A A


85

alphabetic, could not be an ‘I’ (eye). This was presumably designed to prevent possi-

ble clashes with other similar postcodes which have a ‘1’ (one) in this position. How-

ever there is at least one postcode now in use which does indeed have an ‘I’ (eye) in

this position, and it remains to be seen whether it is similar enough to others in the

database to cause problems.

It is clear from the table that there are other possible examples of valid postcodes, as

far as the syntax is concerned, which could not easily be distinguished by OCR.

These would be ones where the only distinguishing characters were in a position

which could be either alphabetic or numeric, and were either a ‘1’ (one) or an ‘I’

(eye), or a ‘0’ (zero) or an ‘O’ (owe). According to an initial scan of the data there are

no such clashes, but there are other possibilities such as ‘5’/’S’ which could be diffi-

cult depending on the font used or the style of handwriting.

The main reason for making the distinction between the classes of characters permit-

ted is to refine the OCR system by specifically employing an alphabetic or numeric

character recogniser at each character position, rather than having one alphanumeric

recogniser, which is bound to be less reliable. Obviously some character positions

can be either alphabetics or numerics and so would require a discriminator to decide

either which recogniser to apply, or if both were applied, which one to believe. This

would probably be based on the relative confidence of each recogniser. By starting

the recognition process at the right and moving to the left, it can be seen from Fig. 6.1

that there are only two character positions which could be alphabetic or numeric1. If

the recognition process were started from the left, the number of undecided charac-

ters would be four, so this is another simple way to reduce the complexity of the

problem.

1. We assume that the length of the postcode is unknown at this time, therefore recognitionproceeds from one end of the postcode to the other until the block located as the postcode isexhausted. The number of characters is then counted implicitly in the recognition process.


86

6.3 Missing Characters

Because the reliability of the OCR system can never reach 100%, it is inevitable that

some characters are going to be unrecognisable. It is also desirable that the system

reject characters rather than miss-classify them, as a miss-classified character could

certainly generate a valid but incorrect postcode. This type of error would be very

difficult to detect without further cross-referencing with the rest of the address. So

the system must be able to deal with cases where not all of the characters of the post-

code are known, and this represents a type of partial match (see section 4, “Partial

Matching” on page 53).

From the discussions in section 5, it is clearly desirable to know how many valid

postcodes could be generated when a search of this kind is made on the database, as

this determines the superimposition factor that will be present in the output of the

CMM. The three graphs shown in Fig. 6.2, Fig. 6.3 and Fig. 6.4 show what happens

when single characters are unrecognised in the different classes of postcode. The

basic question being asked here is probably better expressed in English:

Take a 6 character postcode, ‘S10 4FP’ for example. If the first character isunrecognisable, so that the result of OCR is ‘?10 4FP’, how many post-codes match the ‘10 4FP’ part, and so can be considered as candidates forthe actual postcode on the mail piece?

This question is repeated for each postcode within each class, and for each of the

three classes. The ‘Field’ entry on the graphs indicates which character within the

postcode is being left out. The number of matches indicates how many postcodes

matched against the remaining partial postcode. For example in 7-character post-

codes, Fig. 6.4 shows that with the first field missing, 483761 postcodes matched

against a single entry in the database. This means that 483761 of the 717693 7-charac-

ter postcodes will still be unique, even if the first character cannot be recognised. The

worst case for this field is that 3220 of the 7-character postcodes match 5 entries in


87

the database, when the first character cannot be recognised. This means that there

will be 4 incorrect and the 1 correct postcodes returned by the search algorithm.

These values can be interpreted directly as probabilities, providing it is assumed

there is an even distribution of postcodes, which may not be the case in a real sorting

office. So there is a 67.4% probability that a seven character postcode can be uniquely

identified given that its first character is unknown, and a 0.45% probability that it

will be one of 5 possible postcodes.

Fig. 6.2 - Analysis of five-character postcodes

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0 2 4 6 8 10 12 14 16 18 20

Nu

mb

er o

f P

ost

cod

es

Number of matches

5 Character Postcodes

Field 1

Field 2

Field 3

Field 4

Field 5


88

Fig. 6.3 - Analysis of six-character postcodes

Fig. 6.4 - Analysis of seven-character postcodes

0

50000

100000

150000

200000

250000

300000

350000

0 2 4 6 8 10 12 14 16 18 20

Nu

mb

er o

f P

ost

cod

es

Number of matches


Field 1

Field 2

Field 3

Field 4

Field 5

Field 6

0

100000

200000

300000

400000

500000

600000

0 2 4 6 8 10 12 14 16 18 20

Nu

mb

er o

f P

ost

cod

es

Number of matches


Field 1

Field 2

Field 3

Field 4

Field 5

Field 6

Field 7


89

It is clear from the shape of these graphs that identifying a postcode is relatively easy

if one of its first few characters is unknown, but specifically the last two characters of

each class seem to be fairly evenly distributed among postcodes. This means that

there will on average be more possibilities to consider if one of the last two charac-

ters is unknown rather than one of the first few. However the graphs also show that

no matter which character is missing, there are never more than 20 possibilities

which could be returned by the search algorithm, and so a brute force search through

these to identify the correct one should not be out of the question. As to how this

might be achieved however, is left for future work.

It is also possible that there may be more than one character in a particular postcode

which cannot be recognised by the OCR system. This may be due to one of the fol-

lowing reasons:

• Due to the <100% reliability of the OCR system, there will be a small

number of occasions when one character cannot be recognised. When dis-

tributed over the input sequence of characters, this should mean that

there will never be more than one failure in any particular postcode. But

it is inevitable that eventually, two or more of the statistically predicted

failures will occur in one postcode.

• The address image is very badly formed — so much so that maybe only a

handful of all the characters in the address could be recognised automati-

cally.

These two events will together contribute to the reject rate of the system. In the first

case, it is unlikely that a postcode with two characters missing could be searched for

and the possibilities considered, as there would probably be too many of them. It is

easy to see in the worst case that there would be 400 (20 × 20) possibilities if the last

two characters were the two which could not be recognised. In the second case, if

only a few of the characters can be recognised, there would seem to be little point in

continuing with the automated recognition procedure as there would be no way to


90

check the possibilities which were returned against other information in the address,

even if it were in principle possible to deal with that many alternatives. In fact, a rec-

ogniser with a greater than 70% recognition rate (which is a fairly modest target

given the results presented in section 2) would on average fail on less than 1 in 7

characters. Since postcodes have no more than 7 characters, then on average there

will be no more than 1 character unrecognised in any given postcode. If all charac-

ters can be recognised, there will be less possibilities to deal with (unless of course

the characters do not represent a valid postcode — this is a separate issue and is

dealt with in section 8.3), whereas if more than one character is missing, it may be

very difficult to make any kind of automatic interpretation of the address, simply

because of the number of possibilities involved. Once a system has been designed, it

will be possible to calculate the total cycle time for recognition and database lookup,

and the upper bound of the number of possibilities which can be considered can be

calculated.

It has been argued that the main target for automated mail sorting must be to cor-

rectly recognise the postcode on a mail piece. The main consideration has been the

possibilities for recovery if, for some reason, one of the characters within the post-

code cannot be recognised. The results presented in this chapter show that there will

be little problem in identifying the postcode, even if one of the characters is missing.

This can be achieved either by passing hints to the OCR system once the set of possi-

ble characters is known or by attempting to integrate information from other parts of

the address. It has been shown that even if this is one of the last characters, there will

never be more than 20 possible postcodes to choose from. It is likely that a system

could be made to run fast enough to consider 20 possibilities — if not then a radical

alteration in the approach will be necessary to afford any benefit to the current sys-

tem.


91

In the case when more than one character is missing, it is difficult to say whether the

system would be able to correctly process the mail piece. It is likely that considera-

tion of all the possible valid characters will be too time consuming to be practical in

an on-line system. However the alternative method of incorporating other address

information into the database search may still allow the single correct address record

to be identified. This is certainly a topic for further research.

Now that the problems and possible solutions have been identified, the next chapter

presents some discussions about the feasibility of the proposed approach to this

application.

Improving Automated Postal Address Recognition 7. Feasibility

92

7. Feasibility

It is obviously important to assess whether a system based on the work presented

here would actually form a feasible solution to the problem of improving automated

address recognition. The following section gives some outlines of the measures

involved in determining whether or not the system will perform as required.

7.1 Introduction

At present, the only consideration which can reasonably be addressed is whether or

not the system is likely to perform its function within the strict time requirements of

the on-line mail sorting process. Some analysis is given of the CMM approach to

database searching, followed by a discussion of the implications of this for the mail

sorting application. In particular, a hardware implementation of the CMM process is

discussed and its performance analysed.

7.2 Speed of Database Access

The central problem to the partial matching exercise is how quickly the database can

be searched for the required address information. In [44 Austin et al.], a dedicated

piece of hardware known as the ‘Sum and Threshold’ processor (SAT) is presented,

which is capable of performing the CMM operations detailed in the previous sec-

tions at very high speed. The main equation which determines the speed with which

a database search can be made is shown in Eqn. 7.1. It gives the cycle time (CT) of the

unit given a number of parameters governing the matrix involved.

Eqn. 7.1

This equation covers a two stage process which actually implements the ADAM

structure. However, for this application, only the first stage of the operation is

required and some of the variables become irrelevant. Since the results of the first

CT 50α16------ 3.5β

δ----------- 34+

ρσ---2σ

16----------

3.5ϒ 35+( )

ι 4.5α 3φ+( )+ +

ns=


93

stage are available within the SAT processor, it is reasonable to ignore the second

stage components of the above equation, and the simplified version, for the first

stage only, becomes:

Eqn. 7.2

The coefficients are:

• α — Output size

• β — Number of bits set in input pattern

• φ — Number of bits set in output pattern

The values for these variables can be calculated as follows1. The output size, α, is the

required width of code to represent all the records in the database. From the results

of section 5, using the worst case estimator for k=4 (Eqn. 5.14), and the size of the

database for 5, 6 and 7-character postcodes, the approximate code widths are shown

in Fig. 7.1.

Fig. 7.1 - Estimated code widths for the 3 classesof postcode, using Eqn. 5.14

The number of bits set, φ, has to be 4, as this is the value of k in Eqn. 5.14. Obviously,

with these widths of code, a larger value of k would be better (assuming the log2 rule

holds). But as explained before, it is was not possible to generate equations for

higher values of k due to the excessive amount of time this would take. Using a

1. In all the following discussions, it is implied that 3 separate CMMs will be used to representthe 3 classes of postcode (5, 6, and 7-characters). This can be achieved using only one physicalSAT processor by setting up each CMM in the processor’s memory, and then simply adjustingpointers within the SAT so that the correct CMM is actually evaluated for the given search tobe performed.

PostcodeWidth

Number ofPostcodes

EstimatedCode Width

5 characters 45649 158 bits



CT 50α16------ β 34+( ) 4.5α 3φ++

ns=


94

higher value of k would almost certainly reduce the width of the code and so these

results are sure to give a worst case estimate of the speed of operation.

The input bits, β, is 5, 6 or 7, depending on the postcode width being used. This is

because a very simple coding scheme can be used on the input, where each character

position within the postcode is represented by a 1-in-n bit binary code. The equation

only needs to know how many bits are set to 1 on the input, not the total input size.

The value of n will be different for different positions (for example, a purely numeric

character position can be represented by a 1-in-10 bit binary code, whereas a purely

alphabetic field would need a 1-in-26 bit binary code), however, no matter what the

size of the code, it will always have 1 bit set, and so the value of β will always be

equal to the number of characters in the postcode. Evaluating Eqn. 7.2 for the given

parameters yields the results shown in Fig. 7.2.

Fig. 7.2 - Time taken to search each database forone specific postcode

The result of the CMM operation is a code which will uniquely identify a record in

the main PAF and, obviously, this will have to be retrieved in order to get the actual

postcode and other address details for verification purposes. It would be possible

however to use a second CMM which takes the output of the first and returns actual

postcodes. In this case the variables shown before would take the following values.

The output size, α, would be dependant on the width of the postcode. From Fig. 6.1

it can be seen that the worst case (i.e. largest required code) to represent each post-

code, is ‘ANNAA’ for 5-character postcodes, ‘ABBNAA’ for 6-character postcodes,

and ‘AANBNAA’ for 7-character postcodes. The letters ‘A’, ‘N’ and ‘B’ represent

PostcodeWidth

Number ofPostcodes

Time to Searchfor 1 Postcode

5 characters 45649 55.4 µs

6 characters 866026 148 µs

7 characters 717693 140.1 µs


95

‘alphabetic’, ‘numeric’ and ‘both’ fields within the postcode (for example, the third

character of a 7-character postcode can only be a numeric, and so can be represented

as a 1-in-10 bit binary code). Summing all these positions gives the following sizes:

Fig. 7.3 - Total size of codes required to representeach class of postcode

In fact these sizes represent the actual input sizes used on the first CMM. However,

as explained before, it is only the number of bits set to 1 on the input which affects

the speed of the processor.

The input bits for the second CMM, β, is the number of bits set on the output of the

first CMM. This is the value of k, and is thus 4 in this example, as Eqn. 5.14 predicts

codes with 4 bits set.

The value of φ is the number of characters in the postcode, as there is a direct rela-

tionship between bits in the output code and characters in the postcode that it repre-

sents — this will be 5, 6, or 7 depending on which class of postcode is being searched

for. Re-evaluating for the second CMM gives the following results:

Fig. 7.4 - Overall time to recover actual postcode

It is then trivial to convert the output of the second CMM into an ASCII interpreta-

tion of the postcode, and this can either be used directly to compare with the results

PostcodeWidth

Format ofPostcode

Total Size ofCode Required

5 characters ANNAA 98 bits (5 set)

6 characters ABBNAA 160 bits (6 set)

7 characters AANBNAA 160 bits (7 set)

PostcodeWidth

Time to Recover Postcodefrom First CMM Output

Total Time to Recover Postcodefrom OCR Output

5 characters 34.4 µs 89.8 µs




96

from OCR, or used as a key to locate a record in the PAF. Note that if a partial match

is being evaluated, then the second CMM will have to be evaluated for every post-

code returned by the partial match. There will also be an overhead associated with

separating the superimposed codes. It was shown in [37 Kennedy] that the average

time taken to separate 5 superimposed codes of width 400 bits is less than 2ms per

code, using a technique known as Middle Bit Indexing (see [38 Filer]). This time will

vary with the width of the codes and the number of codes superimposed, but should

give a reasonable indication of the order of magnitude of the problem. So the total

time will be the time taken to evaluate the first CMM, plus the number of records

returned × 2ms, plus the number of records returned × the time taken to evaluate the

second CMM. Note again though that the retrieval of the superimposed codes can be

pipelined with the second CMM to improve the efficiency still further. The worst

case time, for a 6-character fully specified postcode equates to around 4900 postcode

searches per second. In [44] it is shown that the SAT processor shows an average

speed-up, over a Silicon Graphics R460SC Indy workstation by a factor of 5. This

means the worst case time using a conventional machine would still only be

1019.5µs, or around 980 postcodes per second. However the SAT processor is also

over 20 times cheaper than the workstation, which gives a price/performance ratio

of more than 100:1 in favour of the SAT. There would obviously be additional costs

involved with providing a host for the SAT, however it is also likely that there would

be interfacing costs involved if the workstation was used, and these could only be

assessed if the details of the actual sorting machine were available.

7.3 Other Factors

The worst case time of 203.9µs per postcode looks promising, however as the system

will effectively form a pipeline comprising of OCR, database search, verification and

machine-readable code printing — possibly with iteration of the database search and


97

verification steps, the overall cycle time for the whole system will be the longest time

required to execute any one of those parts. As the machine-readable code printing

can really be made to go as fast as the mail moves through the machine, this is not

likely to be the longest part of the process. The database search is very fast, as shown

in the previous tables, and so the slowest part of the pipeline will probably be the

OCR (which also involves address location and line segmentation etc.). It ought to be

possible then to tune the iteration of the next steps to take nearly as much time as the

OCR stage, so as to allow the maximum amount of work to be done without slowing

the overall pipeline down.

It has been alluded to that the system must process around 10 mail pieces per sec-

ond. However the actual time a mail piece spends in the sorting machine (from

breaking the beam as it enters the machine and triggering the imaging process, to

leaving the machine with its machine-readable code printed on it) is around 1 sec-

ond. This means that the pipeline will be processing up to 10 mail pieces at any one

time, but the overall time for the pipeline will be 1 second per mail piece. Given the

times for the database searching and the fact that a character recogniser could be

implemented on the same dedicated hardware, this does not look unfeasible. But it is

also possible to introduce a delay line1 into the mail path which effectively increases

the processing time for each mail piece to 7 seconds. However to maintain the aver-

age throughput of 10 mail pieces per second, it is clear that the number of stages in

the pipeline would have to be increased to take advantage of this, as it means that

there will be on average 70 mail pieces in the sorting machine at any one time. This

may mean that the different CMMs used for the various parts of the system (OCR,

PAF search and postcode retrieval) will all have to be working at the same time and

1. This is a simple mechanical device which forces the mail to take an arduous path throughthe machinery. This increases the time it takes for the mail piece to move from the scannerwhich reads the address, to the printer which prints the machine-readable code on the mailpiece.


98

this will require more physical SAT processors. However they are designed in such a

way that once the host and interface have been set up, many SATs can be added at

little extra cost. There are other tasks which may not be suitable for implementation

on the SAT, such as the address location and segmentation and the superimposed

code separation and verification. However it is not clear yet exactly how these parts

of the system would be implemented and so it is difficult to give accurate timings for

the whole pipeline. All that can be said is that given the current state of technology, it

would be surprising to find that the task could not be completed within 1, let alone 7

seconds.

Improving Automated Postal Address Recognition 8. Conclusions and Further Work

99

8. Conclusions and Further Work

It has been shown that for a realistic improvement in the reliability of automated

address recognition, the main target area has to be the integration of address infor-

mation rather than improving the performance of an OCR system. It has also been

shown that the crux of this issue is the efficient retrieval of a valid address record

from the Postal Address File. This address has to have the highest probability of

being the one that was intended by the author of the address, given the (possibly

incomplete) information obtained from the address image. This amounts to a partial

match search of the database. A number of approaches to this have been proposed.

One in particular was considered in detail, and some of the problems with this

method identified.

Many of the issues raised during the course of this research would warrant further

investigation. This section details some of the more interesting questions which were

raised. As this research was intended as preparatory work for a longer study of a

system for improving automated address recognition, some of the topics discussed

in this section will be taken up over then next 3 years.

8.1 Code Generation

If the maximum ghost code sets are to be used as an effective way of reducing the

problems associated with partial match searching using CMMs, an efficient way of

generating them is essential. It is believed that in order to obtain such a method, a

more complete understanding of the behaviour of the sets is required. One possible

model for the sets which was not covered in the main text involves the use of hyper-

cubes. As shown in Fig. 8.1, a three bit code can be represented as the vertices on a 3-

dimensional cube. For wider codes, more dimensions are required and so they


100

become difficult to envisage. Questions about these hypercubes can still be asked

though.

• What subset of vertices is represented by a maximum ghosting set?

• What geometric features are displayed by such a subset?

• Do different sets have common features when represented in this way?

In particular, if the answer to the last question is in the affirmative, this may be the

key to understanding what makes these sets exhibit the properties they do.

Fig. 8.1 - Representation of 3 bit binary codes asvertices of a 3-dimensional cube

8.2 Values of k

It was mentioned in section 4.3 that the optimum value for k is log2w. This allows the

maximum storage within the CMM — any higher than this and errors in the output

start to affect the reliability of the system. It is not clear however why this should be

the case. It has always been assumed that the problems occur because of saturation

of the matrix. However the graph in Fig. 5.8 on page 75 shows that the size of the sets

of maximum ghosting codes for k=4 starts below the size of sets for k=3, but eventu-

ally becomes higher. The point at which the lines cross represents the point at which

one should stop using codes with 3 bits set and start using codes with 4 bits set. This

value is at around 2.7 on a logarithmic scale, which gives an actual code width of e2.7

— approximately 16. Using the log2 rule above, the number of bits set in each code

goes from 3 at w=15 to 4 at w=16. This is a surprising coincidence, and may indicate

000100

010

110

101

111

001

011 y

x

z

Bit Position: 1 2 3Axis: x y z

Each bit of each coderepresents the position

of that code on therelevant axis


101

that the log2 rule holds not because of saturation in the matrix but because of exces-

sive ghosting on the outputs. In order to confirm this, another line would need to be

plotted on the graph in Fig. 5.8 for k=5. If it were to cross the k=4 line at 3.5 (which is

the logarithm of 32, the code width where the number of bits set goes from 4 to 5 as

determined by the log2 rule), it would provide more than coincidental evidence for a

link between ghosting and the maximum storage capacity of a CMM. As mentioned

before however, this is not practical given the current method of generating the max-

imum ghosting sets because of the large amount of time it would take.

8.3 Strategies for Verification

In section 3, some initial ideas concerning the verification of postcode recognition

were put forward. There are many unanswered questions with regard to how this

may be done. The first stage, which has so far been overlooked, is to find the actual

address block on the mail piece. This can be done either using a simple line finding

algorithm, or a more complex locator such as the one in [1 Wolf, Platt], which

achieves 98.2% success at finding the address block when allowed to propose its top

5 choices. It was not reported how often the correct block was the first choice though.

This system would also require a line segmentation algorithm, but that task would

be considerably simplified by the fact that the box is assumed to contain only

address information. When presented with the entire image of the mail piece, line

segmentation algorithms can be easily confused by graphics on the envelope.

Once the address has been segmented into lines, it is fair to assume that the postcode

will be on the bottom line (either on its own, or following another word such as the

posttown), or on the second bottom line. In this case, the bottom line would probably

contain the country name, for international mail. However there is very little else

which can be assumed about the format of the address. It is likely that the recipients

name is on the first line, but this is not useful when identifying the address within


102

the PAF. Other information which would be useful when identifying the address is,

for example:

• Street Name and Number of Premise

• Posttown Name

• PO Box Number

• Premise Name (building or company name)

Unfortunately there is no standard way of writing an address and so the system can-

not make any assumptions about what information will and will not be present for

any given address. There is also likely to be information which is not helpful, such as

the county, which is included on many addresses but does not actually add any

information. In fact for most domestic addresses, the only two pieces of information

that are needed are the house number and the postcode. For large organisations, just

the postcode is sufficient. However the goal is to verify this against other redundant

information in the address, and the automated system needs some way of identify-

ing what information there is, and how it could be used.

One solution would be to apply OCR to every segmentable word and check each

word in a large dictionary of valid words which could appear on addresses. This dic-

tionary would have to include all the postcodes, posttowns, street names, annota-

tions such as ‘P.O. Box’ and possibly others. Once all the useless information such as

recipients name etc. has been discarded, a search akin to the one described in

[37 Kennedy] and mentioned in section 4 would be performed on some database of

‘address words’. This type of search allows the information being searched for to be

presented unordered and incomplete, as the output of this type of system would

almost certainly be. The search would need to allow the record(s) which matched

against the highest number of input words to be returned, and these would then be

taken as the candidate addresses. There may be scope for further refinement of the


103

input words once some candidate addresses are available, or the system could sim-

ply accept the address which matched the most inputs, providing there was only one

such address.

Another possibility is to make more assumptions about the address. For example, if

the posttown is included on the address, it is usually placed immediately above or to

the left of the postcode. If the county is included, it may be between the posttown

and the postcode. Using a database of hints such as these, the approach described

above could be refined slightly to avoid having to perform OCR on the entire

address, which could result in a substantial improvement in performance.

All these types of approach will probably improve the reliability of the system, at the

expense of reducing its performance in terms of speed. The target of any system

must be to recognise the address and code the mail piece in real time, as any off-line

system would necessarily incur the expense of buffering the address image informa-

tion and the machine-readable code database. However this will eventually become

the less expensive option as more and more processing power is required to perform

the ever increasingly complex sequence of operations involved in actually recognis-

ing the address. There is obviously some trade-off to be made here between the com-

plexity (and hopefully, reliability) of the system and the cost associated with making

this system on-line. In order to make this decision, there has to be some way of meas-

uring the cost of moving the recognition system off-line and this would require a

more detailed analysis of the particular application.

There is a problem which will undoubtedly occur at some time during the operation

of the system — that when all the characters are recognised by the OCR system (in

that the confidence is above some threshold), but the set of characters returned does

not represent a valid postcode. This could be caused either by substitution errors in


104

the OCR system or a genuine error on the mail piece. The job of the verification sys-

tem will be to identify which character(s) are in error. This could be done simply by

finding the character with the lowest confidence from OCR, and removing it from

the postcode, which then forms a partial match postcode with one character missing.

This assumes however that substitution errors are characterised by low confidence

within the OCR system, and this may not be the case. It may be possible to try the

postcode with each character missing in turn, and search all the possibilities. How-

ever as was shown in section 6, if one of the last few characters are missing, this can

represent a large number of possibilities to search. Another approach might be to try

and relate parts of the postcode to other information from the address — specifically

the first portion of the postcode can be matched against the posttown. It is more

important that this section of the address is recognised correctly as this determines

which town within the UK the mail piece is sent to. If this is incorrect it can double

the delivery time. If, however, it arrives in the correct town, miss-classifying the sec-

ond section of the postcode/address will only results in it being sent on the wrong

delivery round and would only delay it by 1/2 to 1 day.

8.4 OCR

Several OCR techniques were discussed in section 2, however these need to be

implemented and tested for this particular application. Specifically, the type of hard-

ware used to implement the CMMs would probably be a custom chip, and so any

OCR system which could be implemented using CMMs would almost certainly ben-

efit from the performance improvements of this chip over standard workstations.

Whether the best OCR technique for the job could use CMMs represents another

trade-off which would require consideration within the framework of the specific

project.


105

One type of search which can be performed using CMMs which has not been men-

tioned so far is a probability based search. This allows the inputs and outputs to the

CMM to be real values rather than binary bits, while keeping the internal weights of

the CMM binary. This has the advantages of inputting real probabilities to the CMM

and producing results according to those probabilities, while retaining the size and

performance advantages of a binary weighted network. So far, it has been assumed

that the output of the OCR system for each character position would simply be the

character with the highest confidence. If the OCR is allowed to output its top few

choices, along with their confidences, these could be overlaid onto the input to the

CMM and a search performed implicitly on all possible combinations of all the input

characters. The output would be biased by the confidences attached to the input

characters and would return the most likely postcode(s). Obviously this requires a

different type of OCR system (real outputs for a number of characters rather than

just the most likely character), and the potential benefits of the probability based

search would have to be weighed against the added complexity of this search and

the different requirements placed on the OCR system.

8.5 Information Feedback

There is clearly a loop in the overall system design (see Fig. 8.2 on page 108), and this

represents the feeding back of information from the database system to the OCR sys-

tem. This information is in the form of valid addresses which the output of the OCR

system points towards. There are a number of ways that this feedback could be han-

dled. Below are outlined two alternatives, but it is quite conceivable that more could

be investigated.


106

8.5.1 Algorithmic Processing of Feedback

The output of the database system is likely to be in the form of a list of valid

addresses. This information has to be correlated with the information found in the

address image by the OCR system. It is also likely that some of the characters which

were suggested by the OCR system would be ruled out by the database search

because they represent invalid addresses. If the system is going to iterate round this

loop of recognition and searching, there needs to be some control over the informa-

tion flow. This can be achieved by taking each address as returned by the database

search and comparing it with the characters found in the address image. For exam-

ple, the OCR system may have given very low confidence values to some characters

in the posttown name, but very high confidence to the characters in the postcode.

The database search should then have indicated what posttown corresponds to that

postcode. The OCR system can now be given extra information in terms of what

characters should be present in the posttown. If it knows what characters it is expect-

ing, a bias can be given to those characters and another attempt made at classifying

them. This could also help to resolve cases where the OCR system returned two

characters with very similar confidence, but only one of them is suggested by the

database search. It can now be given a higher confidence.

The output of this iteration would be a new set of information to be passed to the

database search system, and the loop can be continued until either a single address is

found with high enough overall confidence1, or some fixed maximum number of

iterations is reached without resolving the address. The latter case would then result

in a reject of this mail piece from the automated system.

1. The overall confidence would have to take into account the individual confidences from theOCR system on the various parts of the address and the number of potential addresses whichthe database system suggests would be valid given this output from the OCR system.


107

8.5.2 Asynchronous Processing of Feedback

This method would depend upon the actual implementation of the OCR and data-

base systems, but could result in a greater increase in performance of the overall sys-

tem. It would require that the OCR system be able to take inputs not only from the

address image, but from other sources as well, specifically the database system. Ini-

tially, there would be no output from the database system and this would therefore

have no influence on the OCR system. As the OCR system started to produce out-

puts, these would be fed as they arrived to the database system for searching. When

the search has been completed, the outputs from the database would feed back to the

OCR system and affect its recognition in some way as to bias it towards the address

features associated with the addresses returned from the search. In turn the OCR

system would produce new outputs, which would again feed into the search. Given

appropriate constraints on the flow of information, the whole system would eventu-

ally settle on the final output address using a kind of relaxation process.

It is possible that some of the work currently underway at the University of York

involving the use of the ADAM network and Cellular Automata (CA) could be use-

ful as a framework for this information flow model, and it would be interesting to

investigate whether this kind of application is suited to a CA type implementation. If

so, it would be possible that the feedback system be implemented on the same cus-

tom hardware as the high speed database lookup system. This would obviously be

advantageous as far as communication efficiency was concerned.

Tight control of the process would be needed to make sure the system converged

onto an address or rejected the mail piece within a given time, rather than oscillating

or diverging. However this removes the burden of actually trying to decide before-

hand which pieces of information would be useful in the feedback loop and building


108

them into the control process — this system could be tuned or even evolved by

adjusting the parameters controlling the relaxation process.

8.6 System Design

The eventual aim of this research is to provide an improved automated address rec-

ognition system. It is clear that there will be many component parts to such a system

and there are alternatives for the implementation of each component. In order to

properly assess the impact of the choice of one component implementation over

another, it is necessary to have an overall view of the system and how it will interact

with the existing hardware of the sorting offices. It is also crucial to completely mod-

ularise the system to allow alternative approaches to each component to be imple-

mented and tested. Without this, it will be very difficult to asses the performance of

the system objectively. An outline of the system is shown in Fig. 8.2.

Fig. 8.2 - System Outline of automated addressrecognition system

There are almost certainly parts of this system which are already in place. For exam-

ple, the camera which images the mail piece and the system for printing the machine

readable code on the mail piece are already in use. The exact interfaces would have

to be specified to ensure any new system would work within these modules. Some

kind of control mechanism would be required to handle the loop between the OCR

and Search Engine. This could be as simple as a threshold which must be reached by

Mail Stream

Camera

OCRSearchEngine

MachineReadable

CodeSystem

PAF Index


109

the address recognition system before it is taken as correct. However there must be

some way for the system to identify when an address cannot be recognised. Then,

the image of the mail piece must be passed on to the OCR/VCS1 system, which is the

system which currently handles the mail which cannot be automatically recognised.

The interface to this system would require specification as well.

There is clearly a lot of work to be done as far as the system is concerned. This report

has concentrated mainly on the components of that system in isolation and no

attempt has been made to integrate them. This is left for the actual implementation,

as there are many issues concerning the components which must be resolved before

that can reasonably be addressed.

8.7 Summary

Many diverse issues related to the automated recognition of postal addresses have

been considered, from the initial OCR of the characters which make up the address,

through to an outline of a system for generating the most likely address record from

the PAF. It is not surprising that many more questions have been raised than have

been answered, and as this is report is intended to provide a foundation for further

work, it is perhaps the most useful result. The issues to be addressed are:

• The implementation of an OCR module

• Which parts of the address image are to be considered when attempting

to interpret the address

• The method of searching the PAF for the matching record

• If the CMM method is to be used, the problem of ghosting and its possi-

ble solutions

1. OCR/VCS stands for Optical Character Recognition/Video Coding System. It is the namefor the system which takes address images from mail pieces which cannot be recognised bythe automated system, and presents them on video screens to human operators who key in theaddress information by hand.


110

• The integration of the verification stage with the OCR module, to provide

greater reliability of recognition

• The speed with which the whole operation can be performed

There is also a question of the application of the searching methods to other prob-

lems within the Post Office. While the automated recognition of addresses is clearly

key to one of their main operations, and as such formed an ideal framework for the

research carried out so far, it is by no means the only possible area in which the kind

of technology could be of benefit. It is intended to obtain as wide a picture as possi-

ble of other potential uses of this kind of system, in order to both guide future work

and provide the sponsor with some kind of realisation of the research.

Improving Automated Postal Address Recognition 9. References

111

9. References[1] WOLF, PLATT

Wolf R., Platt J. C.

Postal Address Block Location Using A Convolutional Locator Network

Submission to Advances in Neural Information Processing 6, 1994

[2] LEE, CHOI

Lee S., Choi Y.

Robust Recognition of Handwritten Numerals based on Dual Cooperative

Network

International Joint Conference on Neural Networks Vol. 3 pp 760-768, 1992

[3] KERTESZ, KERTESZ

Kertesz A., Kertesz V.

Dynamically Connected Neural Network for Character Recognition

International Joint Conference on Neural Networks Vol. 3 pp 672-676, 1992

[4] WANG, JEAN

Wang J., Jean J. S. N.

Segmentation of Merged Characters by Neural Networks and Shortest Path

Pattern Recognition Vol. 27 No. 5 pp 649-658, 1994

[5] MULGAONKAR ET AL.

Mulgaonkar P. G., Chen C., DeCurtins J. L.

Word Recognition in a Segmentation-Free Approach to OCR

SPIE Vol. 2103 pp 135-141, 1994

[6] SENI, COHEN

Seni G., Cohen E.

External Word Segmentation of Off-Line Handwritten Text Lines


[7] LIANG ET AL.

Liang S., Shridhar M., Ahmadi A.

Segmentation of Touching Characters in Printed Document Recognition


[8] YANIKOGLU, SANDON

Yanikoglu B. A., Sandon P. A.

Off-Line Cursive Handwriting Recognition Using Neural Networks

SPIE Vol. 1965 Application of Artificial Neural Networks IV pp 577-588, 1993


112

[9] KABIR, DOWNTON

Kabir E., Downton, A. C.

Syntax and Context in OCR of Handwritten British Postcodes

Draft Paper, University of Essex, Colchester

[10] KABIR ET AL.

Kabir E., Downton A. C., Birch R.

Recognition and Verification of Postcodes in Handwritten and Hand Printed

Addresses

Submission to 10ICPR, University of Essex, Colchester

[11] DOWNTON ET AL.

Downton A. C., Kabir E., Guillevic D.

Syntactic and Contextual Post-Processing of Handwritten Addresses for OCR

Draft Paper for 9ICPR, University of Essex, Colchester

[12] HENDRAWAN, LEEDHAM

Hendrawan, Leedham C. G.

Verification of Constrained Postcode Recognition Using Global Features Extracted

From The Handwritten Address - Verification

Commercial Report, University of Essex, Colchester, 1991

[13] LEEDHAM, JONES

Leedham C. G., Jones P. E.

Automatic Sorting of Australian Handwritten Letter Mail Using OCR and Address

Feature Verification

TENCON ‘92 Vol. 1 pp 287-291, 1992

[14] TREGIDGO, DOWNTON

Tregidgo R. W. S., Downton A. C.

Generalised Parallelism for Embedded Vision Systems: An Application to Real

Time OCR of Postal Addresses

Submission to 6th International Conference on Image Analysis and Processing,

University of Essex, Colchester



Scalable Parallelism for Embedded Vision Applications: The Generalised Tree

Pipeline

Submission to Transputer Applications ‘91, University of Essex, Colchester, 1991


113



A Design Philosophy for Scalable Parallel Embedded Vision Systems

University of Essex, Colchester

[17] ROVNER ET AL.

Rovner R. M., Gillies A. M., Ganzberger M. J., Hepp D. J.

Strategies for the Automatic Interpretation of Handwritten Addresses

SPIE Vol. 2103 pp 174-185, 1994

[18] LEEDHAM

Leedham C. G.

Comparison of Optical Recogniser Performance in Postal Applications


[19] HENDRAWAN, LEEDHAM

Hendrawan, Leedham C. G.

Verification of Constrained Postcode Recognition Using Global Features Extracted

From The Handwritten Address - Address Segmentation and Feature Extraction


[20] GORSKY

Gorsky N. D.

Experiments with Handwriting Recognition Using Holographic Representation of

Line Images

Pattern Recognition Letters 15 pp 853-859, 1994

[21] LECUN ET AL.

LeCun Y., Boser B., Denker J. S., Henderson D., Howard R. E., Hubbard W., Jackel

L. D.

Handwritten Digit Recognition with a Back-Propagation Network

Neural Information Processing Systems Vol 2, 1990

[22] WANG, JEAN

Wang J., Jean J. S. N.

Multi-resolution Neural Networks for Omnifont Character Recognition

IEEE International Conference on Neural Networks pp 1588-1593, 1993

[23] DRUCKER ET AL.

Drucker H., Schapire R., Simard P.

Boosting Performance in Neural Networks

International Journal of Pattern Recognition and Artificial Intelligence Vol. 7 No. 4

pp 705-719, 1993


114

[24] GUPTA ET AL.

Gupta A., Nagendraprasad M. V., Liu A., Wang P. S. P., Ayyadurai S.

An Integrated Architecture for Recognition of Totally Unconstrained Handwritten

Numerals


pp 757-773, 1993

[25] MARTIN ET AL.

Martin G. L., Rashid M., Pittman J. A.

Integrated Segmentation and Recognition through Exhaustive Scans or Learned

Saccadic Jumps


pp 831-847, 1993

[26] BURGES ET AL.

Burges C. J. C., Ben J. I., Denker J. S., LeCun Y., Nohl C. R.

Off Line Recognition of Handwritten Postal Words using Neural Networks


pp 689-704, 1993

[27] YOUNG, FU

Young T. Y., Fu K.

Handbook of Pattern Recognition and Image Analysis

Orlando, Academic Press, 1986-1994

[28] HARTIGAN

Hartigan J. A.

Clustering Algorithms

Yale University, 1975

[29] O’KEEFE, AUSTIN

O’Keefe S. E. M., Austin J.

Application of the ADAM Associative Memory to the Analysis of Document

Images

Proceedings of the Weightless Neural Network Workshop pp 17-22, 1995

[30] MARTIN, RASHID

Martin G. L., Rashid M.

Recognizing Overlapping Hand-Printed Characters by Centered-Object Integrated

Segmentation and Recognition

Advances in Neural Information Processing Systems Vol 4 pp 504-511, 1992


115

[31] WILLSHAW ET AL.

Willshaw D. J., Buneman O. P., Longuet-Higgins H. C.

Non-Holographic Associative Memory

Nature Vol 222 pp 960-962, 1969

[32] NADAL, TOULOUSE

Nadal J., Toulouse G.

Information Storage in Sparsely Coded Memory Nets

Network I pp 61-74, 1990

[33] AUSTIN, STONHAM

Austin J., Stonham T.

An Associative Memory for use in Image Recognition and Occlusion Analysis

Image and Vision Computing Vol. 5 No. 4 pp 251-261, 1987

[34] RIVEST

Rivest R. L.

Partial-Match Retrieval Algorithms

SIAM Journal of Computing Vol. 5 No. 1 pp 19-50, 1976

[35] BURKHARD

Burkhard W. A.

Partial Match Retrieval

BIT 16 pp 13-31, 1976

[36] KIM, PRAMANIK

Kim M. H., Pramanik S.

Optimal File Distribution for Partial Match Retrieval

Proceedings of Sigmod International Conference on Management of Data pp 173-

182, 1988

[37] KENNEDY

Kennedy J. V.

An Exploration into Novel Architectures for Uncertain Reasoning

First Year Report, University of York, 1995

[38] FILER

Filer R.

Symbolic Reasoning in an Associative Neural Network

Masters Thesis, University of York, 1994


116

[39] LUCAS

Lucas S. M.

Rapid Best-First Retrieval from Massive Dictionaries

Submission to IEEE International Conference on Neural Networks, 1995

[40] LUCAS

Lucas S. M.

High Performance OCR with Syntactic Neural Networks

Artificial Neural Networks Publication No. 409 pp 133-138, 1995

[41] ELLIMAN, LANCASTER

Elliman D. G., Lancaster I. T.

A Review of Segmentation and Contextual Analysis Techniques for Text

Recognition

Pattern Recognition Vol. 23 No. 3/4 pp 337-346, 1990

[42] CHAHAL

Chahal S.

Discrimination of Handwritten from Machine Printed Text

SPIE Vol 2238 pp 190-197, 1994

[43] AUSTIN

Austin J.

Reasoning with Correlation Matrix Memories

Draft Paper, University of York, 1994

[44] AUSTIN ET AL.

Austin J., Kennedy J. V., Pack R., Cass B.

C-NNAP: An Architecture for the Parallel Processing of Binary Neural Networks

Proceedings of the Weightless Neural Network Workshop pp 23-28, 1995

[45] AUSTIN ET AL.

Austin J., Kennedy J. V., Lees K.

The Advanced Uncertain Reasoning Architecture, AURA

Proceedings of the Weightless Neural Network Workshop, 1995

improving automated postal address recognition draft · improving automated postal address...

Documents