data mining, information theory and image interpretation sargur n. srihari center of excellence for...

Data Mining, Information Theory and Data Mining, Information Theory and Image InterpretationImage Interpretation

Sargur N. Srihari

Center of Excellence for Document Analysis and RecognitionCenter of Excellence for Document Analysis and Recognition

andand

Department of Computer Science and EngineeringDepartment of Computer Science and Engineering

State University of New York at Buffalo

Buffalo, NY 14260

USA

Data Mining

• Search for Valuable Information in Large Volumes of Data

• Knowledge Discovery in Databases (KDD)

• Discovery of Hidden Knowledge, Unexpected Patterns and new rules from Large Databases

Information Theory

• Definitions of Information:– Communication Theory

• Entropy (Shannon-Weaver)

• Stochastic Uncertainty

• Bits

– Information Science• Data of Value in Decision Making

Image InterpretationImage Interpretation

• Use of knowledge in assigning meaning to an image

• Pattern Recognition using Knowledge

• Processing Atoms (Physical) as Bits (Information)

Address Interpretation ModelAddress Interpretation Model

InterpretationI(x)

AddressInterpretation

(AI)

Addressimage

(x)

Knowledge source (K)Mailstream (S)

Postal addressDirectory (D)

Typical American AddressTypical American AddressAddress Directory Size: 139 million recordsAddress Directory Size: 139 million records

Assignment StrategiesAssignment StrategiesTypical street address

ZIP Code: 14221Primary number: 276

Lexicon entry(Street name)

ZIP+4add-on

AMHERSTON DR 7006BELVOIR RD 3604CADMAN DR 6948

CLEARFIELD DR 2336FORESTVIEW DR 1438

HARDING RD 7111HUNTERS LN 3330MCNAIR RD 3718

MEADOWVIEW LN 3557OLD LYME DR 2250

RANCH TRL 2340RANCH TRL W 2246

SHERBROOKE AVE 3421SUNDOWN TRL 2242TENNYSON TER 5916

Database query

Results

Word Recognizer selects(after lexicon expansion)

Delivery point: 142213557

Addressencoding

Australian AddressAustralian Address

Delivery Point ID: 66568882Postal Directory Size: 9.4 million records

Canadian AddressCanadian Address

Postal code: H1X 3B3Postal Directory: 12.7 million records

United Kingdom AddressUnited Kingdom Address

Postcode: TN23 1EU (unique postcode)Delivery Point Suffix: 1A (default)Address Directory Size: 26 million records

Motivation for Information Theoretic Motivation for Information Theoretic StudyStudy

• Understand information interaction in postal address fields

to overcome uncertainty in fields

• Compare the efficiency of assignment strategies

• Rank processing priority for determining a component

value

• Select most effective component to help recover an

ambiguous component

Address Fields in a US Postal Address Fields in a US Postal AddressAddress

Sargur N. Srihari

f6

street namef7

secondarydesignator abbr.

f5

primarynumber

f8

secondarynumber

Lee Entrance STE520 202

f2

state abbr.f3

5-digitZIP Code

f4

4-digitZIP+4 add-on

f1

city name

-Amherst NY 14228 2583

• Delivery point: 142282583

• Address fields

Probability DistributionProbability Distributionof Street Name Lexicon Size | fof Street Name Lexicon Size | f66 | |

Siz

e of

str

eet n

ame

lexi

con

log (Number of ZIP Codes)

No. of ZIP’s with | f6 | = 1=> 6,264 (14.97%)

| f3 | = 41,840Mean | f6 | = 95.04Max | f6 | = 1,513

(3.80, 1)

Siz

e of

str

eet n

ame

lexi

con

log (No. of (ZIP, primary) pairs)

No. of (ZIP, pri) with | f6 | = 1=> 34,102,092 (69.11%)

| (f3 , f5 ) | = 49,347,888Mean | f6 | = 2.21Max | f6 | = 542

(7.53, 1)

Number of Address RecordsNumber of Address Recordsfor Different Countriesfor Different Countries

Countries No. of address recordsin directory (million)

Australia 9.4Canada 12.7

United Kindom 26USA 139Italy

France

• A component c is an address field fi, a portion of fi (e.g., a digit), or a combination of components.

1. Entropy H (x) = information provided by component x (assuming uniform distribution)

H (x) = log2 | x | bits

2. Conditional Entropy Hx(y) = uncertainty of component y when component x is known

where xi is a value of component x; yj is a value of component y

pij is the joint probability of p(xi , yj)3. Redundancy of component x to y

Rx(y) = (H (x) + H (y) - H (x, y)) / H (y)

0 <= Rx(y) <= 1

Higher value of Rx(y) indicates that more information in y is shared by x.

DefinitionsDefinitions

H y pp

px iji j

ij

ikk

( ) lo g( ),

2

Example of Information MeasureExample of Information Measure

Value sets:

field A

(a,b,c,d)

field C

(e,f)

field B

(0,1,9) (0,1)B1 B2

x H (x) Hx(B) R x(B1)

A 2 0.4 1B1 log23 0.55 1B2 1 0.95 0.37C 1 0.95 0.63

Value of field BValue of field A B1 B2 Value of field C

a 1 0 ea 1 1 e

b 1 1 e

c 0 0 fd 9 1 f

Address records

Information measure

pa10 = 1/5,pae = 2/5, etc.

Measure of Information from Measure of Information from National City State File, National City State File, DD1 1 (July (July

1997)1997)

• Measure:

– H(x); x: any combination of f1, f2, and f3i

– Hx(f3); x: any combination of f1, f2, and f3i

Field f2

State abbr.

62

Field f3

ZIP Code

42,880

f31 f32 f33 f34f35

Field f1

City name

39,795Value sets

D1 = 79,343

Measure of Information from Measure of Information from Delivery Point Files, Delivery Point Files, DD2 2

(July 1997)(July 1997)

• Measure:– H(x); x: any combination of f3, f4 , f5 , f6 , f7 , f8, and f9 – Hx(f4); x: f3 with any combination of f3 ~ f9

f4 (ZIP+4 add-on)

9,999Value sets

D2 = 139,080,291

f5 (Primary No.)

1,155,740

f6 (Street name)

1,220,880

f7 (Secondary Abbr.)

24

f8 (Sec. No.)

123,829

f9 (Building/firm)

946,199Value sets

Measure of Information from Measure of Information from DD

Uncertainty in component

Uncertainty in ZIP Code when City, State or a digit is known

• To determine f3 (5-digit ZIP) from f1, f2 and f3i:

- City name reduces uncertainty the most

Propagation of Uncertainty for Propagation of Uncertainty for Assignment StrategiesAssignment Strategies

Ranking Processing Priority for Ranking Processing Priority for Confirming ZIP CodeConfirming ZIP Code

12.08

12.07

12.09

12.12

12.07

9.98

2.01

knowing 1component

1st

2nd

3rd4th5th

state

city

Hf1(f3)

15.39H(f3)

0.002

0.001

0.000state

1st

2nd

Hf1f35f34f33f2(f3)

knowing 5components

f1: City namef2: State abbreviationf3: ZIP Code

1.02

1.22

1.20

1.17

0.89

0.63

1st

2nd

3rd4th5th

state

knowing 2components

Hf1f35 (f3)

0.37

0.36

0.33

0.10

0.33

1st

2nd

3rd4thstate

knowing 3components

Hf1f35f34 (f3)

0.03

0.03

0.01

0.02

1st

2nd

3rdstate

knowing 4components

Hf1f35f34f33(f3)

Processing flow: city, 5th, 4th, 3rd, state

Modeling Processing CostModeling Processing Cost

• For component y

Location rate = l(y) 0 <= l(y) <= 1

Recognition rate = r(y) 0 <= r(y) <= 1

Processing speed = s(y) in msec

Existence rate = e(y) 0 <= e(y) <= 1

Patron rate = p(y) 0 <= p(y) <= 1

Lexicon size of y, given x = | yx | = 2(H (x,y) -H (x))

• Cost of processing component y given component x

(1 + log | yx |) * s(y)l(y) * r(y) * e(y) * p(y)

Costx(y) =Hx(y) *

Example Cost TableExample Cost Table

Component yLocation

ratel(y)

Recognitionrater(y)

Processingspeeds(y)

Existenceratee(y)

Patronratep(y)

City 1 0.75 24.88 1 1

State 1 0.75 24.88 1 1

ZIP digit 1 0.85 11.21 1 1

Ranking Processing Priority for Ranking Processing Priority for Confirming ZIP CodeConfirming ZIP Code

Based on CostBased on Cost

0.55

0.896

0.02

state

1st

3rd

process 5thcomponent

26.57

25.71

15.82

9.46

44.88

1st

3rd

4th5thstate

process 3rdcomponent

8.56

7.62

0.73

14.08

1st

3rd

4thstate


process 2ndcomponent

232.01

231.69

232.09

230.87

692.16

188.21

1st3rd4th

state

city

5th

process 1stcomponent

318.57

318.76

319.63

318.31

1027.6

1st

2nd

3rd4th5th

state

city

373.39

318.31

0

0

state1st


Processing flow based on cost: 2nd, city, 5th, 4th, 3rd, 1stProcessing flow based on Hx(y): city, 5th, 4th, 3rd, state

Recovery of 1st ZIP-Code Digit, Recovery of 1st ZIP-Code Digit, ff3131, , from State Abbr. (from State Abbr. (ff22) and Other ZIP-) and Other ZIP-

Code Digits (Code Digits (ff3232-f-f3535))

• Usage: If recognition of a component (e.g., f31) fails, this component

has higher probability of recovery by knowing another component with largest redundancy (f2).

• There are 62 state abbr’s. In 60 of them, 1st ZIP digit is unique.

For NY and TX, there are two valid 1st ZIP-Code digits.

x H (x) H (x, f31) Rx(f31) Ranking

f2 5.95 6.00 0.99 1f32 3.32 6.64 0 2f33 3.32 6.64 0 2f34 3.32 6.64 0 2f35 3.32 6.64 0 2f31 3.32 - - -

NY ? 4 2 2 8f2 f31 f32 f33 f34 f35

Measure of Information from Mail Measure of Information from Mail Stream, Stream, SS

• Eighteen sets, each from a mail processing site, of mail pieces

• We measure

– Information provided by H(f2), H(f3i)

– Uncertainty of f3 by Hf2(f3), Hf3i(f3)

• Each set is measured separately

• The results are shown on the average of these sets

Comparison of ZIP-Code Comparison of ZIP-Code Uncertainty from Uncertainty from DD and and SS

Comparison of Results from Comparison of Results from DD and and SS

• ZIP-Code uncertainty

from S < from D

• Information from S is more effective for determining a ZIP Code

• The most effective processing flow of using f3i and f2 to

determine f3 is (consistent between S and D)

f2 -> f35 -> f34 -> f33 -> f32 -> f31

UK Address InterpretationUK Address InterpretationField Recognition & Database QueryField Recognition & Database Query

• Fields of interest:

– Locality

– Post town

– County

– Outward postcode

• Target:

Outward postcode

• Control flow:

Based on data mining

Locality

Post town/county

Outwardpostcode

UK Address InterpretationUK Address InterpretationLast Line Parsing & ResolutionLast Line Parsing & Resolution

Addressblockimage

Chaincodegeneration Pre-scan digit

recognition

Line segmentation

Wordseparation

Last line parsing(shape, syntax)

Field recognition&

Database query

Fieldassignment

Outward postcodeassigned

Otherchoices

Y N

N

Y

Candidateoutwardpostcodes

Assignedoutwardpostcode

Last line resolution

DiscussionDiscussion(Reliability of information)(Reliability of information)

• For selecting effective processing flow in address

interpretation, the prediction is accurate when the

information can be the most representative in the current

processing situation

• Use of unreliable information for determining a candidate

value may cause error.

• Unreliable information used to choose an effective

processing flow is less effective.

Reliability of informationReliability of information

• Measure of information from D– Not reflecting the current processing situation

– Full coverage of all valid values

• Measure of information from S– Assuming that site specific preceding history represents

current processing situation

– Mail distribution could be season-specific

– Should consider the coverage of valid samples

– Should consider the information bias if valid samples are from AI engine

Complexity of collecting mail Complexity of collecting mail information (information (SS))

• Information from mail streams should be collected

automatically and only high confidence information is

collected

• Address interpretation is not ideal

• Some error cases would be collected

• Address interpretation may always reject a certain patterns

of mail pieces, resulting in biased collected information

ConclusionConclusion

• Information content of postal addresses can be measured

• The efficiency of assignment strategies can be compared

• Redundancy of two components can be measured– An uncertain component has higher probability of recovery when

another component with larger redundancy is known

• Information measure can suggest most effective processing flow

• Information Theory is an effective tool for Data Mining

data mining, information theory and image interpretation sargur n. srihari center of excellence for...

Documents