data mining, information theory and image interpretation sargur n. srihari center of excellence for...
Post on 21-Dec-2015
218 views
TRANSCRIPT
Data Mining, Information Theory and Data Mining, Information Theory and Image InterpretationImage Interpretation
Sargur N. Srihari
Center of Excellence for Document Analysis and RecognitionCenter of Excellence for Document Analysis and Recognition
andand
Department of Computer Science and EngineeringDepartment of Computer Science and Engineering
State University of New York at Buffalo
Buffalo, NY 14260
USA
Data Mining
• Search for Valuable Information in Large Volumes of Data
• Knowledge Discovery in Databases (KDD)
• Discovery of Hidden Knowledge, Unexpected Patterns and new rules from Large Databases
Information Theory
• Definitions of Information:– Communication Theory
• Entropy (Shannon-Weaver)
• Stochastic Uncertainty
• Bits
– Information Science• Data of Value in Decision Making
Image InterpretationImage Interpretation
• Use of knowledge in assigning meaning to an image
• Pattern Recognition using Knowledge
• Processing Atoms (Physical) as Bits (Information)
Address Interpretation ModelAddress Interpretation Model
InterpretationI(x)
AddressInterpretation
(AI)
Addressimage
(x)
Knowledge source (K)Mailstream (S)
Postal addressDirectory (D)
Typical American AddressTypical American AddressAddress Directory Size: 139 million recordsAddress Directory Size: 139 million records
Assignment StrategiesAssignment StrategiesTypical street address
ZIP Code: 14221Primary number: 276
Lexicon entry(Street name)
ZIP+4add-on
AMHERSTON DR 7006BELVOIR RD 3604CADMAN DR 6948
CLEARFIELD DR 2336FORESTVIEW DR 1438
HARDING RD 7111HUNTERS LN 3330MCNAIR RD 3718
MEADOWVIEW LN 3557OLD LYME DR 2250
RANCH TRL 2340RANCH TRL W 2246
SHERBROOKE AVE 3421SUNDOWN TRL 2242TENNYSON TER 5916
Database query
Results
Word Recognizer selects(after lexicon expansion)
Delivery point: 142213557
Addressencoding
Australian AddressAustralian Address
Delivery Point ID: 66568882Postal Directory Size: 9.4 million records
Canadian AddressCanadian Address
Postal code: H1X 3B3Postal Directory: 12.7 million records
United Kingdom AddressUnited Kingdom Address
Postcode: TN23 1EU (unique postcode)Delivery Point Suffix: 1A (default)Address Directory Size: 26 million records
Motivation for Information Theoretic Motivation for Information Theoretic StudyStudy
• Understand information interaction in postal address fields
to overcome uncertainty in fields
• Compare the efficiency of assignment strategies
• Rank processing priority for determining a component
value
• Select most effective component to help recover an
ambiguous component
Address Fields in a US Postal Address Fields in a US Postal AddressAddress
Sargur N. Srihari
f6
street namef7
secondarydesignator abbr.
f5
primarynumber
f8
secondarynumber
Lee Entrance STE520 202
f2
state abbr.f3
5-digitZIP Code
f4
4-digitZIP+4 add-on
f1
city name
-Amherst NY 14228 2583
• Delivery point: 142282583
• Address fields
Probability DistributionProbability Distributionof Street Name Lexicon Size | fof Street Name Lexicon Size | f66 | |
Siz
e of
str
eet n
ame
lexi
con
log (Number of ZIP Codes)
No. of ZIP’s with | f6 | = 1=> 6,264 (14.97%)
| f3 | = 41,840Mean | f6 | = 95.04Max | f6 | = 1,513
(3.80, 1)
Siz
e of
str
eet n
ame
lexi
con
log (No. of (ZIP, primary) pairs)
No. of (ZIP, pri) with | f6 | = 1=> 34,102,092 (69.11%)
| (f3 , f5 ) | = 49,347,888Mean | f6 | = 2.21Max | f6 | = 542
(7.53, 1)
Number of Address RecordsNumber of Address Recordsfor Different Countriesfor Different Countries
Countries No. of address recordsin directory (million)
Australia 9.4Canada 12.7
United Kindom 26USA 139Italy
France
• A component c is an address field fi, a portion of fi (e.g., a digit), or a combination of components.
1. Entropy H (x) = information provided by component x (assuming uniform distribution)
H (x) = log2 | x | bits
2. Conditional Entropy Hx(y) = uncertainty of component y when component x is known
where xi is a value of component x; yj is a value of component y
pij is the joint probability of p(xi , yj)3. Redundancy of component x to y
Rx(y) = (H (x) + H (y) - H (x, y)) / H (y)
0 <= Rx(y) <= 1
Higher value of Rx(y) indicates that more information in y is shared by x.
DefinitionsDefinitions
H y pp
px iji j
ij
ikk
( ) lo g( ),
2
Example of Information MeasureExample of Information Measure
Value sets:
field A
(a,b,c,d)
field C
(e,f)
field B
(0,1,9) (0,1)B1 B2
x H (x) Hx(B) R x(B1)
A 2 0.4 1B1 log23 0.55 1B2 1 0.95 0.37C 1 0.95 0.63
Value of field BValue of field A B1 B2 Value of field C
a 1 0 ea 1 1 e
b 1 1 e
c 0 0 fd 9 1 f
Address records
Information measure
pa10 = 1/5,pae = 2/5, etc.
Measure of Information from Measure of Information from National City State File, National City State File, DD1 1 (July (July
1997)1997)
• Measure:
– H(x); x: any combination of f1, f2, and f3i
– Hx(f3); x: any combination of f1, f2, and f3i
Field f2
State abbr.
62
Field f3
ZIP Code
42,880
f31 f32 f33 f34f35
Field f1
City name
39,795Value sets
D1 = 79,343
Measure of Information from Measure of Information from Delivery Point Files, Delivery Point Files, DD2 2
(July 1997)(July 1997)
• Measure:– H(x); x: any combination of f3, f4 , f5 , f6 , f7 , f8, and f9 – Hx(f4); x: f3 with any combination of f3 ~ f9
f4 (ZIP+4 add-on)
9,999Value sets
D2 = 139,080,291
f5 (Primary No.)
1,155,740
f6 (Street name)
1,220,880
f7 (Secondary Abbr.)
24
f8 (Sec. No.)
123,829
f9 (Building/firm)
946,199Value sets
Measure of Information from Measure of Information from DD
Uncertainty in component
Uncertainty in ZIP Code when City, State or a digit is known
• To determine f3 (5-digit ZIP) from f1, f2 and f3i:
- City name reduces uncertainty the most
Propagation of Uncertainty for Propagation of Uncertainty for Assignment StrategiesAssignment Strategies
Ranking Processing Priority for Ranking Processing Priority for Confirming ZIP CodeConfirming ZIP Code
12.08
12.07
12.09
12.12
12.07
9.98
2.01
knowing 1component
1st
2nd
3rd4th5th
state
city
Hf1(f3)
15.39H(f3)
0.002
0.001
0.000state
1st
2nd
Hf1f35f34f33f2(f3)
knowing 5components
f1: City namef2: State abbreviationf3: ZIP Code
1.02
1.22
1.20
1.17
0.89
0.63
1st
2nd
3rd4th5th
state
knowing 2components
Hf1f35 (f3)
0.37
0.36
0.33
0.10
0.33
1st
2nd
3rd4thstate
knowing 3components
Hf1f35f34 (f3)
0.03
0.03
0.01
0.02
1st
2nd
3rdstate
knowing 4components
Hf1f35f34f33(f3)
Processing flow: city, 5th, 4th, 3rd, state
Modeling Processing CostModeling Processing Cost
• For component y
Location rate = l(y) 0 <= l(y) <= 1
Recognition rate = r(y) 0 <= r(y) <= 1
Processing speed = s(y) in msec
Existence rate = e(y) 0 <= e(y) <= 1
Patron rate = p(y) 0 <= p(y) <= 1
Lexicon size of y, given x = | yx | = 2(H (x,y) -H (x))
• Cost of processing component y given component x
(1 + log | yx |) * s(y)l(y) * r(y) * e(y) * p(y)
Costx(y) =Hx(y) *
Example Cost TableExample Cost Table
Component yLocation
ratel(y)
Recognitionrater(y)
Processingspeeds(y)
Existenceratee(y)
Patronratep(y)
City 1 0.75 24.88 1 1
State 1 0.75 24.88 1 1
ZIP digit 1 0.85 11.21 1 1
Ranking Processing Priority for Ranking Processing Priority for Confirming ZIP CodeConfirming ZIP Code
Based on CostBased on Cost
0.55
0.896
0.02
state
1st
3rd
process 5thcomponent
26.57
25.71
15.82
9.46
44.88
1st
3rd
4th5thstate
process 3rdcomponent
8.56
7.62
0.73
14.08
1st
3rd
4thstate
process 4thcomponent
process 2ndcomponent
232.01
231.69
232.09
230.87
692.16
188.21
1st3rd4th
state
city
5th
process 1stcomponent
318.57
318.76
319.63
318.31
1027.6
1st
2nd
3rd4th5th
state
city
373.39
318.31
0
0
state1st
process 6thcomponent
Processing flow based on cost: 2nd, city, 5th, 4th, 3rd, 1stProcessing flow based on Hx(y): city, 5th, 4th, 3rd, state
Recovery of 1st ZIP-Code Digit, Recovery of 1st ZIP-Code Digit, ff3131, , from State Abbr. (from State Abbr. (ff22) and Other ZIP-) and Other ZIP-
Code Digits (Code Digits (ff3232-f-f3535))
• Usage: If recognition of a component (e.g., f31) fails, this component
has higher probability of recovery by knowing another component with largest redundancy (f2).
• There are 62 state abbr’s. In 60 of them, 1st ZIP digit is unique.
For NY and TX, there are two valid 1st ZIP-Code digits.
x H (x) H (x, f31) Rx(f31) Ranking
f2 5.95 6.00 0.99 1f32 3.32 6.64 0 2f33 3.32 6.64 0 2f34 3.32 6.64 0 2f35 3.32 6.64 0 2f31 3.32 - - -
NY ? 4 2 2 8f2 f31 f32 f33 f34 f35
Measure of Information from Mail Measure of Information from Mail Stream, Stream, SS
• Eighteen sets, each from a mail processing site, of mail pieces
• We measure
– Information provided by H(f2), H(f3i)
– Uncertainty of f3 by Hf2(f3), Hf3i(f3)
• Each set is measured separately
• The results are shown on the average of these sets
Comparison of ZIP-Code Comparison of ZIP-Code Uncertainty from Uncertainty from DD and and SS
Comparison of Results from Comparison of Results from DD and and SS
• ZIP-Code uncertainty
from S < from D
• Information from S is more effective for determining a ZIP Code
• The most effective processing flow of using f3i and f2 to
determine f3 is (consistent between S and D)
f2 -> f35 -> f34 -> f33 -> f32 -> f31
UK Address InterpretationUK Address InterpretationField Recognition & Database QueryField Recognition & Database Query
• Fields of interest:
– Locality
– Post town
– County
– Outward postcode
• Target:
Outward postcode
• Control flow:
Based on data mining
Locality
Post town/county
Outwardpostcode
UK Address InterpretationUK Address InterpretationLast Line Parsing & ResolutionLast Line Parsing & Resolution
Addressblockimage
Chaincodegeneration Pre-scan digit
recognition
Line segmentation
Wordseparation
Last line parsing(shape, syntax)
Field recognition&
Database query
Fieldassignment
Outward postcodeassigned
Otherchoices
Y N
N
Y
Candidateoutwardpostcodes
Assignedoutwardpostcode
Last line resolution
DiscussionDiscussion(Reliability of information)(Reliability of information)
• For selecting effective processing flow in address
interpretation, the prediction is accurate when the
information can be the most representative in the current
processing situation
• Use of unreliable information for determining a candidate
value may cause error.
• Unreliable information used to choose an effective
processing flow is less effective.
Reliability of informationReliability of information
• Measure of information from D– Not reflecting the current processing situation
– Full coverage of all valid values
• Measure of information from S– Assuming that site specific preceding history represents
current processing situation
– Mail distribution could be season-specific
– Should consider the coverage of valid samples
– Should consider the information bias if valid samples are from AI engine
Complexity of collecting mail Complexity of collecting mail information (information (SS))
• Information from mail streams should be collected
automatically and only high confidence information is
collected
• Address interpretation is not ideal
• Some error cases would be collected
• Address interpretation may always reject a certain patterns
of mail pieces, resulting in biased collected information
ConclusionConclusion
• Information content of postal addresses can be measured
• The efficiency of assignment strategies can be compared
• Redundancy of two components can be measured– An uncertain component has higher probability of recovery when
another component with larger redundancy is known
• Information measure can suggest most effective processing flow
• Information Theory is an effective tool for Data Mining