distance functions and ie
DESCRIPTION
Distance functions and IE. William W. Cohen CALD. Announcements. March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin Markov nets time constraints? Writeups: nothing today “distance metrics for text” – three papers due next Monday, 3/22. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/1.jpg)
Distance functions and IE
William W. CohenCALD
![Page 2: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/2.jpg)
Announcements
• March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin Markov nets– time constraints?
• Writeups:– nothing today– “distance metrics for text” – three papers due
next Monday, 3/22
![Page 3: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/3.jpg)
Record linkage: definition
• Record linkage: determine if pairs of data records describe the same entity – I.e., find record pairs that are co-referent– Entities: usually people (or organizations or…)– Data records: names, addresses, job titles, birth
dates, … • Main applications:
– Joining two heterogeneous relations– Removing duplicates from a single relation
![Page 4: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/4.jpg)
Record linkage: terminology
• The term “record linkage” is possibly co-referent with:– For DB people: data matching, merge/purge,
duplicate detection, data cleansing, ETL (extraction, transfer, and loading), de-duping
– For AI/ML people: reference matching, database hardening, object consolidation,
– In NLP: co-reference/anaphora resolution– Statistical matching, clustering, language
modeling, …
![Page 5: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/5.jpg)
Motivation
• Q: What does this have to do with IE?• A: Quite a lot, actually...
Webfind (Monge & Elkan, 1995)
![Page 6: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/6.jpg)
Finding a technical paper c. 1995
• Start with citation: " Experience With a Learning Personal Assistant", T.M. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski, Communications of the ACM, Vol. 37, No. 7, pp. 81-91, July 1994.
• Find author’s institution (w/ INSPEC)• Find web host (w/ NETFIND)• Find author’s home page and (hopefully) the paper by browsing
![Page 7: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/7.jpg)
Automatically finding a technical paper c. 1995 with WebFind
• Start with citation: " Experience With a Learning Personal Assistant", T.M. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski, Communications of the ACM, Vol. 37, No. 7, pp. 81-91, July 1994.
• Find author’s institution (w/ automated search against INSPEC)
• Find web host (w/ auto search on NETFIND)• Find author’s home page and (hopefully) the paper by
heuristic spidering
![Page 8: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/8.jpg)
The data integration problem
![Page 9: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/9.jpg)
The data integration problem
• Control flow (modulo details about querying– Extract (author, department) pairs from DB1– Extract (department ,www server) pairs from DB2 – Execute the two-step plan to get paper:
• author -> department -> wwwServer– two steps means matching (linking, integrating,
deduping, ....) department names in DB1/DB2– issues are completely different if user is executing a
one-step plan:• one-step plan is retrieval
![Page 10: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/10.jpg)
String distance metrics: Levenshtein
• Edit-distance metrics– Distance is shortest sequence of edit
commands that transform s to t.– Simplest set of operations:
• Copy character from s over to t• Delete a character in s (cost 1)• Insert a character in t (cost 1)• Substitute one character for another (cost 1)
– This is “Levenshtein distance”
![Page 11: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/11.jpg)
Levenshtein distance - example
• distance(“William Cohen”, “Willliam Cohon”)
W I L L I A M _ C O H E N
W I L L L I A M _ C O H O NC C C C I C C C C C C C S C
0 0 0 0 1 1 1 1 1 1 1 1 2 2
s
t
op
cost
alignment
![Page 12: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/12.jpg)
Levenshtein distance - example
• distance(“William Cohen”, “Willliam Cohon”)
W I L L I A M _ C O H E N
W I L L L I A M _ C O H O NC C C C I C C C C C C C S C
0 0 0 0 1 1 1 1 1 1 1 1 2 2
s
t
op
cost
alignment
gap
![Page 13: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/13.jpg)
Computing Levenshtein distance - 1 D(i,j) = score of best alignment from s1..si to t1..tj
= min
D(i-1,j-1), if si=tj //copyD(i-1,j-1)+1, if si!=tj //substituteD(i-1,j)+1 //insertD(i,j-1)+1 //delete
![Page 14: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/14.jpg)
Computing Levenshtein distance - 2D(i,j) = score of best alignment from s1..si to t1..tj
= minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete
(simplify by letting d(c,d)=0 if c=d, 1 else)
also let D(i,0)=i (for i inserts) and D(0,j)=j
![Page 15: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/15.jpg)
Computing Levenshtein distance - 3
D(i,j)= minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete
C O H E NM 1 2 3 4 5C 1 2 3 4 5C 2 2 3 4 5O 3 2 3 4 5H 4 3 2 3 4N 5 4 3 3 3 = D(s,t)
![Page 16: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/16.jpg)
Computing Levenshtein distance – 4
D(i,j) = minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete
C O H E NM 1 2 3 4 5
C 1 2 3 4 5
C 2 3 3 4 5
O 3 2 3 4 5
H 4 3 2 3 4
N 5 4 3 3 3
A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)
![Page 17: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/17.jpg)
Needleman-Wunch distance
D(i,j) = minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j) + G //insertD(i,j-1) + G //delete
d(c,d) is an arbitrary distance function on
characters (e.g. related to typo frequencies,
amino acid substitutibility, etc)
William Cohen
Wukkuan Cigeb
G = “gap cost”
![Page 18: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/18.jpg)
Smith-Waterman distance - 1
D(i,j) = max
0 //start overD(i-1,j-1) - d(si,tj) //subst/copyD(i-1,j) - G //insertD(i,j-1) - G //delete
Distance is maximum over all i,j in table of D(i,j)
![Page 19: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/19.jpg)
Smith-Waterman distance - 2
D(i,j) = max
0 //start overD(i-1,j-1) - d(si,tj) //subst/copyD(i-1,j) - G //insertD(i,j-1) - G //delete
G = 1
d(c,c) = -2
d(c,d) = +1
C O H E NM -1 -2 -3 -4 -5
C +1 0 -1 -2 -3
C 0 0 -1 -2 -3
O -1 +2 +1 0 -1
H -2 +1 +4 +3 +2
N -3 0 +3 +3 +5
![Page 20: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/20.jpg)
Smith-Waterman distance - 3
D(i,j) = max
0 //start overD(i-1,j-1) - d(si,tj) //subst/copyD(i-1,j) - G //insertD(i,j-1) - G //delete
G = 1
d(c,c) = -2
d(c,d) = +1
C O H E NM 0 0 0 0 0
C +1 0 0 0 0
C 0 0 0 0 0
O 0 +2 +1 0 0
H 0 +1 +4 +3 +2
N 0 0 +3 +3 +5
![Page 21: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/21.jpg)
Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996)
![Page 22: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/22.jpg)
Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996)
Used a standard version of Smith-Waterman with hand-tuned weights for inserts and character substitutions.
Split large text fields by separators like commas, etc, and explore different pairings (since S-W assigns a large cost to large transpositions)
Result competitive with plausible competitors.
![Page 23: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/23.jpg)
Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996)
• String s=A1 A2 ... AK, string t=B1 B2 ... BL
• sim’ is editDistance scaled to [0,1]
• Monge-Elkan’s “recursive matching scheme” is average maximal similarity of Ai to Bj:
![Page 24: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/24.jpg)
Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996)
computer science department
stanford university
palo alto
california
Dept. of Comput. Sci.
Stanford Univ.
CA
USA
0.51
0.92
0.5
1.0
![Page 25: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/25.jpg)
![Page 26: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/26.jpg)
![Page 27: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/27.jpg)
![Page 28: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/28.jpg)
Results: S-W from Monge & Elkan
![Page 29: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/29.jpg)
Affine gap distances
• Smith-Waterman fails on some pairs that seem quite similar:
William W. Cohen
William W. ‘Don’t call me Dubya’ Cohen
Intuitively, a single long insertion is “cheaper” than a lot of short insertions
Intuitively, are springlest hulongru poinstertimon extisn’t “cheaper” than a lot of short insertions
![Page 30: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/30.jpg)
Affine gap distances - 2
• Idea: – Current cost of a “gap” of n characters: nG– Make this cost: A + (n-1)B, where A is cost of
“opening” a gap, and B is cost of “continuing” a gap.
![Page 31: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/31.jpg)
Affine gap distances - 3
D(i,j) = maxD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)-1 //insertD(i,j-1)-1 //delete
IS(i,j) = max D(i-1,j) - AIS(i-1,j) - B
IT(i,j) = max D(i,j-1) - AIT(i,j-1) - B
Best score in which si is aligned with a ‘gap’
Best score in which tj is aligned with a ‘gap’
D(i-1,j-1) + d(si,tj)
IS(I-1,j-1) + d(si,tj)
IT(I-1,j-1) + d(si,tj)
![Page 32: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/32.jpg)
Affine gap distances - 4
-B
-B
-d(si,tj) D
IS
IT-d(si,tj)
-d(si,tj)
-A
-A
![Page 33: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/33.jpg)
Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)
• Goal is to match data like this:
![Page 34: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/34.jpg)
Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)
• Hand-tuned edit distance• Lower costs for affine gaps• Even lower cost for affine gaps near a “.”• HMM-based normalization to group title,
author, booktitle, etc into fields
![Page 35: Distance functions and IE](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681679d550346895ddce362/html5/thumbnails/35.jpg)
Affine gap distances – experiments
TFIDF Edit Distance
Adaptive
Cora 0.751 0.839 0.945
0.721 0.964
OrgName1 0.925 0.633 0.9230.366 0.950 0.776
Orgname2 0.958 0.571 0.9580.778 0.912 0.984
Restaurant 0.981 0.827 1.0000.967 0.867 0.950
Parks 0.976 0.967 0.9840.967 0.967 0.967