assesing the ape-inv benchmark
DESCRIPTION
Assesing the APE-INV benchmark. Andrea Maurino DISCo - Dip. di Informatica, Sistematica e Comunicazione Universit à di Milano Bicocca viale Sarca 336/14, 20124, Milano (Italy ). Index. The Benchmark The Methodology Adress verification Deduplication Schema Aware Deduplication - PowerPoint PPT PresentationTRANSCRIPT
Assesing the APE-INV benchmark
Andrea MaurinoDISCo - Dip. di Informatica, Sistematica e Comunicazione
Università di Milano Bicoccaviale Sarca 336/14, 20124, Milano (Italy)
Index
• The Benchmark• The Methodology• Adress verification• Deduplication• Schema Aware Deduplication• On going work
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 2
BENCHMARK
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 3
Benchmark
• 1997 tuples• French inventors only• October 2009 version
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 4
The methodology
Postal address verification
Deduplication
Schema aware deduplication
Postal Address Verification
• Goal • To assess addresses in RAW table• To provide a standardized version for addresses
• Input • Raw Table
• Output• CorrectedRaw_langLat Table
• Tool• AST version 1 • AST version 2
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 6
Postal Address Verification
• The problem: • are the following addresses corrected?
“THOMSON-CSF - SCPI 173, bld Hausmann 75360 Paris Cedex 08”“9 avenue Saint Jacques 91600 Savigny Sur Org”• For postman?• According to Official list of address • According to the clean_address table?
• Answer
« AVENUE SAINT JACQUES 9 91600 SAVIGNY SUR ORGE ESSONNE ILE-DE-FRANCE FR »« BOULEVARD HAUSSMANN 173 75379 PARIS PARIS ILE-DE-FRANCE FR»
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 7
Postal address Verification
• Syntactic accuracy is the closeness of a value v to the elements of the corresponding definition domain D
• Who store the domain D (all possibile French addresses)?• French post office • Google maps
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 8
AST 1.0• AST (AddresS Tool) 1.0 exploits google API to
• assess if and address is correct• Produce a standardize version of the address
• 15000 request per IP/day (for free)• 5 seconds between two requests
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 9
RAW Table
AST1.0server
AST1.0client
AST1.0client
AST1.0client
…
Queuing
Corrected RAW Table
AST1.0
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 10
An address a is accurate if the application of AST 1.0 to it returns exactly one answer with an accuracy level 6.
AST1.0
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 11
AST1.0• Raw table includes 78.6% of accurate data (1569
addresses• According to AST
• In order to evaluate the quality of such results we manually compared the standardized version of accurate addresses with the clean address table.
• The results are the following:
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 12
Values % Raw table
% AST Results
True + 1412 70.7% 90%False + 157 8% 10% False - 209 10.5%
AST 2.0• The 2.0 version uses the Web version of Google API
and exploit the “did you mean” feature to improve results
• An address a is accurate if the application of AST 2.0 tool to it returns exactly one answer an no ”did you mean” sentence is present in the answer page.
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 13
AST 2.0
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 14
RAW Table
AST2.0ServerQueuing
Corrected RAW Table
HTTP REQUEST
AST2.0Wrapper HTML PAGE
AST2.0
• No limitation to # of requests per IP and between two requests
• Raw table includes 64.7% of accurate data (1292) addresses
• The results are the following:
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 15
Values % Raw table
% AST Results
True + 1239 62% 96%False + 51 2.4% 4%False - 382 19.1%
Comments
• Results produced by AST1.0 can be further improved by appling AST2.0 to the non accurate addresses identified by AST1.0
• The coverage of addresses shown by Gmaps is good, but • In case of small towns there are no information• Historical data are not available (Stalingrad - Volgograd)
• The benchmark could include also latitude and longitude information• How to measure this kind of precision?
• AST1.0+AST2.0 is a good way to easily (and freely) assess postal addresses for the majority of countries in the world.
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 16
Final result for Postal Address Verification
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 17
Values % Raw table
% AST Results
True + 1481 74.2% 94%False + 157 8% 6%False - 209 10.5%
DEDUPLICATION
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 18
Deduplication• Space reduction
• Sorted neighborhood method• Comparison functions
• Edit distance, Jaccard …• Decision model
• Fellegi, Sutter• Tool
• Fril• Febrl
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 19
Febrl• Febrl (Freely Extensible Biomedical Record Linkage)
is a freeware data standardisation and probabilistic record linkage python-based tool.
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 20
Febrl• Environment: PowerEdge R710 with two Intel Xeon
X5550 processor (2,66GHz, cache 8MB), 16GB Memory, 4 HD with 450GB SAS 15.000rpm
• Database: Mysql• Index: SNM windows size =10• Comparison functions
• Person_name edit distance (threshold 0.45)• Person_address Longest common substring (threshold
0.45)• Decision model
• Fellegi Sutter. • Score <0.45 match• 0.45<=Score <=0,75 possibile match• score>0.75 match
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 21
Febrl
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 22
Table:CorrectedRaw
% found matching pairs
% Real matching pairs
Time 671000 msMatching pairs found
8537 100
True + 8074 94.58 92.83False + 463 5.42False - 624 7.17Table:
Raw% found matching pairs
% Real matching pairs
Time 44000 msMatching pairs found
266415 100
True + 8698 3.26 100False + 257717 96,74False - 0 0
Fril• FRIL (fine-grained record integration and linkage)
• Java based• Two search methods: nested loop join (NLJ) and the
sorted neighborhood method (SNM).• Comparison function: edit distance, Soundex, Q-
gram, and equality
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 23
Fril
• Environment: HP Pavillon DV6-2125EL with one Intel core I3-330 (2.13 GHz), 4 GB ram, one hd 500 GB Sata 7200RPM
• Db server stored into the PowerEdge R710• Search index: SNM • Comparison functions:
• Person_name edit distance (threshold =0.3)• Address edit distance (threshold =0.8 ) (very high
threshold!!!)• Decision model
• Fellegi Sutter
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 24
Fril
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 25
Table:CorrectedRaw
% found matching pairs
% Real matching pairs
Time 8026 msMatching pairs found
4199 100
True + 3881 92.43 44.62False + 318 7.57False - 4817 55.38Table:
Raw% found matching pairs
% Real matching pairs
Time 8453 msMatching pairs found
5252 100
True + 2664 50,72 30.63False + 2588 29,75False - 6034 69,37
Evaluation
• Febrl is much more time consuming than Fril (it does not use thread)
• Febrl does not accept db connection
• Fril is fast, with good usabiltiy, but not so precise
• Thresholds are sometime too high (probably the benchmark does not include too much noise)
• The use of a cleaned version of the raw table significantly increase the precision of results
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 26
SCHEMA AWARE DEDUPLICATION
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 27
Schema aware deduplication• Data do not live alone
• Inventor table is one of the table of patstat• More information can be used to deduplication goals
• In the literature• Group Linkage (a.k.a. Group ER)• Inter-relationship Deduplication
• We introduce an approach • Domain independent
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 28
• Exploiting context information via schema analysis
• Covering multiple types of record linkage:• scattered
information• Dirty data
Some results
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 30
Table:CorrectedRaw
Values % found matching pairs
Considered possible pairs 1851 100True + 242 13.1True - 46 2.5False + 0 0False - 281 15.2Possibile match 80 4.3Cant tells 1202 64.9
• Some preliminary results• Not so good, but the level of interconnectivity is low• Average 2 co-inventors for each inventor
ONGOING WORK
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 31
Toward a new record linkage
• “Panta rei” (Heraclitus) everything flows, everything is constantly changing.
• Database may keep trace of these never ending changes
• Examples• People change names
• Xin Dong Xin Luna Dong• People change works
• Havely moves from Univ. of Wa. to Google• Nations change
• YUGOSLAVIA Serbia-Montenegro Serbia Kosovo
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 32
An example
person_id person_name person_address appln_filing_date
110670 ABELE, MANLIO, G.
5 EAST 22ND STREET;NEW YORK, NY 10016 18/10/1990
110670 ABELE, MANLIO, G.
5 EAST 22ND STREET;NEW YORK, NY 10016 06/04/1992
110671 ABELE, MANLIO, G.
5 EAST 22ND STREET, 205;NEW YORK, NY 10010
20/02/1991
110672 ABELE, Manlio, G.
5 East 22nd Street, 205,New York, NY 10010 20/02/1991
110674 ABELE, Manlio, G.
5 East 22nd Street,New York, NY 10016 18/10/1990
110674 ABELE, Manlio, G.
5 East 22nd Street,New York, NY 10016 06/04/1992
110674 ABELE, Manlio, G.
5 East 22nd Street,New York, NY 10016 12/04/1995
110674 ABELE, Manlio, G.
5 East 22nd Street,New York, NY 10016 12/03/1996
110675 Abele, Manlio 250 East 54th St.,New York, NY 10022 19/03/2004
Another example
person_id person_name person_address appln_filing_da
te113870 Aberman,Harold
M.Montclair 15/10/1997
113872 Aberman,Harold Montclair 02/09/1999113870 Aberman,Harold
M.Montclair 14/03/2000
113877 ABERMAN, Harold M.
14 Dewberry Way,Irvine, CA 92612
13/03/2001
113869 Aberman,Harold M.
Irvine 28/02/2001
113869 Aberman,Harold M.
Irvine 28/08/2003
Temporal Record Linkage
• Temporal Record linkage is a new research area that it is in charge of discovering if two records represent the same real world object described at two different time stamps
• Work made in collaboration with AT&T Research Labs
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 35
Preliminary results
• Results under publication thus, sorry, we provide some preliminary results
• PEI ADDS data from early binding results
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 36