assesing the ape-inv benchmark

35
Assesing the APE-INV benchmark Andrea Maurino DISCo - Dip. di Informatica, Sistematica e Comunicazione Università di Milano Bicocca viale Sarca 336/14, 20124, Milano (Italy)

Upload: gerd

Post on 15-Feb-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Assesing the APE-INV benchmark. Andrea Maurino DISCo - Dip. di Informatica, Sistematica e Comunicazione Universit à di Milano Bicocca viale Sarca 336/14, 20124, Milano (Italy ). Index. The Benchmark The Methodology Adress verification Deduplication Schema Aware Deduplication - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Assesing the APE-INV benchmark

Assesing the APE-INV benchmark

Andrea MaurinoDISCo - Dip. di Informatica, Sistematica e Comunicazione

Università di Milano Bicoccaviale Sarca 336/14, 20124, Milano (Italy)

Page 2: Assesing the APE-INV benchmark

Index

• The Benchmark• The Methodology• Adress verification• Deduplication• Schema Aware Deduplication• On going work

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 2

Page 3: Assesing the APE-INV benchmark

BENCHMARK

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 3

Page 4: Assesing the APE-INV benchmark

Benchmark

• 1997 tuples• French inventors only• October 2009 version

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 4

Page 5: Assesing the APE-INV benchmark

The methodology

Postal address verification

Deduplication

Schema aware deduplication

Page 6: Assesing the APE-INV benchmark

Postal Address Verification

• Goal • To assess addresses in RAW table• To provide a standardized version for addresses

• Input • Raw Table

• Output• CorrectedRaw_langLat Table

• Tool• AST version 1 • AST version 2

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 6

Page 7: Assesing the APE-INV benchmark

Postal Address Verification

• The problem: • are the following addresses corrected?

“THOMSON-CSF - SCPI 173, bld Hausmann 75360 Paris Cedex 08”“9 avenue Saint Jacques 91600 Savigny Sur Org”• For postman?• According to Official list of address • According to the clean_address table?

• Answer

« AVENUE SAINT JACQUES 9 91600 SAVIGNY SUR ORGE ESSONNE ILE-DE-FRANCE FR »«  BOULEVARD HAUSSMANN 173 75379 PARIS PARIS ILE-DE-FRANCE FR»

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 7

Page 8: Assesing the APE-INV benchmark

Postal address Verification

• Syntactic accuracy is the closeness of a value v to the elements of the corresponding definition domain D

• Who store the domain D (all possibile French addresses)?• French post office • Google maps

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 8

Page 9: Assesing the APE-INV benchmark

AST 1.0• AST (AddresS Tool) 1.0 exploits google API to

• assess if and address is correct• Produce a standardize version of the address

• 15000 request per IP/day (for free)• 5 seconds between two requests

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 9

RAW Table

AST1.0server

AST1.0client

AST1.0client

AST1.0client

Queuing

Corrected RAW Table

Page 10: Assesing the APE-INV benchmark

AST1.0

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 10

An address a is accurate if the application of AST 1.0 to it returns exactly one answer with an accuracy level 6.

Page 11: Assesing the APE-INV benchmark

AST1.0

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 11

Page 12: Assesing the APE-INV benchmark

AST1.0• Raw table includes 78.6% of accurate data (1569

addresses• According to AST

• In order to evaluate the quality of such results we manually compared the standardized version of accurate addresses with the clean address table.

• The results are the following:

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 12

Values % Raw table

% AST Results

True + 1412 70.7% 90%False + 157 8% 10% False - 209 10.5%

Page 13: Assesing the APE-INV benchmark

AST 2.0• The 2.0 version uses the Web version of Google API

and exploit the “did you mean” feature to improve results

• An address a is accurate if the application of AST 2.0 tool to it returns exactly one answer an no ”did you mean” sentence is present in the answer page.

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 13

Page 14: Assesing the APE-INV benchmark

AST 2.0

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 14

RAW Table

AST2.0ServerQueuing

Corrected RAW Table

HTTP REQUEST

AST2.0Wrapper HTML PAGE

Page 15: Assesing the APE-INV benchmark

AST2.0

• No limitation to # of requests per IP and between two requests

• Raw table includes 64.7% of accurate data (1292) addresses

• The results are the following:

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 15

Values % Raw table

% AST Results

True + 1239 62% 96%False + 51 2.4% 4%False - 382 19.1%

Page 16: Assesing the APE-INV benchmark

Comments

• Results produced by AST1.0 can be further improved by appling AST2.0 to the non accurate addresses identified by AST1.0

• The coverage of addresses shown by Gmaps is good, but • In case of small towns there are no information• Historical data are not available (Stalingrad - Volgograd)

• The benchmark could include also latitude and longitude information• How to measure this kind of precision?

• AST1.0+AST2.0 is a good way to easily (and freely) assess postal addresses for the majority of countries in the world.

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 16

Page 17: Assesing the APE-INV benchmark

Final result for Postal Address Verification

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 17

Values % Raw table

% AST Results

True + 1481 74.2% 94%False + 157 8% 6%False - 209 10.5%

Page 18: Assesing the APE-INV benchmark

DEDUPLICATION

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 18

Page 19: Assesing the APE-INV benchmark

Deduplication• Space reduction

• Sorted neighborhood method• Comparison functions

• Edit distance, Jaccard …• Decision model

• Fellegi, Sutter• Tool

• Fril• Febrl

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 19

Page 20: Assesing the APE-INV benchmark

Febrl• Febrl (Freely Extensible Biomedical Record Linkage)

is a freeware data standardisation and probabilistic record linkage python-based tool.

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 20

Page 21: Assesing the APE-INV benchmark

Febrl• Environment: PowerEdge R710 with two Intel Xeon

X5550 processor (2,66GHz, cache 8MB), 16GB Memory, 4 HD with 450GB SAS 15.000rpm

• Database: Mysql• Index: SNM windows size =10• Comparison functions

• Person_name edit distance (threshold 0.45)• Person_address Longest common substring (threshold

0.45)• Decision model

• Fellegi Sutter. • Score <0.45 match• 0.45<=Score <=0,75 possibile match• score>0.75 match

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 21

Page 22: Assesing the APE-INV benchmark

Febrl

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 22

Table:CorrectedRaw

% found matching pairs

% Real matching pairs

Time 671000 msMatching pairs found

8537 100

True + 8074 94.58 92.83False + 463 5.42False - 624 7.17Table:

Raw% found matching pairs

% Real matching pairs

Time 44000 msMatching pairs found

266415 100

True + 8698 3.26 100False + 257717 96,74False - 0 0

Page 23: Assesing the APE-INV benchmark

Fril• FRIL (fine-grained record integration and linkage)

• Java based• Two search methods: nested loop join (NLJ) and the

sorted neighborhood method (SNM).• Comparison function: edit distance, Soundex, Q-

gram, and equality

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 23

Page 24: Assesing the APE-INV benchmark

Fril

• Environment: HP Pavillon DV6-2125EL with one Intel core I3-330 (2.13 GHz), 4 GB ram, one hd 500 GB Sata 7200RPM

• Db server stored into the PowerEdge R710• Search index: SNM • Comparison functions:

• Person_name edit distance (threshold =0.3)• Address edit distance (threshold =0.8 ) (very high

threshold!!!)• Decision model

• Fellegi Sutter

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 24

Page 25: Assesing the APE-INV benchmark

Fril

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 25

Table:CorrectedRaw

% found matching pairs

% Real matching pairs

Time 8026 msMatching pairs found

4199 100

True + 3881 92.43 44.62False + 318 7.57False - 4817 55.38Table:

Raw% found matching pairs

% Real matching pairs

Time 8453 msMatching pairs found

5252 100

True + 2664 50,72 30.63False + 2588 29,75False - 6034 69,37

Page 26: Assesing the APE-INV benchmark

Evaluation

• Febrl is much more time consuming than Fril (it does not use thread)

• Febrl does not accept db connection

• Fril is fast, with good usabiltiy, but not so precise

• Thresholds are sometime too high (probably the benchmark does not include too much noise)

• The use of a cleaned version of the raw table significantly increase the precision of results

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 26

Page 27: Assesing the APE-INV benchmark

SCHEMA AWARE DEDUPLICATION

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 27

Page 28: Assesing the APE-INV benchmark

Schema aware deduplication• Data do not live alone

• Inventor table is one of the table of patstat• More information can be used to deduplication goals

• In the literature• Group Linkage (a.k.a. Group ER)• Inter-relationship Deduplication

• We introduce an approach • Domain independent

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 28

• Exploiting context information via schema analysis

• Covering multiple types of record linkage:• scattered

information• Dirty data

Page 29: Assesing the APE-INV benchmark

Some results

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 30

Table:CorrectedRaw

Values % found matching pairs

Considered possible pairs 1851 100True + 242 13.1True - 46 2.5False + 0 0False - 281 15.2Possibile match 80 4.3Cant tells 1202 64.9

• Some preliminary results• Not so good, but the level of interconnectivity is low• Average 2 co-inventors for each inventor

Page 30: Assesing the APE-INV benchmark

ONGOING WORK

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 31

Page 31: Assesing the APE-INV benchmark

Toward a new record linkage

• “Panta rei” (Heraclitus) everything flows, everything is constantly changing.

• Database may keep trace of these never ending changes

• Examples• People change names

• Xin Dong Xin Luna Dong• People change works

• Havely moves from Univ. of Wa. to Google• Nations change

• YUGOSLAVIA Serbia-Montenegro Serbia Kosovo

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 32

Page 32: Assesing the APE-INV benchmark

An example

person_id person_name person_address appln_filing_date

110670 ABELE, MANLIO, G.

5 EAST 22ND STREET;NEW YORK, NY 10016 18/10/1990

110670 ABELE, MANLIO, G.

5 EAST 22ND STREET;NEW YORK, NY 10016 06/04/1992

110671 ABELE, MANLIO, G.

5 EAST 22ND STREET, 205;NEW YORK, NY 10010

20/02/1991

110672 ABELE, Manlio, G.

5 East 22nd Street, 205,New York, NY 10010 20/02/1991

110674 ABELE, Manlio, G.

5 East 22nd Street,New York, NY 10016 18/10/1990

110674 ABELE, Manlio, G.

5 East 22nd Street,New York, NY 10016 06/04/1992

110674 ABELE, Manlio, G.

5 East 22nd Street,New York, NY 10016 12/04/1995

110674 ABELE, Manlio, G.

5 East 22nd Street,New York, NY 10016 12/03/1996

110675 Abele, Manlio 250 East 54th St.,New York, NY 10022 19/03/2004

Page 33: Assesing the APE-INV benchmark

Another example

person_id person_name person_address appln_filing_da

te113870 Aberman,Harold

M.Montclair 15/10/1997

113872 Aberman,Harold Montclair 02/09/1999113870 Aberman,Harold

M.Montclair 14/03/2000

113877 ABERMAN, Harold M.

14 Dewberry Way,Irvine, CA 92612

13/03/2001

113869 Aberman,Harold M.

Irvine 28/02/2001

113869 Aberman,Harold M.

Irvine 28/08/2003

Page 34: Assesing the APE-INV benchmark

Temporal Record Linkage

• Temporal Record linkage is a new research area that it is in charge of discovering if two records represent the same real world object described at two different time stamps

• Work made in collaboration with AT&T Research Labs

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 35

Page 35: Assesing the APE-INV benchmark

Preliminary results

• Results under publication thus, sorry, we provide some preliminary results

• PEI ADDS data from early binding results

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 36