rule-based method for entity resolution ieee transactions on knowledge and data engineering january...

29
Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Upload: jocelyn-hopkins

Post on 18-Jan-2018

220 views

Category:

Documents


0 download

DESCRIPTION

Traditional ER approaches  Similarity comparison among records.  Can’t identify records correctly in some cases.

TRANSCRIPT

Page 1: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Rule-Based Method for Entity Resolution

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

JANUARY 2015

Page 2: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

INTRODUCTIONIN many applications, a real-world entity may appear inmultiple data sources so that the entity may have quitedifferent descriptions. For example, there are severalways to represent a person’s name or a mailing address.Thus, it is necessary to identify the records referring tothe same real-world entity, which is called Entity Resolution(ER). ER is one of the most important problemsin data cleaning and arises in many applications suchas information integration and information retrieval.Because of its importance, it has attracted much attentionin the literature

Page 3: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

• Traditional ER approachesSimilarity comparison among records.

Can’t identify records correctly in some cases.

Page 4: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

observation:The existence and nonexistence of some attribute-value pairs are both useful to identify records

Page 5: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Contribution

Page 6: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

syntax

Page 7: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

semantics

Page 8: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Properties of ER-Rule Set

Page 9: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Algorithm

• Rule Discovery(DiscR) -To get rules from a training data set• Rule-based entity resolution (R-ER) -To determine the record in the new data set refers to which entity

Page 10: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Rule Discovery

• Several definition before the algorithm

Page 11: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Rule Discovery

Page 12: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

• Rule requirements

Page 13: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015
Page 14: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015
Page 15: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Gen-PR

Page 16: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015
Page 17: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Gen-SingleNRFirst step:

Page 18: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Second step:

Page 19: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Rule-based entity resolution

• we define the weight of each ER-rule r as:

Page 20: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015
Page 21: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015
Page 22: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015
Page 23: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015
Page 24: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Rule update

• Invalid rules• Useless rules

Page 25: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Evaluation• the effectiveness of our rule learning algorithm (DiscR) and

our rule-based ER approach• the impact of training data size on ER accuracy and the

number of generated rules• The impact of rule length threshold on ER accuracy• The scalability of DiscR and R-ER with the size of data

Page 26: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

• Algorithm compared with: GHOST and CFR

Page 27: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015
Page 28: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Summary• DiscR and R-ER can achieve a high accuracy using a small

training data;• updating rules indeed help identify records; • The number of generated rules scales well with the training

data size on both data sets; • rules with length larger than 2 are seldom needed to identify

records; • both DiscR and R-ER scales well with the size of data.

Page 29: Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015

Thank you!