entity matching across heterogeneous sources · 2.static information 3.information sotrage 4.data...

1
1.Electrical computers 2.Static information 3.Information sotrage 4.Data processing 5.Active solid-state devices 6.Computer graphics processing 7.Molecular biology and microbiology 8.Semiconductor device manufacturing Universal interface for retrieval of information in a computer system rank candidate descriptors ranking module search engine relevant area object Source 2: Patents Method for improving voice recognition heuristic algorithms speech recognition distribution system data source text-to-speech Siri (Software) intelligent personal assistant knowledge navigator natural language user interface iOS iPhone iPad iPod voice control Cydia Source 1: Siri's Wiki page Apple server Topic: voice control ... Voice menu system media graphical user interface synchronize customized processor host device database 0.83 0.54 Topic: ranking 0.11 Topic: process Yang Yang 1 , Yizhou Sun 2 , Jie Tang 1 , Bo Ma 1 , and Juanzi Li 1 1 Department of Computer Science and Technology Tsinghua University 2 College of Computer and Information Science, Northeastern University {sherlockbourne, mabox}@gmail.com, {jietang, lijuanzi}@tsinghua.edu.cn, [email protected] Proposed Model Given an entity from a source domain, how to find its matched entities from target domain? Experimental Results Code&Data: http://aminer.org/document-match KEG, Tsinghua University Product-Patent Matching Cross-lingual Matching Method P@3 P@20 MAP R@3 R#20 MRR CS+LDA 0.111 0.083 0.109 0.011 0.046 0.053 RW+LDA 0.111 0.117 0.123 0.033 0.233 0.429 RTM 0.501 0.233 0.416 0.057 0.141 0.171 RW+CST 0.667 0.167 0.341 0.200 0.333 0.668 CST 0.667 0.250 0.445 0.171 0.457 0.683 Method Precision Recall F1-Measure F2-Measure Title Only 1.000 0.410 0.581 0.465 SVM-S 0.957 0.563 0.709 0.613 LFG 0.661 0.820 0.732 0.782 LFG+LDA 0.652 0.805 0.721 0.769 LFG+CST 0.682 0.849 0.757 0.809 Content Similarity based on LDA (CS+LDA): cosine similarity between two articles’ topic distribution extracted by LDA; Random Walk based on LDA (RW+LDA): random walk on a graph where edges indicate the topic similarity between articles; Relational Topic Model (RTM): used to model links between documents; Random Walk based on CST (RW+CST): uses CST instead of LDA comparing with RW+LDA. Title Only: only considers the (translated) title of articles. SVM-S: famous cross-lingual Wikipedia matching toolkit. LFG: mainly considers the structural information of Wiki articles. LFG+LDA: adds content feature (topic distributions) to LFG by LDA. LFG+CST: adds content feature to LFG by employing CST. Cross-Sampling-Based Entity Generation Matching Relation Generation Learning LDA on the source domain and target domain respectively. Given an entity, ranking others as candidates according to the topic similarity. Baseline method (CS+LDA, RW+LDA) 1 2 1 2 Proposed model (CST) Cross-Source Topic Model Paper ID : 548 20 40 60 80 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 #topics Performance (MAP / F1) Prodctpatent Crosslingual (a) number of topics 1 2 3 4 5 6 7 8 9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 e 1 : e 2 Performance (MAP / F1) Prodctpatent Crosslingual (b) Cross-sampling ratio 50 100 150 200 250 300 350 400 450 0.3 0.4 0.5 0.6 0.7 0.8 0.9 precision Performance (MAP / F1) Productpatent Crosslingual (c) Precision 20 40 60 80 100 120 140 160 180 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 #iterations Performance (MAP / F1) Prodctpatent Crosslingual (d) Convergence Challenge 1: Less content overlapping between two sources; Challenge 2: How to model the topic-level relevance probability. E-Step: update variational parameters: Variational inference for model learning M-Step: update model parameters: matching relation Application: Competitor Analysis (pminer.org) Product-patent matching: core problem behind Apple VS Samsung topic distribution regression parameter Integrate topic extraction and entity matching into a unfired framework Entity Matching across Heterogeneous Sources Topics relevant to both Apple and Samsung Parameter Analysis Task: given a Wiki article describing a product, finding all relevant patents. Dataset: 13,085 Wiki articles; 15,00 patents from USPTO; 1,060 matching relations in total. Task: given an English Wiki article, finding all Chinese article reporting the same content. Dataset: 2,000 English articles from Wikipedia; 2,000 Chinese articles from Baidu Baike; Each English article corresponds to one Chinese article.

Upload: others

Post on 24-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Entity Matching across Heterogeneous Sources · 2.Static information 3.Information sotrage 4.Data processing 5.Active solid-state devices 6.Computer graphics processing 7.Molecular

1.Electrical computers 2.Static information 3.Information sotrage 4.Data processing 5.Active solid-state devices 6.Computer graphics processing 7.Molecular biology and microbiology 8.Semiconductor device manufacturing

Univ

ersa

l int

erfa

ce

for r

etrie

val o

f in

form

atio

n in

a

com

pute

r sys

tem

rank candidate

descriptors

ranking module

search engine

relevant areaobject

Source 2: Patents

Met

hod

for i

mpr

ovin

gvo

ice

reco

gniti

on heuristic algorithms

speech recognition

distribution system

data source text-to-speech

Siri

(Sof

twar

e)

intelligent personal assistant

knowledge navigator

natural language user interface

iOS iPhone

iPadiPod

voice control

Cydia

Source 1: Siri's Wiki page

Apple server

Topic: voice control

...

Voic

e m

enu

syst

em media

graphical user interface

synchronize

customized processorhost device

database

0.83

0.54

Topic: ranking

0.11Topic: process

Yang Yang1, Yizhou Sun2, Jie Tang1, Bo Ma1, and Juanzi Li11Department of Computer Science and Technology��Tsinghua University 2College of Computer and Information Science, Northeastern University

{sherlockbourne, mabox}@gmail.com, {jietang, lijuanzi}@tsinghua.edu.cn, [email protected]

Proposed Model

Given an entity from a source domain, how to find its matched entities from target domain?

Experimental Results

Code&Data: http://aminer.org/document-match KEG, Tsinghua University

Product-Patent Matching Cross-lingual Matching

Method P@3 P@20 MAP R@3 R#20 MRR

CS+LDA 0.111 0.083 0.109 0.011 0.046 0.053 RW+LDA 0.111 0.117 0.123 0.033 0.233 0.429

RTM 0.501 0.233 0.416 0.057 0.141 0.171 RW+CST 0.667 0.167 0.341 0.200 0.333 0.668

CST 0.667 0.250 0.445 0.171 0.457 0.683

Method Precision Recall F1-Measure F2-Measure

Title Only 1.000 0.410 0.581 0.465 SVM-S 0.957 0.563 0.709 0.613

LFG 0.661 0.820 0.732 0.782 LFG+LDA 0.652 0.805 0.721 0.769 LFG+CST 0.682 0.849 0.757 0.809

Content Similarity based on LDA (CS+LDA): cosine similarity between two articles’ topic distribution extracted by LDA;Random Walk based on LDA (RW+LDA): random walk on a graph where edges indicate the topic similarity between articles;Relational Topic Model (RTM): used to model links between documents;Random Walk based on CST (RW+CST): uses CST instead of LDA comparing with RW+LDA.

Title Only: only considers the (translated) title of articles.SVM-S: famous cross-lingual Wikipedia matching toolkit.LFG: mainly considers the structural information of Wiki articles. LFG+LDA: adds content feature (topic distributions) to LFG by LDA.LFG+CST: adds content feature to LFG by employing CST.

Cross-Sampling-Based Entity Generation

Matching Relation Generation

Learning LDA on the source domain and target domain respectively.

Given an entity, ranking others as candidates according to the topic similarity.

Baseline method (CS+LDA, RW+LDA)

1

2

1

2

Proposed model (CST)

Cross-Source Topic Model

Paper ID : 548

20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

#topics

Perf

orm

ance

(M

AP

/ F

1)

Prodct−patent

Cross−lingual

(a) number of topics1 2 3 4 5 6 7 8 9

0.2

0.3

0.4

0.5

0.6

0.7

0.8

e1 : e

2

Perf

orm

ance

(M

AP

/ F

1)

Prodct−patentCross−lingual

(b) Cross-sampling ratio50 100 150 200 250 300 350 400 450

0.3

0.4

0.5

0.6

0.7

0.8

0.9

precision

Perf

orm

ance

(M

AP

/ F

1)

Product−patentCross−lingual

(c) Precision20 40 60 80 100 120 140 160 1800

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

#iterations

Perfo

rman

ce (M

AP /

F1)

Prodct−patentCross−lingual

(d) Convergence

Challenge 1: Less content overlapping between two sources;

Challenge 2: How to model the topic-level relevance probability.

E-Step: update variational parameters:

Variational inference for model learning

M-Step: update model parameters:

matching relation

Application: Competitor Analysis (pminer.org)Product-patent matching: core problem behind Apple VS Samsung

topic distributionregression parameter

Integrate topic extraction and entity matching into a unfired framework

Entity Matching across Heterogeneous Sources

Topics relevant to both Apple and SamsungParameter Analysis

Task: given a Wiki article describing a product, finding all relevant patents.

Dataset: • 13,085 Wiki articles; • 15,00 patents from USPTO; • 1,060 matching relations in total.

Task: given an English Wiki article, finding all Chinese article reporting the same content.

Dataset: • 2,000 English articles from Wikipedia; • 2,000 Chinese articles from Baidu Baike; • Each English article corresponds to one Chinese article.