seungwon hwang: entity graph mining and matching

22
Entity Graph Mining and Matching Seung-won Hwang Associate Professor Department of Computer Science and Engineering POSTECH, Korea

Post on 22-Oct-2014

1.183 views

Category:

Business


2 download

DESCRIPTION

This talk introduces the problem of matching web-scale entity graphs, such as multilingual name graphs and social network graphs, to solve difficult problems such as name translation or social id finding. While existing approaches focus on using textual (or phonetic) similarity or Web co-occurrences, this approach combines the strength of the two and significantly outperforms the state-of-the-arts. We present our evaluation results using real-life entity graphs.

TRANSCRIPT

Page 1: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Entity Graph Mining and MatchingSeung-won Hwang

Associate ProfessorDepartment of Computer Science and Engineering

POSTECH, Korea

Page 2: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Mining Human Intelligence from the Web: Click Graph

Language-agnostic/data-intensive: e.g., arabic Corpus?

Are q1 and q2 similar?

Are u3 and u4 similar?

Page 3: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Mining at Finer Granularity: Named Entity (NE) Graph

Person name, Place name, Organization name, Product name Newspapers, Web sites, TV programs, …

MS

jobsgate

s

Apple

Mac

complicated

Co-founder

tenure

Page 4: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Case I: Matching names with twitter accounts [EDBT11]

Page 5: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Case II: Entity Translation [EMNLP10,CIKM11] What are the features? How are the features combined?(using translation as an application scenario)

English Corpus

Chi-nese

Corpus

NE

NE

NE

NE

NE

NE

NE

NE

Ge=(Ve, Ee) Gc=(Vc, Ec)

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE NE

NE

NE

NENE

NE

NE

NE

Page 6: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

NE Translation Goal

Finding a NE in source language into its NE in target language Ex) “Obama” (English) “ 奥巴马” (Chinese)

Resources: comparable corpora

Features

NEE

Features

NEE

Features

NEE

Features

NEE

Features

NEC

Features

NEC

Features

NEC

Features

NEC

NEE NEC

Find!!

NEE NEC

NEE NEC

NEE NEC

Xinhua News Agency (English)

Xinhua News Agency (Chinese)

Page 7: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

NE Translation Similarity Features Entity Name Similarity (E): S.Wan [1], L. Haizhou [2], K. Knight [3]

Pronunciation similarity between named entities Ex) “Obama” and “ 奥巴马” (pronounced Aobama)

Entity Context Similarity (EC): M. Diab [4], H. Ji [5], K. Yu [6]

Contextual word similarity between named entities Ex) The president (总统 ) Obama ( 奥巴马 )

Relationship Similarity (R): G.-w.You [7]

Co-occurrence similarity between pairs of named entities Ex) (“Jackie Chan”, “Bill Gates” ) vs. (“ 成龙” , “ 比尔 · 盖茨 ” )

“As president, Obama signed economic stimulus legislation …”

Page 8: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b Entity Relationship

Using Entity Names E [1,2,3] R

Using Textual Context EC [4,5,6] ?

Motivation Taxonomy Table

Research questions: Why RC is not used? Can all four categories combined?

Shao [8]

You [7]

Page 9: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

In this paper… We propose a new NE translation similarity feature

Relationship Context similarity (RC) Contextual word similarity between named entities Ex) pair (“Barack”, “Michelle”) Spouse

We propose new holistic approaches Combining all E, EC, R, and RC

We validate our proposed approach using extensive experi-ments

Page 10: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Our Framework We abstract this problem as… Graph Matching of two NE relationship graphs extracted from

comparable corpora

English Corpus

Chi-nese

Corpus

NE

NE

NE

NE

NE

NE

NE

NE

Ge=(Ve, Ee) Gc=(Vc, Ec)

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE

NE NE

NE

NE

NENE

NE

NE

NE

Populate a decision matrix R, |Ve|-by-|Vc| matrix

Page 11: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Our Framework Overview – 3 Steps

Initialization Construct NE relationship graphs Build an initial pairwise similarity matrix R0

Use Entity (E) and Entity Context (EC) similarities

Iterative reinforcement Build a final pairwise similarity matrix R∞

Use Relationship (R) and Relationship Context (RC) similarities

Matching Find 1:1 matching from R∞

Build a binary hard decision matrix R*

奥巴马 成龙

Obama .99 .1 .2

Jackie chan

.1

奥巴马 成龙

Obama .99 .1 .2

Jackie chan

.99

Page 12: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Initialization Constructing NE relationship graphs G = (N, E)

Extract NEs using entity tagger for each document in each corpus Regard NEs that appears more than δ times as Nodes Connect two Nodes when they co-occur more than δ times

Initializing R0

Computing entity similarity matrix SE

Use Edit-Distance (ED) between ‘ei’ and Pinyin representation of ‘cj’ Ex) ED(“Obama”, “ 奥巴马” ) = ED(“Obama”, “Aobama”)

)()(

),(1

j

j

Ci

CiEij PYLeneLen

PYeEDS

Page 13: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Initialization Initializing R0

Computing entity context similarity matrix SEC

Context word

ex) “As president, Obama signed economic stimulus legislation …”

Context window

Correlation between a NE and a context word : Log-odd ratios

},),...,(,...,,{),( 2/12/12/2/ liliilili wwNEwwwdNECW

Page 14: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Initializing R0

Computing entity context similarity matrix SEC

Projected Context Association Vector

Initialization

Obama Score… …

President 0.9… …… …

奥巴马 Score

… …… …

总统 0.85

… …

Dictionary…

(President, 总统 )

……

presi-dent

USA

总统

美國

Page 15: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Initialization Initializing R0

Computing entity context similarity matrix SEC

Context Similarity between ‘ei’ and ‘cj’

Compute cosine similarity between two vectors

Merging SE and SEC

Min-Max normalization in range [0:1] Merge

ji

ji

ce

ceECij

CACA

CACAS

ECij

Eijij SSR

Page 16: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Reinforcement Intuition

Two NEs with a strong relationship Co-occur frequently have edge Share similar context have similar relationship context

1. Align neighbors using relationship (R) and relationship context (RC) similarity

2. Update the similarity score

X

NE

NE

English NE Graph

Y

NE

NE

Chinese NE Graph

con-text

con-text

con-text

con-text

Page 17: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Reinforcement Iterative Approach

0

),,(),(

,1 )1(2

)(ij

jiBkvuk

RCjviu

tuvt

ij RSR

Rt

Entity-based Similarity (E & EC)Relationship-based Similarity (R & RC)

Ordered set of aligned neighbor pairs of (i, j) at iteration t

Relationship Context (RC) Similarity between relation pair (i, u) and (j, v)

Relationship (R) Similarity ofi’s neighbor u and j’s neighbor v

Page 18: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Matching Finding 1:1 matching using greedy algorithm

Steps1. Find a translation pair with the highest final similarity score 2. Select the pair and remove the corresponding row and column from R∞

3. Repeat 1. and 2. until the similarity score < threshold

R∞

Page 19: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Experiments Dataset

English Gigaword Corpus Xinhua News Agency 2008.01~2008.12 100,746 news documents

Chinese Gigaword Corpus Xinhua News Agency 2008.01~2008.12 88,029 news documents

Approaches EC : consider Entity context similarity feature only E : consider Entity name similarity feature only Shao (E+EC) : combine Entity name & Entity Context similari-

ties You (E+R) : combine Entity name & Relationship similarities Ours

E+EC+R (when ϒ = 0) E+EC+R+RC

Measure Precision, Recall, and F1-score

Page 20: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Experiments Effectiveness of overall framework

500 person named entities Set λ = 0.15 5-fold cross-validation for threshold parameter learning

Other type of NE (100 Location named entities)

Page 21: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Directions Graph matching Graph cleansing [VLDB11] Scalable entity search

US Presidents

Bill Clinton

William J Clinton

George W. Bush

George H.W. Bush

Dubya

Page 22: Seungwon Hwang: Entity Graph Mining and Matching

Info

rmati

on &

Data

base

Syst

em

s La

b

Thanks Question?

Visit: www.postech.ac.kr/~swhwang for these papers