relin: relatedness and informativeness-based centrality for entity summarization

32
ws .nju.ed u.cn RELIN: Relatedness and Informativeness- based Centrality for Entity Summarization Gong Cheng 1 , Thanh Tran 2 , Yuzhong Qu 1 1 State Key Laboratory for Novel Software Technology, Nanjing University, China 2 Institute AIFB, Karlsruhe Institute of Technology, Germany [email protected] Presented at ISWC2011

Upload: gong-cheng

Post on 20-Jan-2015

368 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

ws .nju.edu.cn

RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng1, Thanh Tran2, Yuzhong Qu1

1 State Key Laboratory for Novel Software Technology, Nanjing University, China2 Institute AIFB, Karlsruhe Institute of Technology, Germany

[email protected]

Presented at ISWC2011

Page 2: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 2 of 30

ws .nju.edu.cn

Motivation

DBpedia describes 3.64M entities with 1B RDF triples.1B/3.64M = 281 RDF triples per entity

A piece of lengthy entity description is unacceptable in tasks that require quick identification of the underlying entity.

Page 3: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 3 of 30

ws .nju.edu.cn

Entity search --- find entities that match an information need

Page 4: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 4 of 30

ws .nju.edu.cn

Pay-as-you-go data integration --- judge whether two entities denote the same

sameAs?

Page 5: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 5 of 30

ws .nju.edu.cn

Motivation

DBpedia describes 3.64M entities with 1B RDF triples.1B/3.64M = 281 RDF triples per entity

A piece of lengthy entity description is unacceptable in tasks that require quick identification of the underlying entity.

Problem: to summarize lengthy entity descriptions

Page 6: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 6 of 30

ws .nju.edu.cn

Outline

Problem statement

The RELIN model

Implementation

Experiments

Conclusions

Page 7: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 7 of 30

ws .nju.edu.cn

Data graph

Page 8: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 8 of 30

ws .nju.edu.cn

Feature set

Page 9: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 9 of 30

ws .nju.edu.cn

Entity summarization

Entity summarization = feature ranking

Entity summary = k top-ranked features

Page 10: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 10 of 30

ws .nju.edu.cn

Outline

Problem statement

The RELIN model

Implementation

Experiments

Conclusions

Page 11: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 11 of 30

ws .nju.edu.cn

Centrality-based ranking: concepts

Widely applied to text summarization and ontology summarization

By constructing a graphNodes: data elements to be ranked

Edges: connecting related nodes

and then, measuring node centralitye.g. degree, PageRank, …

f1

f2

f3

f4

f5Relatednesss < threshold

Relatednesss ≥ threshold

Page 12: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 12 of 30

ws .nju.edu.cn

PageRank

Simulating a random surfer’s behavior who navigates from node to node

Two types of actionFollowing a random edge (with a uniform probability distribution)

Jumping at random (with a uniform probability distribution)

Ranking based on the stationary distribution of such a Markov chain

f1

f2

f3

f4

f5

Page 13: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 13 of 30

ws .nju.edu.cn

Centrality-based ranking for entity summarization: problems

How to define a good featureNot only capturing the main themes of the entity description

But also distinguishing the entity from others

Loss of informationFloat-valued function boolean-valued function

f1

f2

f3

f4

f5Relatednesss < threshold

Relatednesss ≥ threshold

Page 14: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 14 of 30

ws .nju.edu.cn

RELIN: concepts

An extension of PageRankFollowing a random edge ( )within a complete graph, with a probability proportional to the relatedness between the two associated nodes, i.e. no threshold needed

Jumping at random ( ) with a probability proportional to the amount of information carried by the target that helps to identify the entity

Page 15: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 15 of 30

ws .nju.edu.cn

RELIN: RELatedness and INformativeness-based centrality

Two kinds of actionRelational move --- more likely to a feature that carries related information about the theme currently under investigation

Informational jump --- more likely to a feature that provides a large amount of new information for clarifying the identity of the underlying entity

Two non-uniform probability distributions

Page 16: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 16 of 30

ws .nju.edu.cn

Formalization

Actions (given the current feature fq)

P(M|fq): the probability of performing a relational move from fqP(J|fq): the probability of performing an informational jump from fqsubject to P(M|fq) + P(J|fq) = 1

Targets for actions (given FS the feature set)P(fp|fq,M): the probability of performing a relational move from fq to fpP(fp|fq,J): the probability of performing an informational jump from fq to fpsubject to and

Resultx(t): |FS|-dimensional vector

xp(t): the probability that the surfer visits fp at step t

Finally,

and

FSf

qp

p

Mff 1,|P

FSf

qp

p

Jff 1,|P

FSf

qpqqpqqp

q

JfffJMfffMtt ,|P|P,|P|P1 xx

xx t

tlim

Page 17: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 17 of 30

ws .nju.edu.cn

Outline

Problem statement

The RELIN model

Implementation

Experiments

Conclusions

Page 18: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 18 of 30

ws .nju.edu.cn

Actions

P(M|fq) = 1 – λ

P(J|fq) = λ

λ: to be tuned in experiments

Page 19: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 19 of 30

ws .nju.edu.cn

Relatedness --- P(fp|fq,M)

Relatedness between features (i.e. property-value pairs) combinesRelatedness between properties (i.e. resources)

Relatedness between values (i.e. resources)

Relatedness between resources = relatedness between resource namesURI: label or local name

Literal: lexical form

Distributional relatedness between resource namesMore related = more often co-occur in certain contexts (e.g. documents)

Estimated via “pointwise mutual information + Google”

N

ssss jiji

,Hits,P

N

ss jj

HitsP

Page 20: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 20 of 30

ws .nju.edu.cn

Informativeness --- P(fp|fq,J)

Self-information

o: informational jump from fq to fpP(fp|fq): the probability that fp belongs to a feature set given fq also does so

Estimated via a statistical analysis of the data set

Approximation: P(fp|fq) = P(fp)

Page 21: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 21 of 30

ws .nju.edu.cn

Outline

Problem statement

The RELIN model

Implementation

Experiments

Conclusions

Page 22: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 22 of 30

ws .nju.edu.cn

Experiments

Intrinsic evaluation

Extrinsic evaluation

Page 23: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 23 of 30

ws .nju.edu.cn

Intrinsic evaluation --- design

TaskTo manually construct ideal entity summaries as the gold standard

Participants24 students majoring in computer science

Test cases149 entity descriptions randomly selected from DBpedia 3.4

Assignment4.43 participants per entity description

OutputTop-5 features

Top-10 features

Page 24: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 24 of 30

ws .nju.edu.cn

Intrinsic evaluation --- results

Metric: overlap between summaries

Agreement between participants about ideal summaries2.91 when k=5

7.86 when k=10

Quality of summaries computed under different approach settings

BaselinesOurs

Page 25: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 25 of 30

ws .nju.edu.cn

Extrinsic evaluation --- design

TaskTo manually confirm entity mappings by using summaries

Participants19 students majoring in computer science

Test cases47 pairs of entity descriptions (DBpedia 3.4 ↔ Freebase Dec. 2009)

Gold-standard judgments based on owl:sameAs links24 correct and 23 incorrect

Assignment3.62 participants per pair, per approach setting

OutputJudgment: correct or incorrect

Page 26: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 26 of 30

ws .nju.edu.cn

Extrinsic evaluation --- results

MetricsAccuracy of the judgments

1.0 = consistent with the gold standard

0.0 = inconsistent

Time spentNormalized by the average time per judgment spent by the participant

1.0 = medium efficiency

Smaller value = higher efficiency

Results

Page 27: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 27 of 30

ws .nju.edu.cn

Discussion

Automatically computed summaries are still not as good as handcrafted ones.

User-specific notion of informativenessLongitude and latitude are highly informative, but …

Information redundancyLongitude + latitude = point

What if multiple sources …

Summarization = what + how (to present)

k=5 k=10

Agreement between ideal summaries 2.91 7.86

Agreement between computed summaries and ideal summaries 2.40 4.88

Page 28: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 28 of 30

ws .nju.edu.cn

Outline

Problem statement

The RELIN model

Implementation

Experiments

Conclusions

Page 29: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 29 of 30

ws .nju.edu.cn

Conclusions

Problem of entity summarizationExtractive

About identifying the entity that underlies a lengthy description

The RELIN modelVariant of the random surfer model

Non-uniform probability distributions

Informativeness + relatedness

ImplementationBased on linguistic and information theory concepts

Using information captured by the labels of nodes and edges in the data graph

ExperimentsCloser to handcrafted ideal summaries

Assisting users in confirming entity mappings more accurately

Page 30: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 30 of 30

ws .nju.edu.cn

Future work --- application-specific entity summarization

sameAs?

Page 31: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 31 of 30

ws .nju.edu.cn

Related work --- summarization

Paradigm

Extractive- Text

- Ontology

Non-extractive- Database

- Graph

Approach

Centrality-based

Centroid-based

Measure

PageRank-like

Others- Degree

- Betweenness

- …

Model

RELIN- Relatedness

- Informativeness

- Non-uniform probability distribution

PageRank- Relatedness

- Uniform probability distribution

Page 32: RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization

Gong Cheng (程龚 ) [email protected] 32 of 30

ws .nju.edu.cn

Related work --- ranking

Different goals --- to best identify the underlying entityB. Aleman-Meza et al., Ranking Complex Relationships on the Semantic Web. IEEE Internet Comput. 2005.

R. Delbru et al., Hierarchical Link Analysis for Ranking Web Data. ESWC 2010.

T. Franz. TripleRank: Ranking Semantic Web Data By Tensor Decomposition. ISWC 2009.

Exploitation of data semantics at different levels --- use labels of nodes and edgesT. Penin et al., Snippet Generation for Semantic Web Search Engines. ASWC 2009.

X. Zhang et al., Ontology Summarization Based on RDF Sentence Graph. WWW 2007.