similarity on dbpedia

Post on 15-Jun-2015

1.054 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Overview on the notion of similarity and methods for defining similarity on DBpedia.

TRANSCRIPT

Similarity on DBpediaUIMR

PhD student: Samantha LamSupervisor: Conor Hayes

Similarity

How similar are the following films:

2

Similarity

How similar are the following films: (Unsatisfactory)Answer: it depends!

3

DBpedia Graph

Films - nodes - on DBpedia.

Some things about DBpedia:

Big, rich, dense Knowledge Base

→ 3.77m nodes, 400m edges (EN)

Lots of prior work (as we shall see...)

But very heterogeneous - vocabularies, categories

It is a graph

4

DBpedia Graph

Films - nodes - on DBpedia.

Some things about DBpedia:

Big, rich, dense Knowledge Base

→ 3.77m nodes, 400m edges (EN)

Lots of prior work (as we shall see...)

But very heterogeneous - vocabularies, categories

It is a graph

4

DBpedia Graph

Films - nodes - on DBpedia.

Some things about DBpedia:

Big, rich, dense Knowledge Base

→ 3.77m nodes, 400m edges (EN)

Lots of prior work (as we shall see...)

But very heterogeneous - vocabularies, categories

It is a graph

4

Similarity in general

Cognitive Science - Tversky (1977) - psychology - featural.

E.g. film: genre, language, director

Modelling of human thought, semantic relations, how do werelate things to each other? (Quillian & Collins 1969)

5

Semantic

The notion of semantic networks is derived from the hierarchicalsemantic memory model [Collins & Quillian, 1969]

6

Semantic Similarity

Different techniques:

Word frequency: Latent semantic analysis (doesn’t actuallyuse semantic net structure)

Rada (1989) - average shortest path length

Resnik (1999) - information content of lcs

Unfortunately...

Word frequency N/A

Often assumes hierarchical/tree structure oftaxonomy/ontology. (Both Rada and Resnik assumetaxonomy is an is-A hierarchy)

7

Semantic Similarity

Different techniques:

Word frequency: Latent semantic analysis (doesn’t actuallyuse semantic net structure)

Rada (1989) - average shortest path length

Resnik (1999) - information content of lcs

Unfortunately...

Word frequency N/A

Often assumes hierarchical/tree structure oftaxonomy/ontology. (Both Rada and Resnik assumetaxonomy is an is-A hierarchy)

7

Semantic Similarity

Remember, DBpedia not as ‘neat’:

(Image source: http://www.visualdataweb.org/relfinder/)

8

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

9

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

9

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

9

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

9

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance ← uses paths!

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

10

Similarity

Important:

Properties can be related to each other

type 1, e.g. influenced

node, e.g. director

type 2, e.g. collaborated with

node type 2, e.g. film

11

Network Similarity

Social Network Analysis

Established field - notions of influence, centrality, rank etc.

Often applied to small networks

Note: Ranking is often based on similarity

12

Network Similarity

Homogeneous network measures:

PageRank - Sergey & Brin (1998) - random-surfer withteleportation

SimRank - Jeh & Widom (2002) - iteratively ‘inherits’ rankof neighbours

σact - Thiel & Berthold (2010) - node similarities fromspreading activation with a decay factor

13

Network Similarity

Homogeneous network measures:

PageRank - Sergey & Brin (1998) - random-surfer withteleportation

SimRank - Jeh & Widom (2002) - iteratively ‘inherits’ rankof neighbours

σact - Thiel & Berthold (2010) - node similarities fromspreading activation with a decay factor

13

Network Similarity

Homogeneous network measures:

PageRank - Sergey & Brin (1998) - random-surfer withteleportation

SimRank - Jeh & Widom (2002) - iteratively ‘inherits’ rankof neighbours

σact - Thiel & Berthold (2010) - node similarities fromspreading activation with a decay factor

13

Network Similarity

Heterogeneous network measures:

PathSim - Sun & Han (2009) - count instances of‘meta-path’ (specific link pattern)

14

Network Similarity

Applicability to DBpedia:

PageRank, SimRank - N/A - assumes homogeneous links!

Spreading Activation - possible with constraints

Apply PathSim - but how to learn such meta-paths?

Another idea:

Count node-disjoint paths.

Why? View each path as one distinct ‘reason’.

15

Network Similarity

Applicability to DBpedia:

PageRank, SimRank - N/A - assumes homogeneous links!

Spreading Activation - possible with constraints

Apply PathSim - but how to learn such meta-paths?

Another idea:

Count node-disjoint paths.

Why? View each path as one distinct ‘reason’.

15

Similarity

Totoro GITS Matrix

Totoro 44 1 0GITS 1 35 2

Matrix 0 2 58

Totoro – GITS

Category:Anime films

GITS – Matrix

Category:Brain-computer interfacing in fictionMatrix → Category:The Matrix (franchise) →Category:Media franchises ← GITS

16

Similarity

How similar are the following films: Answer: it still depends

17

Similarity

How similar are the following films: Answer: it still depends- on the path you take

18

Summary

Similarity, useful concept in many areas, hard to define

how are films similar?

DBpedia, richly linked KB

film information available here

→ Problem: How to define similarity on DBpedia?

Past methods - don’t exploit linkedness

Network analysis methods can aid this

test trial with node-disjoint paths, GITS more similar to Matrixthan Totoro

19

Summary

Similarity, useful concept in many areas, hard to define

how are films similar?

DBpedia, richly linked KB

film information available here

→ Problem: How to define similarity on DBpedia?

Past methods - don’t exploit linkedness

Network analysis methods can aid this

test trial with node-disjoint paths, GITS more similar to Matrixthan Totoro

19

Summary

Similarity, useful concept in many areas, hard to define

how are films similar?

DBpedia, richly linked KB

film information available here

→ Problem: How to define similarity on DBpedia?

Past methods - don’t exploit linkedness

Network analysis methods can aid this

test trial with node-disjoint paths, GITS more similar to Matrixthan Totoro

20

Ongoing/Future Work

Mining DBpedia as Network

Analyse structured and related data

Similarity as complement to – reasoning, retrieval, querying

Also useful in NLP, recommender systems, knowledgediscovery

→ Examples: work we do in UIMR

21

Ongoing/Future Work

Mining DBpedia as Network

Analyse structured and related data

Similarity as complement to – reasoning, retrieval, querying

Also useful in NLP, recommender systems, knowledgediscovery

→ Examples: work we do in UIMR

21

Ioana Hulpus (2011/2012)

Graph-based topic analysis with the support of Linked Data

22

Ioana Hulpus (2011/2012)

Graph-based topic analysis with the support of Linked Data

23

Benjamin Heitmann (2011/2012)

Spreading activation for cross-domain recommendation

24

Challenges/Discussion

Challenges:

Topology of DBpedia graph

Standard SNA measures for homogeneous networks, e.g.density, degree distribution - how to apply to DBpedia?

What does a path actually mean?

Which subgraphs to use?

How do metrics vary with different subgraphs, e.g. diffontologies/categories?

Scalability (not problem, but challenge)

Evaluation - how do we confirm something is similar?

Thanks for listening! Questions/Suggestions?

25

Challenges/Discussion

Challenges:

Topology of DBpedia graph

Standard SNA measures for homogeneous networks, e.g.density, degree distribution - how to apply to DBpedia?

What does a path actually mean?

Which subgraphs to use?

How do metrics vary with different subgraphs, e.g. diffontologies/categories?

Scalability (not problem, but challenge)

Evaluation - how do we confirm something is similar?

Thanks for listening! Questions/Suggestions?

25

Challenges/Discussion

Challenges:

Topology of DBpedia graph

Standard SNA measures for homogeneous networks, e.g.density, degree distribution - how to apply to DBpedia?

What does a path actually mean?

Which subgraphs to use?

How do metrics vary with different subgraphs, e.g. diffontologies/categories?

Scalability (not problem, but challenge)

Evaluation - how do we confirm something is similar?

Thanks for listening! Questions/Suggestions?

25

top related