similarity on dbpedia

38
Similarity on DBpedia UIMR PhD student: Samantha Lam Supervisor: Conor Hayes

Upload: samantha-lam

Post on 15-Jun-2015

1.054 views

Category:

Documents


2 download

DESCRIPTION

Overview on the notion of similarity and methods for defining similarity on DBpedia.

TRANSCRIPT

Page 1: Similarity on DBpedia

Similarity on DBpediaUIMR

PhD student: Samantha LamSupervisor: Conor Hayes

Page 2: Similarity on DBpedia

Similarity

How similar are the following films:

2

Page 3: Similarity on DBpedia

Similarity

How similar are the following films: (Unsatisfactory)Answer: it depends!

3

Page 4: Similarity on DBpedia

DBpedia Graph

Films - nodes - on DBpedia.

Some things about DBpedia:

Big, rich, dense Knowledge Base

→ 3.77m nodes, 400m edges (EN)

Lots of prior work (as we shall see...)

But very heterogeneous - vocabularies, categories

It is a graph

4

Page 5: Similarity on DBpedia

DBpedia Graph

Films - nodes - on DBpedia.

Some things about DBpedia:

Big, rich, dense Knowledge Base

→ 3.77m nodes, 400m edges (EN)

Lots of prior work (as we shall see...)

But very heterogeneous - vocabularies, categories

It is a graph

4

Page 6: Similarity on DBpedia

DBpedia Graph

Films - nodes - on DBpedia.

Some things about DBpedia:

Big, rich, dense Knowledge Base

→ 3.77m nodes, 400m edges (EN)

Lots of prior work (as we shall see...)

But very heterogeneous - vocabularies, categories

It is a graph

4

Page 7: Similarity on DBpedia

Similarity in general

Cognitive Science - Tversky (1977) - psychology - featural.

E.g. film: genre, language, director

Modelling of human thought, semantic relations, how do werelate things to each other? (Quillian & Collins 1969)

5

Page 8: Similarity on DBpedia

Semantic

The notion of semantic networks is derived from the hierarchicalsemantic memory model [Collins & Quillian, 1969]

6

Page 9: Similarity on DBpedia

Semantic Similarity

Different techniques:

Word frequency: Latent semantic analysis (doesn’t actuallyuse semantic net structure)

Rada (1989) - average shortest path length

Resnik (1999) - information content of lcs

Unfortunately...

Word frequency N/A

Often assumes hierarchical/tree structure oftaxonomy/ontology. (Both Rada and Resnik assumetaxonomy is an is-A hierarchy)

7

Page 10: Similarity on DBpedia

Semantic Similarity

Different techniques:

Word frequency: Latent semantic analysis (doesn’t actuallyuse semantic net structure)

Rada (1989) - average shortest path length

Resnik (1999) - information content of lcs

Unfortunately...

Word frequency N/A

Often assumes hierarchical/tree structure oftaxonomy/ontology. (Both Rada and Resnik assumetaxonomy is an is-A hierarchy)

7

Page 11: Similarity on DBpedia

Semantic Similarity

Remember, DBpedia not as ‘neat’:

(Image source: http://www.visualdataweb.org/relfinder/)

8

Page 12: Similarity on DBpedia

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

9

Page 13: Similarity on DBpedia

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

9

Page 14: Similarity on DBpedia

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

9

Page 15: Similarity on DBpedia

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

9

Page 16: Similarity on DBpedia

On DBpedia/Wikipedia

Recent applications:

Gabrilovich & Markovitch (2007) - express text as a weightedvector of Wikipedia articles, Explicit Semantic Analysis (ESA)

Witten & Milne (2008) - the Wikipedia Link-based measure -similarity of neighbours

Passant (2010) - Linked Data Semantic Distance ← uses paths!

Mirizzi et al. (2012) uses DBpedia for movie recommendationusing a Vector Space Model

10

Page 17: Similarity on DBpedia

Similarity

Important:

Properties can be related to each other

type 1, e.g. influenced

node, e.g. director

type 2, e.g. collaborated with

node type 2, e.g. film

11

Page 18: Similarity on DBpedia

Network Similarity

Social Network Analysis

Established field - notions of influence, centrality, rank etc.

Often applied to small networks

Note: Ranking is often based on similarity

12

Page 19: Similarity on DBpedia

Network Similarity

Homogeneous network measures:

PageRank - Sergey & Brin (1998) - random-surfer withteleportation

SimRank - Jeh & Widom (2002) - iteratively ‘inherits’ rankof neighbours

σact - Thiel & Berthold (2010) - node similarities fromspreading activation with a decay factor

13

Page 20: Similarity on DBpedia

Network Similarity

Homogeneous network measures:

PageRank - Sergey & Brin (1998) - random-surfer withteleportation

SimRank - Jeh & Widom (2002) - iteratively ‘inherits’ rankof neighbours

σact - Thiel & Berthold (2010) - node similarities fromspreading activation with a decay factor

13

Page 21: Similarity on DBpedia

Network Similarity

Homogeneous network measures:

PageRank - Sergey & Brin (1998) - random-surfer withteleportation

SimRank - Jeh & Widom (2002) - iteratively ‘inherits’ rankof neighbours

σact - Thiel & Berthold (2010) - node similarities fromspreading activation with a decay factor

13

Page 22: Similarity on DBpedia

Network Similarity

Heterogeneous network measures:

PathSim - Sun & Han (2009) - count instances of‘meta-path’ (specific link pattern)

14

Page 23: Similarity on DBpedia

Network Similarity

Applicability to DBpedia:

PageRank, SimRank - N/A - assumes homogeneous links!

Spreading Activation - possible with constraints

Apply PathSim - but how to learn such meta-paths?

Another idea:

Count node-disjoint paths.

Why? View each path as one distinct ‘reason’.

15

Page 24: Similarity on DBpedia

Network Similarity

Applicability to DBpedia:

PageRank, SimRank - N/A - assumes homogeneous links!

Spreading Activation - possible with constraints

Apply PathSim - but how to learn such meta-paths?

Another idea:

Count node-disjoint paths.

Why? View each path as one distinct ‘reason’.

15

Page 25: Similarity on DBpedia

Similarity

Totoro GITS Matrix

Totoro 44 1 0GITS 1 35 2

Matrix 0 2 58

Totoro – GITS

Category:Anime films

GITS – Matrix

Category:Brain-computer interfacing in fictionMatrix → Category:The Matrix (franchise) →Category:Media franchises ← GITS

16

Page 26: Similarity on DBpedia

Similarity

How similar are the following films: Answer: it still depends

17

Page 27: Similarity on DBpedia

Similarity

How similar are the following films: Answer: it still depends- on the path you take

18

Page 28: Similarity on DBpedia

Summary

Similarity, useful concept in many areas, hard to define

how are films similar?

DBpedia, richly linked KB

film information available here

→ Problem: How to define similarity on DBpedia?

Past methods - don’t exploit linkedness

Network analysis methods can aid this

test trial with node-disjoint paths, GITS more similar to Matrixthan Totoro

19

Page 29: Similarity on DBpedia

Summary

Similarity, useful concept in many areas, hard to define

how are films similar?

DBpedia, richly linked KB

film information available here

→ Problem: How to define similarity on DBpedia?

Past methods - don’t exploit linkedness

Network analysis methods can aid this

test trial with node-disjoint paths, GITS more similar to Matrixthan Totoro

19

Page 30: Similarity on DBpedia

Summary

Similarity, useful concept in many areas, hard to define

how are films similar?

DBpedia, richly linked KB

film information available here

→ Problem: How to define similarity on DBpedia?

Past methods - don’t exploit linkedness

Network analysis methods can aid this

test trial with node-disjoint paths, GITS more similar to Matrixthan Totoro

20

Page 31: Similarity on DBpedia

Ongoing/Future Work

Mining DBpedia as Network

Analyse structured and related data

Similarity as complement to – reasoning, retrieval, querying

Also useful in NLP, recommender systems, knowledgediscovery

→ Examples: work we do in UIMR

21

Page 32: Similarity on DBpedia

Ongoing/Future Work

Mining DBpedia as Network

Analyse structured and related data

Similarity as complement to – reasoning, retrieval, querying

Also useful in NLP, recommender systems, knowledgediscovery

→ Examples: work we do in UIMR

21

Page 33: Similarity on DBpedia

Ioana Hulpus (2011/2012)

Graph-based topic analysis with the support of Linked Data

22

Page 34: Similarity on DBpedia

Ioana Hulpus (2011/2012)

Graph-based topic analysis with the support of Linked Data

23

Page 35: Similarity on DBpedia

Benjamin Heitmann (2011/2012)

Spreading activation for cross-domain recommendation

24

Page 36: Similarity on DBpedia

Challenges/Discussion

Challenges:

Topology of DBpedia graph

Standard SNA measures for homogeneous networks, e.g.density, degree distribution - how to apply to DBpedia?

What does a path actually mean?

Which subgraphs to use?

How do metrics vary with different subgraphs, e.g. diffontologies/categories?

Scalability (not problem, but challenge)

Evaluation - how do we confirm something is similar?

Thanks for listening! Questions/Suggestions?

25

Page 37: Similarity on DBpedia

Challenges/Discussion

Challenges:

Topology of DBpedia graph

Standard SNA measures for homogeneous networks, e.g.density, degree distribution - how to apply to DBpedia?

What does a path actually mean?

Which subgraphs to use?

How do metrics vary with different subgraphs, e.g. diffontologies/categories?

Scalability (not problem, but challenge)

Evaluation - how do we confirm something is similar?

Thanks for listening! Questions/Suggestions?

25

Page 38: Similarity on DBpedia

Challenges/Discussion

Challenges:

Topology of DBpedia graph

Standard SNA measures for homogeneous networks, e.g.density, degree distribution - how to apply to DBpedia?

What does a path actually mean?

Which subgraphs to use?

How do metrics vary with different subgraphs, e.g. diffontologies/categories?

Scalability (not problem, but challenge)

Evaluation - how do we confirm something is similar?

Thanks for listening! Questions/Suggestions?

25