information theoritic analysis of entity dynamics on the linked open data cloud

28
www.moving-project.eu TraininG towards a society of data-saVvy inforMation prOfessionals to enable open leadership INnovation Chifumi Nishioka and Ansgar Scherp ZBW -- Leibniz Information Centre for Economics and Kiel University, Germany Information-theoretic Analysis of Entity Dynamics on the Linked Open Data cloud

Upload: moving-project

Post on 11-Feb-2017

77 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

TraininG towards a society of data-saVvy inforMation prOfessionals to enable open leadership INnovation

Chifumi Nishioka and Ansgar Scherp

ZBW -- Leibniz Information Centre for Economics and Kiel University, Germany

Information-theoretic Analysis of Entity Dynamics on the Linked Open Data cloud

Page 2: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

2 of 19

Motivation

• Understanding the dynamics of the LOD cloud is important for many applications • e.g., SPARQL query caching, crawling strategies, term

recommendations

• Related work • Evolution of LOD documents [Käfer et al. 13] • Dynamics of LOD sources [Dividino et al. 14]

• Entities on the LOD cloud

• Used by a lot of applications • Knowledge graph in search engines • Document modeling [Schuhmacher and Ponzetto 14]

Chifumi Nishioka ([email protected])

Come to the presentation of

“TermPicker” by Johann Schaible at

14:30 on 1st June (Wednesday)

We conduct an analysis focusing on entities

Page 3: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

3 of 19

Research Goals

• Measure the changes in entities between two points in time • Represent the temporal dynamics of entities as time-series

• Time-series clustering • Periodicity detection

• Evaluate four different features of entities

Chifumi Nishioka ([email protected])

Goal 1: Represent the temporal dynamics of entities

Goal 2: Find out the representative temporal patterns of entity dynamics

Goal 3: Find out which features of entity more likely define temporal dynamics of entities

Page 4: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

4 of 19

Formalization

• 𝑋𝑡: snapshot of LOD documents at point in time 𝑡 • Snapshot is a collection of triples 𝑥

• 𝑥: triple • 𝑥 = 𝑠, 𝑝, 𝑜 : subject, predicate, and object

Chifumi Nishioka ([email protected])

Page 5: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

5 of 19

Entity and Entity Representations

• Entities are represented by a set of triples • Entity Representation: Out

• Set of triples with common subject URI • e.g., db:John_Brown is defined by two triples

• Entity Representation: InOut • Set of triples with common subject URI or object URI • e.g., db:John_Brown is defined by three triples

Chifumi Nishioka ([email protected])

db:Anne_Smith db:spouseOf db:John_Brown

db:John_Brown db:birthplace db:Los_Angels

db:John_Brown db:works db:Green_University

Page 6: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

6 of 19

Triple Weighting

• example: Barack Obama

• <Barack_Obama, dbp:vicePresident , Joe_Biden> is more important than <Barack_Obama, rdf:type , foaf:Person>

• Baseline • All triples have a same weight

• Combined Information Content (combIC) [Schuhmacher and Ponzetto 14] • 𝐼𝐶 𝑣 = −log(𝑃(𝑣)) • 𝑝𝑟𝑒𝑑 𝑥 , 𝑜𝑏𝑗 𝑥 returns predicate and object of a

triple 𝑥, respectively

Chifumi Nishioka ([email protected])

𝑤𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒(𝑥) = 1

Each triple in entities has different importance for entities

𝑤𝑐𝑜𝑚𝑏𝐼𝐶 𝑥 = 𝐼𝐶 𝑝𝑟𝑒𝑑 𝑥 + 𝐼𝐶(𝑜𝑏𝑗(𝑥))

Page 7: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

7 of 19

Measuring Entity Dynamics

• Cosine distance

• Euclidean distance

Chifumi Nishioka ([email protected])

Goal 1: Represent the temporal dynamics of entities

𝛿𝑐𝑜𝑠𝑑 𝐸𝑡1 , 𝐸𝑡2 = 1 −𝐸𝑡1 ∙ 𝐸𝑡2

||𝐸𝑡1|| ∙ | 𝐸𝑡2 |

1. Measure the amount of changes in entities between two successive snapshots by one of two distance measures

𝛿𝑒𝑢𝑐 𝐸𝑡1 , 𝐸𝑡2 = (𝐸𝑡1,𝑖 − 𝐸𝑡2,𝑖)2

𝑖=1

Page 8: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

8 of 19

Vector Representation of Entities

• Represent an entity 𝐸 by one-hot encoding • Extract all unique triples from different snapshots • Fix order of triples • e.g., db:Anne_Smith at 𝑡1 is (1,1,1,0,0) and at 𝑡2 is

(1,0,1,1,1)

• Cosine distance: 𝛿𝑐𝑜𝑠𝑑 𝐸𝑡1 , 𝐸𝑡2 = 1 −2

3∙ 4= 0.42

• Euclidean distance: 𝛿𝑒𝑢𝑐 𝐸𝑡1 , 𝐸𝑡2 = 3 Chifumi Nishioka ([email protected])

𝑡1 1 db:Anne_Smith db:birthplace db:New_York

2 db:Anne_Smith db:works db:Green_University

3 db:Anne_Smith db:spouseOf db:John_Brown

𝑡2 1 db:Anne_Smith db:birthplace db:New_York

4 db:Anne_Smith db:works db:Royal_University

3 db:Anne_Smith db:spouseOf db:John_Brown

5 db:Anne_Smith db:degree db:Master_of_Science

Page 9: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

9 of 19

Temporal Dynamics of Entities

• Temporal Dynamics of an entity 𝐸

• 𝑛: the number of snapshots

Chifumi Nishioka ([email protected])

Δ(𝐸) = (𝛿 𝐸𝑡1 , 𝐸𝑡2 , 𝛿 𝐸𝑡2 , 𝐸𝑡3 , ⋯ 𝛿(𝐸𝑡𝑛−1 , 𝐸𝑡𝑛))

2. Represent temporal dynamics of entities by a time-series of the amount of changes in an entity between two successive snapshots

Subsequently, we mine the resulted time-series to find out patterns of temporal dynamics of entities

Page 10: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

10 of 19

Time-series Clustering

Chifumi Nishioka ([email protected])

• Clustering algorithm: k-means++ [Arthur and Vassilvitskii 07] • Introduce an improved initial seeding into k-means

• Distance measure: Euclidean distance • The most efficient measure for distance between

time-series with a reasonably high accuracy [Wang et al. 13]

• Optimization of the number of clusters : Average Silhouette

Goal 2: Find out the representative temporal patterns of entity dynamics

Page 11: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

11 of 19

Periodicity Detection

• Periodicity Detection

• A task of detecting periodicity from time-series • Example 1: (1, 3, 2, 1, 3, 2) -> periodicity of three • Example 2: (1, 2, 1, 2, 1, 2) -> periodicity of two • Employ a convolution-based algorithm [Elfeky et al.

05]

Chifumi Nishioka ([email protected])

We assume that the amount of changes of entities have some periodicity

We see the centroids of the resulted clusters as patterns of entity dynamics

Page 12: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

12 of 19

Dataset

• Dynamic Linked Data Observatory (DyLDO) dataset [Käfer et al. 12] • Weekly snapshots of the fixed set of LOD documents • 165 snapshots over three years (05/2012 to 07/2015)

• Entities in the DyLDO dataset • Almost 75% of entities appear only at one snapshot • Focus on entities that appear at >70% of snapshots

Chifumi Nishioka ([email protected])

Entity

representation

# of unique

entities in 165

snapshots

# of entities that

appear at >70% of

snapshots

Out 27,788,902 2,909,700

InOut 29,097,929 2,950,533

Page 13: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

13 of 19

Patterns of Entity Dynamics (1/3)

• Analysis with respect to eight conditions • Conditions are made by two entity representations,

two distance measures, two triple weighting methods

• Result of clustering • # of clusters are smaller when using combIC

Chifumi Nishioka ([email protected])

Page 14: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

14 of 19

Patterns of Entity Dynamics (2/3)

Chifumi Nishioka ([email protected])

Out Cosine Baseline Out Cosine CombIC

Out Euclidean CombIC

Out Euclidean Baseline

Page 15: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

15 of 19

Patterns of Entity Dynamics (3/3)

Chifumi Nishioka ([email protected])

InOut Cosine Baseline InOut Cosine CombIC

InOut Euclidean Baseline InOut Euclidean CombIC

Page 16: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

16 of 19

Periodicity of Entity Dynamics

Chifumi Nishioka ([email protected])

We observe periodicities in temporal dynamics of entities

• e.g., “Periodicity of 56” indicates that the amount of entity changes vary along with one-year cycle

• Different patterns have different periodicities

Page 17: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

17 of 19

Features for Entity Dynamics (1/2)

• Four features of entities • RDF Type (𝑓1) • Property (𝑓2) • Union of RDF types and properties (𝑓3) • Pay level domain (PLD) of entity URI (𝑓4)

• e.g., http://dbpedia.org/resource/The_Beatles -> dbpedia.org

• Evaluate four features by RandIndex • RandIndex: a metric of clustering • Measure the difference of clustering by a feature and

by entity dynamics (i.e., time-series vectors)

Chifumi Nishioka ([email protected])

Goal 3: Find out which features of entity more likely define temporal dynamics of entities

Page 18: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

18 of 19

Features for Entity Dynamics (2/2)

• Entities that share a common PLD are more likely

to have similar temporal dynamics of entities when employing baseline for triple weighting

• When using combIC, entities that have a common RDF type or ECS more likely to belong a same cluster

Chifumi Nishioka ([email protected])

Page 19: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

19 of 19

Thank you for your attention!

Project consortium and funding agency

Chifumi Nishioka ([email protected])

MOVING is funded by the EU Horizon 2020 Programme under the project number INSO-4-2015: 693092

Page 20: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

20 of 19

Conclusion

• Temporal dynamics of entities on the LOD cloud • Represent the temporal dynamics of entities as time-

series • Find out the representative temporal patterns of

entity dynamics • Find out which features of entity

• Future work • e.g., SPARQL query caching

Chifumi Nishioka ([email protected])

Goal 3: Find out which features of entity more likely define temporal dynamics of entities

Page 21: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

21 of 19

Appendix

Chifumi Nishioka ([email protected])

Page 22: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

22 of 19

Reference

• [Arthur and Vassilvitskii 07] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. SODA, 2007.

• [Elfeky et al. 05] M.G. Elfeky, W.G. Aref, and A.K. Elmagarmid. Periodicity detection in time series databases. IEEE TKDE, 2005.

• [Käfer et al. 12] T. Käfer, J. Umbrich, A. Hogan, and A. Polleres. Towards a dynamic linked data observatory. LDOW, 2012.

• [Käfer et al. 13] T. Käfer, A. Abdelrahman, J. Umbrich, P. O’Byrne, and A. Hogan. Observing linked data dynamics. ESWC, 2013.

• [Neumann and Moerkotte 11] T. Neumann and G. Moerkotte. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. ICDE, 2011.

• [Schuhmacher and Ponzetto 14] M. Schuhmacher and S.P. Ponzetto. Knowledge-based graph document modeling. WSDM, 2014.

• [Wang et al. 13] X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, and E. Keogh. Experimental comparison of representation methods and distance measures for time series data. Data Mining and Knowledge Discovery, 2013.

• [Yang and Leskovec 11] J. Yang and J. Leskovec. Patterns of temporal variation in online media. WSDM, 2011.

Chifumi Nishioka ([email protected])

Page 23: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

23 of 19

Entities in the DyLDO dataset

• Distribution of # of times of appearances of entities in 165 snapshots

Chifumi Nishioka ([email protected])

Entity representation Out Entity representation InOut

Page 24: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

24 of 19

Resulted Temporal Patterns (1/4)

Chifumi Nishioka ([email protected])

Out Cosine Baseline Out Cosine CombIC

Page 25: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

25 of 19

Resulted Temporal Patterns (2/4)

Chifumi Nishioka ([email protected])

Out Euclidean Baseline Out Euclidean CombIC

Page 26: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

26 of 19

Resulted Temporal Patterns (3/4)

Chifumi Nishioka ([email protected])

InOut Cosine Baseline InOut Cosine CombIC

Page 27: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

27 of 19

Resulted Temporal Patterns (4/4)

Chifumi Nishioka ([email protected])

InOut Euclidean Baseline InOut Euclidean CombIC

Page 28: Information theoritic analysis of entity dynamics on the linked open data cloud

www.moving-project.eu

28 of 19

Presentation

• PROFILES 2016

Chifumi Nishioka ([email protected])