pagerank on wikipedia: towards general importance scores for entities

19
KIT The Research University in the Helmholtz Association INSTITUTE OF APPLIED INFORMATICS AND FORMAL DESCRIPTION METHODS (AIFB) www.kit.edu PageRank on Wikipedia: Towards General Importance Scores for Entities Andreas Thalhammer and Achim Rettinger Know@LOD 2016 30.05.2016

Upload: andreas-thalhammer

Post on 11-Apr-2017

313 views

Category:

Data & Analytics


0 download

TRANSCRIPT

KIT – The Research University in the Helmholtz Association

INSTITUTE OF APPLIED INFORMATICS AND FORMAL DESCRIPTION METHODS (AIFB)

www.kit.edu

PageRank on Wikipedia: Towards General Importance Scores for Entities

Andreas Thalhammer and Achim Rettinger

Know@LOD 2016 30.05.2016

Institute of Applied Informatics and Formal

Description Methods (AIFB)

2

Outline

Motivation

Background

Experiments

Conclusions

06.06.2016 Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

Institute of Applied Informatics and Formal

Description Methods (AIFB)

3

MOTIVATION

06.06.2016 Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

Institute of Applied Informatics and Formal

Description Methods (AIFB)

4

PageRank on Wikipedia

06.06.2016

[3]

Wikipedia pages link to each other.

Wikipedia can be regarded as a link graph.

PageRank needs a link graph as an input.

The output of PageRank are ranking scores for Web pages.

+

Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

Institute of Applied Informatics and Formal

Description Methods (AIFB)

5

Idea: Combine Wikipedia PageRank scores

with DBpedia/Wikidata

PREFIX v:http://purl.org/voc/vrank#

SELECT ?e ?r FROM <http://dbpedia.org>

FROM <http://people.aifb.kit.edu/ath/

#DBpedia_PageRank>

WHERE {

?e rdf:type dbo:Scientist;

v:hasRank/v:rankValue ?r.

} ORDER BY DESC(?r) LIMIT 5

06.06.2016

Every resource can be ranked.

SPARQL result sets are often large.

PageRank scores capture a general notion of importance.

dbpedia:Carl_Linnaeus 551.791

dbpedia:Charles_Darwin 215.028

dbpedia:Albert_Einstein 186.549

dbpedia:Isaac_Newton 167.811

dbpedia:Sigmund_Freud 140.245

Done. Wait!

Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

Institute of Applied Informatics and Formal

Description Methods (AIFB)

6

Who is Carl Linnaeus?

06.06.2016

Main idea:

Change the input of the

PageRank algorithm.

Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

Institute of Applied Informatics and Formal

Description Methods (AIFB)

7

BACKGROUND

06.06.2016 Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

Institute of Applied Informatics and Formal

Description Methods (AIFB)

8

The PageRank algorithm

Important parameters:

l(w) – returns all pages that link to w.

c(w) – the number of outgoing links of w.

d – the damping factor

Traditional PageRank [1]:

Variant: Weighted Links Rank (WLRank) [2]:

Link weights (lw): relative position of a link in the article

Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

06.06.2016

[3]

Institute of Applied Informatics and Formal

Description Methods (AIFB)

9

The Wikipedia link graph

We use non-rendered versions of articles.

Wiki syntax:

Link: [[brown bear|bears]]

Template: {{ ... }}

Link in a template: {{.... [[brown bear|bears]] ...}}

DBpedia Extraction Framework

Extracts links from the wikitext.

Doesn’t consider:

links from category pages;

links in <ref> tags.

Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

06.06.2016

Institute of Applied Informatics and Formal

Description Methods (AIFB)

10

EXPERIMENTS

06.06.2016 Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

Institute of Applied Informatics and Formal

Description Methods (AIFB)

11

Compared rankings

Newly constructed rankings

ALL – all links from the article text and from the templates

ATL – article text links

ATL-RP – article text links with WLRank and relative position

ABL – abstract links

TEL – template links

Reference rankings:

DBP 2014 – PageRank on “Pagelinks” dataset

DBP 2015-04 – PageRank on “Pagelinks” dataset

DBP-U 2015-04 – PageRank on “Pagelinks unredirected” dataset

TOWR-PV – “The Open Wikipedia Ranking” page views

SUB – SubjectiveEye3D by Paul Houle page views

Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

06.06.2016

Institute of Applied Informatics and Formal

Description Methods (AIFB)

12

Setup

Measures (pairwise):

Kendall’s tau

Spearman’s rho

Hypotheses:

1. Links in templates are created in a “please fill out” manner and rather

negatively influence on the general salience that PageRank scores

should represent.

2. Links that are mentioned at the beginning of articles are more often

clicked and the WLRank correlates stronger with pageviews.

Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

06.06.2016

Institute of Applied Informatics and Formal

Description Methods (AIFB)

13

Results

Spearman’s rho

Kendall’s tau

06.06.2016 Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

Input graph size:

Institute of Applied Informatics and Formal

Description Methods (AIFB)

14

SUMMARY

06.06.2016 Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

Institute of Applied Informatics and Formal

Description Methods (AIFB)

15

Conclusions

Hypothesis 1 (removing template links):

Subjectively PageRank scores got “better”

the PageRank of Carl Linnaeus got reduced.

Objectively PageRank do not change strongly.

Using only template links is not recommended for computing PageRank

(note that DBpedia extracts most semantic links from there).

Hypothesis 2 (ranking with relative positions):

WLRanks with weights on the link positions in the article text correlate

stronger with pageview-based rankings

Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

06.06.2016

Institute of Applied Informatics and Formal

Description Methods (AIFB)

16

Further insights

1 YouTube Category:Living_people

2 Searching United_States

3 Facebook List_of_sovereign_states

4 United_States Animal

5 Undefined France

6 Lists_of_deaths_by_year United_Kingdom

7 Wikipedia World_War_II

8 The_Beatles Germany

9 Barack_Obama Canada

10 Web_search_engine India

11 Google Iran

12 Michael_Jackson Association_football

13 Sex England

14 Lady_Gaga Australia

15 World_War_II Arthropod

16 United_Kingdom Insect

17 Eminem Russia

18 Lil_Wayne Japan

19 Adolf_Hitler China

20 India Italy

21 Justin_Bieber English_language

22 How_I_Met_Your_Mother Poland

23 The_Big_Bang_Theory London

24 World_War_I Spain

25 Miley_Cyrus New_York_City

26 Glee_(TV_series) Catholic_Church

27 Favicon World_War_I

28 Canada Bakhsh

29 Sex_position Latin

30 Kim_Kardashian Village

Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

06.06.2016

Pageview-based (SubjectiveEye3D) Link-based (DBpedia PageRank)

Page views or PageRank?

(it depends, but in general)

Both!

Institute of Applied Informatics and Formal

Description Methods (AIFB)

17

Resources

Entity rankings:

DBpedia PageRank: http://people.aifb.kit.edu/ath/#DBpedia_PageRank

SubjectiveEye3D:

https://github.com/paulhoule/telepath/wiki/SubjectiveEye3D

The Open Wikipedia Ranking: http://wikirank.di.unimi.it/

Applications (of DBpedia PageRank):

LinkSUM entity summarization system:

http://km.aifb.kit.edu/services/link/

DBtrends: http://dbtrends.aksw.org

Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

06.06.2016

Institute of Applied Informatics and Formal

Description Methods (AIFB)

18 06.06.2016

Questions?

[email protected]

@thalhamm

Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

Institute of Applied Informatics and Formal

Description Methods (AIFB)

19

References

[1] S. Brin and L. Page. The Anatomy of a Large-scale Hypertextual Web Search Engine. In Proceedings

of the Seventh International Conference on World Wide Web 7, WWW7, pages 107–117. Elsevier

Science Publishers B. V., Amsterdam, The Netherlands, The Netherlands, 1998.

[2] R. Baeza-Yates and E. Davis. Web Page Ranking Using Link Attributes. In Proceedings of the 13th

International World Wide Web Conference on Alternate Track Papers &Amp; Posters, WWW Alt. ’04,

pages 328–329, New York, NY, USA, 2004. ACM.

[3] An art draw drawn by Felipe Micaroni Lalli ([email protected]).

Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards

General Importance Scores for Entities

06.06.2016