wikipedia as a time machine - www 2014 tempweb workshop presentation

19
Time Machine STEWART WHITING AND JOEMON M. JOSE UNIVERSITY OF GLASGOW, SCOTLAND, UK Wikipedia as a OMAR ALONSO MICROSOFT BING, MOUNTAIN VIEW, CA, USA Temporal Web Analytics Workshop 2014

Upload: stewhir

Post on 08-Jul-2015

446 views

Category:

Internet


6 download

DESCRIPTION

An overview of using Wikipedia time signal data. These are the slides for the TempWeb workshop paper: http://www.stewh.com/wp-content/uploads/2014/02/w14temp07-whiting.pdf

TRANSCRIPT

Page 1: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Time MachineSTEWART WHITING AND JOEMON M. JOSE

UNIVERSITY OF GLASGOW, SCOTLAND, UK

Wikipedia as a

OMAR ALONSO

MICROSOFT BING, MOUNTAIN VIEW, CA, USA

Temporal Web AnalyticsWorkshop 2014

Page 2: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData

Anyone can create and edit content

Moderator-curated

Reflects time-based news, culture and phenomena

Wikipedia English started in 2001

Now contains 4.5M+ articles

~20.4 revisions per article

Vast amounts of open data

Rich structure (article hierarchy, linking, taxonomies – semantics)

Understanding Wikipedia

6th most visited website on the internet[Alexa]

Huge collaborative encyclopaedic effort

Page 3: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Wikipedia as a Time Machine

Text contentPeople write about the past/present/future

Explicit/implicit structure

Meta-data signalsPulse of real-time activity

Side-effects of temporal user interest

- without needing a query log!

Wikipedia offers a great deal

of time information:

Insight into:

Story

Temporal sequencing

Entity relationships

Impact

Page 4: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData This Talk

How can we discover,

understand and track

past, present and future

temporal topics using

Wikipedia?

And, how can this

knowledge be exploited

in time-aware

information retrieval

tasks?

Page 5: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData

Wikipedia text and structure used extensively in many non-temporal IR tasks

Semantic Similarity/Relatedness Measures[GabrilovichEtAl2007 – Wiki. Explicit Semantic Analysis][StrubeEtAl2006 – WikiRelate!]

External Collection Query Expansion[XuEtAl2009]

Query Intent Modelling[HuEtAl2009]

Cross-Lingual IR[PotthastEtAl2008]

Entity Tasks – Recognition, Disambiguation etc[Many!]

IR & Wikipedia

Page 6: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Time-aware IR & Wikipedia

Using Wikipedia temporal signals in time-aware IR tasks

Event/Topic Detection & TrackingDetection/tracking: [CiglanNorvag2010,OsborneEtAl2012,SteinerEtAl2013]

Summarisation: [GeorgescuEtAl2013,WhitingEtAl2012] Evaluation (ground-truth): [McMinnEtAl2013]

Event Visualisation[WattenbergEtAl2007]

Temporal Semantics - Entity/Fact Extraction[WangEtAl2010,BalogNorvag2012]

Temporal Query Intent ModellingAmbiguous intents: [ZhouEtAl2013]

Multi-faceted intents: [WhitingEtAl2013]

There are many opportunities…

Page 7: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Wikipedia Characteristics

How quickly does Wikipedia reflect the world?

What topic coverage does it offer?

Is Wikipedia content high-quality?

Can it be trusted?

Page 8: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Freshness/Timeliness

Latency

‘Main-stream’ events – very small (<30 mins? <2 hours? Depends who you ask…)

KBA filtering task at TREC: improve event coverage/speed

Pope Benedict XVI’s Resignation

EN and FR articles updated at 10:58 and

11:00

Reuters broke news at 10:59, following

Vatican announcement at 10:57:47

Whitney Houston’s Death

Reported on Twitter at 00:15 UTC by niece of hotel

worker who found her

Spread through Twitter, confirmed by AP via

Twitter at 00:57 UTC

WH’s article updated ‘has died’ as 01:01 UTC

Page 9: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Topic Coverage

Not all topics covered

representatively

Events may only appear as a

sentence or sub-section of main

article (e.g. a celebrity in a scandal)

Separate article(s) created for

major events39th G8 Summit, 2013 North India Floods

See Also: Response to...., Criticisms of… etc.

Meta-data signals quantify impact An Analysis of Topical Coverage of WikipediaHalavais and Lackaff, 2008

Page 10: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Content Quality

Idealistically – facts verified by 3rd party

through citations

Plenty of editorial guidelines

“Wikipedia is not a newspaper”

Bots make lots of changes

Talk pages contain temporal discourse

Sometimes prominent articles are

locked – far less edits (but, pre-verified)

Period Digest

1 {{death}} (Refers to the article ’infobox’ with birth and death dates.)

2 Houston died on February 11, 2012. Publicist Kristen Foster said

Saturday that the singer had died, but the cause of her death was

unknown. She died in [[Ottawa]], [[Canada]].

3 [Similar to previous.]

4 4 On February 11, 2012, publicist Kristen Foster revealed Houston

had died aged 48. A cause of death was not immediately given. She

died in her Beverly Hills home.

5 [Similar to previous.]

6 [Similar to previous.]

7 On February 11, 2012, publicist Kristen Foster revealed Houston had

died from unspecified causes at the age of 48, with unconfirmed

reports suggesting her death occurred in her room at the [[Beverly

Hilton Hotel]].

8 Houston released her new album, ”[[I Look to You]]”, on August

2009. The album’s first two singles are "I Look to You" and "Million

Dollar Bill". The album entered the [[Billboard 200]] at No. 1...

9 Local police said there were "no obvious signs of criminal intent."

Two days prior to her death, witnesses reported seeing

Houston behave erratically. They were rumored that she died of drug

overdose.

Page 11: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Data Sources

Page APIsEasy random access to revisions etc. (slow!)

Article Creation/Change IRC ChannelsAll updates, no full-text

Article Creation/Change RSS/Atom FeedsNot all updates, but includes full-text content

XML Article Dumps (monthly)All article/page revisions (EN is 7TB decompressed!)

Or, current article revision onlyNeed a cluster to derive more useful datasets

Page View Dumps (hourly)Measure of article popularity, since end 2007

See stats.grok.se for an easier interface

May 2013 daily article changes RSS feed volume (in log scale) for

Wikipedia EN, FR, IT, DE and ES

Several openly available Wikipedia data sources

Page 12: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Current Events Portal

Manually curated list of recent/ongoing mainstream events

Ad-hoc taxonomy, e.g. finance, sports, deaths, politics etc.

Used as a ground-truth for automated TDT evaluation

May 2013: Avg. 15 (±6) articles per day

Page 13: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Temporal Expressions

Using temporal tagger

(e.g. HeidelTime)

Extracted dates in article content

YEAR, MONTH-YEAR and DAY-MONTH-YEAR

Year mentions in Wikipedia English from1900 to 2020

Visualises past and future time coverage

9/11, 2001 is a large spike

1st/2nd World Wars also prominent

‘Recentism’ - biased coverage of recent information

Page 14: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Page Edit Stream

‘Arab Spring’ daily article edit frequency and length (in characters) since 27th January 2011

(to 23rd March 2012)

Derived from historic revision

dumps, RSS or IRC feeds

Changed text can be mined for

summaries, inc. references

Look for links, sections, images in

markup

Page 15: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Temporal Article Structure

Changes in article (sub-)sections

Finer-grained interest over time

People edit what is changing -

Evolving section hierarchy

A temporal directed acyclic graph -

Cumulative ‘Arab Spring’ article section edit frequency since 27th January 2011

root

Page 16: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Temporal Link Graph

Cumulative ‘Arab Spring’ article in- and out-link degree since 27th January 2011

Links created using

[article/redirect|[name]]

Wiki markup

Need to be careful with namespaces,

languages, link naming and redirects

Can also include external ‘citation’

links

Page 17: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Page View Stream

Page views are very sensitive

Little correlation between page

edit and viewing activity

More edits than interest at first -

Correlations between articles are

interesting [CiglanNorvag2010]

‘Arab Spring’ article daily edit frequencyand page views since 27th January 2011

(to 23rd March 2012)

Page 18: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Final Remarks

I have various distilled datasets with me (and can arrange download + C# MapReduce code)

ArticleEditTimestamps

SampleEventSummarisation

DisambiguationPages

TemporalLinkGraphWithSections

RedirectPages

TemporalSectionChanges

TimeExpressions

120gb total, or select

Wikipedia temporal datasets cover a

wide range of events, culture and

phenomena

Temporal meta-data and content signals

openly available

Informative power – hugely valuable

for time-aware IR research

Probably won’t beat Twitter for speed,

but Wiki has structure and quality control

Many open research questions and

opportunities for time-aware IR!

Page 19: Wikipedia as a Time Machine - WWW 2014 TempWeb Workshop Presentation

Introduction Wiki Characteristics Time Signals Final RemarksData Some Research Questions

1. How fast does Wikipedia respond to

events of different types in countries?

2. How can Wikipedia data supplement

query log, Twitter and news feed

streams to improve time-aware IR?

3. What do temporal correlations

between linked article page views

mean – is this reflected in the text

content?

4. Can event similarity be measured on

temporal and topical dimensions?

5. Can this temporal knowledge be used

to predict interest in topics that

become associated in similar ways?

(E.g. actors selected by famous shows,

or directors etc.)