sheila kinsella phd defense

36
Copyright 2010 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.i e Augmenting Social Media Items with Metadata using Related Web Content Sheila Kinsella 1

Upload: sheila-kinsella

Post on 11-May-2015

984 views

Category:

Technology


1 download

DESCRIPTION

"Augmenting Social Media Items with Metadata using Related Web Content" - Slides from the public presentation part of my PhD defense. Presented at DERI, NUI Galway, on August 30, 2011.

TRANSCRIPT

Page 1: Sheila Kinsella PhD Defense

Copyright 2010 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

1

Augmenting Social Media Items with Metadata using Related Web Content

Sheila Kinsella

Page 2: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

2

Outline

Motivation Example Scenario

Tag prediction Geolocation Topic classification

Combining the approaches Impact Conclusions

Approach Evaluation Summary

Page 3: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

3

Motivation

Social media is an important information source e.g., real time citizen journalism, Q&A sites, niche

topics Search and navigation can be challenging

Short and informal posts Items are not curated and often lack metadata Users conversing share a common context and

therefore omit relevant information, e.g. location Making use of related Web data can help us to

infer such context information e.g., hyperlinks, posts with similar content

Page 4: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

4

Example Scenario:Adding metadata to a blog post

tags?

topic?location?

Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.

Page 5: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

5

Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.

Example Scenario:Possible clues from content

tags?

topic?location?

Page 6: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

6

Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.

location?

Example Scenario:Possible clues from content

tags?

topic?

Page 7: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

7

Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.

Example Scenario:Possible clues from content

tags?

location?topic?

Page 8: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

8

Example Scenario:Exploiting related Web content

...didn’t see the match but here’s a summary from John..

..............This review of the Connacht match shows that they are getting back in form!......

href

href

tags from anchortext

Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.

Page 9: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.

Example Scenario:Exploiting related Web content

location from geotagged social media

9

JohnSmith John SmithI’m at the Galway Sportsground

Page 10: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

10

Example Scenario:Exploiting related Web content

topic from hyperlinked objects

href

YouTubeTitle:Fionn Carr try

Category:SportTags:rugby, try, carr, connacht

Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.

Page 11: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

11

Example Scenario:Overview of approaches

TAG PREDICTION GEOLOCATIONTOPIC

CLASSIFICATION

...didn’t see the match but here’s a summary from John..

..............This review of the Connacht match shows that they are getting back in form!......

href

href

href

YouTubeTitle:Fionn Carr try

Category:SportTags:rugby, try, carr, connacht

Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.

JohnSmith John SmithI’m at the Galway Sportsground

Page 12: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

12

Tag Prediction: Approach

Aim: Automatic tag generation based on anchortext

1. Data collection and preprocessing Retrieve document and extract META information Retrieve inlinking documents and extract anchortext Preprocessing (e.g. stemming, stopword removal)

2. Tag indexing and ranking Generate term vectors from the preprocessed

annotations Ranking: tf and tf-idf

Page 13: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

13

Tag Prediction: Evaluation (1)

Datasets: Web: WEBSPAM-2007, 12M pages from .uk domain Delicious: 2007 Crawl containing tags for 4.5M URLs Overlap between datasets: 192k URLs

Goals: Compare overlap of predicted tags and delicious tags Assess relevance of predicted tags and relevance of

delicious tags

Page 14: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

14

Tag Prediction: Evaluation (2)

Automatic Evaluation Relative precision@k (Average proportion of predicted

tags that are also among delicious tags)

Relative recall@k (Average proportion of delicious tags can also be inferred from anchortext)

k=1 k=2 k=3 k=4 k=5

0.41 0.35 0.32 0.29 0.28

k=1 k=2 k=3 k=4 k=5

0.48 0.45 0.42 0.39 0.37

Page 15: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

15

Tag Prediction: Evaluation (3)

Human Evaluation 80 documents, each assessed by 3 judges

– 0: not relevant; 1: quite relevant; 2: very relevant Evaluator agreement

– In 85% of cases, judges at least almost agree– i.e., two agree and the third differs by just one point

Page 16: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

16

Tag Prediction: Evaluation (4)

Human Evaluation Precision@k (Average proportion of tags judged

relevant by evaluators)– Relevance threshold: 1

Not feasible to measure recall

k=1 k=2 k=3 k=4 k=5

Delicious 0.86 0.84 0.82 0.80 0.78

predicted 0.78 0.76 0.69 0.67 0.66

Page 17: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

17

Tag Prediction: Summary

Substantial overlap between tags assigned on a social bookmarking site and terms from anchortext

Human evaluators rate relevance of terms from anchortext as not much lower than tags

This approach can provide useful and novel annotations for untagged social media items, if other users link to them with anchortext

Page 18: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

18

Geolocation: Approach (1)

Aim: Location prediction based on models built from geotagged social media Enables detection of implicit location clues such as

slang, venues, other terms of local relevance

1. Reverse Geocoding Filter geotagged tweets from Twitter stream Reverse-geocode each coordinate to corresponding

places– Postal code, City, State, Country– Yahoo! Geoplanet service

Aggregate all of the text from each place together for model building

Page 19: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

19

Geolocation: Approach (2)

2. Language Modelling Approach from information retrieval – given a query, find

the most relevant document in a collection Model each document and query as bag of words For each document, calculate probability that a random

sampling would result in the query Based on the intuition that users create queries by

guessing words that would occur in the document For our geolocation task: estimates the probability that a

random sampling of a location would result in the social media post

Page 20: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

20

Geolocation: Evaluation (1)

Dataset Twitter Firehose stream 7.3 million geotagged tweets posted during Summer 2010 Retweets removed, #hashtags and @usernames

preserved

Baseline: Yahoo Placemaker! identifies and disambiguates placenames in text and

returns the spatial entity most likely to encompass them

Place type

# Tweets

# Distinct places

Country 7.3m 222

State 7.3m 2.3k

City 6.3m 72.6k

Postal code

7.2m 104.7k

Page 21: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

21

Geolocation: Evaluation (2)

Prediction Methods Trivial Classifier

– Each tweet assigned to the most common place in training set

Placemaker (Tweet)– Each tweet is submitted to Placemaker and the most

probable candidate is selected. Allows detection of explicit geographic references in the tweet

Language Model– Locations are ranking according to their query likelihood and

the location whose model ranks highest is selected Placemaker (Location)

– The location field from the tweet is submitted to Placemaker and the most probable candidate is selected. Allows detection of explicit geographic references in the self-reported location

Page 22: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

22

Geolocation: Evaluation (3)

Zip Town State Country

Trivial Classifier 0.005 0.061 0.060 0.434

Placemaker (Tweet) 0.018 0.060 0.076 0.120

Language Model 0.052 0.217 0.246 0.514

Placemaker (Location)

0.017 0.269 0.401 0.518

Tweet location prediction accuracy Common location focused services removed

Page 23: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

23

Geolocation: Summary

Language models of geotagged tweets enables the location of non-geotagged items to be predicted

The approach gives large improvements compared to parsing for explicit placenames City level accuracy – 21.7% versus 6%

The approach can be used to detect implicit geographical information in social media posts

Page 24: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

24

Topic Classification: Approach

Aim: Improve topic classification using structured data from hyperlinks

1. Identify sources of structured data from hyperlinks Based on domains, e.g., wikipedia.org

2. Retrieve structured data for these hyperlinks From Linked Data/APIs, e.g., dbpedia.org

3. Perform text classification Requires set of already categorised posts for training Post content and external metadata as sources of textual

features Compare accuracy achieved by different metadata types

4. Related to IR studies that classify documents based on fielded text from hyperlinked pages, but they consider structural rather than semantic fields

Page 25: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

25

Topic Classification: Evaluation (1)

Datasets

External data sources

Forum Twitter

Data source message board microblogging site

Ground truth topics forums #hashtags

# classes (topics) 10 6

# posts 6,626 2,415

Linked Data Web APIs

Page 26: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

26

Topic Classification: Evaluation (2)

Experimental Setup Multinomial Naïve Bayes classifier (WEKA) 10-fold cross-validation Compared classification accuracy for different post

representations based– post content– hyperlinked HTML pages– hyperlinked object metadata– combinations of these

Experimented to find optimal ways of combining feature vectors (e.g., weightings)

Page 27: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

27

Results

Topic Classification: Evaluation (3)

Data Source Forum Twitter

Content (no URLs) 0.745 0.722

Content (with URLs) 0.811 0.759

HTML 0.730 0.645

Metadata 0.835 0.683

Content + HTML 0.832 0.784

Content + Metadata

0.899 0.820

(micro-averaged F1)

Page 28: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

28

Results – comparing metadata types

Topic Classification: Evaluation (4)

Metadata type Content (no URLs) Metadata only Content+M’data

Tag

0.709

0.838 0.864

Title 0.773 0.824

Description 0.752 0.810

Category 0.514 0.753

Metadata type Content (no URLs) Metadata only Content+M’data

Category

0.761

0.811 0.851

Description 0.798 0.850

Title 0.685 0.809

Wikipedia

YouTube

Page 29: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

29

Topic Classification: Summary

Topic classification in social media can be improved by making use of structured metadata from hyperlinked objects

The most useful metadata types can be found experimentally, but for different objects, the usefulness of metadata types varies

The categories assigned by this approach would allow a user to browse social media posts with hyperlinks by topic, even if the text of the post itself is not sufficient for accurate automatic categorisation of the post.

Page 30: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

30

Combining the approaches (1)

tags

topiclocation

Page 31: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

31

Combining the approaches (2)

Last night I watched Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.

tags?

topic?location?

Page 32: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

32

Combining the approaches (3)

@prefix ex: <http://example.org/> .@prefix content: <http://purl.org/rss/1.0/modules/content/> .@prefix dc: <http://purl.org/dc/terms/> .@prefix sioc: <http://rdfs.org/sioc/ns#> .

ex:post1 rdf:type sioc:Post .ex:post1 content:encoded “Last night I watched Connacht play at The

Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a [url=‘http://www.youtube.com/watch?v=[...]’]clip of the first try.[/url]” .

ex:post1 sioc:links_to <http://www.youtube.com/watch?v=[...]> .ex:post1 dc:subject “connacht” .ex:post1 dc:subject “match” .ex:post1 dc:subject “review” .ex:post1 dc:subject “summary” .ex:post1 dc:spatial <http://sws.geonames.org/2964180/> .ex:post1 sioc:topic <http://www.dmoz.org/Sports/Football/Rugby_Union/> .

Page 33: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

33

Combining the approaches (4)

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

PREFIX sioc: <http://rdfs.org/sioc/ns#> .PREFIX dc: <http://purl.org/dc/terms/> .

SELECT ?post WHERE { ?post rdf:type sioc:Post . ?post dc:spatial <http://sws.geonames.org/2964180/> . ?post dc:created ?date . FILTER (str(?date) > ``2009-05-23T00:00:00'') . FILTER (str(?date) < ``2009-06-06T23:59:59'') . FILTER EXISTS { { ?post dc:subject ``volvooceanrace'' } UNION { ?post dc:subject ``vor'' } UNION { ?post dc:subject ``oceanrace'' } UNION { ?post dc:subject ``yacht'' } }}

Use-case 1: Local search

A blogger is looking for media to enhance a post about the Volvo Ocean Race

Page 34: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

34

Combining the approaches (5)

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .PREFIX sioc: <http://rdfs.org/sioc/ns#> .PREFIX gn: <http://www.geonames.org/ontology#> .PREFIX skos: <http://www.w3.org/2004/02/skos/core#> .

SELECT ?post WHERE { ?post rdf:type sioc:Post . ?post dc:spatial ?location . ?location gn:parentADM1

<http://sws.geonames.org/2963597/> . ?post sioc:topic ?topic . ?topic skos:broader+ <http://www.dmoz.org/Sports/> .}

Use-case 2: local browsing A sports fan wants to follow conversations about sports

in their local area

Page 35: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

35

Impact

5 conference papers ESWC 2011, ECIR 2011, I-Semantics 2010, IV 2008, ASNA 2008

2 workshop papers WIDM @ CIKM 2008, SMUC @ CIKM 2011

2 book chapters Advances in Computers 76 (Elsevier) Reasoning Web (Springer)

Tutorial "Combining the Social and the Semantic Web”, ESWC 2011

Page 36: Sheila Kinsella PhD Defense

Digital Enterprise Research Institute www.deri.ie

36

Summary

Proposed approaches for automatically generating metadata for social media posts using related Web content Tags, location and topic

Evaluated the accuracy of each approach Illustrated how the approaches can be used in

combination in order to semantically enrich social media posts and enable enhanced search and browsing in a social media dataset