sheila kinsella phd defense
DESCRIPTION
"Augmenting Social Media Items with Metadata using Related Web Content" - Slides from the public presentation part of my PhD defense. Presented at DERI, NUI Galway, on August 30, 2011.TRANSCRIPT
Copyright 2010 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
1
Augmenting Social Media Items with Metadata using Related Web Content
Sheila Kinsella
Digital Enterprise Research Institute www.deri.ie
2
Outline
Motivation Example Scenario
Tag prediction Geolocation Topic classification
Combining the approaches Impact Conclusions
Approach Evaluation Summary
Digital Enterprise Research Institute www.deri.ie
3
Motivation
Social media is an important information source e.g., real time citizen journalism, Q&A sites, niche
topics Search and navigation can be challenging
Short and informal posts Items are not curated and often lack metadata Users conversing share a common context and
therefore omit relevant information, e.g. location Making use of related Web data can help us to
infer such context information e.g., hyperlinks, posts with similar content
Digital Enterprise Research Institute www.deri.ie
4
Example Scenario:Adding metadata to a blog post
tags?
topic?location?
Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.
Digital Enterprise Research Institute www.deri.ie
5
Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.
Example Scenario:Possible clues from content
tags?
topic?location?
Digital Enterprise Research Institute www.deri.ie
6
Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.
location?
Example Scenario:Possible clues from content
tags?
topic?
Digital Enterprise Research Institute www.deri.ie
7
Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.
Example Scenario:Possible clues from content
tags?
location?topic?
Digital Enterprise Research Institute www.deri.ie
8
Example Scenario:Exploiting related Web content
...didn’t see the match but here’s a summary from John..
..............This review of the Connacht match shows that they are getting back in form!......
href
href
tags from anchortext
Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.
Digital Enterprise Research Institute www.deri.ie
Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.
Example Scenario:Exploiting related Web content
location from geotagged social media
9
JohnSmith John SmithI’m at the Galway Sportsground
Digital Enterprise Research Institute www.deri.ie
10
Example Scenario:Exploiting related Web content
topic from hyperlinked objects
href
YouTubeTitle:Fionn Carr try
Category:SportTags:rugby, try, carr, connacht
Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.
Digital Enterprise Research Institute www.deri.ie
11
Example Scenario:Overview of approaches
TAG PREDICTION GEOLOCATIONTOPIC
CLASSIFICATION
...didn’t see the match but here’s a summary from John..
..............This review of the Connacht match shows that they are getting back in form!......
href
href
href
YouTubeTitle:Fionn Carr try
Category:SportTags:rugby, try, carr, connacht
Last night I saw Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.
JohnSmith John SmithI’m at the Galway Sportsground
Digital Enterprise Research Institute www.deri.ie
12
Tag Prediction: Approach
Aim: Automatic tag generation based on anchortext
1. Data collection and preprocessing Retrieve document and extract META information Retrieve inlinking documents and extract anchortext Preprocessing (e.g. stemming, stopword removal)
2. Tag indexing and ranking Generate term vectors from the preprocessed
annotations Ranking: tf and tf-idf
Digital Enterprise Research Institute www.deri.ie
13
Tag Prediction: Evaluation (1)
Datasets: Web: WEBSPAM-2007, 12M pages from .uk domain Delicious: 2007 Crawl containing tags for 4.5M URLs Overlap between datasets: 192k URLs
Goals: Compare overlap of predicted tags and delicious tags Assess relevance of predicted tags and relevance of
delicious tags
Digital Enterprise Research Institute www.deri.ie
14
Tag Prediction: Evaluation (2)
Automatic Evaluation Relative precision@k (Average proportion of predicted
tags that are also among delicious tags)
Relative recall@k (Average proportion of delicious tags can also be inferred from anchortext)
k=1 k=2 k=3 k=4 k=5
0.41 0.35 0.32 0.29 0.28
k=1 k=2 k=3 k=4 k=5
0.48 0.45 0.42 0.39 0.37
Digital Enterprise Research Institute www.deri.ie
15
Tag Prediction: Evaluation (3)
Human Evaluation 80 documents, each assessed by 3 judges
– 0: not relevant; 1: quite relevant; 2: very relevant Evaluator agreement
– In 85% of cases, judges at least almost agree– i.e., two agree and the third differs by just one point
Digital Enterprise Research Institute www.deri.ie
16
Tag Prediction: Evaluation (4)
Human Evaluation Precision@k (Average proportion of tags judged
relevant by evaluators)– Relevance threshold: 1
Not feasible to measure recall
k=1 k=2 k=3 k=4 k=5
Delicious 0.86 0.84 0.82 0.80 0.78
predicted 0.78 0.76 0.69 0.67 0.66
Digital Enterprise Research Institute www.deri.ie
17
Tag Prediction: Summary
Substantial overlap between tags assigned on a social bookmarking site and terms from anchortext
Human evaluators rate relevance of terms from anchortext as not much lower than tags
This approach can provide useful and novel annotations for untagged social media items, if other users link to them with anchortext
Digital Enterprise Research Institute www.deri.ie
18
Geolocation: Approach (1)
Aim: Location prediction based on models built from geotagged social media Enables detection of implicit location clues such as
slang, venues, other terms of local relevance
1. Reverse Geocoding Filter geotagged tweets from Twitter stream Reverse-geocode each coordinate to corresponding
places– Postal code, City, State, Country– Yahoo! Geoplanet service
Aggregate all of the text from each place together for model building
Digital Enterprise Research Institute www.deri.ie
19
Geolocation: Approach (2)
2. Language Modelling Approach from information retrieval – given a query, find
the most relevant document in a collection Model each document and query as bag of words For each document, calculate probability that a random
sampling would result in the query Based on the intuition that users create queries by
guessing words that would occur in the document For our geolocation task: estimates the probability that a
random sampling of a location would result in the social media post
Digital Enterprise Research Institute www.deri.ie
20
Geolocation: Evaluation (1)
Dataset Twitter Firehose stream 7.3 million geotagged tweets posted during Summer 2010 Retweets removed, #hashtags and @usernames
preserved
Baseline: Yahoo Placemaker! identifies and disambiguates placenames in text and
returns the spatial entity most likely to encompass them
Place type
# Tweets
# Distinct places
Country 7.3m 222
State 7.3m 2.3k
City 6.3m 72.6k
Postal code
7.2m 104.7k
Digital Enterprise Research Institute www.deri.ie
21
Geolocation: Evaluation (2)
Prediction Methods Trivial Classifier
– Each tweet assigned to the most common place in training set
Placemaker (Tweet)– Each tweet is submitted to Placemaker and the most
probable candidate is selected. Allows detection of explicit geographic references in the tweet
Language Model– Locations are ranking according to their query likelihood and
the location whose model ranks highest is selected Placemaker (Location)
– The location field from the tweet is submitted to Placemaker and the most probable candidate is selected. Allows detection of explicit geographic references in the self-reported location
Digital Enterprise Research Institute www.deri.ie
22
Geolocation: Evaluation (3)
Zip Town State Country
Trivial Classifier 0.005 0.061 0.060 0.434
Placemaker (Tweet) 0.018 0.060 0.076 0.120
Language Model 0.052 0.217 0.246 0.514
Placemaker (Location)
0.017 0.269 0.401 0.518
Tweet location prediction accuracy Common location focused services removed
Digital Enterprise Research Institute www.deri.ie
23
Geolocation: Summary
Language models of geotagged tweets enables the location of non-geotagged items to be predicted
The approach gives large improvements compared to parsing for explicit placenames City level accuracy – 21.7% versus 6%
The approach can be used to detect implicit geographical information in social media posts
Digital Enterprise Research Institute www.deri.ie
24
Topic Classification: Approach
Aim: Improve topic classification using structured data from hyperlinks
1. Identify sources of structured data from hyperlinks Based on domains, e.g., wikipedia.org
2. Retrieve structured data for these hyperlinks From Linked Data/APIs, e.g., dbpedia.org
3. Perform text classification Requires set of already categorised posts for training Post content and external metadata as sources of textual
features Compare accuracy achieved by different metadata types
4. Related to IR studies that classify documents based on fielded text from hyperlinked pages, but they consider structural rather than semantic fields
Digital Enterprise Research Institute www.deri.ie
25
Topic Classification: Evaluation (1)
Datasets
External data sources
Forum Twitter
Data source message board microblogging site
Ground truth topics forums #hashtags
# classes (topics) 10 6
# posts 6,626 2,415
Linked Data Web APIs
Digital Enterprise Research Institute www.deri.ie
26
Topic Classification: Evaluation (2)
Experimental Setup Multinomial Naïve Bayes classifier (WEKA) 10-fold cross-validation Compared classification accuracy for different post
representations based– post content– hyperlinked HTML pages– hyperlinked object metadata– combinations of these
Experimented to find optimal ways of combining feature vectors (e.g., weightings)
Digital Enterprise Research Institute www.deri.ie
27
Results
Topic Classification: Evaluation (3)
Data Source Forum Twitter
Content (no URLs) 0.745 0.722
Content (with URLs) 0.811 0.759
HTML 0.730 0.645
Metadata 0.835 0.683
Content + HTML 0.832 0.784
Content + Metadata
0.899 0.820
(micro-averaged F1)
Digital Enterprise Research Institute www.deri.ie
28
Results – comparing metadata types
Topic Classification: Evaluation (4)
Metadata type Content (no URLs) Metadata only Content+M’data
Tag
0.709
0.838 0.864
Title 0.773 0.824
Description 0.752 0.810
Category 0.514 0.753
Metadata type Content (no URLs) Metadata only Content+M’data
Category
0.761
0.811 0.851
Description 0.798 0.850
Title 0.685 0.809
Wikipedia
YouTube
Digital Enterprise Research Institute www.deri.ie
29
Topic Classification: Summary
Topic classification in social media can be improved by making use of structured metadata from hyperlinked objects
The most useful metadata types can be found experimentally, but for different objects, the usefulness of metadata types varies
The categories assigned by this approach would allow a user to browse social media posts with hyperlinks by topic, even if the text of the post itself is not sufficient for accurate automatic categorisation of the post.
Digital Enterprise Research Institute www.deri.ie
30
Combining the approaches (1)
tags
topiclocation
Digital Enterprise Research Institute www.deri.ie
31
Combining the approaches (2)
Last night I watched Connacht play at The Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a clip of the first try.
tags?
topic?location?
Digital Enterprise Research Institute www.deri.ie
32
Combining the approaches (3)
@prefix ex: <http://example.org/> .@prefix content: <http://purl.org/rss/1.0/modules/content/> .@prefix dc: <http://purl.org/dc/terms/> .@prefix sioc: <http://rdfs.org/sioc/ns#> .
ex:post1 rdf:type sioc:Post .ex:post1 content:encoded “Last night I watched Connacht play at The
Sportsground. The match started well for Connacht with a great try but after half time the opposition closed the gap. Finally we managed to hold out for the win. It was a great game from both sides. Here's a [url=‘http://www.youtube.com/watch?v=[...]’]clip of the first try.[/url]” .
ex:post1 sioc:links_to <http://www.youtube.com/watch?v=[...]> .ex:post1 dc:subject “connacht” .ex:post1 dc:subject “match” .ex:post1 dc:subject “review” .ex:post1 dc:subject “summary” .ex:post1 dc:spatial <http://sws.geonames.org/2964180/> .ex:post1 sioc:topic <http://www.dmoz.org/Sports/Football/Rugby_Union/> .
Digital Enterprise Research Institute www.deri.ie
33
Combining the approaches (4)
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
PREFIX sioc: <http://rdfs.org/sioc/ns#> .PREFIX dc: <http://purl.org/dc/terms/> .
SELECT ?post WHERE { ?post rdf:type sioc:Post . ?post dc:spatial <http://sws.geonames.org/2964180/> . ?post dc:created ?date . FILTER (str(?date) > ``2009-05-23T00:00:00'') . FILTER (str(?date) < ``2009-06-06T23:59:59'') . FILTER EXISTS { { ?post dc:subject ``volvooceanrace'' } UNION { ?post dc:subject ``vor'' } UNION { ?post dc:subject ``oceanrace'' } UNION { ?post dc:subject ``yacht'' } }}
Use-case 1: Local search
A blogger is looking for media to enhance a post about the Volvo Ocean Race
Digital Enterprise Research Institute www.deri.ie
34
Combining the approaches (5)
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .PREFIX sioc: <http://rdfs.org/sioc/ns#> .PREFIX gn: <http://www.geonames.org/ontology#> .PREFIX skos: <http://www.w3.org/2004/02/skos/core#> .
SELECT ?post WHERE { ?post rdf:type sioc:Post . ?post dc:spatial ?location . ?location gn:parentADM1
<http://sws.geonames.org/2963597/> . ?post sioc:topic ?topic . ?topic skos:broader+ <http://www.dmoz.org/Sports/> .}
Use-case 2: local browsing A sports fan wants to follow conversations about sports
in their local area
Digital Enterprise Research Institute www.deri.ie
35
Impact
5 conference papers ESWC 2011, ECIR 2011, I-Semantics 2010, IV 2008, ASNA 2008
2 workshop papers WIDM @ CIKM 2008, SMUC @ CIKM 2011
2 book chapters Advances in Computers 76 (Elsevier) Reasoning Web (Springer)
Tutorial "Combining the Social and the Semantic Web”, ESWC 2011
Digital Enterprise Research Institute www.deri.ie
36
Summary
Proposed approaches for automatically generating metadata for social media posts using related Web content Tags, location and topic
Evaluated the accuracy of each approach Illustrated how the approaches can be used in
combination in order to semantically enrich social media posts and enable enhanced search and browsing in a social media dataset