mining the web for points of interest
TRANSCRIPT
Adam Rae Vanessa Murdock, Adrian Popescu, Hugues Bouchard
SIGIR 2012, Portland, Oregon, Entities Session Adam Rae [email protected] Vanessa Murdock Adrian Popescu Hugues Bouchard
Mining the Web for Points of Interest
Using social media to increase our knowledge of the world
I’m at Adam’s Bar…
?
!
Contents
§ Motivation
§ Point Of Interest (POI) extraction using user generated data
§ POI localisation using social media
§ Conclusions
Motivation § Geographic Points of Interest are valuable
representations of important places in the world around us.
§ Browsing and search of POIs increasingly important › Web search › Mobile › Navigation
Where do POIs come from?
§ Editing listings coming from NMAs, commercial directories etc. › Costly process › Expensive to maintain freshness › Coverage
§ Do they reflect the kind of places that people are interested in looking for?
Can we get them from the web? § Un/semi-structured mentions of POIs throughout
text on web › Lots of context
§ Structured mentions of POIs in micro blogging
systems and Wikipedia articles › Easy to extract
When is a POI not a POI?
1 The White House is at 1600 Pennsylvania Avenue, Washington DC.
2 The White House released a statement today suggesting the moon is made of cheese.
3 The people living in the white house at the end of the street turned out to be Martians.
Europe According to Foursquare
The World According to Foursquare
The World According to Gowalla
The World According to Wikipedia
Can we bootstrap using social media?
§ Train Conditional Random Fields (CRF) using web snippets bootstrapped from structured mentions in micro-blog entries › Extract POI, use as query to search engine › Resultant snippets filtered to those that contain POI › Sanitise
§ Also from geocoded Wikipedia articles (according to Yago2)
Ground Truth Data § Created by manual assessors given explicit
instructions › 1,337 examples of POIs in (some) context › 1,066 unique POIs › Inter-assessor agreement:
Ground Truth Assessor
Precision Recall F-Measure
1 0.749 0.792 0.770
2 0.814 0.716 0.762
Sequential Tagging Model
€
p Y | X,λ( ) =1
Z(X)exp λ jFj (Y,X)
j∑$
% & &
'
( ) )
€
argmaxΛ1
Z(X)exp λ jFj (Y,X)
j∑%
& ' '
(
) * *
+ , -
. -
/ 0 -
1 -
Features § Lexical › Word identity, shape, position, etc.
§ Grammatical › Part of Speech, Apache OpenNLP
§ Statistical › Normalised Point-wise Mutual Information of mobile
search query logs § Geographic › Gazetteer attributes from Yahoo! Placemaker › http://developer.yahoo.com/geo/placemaker/
Process Overview
… was only after he had left the Marriott Hotel that he remembered…
Geocoded Wikipedia Articles
Check-Ins (Foursquare)
Check-Ins(Gowalla)
Wikipedia Bootstrapped Raw Web Snippets
Foursquare Bootstrapped Raw Web
Snippets
Gowalla Bootstrapped Raw Web Snippets
Wikipedia based POI Tagger
Foursquare based POI
Tagger
Gowalla based POI Tagger
ExtractPOI
Mentions
Sear
ch E
ngin
e (B
ing)
Snip
pet P
roce
ssin
g
CRF
Mod
el T
rain
ing
Extract Article Titles
Results
Training Data Testing Data Precision Recall
Y! Placemaker Manual Data 0.237 0.228
Wikipedia Manual Data 0.514 0.337
Foursquare Manual Data 0.276 0.655
Gowalla Manual Data 0.360 0.414
Wikipedia 10-fold CV 0.879 0.955
Foursquare 10-fold CV 0.689 0.468
Gowalla 10-fold CV 0.857 0.868
Language Modelling § Partition the world into 1km cells § For each, create model from Flickr photos taken
in that area
§ Treat problem as IR, match a POI (query) against the cells (document) › Return centroid of of best matching cell €
P t |θL( ) =cuser(t,L)
L
€
L = cuser(ti,L)ti ∈L∑
Performance
Placemaker Cascade Geo Scope # Examples Placemaker POIs
0.29 0.29 0.29 134
Placemaker Other Locs
4.19 2.90 2.12 131
All Known Locs
1.17 0.82 0.79 265
New Locations
- 439.0 5.88 130
All Data - 1.20 0.96 395
Conclusions and Implications
§ POIs are valuable, but useful ones difficult to define
§ Generating evaluation data is hard
§ Can use web snippets bootstrapped with check-ins, and articles on Wikipedia to train POI tagger › Up to 88% precision on unlabelled data › Reflect the POIs users visit › Easily updated › Can be located accurately using hybrid gazetteer + Flickr
language model technique
Benefits of this approach § Discover POIs: › that we already know about (replace/extend existing
sources) › we didn’t already know about (novel POIs) › of more diverse types (increasing coverage) › that are fresher
§ Increase relevance of local and hyperlocal search using wisdom of the crowds
Research Areas - Automatic POI detection in UGC - Learning how users refer to places - Localising media - Generating evaluation data
- (This is hard) - Multi-source combination - Quality & Credibility