pinpointing location focus in microblogs
TRANSCRIPT
Pinpointing Locational Focus in Microblogs
Jie Yin, Sarvnaz Karimi, John LingadNovember 2014
DIGITAL PRODUCTIVITY FLAGSHIP
CSIRO: positive impact | Presentation title | Presenter name
Where is it happening?
For those monitoring social media to• send help in emergency• avoid certain area(e.g., for traffic)• recommend services (ads)
2 |
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi3 |
Find it on the map!
Locational focus
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi4 |
Locational focus: Macquarie Centre, North Ryde, New South Wales, Australia
Location mentions: Sydney, Macquarie CentreAuthor Location: Brisbane, Australia
Some tweets mention multiple locations: Not easy to identify the focus
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi5 |
Mary river, Queensland, Australia
Mary river, Queensland, Australia
Some tweets have no locational focus
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi6 |
There is an unknown location (Ambiguity)
No specific focus(World Level?)
To find locational focus, we have two tasks:1. Find mentions of locations2. Aggregate these to infer the main focus
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi7 |
Finding location mentions
1. Where to look for the location mentions• Location mentions can be in Tweet text and or in hashtags• Some hashtags are concatenated words or abbreviations, e.g., #QLDflood =
QLD + flood• Tweet texts may mention a geographical location, such as Sydney, or a Point-
of-Interest (POI) such as an organisations name or a shop• Authors’ locations in their profile (not exactly a location mention)
2. How to find these mentions?• Hashtag segmentation• Named Entity Recognition
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi8 |
Location mention extraction
• Related work• NER tools for formal text, such Stanford NER and OpenNLP, are highly
accurate (solved problem).• NER specific for Twitter: TwiNER [Wang et al.,2012], TwitterNLP [Ritter et al.,
2011]• Retrained NER tools for Twitter [Lingad et al., 2013] – Location and
Organisation entities only
• In this work: • Segmented the hashtags using a simple greedy maximal matching heuristics:
Used an English dictionary augmented with place name abbreviations• Used retrained Stanford NER, and used LOC and ORG
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi9 |
Inferring locational focus
Given a list of location mentions, determine what the focus is.
For example:
If mentions are VIC,NSW,QLD,WA then focus is AustraliaIf mentions are Swanston St, RMIT then focus is RMIT University,
Melbourne, VIC, Australia
• Requires knowledge of the geographical locations as well as POIs and their relationships/hierarchy.• Gazetteer Australia 2010, GeoNames New Zealand, OpenStreetMap
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi10 |
Gazetteer as a tree
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi11 |
Specific POI
City/Suburb/Town/Non-Specific POI (e.g., river, highway)
State/Territory/Region
Country
Inference algorithm: Where on the map?
• Step 1: Query location mentions from the gazetteer, and return matching (partial or exact match) results in full path in the gazetteer tree
• Step 2: Create an inference tree using the returned paths• Step 3: Propagate the scores in the tree• Step 4: Find a maximum scoring path
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi12 |
Goal: Finding the lowest granularity possibleAssumption: More possible matches found within a geographical region indicates that region on the map is more likely the focus
Querying the gazetteer tree
• Location mentions: Sydney, Macquarie Centre• Author Location: Brisbane, Australia
• Gazetteer querying returns: - brisbane, queensland, australia - south brisbane, queensland, australia- macquaire centre, north ryde, new south wales, australia- macquaire university, macquaire park, new south wales, australia- ...
Each of these returned results get a matching score based on Jaccard similarity of the query and the matched node.
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi13 |
Building the inference tree
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi14 |
earth
australia
queensland
brisbane Leaf Score
brisbane, queensland, australiamacquaire centre, north ryde, new south wales, australia
new south wales
north ryde
macquaire centreLeaf Score
Propagating scores to the parents and finding the maximal path
• More branches within a sub-tree increase the chance of their parent to be in the maximal path
• Bottom-up scoring of parents from leaves to the root• Parent score = current score + 0.5*score of the highest scoring
child
• Top-down selection of the maximal path based on entropy as the termination condition. If entropy of children scores are higher than a pre-defined threshold, the algorithm stops at that level.
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi15 |
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi16 |
earth
australia
queensland
brisbane
brisbane, queensland, australiamacquaire centre, north ryde, new south wales, australiaSydney, new south wales, australia
new south wales
north ryde
macquaire centre
sydneyA
0.5*A
B
D
macquaire University
C
0.5*Max(B,C)
0.5*Max(0.5*B,D)
Leaf score= w*2^level*Jaccard similarity
Dataset & annotation
• Queried Twitter with keywords such as fire, earthquake, storm, hurricane• Randomly sampled 7,000 tweets• Two annotation steps:
1. Indentify location mentions2. Identify locational focus (based on tweet and author location)• Three annotators per tweet, only tweets with majority agreement
(2 out of 3) were kept in the final set.• Tweets that their focus was not within Australia and New Zealand
were removed.• There was a small set of tweets that were marked as impossible to
detect the focus which were removed.• Final set: 1398 tweets (80 kept for parameter tuning)
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi17 |
Baseline: Yahoo! PlaceFinder*
• A service that accepted queries and returned a list of matching places in the form of country, state, city, poi
• A query to the service was similar to a database querying: SELECT * FROM geo.placefinder WHERE text = query text
And we chose the query text to be(a) tweet (text & hashtag) and user location from their profile(b) the list of location mentions from one tweet (human annotated)
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi18 |
* As it was called in Jan 2014
Accuracy with manual location mentions (without NER)
All Text Hashtag User Location
Level 1 - Country 89.9 35.3 45.2 71.6
Level 2 - State 73.5 29.3 37.4 36.3
Level 3 - City/Suburb 51.0 24.5 12.4 4.9
Level 4 - POI 29.7 11.7 8.1 1.8
No focus 58.5 95.8 96.4 63.2
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi19 |
User location was most useful in the county level, but did not contribute much in other levels of granularity.POI was hardest with only ~30% were correctly identified.
All = 0.6 text + 0.3 hashtag + 0.1 user location
Accuracy with location mentions extracted using NER
Level 1 Level 2 Level 3 Level 4 No Focus
PlaceFinder (a) 87.9 58.6 22.9 21.0 0.3
PlaceFinder (b) 87.8 59.1 23.5 18.8 25.5
Our Alg. No NER 89.9 73.5 51.0 29.7 58.5
Our Alg. With NER 91.3 65.7 47.0 24.9 53.4
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi20 |
(a) The whole tweet was queried (b) location mentions were queries (no NER)
Country level focus was the easiest with all settings performing similar.PlaceFinder was consistently worse in other levels, but that could also be the effect of our gazetteer hierarchy.
The sources of errors in our algorithm
• Annotation mistakes: human annotators missed some of the mentions.• Missing some of the street and POIs in the gazetteer.• Heavily misspelled place named that were not corrected
in our pre-processing step.• We favoured lower granularities in our scoring, which
introduced wrong POIs that were not needed.• Gazetteer bias: if one mention had many matches in a
region, the path could wrongly get stronger.
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi21 |
What we learnt and what’s next?
• Finding locational focus is difficult, even for human (low agreement in annotation)• Our method was accurate (90%) at country level, but accuracy
dropped for state, city, and POI levels (29%).• All three information sources (text, hashtag, and author location)
contribute in finding focus, but in different levels.
• How to make it better?• Incorporate some context, e.g., tweets that share hashtags, replies,
temporally close• Learning the weights of different information sources
CSIRO: positive impact | Pinpointing Locational Focus in Tweets | Sarvnaz Karimi22 |