social + mobile + commerce
DESCRIPTION
Social + Mobile + Commerce. Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia Based Approach. Abhishek Gattani, Digvijay Lamba , Nikesh Garera, Mitul Tiwari 3 , Xiaoyong Chai, - PowerPoint PPT PresentationTRANSCRIPT
Social + Mobile + Commerce
Entity Extraction, Linking, Classification, and Tagging for Social Media:
A Wikipedia Based Approach
Aug 27th, 2013
Abhishek Gattani, Digvijay Lamba, Nikesh Garera, Mitul Tiwari3, Xiaoyong Chai,Sanjib Das1, Sri Subramaniam, Anand Rajaraman2, Venky Harinarayan2, AnHai Doan1;
@WalmartLabs, 1University of Wisconsin-Madison, 2Cambrian Ventures, 3LinkedIn
The Problem“Obama gave an immigration speech while on vacation in Hawaii”
Entity Extraction “Obama” is a Person, “Hawaii” is a location
Entity Linking “Obama” -> en.wikipedia.org/wiki/Barack Obama“Hawaii” -> en.wikipedia.org/wiki/Hawaii
Classification “Politics”, “Travel”
Tagging “Politics”, “Travel”, “Immigration”, “President Obama”, “Hawaii”
On Social Media Data
Short Sentences Ungrammatical, misspelled, lots of acronyms
Social Context From previous conversation/interests: “Go Giants!!”
Large Scale 10s of thousands of updates a second
Lots of Topics New topics and themes every day. Large scale of topics
Why? – Use cases
• Used extensively at Kosmix and later at @WalmartLabs– Twitter event monitoring– In context ads– User query parsing– Product search and recommendations– Social Mining
• Use Cases– Central topic detection for a web page or tweet.– Getting a stream of tweets/messages about a topic.
• Small team at scale– About 3 engineers at a time– Processing the entire Twitter firehose
Based on a Knowledge Base
Published: Building, maintaining, and using knowledge bases: A report from the trenches. In SIGMOD, 2013.
• Global: Covers a wide range of topics. Includes WordNet, Wikipedia, Chrome, Adam, MusicBrainz, Yahoo Stocks etc.
• Taxonomy: Converted Wikipedia graph to a hierarchical taxonomy with IsA edges which are transitive
• Large: 6.5 Million hierarchical concepts with 165 Million relationships• Real Time: Constantly updated from sources, analyst curation, event detection• Rich: Synonyms, Homonyms, Relationships, etc
Annotate with Contexts
A Real Time User Context
What topics does this user talk about?
A Real Time Social Context
What topics are usually in context of a Hashtag, Domain, or KB Node
A Web Context
Topics in a link in a tweet. What are the topics in KB Node’s Wiki Page?
Compute the context at scale
Every social conversation takes place in a context that changes what it means
Example Contexts
Barack Obama Social: Putin, Russia, White House, SOPA, Syria, Homeownership, Immigration, Edward Snowden, Al Qaeda
Web: President, White House, Senate, Illinois, Democratic, United States, US Military, War, Michelle Obama, Lawyer, African American
www.whitehouse.gov Social: Petition, Barack Obama, Change, Healthcare, SOPA
#Politics Social: Barack Obama, Russia, Rick Scott, State Dept, Egypt, Snowden, War, Washington, House of Representatives
@Whitehouse User: Barack Obama, Housing Market, Homeownership, Mortgage Rates, Phoenix, Americans, Middle Class Families
Key Differentiators – why it works?
The Knowledge Base
Interleave several problems
Use of Context
Scale
Rule Based
How: First Find Candidate Mentions
“RT Stephen lets watch. Politics of Love is about Obama’s election @EricSu”
Step 1: Pre-Process – clean up tweet “Stephen lets watch. Politics of Love is about Obama’s election”
Step 2: Find Mentions – All in KB + detectors[“Stephen”, “lets”, “watch” “Politics”, “Politics of Love”, “is”, “about”, “Obama”,
“Election”]
Step 3: Initial Rules – Remove obvious bad cases[“Stephen”, “watch”, “Politics”, “Politics of Love”, “Obama”, “Election”]
Step 4: Initial scoring – Quick and dirty[“Obama”: 10, “Politics of Love”: 9, “Stephen”:7, “watch”: “7”., “Politics”: 6,
“Election”: 6,]
How: Add mention featuresStep 5: Tag and Classify– Quick and dirty
“Obama”: Presidents, Politicians, People; Politics, Places, Geography “Politics of Love”: Movies, Political Movies, Entertainment, Politics“Stephen”: Names, People“watch”: Verb, English Words, Language, Fashion Accessories, Clothing“Politics”: Politics“Election”: Political Events, Politics, Government
Tweet: Politics, People, Movies, entertainment.. Etc.
Step 6: Add featuresContexts, similarity to the tweet, similarity to user or website, popularity
measures, is it interesting?, social signals
How: Finalize mentions
Step 7: Apply Rules
“Obama”: Boost popular stuff and proper nouns“Politics of Love”: Boost Proper nouns, Boost due to “Watch”“Stephen”: Delete out of context names“watch”: Remove verbs“Politics”: Boost tags which are also mentions“Election”: Boost mentions in the central topic
Step 8: Disambiguate
KB has many meanings – Pick One
Obama: Barrack Obama. Popularity, Context, Social PopularityWatch: verb. Clothing is not in context
Context is most important! We use many contexts for most success.
How: Finalize
Step 9: Rescore
Logistic Regression model on all the features
Step 9: Re-tag
Use latest scores and only picked meanings
Step 9: Editorial Rules
A regular expression like language for analysts to pick/book
Does it work? – Evaluation of Entity Extraction
• For 500 English Tweets we hand curate a list of mentions.• For 99 of those built a comprehensive list of tags.
• Entity extraction:• Works well for people, organizations,
locations• Works great for unique names• Works badly for Media: Albums, Songs,
• Generic Problem:• Too many movies, books, albums and
songs have “Generic” Names• Inception, It’s Friday etc.
• Even when popular they are often used “in conversation”
• Very hard to disambiguate.• Very hard to find which ones are Generic.
Does it work? – Evaluation of Tagging
• Tagging/Classification:• Works well for Travel/Sports• Bad for Products and Social
sciences
• N Lineages problem:• Note that all mentions have
multiple lineages in the KB.• Usually, one IsA lineage goes to
“People” or “Product”• A ContainedIn lineage goes to
the topic like “SocialScience”• Detecting which is primary is a
hard problem.• Is Camera in Photography? Or
Electronics?• Is War History? Or Politics?• How far do we go?
Comparison with existing systems
• The first such comparison effort that we know of.• OpenCalais
– Industrial Entity Extraction system• StanNER-3: (From Stanford)
– This is a 3-class (Person, Organization, Location) named entity recognizer. The system uses a CRF-based model which has been trained on a mixture of CoNLL, MUC and ACE named entity corpora.
• StanNER-3-cl: (From Stanford)– This is the caseless version of StanNER-3 system which means it ignores
capitalization in text.• StanNER-4: (From Stanford)
– This is a 4-class (Person, Organization, Location, Misc) named entity recognizer for English text. This system uses a CRF-based model which has been trained on the CoNLL corpora.
For People, Organization, Location
• Details in the Paper.• We are far better on almost all respects:
– Overall: 85% Precision vs 78% best in other systems.– Overall: 68% Recall vs 40% for StanNER-3 and 28% for OpenCalais– Significantly better on Organizations
• Why? - Bigger Knowledge Base– The larger knowledge base allows a more comprehensive
disambiguation.– Is “Emilie Sloan” referring to a person or organization?
• Why? - Common interjections– LOL, ROFL, Haha interpreted as organizations by other systems.– Acronyms misinterpreted
• Vs OpenCalais– Recall is a major difference with a significantly smaller set of entities
recognized by Open Calais
Q&A