how oracle uses crowdflower for sentiment analysis
TRANSCRIPT
How Oracle Uses CrowdFlower'sData Enrichment Platform For
Sentiment Analysis
Before we get started
THIS IS A TITLE
#RichData
The housekeeping items:
• Webinar slides, recording, and Q&A will be emailed
• Enter questions in chat on webinar panel• Or ask your questions on twitter -
@CrowdFlower- Use #RichData
Meet the Data Scientists
THIS IS A TITLE
Randall SparksPrincipal Member of Technical StaffOracle Data Cloud — Social Platform Group
Pallika KananiSenior Research Staff MemberOracle Labs
Lukas Biewald | @L2KCEO and Founder CrowdFlower
#RichData
• Test Question Infrastructure• Support for tracking
contributor agreement and data quality
People-PoweredFeedback
Overview
What will be covered today?
Train and perfect your algorithms to build sentiment & other models that classify
text
• Multiple language support• World-wide contributor
network• Data enrichment capabilities
Insights Why CrowdFlower?
Real examples of data collection, data modeling
done by Oracle
Use Cases
#RichData
#RichData
Randall Sparks
• Oracle Data Cloud – Social Platform Group
• Use case: Social Media Analytics
• Data Collection, Data Modeling Process
• Use case: Multiple Languages
About Us
• Oracle Data Cloud — Social Platform Group– Data Service supporting multiple applications– Monitoring & Analysis of Social Media Streams & other text
sources
• Categorization of social media streams to topics + enrichments– Key words/phrases, Semantic vectors (LSA)
• Enrichments– Themes within a topic, related terms appearing in
messages– Demographics, Location, Indicators of intent, etc.– Sentiment
• Social Relationship Management(SRM) Product#RichData
What We Do
• Collect, filter, & analyze a large volume of streaming social media content from multiple content sources via multiple suppliers/aggregators
• Multiple (30+) languages — big data collection challenge
• Process– Collect content streamed from multiple
suppliers/aggregators– Text filtering, normalization, tokenization, chunking, etc.
(NLP)– “Categorize” messages (match snippets to “Topics”)– Topics: combinations of keywords/phrases +
semantic filters: vector comparison of words & texts in “semantic space” using Latent Semantic Analysis (LSA)
#RichData
Use Case: Social Media AnalyticsKeywords/phrases +
Semantic filters
#RichData
Use Case: Social Media Analytics — Example View
#RichData
Use Case: Social Media Analytics — Example View
#RichData
Use Case: Social Media Analytics — Example View
#RichData
Use Case: Social Media Analytics — Example View
#RichData
Use Case: Social Media Analytics — Example View
• Media Types of matched “snippets”
#RichData
Why We Need Sentiment Data?
• Train sentiment model (Machine Learning)– Training data: 1000s of human-annotated items– Features: words
• also: n-grams, phrases, known negation/intensification patterns, etc.
• punctuation, emoticons, emoji, other metadata
– Various algorithms:• Decision Trees, Logistic Regression,
Support Vector Machine (SVM), etc.
• Analyze model– held-out test set– accuracy, precision/recall, etc.
#RichData
Data Collection & Modeling Process
• Generate “gold” test item data:– Transform into (our) standard format for upload to
CrowdFlower– Define CrowdFlower job to generate test questions
& upload data – Run job & download results– Select “gold” test items based on analysis of
contributor agreement
#RichData
• Generate full training & test data sets:– Define main CrowdFlower job, upload data & test
items– Launch & monitor job (remove problematic test
questions)– Download & analyze results– Select (high-agreement) items for ML sentiment
model training– Build sentiment model, test, & deploy
Data Collection & Modeling Process (continued)
#RichData
An Example Of How We Collect Data
#RichData
12+ Languages. Target: 30
#RichData
#RichData
Pallika Kanani
• About Oracle Labs
• Power of human-annotated data
• Use case – Language understanding
• Use case – Wisdom of the crowd
• Use case – Data quality
#RichData
Information Retrieval and Machine Learning Group
• Strong research program, publications• Develop core Information Retrieval, Statistical
Natural Language Processing and Machine Learning technologies
• Help solve complex and challenging business problems across Oracle
• Utilize CrowdFlower platform for a wide variety of relevance ranking and NLP problems
Data Annotation
• First step in building search / NLP / machine learning application
• Many Machine Learning techniques require some human-annotated data
• Even for unsupervised methods, need annotated data for proper evaluation
#RichData
Use Case: Language Understanding
• Goal: Get a better understanding of what our customers are talking about
• Extract useful information from raw text • Language is all about context: Disambiguating
extracted information is crucial, and people are good at understanding context– Are people talking about New York subway or
Subway, the restaurant?
#RichData
CrowdFlower as a data enrichment platform
• Data collection for Machine Learning used to be tedious– Long iterations typically lasting weeks and months– High prohibitive costs – Difficult to innovate overfitting to existing corpora
• Try out new tasks at previously unimaginable speed• Designing a job for a new NLP task is as short as a day,
getting results can be matter of hours• Rapid Prototyping due to affordable cost for early trials
(and final data collection)
Before
After
#RichData
Rapid Feedback
• Rapid debugging of the data collection process
• Works like debugging a software with humans in the loop
#RichData
Wisdom of the Crowd
• Incorrect test questions due to lack of knowledge of pop culture
• The crowd set me straight
“’Say Something’ is the name of a song. Please fix your test question”
#RichData
Data Quality
• Good quality data even for tricky tasks
• Example: Ran a task for finding relevant URLs from Wikipedia, and got excellent results
#RichData
TWITTER.COM/[email protected]
Q & A
What’s next?
THIS IS A TITLE
• Look out for a follow up email with a copy of
these slides, a recording of the webinar, Q&A
recap, and other fun stuff
• View and share this presentation on Slideshare
- Follow us for more such events
• Next webinar:
- CrowdFlower User Webinar: Graphical Editor and
Visual Reports
- September 10th 2015 – 10:00 AM PST
- Register at: http://www.crowdflower.com/events#RichData
Rich Data SummitWhat is Rich Data Summit?The leading conference for data scientists focused on turning big data into rich, meaningful data • Data Scientists – 300+• Sessions focused on Data Science –
5• Hands-on Workshops – 9
Qualified webinar attendees will receive 30% discount coupon
Interested? Email us at [email protected]
www.richdatasummit.com
@RichDataSummit
#RichData
TWITTER.COM/[email protected]
Thank you.