how oracle uses crowdflower for sentiment analysis

How Oracle Uses CrowdFlower'sData Enrichment Platform For

Sentiment Analysis

Before we get started

THIS IS A TITLE

#RichData

The housekeeping items:

• Webinar slides, recording, and Q&A will be emailed

• Enter questions in chat on webinar panel• Or ask your questions on twitter -

@CrowdFlower- Use #RichData

Meet the Data Scientists

THIS IS A TITLE

Randall SparksPrincipal Member of Technical StaffOracle Data Cloud — Social Platform Group

Pallika KananiSenior Research Staff MemberOracle Labs

Lukas Biewald | @L2KCEO and Founder CrowdFlower

#RichData

• Test Question Infrastructure• Support for tracking

contributor agreement and data quality

People-PoweredFeedback

Overview

What will be covered today?

Train and perfect your algorithms to build sentiment & other models that classify

text

• Multiple language support• World-wide contributor

network• Data enrichment capabilities

Insights Why CrowdFlower?

Real examples of data collection, data modeling

done by Oracle

Use Cases

#RichData

#RichData

Randall Sparks

• Oracle Data Cloud – Social Platform Group

• Use case: Social Media Analytics

• Data Collection, Data Modeling Process

• Use case: Multiple Languages

About Us

• Oracle Data Cloud — Social Platform Group– Data Service supporting multiple applications– Monitoring & Analysis of Social Media Streams & other text

sources

• Categorization of social media streams to topics + enrichments– Key words/phrases, Semantic vectors (LSA)

• Enrichments– Themes within a topic, related terms appearing in

messages– Demographics, Location, Indicators of intent, etc.– Sentiment

• Social Relationship Management(SRM) Product#RichData

What We Do

• Collect, filter, & analyze a large volume of streaming social media content from multiple content sources via multiple suppliers/aggregators

• Multiple (30+) languages — big data collection challenge

• Process– Collect content streamed from multiple

suppliers/aggregators– Text filtering, normalization, tokenization, chunking, etc.

(NLP)– “Categorize” messages (match snippets to “Topics”)– Topics: combinations of keywords/phrases +

semantic filters: vector comparison of words & texts in “semantic space” using Latent Semantic Analysis (LSA)

#RichData

Use Case: Social Media AnalyticsKeywords/phrases +

Semantic filters

#RichData

Use Case: Social Media Analytics — Example View

#RichData

Use Case: Social Media Analytics — Example View

• Media Types of matched “snippets”

#RichData

Why We Need Sentiment Data?

• Train sentiment model (Machine Learning)– Training data: 1000s of human-annotated items– Features: words

• also: n-grams, phrases, known negation/intensification patterns, etc.

• punctuation, emoticons, emoji, other metadata

– Various algorithms:• Decision Trees, Logistic Regression,

Support Vector Machine (SVM), etc.

• Analyze model– held-out test set– accuracy, precision/recall, etc.

#RichData

Data Collection & Modeling Process

• Generate “gold” test item data:– Transform into (our) standard format for upload to

CrowdFlower– Define CrowdFlower job to generate test questions

& upload data – Run job & download results– Select “gold” test items based on analysis of

contributor agreement

#RichData

• Generate full training & test data sets:– Define main CrowdFlower job, upload data & test

items– Launch & monitor job (remove problematic test

questions)– Download & analyze results– Select (high-agreement) items for ML sentiment

model training– Build sentiment model, test, & deploy

Data Collection & Modeling Process (continued)

#RichData

An Example Of How We Collect Data

#RichData

12+ Languages. Target: 30

#RichData

#RichData

Pallika Kanani

• About Oracle Labs

• Power of human-annotated data

• Use case – Language understanding

• Use case – Wisdom of the crowd

• Use case – Data quality

#RichData

Information Retrieval and Machine Learning Group

• Strong research program, publications• Develop core Information Retrieval, Statistical

Natural Language Processing and Machine Learning technologies

• Help solve complex and challenging business problems across Oracle

• Utilize CrowdFlower platform for a wide variety of relevance ranking and NLP problems

Data Annotation

• First step in building search / NLP / machine learning application

• Many Machine Learning techniques require some human-annotated data

• Even for unsupervised methods, need annotated data for proper evaluation

#RichData

Use Case: Language Understanding

• Goal: Get a better understanding of what our customers are talking about

• Extract useful information from raw text • Language is all about context: Disambiguating

extracted information is crucial, and people are good at understanding context– Are people talking about New York subway or

Subway, the restaurant?

#RichData

CrowdFlower as a data enrichment platform

• Data collection for Machine Learning used to be tedious– Long iterations typically lasting weeks and months– High prohibitive costs – Difficult to innovate overfitting to existing corpora

• Try out new tasks at previously unimaginable speed• Designing a job for a new NLP task is as short as a day,

getting results can be matter of hours• Rapid Prototyping due to affordable cost for early trials

(and final data collection)

Before

After

#RichData

Rapid Feedback

• Rapid debugging of the data collection process

• Works like debugging a software with humans in the loop

#RichData

Wisdom of the Crowd

• Incorrect test questions due to lack of knowledge of pop culture

• The crowd set me straight

“’Say Something’ is the name of a song. Please fix your test question”

#RichData

Data Quality

• Good quality data even for tricky tasks

• Example: Ran a task for finding relevant URLs from Wikipedia, and got excellent results

#RichData

TWITTER.COM/[email protected]

Q & A

What’s next?

THIS IS A TITLE

• Look out for a follow up email with a copy of

these slides, a recording of the webinar, Q&A

recap, and other fun stuff

• View and share this presentation on Slideshare

- Follow us for more such events

• Next webinar:

- CrowdFlower User Webinar: Graphical Editor and

Visual Reports

- September 10th 2015 – 10:00 AM PST

- Register at: http://www.crowdflower.com/events#RichData

Rich Data SummitWhat is Rich Data Summit?The leading conference for data scientists focused on turning big data into rich, meaningful data • Data Scientists – 300+• Sessions focused on Data Science –

5• Hands-on Workshops – 9

Qualified webinar attendees will receive 30% discount coupon

Interested? Email us at [email protected]

www.richdatasummit.com

@RichDataSummit

#RichData

mailto:[email protected]

TWITTER.COM/[email protected]

Thank you.

how oracle uses crowdflower for sentiment analysis

Data & Analytics

sentiment data

data scientists

data modeling process

crowdflower use

data run job

gold test item data

oracle use cases

multiple languages