understanding voice of members via text mining – how linkedin built a text analytics engine at...
Post on 07-Feb-2017
50 Views
Preview:
TRANSCRIPT
Understanding Voice of Members via Text Mining
– How Linkedin built a text analytics platform at scale
Chi-Yi Kuan Weidong Zhang
Tiger Zhang
Who are we?
www.linkedin.com/in/chiyikuan
Chi-Yi Kuan
www.linkedin.com/in/weidongzhang1
Weidong Zhang Tiger Zhang
www.linkedin.com/in/tigerzhang
• Director, Analytics at Linkedin • Big data evangelist and
practitioner
• Manager, Analytics Platform & Apps at Linkedin
• Build big data and analytics products
• Sr. Staff, Analytics at Linkedin • Text mining scientist and big data
enthusiast
Strata + Hadoop World, 12/8/2016
Strata + Hadoop World, 12/8/2016
Knowledge Schools Skills Jobs Companies Members
467M 7M 6M 3B 27k 200k Endorsements Daily posts
Strata + Hadoop World, 12/8/2016
467M 2B Billions
LinkedIn Big Data
Strata + Hadoop World, 12/8/2016
Strata + Hadoop World, 12/8/2016
467+ million members = a lot of data
Voices: drive actionable intelligence from member voices…
What’s trending Products
Home Page Mobile Inbox
Sentiments Value Props
Hire Market Sell
Relevance filtering
Classification
Topic mining
Identify content that is relevant to Linkedin brand and products/services
Structuralize unstructured textual data into well-defined categories
Find most significant topics and stories in a certain time window
Strata + Hadoop World, 12/8/2016
…creating impact across business metrics
Developed game-changing solutions to drive Voice of Member impact
Improved analytics efficiency with unstructured data by 20X
Drove end-to-end technological integration on big data and embedding NLP solutions
Piloting operational solutions to scale advanced analytics impact for broader organization
Strata + Hadoop World, 12/8/2016
LinkedIn Hadoop Ecosystem
HDFS
Map-Reduce Tez Spark
Pig Hive Scalding
YARN
AZK
AB
AN
Strata + Hadoop World, 12/8/2016
Design Principles for Voices Platform
Scalability Availability Easy to Use Process Platform
Data Systems
Application Framework
Kafka, Hadoop
Spark Gobblin
Elasticsearch NoSQL
Phoenix Elasticsearch
Highcharts
Strata + Hadoop World, 12/8/2016
E2E Voices Platform Architecture
Strata + Hadoop World, 12/8/2016
Data Processing at Scale – with Generic ETL
Strata + Hadoop World, 12/8/2016
Smart IDs – for Viral Mentions with Threading
Strata + Hadoop World, 12/8/2016
High Availability – through Heterogeneous Data
Strata + Hadoop World, 12/8/2016
Machine learning based analytic engine to surface insights to everyday business users
Customized Feeds
Central navigation
Trending insights
Social analytics & topic mining
Deep dives
Sentiment solutions
Strata + Hadoop World, 12/8/2016
Text mining is a crowded space
Strata + Hadoop World, 12/8/2016
Our solution targets unique use cases for LinkedIn Member info
• Identity • Behavior • Social
Social data
Customer feedback • Customer service • Group updates • Network updates
Survey results
What’s trending
Products
Sentiments
Value Propositions
PYMK Group
Home Page Mobile Inbox
Identity Network
Hire Market Sell
Relevance solution
Topic mining
Text Classification
Strata + Hadoop World, 12/8/2016
▪ Product insights, launches, and events ▪ Horizontal themes ▪ PR and marketing campaigns
▪ Brand and value ▪ LinkedIn’s strategy, financial
performance, international etc.
Relevant:
Non-relevant: ▪ Status update, e.g. "I posted
something on Linkedin"; ▪ Social mentions, e.g. "Please
connect with me on Linkedin" or "Follow me on Linkedin"; ▪ Self promoting materials, e.g.
“share on LinkedIn” ▪ SPAMs
1) Focusing on relevant data
Strata + Hadoop World, 12/8/2016
Keyword based approach
Relevance prediction
power Rules
56%
Whitelist Blacklist
10%
60%
6%
19%
35%
Strata + Hadoop World, 12/8/2016
Generic text classification framework ▪ Feature generation ▪ Feature selection ▪ Machine learning algorithms:
– Naïve Bayes (NB) – Logistic Regression (LR) – Support Vector Machines (SVM)
(LibLinear) ▪ Cross-validation and evaluation
Applications ▪ LinkedIn relevance ▪ Sentiment analysis ▪ Product categorization
▪ Value proposition classification
2) Leveraging text classification engine
Strata + Hadoop World, 12/8/2016
Machine learning approach increases overall relevance by 40%
Relevance prediction
power Rules
56%
Whitelist Blacklist
6%
19%
40%
100%
SVM
35%
SVM: great gain in balancing precision and recall
Strata + Hadoop World, 12/8/2016
3) Enabling topic mining HIGH SPARK
Description
POS pattern matching
Part-of-speech (POS) tagging (Stanford CoreNLP) This is great.
… …
Topic pruning
- Stemming
- removing stop words
- merging synonyms
- clustering (optional)
**** ing ****** s
= =
Topic ranking: TF-IDF weighting and DF ranking
Strata + Hadoop World, 12/8/2016
Trending Insights – identify organic trending topics
Didi and Kuaidi merger
Product release
Strata + Hadoop World, 12/8/2016
LinkedIn’s customer support has evolved into an intelligence platform…
Scaling to have a broader impact across LinkedIn
▪ GCO cases ▪ Issue resolution ▪ Support focused
▪ Internal data (GCO, surveys, site feedback)
▪ App review ▪ LI.com ▪ Social data
▪ Product insight ▪ Member insight ▪ Launch tracking
▪ Social sentiment ▪ Brand tracking ▪ Viral mentions
Reactive Multi-channel Intelligent Predictive
Support Feedback Insights Anticipation
Strata + Hadoop World, 12/8/2016
…breaks down into sentiment and drivers…
4
(For LI data ) deep dive into MLC segmentation…
6
…geographic locations…
5
…and audience segmentation…
7
…generates automatic reporting, alerts and escalations…
8
…and close the feedback loop with support and PR solutions
9
This is what the future could look like From the first time we pick up an isolated comment…
1
Machine determines if there is significant reach…
2
…and whether it is a trending topic…
3
Strata + Hadoop World, 12/8/2016
Best customer experience starts from understanding Voices of
members!
Thank You!
Engineering blogs for Voices
Strata + Hadoop World, 12/8/2016
Part I. Voices: a Text Analytics Platform for Understanding Member Feedback Part II. Technical Details for Topic Mining
References 1. LibLinear: a library for large linear classification, available at
https://www.csie.ntu.edu.tw/~cjlin/liblinear/
2. LingPipe: a Java-based toolkit for processing text using computational linguistics,
available at http://alias-i.com/lingpipe/
3. NLTK: a leading platform for building Python programs to work with human language
data, available at http://www.nltk.org/
4. Stanford CoreNLP: an open source project lead by Stanford NLP group, available at
http://nlp.stanford.edu/software/
top related