data mining & engineering
TRANSCRIPT
Copyright ©2014 Visible Technologies, Inc. All rights reserved.1
Data Mining and Engineering
Lucas ParkerSenior Software Development Engineer, Research & Development
Presented by
Copyright ©2014 Visible Technologies, Inc. All rights reserved.2
About Visible
Our Mission: Customer ValueGlobal Authoritative Content• Most comprehensive global content sourcing model• Clean & accurate data
Powerful Search & Discovery• Only the most pertinent results, based on your criteria• Pivot and drill to identify discussion drivers
Engagement/Social CRM• Social Media workflow and engagement for individual or team• Integrate with CRM for continuous relationship management
Sophisticated Social Analytics•Measure, compare, and contrast program & communication results• Segment results by product attributes, reputation drivers, etc.
Actionable Insights•Discovery & analytics to uncover insights in real time•Holistic consumer insights, integrated with other market research
Copyright ©2014 Visible Technologies, Inc. All rights reserved.3
What We Do
• Domain is “social media”• Twitter, Facebook, forums, blogs, etc
• Huge data sets, lots of noise.
• Enrichment, aggregation, reporting.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.4
Visible – Target Business Groups
• Customer Servicing• Interactions between business and customer.
• Marketing• Brand effort, campaigns, periodic messaging.
• Corporate Communications• PR, reputation of company and stakeholders
• Research• Audience definition, demographics, psychographics
Copyright ©2014 Visible Technologies, Inc. All rights reserved.6
Articulating the Problem
“Marketing analysts need to understand the impact of their
campaigns and we can provide them an avenue to do so.”
- Surf and turf
“We should totally Hadoop something!”
- Knuckle sandwich
Copyright ©2014 Visible Technologies, Inc. All rights reserved.7
Feature Engineering
• Your concise data features are easy to grasp, but do they provide for an adequate model?
• Your 600-dimension model is totally awesome, but does it scale?
• How much is “good enough”?
Copyright ©2014 Visible Technologies, Inc. All rights reserved.8
Proposing Solutions to the Business
• Understand scale issues.
• Provide alternatives: • There is no such thing as a perfect system.
• Communicate clearly about real and opportunity costs.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.9
The Hazards of Third Party Data
• Data might not be available forever.
• Vendors might change terms.
• Entrenchment can impede growth/change due to poor quality over time (data sources can decay, vendors may slack on maintenance).
Copyright ©2014 Visible Technologies, Inc. All rights reserved.11
Productionalizing Prototypes
• Isn’t that a fancy word?
• Strike balance between awesome and simple.• This is almost impossible to get right.• Even if you get it right once, it won’t last.
• Better for everybody if you give me as simple a mechanism as possible.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.12
Expanding and Maintaining 1
• Data drift• How does data change organically over time?
• Bit rot• Does anybody even remember how to refit the model?
• Split maintenance• Keeping the research model up to date with the
production model never happens.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.13
Expanding and Maintaining 2
• Horizontal expansion exposes original scope assumptions.
• “We have it in English. What do you mean we can’t get it in Swahili?”
• Value trumps veracity. Sacrifices of purity cause degradation.
• Business needs results in accretion of surrounding goo.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.14
Document Tone: “NLP” versus “Statistical”
• NLP/Probablistic Grammars:• Effective.• Slow.• Costly reference grammars. Consider a vendor.
• Vector space modeling (term vectors/n-grams)• Very fast at runtime.• Work best with lots of training data.• Can fit yourself, so long as you can afford to maintain it.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.17
Language-Detection: Features
• Supports 53 languages.
• Fitted on Wikipedia corpora.
• Classic “one-versus-all” classification.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.18
Language-Detection: Mechanism
• Determines the frequency with which n-grams of 1-3 characters appear inside of a labeled corpus.
“To what extent does each 1-3 character n-gram
participate in a label?”
"tho":134583,"thr":87801,"the":3415279,"thi":110969,"tha":240340
Copyright ©2014 Visible Technologies, Inc. All rights reserved.19
Language-Detection: Practicalities
• Downsides?
• Twitter and Facebook!
• Letter casing (“I love you” versus “i love you”).
• Mixed-language documents (e.g. Chinese documents with English words).
Copyright ©2014 Visible Technologies, Inc. All rights reserved.21
Overview
• Airline passengers found sewing needles in sandwiches.
• Airline attempted to redirect the conversation and measure the results.
• Visible tracked this event in social media.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.22
Delta Airlines: Needle Sandwiches
Purchased a refinery to reduce fuel costs
Passengers found needles in their on-flight sandwiches
Free tickets given away as a promotion
Prominent terms at a week view.
Prominent terms at a
month view.
Prominent terms at a
three month view.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.23
Delta Volumes Over Time
Purchased a refinery to reduce fuel costs
Needles found in on flight Turkey Sandwiches
Free tickets given away as a promotion
Copyright ©2014 Visible Technologies, Inc. All rights reserved.27
PR Case Study: Conclusion
• Contest didn’t pay off in the long term.
• Attempts to redirect the conversation may be ham-fisted.
• Thoughts? Conjecture?