data mining & engineering

29
Copyright ©2014 Visible Technologies, Inc. All rights reserved. 1 Data Mining and Engineering Lucas Parker Senior Software Development Engineer, Research & Development Presented by

Upload: visible-technologies

Post on 20-Aug-2015

394 views

Category:

Technology


2 download

TRANSCRIPT

Copyright ©2014 Visible Technologies, Inc. All rights reserved.1

Data Mining and Engineering

Lucas ParkerSenior Software Development Engineer, Research & Development

Presented by

Copyright ©2014 Visible Technologies, Inc. All rights reserved.2

About Visible

Our Mission: Customer ValueGlobal Authoritative Content• Most comprehensive global content sourcing model• Clean & accurate data

Powerful Search & Discovery• Only the most pertinent results, based on your criteria• Pivot and drill to identify discussion drivers

Engagement/Social CRM• Social Media workflow and engagement for individual or team• Integrate with CRM for continuous relationship management

Sophisticated Social Analytics•Measure, compare, and contrast program & communication results• Segment results by product attributes, reputation drivers, etc.

Actionable Insights•Discovery & analytics to uncover insights in real time•Holistic consumer insights, integrated with other market research

Copyright ©2014 Visible Technologies, Inc. All rights reserved.3

What We Do

• Domain is “social media”• Twitter, Facebook, forums, blogs, etc

• Huge data sets, lots of noise.

• Enrichment, aggregation, reporting.

Copyright ©2014 Visible Technologies, Inc. All rights reserved.4

Visible – Target Business Groups

• Customer Servicing• Interactions between business and customer.

• Marketing• Brand effort, campaigns, periodic messaging.

• Corporate Communications• PR, reputation of company and stakeholders

• Research• Audience definition, demographics, psychographics

Data Mining Meets Engineering

Copyright ©2014 Visible Technologies, Inc. All rights reserved.6

Articulating the Problem

“Marketing analysts need to understand the impact of their

campaigns and we can provide them an avenue to do so.”

- Surf and turf

“We should totally Hadoop something!”

- Knuckle sandwich

Copyright ©2014 Visible Technologies, Inc. All rights reserved.7

Feature Engineering

• Your concise data features are easy to grasp, but do they provide for an adequate model?

• Your 600-dimension model is totally awesome, but does it scale?

• How much is “good enough”?

Copyright ©2014 Visible Technologies, Inc. All rights reserved.8

Proposing Solutions to the Business

• Understand scale issues.

• Provide alternatives: • There is no such thing as a perfect system.

• Communicate clearly about real and opportunity costs.

Copyright ©2014 Visible Technologies, Inc. All rights reserved.9

The Hazards of Third Party Data

• Data might not be available forever.

• Vendors might change terms.

• Entrenchment can impede growth/change due to poor quality over time (data sources can decay, vendors may slack on maintenance).

Copyright ©2014 Visible Technologies, Inc. All rights reserved.10

Bonini’s Paradox

Copyright ©2014 Visible Technologies, Inc. All rights reserved.11

Productionalizing Prototypes

• Isn’t that a fancy word?

• Strike balance between awesome and simple.• This is almost impossible to get right.• Even if you get it right once, it won’t last.

• Better for everybody if you give me as simple a mechanism as possible.

Copyright ©2014 Visible Technologies, Inc. All rights reserved.12

Expanding and Maintaining 1

• Data drift• How does data change organically over time?

• Bit rot• Does anybody even remember how to refit the model?

• Split maintenance• Keeping the research model up to date with the

production model never happens.

Copyright ©2014 Visible Technologies, Inc. All rights reserved.13

Expanding and Maintaining 2

• Horizontal expansion exposes original scope assumptions.

• “We have it in English. What do you mean we can’t get it in Swahili?”

• Value trumps veracity. Sacrifices of purity cause degradation.

• Business needs results in accretion of surrounding goo.

Copyright ©2014 Visible Technologies, Inc. All rights reserved.14

Document Tone: “NLP” versus “Statistical”

• NLP/Probablistic Grammars:• Effective.• Slow.• Costly reference grammars. Consider a vendor.

• Vector space modeling (term vectors/n-grams)• Very fast at runtime.• Work best with lots of training data.• Can fit yourself, so long as you can afford to maintain it.

Language Detection

Engineering Case Study

Copyright ©2014 Visible Technologies, Inc. All rights reserved.16

Language-Detection

Copyright ©2014 Visible Technologies, Inc. All rights reserved.17

Language-Detection: Features

• Supports 53 languages.

• Fitted on Wikipedia corpora.

• Classic “one-versus-all” classification.

Copyright ©2014 Visible Technologies, Inc. All rights reserved.18

Language-Detection: Mechanism

• Determines the frequency with which n-grams of 1-3 characters appear inside of a labeled corpus.

“To what extent does each 1-3 character n-gram

participate in a label?”

"tho":134583,"thr":87801,"the":3415279,"thi":110969,"tha":240340

Copyright ©2014 Visible Technologies, Inc. All rights reserved.19

Language-Detection: Practicalities

• Downsides?

• Twitter and Facebook!

• Letter casing (“I love you” versus “i love you”).

• Mixed-language documents (e.g. Chinese documents with English words).

Delta Airlines and “Needle Sandwiches”

PR Case Study:

Copyright ©2014 Visible Technologies, Inc. All rights reserved.21

Overview

• Airline passengers found sewing needles in sandwiches.

• Airline attempted to redirect the conversation and measure the results.

• Visible tracked this event in social media.

Copyright ©2014 Visible Technologies, Inc. All rights reserved.22

Delta Airlines: Needle Sandwiches

Purchased a refinery to reduce fuel costs

Passengers found needles in their on-flight sandwiches

Free tickets given away as a promotion

Prominent terms at a week view.

Prominent terms at a

month view.

Prominent terms at a

three month view.

Copyright ©2014 Visible Technologies, Inc. All rights reserved.23

Delta Volumes Over Time

Purchased a refinery to reduce fuel costs

Needles found in on flight Turkey Sandwiches

Free tickets given away as a promotion

Copyright ©2014 Visible Technologies, Inc. All rights reserved.24

Delta Volumes Over Time

Copyright ©2014 Visible Technologies, Inc. All rights reserved.25

Month View

Copyright ©2014 Visible Technologies, Inc. All rights reserved.26

3 Month View

Copyright ©2014 Visible Technologies, Inc. All rights reserved.27

PR Case Study: Conclusion

• Contest didn’t pay off in the long term.

• Attempts to redirect the conversation may be ham-fisted.

• Thoughts? Conjecture?

Copyright ©2014 Visible Technologies, Inc. All rights reserved.28

Conclusion

Questions?

Thank Youwww.visibletechnologies.com

[email protected]

Twitter: @Visible

Phone: (888) 852-0320