big data analytics 3: machine learning to engage the customer, with apache spark, ibm watson, and...

32
Machine Learning to Engage the Customer [email protected] @chris_biow

Upload: mongodb

Post on 25-Jul-2015

451 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

Machine Learningto Engage the Customer

[email protected]

@chris_biow

Page 2: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

Trigger WarningThis presentation, and materials to which it links, contains triggers. These will be triggering reactive, asynchronous, and message-driven environments.

A safe room is available in Empire West, where Alan Viars is presenting Modernizing National Health Care.

Page 3: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

3

Objectionable Content

Language Impurities• Basic Linear Algebra

Subprograms (BLAS - Fortran)• Node-RED visual programming• Node.js• Scala, with Perlish accent

(Ehrmegerd, nerl perlish!)• Java, C++, Prolog• Twitter: unfiltered, live feed• Machine recommendations• Degenerate cases

Page 4: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

Let’s Try It

Page 5: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

Node-RED Twitter with Watson Resonance

Page 6: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

db.tweets.aggregate([ {$group: { _id: { hour: {$hour: "$date"}, minute: {$minute: "$date"} }, total: {$sum: "$sentiment.score"}, average: {$avg: "$sentiment.score"}, count: {$sum: 1}, happyTalk: {$push: "$sentiment.positive"} }}, {$unwind: "$happyTalk"}, {$unwind: "$happyTalk"}, {$group: { _id: "$_id", total: {$first: "$total"}, average: {$first: "$average"}, count: {$first: "$count"}, happyTalk: {$addToSet: "$happyTalk"} }}, {$sort: {_id: -1} }])

Page 7: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

The What and the Why

Page 8: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

8

Machine Learning

• What: depends who you ask– learning that is done by machines [my lab partner] – algorithms that can learn from and make predictions on data [Wikipedia, just now]– induction and … other algorithms that can be said to “learn” [Kohavi 1998 goo.gl/WvEmNJ]– whatever the heck we’re selling [cloud vendors]– common cognitive framework, ingests content, observe, interpret, evaluate, decide [IBM Watson]– predictive analytics [Microsoft Azure, AWS]– algorithmic grab-bag [Mahout, MLlib]

• Why: depends what you want– Engagement, discovery, decision [Watson]– Prediction: maintenance, demand, resource allocation [Azure]– Analytics: fraud, personalization, marketing, churn, support [AWS]

Greg Steinbruner
can we make these first ones look more like quotations? (do we need so many...it's a lot of words for one slide...maybe pick three?)
Page 9: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

9

Apache Mahout Samsara

• Architectures: standalone, MapReduce, Spark, H20• Languages: DSL shell, Java• Functions

– Collaborative filtering– Classification– Clustering– Dimensionality reduction– Topic models– Miscellany

Example: Create topic grouping for Wikipedia articles

Page 10: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

10

Spark MLlib

• Languages: Scala, Java, Python• Clusters: EC2, YARN, Mesos, standalone• Linear algebra: Java Breeze / Fortran BLAS• Data: vector, point, matrix• Functions

– Basic stats– Classification and regression– Collaborative Filtering– Clustering– Dimensionality reduction (remove variables)– Feature extraction & transformation– Frequent pattern mining– Optimization (local min/max)

Example: interactive drill-down categories for large result set

Page 11: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

11

The Magic of Alternating Least Squares Latent Factoring

Which is the real me?

Movies recommended for you: 1: The Sound of Music (1965) 2: Snow White and the Seven Dwarfs (1937) 3: Beauty and the Beast (1991) 4: Charlie Brown Christmas, A (1965) 5: Bambi (1942) 6: Seven Brides for Seven Brothers (1954) 7: Mary Poppins (1964) 8: Pinocchio (1940) 9: Gone with the Wind (1939)10: The Wizard of Oz (1939)

Movies recommended for you: 1: Maradona by Kusturica (2008) 2: Shadows of Forgotten Ancestors (1964) 3: Rosario Tijeras (2005) 4: Constantine's Sword (2007) 5: Titicut Follies (1967) 6: Lady Chatterley (2006) 7: August Evening (2007) 8: Power of Nightmares: The Rise of the Politics of Fear, The (2004) 9: Sun Alley (Sonnenallee) (1999)10: Who's Singin' Over There? (a.k.a. Who Sings Over There) (Ko to tamo peva) (1980)

Page 12: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

12

Watson Developer Cloud

• Presented as services for Bluemix• RESTful calls• Node.js• Node-RED

Example: Message resonance for email solicitation

Page 13: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

13

Microsoft Azure

• R and Python• Flowchart GUI• Correlation, modeling, trend projection, forecasting• HDInsight cloud Hadoop• Publishing for profit via Machine Learning Gallery

– Voice recognition– Customer churn prediction– Text extraction: sentiment and key phrase– Contributor donation propensity– Frequently bought together– Classifier– Clustering– Linear regression– … 35 total in market [goo.gl/LhMbUu]

Example: Retail forecasting

Page 14: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

14

AWS

• Create models• Generate predictions• Data: S3, Redshift, RDS• APIs: Java, .NET, Python, PHP, Node, Ruby• Mobile SDK • Use cases

– Fraud detection– Content personalization– Marketing propensity modeling – Document classification– Customer churn prediction– Customer support solutions

Example: Marketing response prediction

Page 15: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

15

MongoDB

• Next-gen database– Document-model– Scalable– Highly-available– Secondary indexes

• Agile with schema and query types• Subsecond query response over multiple indexes• Low-second aggregation framework for basic analytics

Example: Number of articles by author

• In-database mapReduce• Hadoop connector

– Mongo[Input|Output]Format– mongo.[input|output].uri or BSON– mongo.input.query

Agility Aggregation Framework

Documents

High Availability Secondary Indexing

Scalability

Page 16: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

16

MongoDB Data Operations Spectrum

• Retrieve Nothing – infinitely fast• Document Retrieval – 1ms if in cache, ~10ms from spinning disk• .find() – per-document cost similar to single document

– _id range– any secondary index range, can be composite key– intersect two indexes– covered indexes even faster

• .count(), .distinct(), .group() – fast, may be covered• .aggregate() – retrieval cost like find, plus pipeline operations

– $match– $group– $project– $redact

• .mapReduce() – in-database Javascript• Hadoop Connector

– mongo.input.query for indexed partial scan– full scan

Faster…

……

……

.....Slow

er

Page 17: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

17

Greg Steinbruner
unit of measurement on the Y...seconds? The legend and font over all may be a little small?
Page 18: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

Using Spark

Page 19: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

19

Topic Detection

• Grouping documents according to topics, especially over time– Google News

• Latent Dirichlet Allocation – Corpus of M documents, each of N words

Wij at position i in document j– Documents have (latent) topic distributions α

θi for document i– Topics have word distributions β, φk for topic k

Zij is topic contributing to word at position j in document i– Remove stopwords!

• Tweets– Large, terse corpus – Highly sensitive to number of iterations

(10 returned little more than word distribution)– Requires some iterative stopwording

"Smoothed LDA" by Slxu.public - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Smoothed_LDA.png#/media/File:Smoothed_LDA.png"Dirichlet distributions" by en:User:ThG - en:Image:Dirichlet_distributions.png. Licensed under Public Domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Dirichlet_distributions.png#/media/File:Dirichlet_distributions.png

Page 20: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

** Form C := alpha*A**H*B + beta*C.* DO 120 J = 1,N DO 110 I = 1,M TEMP = ZERO DO 100 L = 1,K TEMP = TEMP + CONJG(A(L,I))*B(L,J) 100 CONTINUE IF (BETA.EQ.ZERO) THEN C(I,J) = ALPHA*TEMP ELSE C(I,J) = ALPHA*TEMP + BETA*C(I,J) END IF 110 CONTINUE 120 CONTINUE ELSE** Form C := alpha*A**T*B + beta*C* DO 150 J = 1,N DO 140 I = 1,M TEMP = ZERO DO 130 L = 1,K TEMP = TEMP + A(L,I)*B(L,J) 130 CONTINUE IF (BETA.EQ.ZERO) THEN C(I,J) = ALPHA*TEMP ELSE C(I,J) = ALPHA*TEMP + BETA*C(I,J) END IF 140 CONTINUE 150 CONTINUE END IF

ELSE IF (NOTA) THEN IF (CONJB) THEN** Form C := alpha*A*B**H + beta*C.* DO 200 J = 1,N IF (BETA.EQ.ZERO) THEN DO 160 I = 1,M C(I,J) = ZERO 160 CONTINUE ELSE IF (BETA.NE.ONE) THEN DO 170 I = 1,M C(I,J) = BETA*C(I,J) 170 CONTINUE END IF DO 190 L = 1,K IF (B(J,L).NE.ZERO) THEN TEMP = ALPHA*CONJG(B(J,L)) DO 180 I = 1,M C(I,J) = C(I,J) + TEMP*A(I,L) 180 CONTINUE END IF 190 CONTINUE 200 CONTINUE ELSE** Form C := alpha*A*B**T + beta*C* DO 250 J = 1,N IF (BETA.EQ.ZERO) THEN DO 210 I = 1,M C(I,J) = ZERO

Page 21: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

Create the Resilient Distributed Dataset (RDD)

rdd = sc.newAPIHadoopRDD(

config, MongoInputFormat.class, Object.class, BSONObject.class)

config.set(

"mongo.input.uri", "mongodb://127.0.0.1:27017/marketdata.minbars")

config.set(

"mongo.input.query", '{"_id":{"$gt":{"$date":1182470400000}}}')

config.set(

"mongo.output.uri",

"mongodb://127.0.0.1:27017/marketdata.fiveminutebars")

val minBarRawRDD = sc.newAPIHadoopRDD(

config,

classOf[com.mongodb.hadoop.MongoInputFormat],

classOf[Object],

classOf[BSONObject])

Page 22: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

val fiveMinBars = groupBars.map(

g => (

g.head.get("_id"),

new BasicBSONObject(g.head.toMap()).

append("Close", g.last.get("Close") ).

append("High", g.map(b => b.get("High").toString.toFloat).reduceLeft(math.max) ).

append("Low", g.map(b => b.get("Low").toString.toFloat).reduceLeft(math.min) ).

append("Volume", g.map(b => b.get("Volume").toString.toInt).foldLeft(0)(_ + _) )

)

)

Operate through Spark on the RDD Object

Page 23: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

// Create a separate Configuration for saving data back to MongoDB.

val outputConfig = new Configuration()

outputConfig.set("mongo.output.format", "com.mongodb.hadoop.MongoOutputFormat")

outputConfig.set("mongo.output.uri", "mongodb://"

+ mongoPort

+ "/marketdata.fiveminutebars")

fiveMinBars.saveAsNewAPIHadoopFile(

"file:///dummy",

classOf[Any],

classOf[Any],

classOf[MongoOutputFormat[_,_]],

outputConfig)

Put It Back Where You Found It

Page 24: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB
Greg Steinbruner
what are we saying in the conclusion?
Page 25: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

LOREM IPSUM

LOREM IPSUM

LOREM IPSUM

LOREM IPSUM

Sollicitudin VenenatisLOREM IPSUM

LOREM IPSUM

LOREM IPSUM

LOREM IPSUM

Graphic Element Examples

Page 26: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

Porta Ultricies

Commodo Porta

Graph Examples

Page 27: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB
Page 28: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

{

_id : ObjectId("4c4ba5e5e8aabf3"),

employee_name: "Dunham, Justin",

department : "Marketing",

title : "Product Manager, Web",

report_up: "Neray, Graham",

pay_band: “C",

benefits : [

{ type :  "Health",

plan : "PPO Plus" },

{ type :   "Dental",

plan : "Standard" }

]

}

Code/Highlight Example

Page 29: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

Aggregation Framework Agility Backup Big Data Briefcase

Buildings Business Intelligence Camera Cash Register Catalog

Chat Checkmark Checkmark Cloud Commercial Contract

Computer Content Continuous Development Credit Card Customer Success

Page 30: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

Data Center Data Variety Data Velocity Data Volume Data Warehouse Database

Dialogue Directory Documents Downloads Drivers Dynamic Schema

EDW Integration Faster Time to Market File Transfer Flexible Gear Hadoop

Health Check High Availability Horizontal Scaling Integrating into Infrastructure Internet of Things Iterative Development

Page 31: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

Life Preserver Line Graph Lock Log Data Lower Cost Magnifying Glass

Man Mobile Phone Meter Monitoring Music New Apps

New Data Types Online Open Source Parachute Personalization Pin

Platform Certification Product Catalog Puzzle Pieces RDBMS Realtime Analytics Rich Querying

Page 32: Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Spark, IBM Watson, and MongoDB

Life Preserver RSS Scalability Scale Secondary Indexing Steering Wheel

Stopwatch Text Search Tick Data Training Transmission Tower Trophy

Woman World