la hug - agile analytics applications on hdp

106
2 © Hortonworks Inc. 2012 Russell Jurney (@rjurney ) - Hadoop Eva ngelist @ Hortonworks Formerly Viz, Data S cience at Ning, LinkedIn HBase Dashboards, Career Explorer , InMaps Agile Analytics Applications on HDP

Upload: hortonworks

Post on 26-Jan-2015

110 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: LA HUG - Agile Analytics Applications on HDP

2© Hortonworks Inc. 2012

Russell Jurney (@rjurney) - Hadoop Evangelist @Hortonworks

Formerly Viz, Data Science at Ning, LinkedIn

HBase Dashboards, Career Explorer, InMaps

Agile Analytics Applicationson HDP

Page 2: LA HUG - Agile Analytics Applications on HDP

3© Hortonworks Inc. 2012

About me... Bearding.

• I’m going to beat this guy

• Seriously

• Bearding is my #1 natural talent

• Salty Sea Beard

• Fortified with Pacific Ocean Minerals

Page 3: LA HUG - Agile Analytics Applications on HDP

4© Hortonworks Inc. 2012

Agile Data - The Book (July, 2013)

Read on Safari Rough Cuts

Early Release Here

Code Here

Page 4: LA HUG - Agile Analytics Applications on HDP

5© Hortonworks Inc. 2012

We go fast... but don’t worry!

• Examples for EVERYTHING on the Hortonworks blog: http://hortonworks.com/blog/authors/russell_jurney

• Download the slides - click the links - read examples!

• If its not on the blog, its in the book!

• Order now: http://shop.oreilly.com/product/0636920025054.do

• Read the book Friday on Safari Rough Cuts

5

Page 5: LA HUG - Agile Analytics Applications on HDP

6© Hortonworks Inc. 2012

HDP Sandbox - Talk Lessons Coming!

Page 6: LA HUG - Agile Analytics Applications on HDP

8© Hortonworks Inc. 2012

Agile Application Development: Check

• LAMP stack mature• Post-Rails frameworks to choose from• Enable rapid feedback and agility

8

+ NoSQL

Page 7: LA HUG - Agile Analytics Applications on HDP

9© Hortonworks Inc. 2012

Data Warehousing

Page 8: LA HUG - Agile Analytics Applications on HDP

10© Hortonworks Inc. 2012

Scientific Computing / HPC

• ‘Smart kid’ only: MPI, Globus, etc. until Hadoop

10

Tubes and Mercury (old school) Cores and Spindles (new school)

UNIVAC and Deep Blue both fill a warehouse. We’re back...

Page 9: LA HUG - Agile Analytics Applications on HDP

11© Hortonworks Inc. 2012

Data Science?

ApplicationDevelopment Data Warehousing

Scientific Computing / HPC

Page 10: LA HUG - Agile Analytics Applications on HDP

12© Hortonworks Inc. 2012

Data Center as Computer

• Warehouse Scale Computers and applications

12

“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.” Click here for a paper on operating a ‘data center as computer.’

Page 11: LA HUG - Agile Analytics Applications on HDP

13© Hortonworks Inc. 2012

Page 12: LA HUG - Agile Analytics Applications on HDP

14© Hortonworks Inc. 2012

Page 13: LA HUG - Agile Analytics Applications on HDP

15© Hortonworks Inc. 2012

Page 14: LA HUG - Agile Analytics Applications on HDP

16© Hortonworks Inc. 2012

Page 15: LA HUG - Agile Analytics Applications on HDP

17© Hortonworks Inc. 2012

Page 16: LA HUG - Agile Analytics Applications on HDP

18

Page 17: LA HUG - Agile Analytics Applications on HDP

Tez – Faster MapReduce!

19

Page 18: LA HUG - Agile Analytics Applications on HDP

20© Hortonworks Inc. 2012

Hadoop to the Rescue!

Page 19: LA HUG - Agile Analytics Applications on HDP

21© Hortonworks Inc. 2012

Hadoop to the Rescue!

• Easy to use! (Pig, Hive, Cascading)

• CHEAP: 1% the cost of SAN/NAS

• A department can afford its own Hadoop cluster!

• Dump all your data in one place: Hadoop DFS

• Silos come CRASHING DOWN!

• JOIN like crazy!

• ETL like whoah!

• An army of mappers and reducers at your command

• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!21

Page 20: LA HUG - Agile Analytics Applications on HDP

22© Hortonworks Inc. 2012

NOW WHAT?

?

Page 21: LA HUG - Agile Analytics Applications on HDP

23© Hortonworks Inc. 2012

Analytics Apps: It takes a Team

• Broad skill-set

• Nobody has them all

• Inherently collaborative

Page 22: LA HUG - Agile Analytics Applications on HDP

24© Hortonworks Inc. 2012

Data Science Team

• 3-4 team members with broad, diverse skill-sets that overlap

• Transactional overhead dominates at 5+ people

• Expert researchers: lend 25-50% of their time to teams

• Creative workers. Run like a studio, not an assembly line

• Total freedom... with goals and deliverables.

• Work environment matters most

24

Page 23: LA HUG - Agile Analytics Applications on HDP

25© Hortonworks Inc. 2012

How to get insight into product?

• Back-end has gotten t-h-i-c-k-e-r

• Generating $$$ insight can take 10-100x app dev

• Timeline disjoint: analytics vs agile app-dev/design

• How do you ship insights efficiently?

• How do you collaborate on research vs developer timeline?25

Page 24: LA HUG - Agile Analytics Applications on HDP

26© Hortonworks Inc. 2012

The Wrong Way - Part One

“We made a great design. Your job is to predict the future for it.”

Page 25: LA HUG - Agile Analytics Applications on HDP

27© Hortonworks Inc. 2012

The Wrong Way - Part Two

“Whats taking you so long to reliably predict the future?”

Page 26: LA HUG - Agile Analytics Applications on HDP

28© Hortonworks Inc. 2012

The Wrong Way - Part Three

“The users don’t understand what 86% true means.”

Page 27: LA HUG - Agile Analytics Applications on HDP

29© Hortonworks Inc. 2012

The Wrong Way - Part Four

GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!!

Page 28: LA HUG - Agile Analytics Applications on HDP

30© Hortonworks Inc. 2012

The Wrong Way - Inevitable Conclusion

Plane Mountain

Page 29: LA HUG - Agile Analytics Applications on HDP

31© Hortonworks Inc. 2012

Reminds me of... the waterfall model

:(

Page 30: LA HUG - Agile Analytics Applications on HDP

32© Hortonworks Inc. 2012

Chief Problem

You can’t design insight in analytics applications.

You discover it.

You discover by exploring.

Page 31: LA HUG - Agile Analytics Applications on HDP

33© Hortonworks Inc. 2012

-> Strategy

So make an app for exploring your data.

Which becomes a palette for what you ship.

Iterate and publish intermediate results.

Page 32: LA HUG - Agile Analytics Applications on HDP

34© Hortonworks Inc. 2012

Data Design

• Not the 1st query that = insight, its the 15th, or the

150th

• Capturing “Ah ha!” moments

• Slow to do those in batch...

• Faster, better context in an interactive web application.

• Pre-designed charts wind up terrible. So bad.

• Easy to invest man-years in the wrong statistical models

• Semantics of presenting predictions are complex, delicate

• Opportunity lies at intersection of data & design 34

Page 33: LA HUG - Agile Analytics Applications on HDP

35© Hortonworks Inc. 2012

How do we get back to Agile?

Page 34: LA HUG - Agile Analytics Applications on HDP

36© Hortonworks Inc. 2012

Statement of Principles

(then tricks, with code)

Page 35: LA HUG - Agile Analytics Applications on HDP

37© Hortonworks Inc. 2012

Setup an environment where...

• Insights repeatedly produced

• Iterative work shared with entire team

• Interactive from day 0

• Data model is consistent end-to-end

• Minimal impedance between layers

• Scope and depth of insights grow

• Insights form the palette for what you ship

• Until the application pays for itself and more

37

Page 36: LA HUG - Agile Analytics Applications on HDP

38© Hortonworks Inc. 2012

Value document > relation

Most data is dirty. Most data is semi-structured or un-structured. Rejoice!

Page 37: LA HUG - Agile Analytics Applications on HDP

39© Hortonworks Inc. 2012

Value document > relation

Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.

Page 38: LA HUG - Agile Analytics Applications on HDP

40© Hortonworks Inc. 2012

Relational Data = Legacy Format

• Why JOIN? Storage is fundamentally cheap!

• Duplicate that JOIN data in one big record type!

• ETL once to document format on import, NOT every job

• Not zero JOINs, but far fewer JOINs

• Semi-structured documents preserve data’s actual structure

• Column compressed document formats beat JOINs! (paper coming)

40

Page 39: LA HUG - Agile Analytics Applications on HDP

41© Hortonworks Inc. 2012

Value imperative > declarative

• We don’t know what we want to SELECT.

• Data is dirty - check each step, clean iteratively.

• 85% of data scientist’s time spent munging. See: ETL.

• Imperative is optimized for our process.

• Process = iterative, snowballing insight

• Efficiency matters, self optimize

41

Page 40: LA HUG - Agile Analytics Applications on HDP

42© Hortonworks Inc. 2012

Value dataflow > SELECT

Page 41: LA HUG - Agile Analytics Applications on HDP

43© Hortonworks Inc. 2012

Ex. dataflow: ETL + email sent count

(I can’t read this either. Get a big version here.)

Page 42: LA HUG - Agile Analytics Applications on HDP

44© Hortonworks Inc. 2012

Value Pig > Hive (for app-dev)

• Pigs eat ANYTHING• Pig is optimized for refining data, as opposed to consuming it• Pig is imperative, iterative• Pig is dataflows, and SQLish (but not SQL)• Code modularization/re-use: Pig Macros• ILLUSTRATE speeds dev time (even UDFs)• Easy UDFs in Java, JRuby, Jython, Javascript• Pig Streaming = use any tool, period.• Easily prepare our data as it will appear in our app.• If you prefer Hive, use Hive.

44

But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive... See: HCatalog for Pig/Hive integration, and this post.

Page 43: LA HUG - Agile Analytics Applications on HDP

45© Hortonworks Inc. 2012

Localhost vs Petabyte scale: same tools tools• Simplicity essential to scalability: highest level tools we can

• Prepare a good sample - tricky with joins, easy with

documents

• Local mode: pig -l /tmp -x local -v -w

• Frequent use of ILLUSTRATE

• 1st: Iterate, debug & publish locally

• 2nd: Run on cluster, publish to team/customer

• Consider skipping Object-Relational-Mapping (ORM)

• We do not trust ‘databases,’ only HDFS @ n=3.

• Everything we serve in our app is re-creatable via Hadoop.

45

Page 44: LA HUG - Agile Analytics Applications on HDP

46© Hortonworks Inc. 2012

Data-Value Pyramid

Climb it. Do not skip steps. See here.

Page 45: LA HUG - Agile Analytics Applications on HDP

47© Hortonworks Inc. 2012

0/1) Display atomic records on the web

Page 46: LA HUG - Agile Analytics Applications on HDP

48© Hortonworks Inc. 2012

0.0) Document-serialize events

• Protobuf

• Thrift

• JSON

• Avro - I use Avro because the schema is onboard.

48

Page 47: LA HUG - Agile Analytics Applications on HDP

49© Hortonworks Inc. 2012

0.1) Documents via Relation ETL

enron_messages = load '/enron/enron_messages.tsv' as (

message_id:chararray,

sql_date:chararray,

from_address:chararray,

from_name:chararray,

subject:chararray,

body:chararray

);

 

enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);

 

split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';

 

headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;

with_headers = join headers by group, enron_messages by message_id parallel 10;

emails = foreach with_headers generate enron_messages::message_id as message_id,

CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,

TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray),

enron_messages::subject as subject,

enron_messages::body as body,

headers::tos.(address, name) as tos,

headers::ccs.(address, name) as ccs,

headers::bccs.(address, name) as bccs;

store emails into '/enron/emails.avro' using AvroStorage( Example here.

Page 48: LA HUG - Agile Analytics Applications on HDP

50© Hortonworks Inc. 2012

0.2) Serialize events from streams

class GmailSlurper(object): ...  def init_imap(self, username, password):    self.username = username    self.password = password    try:      imap.shutdown()    except:      pass    self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)    self.imap.login(username, password)    self.imap.is_readonly = True ...  def write(self, record):    self.avro_writer.append(record) ...  def slurp(self):    if(self.imap and self.imap_folder):      for email_id in self.id_list:        (status, email_hash, charset) = self.fetch_email(email_id)        if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):          print email_id, charset, email_hash['thread_id']          self.write(email_hash)

Scrape your own gmail in Python and Ruby.

Page 49: LA HUG - Agile Analytics Applications on HDP

51© Hortonworks Inc. 2012

0.3) ETL Logs

log_data = LOAD 'access_log' USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes);

Page 50: LA HUG - Agile Analytics Applications on HDP

52© Hortonworks Inc. 2012

1) Plumb atomic events -> browser

(Example stack that enables high productivity)

Page 51: LA HUG - Agile Analytics Applications on HDP

53© Hortonworks Inc. 2012

Lots of Stack Options with Examples

• Pig with Voldemort, Ruby, Sinatra: example

• Pig with ElasticSearch: example

• Pig with MongoDB, Node.js: example

• Pig with Cassandra, Python Streaming, Flask: example

• Pig with HBase, JRuby, Sinatra: example

• Pig with Hive via HCatalog: example (trivial on HDP)

• Up next: Accumulo, Redis, MySQL, etc.

53

Page 52: LA HUG - Agile Analytics Applications on HDP

54© Hortonworks Inc. 2012

1.1) cat our Avro serialized events

me$ cat_avro ~/Data/enron.avro

{ u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z', u'from': {u'address': u'[email protected]', u'name': None}, u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u'subject': u'Re: Enron trade for frop futures', u'tos': [ {u'address': u'[email protected]', u'name': None} ]}

Get cat_avro in python, ruby

Page 53: LA HUG - Agile Analytics Applications on HDP

55© Hortonworks Inc. 2012

1.2) Load our events in Pig

me$ pig -l /tmp -x local -v -w

grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();grunt> describe enron_emails

emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)}}

 

Page 54: LA HUG - Agile Analytics Applications on HDP

56© Hortonworks Inc. 2012

1.3) ILLUSTRATE our events in Pig

grunt> illustrate enron_emails 

---------------------------------------------------------------------------| emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | | tos:bag{to:tuple(address:chararray,name:chararray)} || ccs:bag{cc:tuple(address:chararray,name:chararray)} || bccs:bag{bcc:tuple(address:chararray,name:chararray)} |---------------------------------------------------------------------------| | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | ([email protected], J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {([email protected],)} | | {} | | {} |

Upgrade to Pig 0.10+

Page 55: LA HUG - Agile Analytics Applications on HDP

57© Hortonworks Inc. 2012

1.4) Publish our events to a ‘database’

pig -l /tmp -x local -v -w -param avros=enron.avro \ -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig

/* MongoDB libraries and configuration */register /me/mongo-hadoop/mongo-2.7.3.jarregister /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jarregister /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar/* Set speculative execution off to avoid chance of duplicate records in Mongo */set mapred.map.tasks.speculative.execution falseset mapred.reduce.tasks.speculative.execution falsedefine MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut *//* By default, lets have 5 reducers */set default_parallel 5avros = load '$avros' using AvroStorage();store avros into '$mongourl' using MongoStorage();

Full instructions here.

Which does this:

From Avro to MongoDB in one command:

Page 56: LA HUG - Agile Analytics Applications on HDP

58© Hortonworks Inc. 2012

1.5) Check events in our ‘database’

$ mongo enronMongoDB shell version: 2.0.2connecting to: enron> show collectionsemailssystem.indexes> db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}){ "_id" : ObjectId("502b4ae703643a6a49c8d180"), "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>", "date" : "2001-01-09T06:38:00.000Z", "from" : { "address" : "[email protected]", "name" : "J.R. Bob Dobbs" },"subject" : Re: Enron trade for frop futures, "body" : "Scamming more people...", "tos" : [ { "address" : "connie@enron", "name" : null } ], "ccs" : [ ], "bccs" : [ ]}

Page 57: LA HUG - Agile Analytics Applications on HDP

59© Hortonworks Inc. 2012

1.6) Publish events on the web

require 'rubygems'require 'sinatra'require 'mongo'require 'json'

connection = Mongo::Connection.newdatabase = connection['agile_data']collection = database['emails']

get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data)end

Page 58: LA HUG - Agile Analytics Applications on HDP

60© Hortonworks Inc. 2012

1.6) Publish events on the web

Page 59: LA HUG - Agile Analytics Applications on HDP

61© Hortonworks Inc. 2012

Whats the point?

• A designer can work against real data.

• An application developer can work against real data.

• A product manager can think in terms of real data.

• Entire team is grounded in reality!

• You’ll see how ugly your data really is.

• You’ll see how much work you have yet to do.

• Ship early and often!

• Feels agile, don’t it? Keep it up!

61

Page 60: LA HUG - Agile Analytics Applications on HDP

62© Hortonworks Inc. 2012

1.7) Wrap events with Bootstrap

<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">

</head>

<body>

<div class="container" style="margin-top: 100px;">

<table class="table table-striped table-bordered table-condensed">

<thead>

{% for key in data['keys'] %}

<th>{{ key }}</th>

{% endfor %}

</thead>

<tbody>

<tr>

{% for value in data['values'] %}

<td>{{ value }}</td>

{% endfor %}

</tr>

</tbody>

</table>

</div>

</body>Complete example here with code here.

Page 61: LA HUG - Agile Analytics Applications on HDP

63© Hortonworks Inc. 2012

1.7) Wrap events with Bootstrap

Page 62: LA HUG - Agile Analytics Applications on HDP

64© Hortonworks Inc. 2012

Refine. Add links between documents.

Not the Mona Lisa, but coming along... See: here

Page 63: LA HUG - Agile Analytics Applications on HDP

66© Hortonworks Inc. 2012

1.8) List links to sorted events

mongo enron

> db.emails.ensureIndex({message_id: 1})

> db.emails.find().sort({date:0}).limit(10).pretty()

{

{

"_id" : ObjectId("4f7a5da2414e4dd0645d1176"),

"message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",

"from" : [

...

pig -l /tmp -x local -v -w

emails_per_user = foreach (group emails by from.address) {

sorted = order emails by date;

last_1000 = limit sorted 1000;

generate group as from_address, emails as emails;

};

store emails_per_user into '$mongourl' using MongoStorage();

Use Pig, serve/cache a bag/array of email documents:

Use your ‘database’, if it can sort.

Page 64: LA HUG - Agile Analytics Applications on HDP

67© Hortonworks Inc. 2012

1.8) List links to sorted documents

Page 65: LA HUG - Agile Analytics Applications on HDP

68© Hortonworks Inc. 2012

1.9) Make it searchable...

If you have list, search is easy with ElasticSearch and Wonderdog...

/* Load ElasticSearch integration */

register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';

register '/me/elasticsearch-0.18.6/lib/*';

define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();

emails = load '/me/tmp/emails' using AvroStorage();

store emails into 'es://email/email?json=false&size=1000' using

ElasticSearch('/me/elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-

0.18.6/plugins');

curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'

Test it with curl:

ElasticSearch has no security features. Take note. Isolate.

Page 66: LA HUG - Agile Analytics Applications on HDP

69© Hortonworks Inc. 2012

From now on we speed up...

Don’t worry, its in the book and on the blog.

http://hortonworks.com/blog/

Page 67: LA HUG - Agile Analytics Applications on HDP

70© Hortonworks Inc. 2012

2) Create Simple Charts

Page 68: LA HUG - Agile Analytics Applications on HDP

71© Hortonworks Inc. 2012

2) Create Simple Tables and Charts

Page 69: LA HUG - Agile Analytics Applications on HDP

72© Hortonworks Inc. 2012

2) Create Simple Charts

• Start with an HTML table on general principle.

• Then use nvd3.js - reusable charts for d3.js

• Aggregate by properties & displaying is first step in entity

resolution

• Start extracting entities. Ex: people, places, topics, time series

• Group documents by entities, rank and count.

• Publish top N, time series, etc.

• Fill a page with charts.

• Add a chart to your event page. 72

Page 70: LA HUG - Agile Analytics Applications on HDP

73© Hortonworks Inc. 2012

2.1) Top N (of anything) in Pig

pig -l /tmp -x local -v -w

top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc;top_10_things = limit sorted 10;generate group as key, top_10_things as top_10_things;};store top_n into '$mongourl' using MongoStorage();

Remember, this is the same structure the browser gets as json.

This would make a good Pig Macro.

Page 71: LA HUG - Agile Analytics Applications on HDP

74© Hortonworks Inc. 2012

2.2) Time Series (of anything) in Pig

pig -l /tmp -x local -v -w

/* Group by our key and date rounded to the month, get a total */things_by_month = foreach (group things by (key, ISOToMonth(datetime))generate flatten(group) as (key, month),

COUNT_STAR(things) as total;

/* Sort our totals per key by month to get a time series */things_timeseries = foreach (group things_by_month by key) { timeseries = order things by month;generate group as key, timeseries as timeseries;};

store things_timeseries into '$mongourl' using MongoStorage();

Yet another good Pig Macro.

Page 72: LA HUG - Agile Analytics Applications on HDP

75© Hortonworks Inc. 2012

Data processing in our stack

A new feature in our application might begin at any layer... great!

Any team member can add new features, no problemo!

I’m creative!

I know Pig!

I’m creative too!

I <3 Javascript!

omghi2u!

where r my legs?

send halp

Page 73: LA HUG - Agile Analytics Applications on HDP

76© Hortonworks Inc. 2012

Data processing in our stack

... but we shift the data-processing towards batch, as we are able.

Ex: Overall total emails calculated in each layer

See real example here.

Page 74: LA HUG - Agile Analytics Applications on HDP

77© Hortonworks Inc. 2012

3) Exploring with Reports

Page 75: LA HUG - Agile Analytics Applications on HDP

78© Hortonworks Inc. 2012

3) Exploring with Reports

Page 76: LA HUG - Agile Analytics Applications on HDP

79© Hortonworks Inc. 2012

3.0) From charts to reports...

• Extract entities from properties we aggregated by in charts (Step 2)

• Each entity gets its own type of web page

• Each unique entity gets its own web page

• Link to entities as they appear in atomic event documents (Step

1)

• Link most related entities together, same and between types.

• More visualizations!

• Parametize results via forms.

79

Page 77: LA HUG - Agile Analytics Applications on HDP

80© Hortonworks Inc. 2012

3.1) Looks like this...

Page 78: LA HUG - Agile Analytics Applications on HDP

81© Hortonworks Inc. 2012

3.2) Cultivate common keyspaces

Page 79: LA HUG - Agile Analytics Applications on HDP

82© Hortonworks Inc. 2012

3.3) Get people clicking. Learn.

• Explore this web of generated pages, charts and links!

• Everyone on the team gets to know your data.

• Keep trying out different charts, metrics, entities, links.

• See whats interesting.

• Figure out what data needs cleaning and clean it.

• Start thinking about predictions & recommendations.

82

‘People’ could be just your team, if data is sensitive.

Page 80: LA HUG - Agile Analytics Applications on HDP

83© Hortonworks Inc. 2012

4) Predictions and Recommendations

Page 81: LA HUG - Agile Analytics Applications on HDP

84© Hortonworks Inc. 2012

4.0) Preparation

• We’ve already extracted entities, their properties and relationships

• Our charts show where our signal is rich

• We’ve cleaned our data to make it presentable

• The entire team has an intuitive understanding of the data

• They got that understanding by exploring the data

• We are all on the same page!

84

Page 82: LA HUG - Agile Analytics Applications on HDP

85© Hortonworks Inc. 2012

4.2) Think in different perspectives

• Networks

• Time Series / Distributions

• Natural Language Processing

• Conditional Probabilities / Bayesian Inference

• Check out Chapter 2 of the book...85See here.

Page 83: LA HUG - Agile Analytics Applications on HDP

86© Hortonworks Inc. 2012

4.3) Networks

Page 84: LA HUG - Agile Analytics Applications on HDP

87© Hortonworks Inc. 2012

4.3.1) Weighted Email Networks in Pig

DEFINE header_pairs(email, col1, col2) RETURNS pairs { filtered = FILTER $email BY ($col1 IS NOT NULL) AND ($col2 IS NOT NULL); flat = FOREACH filtered GENERATE FLATTEN($col1) AS $col1, FLATTEN($col2) AS $col2; $pairs = FOREACH flat GENERATE LOWER($col1) AS ego1, LOWER($col2) AS ego2;}/* Get email address pairs for each type of connection, and union them together */emails = LOAD '/me/Data/enron.avro' USING AvroStorage();from_to = header_pairs(emails, from, to);from_cc = header_pairs(emails, from, cc);from_bcc = header_pairs(emails, from, bcc);pairs = UNION from_to, from_cc, from_bcc;/* Get a count of emails over these edges. */pair_groups = GROUP pairs BY (ego1, ego2);sent_counts = FOREACH pair_groups GENERATE FLATTEN(group) AS (ego1, ego2), COUNT_STAR(pairs) AS total;

Page 85: LA HUG - Agile Analytics Applications on HDP

88© Hortonworks Inc. 2012

4.3.2) Networks Viz with Gephi

Page 86: LA HUG - Agile Analytics Applications on HDP

89© Hortonworks Inc. 2012

4.3.3) Gephi = Easy

Page 87: LA HUG - Agile Analytics Applications on HDP

90© Hortonworks Inc. 2012

4.3.4) Social Network Analysis

Page 88: LA HUG - Agile Analytics Applications on HDP

91© Hortonworks Inc. 2012

4.4) Time Series & Distributions

pig -l /tmp -x local -v -w

/* Count things per day */

things_per_day = foreach (group things by (key, ISOToDay(datetime))

generate flatten(group) as (key, day),

COUNT_STAR(things) as total;

/* Sort our totals per key by day to get a sorted time series */

things_timeseries = foreach (group things_by_day by key) {

timeseries = order things by day;

generate group as key, timeseries as timeseries;

};

store things_timeseries into '$mongourl' using MongoStorage();

Page 89: LA HUG - Agile Analytics Applications on HDP

92© Hortonworks Inc. 2012

4.4.1) Smooth Sparse Data

See here.

Page 90: LA HUG - Agile Analytics Applications on HDP

93© Hortonworks Inc. 2012

4.4.2) Regress to find TrendsJRuby Linear Regression UDF Pig to use the UDF

Trend Line in your Application

Page 91: LA HUG - Agile Analytics Applications on HDP

94© Hortonworks Inc. 2012

4.5.1) Natural Language Processing

Example with code here and macro here.

import 'tfidf.macro';my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body');

/* Get the top 10 Tf*Idf scores per message */per_message_cassandra = foreach (group tfidf_all by message_id) { sorted = order tfidf_all by value desc; top_10_topics = limit sorted 10; generate group, top_10_topics.(score, value);}

Page 92: LA HUG - Agile Analytics Applications on HDP

95© Hortonworks Inc. 2012

4.5.2) NLP: Extract Topics!

Page 93: LA HUG - Agile Analytics Applications on HDP

96© Hortonworks Inc. 2012

4.5.3) NLP for All: Extract Topics!

• TF-IDF in Pig - 2 lines of code with Pig Macros: • http://hortonworks.com/blog/pig-macro-for-tf-idf-makes-topic-summarization-2-lines-of-pig/

• LDA with Pig and the Lucene Tokenizer: • http://thedatachef.blogspot.be/2012/03/topic-discovery-with-apache-pig-and.html

96

Page 94: LA HUG - Agile Analytics Applications on HDP

97© Hortonworks Inc. 2012

4.6) Probability & Bayesian Inference

Page 95: LA HUG - Agile Analytics Applications on HDP

98© Hortonworks Inc. 2012

4.6.1) Gmail Suggested Recipients

Page 96: LA HUG - Agile Analytics Applications on HDP

99© Hortonworks Inc. 2012

4.6.1) Reproducing it with Pig...

Page 97: LA HUG - Agile Analytics Applications on HDP

100© Hortonworks Inc. 2012

4.6.2) Step 1: COUNT(From -> To)

Page 98: LA HUG - Agile Analytics Applications on HDP

101© Hortonworks Inc. 2012

4.6.2) Step 2: COUNT(From, To, Cc)/Total

P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone

Page 99: LA HUG - Agile Analytics Applications on HDP

102© Hortonworks Inc. 2012

4.6.3) Wait - Stop Here! It works!

They match...

Page 100: LA HUG - Agile Analytics Applications on HDP

103© Hortonworks Inc. 2012

4.4) Add predictions to reports

Page 101: LA HUG - Agile Analytics Applications on HDP

104© Hortonworks Inc. 2012

5) Enable new actions

Page 102: LA HUG - Agile Analytics Applications on HDP

105© Hortonworks Inc. 2012

Why doesn’t Kate reply to my emails?

• What time is best to catch her?

• Are they too long?

• Are they meant to be replied to (contain original content)?

• Are they nice? (sentiment analysis)

• Do I reply to her emails (reciprocity)?

• Do I cc the wrong people (my mom) ?

105

Page 103: LA HUG - Agile Analytics Applications on HDP

106© Hortonworks Inc. 2012

Example: LinkedIn InMaps

<------ personalization drives engagement

Shared at http://inmaps.linkedinlabs.com/share/Russell_Jurney/316288748096695765986412570341480077402

Page 104: LA HUG - Agile Analytics Applications on HDP

107© Hortonworks Inc. 2012

Example: Packetpig and PacketLoop

snort_alerts = LOAD '$pcap'  USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig');

countries = FOREACH snort_alerts  GENERATE    com.packetloop.packetpig.udf.geoip.Country(src) as country,    priority;

countries = GROUP countries BY country;

countries = FOREACH countries  GENERATE    group,    AVG(countries.priority) as average_severity;

STORE countries into 'output/choropleth_countries' using PigStorage(',');

Code here.

Page 105: LA HUG - Agile Analytics Applications on HDP

108© Hortonworks Inc. 2012

Example: Packetpig and PacketLoop

Page 106: LA HUG - Agile Analytics Applications on HDP

109© Hortonworks Inc. 2012

Thank You!

Questions & Answers

Slides: http://slidesha.re/T943VU

Follow: @hortonworks and @rjurneyRead: hortonworks.com/blog