functional programming for optimization problems in big data

81
Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid “Functional programming for optimization problems in Big Data” Copyright @2013, Concurrent, Inc. 1 Wednesday, 06 March 13

Upload: paco-nathan

Post on 01-Jul-2015

16.405 views

Category:

Technology


2 download

DESCRIPTION

Invited talk for the INFORMS chapter at Stanford. 2013-03-06.

TRANSCRIPT

Page 1: Functional programming for optimization problems in Big Data

Paco NathanConcurrent, Inc.San Francisco, CA@pacoid

“Functional programming for optimization problems in Big Data”

Copyright @2013, Concurrent, Inc.

1Wednesday, 06 March 13

Page 2: Functional programming for optimization problems in Big Data

The Workflow Abstraction

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

1. Data Science2. Functional Programming3. Workflow Abstraction4. Typical Use Cases5. Open Data Example

2Wednesday, 06 March 13Let’s consider a trendline subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes and commercialized Big Data.Where did Big Data come from, and where is this kind of work headed?

Page 3: Functional programming for optimization problems in Big Data

Q3 1997: inflection point

Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware.

This effort prepared the way for huge Internet successesin the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG

MapReduce and the Apache Hadoop open source stack emerged from this.

3Wednesday, 06 March 13Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion:

parallelize workloads onto clusters of commodity servers to scale-out horizontally. Google and Inktomi (YHOO Search) were working along the same lines.

Page 4: Functional programming for optimization problems in Big Data

RDBMS

Stakeholder

SQL Queryresult sets

Excel pivot tablesPowerPoint slide decks

Web App

Customers

transactions

Product

strategy

Engineering

requirements

BIAnalysts

optimizedcode

Circa 1996: pre- inflection point

4Wednesday, 06 March 13Perl and C++ for CGI :)Feedback loops shown in red represent data innovations at the time… these are rather static.

Characterized by slow, manual processes:data modeling / business intelligence; “throw it over the wall”…this thinking led to impossible silos

Page 5: Functional programming for optimization problems in Big Data

RDBMS

SQL Queryresult sets

recommenders+

classifiersWeb Apps

customertransactions

AlgorithmicModeling

Logs

eventhistory

aggregation

dashboards

Product

EngineeringUX

Stakeholder Customers

DW ETL

Middleware

servletsmodels

Circa 2001: post- big ecommerce successes

5Wednesday, 06 March 13Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classifiers, and other predictive models -- e.g., ad networks automating parts of the marketing funnel, as in our case study.

LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.

Page 6: Functional programming for optimization problems in Big Data

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere

6Wednesday, 06 March 13Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.

Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment.

We see this feeding into cluster optimization in YARN, Apache Mesos, etc.

Page 7: Functional programming for optimization problems in Big Data

by Leo Breiman

Statistical Modeling: The Two CulturesStatistical Science, 2001

bit.ly/eUTh9L

references…

7Wednesday, 06 March 13Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)

Page 8: Functional programming for optimization problems in Big Data

Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.html

eBay“The eBay Architecture” – Randy Shoup, Dan Pritchettaddsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.htmladdsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

Inktomi (YHOO Search)“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)youtube.com/watch?v=E91oEn1bnXM

Google“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)youtube.com/watch?v=qsan-GQaeykperspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

references…

8Wednesday, 06 March 13In their own words…

Page 9: Functional programming for optimization problems in Big Data

core values

Data Science teams develop actionable insights, building confidence for decisions

that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords)

probably somewhere in-between… solving for pattern, at scale.

by definition, this is a multi-disciplinary pursuit which requires teams, not sole players

9Wednesday, 06 March 13

Page 10: Functional programming for optimization problems in Big Data

apps

team process = needs

discovery

modeling

integration

systems

help people ask the right questions

allow automation to place informed bets

deliver products at scale to customers

build smarts into product features

keep infrastructure running, cost-effective

Gephi

10Wednesday, 06 March 13

Page 11: Functional programming for optimization problems in Big Data

team composition = roles

business process, stakeholder

data prep, discovery, modeling, etc.

software engineering, automation

systems engineering, access

datascience

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

11Wednesday, 06 March 13This is an example of multi-disciplinary team composition for data scienceWhile other emerging problems spaces will require other more specific kinds of team roles

Page 12: Functional programming for optimization problems in Big Data

matrix: evaluate needs × roles

stakeholder

scientist

developer

ops

discovery

discovery

modeling

modeling

integration

integration

appsapps systems

systems

12Wednesday, 06 March 13

Page 13: Functional programming for optimization problems in Big Data

most valuable skills

approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc.

unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up

most valuable skills:‣ learn to use programmable tools that prepare data

‣ learn to generate compelling data visualizations

‣ learn to estimate the confidence for reported results

‣ learn to automate work, making analysis repeatable

the rest of the skills – modeling, algorithms, etc. – those are secondary

D3

13Wednesday, 06 March 13

Page 14: Functional programming for optimization problems in Big Data

in a nutshell, what we do…

‣ estimate probability

‣ calculate analytic variance

‣ manipulate order complexity

‣ leverage use of learning theory

+ collab with DevOps, Stakeholders

+ reduce work to cron entries

Unique Registration

Launched games lobby

NUI:TutorialMode

Birthday Message

Chat PublicRoom voice

Launched heyzap game

ConnectivityTest: test suite started

Create New Pet

Movie View Started: client, community

NUI:MovieMode

Buy an Item: web

Put on Clothing

Address space remaining: 512M

Customer Made Purchase Cart Page Step 2

Feed Pet

Play Pet

Chat Now

Edit Panel

Client Inventory Panel Flip Product Over

Add Friend

Open 3D Window

Change Seat

Type a Bubble

Visit Own Homepage

Take a Snapshot

NUI:BuyCreditsMode

NUI:MyProfileClicked

Address space remaining: 1G

Leave a Message

NUI:ChatMode

NUI:FriendsModedv

Website Login

Add Buddy

NUI:PublicRoomMode

NUI:MyRoomMode

Client Inventory Panel Remove Product

Client Inventory Panel Apply Product

NUI:DressUpMode

Unique RegistrationLaunched games lobbyNUI:TutorialModeBirthday MessageChat PublicRoom voiceLaunched heyzap gameConnectivityTest: test suite startedCreate New PetMovie View Started: client, communityNUI:MovieModeBuy an Item: webPut on ClothingAddress space remaining: 512MCustomer Made Purchase Cart Page Step 2Feed PetPlay PetChat NowEdit PanelClient Inventory Panel Flip Product OverAdd FriendOpen 3D WindowChange SeatType a BubbleVisit Own HomepageTake a SnapshotNUI:BuyCreditsModeNUI:MyProfileClickedAddress space remaining: 1GLeave a MessageNUI:ChatModeNUI:FriendsModedvWebsite LoginAdd BuddyNUI:PublicRoomModeNUI:MyRoomModeClient Inventory Panel Remove ProductClient Inventory Panel Apply ProductNUI:DressUpMode

science in data science?

14Wednesday, 06 March 13

Page 15: Functional programming for optimization problems in Big Data

references…

by DJ Patil

Data JujitsuO’Reilly, 2012amazon.com/dp/B008HMN5BE

Building Data Science TeamsO’Reilly, 2011amazon.com/dp/B005O4U3ZE

15Wednesday, 06 March 13Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)

Page 16: Functional programming for optimization problems in Big Data

The Workflow Abstraction

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

1. Data Science2. Functional Programming3. Workflow Abstraction4. Typical Use Cases5. Open Data Example

16Wednesday, 06 March 13Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.

Page 17: Functional programming for optimization problems in Big Data

Cascading – origins

API author Chris Wensel worked as a system architect at an Enterprise firm well-known for several popular data products.

Wensel was following the Nutch open source project – before Hadoop even had a name.

He noted that it would become difficult to find Java developers to write complex Enterprise apps directly in Apache Hadoop – a potential blocker for leveraging this new open source technology.

17Wednesday, 06 March 13Cascading initially grew from interaction with the Nutch project, before Hadoop had a name

API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.

Page 18: Functional programming for optimization problems in Big Data

Cascading – functional programming

Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature.

To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows.

18Wednesday, 06 March 13Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.

Page 19: Functional programming for optimization problems in Big Data

examples…

• Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments

• new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming:

Cascalog in Clojure (2010)Scalding in Scala (2012)

github.com/nathanmarz/cascalog/wiki

github.com/twitter/scalding/wiki

19Wednesday, 06 March 13Many case studies, many Enterprise production deployments now for 5+ years.

Page 20: Functional programming for optimization problems in Big Data

void map (String doc_id, String text):

for each word w in segment(text):

emit(w, "1");

void reduce (String word, Iterator group):

int count = 0;

for each pc in group:

count += Int(pc);

emit(word, String(count));

The Ubiquitous Word Count

Definition: count how often each word appears in a collection of text documents

This simple program provides an excellent test case for parallel processing, since it illustrates:

• requires a minimal amount of code

• demonstrates use of both symbolic and numeric values

• shows a dependency graph of tuples as an abstraction

• is not many steps away from useful search indexing

• serves as a “Hello World” for Hadoop apps

Any distributed computing framework which can run Word Count efficiently in parallel at scale can handle much larger and more interesting compute problems.

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

count how often each word appears in a collection of text documents

20Wednesday, 06 March 13Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...

Page 21: Functional programming for optimization problems in Big Data

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

1 map 1 reduce18 lines code gist.github.com/3900702

word count – conceptual flow diagram

cascading.org/category/impatient

21Wednesday, 06 March 13Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.

Page 22: Functional programming for optimization problems in Big Data

word count – Cascading app in Java

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

22Wednesday, 06 March 13Based on a Cascading implementation of Word Count, here is sample code -- approx 1/3 the code size of the Word Count example from Apache Hadoop

2nd to last line: generates a DOT file for the flow diagram

Page 23: Functional programming for optimization problems in Big Data

map

reduceEvery('wc')[Count[decl:'count']]

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

GroupBy('wc')[by:['token']]

Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[head]

[tail]

[{2}:'token', 'count'][{1}:'token']

[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']

wc[{1}:'token'][{1}:'token']

[{2}:'token', 'count'][{2}:'token', 'count']

[{1}:'token'][{1}:'token']

word count – generated flow diagramDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

23Wednesday, 06 March 13As a concrete example of literate programming in Cascading, here is the DOT representation of the flow plan -- generated by the app itself.

Page 24: Functional programming for optimization problems in Big Data

(ns impatient.core  (:use [cascalog.api]        [cascalog.more-taps :only (hfs-delimited)])  (:require [clojure.string :as s]            [cascalog.ops :as c])  (:gen-class))

(defmapcatop split [line]  "reads in a line of string and splits it by regex"  (s/split line #"[\[\]\\\(\),.)\s]+"))

(defn -main [in out & args]  (?<- (hfs-delimited out)       [?word ?count]       ((hfs-delimited in :skip-header? true) _ ?line)       (split ?line :> ?word)       (c/count ?count)))

; Paul Lam; github.com/Quantisan/Impatient

word count – Cascalog / ClojureDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

24Wednesday, 06 March 13Here is the same Word Count app written in Clojure, using Cascalog.

Page 25: Functional programming for optimization problems in Big Data

github.com/nathanmarz/cascalog/wiki

• implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language

• run ad-hoc queries from the Clojure REPL –approx. 10:1 code reduction compared with SQL

• composable subqueries, used for test-driven development (TDD) practices at scale

• Leiningen build: simple, no surprises, in Clojure itself

• more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog

• has a learning curve, limited number of Clojure developers

• aggregators are the magic, and those take effort to learn

word count – Cascalog / ClojureDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

25Wednesday, 06 March 13From what we see about language features, customer case studies, and best practices in general -- Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.

Great for large-scale, complex apps, where small teams must limit the complexities in their process.

Page 26: Functional programming for optimization problems in Big Data

import com.twitter.scalding._ class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ \\[\\]\\(\\),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true))}

word count – Scalding / ScalaDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

26Wednesday, 06 March 13Here is the same Word Count app written in Scala, using Scalding.

Very compact, easy to understand; however, also more imperative than Cascalog.

Page 27: Functional programming for optimization problems in Big Data

github.com/twitter/scalding/wiki

• extends the Scala collections API so that distributed lists become “pipes” backed by Cascading

• code is compact, easy to understand

• nearly 1:1 between elements of conceptual flow diagram and function calls

• extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc.

• significant investments by Twitter, Etsy, eBay, etc.

• great for data services at scale

• less learning curve than Cascalog, not as much of a high-level language

word count – Scalding / ScalaDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

27Wednesday, 06 March 13If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.

Page 28: Functional programming for optimization problems in Big Data

github.com/twitter/scalding/wiki

• extends the Scala collections API so that distributed lists become “pipes” backed by Cascading

• code is compact, easy to understand

• nearly 1:1 between elements of conceptual flow diagram and function calls

• extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc.

• significant investments by Twitter, Etsy, eBay, etc.

• great for data services at scale (imagine SOA infra @ Google as an open source project)

• less learning curve than Cascalog, not as much of a high-level language

word count – Scalding / ScalaDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

Cascalog and Scalding DSLs leverage the functional aspects of MapReduce, helping to limit complexity in process

28Wednesday, 06 March 13Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…

Page 29: Functional programming for optimization problems in Big Data

The Workflow Abstraction

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

1. Data Science2. Functional Programming3. Workflow Abstraction4. Typical Use Cases5. Open Data Example

29Wednesday, 06 March 13CS theory related to data workflow abstraction, to manage complexity

Page 30: Functional programming for optimization problems in Big Data

Cascading workflows – pattern language

Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Data is represented as flows of tuples. Operations within the tuple flows bring functional programming aspects into Java apps.

In formal terms, this provides a pattern language.

30Wednesday, 06 March 13A pattern language, based on the metaphor of “plumbing”

Page 31: Functional programming for optimization problems in Big Data

references…

pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices.

amazon.com/dp/0195019199

design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four”.

amazon.com/dp/0201633612

31Wednesday, 06 March 13Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.

Page 32: Functional programming for optimization problems in Big Data

Cascading workflows – literate programming

Cascading workflows generate their own visual documentation: flow diagrams

In formal terms, flow diagrams leverage a methodology called literate programming

Provides intuitive, visual representations for apps, greatfor cross-team collaboration.

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

32Wednesday, 06 March 13Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.

Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first

Page 33: Functional programming for optimization problems in Big Data

references…

by Don Knuth

Literate ProgrammingUniv of Chicago Press, 1992

literateprogramming.com/

“Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”

33Wednesday, 06 March 13Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.

Page 34: Functional programming for optimization problems in Big Data

examples…

• Scalding apps have nearly 1:1 correspondence between function calls and the elements in theirflow diagrams – excellent elision and literate representation

• noticed on cascading-users email list: when troubleshooting issues, Cascading experts ask novices to provide an app’s flow diagram (generated as a DOT file), sometimes in lieu of showing code

In formal terms, a flow diagram is a directed, acyclic graph (DAG) on which lots of interesting math applies for query optimization, predictive models about app execution, parallel efficiency metrics, etc.

map

reduceEvery('wc')[Count[decl:'count']]

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

GroupBy('wc')[by:['token']]

Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[head]

[tail]

[{2}:'token', 'count'][{1}:'token']

[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']

wc[{1}:'token'][{1}:'token']

[{2}:'token', 'count'][{2}:'token', 'count']

[{1}:'token'][{1}:'token']

34Wednesday, 06 March 13Literate programming examples observed on the email list are some of the best illustrations of this methodology.

Page 35: Functional programming for optimization problems in Big Data

Cascading workflows – business process

Following the essence of literate programming, Cascading workflows provide statements of business process

This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data)

As a separation of concerns between business process and implementation details (Hadoop, etc.)

This is especially apparent in large-scale Cascalog apps:

“Specify what you require, not how to achieve it.”

By virtue of the pattern language, the flow planner in used in a Cascading app determines how to translate business process into efficient, parallel jobs at scale.

35Wednesday, 06 March 13Business Stakeholder POV:business process management for workflow orchestration (think BPM/BPEL)

Page 36: Functional programming for optimization problems in Big Data

references…

by Edgar Codd

“A relational model of data for large shared data banks”Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685

Rather than arguing between SQL vs. NoSQL…structured vs. unstructured data frameworks… this approach focuses on:

the process of structuring data

That’s what apps do – Making Data Work

36Wednesday, 06 March 13Focus on *the process of structuring data*which must happen before the large-scale joins, predictive models, visualizations, etc.

Just because your data is loaded into a “structured” store, that does not imply that your app has finished structuring it for the purpose of making data work.

BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)

Page 37: Functional programming for optimization problems in Big Data

Cascading workflows – functional relational programming

The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL.

Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in:

Moseley & Marks, 2006“Out of the Tar Pit”goo.gl/SKspn

37Wednesday, 06 March 13A more contemporary statement along similar lines...

Page 38: Functional programming for optimization problems in Big Data

Two Avenues…

scale ➞co

mpl

exity

Enterprise: must contend with complexity at scale everyday…

incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff

Start-ups: crave complexity and scale to become viable…

new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding

38Wednesday, 06 March 13Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity

Page 39: Functional programming for optimization problems in Big Data

Cascading workflows – functional relational programming

The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL.

Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in:

Moseley & Marks, 2006“Out of the Tar Pit”goo.gl/SKspn

several theoretical aspects converge into software engineering practices which mitigates the complexity of building and maintaining Enterprise data workflows

39Wednesday, 06 March 13

Page 40: Functional programming for optimization problems in Big Data

The Workflow Abstraction

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

1. Data Science2. Functional Programming3. Workflow Abstraction4. Typical Use Cases5. Open Data Example

40Wednesday, 06 March 13Here are a few use cases to consider, for Enterprise data workflows

Page 41: Functional programming for optimization problems in Big Data

Cascading – deployments

• 5+ history of Enterprise production deployments,ASL 2 license, GitHub src, http://conjars.org

• partners: Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC, SpringSource, Cloudera

• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc.

• use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, genomics, climatology, etc.

41Wednesday, 06 March 13Several published case studies about Cascading, Cascalog, Scalding, etc.Wide range of use cases.

Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.Partnerships with the various Hadoop distro vendors, cloud providers, etc.

Page 42: Functional programming for optimization problems in Big Data

Finance: Ecommerce Risk

Problem:

<1% chargeback rate allowed by Visa, others follow

• may leverage CAPTURE/AUTH wait period

• Cybersource, Vindicia, others haven’t stopped fraud

>15% chargeback rate common for mobile in US:

• not much info shared with merchant

• carrier as judge/jury/executioner; customer assumed correct

most common: professional fraud (identity theft, etc.)

• patterns of attack change all the time

• widespread use of IP proxies, to mask location

• global market for stolen credit card info

other common case is friendly fraud

• teenager billing to parent’s cell phone

stat.berkeley.edu

42Wednesday, 06 March 13

Page 43: Functional programming for optimization problems in Big Data

KPI:

chargeback rate (CB)

• ground truth for how much fraud the bank/carrier claims

• 7-120 day latencies from the bank

false positive rate (FP)

• estimated cost: predicts customer support issues

• complaints due to incorrect fraud scores on valid orders (or lies)

false negative rate (FN)

• estimated risk: how much fraud may pass undetected in future orders

• changes with new product features/services/inventory/marketing

stat.berkeley.eduFinance: Ecommerce Risk

43Wednesday, 06 March 13

Page 44: Functional programming for optimization problems in Big Data

Data Science Issues:

• chargeback limits imply few training cases

• sparse data implies lots of missing values – must impute

• long latency on chargebacks – “good” flips to “bad”

• most detection occurs within large-scale batch, decisions required during real-time event processing

• not just one pattern to detect – many, ever-changing

• many unknowns: blocked orders scare off professional fraud, inferences cannot be confirmed

• cannot simply use raw data as input – requires lots of data preparation and statistical modeling

• each ecommerce firm has shopping/policy nuances which get exploited differently – hard to generalize solutions

stat.berkeley.eduFinance: Ecommerce Risk

44Wednesday, 06 March 13

Page 45: Functional programming for optimization problems in Big Data

Predictive Analytics:

batch

• cluster/segment customers for expected behaviors

• adjust for seasonal variation

• geospatial indexing / bayesian point estimates (fraud by lat/lng)

• impute missing values (“guesses” to fill-in sparse data)

• run anti-fraud classifier (customer 360)

real-time

• exponential smoothing (estimators for velocity)

• calculate running medians (anomaly detection)

• run anti-fraud classifier (per order)

stat.berkeley.eduFinance: Ecommerce Risk

45Wednesday, 06 March 13

Page 46: Functional programming for optimization problems in Big Data

1. Data Preparation (batch)

‣ ETL from bank, log sessionization, customer profiles, etc.

- large-scale joins of customers + orders

‣ apply time window

- too long: patterns lose currency

- too short: not enough wait for chargebacks

‣ segment customers

- temporary fraud (identity theft which has been resolved)

- confirmed fraud (chargebacks from the bank)

- estimated fraud (blocked/banned by Customer Support)

- valid orders (but different clusters of expected behavior)

‣ subsample to rebalance data

- produce training set + test holdout

- adjust balance for FP/FN bias (company risk profile)

stat.berkeley.eduFinance: Ecommerce Risk

46Wednesday, 06 March 13

Page 47: Functional programming for optimization problems in Big Data

2. Model Creation (analyst)

‣ distinguish between different IV data types

- continuous (e.g., age)

- boolean (e.g., paid lead)

- categorical (e.g., gender)

- computed (e.g., geo risk, velocities)

‣ use geospatial smoothing for lat/lng

‣ determine distributions for IV

‣ adjust IV for seasonal variation, where appropriate

‣ impute missing values based on density functions / medians

‣ factor analysis: determine which IV to keep (too many creates problems)

‣ train model: random forest (RF) classifiers predict likely fraud

‣ calculate the confusion matrix (TP/FP/TN/FN)

stat.berkeley.eduFinance: Ecommerce Risk

47Wednesday, 06 March 13

Page 48: Functional programming for optimization problems in Big Data

3. Test Model (analyst/batch loop)

‣ calculate estimated fraud rates

‣ identify potential found fraud cases

‣ report to Customer Support for review

‣ generate risk vs. benefit curves

‣ visualize estimated impact of new model

4. Decision (stakeholder)

‣ decide risk vs. benefit (minimize fraud + customer support costs)

‣ coordinate with bank/carrier if there are current issues

‣ determine go/no-go, when to deploy in production, size of rollout

stat.berkeley.eduFinance: Ecommerce Risk

48Wednesday, 06 March 13

Page 49: Functional programming for optimization problems in Big Data

5. Production Deployment (near-time)

‣ run model on in-memory grid / transaction processing

‣ A/B test to verify model in production (progressive rollout)

‣ detect anomalies

- use running medians on continuous IVs

- use exponential smoothing on computed IVs (velocities)

- trigger notifications

‣ monitor KPI and other metrics in dashboards

stat.berkeley.eduFinance: Ecommerce Risk

49Wednesday, 06 March 13

Page 50: Functional programming for optimization problems in Big Data

Cascading apps

risk classifierdimension: per-order

risk classifierdimension: customer 360

PMML model

analyst'slaptopdata prep

detectfraudsters

predictmodel costs

customertransactions

score new orders

trainingdata sets

batchworkloads

real-timeworkloads

anomalydetection

segmentcustomers

IMDGHadoop

partner dataDW

ETL

chargebacks,etc.

CustomerDB

velocitymetrics

Finance: Ecommerce Risk

50Wednesday, 06 March 13

Page 51: Functional programming for optimization problems in Big Data

Ecommerce: Marketing FunnelW

ikipedia

Problem:

• must optimize large ad spend budget

• different vendors report different kinds of metrics

• some campaigns are much smaller than others

• seasonal variation distorts performance

• inherent latency in spend vs. effect

• ads channels cannot scale up immediately

• must “scrub” leads to dispute payments/refunds

• hard to predict ROI for incremental ad spend

• many issues of diminishing returns in general

51Wednesday, 06 March 13

Page 52: Functional programming for optimization problems in Big Data

Wikipedia

KPI:

cost per paying user (CPP)

• must align metrics for different ad channels

• generally need to estimate to end-of-month

customer lifetime value (LTV)

• big differences based on geographic region, age, gender, etc.

• assumes that new customers behave like previous customers

return on investment (ROI)

• relationship between CPP and LTV

• adjust to invest in marketing (>CPP) vs. extract profit (>LTV)

other metrics

• reach: how many people get a brand message

• customer satisfaction: would recommend to a friend, etc.

Ecommerce: Marketing Funnel

52Wednesday, 06 March 13

Page 53: Functional programming for optimization problems in Big Data

Wikipedia

Predictive Analytics:

batch

• log aggregation, followed with cohort analysis

• bayesian point estimates compare different-sized ad tests

• time series analysis normalizes for seasonal variation

• geolocation adjusts for regional cost/benefit

• customer lifetime value estimates ROI of new leads

• linear programming models estimate elasticity of demand

real-time

• determine whether this is actually a new customer…

• new: modify initial UX based on ad channel, region, friends, etc.

• old: recommend products/services/friends based on behaviors

• adjust spend on poorly performing channels

• track back to top referring sites/partners

Ecommerce: Marketing Funnel

53Wednesday, 06 March 13

Page 54: Functional programming for optimization problems in Big Data

Airlines

Problem:

• minimize schedule delays

• re-route around weather and airport conditions

• manage supplier channels and inventories to minimize AOG

KPI:

forecast future passenger demand

customer loyalty

aircraft on ground (AOG)

mean time between failures (MTBF)

54Wednesday, 06 March 13

Page 55: Functional programming for optimization problems in Big Data

Predictive Analytics:

batch

• predict “last mile” failures

• optimize capacity utilization

• operations research problem to optimize stocking / minimize fuel waste

• boost customer loyalty by adjusting incentives frequent flyer programs

real-time

• forecast schedule delays

• monitor factors for travel conditions: weather, airports, etc.

Airlines

55Wednesday, 06 March 13

Page 56: Functional programming for optimization problems in Big Data

The Workflow Abstraction

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

1. Data Science2. Functional Programming3. Workflow Abstraction4. Typical Use Cases5. Open Data Example

56Wednesday, 06 March 13Cascalog app for mobile data API (recommender service) based on City of Palo Alto Open Data

Page 57: Functional programming for optimization problems in Big Data

Palo Alto is quite a pleasant place

• temperate weather

• lots of parks, enormous trees

• great coffeehouses

• walkable downtown

• not particularly crowded

On a nice summer day, who wants to be stuck indoors on a phone call?

Instead, take it outside – go for a walk

And example open source project: github.com/Cascading/CoPA/wiki

57Wednesday, 06 March 13Palo Alto is generally quite a pleasant place: the weather is temperate, there are lots of parks with enormous trees, most of downtown is quite walkable, and it's not particularly crowded.

On a summer day in Palo Alto, one of the last things anybody really wants is to be stuck in an office on a long phone call.Instead people walk outside and take their calls, probably heading toward a favorite espresso bar or a frozen yogurt shop.On a hot summer day in Palo Alto, knowing a nice quiet route to walk in the shade would be great.

Page 58: Functional programming for optimization problems in Big Data

1. Open Data about municipal infrastructure(GIS data: trees, roads, parks)

2. Big Data about where people like to walk(smartphone GPS logs)

3. some curated metadata(which surfaces the value)

⇒4. personalized recommendations:

“Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sipping a latte or enjoying some fro-yo.”

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

58Wednesday, 06 March 13We merge unstructured geo data about municipal infrastructure(GIS data: trees, roads, parks)+unstructured data about where people like to walk(smartphone GPS logs)+a little metadata (curated)=>personalized recommendations:

"Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sippin’ a latte or enjoying some fro-yo."

Page 59: Functional programming for optimization problems in Big Data

The City of Palo Alto recently began to support Open Data to give the local community greater visibility into how their city government operates

This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good

paloalto.opendata.junar.com/dashboards/7576/geographic-information/

discovery

59Wednesday, 06 March 13The City of Palo Alto has recently begun to support Open Data to give the local community greater visibility into how their city government functions.

This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good.

http://paloalto.opendata.junar.com/dashboards/7576/geographic-information/

Page 60: Functional programming for optimization problems in Big Data

GIS about trees in Palo Alto:discovery

60Wednesday, 06 March 13(trees map overlay)

Page 61: Functional programming for optimization problems in Big Data

Geographic_Information,,,

"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis Source: davey tree Protected: Designated: Heritage: Appraised Value: Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872 Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point""Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 Base Type Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15 Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse Extent: 0 Ravelling Severity: none Ravelling Extent: 0 Ridability Severity: none Trench Severity: none Trench Extent: 0 Rutting Severity: none Rutting Extent: 0 Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0 Remediation: Deduct Value: 100 Priority: Pavement Condition: excellent Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicols Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0 -122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0 -122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0 -122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line"

discovery

(unstructured data…)

61Wednesday, 06 March 13here’s what we have to work with -- raw GIS export as CSV, with plenty o’ errors too, for good measure

this illustrates a great example of “unstructured data”

Alligator Severity!

Page 62: Functional programming for optimization problems in Big Data

(defn parse-gis [line] "leverages parse-csv for complex CSV format in GIS export" (first (csv/parse-csv line)) )  (defn etl-gis [gis trap] "subquery to parse data sets from the GIS source tap" (<- [?blurb ?misc ?geo ?kind] (gis ?line) (parse-gis ?line :> ?blurb ?misc ?geo ?kind) (:trap (hfs-textline trap)) ))

discovery

(specify what you require, not how to achieve it…

80/20 rule of data prep cost)

62Wednesday, 06 March 13Let's use Cascalog to begin our process of structuring that data

since the GIS export is vaguely in CSV format, here's a simple way to clean up the data

referring back to DJ Patil’s “Data Jujitsu”, that clean up usually accounts for 80% of project costs

Page 63: Functional programming for optimization problems in Big Data

discovery

(ad-hoc queries get refined into composable predicates)

Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AVTree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None 37.446001565119,-122.167713417554,0.0Point

63Wednesday, 06 March 13First we load `lein repl` to get an interactive prompt for Clojure…bring Cascalog libraries into Clojure…define functions to use…and execute queries

then we convert the queries into composable, logical predicates

Let’s take a peek at the results...

TSV output becomes more structured, while the “bad” data has been trapped into a data set for review

[bold/colors added for clarity]

Page 64: Functional programming for optimization problems in Big Data

discovery

(curate valuable metadata)

64Wednesday, 06 March 13since we can find species and geolocation for each tree,

let’s add some metadata to infer other valuable data results, e.g., tree height

based on Wikipedia.org, Calflora.org, USDA.gov, etc.

Page 65: Functional programming for optimization problems in Big Data

(defn get-trees [src trap tree_meta] "subquery to parse/filter the tree data" (<- [?blurb ?tree_id ?situs ?tree_site ?species ?wikipedia ?calflora ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^\s+Private.*Tree ID.*" ?misc) (parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species) ((c/comp s/trim s/lower-case) ?raw_species :> ?species) (tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height) (avg ?min_height ?max_height :> ?avg_height) (geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt) (read-string ?tree_lat :> ?lat) (read-string ?tree_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (:trap (hfs-textline trap)) ))

discovery

?blurb!! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl?tree_id! " 412?situs"" 115?tree_site" 1?species" " liquidambar styraciflua?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua?calflora http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598?avg_height" 27.5?tree_lat" 37.446001565119?tree_lng" -122.167713417554?tree_alt" 0.0?geohash" " 9q9jh0

65Wednesday, 06 March 13Next, refine the data about trees: join with metadata, calculate estimators, etc.

Now we have a data product about trees in Palo Alto, which has been enriched by our processBTW, those geolocation fields are especially important...

Page 66: Functional programming for optimization problems in Big Data

// run analysis and visualization in Rlibrary(ggplot2)

dat_folder <- '~/src/concur/CoPA/out/tree'data <- read.table(file=paste(dat_folder, "part-00000", sep="/"), sep="\t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8") summary(data)

t <- head(sort(table(data$V5), decreasing=TRUE)trees <- as.data.frame.table(t, n=20))colnames(trees) <- c("species", "count") m <- ggplot(data, aes(x=V8))m <- m + ggtitle("Estimated Tree Height (meters)")m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density() par(mar = c(7, 4, 4, 2) + 0.1)plot(trees, xaxt="n", xlab="")axis(1, labels=FALSE)text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1, labels=trees$species, xpd=TRUE)grid(nx=nrow(trees))

discovery

66Wednesday, 06 March 13Another aspect of the “Discovery” phase is to poke at the data: run summary stats, visualize the data, etc.

We’ll use RStudio for that...

Page 67: Functional programming for optimization problems in Big Data

discovery

sweetgum

Analysis of the tree data:

67Wednesday, 06 March 13some analysis and visualizations from RStudio:

* frequency of species * density plot of tree heights in Palo Alto

Page 68: Functional programming for optimization problems in Big Data

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

discovery

(flow diagram, gis ⇒ tree)

68Wednesday, 06 March 13Here’s a conceptual flow diagram, which shows a directed, acyclic graph (DAG) of data taps, tuple streams, operations, joins, assertions, aggregations, etc.

Page 69: Functional programming for optimization problems in Big Data

discoveryIn addition, the road data provides:

• traffic class (arterial, truck route, residential, etc.)

• traffic counts distribution

• surface type (asphalt, cement; age)

This leads to estimators for noise, sunlight reflection, etc.

69Wednesday, 06 March 13more analysis and visualizations from RStudio:

* frequency of traffic classes * density plot of traffic counts

Page 70: Functional programming for optimization problems in Big Data

9q9jh0

geohash with 6-digit resolution

approximates a 5-block square

centered lat: 37.445, lng: -122.162

modeling

70Wednesday, 06 March 13Shifting into the modeling phase, we use “geohash” codes for “cheap and dirty” geospatial indexing suited for parallel processing (Hadoop)

much more effective methods exist; however, this is simple to show

6-digit resolution on a geohash generates approximately a 5-block square

Page 71: Functional programming for optimization problems in Big Data

Each road in the GIS export is listed as a block between two cross roads, and each may have multiple road segments to represent turns:

" -122.161776959558,37.4518836690781,0.0 " -122.161390381489,37.4516410983794,0.0 " -122.160786011735,37.4512589903357,0.0 " -122.160531178368,37.4510977281699,0.0

modeling

( lat0, lng0, alt0 )

( lat1, lng1, alt1 )

( lat2, lng2, alt2 )

( lat3, lng3, alt3 )

NB: segments in the raw GIS have the order of geo coordinates scrambled: (lng, lat, alt)

71Wednesday, 06 March 13Each road is listed in the GIS export as a block between two cross roads, and each may have multiple road segments to represent turns

Page 72: Functional programming for optimization problems in Big Data

Our app analyzes each road segment as a data tuple, calculating a center point, then uses a geohash to define a boundary:

modeling

9q9jh0

( lat, lng, alt )

72Wednesday, 06 March 13Then uses a geohash to specify a boundary

Page 73: Functional programming for optimization problems in Big Data

9q9jh0

Query to join a road segment tuple with all the trees within its geohash boundary:

modeling

73Wednesday, 06 March 13Query to join the road segment tuple with trees within its geohash boundary

Page 74: Functional programming for optimization problems in Big Data

X X

X

Use distance-to-midpoint to filter trees which are too far away to provide shade. Calculate a sum of moments for tree height × distance from road segment, as an estimator for shade:

modeling

∑( h·d )

Also calculate estimators for traffic frequency and noise

74Wednesday, 06 March 13Use distance to midpoint to filter out trees which are too far away to provide shade

Calculate a sum of moments for tree height × distance from center;approximate, but pretty good

also calculate estimators for traffic frequency and noise

Page 75: Functional programming for optimization problems in Big Data

(defn get-shade [trees roads] "subquery to join tree and road estimates, maximize for shade" (<- [?road_name ?geohash ?road_lat ?road_lng

?road_alt ?road_metric ?tree_metric] (roads ?road_name _ _ _

?albedo ?road_lat ?road_lng ?road_alt ?geohash ?traffic_count _ ?traffic_class _ _ _ _)

(road-metric ?traffic_class ?traffic_count ?albedo :> ?road_metric)

(trees _ _ _ _ _ _ _ ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash)

(read-string ?avg_height :> ?height) ;; limit to trees which are higher than people (> ?height 2.0) (tree-distance

?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance) ;; limit to trees within a one-block radius (not meters) (<= ?distance 25.0) (/ ?height ?distance :> ?tree_moment) (c/sum ?tree_moment :> ?sum_tree_moment) ;; magic number 200000.0 used to scale tree moment

;; based on median (/ ?sum_tree_moment 200000.0 :> ?tree_metric) ))

modeling

75Wednesday, 06 March 13We also filter these estimators, based on a few magic numbers obtained during analysis in R

Page 76: Functional programming for optimization problems in Big Data

M

tree

Join Calculatedistance

shade

Filterheight

Summoment

REstimatetraffic

Rroad

Filterdistance

M M

Filtersum_moment

(flow diagram, shade)

modeling

76Wednesday, 06 March 13A conceptual flow diagram, showing the DAG for the join of road + tree => estimators for shade

Page 77: Functional programming for optimization problems in Big Data

(defn get-gps [gps_logs trap] "subquery to aggregate and rank GPS tracks per user" (<- [?uuid ?geohash ?gps_count ?recent_visit] (gps_logs

?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading ?elapsed ?distance)

(read-string ?gps_lat :> ?lat) (read-string ?gps_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (c/count :> ?gps_count) (date-num ?date :> ?visit) (c/max ?visit :> ?recent_visit) ))

modeling

?uuid ?geohash ?gps_count ?recent_visitcf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 197237669096932cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348

77Wednesday, 06 March 13Here’s a Cascalog function to aggregate GPS tracks per user

in other words, behavioral targetingthis shows aggregation in Cascalog -- the subtle but hard part

now we have a data product about walkable road segments in Palo Alto

Page 78: Functional programming for optimization problems in Big Data

Recommenders often combine multiple signals, via weighted averages, to rank personalized results:

• GPS of person ∩ road segment

• frequency and recency of visit

• traffic class and rate

• road albedo (sunlight reflection)

• tree shade estimator

Adjusting the mix allows for further personalization at the end use

modeling

(defn get-reco [tracks shades] "subquery to recommend road segments based on GPS tracks" (<- [?uuid ?road ?geohash ?lat ?lng ?alt ?gps_count ?recent_visit ?road_metric ?tree_metric] (tracks ?uuid ?geohash ?gps_count ?recent_visit) (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric) ))

78Wednesday, 06 March 13One approach to building commercial recommender systems is to take a vector of different preference metrics, combine in to a single sortable value, then rank the results before making personalized suggestions.

The resulting data in the "reco" output set produces exactly that.“tracks” represents behavioral targeting, while “shades” represents our inventory

Overall, the resulting app needs to enable feedback loops involving customers, their GPS tracks, their profile settings, etc.

Page 79: Functional programming for optimization problems in Big Data

‣ addr: 115 HAWTHORNE AVE‣ lat/lng: 37.446, -122.168‣ geohash: 9q9jh0‣ tree: 413 site 2‣ species: Liquidambar styraciflua‣ est. height: 23 m‣ shade metric: 4.363‣ traffic: local residential, light traffic‣ recent visit: 1972376952532‣ a short walk from my train stop ✔

apps

79Wednesday, 06 March 13One of top recommendations for me is about two blocks from my train stop, where a couple of really big American Sweetgum trees provide ample shadeon a residential street with not much traffic

Page 80: Functional programming for optimization problems in Big Data

Enterprise Data Workflowswith Cascading

O’Reilly, 2013amazon.com/dp/1449358721

references…

80Wednesday, 06 March 13Some of this material comes from an upcoming O’Reilly book: “Enterprise Data Workflows with Cascading”Should be in Rough Cuts soon - scheduled to be out in print this June.

Page 81: Functional programming for optimization problems in Big Data

blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities:

cascading.org

zest.to/group11

github.com/Cascading

conjars.org

goo.gl/KQtUL

concurrentinc.com

join us for very interesting work!

drill-down…

Copyright @2013, Concurrent, Inc.

81Wednesday, 06 March 13Links to our open source projects, developer community, etc…

contact me @pacoidhttp://concurrentinc.com/(we're hiring too!)