Paco NathanConcurrent, Inc.San Francisco, CA@pacoid
“Functional programming for optimization problems in Big Data”
Copyright @2013, Concurrent, Inc.
1Wednesday, 06 March 13
The Workflow Abstraction
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
1. Data Science2. Functional Programming3. Workflow Abstraction4. Typical Use Cases5. Open Data Example
2Wednesday, 06 March 13Let’s consider a trendline subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes and commercialized Big Data.Where did Big Data come from, and where is this kind of work headed?
Q3 1997: inflection point
Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware.
This effort prepared the way for huge Internet successesin the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack emerged from this.
3Wednesday, 06 March 13Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion:
parallelize workloads onto clusters of commodity servers to scale-out horizontally. Google and Inktomi (YHOO Search) were working along the same lines.
RDBMS
Stakeholder
SQL Queryresult sets
Excel pivot tablesPowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BIAnalysts
optimizedcode
Circa 1996: pre- inflection point
4Wednesday, 06 March 13Perl and C++ for CGI :)Feedback loops shown in red represent data innovations at the time… these are rather static.
Characterized by slow, manual processes:data modeling / business intelligence; “throw it over the wall”…this thinking led to impossible silos
RDBMS
SQL Queryresult sets
recommenders+
classifiersWeb Apps
customertransactions
AlgorithmicModeling
Logs
eventhistory
aggregation
dashboards
Product
EngineeringUX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
5Wednesday, 06 March 13Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classifiers, and other predictive models -- e.g., ad networks automating parts of the marketing funnel, as in our case study.
LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.
Workflow
RDBMS
near timebatch
services
transactions,content
socialinteractions
Web Apps,Mobile, etc.History
Data Products Customers
RDBMS
LogEvents
In-Memory Data Grid
Hadoop, etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/wdev
datascience
discovery+
modeling
Planner
Ops
dashboardmetrics
businessprocess
optimizedcapacitytaps
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
existingSDLC
Circa 2013: clusters everywhere
6Wednesday, 06 March 13Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.
Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment.
We see this feeding into cluster optimization in YARN, Apache Mesos, etc.
by Leo Breiman
Statistical Modeling: The Two CulturesStatistical Science, 2001
bit.ly/eUTh9L
references…
7Wednesday, 06 March 13Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay“The eBay Architecture” – Randy Shoup, Dan Pritchettaddsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.htmladdsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)youtube.com/watch?v=E91oEn1bnXM
Google“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)youtube.com/watch?v=qsan-GQaeykperspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
references…
8Wednesday, 06 March 13In their own words…
core values
Data Science teams develop actionable insights, building confidence for decisions
that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords)
probably somewhere in-between… solving for pattern, at scale.
by definition, this is a multi-disciplinary pursuit which requires teams, not sole players
9Wednesday, 06 March 13
apps
team process = needs
discovery
modeling
integration
systems
help people ask the right questions
allow automation to place informed bets
deliver products at scale to customers
build smarts into product features
keep infrastructure running, cost-effective
Gephi
10Wednesday, 06 March 13
team composition = roles
business process, stakeholder
data prep, discovery, modeling, etc.
software engineering, automation
systems engineering, access
datascience
DataScientist
App Dev
Ops
DomainExpert
introducedcapability
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
11Wednesday, 06 March 13This is an example of multi-disciplinary team composition for data scienceWhile other emerging problems spaces will require other more specific kinds of team roles
matrix: evaluate needs × roles
stakeholder
scientist
developer
ops
discovery
discovery
modeling
modeling
integration
integration
appsapps systems
systems
12Wednesday, 06 March 13
most valuable skills
approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc.
unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up
most valuable skills:‣ learn to use programmable tools that prepare data
‣ learn to generate compelling data visualizations
‣ learn to estimate the confidence for reported results
‣ learn to automate work, making analysis repeatable
the rest of the skills – modeling, algorithms, etc. – those are secondary
D3
13Wednesday, 06 March 13
in a nutshell, what we do…
‣ estimate probability
‣ calculate analytic variance
‣ manipulate order complexity
‣ leverage use of learning theory
+ collab with DevOps, Stakeholders
+ reduce work to cron entries
Unique Registration
Launched games lobby
NUI:TutorialMode
Birthday Message
Chat PublicRoom voice
Launched heyzap game
ConnectivityTest: test suite started
Create New Pet
Movie View Started: client, community
NUI:MovieMode
Buy an Item: web
Put on Clothing
Address space remaining: 512M
Customer Made Purchase Cart Page Step 2
Feed Pet
Play Pet
Chat Now
Edit Panel
Client Inventory Panel Flip Product Over
Add Friend
Open 3D Window
Change Seat
Type a Bubble
Visit Own Homepage
Take a Snapshot
NUI:BuyCreditsMode
NUI:MyProfileClicked
Address space remaining: 1G
Leave a Message
NUI:ChatMode
NUI:FriendsModedv
Website Login
Add Buddy
NUI:PublicRoomMode
NUI:MyRoomMode
Client Inventory Panel Remove Product
Client Inventory Panel Apply Product
NUI:DressUpMode
Unique RegistrationLaunched games lobbyNUI:TutorialModeBirthday MessageChat PublicRoom voiceLaunched heyzap gameConnectivityTest: test suite startedCreate New PetMovie View Started: client, communityNUI:MovieModeBuy an Item: webPut on ClothingAddress space remaining: 512MCustomer Made Purchase Cart Page Step 2Feed PetPlay PetChat NowEdit PanelClient Inventory Panel Flip Product OverAdd FriendOpen 3D WindowChange SeatType a BubbleVisit Own HomepageTake a SnapshotNUI:BuyCreditsModeNUI:MyProfileClickedAddress space remaining: 1GLeave a MessageNUI:ChatModeNUI:FriendsModedvWebsite LoginAdd BuddyNUI:PublicRoomModeNUI:MyRoomModeClient Inventory Panel Remove ProductClient Inventory Panel Apply ProductNUI:DressUpMode
science in data science?
14Wednesday, 06 March 13
references…
by DJ Patil
Data JujitsuO’Reilly, 2012amazon.com/dp/B008HMN5BE
Building Data Science TeamsO’Reilly, 2011amazon.com/dp/B005O4U3ZE
15Wednesday, 06 March 13Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
The Workflow Abstraction
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
1. Data Science2. Functional Programming3. Workflow Abstraction4. Typical Use Cases5. Open Data Example
16Wednesday, 06 March 13Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
Cascading – origins
API author Chris Wensel worked as a system architect at an Enterprise firm well-known for several popular data products.
Wensel was following the Nutch open source project – before Hadoop even had a name.
He noted that it would become difficult to find Java developers to write complex Enterprise apps directly in Apache Hadoop – a potential blocker for leveraging this new open source technology.
17Wednesday, 06 March 13Cascading initially grew from interaction with the Nutch project, before Hadoop had a name
API author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
Cascading – functional programming
Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows.
18Wednesday, 06 March 13Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
examples…
• Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments
• new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming:
Cascalog in Clojure (2010)Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
19Wednesday, 06 March 13Many case studies, many Enterprise production deployments now for 5+ years.
void map (String doc_id, String text):
for each word w in segment(text):
emit(w, "1");
void reduce (String word, Iterator group):
int count = 0;
for each pc in group:
count += Int(pc);
emit(word, String(count));
The Ubiquitous Word Count
Definition: count how often each word appears in a collection of text documents
This simple program provides an excellent test case for parallel processing, since it illustrates:
• requires a minimal amount of code
• demonstrates use of both symbolic and numeric values
• shows a dependency graph of tuples as an abstraction
• is not many steps away from useful search indexing
• serves as a “Hello World” for Hadoop apps
Any distributed computing framework which can run Word Count efficiently in parallel at scale can handle much larger and more interesting compute problems.
DocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
count how often each word appears in a collection of text documents
20Wednesday, 06 March 13Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...
DocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
1 map 1 reduce18 lines code gist.github.com/3900702
word count – conceptual flow diagram
cascading.org/category/impatient
21Wednesday, 06 March 13Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
word count – Cascading app in Java
String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );
// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();
DocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
22Wednesday, 06 March 13Based on a Cascading implementation of Word Count, here is sample code -- approx 1/3 the code size of the Word Count example from Apache Hadoop
2nd to last line: generates a DOT file for the flow diagram
map
reduceEvery('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count'][{1}:'token']
[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']
wc[{1}:'token'][{1}:'token']
[{2}:'token', 'count'][{2}:'token', 'count']
[{1}:'token'][{1}:'token']
word count – generated flow diagramDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
23Wednesday, 06 March 13As a concrete example of literate programming in Cascading, here is the DOT representation of the flow plan -- generated by the app itself.
(ns impatient.core (:use [cascalog.api] [cascalog.more-taps :only (hfs-delimited)]) (:require [clojure.string :as s] [cascalog.ops :as c]) (:gen-class))
(defmapcatop split [line] "reads in a line of string and splits it by regex" (s/split line #"[\[\]\\\(\),.)\s]+"))
(defn -main [in out & args] (?<- (hfs-delimited out) [?word ?count] ((hfs-delimited in :skip-header? true) _ ?line) (split ?line :> ?word) (c/count ?count)))
; Paul Lam; github.com/Quantisan/Impatient
word count – Cascalog / ClojureDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
24Wednesday, 06 March 13Here is the same Word Count app written in Clojure, using Cascalog.
github.com/nathanmarz/cascalog/wiki
• implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development (TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
word count – Cascalog / ClojureDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
25Wednesday, 06 March 13From what we see about language features, customer case studies, and best practices in general -- Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.
Great for large-scale, complex apps, where small teams must limit the complexities in their process.
import com.twitter.scalding._ class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ \\[\\]\\(\\),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true))}
word count – Scalding / ScalaDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
26Wednesday, 06 March 13Here is the same Word Count app written in Scala, using Scalding.
Very compact, easy to understand; however, also more imperative than Cascalog.
github.com/twitter/scalding/wiki
• extends the Scala collections API so that distributed lists become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram and function calls
• extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog, not as much of a high-level language
word count – Scalding / ScalaDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
27Wednesday, 06 March 13If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
github.com/twitter/scalding/wiki
• extends the Scala collections API so that distributed lists become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual flow diagram and function calls
• extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc.
• significant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale (imagine SOA infra @ Google as an open source project)
• less learning curve than Cascalog, not as much of a high-level language
word count – Scalding / ScalaDocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
Cascalog and Scalding DSLs leverage the functional aspects of MapReduce, helping to limit complexity in process
28Wednesday, 06 March 13Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
The Workflow Abstraction
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
1. Data Science2. Functional Programming3. Workflow Abstraction4. Typical Use Cases5. Open Data Example
29Wednesday, 06 March 13CS theory related to data workflow abstraction, to manage complexity
Cascading workflows – pattern language
Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
Data is represented as flows of tuples. Operations within the tuple flows bring functional programming aspects into Java apps.
In formal terms, this provides a pattern language.
30Wednesday, 06 March 13A pattern language, based on the metaphor of “plumbing”
references…
pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices.
amazon.com/dp/0195019199
design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four”.
amazon.com/dp/0201633612
31Wednesday, 06 March 13Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.
Cascading workflows – literate programming
Cascading workflows generate their own visual documentation: flow diagrams
In formal terms, flow diagrams leverage a methodology called literate programming
Provides intuitive, visual representations for apps, greatfor cross-team collaboration.
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
32Wednesday, 06 March 13Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.
Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first
references…
by Don Knuth
Literate ProgrammingUniv of Chicago Press, 1992
literateprogramming.com/
“Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”
33Wednesday, 06 March 13Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.
examples…
• Scalding apps have nearly 1:1 correspondence between function calls and the elements in theirflow diagrams – excellent elision and literate representation
• noticed on cascading-users email list: when troubleshooting issues, Cascading experts ask novices to provide an app’s flow diagram (generated as a DOT file), sometimes in lieu of showing code
In formal terms, a flow diagram is a directed, acyclic graph (DAG) on which lots of interesting math applies for query optimization, predictive models about app execution, parallel efficiency metrics, etc.
map
reduceEvery('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count'][{1}:'token']
[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']
wc[{1}:'token'][{1}:'token']
[{2}:'token', 'count'][{2}:'token', 'count']
[{1}:'token'][{1}:'token']
34Wednesday, 06 March 13Literate programming examples observed on the email list are some of the best illustrations of this methodology.
Cascading workflows – business process
Following the essence of literate programming, Cascading workflows provide statements of business process
This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data)
As a separation of concerns between business process and implementation details (Hadoop, etc.)
This is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
By virtue of the pattern language, the flow planner in used in a Cascading app determines how to translate business process into efficient, parallel jobs at scale.
35Wednesday, 06 March 13Business Stakeholder POV:business process management for workflow orchestration (think BPM/BPEL)
references…
by Edgar Codd
“A relational model of data for large shared data banks”Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685
Rather than arguing between SQL vs. NoSQL…structured vs. unstructured data frameworks… this approach focuses on:
the process of structuring data
That’s what apps do – Making Data Work
36Wednesday, 06 March 13Focus on *the process of structuring data*which must happen before the large-scale joins, predictive models, visualizations, etc.
Just because your data is loaded into a “structured” store, that does not imply that your app has finished structuring it for the purpose of making data work.
BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)
Cascading workflows – functional relational programming
The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL.
Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in:
Moseley & Marks, 2006“Out of the Tar Pit”goo.gl/SKspn
37Wednesday, 06 March 13A more contemporary statement along similar lines...
Two Avenues…
scale ➞co
mpl
exity
➞
Enterprise: must contend with complexity at scale everyday…
incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff
Start-ups: crave complexity and scale to become viable…
new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding
38Wednesday, 06 March 13Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
Cascading workflows – functional relational programming
The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL.
Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in:
Moseley & Marks, 2006“Out of the Tar Pit”goo.gl/SKspn
several theoretical aspects converge into software engineering practices which mitigates the complexity of building and maintaining Enterprise data workflows
39Wednesday, 06 March 13
The Workflow Abstraction
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
1. Data Science2. Functional Programming3. Workflow Abstraction4. Typical Use Cases5. Open Data Example
40Wednesday, 06 March 13Here are a few use cases to consider, for Enterprise data workflows
Cascading – deployments
• 5+ history of Enterprise production deployments,ASL 2 license, GitHub src, http://conjars.org
• partners: Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC, SpringSource, Cloudera
• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc.
• use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, genomics, climatology, etc.
41Wednesday, 06 March 13Several published case studies about Cascading, Cascalog, Scalding, etc.Wide range of use cases.
Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.Partnerships with the various Hadoop distro vendors, cloud providers, etc.
Finance: Ecommerce Risk
Problem:
<1% chargeback rate allowed by Visa, others follow
• may leverage CAPTURE/AUTH wait period
• Cybersource, Vindicia, others haven’t stopped fraud
>15% chargeback rate common for mobile in US:
• not much info shared with merchant
• carrier as judge/jury/executioner; customer assumed correct
most common: professional fraud (identity theft, etc.)
• patterns of attack change all the time
• widespread use of IP proxies, to mask location
• global market for stolen credit card info
other common case is friendly fraud
• teenager billing to parent’s cell phone
stat.berkeley.edu
42Wednesday, 06 March 13
KPI:
chargeback rate (CB)
• ground truth for how much fraud the bank/carrier claims
• 7-120 day latencies from the bank
false positive rate (FP)
• estimated cost: predicts customer support issues
• complaints due to incorrect fraud scores on valid orders (or lies)
false negative rate (FN)
• estimated risk: how much fraud may pass undetected in future orders
• changes with new product features/services/inventory/marketing
stat.berkeley.eduFinance: Ecommerce Risk
43Wednesday, 06 March 13
Data Science Issues:
• chargeback limits imply few training cases
• sparse data implies lots of missing values – must impute
• long latency on chargebacks – “good” flips to “bad”
• most detection occurs within large-scale batch, decisions required during real-time event processing
• not just one pattern to detect – many, ever-changing
• many unknowns: blocked orders scare off professional fraud, inferences cannot be confirmed
• cannot simply use raw data as input – requires lots of data preparation and statistical modeling
• each ecommerce firm has shopping/policy nuances which get exploited differently – hard to generalize solutions
stat.berkeley.eduFinance: Ecommerce Risk
44Wednesday, 06 March 13
Predictive Analytics:
batch
• cluster/segment customers for expected behaviors
• adjust for seasonal variation
• geospatial indexing / bayesian point estimates (fraud by lat/lng)
• impute missing values (“guesses” to fill-in sparse data)
• run anti-fraud classifier (customer 360)
real-time
• exponential smoothing (estimators for velocity)
• calculate running medians (anomaly detection)
• run anti-fraud classifier (per order)
stat.berkeley.eduFinance: Ecommerce Risk
45Wednesday, 06 March 13
1. Data Preparation (batch)
‣ ETL from bank, log sessionization, customer profiles, etc.
- large-scale joins of customers + orders
‣ apply time window
- too long: patterns lose currency
- too short: not enough wait for chargebacks
‣ segment customers
- temporary fraud (identity theft which has been resolved)
- confirmed fraud (chargebacks from the bank)
- estimated fraud (blocked/banned by Customer Support)
- valid orders (but different clusters of expected behavior)
‣ subsample to rebalance data
- produce training set + test holdout
- adjust balance for FP/FN bias (company risk profile)
stat.berkeley.eduFinance: Ecommerce Risk
46Wednesday, 06 March 13
2. Model Creation (analyst)
‣ distinguish between different IV data types
- continuous (e.g., age)
- boolean (e.g., paid lead)
- categorical (e.g., gender)
- computed (e.g., geo risk, velocities)
‣ use geospatial smoothing for lat/lng
‣ determine distributions for IV
‣ adjust IV for seasonal variation, where appropriate
‣ impute missing values based on density functions / medians
‣ factor analysis: determine which IV to keep (too many creates problems)
‣ train model: random forest (RF) classifiers predict likely fraud
‣ calculate the confusion matrix (TP/FP/TN/FN)
stat.berkeley.eduFinance: Ecommerce Risk
47Wednesday, 06 March 13
3. Test Model (analyst/batch loop)
‣ calculate estimated fraud rates
‣ identify potential found fraud cases
‣ report to Customer Support for review
‣ generate risk vs. benefit curves
‣ visualize estimated impact of new model
4. Decision (stakeholder)
‣ decide risk vs. benefit (minimize fraud + customer support costs)
‣ coordinate with bank/carrier if there are current issues
‣ determine go/no-go, when to deploy in production, size of rollout
stat.berkeley.eduFinance: Ecommerce Risk
48Wednesday, 06 March 13
5. Production Deployment (near-time)
‣ run model on in-memory grid / transaction processing
‣ A/B test to verify model in production (progressive rollout)
‣ detect anomalies
- use running medians on continuous IVs
- use exponential smoothing on computed IVs (velocities)
- trigger notifications
‣ monitor KPI and other metrics in dashboards
stat.berkeley.eduFinance: Ecommerce Risk
49Wednesday, 06 March 13
Cascading apps
risk classifierdimension: per-order
risk classifierdimension: customer 360
PMML model
analyst'slaptopdata prep
detectfraudsters
predictmodel costs
customertransactions
score new orders
trainingdata sets
batchworkloads
real-timeworkloads
anomalydetection
segmentcustomers
IMDGHadoop
partner dataDW
ETL
chargebacks,etc.
CustomerDB
velocitymetrics
Finance: Ecommerce Risk
50Wednesday, 06 March 13
Ecommerce: Marketing FunnelW
ikipedia
Problem:
• must optimize large ad spend budget
• different vendors report different kinds of metrics
• some campaigns are much smaller than others
• seasonal variation distorts performance
• inherent latency in spend vs. effect
• ads channels cannot scale up immediately
• must “scrub” leads to dispute payments/refunds
• hard to predict ROI for incremental ad spend
• many issues of diminishing returns in general
51Wednesday, 06 March 13
Wikipedia
KPI:
cost per paying user (CPP)
• must align metrics for different ad channels
• generally need to estimate to end-of-month
customer lifetime value (LTV)
• big differences based on geographic region, age, gender, etc.
• assumes that new customers behave like previous customers
return on investment (ROI)
• relationship between CPP and LTV
• adjust to invest in marketing (>CPP) vs. extract profit (>LTV)
other metrics
• reach: how many people get a brand message
• customer satisfaction: would recommend to a friend, etc.
Ecommerce: Marketing Funnel
52Wednesday, 06 March 13
Wikipedia
Predictive Analytics:
batch
• log aggregation, followed with cohort analysis
• bayesian point estimates compare different-sized ad tests
• time series analysis normalizes for seasonal variation
• geolocation adjusts for regional cost/benefit
• customer lifetime value estimates ROI of new leads
• linear programming models estimate elasticity of demand
real-time
• determine whether this is actually a new customer…
• new: modify initial UX based on ad channel, region, friends, etc.
• old: recommend products/services/friends based on behaviors
• adjust spend on poorly performing channels
• track back to top referring sites/partners
Ecommerce: Marketing Funnel
53Wednesday, 06 March 13
Airlines
Problem:
• minimize schedule delays
• re-route around weather and airport conditions
• manage supplier channels and inventories to minimize AOG
KPI:
forecast future passenger demand
customer loyalty
aircraft on ground (AOG)
mean time between failures (MTBF)
54Wednesday, 06 March 13
Predictive Analytics:
batch
• predict “last mile” failures
• optimize capacity utilization
• operations research problem to optimize stocking / minimize fuel waste
• boost customer loyalty by adjusting incentives frequent flyer programs
real-time
• forecast schedule delays
• monitor factors for travel conditions: weather, airports, etc.
Airlines
55Wednesday, 06 March 13
The Workflow Abstraction
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
1. Data Science2. Functional Programming3. Workflow Abstraction4. Typical Use Cases5. Open Data Example
56Wednesday, 06 March 13Cascalog app for mobile data API (recommender service) based on City of Palo Alto Open Data
Palo Alto is quite a pleasant place
• temperate weather
• lots of parks, enormous trees
• great coffeehouses
• walkable downtown
• not particularly crowded
On a nice summer day, who wants to be stuck indoors on a phone call?
Instead, take it outside – go for a walk
And example open source project: github.com/Cascading/CoPA/wiki
57Wednesday, 06 March 13Palo Alto is generally quite a pleasant place: the weather is temperate, there are lots of parks with enormous trees, most of downtown is quite walkable, and it's not particularly crowded.
On a summer day in Palo Alto, one of the last things anybody really wants is to be stuck in an office on a long phone call.Instead people walk outside and take their calls, probably heading toward a favorite espresso bar or a frozen yogurt shop.On a hot summer day in Palo Alto, knowing a nice quiet route to walk in the shade would be great.
1. Open Data about municipal infrastructure(GIS data: trees, roads, parks)
✚
2. Big Data about where people like to walk(smartphone GPS logs)
✚
3. some curated metadata(which surfaces the value)
⇒4. personalized recommendations:
“Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sipping a latte or enjoying some fro-yo.”
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
58Wednesday, 06 March 13We merge unstructured geo data about municipal infrastructure(GIS data: trees, roads, parks)+unstructured data about where people like to walk(smartphone GPS logs)+a little metadata (curated)=>personalized recommendations:
"Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sippin’ a latte or enjoying some fro-yo."
The City of Palo Alto recently began to support Open Data to give the local community greater visibility into how their city government operates
This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good
paloalto.opendata.junar.com/dashboards/7576/geographic-information/
discovery
59Wednesday, 06 March 13The City of Palo Alto has recently begun to support Open Data to give the local community greater visibility into how their city government functions.
This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good.
http://paloalto.opendata.junar.com/dashboards/7576/geographic-information/
GIS about trees in Palo Alto:discovery
60Wednesday, 06 March 13(trees map overlay)
Geographic_Information,,,
"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis Source: davey tree Protected: Designated: Heritage: Appraised Value: Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872 Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point""Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 Base Type Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15 Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse Extent: 0 Ravelling Severity: none Ravelling Extent: 0 Ridability Severity: none Trench Severity: none Trench Extent: 0 Rutting Severity: none Rutting Extent: 0 Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0 Remediation: Deduct Value: 100 Priority: Pavement Condition: excellent Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicols Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0 -122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0 -122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0 -122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line"
discovery
(unstructured data…)
61Wednesday, 06 March 13here’s what we have to work with -- raw GIS export as CSV, with plenty o’ errors too, for good measure
this illustrates a great example of “unstructured data”
Alligator Severity!
(defn parse-gis [line] "leverages parse-csv for complex CSV format in GIS export" (first (csv/parse-csv line)) ) (defn etl-gis [gis trap] "subquery to parse data sets from the GIS source tap" (<- [?blurb ?misc ?geo ?kind] (gis ?line) (parse-gis ?line :> ?blurb ?misc ?geo ?kind) (:trap (hfs-textline trap)) ))
discovery
(specify what you require, not how to achieve it…
80/20 rule of data prep cost)
62Wednesday, 06 March 13Let's use Cascalog to begin our process of structuring that data
since the GIS export is vaguely in CSV format, here's a simple way to clean up the data
referring back to DJ Patil’s “Data Jujitsu”, that clean up usually accounts for 80% of project costs
discovery
(ad-hoc queries get refined into composable predicates)
Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AVTree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None 37.446001565119,-122.167713417554,0.0Point
63Wednesday, 06 March 13First we load `lein repl` to get an interactive prompt for Clojure…bring Cascalog libraries into Clojure…define functions to use…and execute queries
then we convert the queries into composable, logical predicates
Let’s take a peek at the results...
TSV output becomes more structured, while the “bad” data has been trapped into a data set for review
[bold/colors added for clarity]
discovery
(curate valuable metadata)
64Wednesday, 06 March 13since we can find species and geolocation for each tree,
let’s add some metadata to infer other valuable data results, e.g., tree height
based on Wikipedia.org, Calflora.org, USDA.gov, etc.
(defn get-trees [src trap tree_meta] "subquery to parse/filter the tree data" (<- [?blurb ?tree_id ?situs ?tree_site ?species ?wikipedia ?calflora ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^\s+Private.*Tree ID.*" ?misc) (parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species) ((c/comp s/trim s/lower-case) ?raw_species :> ?species) (tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height) (avg ?min_height ?max_height :> ?avg_height) (geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt) (read-string ?tree_lat :> ?lat) (read-string ?tree_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (:trap (hfs-textline trap)) ))
discovery
?blurb!! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl?tree_id! " 412?situs"" 115?tree_site" 1?species" " liquidambar styraciflua?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua?calflora http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598?avg_height" 27.5?tree_lat" 37.446001565119?tree_lng" -122.167713417554?tree_alt" 0.0?geohash" " 9q9jh0
65Wednesday, 06 March 13Next, refine the data about trees: join with metadata, calculate estimators, etc.
Now we have a data product about trees in Palo Alto, which has been enriched by our processBTW, those geolocation fields are especially important...
// run analysis and visualization in Rlibrary(ggplot2)
dat_folder <- '~/src/concur/CoPA/out/tree'data <- read.table(file=paste(dat_folder, "part-00000", sep="/"), sep="\t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8") summary(data)
t <- head(sort(table(data$V5), decreasing=TRUE)trees <- as.data.frame.table(t, n=20))colnames(trees) <- c("species", "count") m <- ggplot(data, aes(x=V8))m <- m + ggtitle("Estimated Tree Height (meters)")m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density() par(mar = c(7, 4, 4, 2) + 0.1)plot(trees, xaxt="n", xlab="")axis(1, labels=FALSE)text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1, labels=trees$species, xpd=TRUE)grid(nx=nrow(trees))
discovery
66Wednesday, 06 March 13Another aspect of the “Discovery” phase is to poke at the data: run summary stats, visualize the data, etc.
We’ll use RStudio for that...
discovery
sweetgum
Analysis of the tree data:
67Wednesday, 06 March 13some analysis and visualizations from RStudio:
* frequency of species * density plot of tree heights in Palo Alto
M
tree
GISexport
Regexparse-gis
src
Scrubspecies
Geohash
Regexparse-tree
tree
TreeMetadata
Join
FailureTraps
Estimateheight
M
discovery
(flow diagram, gis ⇒ tree)
68Wednesday, 06 March 13Here’s a conceptual flow diagram, which shows a directed, acyclic graph (DAG) of data taps, tuple streams, operations, joins, assertions, aggregations, etc.
discoveryIn addition, the road data provides:
• traffic class (arterial, truck route, residential, etc.)
• traffic counts distribution
• surface type (asphalt, cement; age)
This leads to estimators for noise, sunlight reflection, etc.
69Wednesday, 06 March 13more analysis and visualizations from RStudio:
* frequency of traffic classes * density plot of traffic counts
9q9jh0
geohash with 6-digit resolution
approximates a 5-block square
centered lat: 37.445, lng: -122.162
modeling
70Wednesday, 06 March 13Shifting into the modeling phase, we use “geohash” codes for “cheap and dirty” geospatial indexing suited for parallel processing (Hadoop)
much more effective methods exist; however, this is simple to show
6-digit resolution on a geohash generates approximately a 5-block square
Each road in the GIS export is listed as a block between two cross roads, and each may have multiple road segments to represent turns:
" -122.161776959558,37.4518836690781,0.0 " -122.161390381489,37.4516410983794,0.0 " -122.160786011735,37.4512589903357,0.0 " -122.160531178368,37.4510977281699,0.0
modeling
( lat0, lng0, alt0 )
( lat1, lng1, alt1 )
( lat2, lng2, alt2 )
( lat3, lng3, alt3 )
NB: segments in the raw GIS have the order of geo coordinates scrambled: (lng, lat, alt)
71Wednesday, 06 March 13Each road is listed in the GIS export as a block between two cross roads, and each may have multiple road segments to represent turns
Our app analyzes each road segment as a data tuple, calculating a center point, then uses a geohash to define a boundary:
modeling
9q9jh0
( lat, lng, alt )
72Wednesday, 06 March 13Then uses a geohash to specify a boundary
9q9jh0
Query to join a road segment tuple with all the trees within its geohash boundary:
modeling
73Wednesday, 06 March 13Query to join the road segment tuple with trees within its geohash boundary
X X
X
Use distance-to-midpoint to filter trees which are too far away to provide shade. Calculate a sum of moments for tree height × distance from road segment, as an estimator for shade:
modeling
∑( h·d )
Also calculate estimators for traffic frequency and noise
74Wednesday, 06 March 13Use distance to midpoint to filter out trees which are too far away to provide shade
Calculate a sum of moments for tree height × distance from center;approximate, but pretty good
also calculate estimators for traffic frequency and noise
(defn get-shade [trees roads] "subquery to join tree and road estimates, maximize for shade" (<- [?road_name ?geohash ?road_lat ?road_lng
?road_alt ?road_metric ?tree_metric] (roads ?road_name _ _ _
?albedo ?road_lat ?road_lng ?road_alt ?geohash ?traffic_count _ ?traffic_class _ _ _ _)
(road-metric ?traffic_class ?traffic_count ?albedo :> ?road_metric)
(trees _ _ _ _ _ _ _ ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash)
(read-string ?avg_height :> ?height) ;; limit to trees which are higher than people (> ?height 2.0) (tree-distance
?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance) ;; limit to trees within a one-block radius (not meters) (<= ?distance 25.0) (/ ?height ?distance :> ?tree_moment) (c/sum ?tree_moment :> ?sum_tree_moment) ;; magic number 200000.0 used to scale tree moment
;; based on median (/ ?sum_tree_moment 200000.0 :> ?tree_metric) ))
modeling
75Wednesday, 06 March 13We also filter these estimators, based on a few magic numbers obtained during analysis in R
M
tree
Join Calculatedistance
shade
Filterheight
Summoment
REstimatetraffic
Rroad
Filterdistance
M M
Filtersum_moment
(flow diagram, shade)
modeling
76Wednesday, 06 March 13A conceptual flow diagram, showing the DAG for the join of road + tree => estimators for shade
(defn get-gps [gps_logs trap] "subquery to aggregate and rank GPS tracks per user" (<- [?uuid ?geohash ?gps_count ?recent_visit] (gps_logs
?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading ?elapsed ?distance)
(read-string ?gps_lat :> ?lat) (read-string ?gps_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (c/count :> ?gps_count) (date-num ?date :> ?visit) (c/max ?visit :> ?recent_visit) ))
modeling
?uuid ?geohash ?gps_count ?recent_visitcf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 197237669096932cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348
77Wednesday, 06 March 13Here’s a Cascalog function to aggregate GPS tracks per user
in other words, behavioral targetingthis shows aggregation in Cascalog -- the subtle but hard part
now we have a data product about walkable road segments in Palo Alto
Recommenders often combine multiple signals, via weighted averages, to rank personalized results:
• GPS of person ∩ road segment
• frequency and recency of visit
• traffic class and rate
• road albedo (sunlight reflection)
• tree shade estimator
Adjusting the mix allows for further personalization at the end use
modeling
(defn get-reco [tracks shades] "subquery to recommend road segments based on GPS tracks" (<- [?uuid ?road ?geohash ?lat ?lng ?alt ?gps_count ?recent_visit ?road_metric ?tree_metric] (tracks ?uuid ?geohash ?gps_count ?recent_visit) (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric) ))
78Wednesday, 06 March 13One approach to building commercial recommender systems is to take a vector of different preference metrics, combine in to a single sortable value, then rank the results before making personalized suggestions.
The resulting data in the "reco" output set produces exactly that.“tracks” represents behavioral targeting, while “shades” represents our inventory
Overall, the resulting app needs to enable feedback loops involving customers, their GPS tracks, their profile settings, etc.
‣ addr: 115 HAWTHORNE AVE‣ lat/lng: 37.446, -122.168‣ geohash: 9q9jh0‣ tree: 413 site 2‣ species: Liquidambar styraciflua‣ est. height: 23 m‣ shade metric: 4.363‣ traffic: local residential, light traffic‣ recent visit: 1972376952532‣ a short walk from my train stop ✔
apps
79Wednesday, 06 March 13One of top recommendations for me is about two blocks from my train stop, where a couple of really big American Sweetgum trees provide ample shadeon a residential street with not much traffic
Enterprise Data Workflowswith Cascading
O’Reilly, 2013amazon.com/dp/1449358721
references…
80Wednesday, 06 March 13Some of this material comes from an upcoming O’Reilly book: “Enterprise Data Workflows with Cascading”Should be in Rough Cuts soon - scheduled to be out in print this June.
blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com
join us for very interesting work!
drill-down…
Copyright @2013, Concurrent, Inc.
81Wednesday, 06 March 13Links to our open source projects, developer community, etc…
contact me @pacoidhttp://concurrentinc.com/(we're hiring too!)