the future is now - files.meetup.comfiles.meetup.com/3183732/scalable predictive pipelines with...
TRANSCRIPT
THE FUTURE IS NOW
Scalable Predictive Pipelines with Spark and ScalaDimitris Papadopoulos
3
About Schibsted
4
About Schibsted
5
About Schibsted
6
Event Tracking Data
7
Event Tracking Data
8
Event Tracking Data
9
Event Tracking Data
10
Data Science Tasks
DataModel
Results
Preprocessing
1. Using Spark ML Pipelines
2. Scalable Pipelines
11
Outline
12
Pipeline
13
Pipeline
14
Pipeline
15
Not a pipe
16
Pipeline Stage
● One or more inputs ● Strictly one output
17
Pipeline Stage
● One or more inputs ● Strictly one output
● Closed under concatenation
18
Pipeline Stage
● One or more inputs ● Strictly one output
● Closed under concatenation● Standalone and runnable● Spark™ ML inside
19
Spark ML Pipelines
20
Spark ML Pipelines
Using a Pipeline to train a model
21
Spark ML Pipelines
Using a PipelineModel to get predictions
22
Peek inside a Spark pipelineIt’s a Pipeline
23
Peek inside a Spark pipelineIt’s a Pipeline
plain Spark API
24
Peek inside a Spark pipelineIt’s a Pipeline
plain Spark API
From DataFrame to a Model
25
Peek inside a Spark pipelineInstantiating a Pipeline
Running it!
26
Example PipelineEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)
UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.
EventPreprocessor:aggregates events per user
GenderPredictor:creates labels and features, trains classifier & computes predictions
27
Example PipelineEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)
UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.
EventPreprocessor:aggregates events per user
GenderPredictor:creates labels and features, trains classifier & computes predictions
GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC
28
Scalable Pipelines: pain pointsEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)
UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.
EventPreprocessor:aggregates events per user
GenderPredictor:creates labels and features, trains classifier & computes predictions
GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC
29
Scalable Pipelines: pain pointsEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)
UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.
EventPreprocessor:aggregates events per user
GenderPredictor:creates labels and features, trains classifier & computes predictions
GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC
Input: 1 day’s / 7 days’ worth of events data. Larger lookbacks needed for better accuracy.
30
More data for better performance
Performance of three different pipelines, vs lookback length (1, 7, 30, 45)
31
Scalable Pipelines: pain pointsEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)
UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.
EventPreprocessor:aggregates events per user
GenderPredictor:creates labels and features, trains classifier & computes predictions
GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC
What will happen if we try to process30 days worth of data (e.g. 3B events) ???
32
Scalable Pipelines: pain points
Memory and processing heavy:● In one use-case, for 7 days lookback (~7 x 100M events) we used to need 20 Spark
executors with 22G of memory each.
Not easily scalable● As the lookback increases ● As more and more sites are incorporated into our pipelines
Redundant processing● For K days of lookback, we are repeating processing of K - 2 days worth of data, when we
run the pipeline every day, in a rolling window fashion.
“What will happen if we try to process30 days worth of data (e.g. 3B events) ???”
33
Saved by Algebra
● The operations (op) along with the corresponding data structures (S) that we are interested in are monoids.○ Associative:
■ for all A,B,C in S, (A op B) op C = A op (B op C) ○ Identity element:
■ there exists E in S such that for each A in S, E op A = A op E = A
● Examples:○ Summation: 1 + 2 + 3 + 4 = (1 + 2) + (3 + 4)○ String array concatenation: [“foo”] + [“bar”] + [“baz”] = [“foo”, “bar”] + [“baz”]
34
Scalable Pipelines: in monoids fashion
● Split the aggregations in smaller chunks○ i.e. pre-process events per user and single day (not over the entire lookback)
35
Scalable Pipelines: in monoids fashion
● Split the aggregations in smaller chunks○ i.e. pre-process events per user and single day (not over the entire lookback)
● Make one (or multiple) day aggregates and combine○ i.e. aggregate over the pre-preprocessed events per user and day
36
Scalable Pipelines: in monoids fashion
● Split the aggregations in smaller chunks○ i.e. pre-process events per user and single day (not over the entire lookback)
● Make one (or multiple) day aggregates and combine○ i.e. aggregate over the pre-preprocessed events per user and day
● It’s like trying to ...eat an elephant: one piece at a time!
37
Scalable Pipelines: building blocks
● Imagine we had a MapAggregator, for aggregating maps of [String->Double].
38
Scalable Pipelines: building blocks
● Imagine we had a MapAggregator, for aggregating maps of [String->Double].
● The spec for such an aggregator implemented in Scala on Spark could look like this. :-)
39
Scalable Pipelines: building blocks
● Imagine we had a MapAggregator, for aggregating maps of [String->Double].
● The spec for such an aggregator implemented in Scala on Spark could look like this. :-)
40
Scalable Pipelines: building blocks
● In Spark we can define our own functions, also known as User Defined Functions (UDF)
● A UDF takes as arguments one or more columns, and returns some output.
● It gets executed for each row of the DataFrame.● It can also be parameterized.● e.g. val myUDF = udf((myArg: myType) => ...)
● Since Spark 1.5, we can also define our own User Defined Aggregate Functions (UDAF).
● UDAFs can be used to compute custom calculations over groups of input data (in contrast, UDFs compute a value looking at a single input row)
● Examples: calculating geometric mean or calculating the product of values for every group.
● A UDAF maintains an aggregation buffer to store intermediate results for every group of input data.
● It updates this buffer for every input row. ● Once it has processed all input rows, it generates a result value based on
values of the aggregation buffer.
41
Scalable Pipelines: UDAF
42
Scalable Pipelines: UDAFA User Defined Aggregate Function
Implementation of abstract methods
43
Scalable Pipelines: adding a new stageEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)
UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.
EventPreprocessor:aggregates events per user
GenderPredictor:creates labels and features, trains classifier & computes predictions
GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC
What will happen if we try to process30 days worth of data (e.g. 3B events) ???
44
Scalable Pipelines: adding a new stageEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)
UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.
EventPreprocessor:aggregates events per user and day
GenderPredictor:creates labels and features, trains classifier & computes predictions
GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC
EventAggregator:aggregates pre-processed events per user over multiple days (lookback)
45
Scalable Pipelines: Aggregating Events
46
Scalable Pipelines: Aggregating EventsIt’s a Transformer
47
Scalable Pipelines: Aggregating EventsIt’s a Transformer
DataFrame in , DataFrame out
48
Scalable Pipelines: Aggregating EventsIt’s a Transformer
DataFrame in , DataFrame out
Aggregating maps of feature frequency counts
49
Scalable Pipelines: closing remarks
● With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!
50
Scalable Pipelines: closing remarks
● With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!
● Obvious gains: freeing up resources that can be used for running even more pipelines, faster, over even more input data
51
Scalable Pipelines: closing remarks
● Needles to say, more factors contribute towards a scalable pipeline:○ Performance tuning of the Spark cluster○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration
52
Scalable Pipelines: closing remarks
● Needles to say, more factors contribute towards a scalable pipeline:○ Performance tuning of the Spark cluster○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration
● But each one of these is a topic for a separate talk (Carlos? Hint, hint!) :-)
53
Q/A
Thank you!
54
Shameless plug
We are hiring!
Across all our hubs
in London, Oslo, Stockholm, Barcelona
for Data Science, Engineering, UX and Product roles
https://jobs.lever.co/[email protected]