spark workflow management

41
Spark Workflow Management Romi Kuntsman Senior Big Data Engineer @ Totango [email protected] https://il.linkedin.com/in/romik „Big things are happening here“ Meetup 2015-04-29

Upload: romi-kuntsman

Post on 21-Apr-2017

6.174 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Spark Workflow Management

Spark Workflow Management

Romi Kuntsman

Senior Big Data Engineer @ Totango

[email protected]

https://il.linkedin.com/in/romik

„Big things are happening here“ Meetup2015-04-29

Page 2: Spark Workflow Management

Agenda

● Totango and Customer Success

● Totango architecture overview

● Apache Spark computing framework

● Luigi workflow Engine

● Luigi in Totango

Page 3: Spark Workflow Management

Totango and Customer Success

Your customers' success is your success

Page 4: Spark Workflow Management

SaaS Customer Journey

DECREASE VALUE

DECREASE VALUE

CHURN

CHURN

GROW VALUE

FIRST VALUE

START

INCREASE USERS

INCREASE USAGE

EXPAND FUNCTIONALITY

CHURN

ONGOING VALUE

Page 5: Spark Workflow Management

Customer Success Platform

● Analytics for SaaS companies● Clear view of customer journey● Proactively prevent churn● Increase upsale● Track feature, module and total usage● Health score based on usages pattern● Improve conversion from trial to paying

Page 6: Spark Workflow Management

Health Console

Page 7: Spark Workflow Management

Module Statistics

Page 8: Spark Workflow Management

Feature Adoption

Page 9: Spark Workflow Management

About Totango

● Founded in 2010● Size: ~50 (half R&D)● Offices in Tel Aviv, San Mateo CA● 120+ customers● ~70 million events per day● ~1.5 billion indexed documents per month● Hosted on Amazon Web Services

Page 10: Spark Workflow Management

Totango Architecture Overview

From usage information to actionable analytics

Page 11: Spark Workflow Management

Terminology

● Service – Totango's customer (e.g. Zendesk)

● Account – Service's (Zendesk's) customer

● SDR (Service Data Record) – User activity event (e.g. user Joe from account Acme did activity Login in module Application)

Page 12: Spark Workflow Management

SDR reception

● Clients send SDRs to the gateway, where they are collected, filtered, packaged and finally stored in S3 for daily/hourly batch processing.

● Realtime processing also notified.

Page 13: Spark Workflow Management

Batch Workflow

Page 14: Spark Workflow Management

Account Data Flow

1) Raw Data (SDRs)

2) Account Aging (MySQL - legacy)

3) Activity Aggregations (Hadoop – legacy)

4) Metrics (Spark)

5) Health (Spark)

6) Alerts (Spark)

7) Indexing to Elasticsearch

Page 15: Spark Workflow Management

Data Structure

● Account documents stored on Amazon S3● Hierarchial directory structure per task param:

e.g. /s-1234/prod/2015-04-27/account/metrics● Documents have a predefined JSON schema.

JSON mapped directly to Java document class● Each file is an immutable collection of documents

One object per line – easily partitioned by lines

Page 16: Spark Workflow Management

Apache Spark

One tool to rule all data transformations

Page 17: Spark Workflow Management

Resilient Distributed Datasets

● RDDs – distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant way

● Initial RDD created from stable storage ● Programmer defines a transformation from an

immutable input object to a new output object● Transformation function class can (read: should!)

be built and tested separately from Spark

Page 18: Spark Workflow Management

Transformation flow

Read: inputRows = sparkContext.textFile(inputPath)

Decode: inputDocuments = inputRows.map(new jsonToAccountDocument())

Trasform: docsWithHealth = inputDocuments.map(new augmentDocumentWithHealth(healthCalcMetadata))

… other transformations may be done, all in memory …

Encode: outputRows = docsWithHealth.map(new accountDocumentToJson())

Write: outputRows.saveAsTextFile(outputPath)

Page 19: Spark Workflow Management

Examples (Java)

Class AugmentDocumentWithHealth implements Function<AccountDocument, AccountDocument>

AccountDocument call(final AccountDocument document) throws Exception { … return document with health … }

Class AccountHealthToAlerts implements FlatMapFunction<AccountDocument, EventDocument>

Iterable<EventDocument> call(final AccountDocument document) throws Exception { … generate alerts … }

Page 20: Spark Workflow Management

Transformation function

● Passed as parameter to Spark transformation:map, reduce, filter, flatMap, mapPartitions

● Can (read: should!!) be checked in Unit Tests

● Serializable – sent to Spark worker serialized● Function must be idempotent!

● May be passed immutable metadata

Page 21: Spark Workflow Management

Luigi Workflow Engine

You build the tasks, it takes care of the plumbing

Page 22: Spark Workflow Management

Why a workflow engine?

● Managing many ETL jobs

● Dependencies between jobs

● Continue pipeline from point of failure

● Separate workflow per service per date

● Overview and drill-down status Web UI

● Manual intervention

Page 23: Spark Workflow Management

Workflow engines

● Azkaban, by LinkedIn (mostly for Hadoop)

● Oozie, by Apache (only for Hadoop)

● Amazon Simple Workflow Service (too generic)

● Amazon Data Pipeline (deeply tied to AWS)

● Luigi, by Spotify (customizable) – our choice!

Page 24: Spark Workflow Management

What is Luigi

● Like Makefile – but in Python, and for data

● Dependencies are managed directly in code

● Generic and easily extendable

● Visualization of task status and dependency

● Command-line interface

Page 25: Spark Workflow Management

Luigi Task Structure

● Extend luigi.Task

Implement 4 methods:● def input(self) (optional)● def output(self)● def depends(self)● def run(self)

Page 26: Spark Workflow Management

Luigi Task Example

Page 27: Spark Workflow Management

Luigi Predefined Tasks

● HadoopJobTask● SparkSubmitTask● CopyToIndex (ES)● HiveQueryTask● PigJobTask● CopyToTable (RDMS)● … many others

Page 28: Spark Workflow Management

Luigi Task Parameters

Page 29: Spark Workflow Management

Luigi Command-line

Page 30: Spark Workflow Management

Luigi Task List

Page 31: Spark Workflow Management

Luigi Dependency Graph

Page 32: Spark Workflow Management

Luigi Dependency Graph

Page 33: Spark Workflow Management

Luigi in Totango

This is how we do it

Page 34: Spark Workflow Management

Our codebase is in Java

Java class is called inside the task run method

Page 35: Spark Workflow Management

Jenkins for Luigi

Page 36: Spark Workflow Management

Gameboy

● Totango-specific controller for Luigi

● Provides high level overview

● Enable manual re-run of specific tasks

● Monitor progress, performance, run time,

queue, worker load etc

Page 37: Spark Workflow Management

Gameboy

Page 38: Spark Workflow Management

Gameboy

Page 39: Spark Workflow Management

Gameboy

Page 40: Spark Workflow Management

Summary

● Typical data flow – from raw data to insights● We use Spark for fast in-memory

transformations, all code is in Java● Our batch processing pipeline consist of a

series of tasks, which are managed in Luigi● We don't use all of Luigi's python abilities, and

we've added some new management abilities

Page 41: Spark Workflow Management

Questions?

The end is only the beginning