serverless data architecture at scale on google cloud platform

Serverless Data Architecture at scale on Google Cloud Platform

Lorenzo RidiMachine Learning/Data Science Meetup

Rome, 02-02-2017

I’ve been aResearch Fellow @UniFI

I am aSoftware Engineer @Noovle

I am aGoogle Cloud PlatformQualified DeveloperI am aGoogle Cloud Platform Authorized Trainer

Hi, I’m Lorenzo!

Google’s Mission

Organize the world’s information and make it universally accessible and useful.“ ”

2002

2004

2006

2008

2010

2012

2014

2016

GFS

MapReduce

TensorFlow

BigTable

Dremel

Colossus

Flume

Megastore

Spanner

Millwheel

PubSubF1

Google’s Data Research

2002

2004

2006

2008

2010

2012

2014

2016

ML

PubSub

DataFlow

DataStore

DataFlow

Cloud Storage

BigQuery

BigTable

DataProc

Cloud Storage

Google’s Data Products

GA

Cloud Natural

Language

BetaGAGA

Cloud Speech

Cloud Translat

eCloud Vision

Stay tuned...

Fully trained ML models from Google Cloud that allow a general developer to take advantage of rich machine learning capabilities with

simple REST based services.

Pre-Trained Machine Learning Models

tensorflow.orggithub.com/tensorflow

Open Source Software Library for Machine Learning.

Cloud Machine Learning

Managed service that enables you to easily build machine learning models, that work on any type of

data, of any size.

Use your own data to train models

Cracking Black FridayAdding Machine Learning to a

serverless data analysis pipeline

Black Friday (ˈblæk fraɪdɪ)noun

The day following Thanksgiving Day in the United States. Since 1932, it

has been regarded as the beginning of the Christmas shopping season.

Black Friday in the US

2012 - 2016

source: Google Trends, November 23rd 2016

Black Friday in Italy2012 - 2016

source: Google Trends, November 23rd 2016

What are we doing

Processing + analytics

Tweets about black friday

insights

$$Hashtags(blackfriday, blackfriday2016)Brands & vendors(walmart, bestbuy)

Negative Hashtags(notonedime, blackoutblackfriday)

ingest

process

store

analyze

How we’re gonna do it

Reqs (boss):Do it fastMake it work

process

store

analyze



goserverless

ingest

process

store

analyze

Pub/Sub

Container Engine

(Kubernetes)



goserverless

What is Google Cloud Pub/Sub?

● Google Cloud Pub/Sub is a fully-managed real-time messaging service.

○ Guaranteed delivery■ “At least once” semantics

○ Reliable at scale■ Messages are replicated in

different zones

From Twitter to Pub/Sub

$ gcloud beta pubsub topics create blackfridaytweetsCreated topic [blackfridaytweets].

SHELL


?Pub/Sub Topic

Subscription A

Subscription B

Subscription C

Consumer A

Consumer B

Consumer C

Reliable AND scalable deliveryDecouples

producer and

consumer(s)

Absorbs shocks and changes


● Simple Python application using the TweePy library

# somewhere in the code, track a given set of keywordsstream = Stream(auth, listener)stream.filter(track=['blackfriday', [...]])

[...]

# somewhere else, write messages to Pub/Subfor line in data_lines: pub = base64.urlsafe_b64encode(line) messages.append({'data': pub})body = {'messages': messages}resp = client.projects().topics().publish( topic='blackfridaytweets', body=body).execute(num_retries=NUM_RETRIES)

PYTHON

This is our Pub/Sub topic


App+

Libs

VM


App+

Libs

VM


App+

Libs

hard toscale

hard to make

fault-tolerant

difficult todeploy & update


App+

Libs Container


App+

Libs Container

FROM google/python

RUN pip install --upgrade pipRUN pip install pyopenssl ndg-httpsclient pyasn1RUN pip install tweepyRUN pip install --upgrade google-api-python-clientRUN pip install python-dateutil

ADD twitter-to-pubsub.py /twitter-to-pubsub.pyADD utils.py /utils.py

CMD python twitter-to-pubsub.py

DOCKERFILE

install/updatelibs

execute script

copy scripts


App+

Libs Container

easy to deploy but

still doesn’t scale


App+

Libs Container Pod

What is Kubernetes (K8S)?

● An orchestration tool for managing a cluster of containers across multiple hosts○ Scaling, rolling upgrades, A/B

testing, etc.

● Declarative – not procedural○ Auto-scales and self-heals to desired

state

● Supports multiple container runtimes, currently Docker and CoreOS Rkt

● Open-source: github.com/kubernetes

http://github.com/kubernetes


App+

Libs Container Pod

apiVersion: v1kind: ReplicationControllermetadata: [...]Spec: replicas: 1 template: metadata: labels: name: twitter-stream spec: containers: - name: twitter-to-pubsub image: gcr.io/codemotion-2016-demo/pubsub_pipeline env: - name: PUBSUB_TOPIC value: ...

YAMLTodo: use Deployments!


App+

Libs Container Pod


App+

Libs Container Pod Node

Node


Pod A Pod B

Container Engine manages the K8S master for us!


Node 1

Node 2

Nodes autoscalingcourtesy of Kubernetes 1.3


$ gcloud container clusters create codemotion-2016-demo-clusterCreating cluster cluster-1...done.Created [...projects/codemotion-2016-demo/.../clusters/codemotion-2016-demo-cluster].

$ gcloud container clusters get-credentials codemotion-2016-demo-clusterFetching cluster endpoint and auth data.kubeconfig entry generated for cluster-1.

$ kubectl create -f ~/git/kube-pubsub-bq/pubsub/twitter-stream.yamlreplicationcontroller “twitter-stream” created.

SHELL

https://console.cloud.google.com/compute/instanceGroups/details/europe-west1-b/gke-codemotion-2016-demo-clu-default-pool-813a39fa-grp?project=codemotion-2016-demo&graph=GCE_CPU&duration=PT6H

process

store

analyze

Pub/Sub

Kubernetes


store

analyze

Pub/Sub

Kubernetes

Dataflow


analyze

Pub/Sub

Kubernetes

DataflowBigQuery


What is Google Cloud Dataflow?

● Cloud Dataflow is a collection of open source SDKs to implement parallel processing pipelines.○ same programming model

for streaming and batch pipelines

● Cloud Dataflow is a managed service to run parallel processing pipelines on Google Cloud Platform

Apache Beam

What is Google BigQuery?

● Google BigQuery is a fully-managed Analytic Data Warehouse solution allowing real-time analysis of Petabyte-scale datasets.

● Enterprise-grade features○ Batch and streaming (100K

rows/sec) data ingestion○ JDBC/ODBC connectors○ Rich SQL-2011-compliant

query language○ Supports updates and

deletesnew!

new!

From Pub/Sub to BigQuery

Pub/Sub Topic

Subscription

Read tweets from

Pub/Sub

Format tweets for BigQuery

Write tweets on BigQuery

BigQuery Table

Dataflow Pipeline


● A Dataflow pipeline is a Java program.

// TwitterProcessor.java

public static void main(String[] args) {

Pipeline p = Pipeline.create();

PCollection<String> tweets = p.apply(PubsubIO.Read.topic("...blackfridaytweets"));

PCollection<TableRow> formattedTweets = tweets.apply(ParDo.of(new DoFormat()));

formattedTweets.apply(BigQueryIO.Write.to(tableReference));

p.run();

}

JAVA

Reads from Pub/Sub

Writes on BigQuery

Transforms each tweet (json) in a BigQuery record

(python too!)


● A Dataflow pipeline is a Java program.

// TwitterProcessor.java

// Do Function (to be used within a ParDo)private static final class DoFormat extends DoFn<String, TableRow> { private static final long serialVersionUID = 1L;

@Override public void processElement(DoFn<String, TableRow>.ProcessContext c) { c.output(createTableRow(c.element())); }}

// Helper methodprivate static TableRow createTableRow(String tweet) throws IOException { return JacksonFactory.getDefaultInstance().fromString(tweet, TableRow.class);}

JAVA

(python too!)

Input elementWrite output


● Use Maven to build, deploy or update the Pipeline.

$ mvn compile exec:java -Dexec.mainClass=it.noovle.dataflow.TwitterProcessor-Dexec.args="--streaming"

[...]

INFO: To cancel the job using the 'gcloud' tool, run:> gcloud alpha dataflow jobs --project=codemotion-2016-demo cancel 2016-11-19_15_49_53-5264074060979116717[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------[INFO] Total time: 18.131s[INFO] Finished at: Sun Nov 20 00:49:54 CET 2016[INFO] Final Memory: 28M/362M[INFO] ------------------------------------------------------------------------

SHELL


● You can monitor your pipelines from Cloud Console.

From Pub/Sub to BigQuery● Data start flowing into BigQuery tables. You can run

queries from the CLI or the Web Interface.

Yay! It works!What now?

analyze

Pub/Sub

Kubernetes

DataflowBigQuery


Pub/Sub

Kubernetes

DataflowBigQuery

DataStudio


Pub/Sub

Kubernetes

DataflowBigQuery


enrich

DataStudio

Add magic here

Pub/Sub

Kubernetes

DataflowBigQuery


Natural Language

API

DataStudio

Sentiment Analysis with Natural Language API

Polarity: [-1,1]

Magnitude: [0,+inf)

Text


Polarity: [-1,1]

Magnitude: [0,+inf)

Text

sentiment = polarity x magnitude


Pub/Sub Topic

Read tweets from

Pub/Sub

Write tweets on BigQuery BigQuery

Tables

Dataflow Pipeline

Filter and Evaluate

sentiment


Write tweets on BigQuery



● We just add the additional necessary steps.// TwitterProcessor.java

public static void main(String[] args) {

Pipeline p = Pipeline.create();

PCollection<String> tweets = p.apply(PubsubIO.Read.topic("...blackfridaytweets"));

PCollection<String> sentTweets = tweets.apply(ParDo.of(new DoFilterAndProcess())); PCollection<TableRow> formSentTweets = sentTweets.apply(ParDo.of(new DoFormat())); formSentTweets.apply(BigQueryIO.Write.to(sentTableReference));

PCollection<TableRow> formattedTweets = tweets.apply(ParDo.of(new DoFormat()));

formattedTweets.apply(BigQueryIO.Write.to(tableReference));

p.run();

}

JAVA

PCollection<String> sentTweets = tweets.apply(ParDo.of(new DoFilterAndProcess())); PCollection<TableRow> formSentTweets = sentTweets.apply(ParDo.of(new DoFormat())); formSentTweets.apply(BigQueryIO.Write.to(sentTableReference));


● The update process preserves all in-flight data.

$ mvn compile exec:java -Dexec.mainClass=it.noovle.dataflow.TwitterProcessor-Dexec.args="--streaming --update --jobName=twitterprocessor-lorenzo-

1107222550"

[...]

INFO: To cancel the job using the 'gcloud' tool, run:> gcloud alpha dataflow jobs --project=codemotion-2016-demo cancel 2016-11-19_15_49_53-5264074060979116717[INFO] ------------------------------------------------------------------------[INFO] BUILD SUCCESS[INFO] ------------------------------------------------------------------------[INFO] Total time: 18.131s[INFO] Finished at: Sun Nov 20 00:49:54 CET 2016[INFO] Final Memory: 28M/362M[INFO] ------------------------------------------------------------------------

SHELL

Pub/Sub

Kubernetes

DataflowBigQuery

DataStudio

We did it!

Natural Language

API

Pub/Sub

Kubernetes

DataflowBigQuery

DataStudio

We did it!

Natural Language

API

“To serve and protect”

Live demo

Polarity: -1.0Magnitude: 1.5

Polarity: -1.0Magnitude: 2.1

Open on Thursday? Bad idea..

Beta test your deals!

Thank you!

[email protected]

serverless data architecture at scale on google cloud platform

Technology