streaming analytics on google cloud platform, by javier ramirez, teowaki

73
End-to-end streaming analytics on Google Cloud Platform From event capture to dashboard to monitoring Javier Ramirez @supercoco9

Upload: javier-ramirez

Post on 22-Jan-2018

309 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

End-to-end streaming analytics on Google Cloud Platform

From event capture to dashboard to monitoring

Javier Ramirez@supercoco9

Page 2: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Hard problemsand easy problems

And hard problems that look easy

3

Page 3: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Calculate the average of several numbers.

An easy problem

Page 4: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Calculate the average of several numbers. By the way, they might be MANY numbers. They will probably not fit in memory. They might not even fit in one file or on a single hard drive.

An easy big data problem

Page 5: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Calculate the average of several numbers. By the way, they might be MANY numbers. They will probably not fit in memory. They might not even fit in one file or on a single hard drive.

Truth is they will not be in one file, but they will be streamed live from different sensors…

An easy big data and streaming problem

Page 6: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Calculate the average of several numbers. By the way, they might be MANY numbers. They will probably not fit in memory. They might not even fit in one file or on a single hard drive.

Truth is they will not be in one file, but they will be streamed live from different sensors… In different parts of the world

A not so easy streaming data problem

Page 7: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Calculate the average of several numbers. By the way, they might be MANY numbers. They will probably not fit in memory. They might not even fit in one file or on a single hard drive.

Truth is they will not be in one file, but they will be streamed live from different sensors… In different parts of the world

Some sensors might send a few events per hour, some a few thousands per second…

An autoscaling streaming problem

Page 8: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Calculate the average of several numbers. By the way, they might be MANY numbers. They will probably not fit in memory. They might not even fit in one file or on a single hard drive.

Truth is they will not be in one file, but they will be streamed live from different sensors… In different parts of the world

Some sensors might send a few events per hour, some a few thousands per second… We want not just the total average of all the points, but the moving average every 30 seconds, for every sensor. And the hourly, daily, and monthly averages

A hard streaming analytics problem

Page 9: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Calculate the average of several numbers. By the way, they might be MANY numbers. They will probably not fit in memory. They might not even fit in one file or on a single hard drive.

Truth is they will not be in one file, but they will be streamed live from different sensors… In different parts of the world

Some sensors might send a few events per hour, some a few thousands per second… We want not just the total average of all the points, but the moving average every 30 seconds, for every sensor. And the hourly, daily, and monthly averages

Sometimes the sensors will have connectivity issues and will not send their data until later, but of course I want the calculations to still be correct

A real life analytics problem

Page 10: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

All of the above, plus monitoring, alerts, self-healing, a way to query the data efficiently, and a pretty dashboard on top

What your client/boss will expectCalculate the average of several numbers. By the way, they might be MANY numbers. They will probably not fit in memory. They might not even fit in one file or on a single hard drive.

Truth is they will not be in one file, but they will be streamed live from different sensors… In different parts of the world

Some sensors might send a few events per hour, some a few thousands per second… We want not just the total average of all the points, but the moving average every 30 seconds, for every sensor. And the hourly, daily, and monthly averages

Sometimes the sensors will have connectivity issues and will not send their data until later, but of course I want the calculations to still be correct

Page 11: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

… is easier

said than

done

Page 12: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Our complete system in 100 lines of Java, of which 90 are mostly

boilerplate and configuration

<= Don’t try to read that. I’ll zoom in later

Page 13: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

14

Page 14: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

What we needA streaming data pipeline components

Data acquisition

Data validation

Transformation / Aggregation VisualizationStorage/

Analytics

Monitoring and alerts of all the components

Page 15: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Data AcquisitionSending and receiving data at scale

16

Page 16: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Google Cloud Pub/Sub

Google Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.

Google Cloud Pub/Sub delivers low-latency, durable messaging that helps developers quickly integrate systems hosted on the Google Cloud Platform and externally.

Ingest event streams from anywhere, at any scale, for simple, reliable, real-time stream analytics

Page 17: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

The spotify proof of conceptCurrently our production load peaks at around 700K events per second. To account for the future growth and possible disaster recovery scenarios, we settled on a test load of 2M events per second.

To make it extra hard for Pub/Sub, we wanted to publish this amount of traffic from a single data center, so that all the requests were hitting the Pub/Sub machines in the same zone. We made the assumption that Google plans zones as independent failure domains and that each zone can handle equal amounts of traffic.

In theory, if we’re able to push 2M messages to a single zone, we should be able to push number_of_zones * 2M messages across all zones.

Our hope was that the system would be able to handle this traffic on both the producing and consuming side for a long time without the service degrading.

https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/

Page 18: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

The spotify proof of concept

https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/

They pushed 2 million events per second (to two topics) from 29 servers, non-stop, for five days.

“We did not observe any lost messages whatsoever during the test period.”

Page 19: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

The no operations advantage

https://labs.spotify.com/2016/03/10/spotifys-event-delivery-the-road-to-the-cloud-part-iii/

Event Delivery System In Cloud

We’re actively working on bringing the new system to production. The preliminary numbers we obtained from running the new system in the experimental phase look very promising. The worst end-to-end latency observed with the new system is four times lower than the end-to-end latency of old system.

But boosting performance isn’t the only thing we want to get from the new system. Our bet is that by using cloud-managed products we will have a much lower operational overhead. That in turn means we will have much more time to make Spotify’s products better.

Page 20: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

A truly global network

Page 21: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Pub/Sub now works with Cloud IOT Core

Device Manager

The device manager allows individual devices to be configured and managed securely in a coarse-grained way; management can be done through a console or programmatically. The device manager establishes the identity of a device, and provides the mechanism for authenticating a device when connecting. It also maintains a logical configuration of each device and can be used to remotely control the device from the cloud.

Protocol Bridge

The protocol bridge provides connection endpoints for protocols with automatic load balancing for all device connections. The protocol bridge has native support for secure connection over MQTT, an industry-standard IoT protocol. The protocol bridge publishes all device telemetry to Cloud Pub/Sub, which can then be consumed by downstream analytic systems.

Which is very cool if you are into Arduino, Raspberry PI, Android, or embedded systems

Page 22: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Storage & AnalyticsReliable, fast, scalable, and flexible. And no-ops

25

Page 23: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

BigQuery

A database where you can send as much (or as little) data as you want, either batch or streaming, and run any SQL you want, no matter how big your data is.

Even if you have petabytes of data.

Even if you want to join data from different projects or from public data sources.

Even if you want to query external data on Spreadsheets or Cloud Storage.

Even if you want to create your own User Defined Functions in JavaScript.

Page 24: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

BigQuery also...

… is serverless and zero configuration. You never have to worry about memory, CPU, network, or disk. You send your data, you send your queries, you get results.

Behind the scenes BigQuery will use up to 2000 CPUs in parallel for your queries, and a huge amount of networked storage. But you don’t care.

You pay for how much data you send and how much data you query. If you are not using the database, you are not paying anything. But it’s always available

Page 25: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Hope you are not easily impressed

How long it would take to read 4 Terabytes from a hard drive at 100 MB/s?

And to filter 100 billion data points using a regular expression for each?

And moving 278 GB across a 1 Gbps network?

Page 26: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Hope you are not easily impressed

How long it would take to read 4 Terabytes from a hard drive at 100 MB/s?

About 11 hours

And to filter 100 billion data points using a regular expression for each?

About 27 hours

And moving 278 GB across a 1 Gbps network?

About 40 minutes

Page 27: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Hope you are not easily impressed

Page 28: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Hope you are not easily impressed

Page 29: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki
Page 30: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

We will use a simple table for our system

Page 31: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Data ValidationApparently simple, but always a pain

34

Page 32: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki
Page 33: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Google Cloud DataprepAn intelligent cloud data service to visually explore, clean, and prepare data for analysis

Page 34: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Cloud Dataprep: Explorer & Suggestions

Page 35: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Cloud Dataprep: Transforms & Scripts

Page 36: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Cloud Dataprep: No-ops execution & reports

Page 37: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Transformation & Aggregation

Batch and streaming ETL jobs, and data pipelines40

Page 38: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Apache BEAM: An advanced unified programming modelApache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

Beam is particularly useful for Embarrassingly Parallel data processing tasks, in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel. You can also use Beam for Extract, Transform, and Load (ETL) tasks and pure data integration. These tasks are useful for moving data between different storage media and data sources, transforming data into a more desirable format, or loading data onto a new system.

Page 39: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Apache BEAM: A basic pipeline

Page 40: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Apache BEAM: Streaming is hard

Page 41: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Apache BEAM: Streaming is hard

Page 42: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Averages with BEAM: Overview

Boilerplate

and

configuration

Writing the output to BigQuery

This is the code that actually processes and aggregates the data

Start the pipeline

Page 43: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Averages with BEAM: Config

Page 44: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Averages with BEAM: Output to BigQuery

Page 45: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Averages with BEAM: The processing itself

Transform/Filter. We are just parsing a line of text into multiple fields

Aggregate. We are outputting the mean speed of the last minute per sensor, every 30 seconds

Page 46: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Google Cloud Dataflow: BEAM with no-operations

Google developed internally BEAM as a closed-source product. Then they realised it would make sense to open-source it and they donated it to the Apache community.

Anyone can use BEAM completely for free, and choose the runner in which to execute your pipeline.

Google Cloud Dataflow is a BEAM runner to execute your pipelines with no-operations, with logging, monitoring, auto-scaling, shuffling, and dynamic re-balancing.

It’s like BEAM, but as a managed service.

Page 47: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Demo time

51

Page 48: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Three instances in three continents

Page 49: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Our dataflow pipeline ready to accept data

Page 50: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Let’s start sending some data

Page 51: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Our dataflow pipeline seems to be working fine.

64 elements per second should be easy.

Page 52: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Let’s send much more data from all over!

Page 53: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Our dataflow pipeline starts feeling the heat.

Receiving 2440 elements per second now-

Page 54: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

And now we are processing over 20,000 elements per second at the 1st step.

But the lag starts to increase

Page 55: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Auto scaling to the rescue. Three workers now

Page 56: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

And lag goes back to normal

Page 57: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

And back to just one worker

Page 58: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

MonitoringWhat I can’t see, doesn’t exist

64

Page 59: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Google Stackdriver monitoring

Page 60: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Google Stackdriver monitoring

Page 61: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Google Stackdriver alerts

Page 62: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

VisualizationBecause an image is worth a thousand logs

68

Page 63: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Google Data Studio

Page 64: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Google Data Studio: my dashboard

Page 65: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Google Data Studio: data sources

Page 66: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Google Data Studio: data sources

Page 67: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Google Data Studio: drag and drop

Page 68: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Google Data Studio: drag and drop

Page 69: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Almost thereJust one more slide

75

Page 70: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Cloud IoT Core

Cloud Dataprep

Stackdriver Monitoring Logging ErrorReporting

All together now!

Page 71: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

CHEERS!I’m happy to answer any questions you may have at lunchtime or the coffee breaks.

Or ping me at @supercoco9 on twitter. You got 240 chars now

Demo source code available at:https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/streaming/process

Javier Ramirez

Page 72: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

End-to-end streaming analytics on Google Cloud Platform

From event capture to dashboard to monitoring

Javier Ramirez@supercoco9

Page 73: Streaming analytics on Google Cloud Platform, by Javier Ramirez, teowaki

Template Design Credits

The Template provides a theme with four basic colors:

The backgrounds were created by Free Google Slides Templates.

The original template for this presentation was provided by, and it’s property of, Free Google Slides Templates - http://freegoogleslidestemplates.com

Vectorial Shapes in this Template were created by Free Google Slides Templates and downloaded from pexels.com and unsplash.com.

Icons in this Template are part of Google® Material Icons and 1001freedownloads.com.

Shapes & Icons Backgrounds

Fonts Color PaletteThe fonts used in this template are taken from Google fonts. ( Dosis,Open Sans )You can download the fonts from the following url: https://www.google.com/fonts/ #93c47dff #0097a7ff

#78909cff #eeeeeeff

#f7b600ff #00ce00e3

#de445eff #000000ff