the journey of moving from aws elk to gcp data pipeline

Post on 14-Apr-2017

483 Views

Category:

Engineering

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Build DMP on top of GCP

VMFive - Randy Huang

Agenda

• Migrated Pipeline to GCP

• Cost Comparison

• Business Use Case

• Fluentd Demo

ELK + AWS EMR

Kinesis Lambda

Pros & Cons• Pros :

• Well Support.

• Well docs.

• Easy to find Reference.

• Cons :

• High Cost.

• Not open source.

• Have to set the scale at first.

Pipeline on GCP

Dataflow

BigQuery

Machine Learning

Data Visualization

Compute Engine

Global Load Balancing

Datastudio

The Products and Services logos may be used to accurately reference Google's technology and tools, for instance in architecture diagrams. 7

Batch

BI Analysis

Storage Cloud Storage

Processing Cloud DataflowStreaming

Time Series Streaming Cloud Pub/Sub

Storage BigQuery

The Products and Services logos may be used to accurately reference Google's technology and tools, for instance in architecture diagrams. 8

Targeting Engines

Data Sources

Machine Learning Applications

API Backend Compute Engine

Spark MLlib Cloud Dataproc

App Engine

Transform Data

Hosted Models Cloud Machine Learning

Real-Time Prediction API

Device Related Cloud Pub/Sub

Behavior Related Cloud Pub/Sub

3rd Party Data Cloud Pub/Sub

Redis Compute Engine

Pros & Cons• Pros :

• Cost-effective.

• Operation-effective.

• Google got your back.

• Cons :

• API/SDK changes everyday.

• Some still in beta mode.

• Docs everywhere.

Workflow Monitoring• Digdag <Airflow/Oozie/Luigi>

• Native support Python & Ruby

• Multi-Cloud

• Modular

• Workflow as code

• Docker Support

• Altering to Slack

Digdag Sample

Digdag

Cost Comparison

• $2000 on AWS per month

• about $200 on GCP production

• about another $200 for dev

• 50M events per month

Business Use Case• Digital Ads Targeting

• User Behavior Tagging

• BI

• GEO Reporting

• KPI Reporting

• User Demographic

Some Tips• BigQuery

• https://status.cloud.google.com/incident/bigquery/18022

• Solved by Fluentd’s Retry and HA

• Dataflow’s SDK & docs is not sync

• Dataflow Sideinput has a bug with Streaming mode

• Compute Engine SLB - TCP/UDP setup for forwarding

Flunetd Update

• Release note for v0.14

• sub second event flush

• New Plugin APIS support formatting configurations dynamically

(e.g., path /my/dest/${tag}/mydata.%Y-%m-%d.log)

• Secure Forward

Demo

• Nginx -> Fluentd -> BigQuery -> DataStudio

• MySQL -> Fluentd -> BigQuery

top related