data processing · big data in google cloud platform • machine learning platform(alpha) • fast,...

Jelena Pjesivac-GrbovicStaff software engineerCloud Big Data

Data Processing with Apache Beam (incubating) and

Google Cloud Dataflow

XLDB’16 - May 2016

In collaboration with Frances Perry, Tayler Akidau, and Dataflow team

Infinite, Out-of-Order Data Sets

What, Where, When, How

Apache Beam (incubating)

Agenda


2

4

1

3

Infinite, Out-of-Order Data Sets1

Data...

...can be big...

...really, really big...

TuesdayWednesday

Thursday

… maybe infinitely big...

9:008:00 14:0013:0012:0011:0010:00

… with unknown delays.

9:008:00 14:0013:0012:0011:0010:00

8:00

8:008:00

Element-wise transformations

13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time

Aggregating via Processing-Time Windows

13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time

Aggregating via Event-Time Windows

Event Time

Processing Time 11:0010:00 15:0014:0013:0012:00

11:0010:00 15:0014:0013:0012:00

Input

Output

Reality

Formalizing Event-Time SkewP

roce

ssin

g Ti

me

Event Time

Ideal

Skew

Formalizing Event-Time Skew

Watermarks describe event time progress.

"No timestamp earlier than the watermark will be seen"

Pro

cess

ing

Tim

e

Event Time

~Watermark

Ideal

Skew

Often heuristic-based.

Too Slow? Results are delayed.Too Fast? Some data is late.

What, Where, When, How2

What are you computing?

Where in event time?

When in processing time?

How do refinements relate?

What are you computing?

What Where When How

Element-Wise Aggregating Composite

What: Computing Integer Sums

// Collection of raw log linesPCollection<String> raw = IO.read(...);

// Element-wise transformation into team/score pairs

PCollection<KV<String, Integer>> input =

raw.apply(ParDo.of(new ParseFn());

// Composite transformation containing an aggregationPCollection<KV<String, Integer>> scores =

input.apply(Sum.integersPerKey());

What Where When How

What: Computing Integer Sums

What Where When How

Windowing divides data into event-time-based finite chunks.

Often required when doing aggregations over unbounded data.

Where in event time?

What Where When How

Fixed Sliding1 2 3

54

Sessions

2

431

Key 2

Key 1

Key 3

Time

1 2 3 4

Where: Fixed 2-minute Windows

What Where When How

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2)))

.apply(Sum.integersPerKey());

Where: Fixed 2-minute Windows

What Where When How

When in processing time?

What Where When How

• Triggers control when results are emitted.

• Triggers are often relative to the watermark.

Pro

cess

ing

Tim

e

Event Time

~Watermark

Ideal

Skew

When: Triggering at the Watermark

What Where When How


.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()))


When: Triggering at the Watermark

What Where When How

When: Early and Late Firings

What Where When How


.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1))))


When: Early and Late Firings

What Where When How

How do refinements relate?

What Where When How

• How should multiple outputs per window accumulate?• Appropriate choice depends on consumer.

Firing Elements

Speculative [3]

Watermark [5, 1]

Late [2]

Last Observed

Total Observed

Discarding

3

6

2

2

11

Accumulating

3

9

11

11

23

Acc. & Retracting

3

9, -3

11, -9

11

11

(Accumulating & Retracting not yet implemented.)

How: Add Newest, Remove Previous

What Where When How

1.Classic Batch 2. Batch with Fixed Windows

3. Streaming

5. Streaming With Retractions

4. Streaming with Speculative + Late Data

What Where When How

6. Sessions

What / Where / When / How

3 Apache Beam (incubating)

The Evolution of Beam

MapReduce


Apache Beam

BigTable DremelColossus

FlumeMegastoreSpanner

PubSub

Millwheel

1. The Beam Model: What / Where / When / How

2. SDKs for writing Beam pipelines -- starting with Java

3. Runners for Existing Distributed Processing Backends• Apache Flink (thanks to data Artisans)• Apache Spark (thanks to Cloudera)• Google Cloud Dataflow (fully managed service)• Local (in-process) runner for testing

What is Part of Apache Beam?

1. End users: who want to write pipelines or transform libraries in a language that’s familiar.

2. SDK writers: who want to make Beam concepts available in new languages.

3. Runner writers: who have a distributed processing environment and want to support Beam pipelines

Apache Beam Technical Vision

Beam Model: Fn Runners

Runner A Runner B

Beam Model: Pipeline Construction

OtherLanguagesBeam Java Beam

Python

Execution Execution

Cloud Dataflow

Execution

Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors

Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem

Growing the Beam Community

Google Cloud Dataflow4

• Fully managed service for running Beam pipelines• Dynamically provisioned, on-demand resources

• VMs, temporary storage• No tuning required

• Autoscaling + Dynamic Work Rebalancing• Built from the experience with Google

internal products


Wor

kers

Time

With DWR

• Advanced straggler mitigation technique• Ensures all tasks finish at the same time

No Tuning Required: Dynamic Work Rebalancing

Wor

kers

Time

Without DWR

• For more info google: “No shard left behind: dynamic work rebalancing in Google Cloud Dataflow”

https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow.html



• Dynamically adjust to the number of workers to match the load• Both for streaming and batch

No Tuning Required: Autoscaling

• For more info google: “Comparing Cloud Dataflow autoscaling to Spark and Hadoop”

Time Time

Wor

kers

https://cloud.google.com/blog/big-data/2016/03/comparing-cloud-dataflow-autoscaling-to-spark-and-hadoop



• Apache Beam connectors• Google Cloud

• Storage, BigQuery, BigTable, Datastore, Pub/Sub,

• External / Custom IO• Kafka, HDFS, many in flight

• Part of Google Cloud Platform• Monitoring UI• Cloud Logging• Cloud Debugger and Profiler• Stackdriver integration

Integrations

• BigQuery• A fast, economical, and fully managed data warehouse solution

• Dataflow• Fully managed, real-time, data processing service for batch and

streaming• Dataproc

• Fast, easy to use managed Spark and Hadoop service• Datalab(beta)

• Interactive large scale data analysis, exploration and visualization• Pub/Sub

• Reliable, many-to-many, asynchronous messaging service• Genomics

• Empowers scientists to organize world’s genomics information

Big Data in Google Cloud Platform

• Machine Learning Platform(alpha)

• Fast, large scale, easy to use Machine Learning service

• Vision API• Enables insights based on our powerful Vision APIs

• Speech API• Speech to text conversion powered by Machine Learning

• Translate API• Enables multilingual apps and programmatic translation

Machine Learning in Google Cloud Platform

Learn More! Follow @GCPBigData + @ApacheBeam

Apache Beam (incubating)http://beam.incubator.apache.org

Google Cloud Dataflowhttp://cloud.google.com/dataflow

Google Cloud Platformhttp://cloud.google.com

http://beam.incubator.apache.org

http://cloud.google.com/dataflow

http://cloud.google.com

Thank you!

data processing · big data in google cloud platform • machine learning platform(alpha) • fast,...

Documents