data processing · big data in google cloud platform • machine learning platform(alpha) • fast,...

44
Jelena Pjesivac-Grbovic Staff software engineer Cloud Big Data Data Processing with Apache Beam (incubating) and Google Cloud Dataflow XLDB’16 - May 2016 In collaboration with Frances Perry, Tayler Akidau, and Dataflow team

Upload: others

Post on 20-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Jelena Pjesivac-GrbovicStaff software engineerCloud Big Data

Data Processing with Apache Beam (incubating) and

Google Cloud Dataflow

XLDB’16 - May 2016

In collaboration with Frances Perry, Tayler Akidau, and Dataflow team

Page 2: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Infinite, Out-of-Order Data Sets

What, Where, When, How

Apache Beam (incubating)

Agenda

Google Cloud Dataflow

2

4

1

3

Page 3: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Infinite, Out-of-Order Data Sets1

Page 4: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Data...

Page 5: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

...can be big...

Page 6: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

...really, really big...

TuesdayWednesday

Thursday

Page 7: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

… maybe infinitely big...

9:008:00 14:0013:0012:0011:0010:00

Page 8: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

… with unknown delays.

9:008:00 14:0013:0012:0011:0010:00

8:00

8:008:00

Page 9: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Element-wise transformations

13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time

Page 10: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Aggregating via Processing-Time Windows

13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time

Page 11: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Aggregating via Event-Time Windows

Event Time

Processing Time 11:0010:00 15:0014:0013:0012:00

11:0010:00 15:0014:0013:0012:00

Input

Output

Page 12: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Reality

Formalizing Event-Time SkewP

roce

ssin

g Ti

me

Event Time

Ideal

Skew

Page 13: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Formalizing Event-Time Skew

Watermarks describe event time progress.

"No timestamp earlier than the watermark will be seen"

Pro

cess

ing

Tim

e

Event Time

~Watermark

Ideal

Skew

Often heuristic-based.

Too Slow? Results are delayed.Too Fast? Some data is late.

Page 14: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

What, Where, When, How2

Page 15: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

What are you computing?

Where in event time?

When in processing time?

How do refinements relate?

Page 16: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

What are you computing?

What Where When How

Element-Wise Aggregating Composite

Page 17: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

What: Computing Integer Sums

// Collection of raw log linesPCollection<String> raw = IO.read(...);

// Element-wise transformation into team/score pairs

PCollection<KV<String, Integer>> input =

raw.apply(ParDo.of(new ParseFn());

// Composite transformation containing an aggregationPCollection<KV<String, Integer>> scores =

input.apply(Sum.integersPerKey());

What Where When How

Page 18: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

What: Computing Integer Sums

What Where When How

Page 19: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

What: Computing Integer Sums

What Where When How

Page 20: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Windowing divides data into event-time-based finite chunks.

Often required when doing aggregations over unbounded data.

Where in event time?

What Where When How

Fixed Sliding1 2 3

54

Sessions

2

431

Key 2

Key 1

Key 3

Time

1 2 3 4

Page 21: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Where: Fixed 2-minute Windows

What Where When How

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2)))

.apply(Sum.integersPerKey());

Page 22: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Where: Fixed 2-minute Windows

What Where When How

Page 23: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

When in processing time?

What Where When How

• Triggers control when results are emitted.

• Triggers are often relative to the watermark.

Pro

cess

ing

Tim

e

Event Time

~Watermark

Ideal

Skew

Page 24: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

When: Triggering at the Watermark

What Where When How

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()))

.apply(Sum.integersPerKey());

Page 25: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

When: Triggering at the Watermark

What Where When How

Page 26: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

When: Early and Late Firings

What Where When How

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Minutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Minutes(1)))

.withLateFirings(AtCount(1))))

.apply(Sum.integersPerKey());

Page 27: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

When: Early and Late Firings

What Where When How

Page 28: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

How do refinements relate?

What Where When How

• How should multiple outputs per window accumulate?• Appropriate choice depends on consumer.

Firing Elements

Speculative [3]

Watermark [5, 1]

Late [2]

Last Observed

Total Observed

Discarding

3

6

2

2

11

Accumulating

3

9

11

11

23

Acc. & Retracting

3

9, -3

11, -9

11

11

(Accumulating & Retracting not yet implemented.)

Page 29: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

How: Add Newest, Remove Previous

What Where When How

Page 30: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

1.Classic Batch 2. Batch with Fixed Windows

3. Streaming

5. Streaming With Retractions

4. Streaming with Speculative + Late Data

What Where When How

6. Sessions

What / Where / When / How

Page 31: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

3 Apache Beam (incubating)

Page 32: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

The Evolution of Beam

MapReduce

Google Cloud Dataflow

Apache Beam

BigTable DremelColossus

FlumeMegastoreSpanner

PubSub

Millwheel

Page 33: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

1. The Beam Model: What / Where / When / How

2. SDKs for writing Beam pipelines -- starting with Java

3. Runners for Existing Distributed Processing Backends• Apache Flink (thanks to data Artisans)• Apache Spark (thanks to Cloudera)• Google Cloud Dataflow (fully managed service)• Local (in-process) runner for testing

What is Part of Apache Beam?

Page 34: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

1. End users: who want to write pipelines or transform libraries in a language that’s familiar.

2. SDK writers: who want to make Beam concepts available in new languages.

3. Runner writers: who have a distributed processing environment and want to support Beam pipelines

Apache Beam Technical Vision

Beam Model: Fn Runners

Runner A Runner B

Beam Model: Pipeline Construction

OtherLanguagesBeam Java Beam

Python

Execution Execution

Cloud Dataflow

Execution

Page 35: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors

Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem

Growing the Beam Community

Page 36: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Google Cloud Dataflow4

Page 37: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

• Fully managed service for running Beam pipelines• Dynamically provisioned, on-demand resources

• VMs, temporary storage• No tuning required

• Autoscaling + Dynamic Work Rebalancing• Built from the experience with Google

internal products

Google Cloud Dataflow

Page 38: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Wor

kers

Time

With DWR

• Advanced straggler mitigation technique• Ensures all tasks finish at the same time

No Tuning Required: Dynamic Work Rebalancing

Wor

kers

Time

Without DWR

• For more info google: “No shard left behind: dynamic work rebalancing in Google Cloud Dataflow”

Page 39: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

• Dynamically adjust to the number of workers to match the load• Both for streaming and batch

No Tuning Required: Autoscaling

• For more info google: “Comparing Cloud Dataflow autoscaling to Spark and Hadoop”

Time Time

Wor

kers

Page 40: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

• Apache Beam connectors• Google Cloud

• Storage, BigQuery, BigTable, Datastore, Pub/Sub,

• External / Custom IO• Kafka, HDFS, many in flight

• Part of Google Cloud Platform• Monitoring UI• Cloud Logging• Cloud Debugger and Profiler• Stackdriver integration

Integrations

Page 41: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

• BigQuery• A fast, economical, and fully managed data warehouse solution

• Dataflow• Fully managed, real-time, data processing service for batch and

streaming• Dataproc

• Fast, easy to use managed Spark and Hadoop service• Datalab(beta)

• Interactive large scale data analysis, exploration and visualization• Pub/Sub

• Reliable, many-to-many, asynchronous messaging service• Genomics

• Empowers scientists to organize world’s genomics information

Big Data in Google Cloud Platform

Page 42: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

• Machine Learning Platform(alpha)

• Fast, large scale, easy to use Machine Learning service

• Vision API• Enables insights based on our powerful Vision APIs

• Speech API• Speech to text conversion powered by Machine Learning

• Translate API• Enables multilingual apps and programmatic translation

Machine Learning in Google Cloud Platform

Page 43: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Learn More! Follow @GCPBigData + @ApacheBeam

Apache Beam (incubating)http://beam.incubator.apache.org

Google Cloud Dataflowhttp://cloud.google.com/dataflow

Google Cloud Platformhttp://cloud.google.com

Page 44: Data Processing · Big Data in Google Cloud Platform • Machine Learning Platform(alpha) • Fast, large scale, easy to use Machine Learning service • Vision API • Enables insights

Thank you!