data processing · big data in google cloud platform • machine learning platform(alpha) • fast,...
TRANSCRIPT
Jelena Pjesivac-GrbovicStaff software engineerCloud Big Data
Data Processing with Apache Beam (incubating) and
Google Cloud Dataflow
XLDB’16 - May 2016
In collaboration with Frances Perry, Tayler Akidau, and Dataflow team
Infinite, Out-of-Order Data Sets
What, Where, When, How
Apache Beam (incubating)
Agenda
Google Cloud Dataflow
2
4
1
3
Infinite, Out-of-Order Data Sets1
Data...
...can be big...
...really, really big...
TuesdayWednesday
Thursday
… maybe infinitely big...
9:008:00 14:0013:0012:0011:0010:00
… with unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
Element-wise transformations
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
Aggregating via Processing-Time Windows
13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
Aggregating via Event-Time Windows
Event Time
Processing Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Reality
Formalizing Event-Time SkewP
roce
ssin
g Ti
me
Event Time
Ideal
Skew
Formalizing Event-Time Skew
Watermarks describe event time progress.
"No timestamp earlier than the watermark will be seen"
Pro
cess
ing
Tim
e
Event Time
~Watermark
Ideal
Skew
Often heuristic-based.
Too Slow? Results are delayed.Too Fast? Some data is late.
What, Where, When, How2
What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
What are you computing?
What Where When How
Element-Wise Aggregating Composite
What: Computing Integer Sums
// Collection of raw log linesPCollection<String> raw = IO.read(...);
// Element-wise transformation into team/score pairs
PCollection<KV<String, Integer>> input =
raw.apply(ParDo.of(new ParseFn());
// Composite transformation containing an aggregationPCollection<KV<String, Integer>> scores =
input.apply(Sum.integersPerKey());
What Where When How
What: Computing Integer Sums
What Where When How
What: Computing Integer Sums
What Where When How
Windowing divides data into event-time-based finite chunks.
Often required when doing aggregations over unbounded data.
Where in event time?
What Where When How
Fixed Sliding1 2 3
54
Sessions
2
431
Key 2
Key 1
Key 3
Time
1 2 3 4
Where: Fixed 2-minute Windows
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.apply(Sum.integersPerKey());
Where: Fixed 2-minute Windows
What Where When How
When in processing time?
What Where When How
• Triggers control when results are emitted.
• Triggers are often relative to the watermark.
Pro
cess
ing
Tim
e
Event Time
~Watermark
Ideal
Skew
When: Triggering at the Watermark
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
When: Triggering at the Watermark
What Where When How
When: Early and Late Firings
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
When: Early and Late Firings
What Where When How
How do refinements relate?
What Where When How
• How should multiple outputs per window accumulate?• Appropriate choice depends on consumer.
Firing Elements
Speculative [3]
Watermark [5, 1]
Late [2]
Last Observed
Total Observed
Discarding
3
6
2
2
11
Accumulating
3
9
11
11
23
Acc. & Retracting
3
9, -3
11, -9
11
11
(Accumulating & Retracting not yet implemented.)
How: Add Newest, Remove Previous
What Where When How
1.Classic Batch 2. Batch with Fixed Windows
3. Streaming
5. Streaming With Retractions
4. Streaming with Speculative + Late Data
What Where When How
6. Sessions
What / Where / When / How
3 Apache Beam (incubating)
The Evolution of Beam
MapReduce
Google Cloud Dataflow
Apache Beam
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
1. The Beam Model: What / Where / When / How
2. SDKs for writing Beam pipelines -- starting with Java
3. Runners for Existing Distributed Processing Backends• Apache Flink (thanks to data Artisans)• Apache Spark (thanks to Cloudera)• Google Cloud Dataflow (fully managed service)• Local (in-process) runner for testing
What is Part of Apache Beam?
1. End users: who want to write pipelines or transform libraries in a language that’s familiar.
2. SDK writers: who want to make Beam concepts available in new languages.
3. Runner writers: who have a distributed processing environment and want to support Beam pipelines
Apache Beam Technical Vision
Beam Model: Fn Runners
Runner A Runner B
Beam Model: Pipeline Construction
OtherLanguagesBeam Java Beam
Python
Execution Execution
Cloud Dataflow
Execution
Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors
Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem
Growing the Beam Community
Google Cloud Dataflow4
• Fully managed service for running Beam pipelines• Dynamically provisioned, on-demand resources
• VMs, temporary storage• No tuning required
• Autoscaling + Dynamic Work Rebalancing• Built from the experience with Google
internal products
Google Cloud Dataflow
Wor
kers
Time
With DWR
• Advanced straggler mitigation technique• Ensures all tasks finish at the same time
No Tuning Required: Dynamic Work Rebalancing
Wor
kers
Time
Without DWR
• For more info google: “No shard left behind: dynamic work rebalancing in Google Cloud Dataflow”
• Dynamically adjust to the number of workers to match the load• Both for streaming and batch
No Tuning Required: Autoscaling
• For more info google: “Comparing Cloud Dataflow autoscaling to Spark and Hadoop”
Time Time
Wor
kers
• Apache Beam connectors• Google Cloud
• Storage, BigQuery, BigTable, Datastore, Pub/Sub,
• External / Custom IO• Kafka, HDFS, many in flight
• Part of Google Cloud Platform• Monitoring UI• Cloud Logging• Cloud Debugger and Profiler• Stackdriver integration
Integrations
• BigQuery• A fast, economical, and fully managed data warehouse solution
• Dataflow• Fully managed, real-time, data processing service for batch and
streaming• Dataproc
• Fast, easy to use managed Spark and Hadoop service• Datalab(beta)
• Interactive large scale data analysis, exploration and visualization• Pub/Sub
• Reliable, many-to-many, asynchronous messaging service• Genomics
• Empowers scientists to organize world’s genomics information
Big Data in Google Cloud Platform
• Machine Learning Platform(alpha)
• Fast, large scale, easy to use Machine Learning service
• Vision API• Enables insights based on our powerful Vision APIs
• Speech API• Speech to text conversion powered by Machine Learning
• Translate API• Enables multilingual apps and programmatic translation
Machine Learning in Google Cloud Platform
Learn More! Follow @GCPBigData + @ApacheBeam
Apache Beam (incubating)http://beam.incubator.apache.org
Google Cloud Dataflowhttp://cloud.google.com/dataflow
Google Cloud Platformhttp://cloud.google.com
Thank you!