no shard left behind€¦ · google cloud platform 17 these kinda work. but not really. manual...

55
Eugene Kirpichov Senior Software Engineer No Shard Left Behind Straggler-free data processing in Cloud Dataflow

Upload: others

Post on 17-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Eugene KirpichovSenior Software Engineer

No Shard Left BehindStraggler-free data processing in Cloud Dataflow

Page 2: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 2

Wor

kers

Time

Page 3: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 3

Page 4: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Plan

AutoscalingWhy dynamic rebalancing really matters

If you remember two thingsPhilosophy of everything above

01

02

03

IntroSetting the stage

StragglersWhere they come from andhow people fight them

Dynamic rebalancing1 How it works 2 Why is it hard

04

05

Page 5: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

01 IntroSetting the stage

Page 6: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 6

Google’s data processing timeline

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

2014

MillWheel

Dataflow

2016

Apache Beam

Page 7: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 7

Pipeline p = Pipeline.create(options);

p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))

.apply(FlatMapElements.via(

word → Arrays.asList(word.split("[^a-zA-Z']+"))))

.apply(Filter.byPredicate(word → !word.isEmpty()))

.apply(Count.perElement())

.apply(MapElements.via(

count → count.getKey() + ": " + count.getValue())

.apply(TextIO.Write.to("gs://.../..."));

p.run();

WordCount

Page 8: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 8

A

K, V

B

K, [V]

DoFn: A → [B]ParDo

GBKGroupByKey

MapReduce = ParDo + GroupByKey + ParDo

Page 9: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 9

DoFn

DoFn

DoFn

Running a ParDo

shard 1

shard 2

shard N

DoFn

Page 10: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 10

Gantt charts

shard N

Wor

kers

Time

Page 11: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 11

Large WordCount:Read files, GroupByKey, Write files.

400

wor

kers

20 minutes

Page 12: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

02 StragglersWhere they come from, and how people fight them

Page 13: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 13

StragglersW

orke

rs

Time

Page 14: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 14

Amdahl’s law: it gets worse at scale

Higher scale ⇒ More bottlenecked by serial parts.

#workers

serial fraction

Page 15: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 15

Process dictionaryin parallel by first letter:

⅙ words start with ‘t’ ⇒ < 6x speedup

Where do stragglers come from?

Spuriously slow external RPCs

Bugs

Join Foos / Bars,in parallel by Foos.

Some Foos have ≫ Bars than others.

Bad machines

Bad network

Resource contention

Uneven resources NoiseUneven partitioning Uneven complexity

Page 16: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 16

Oversplit

Hand-tune

Use data statistics

What would you do?

Uneven resources

Backups

Restarts

NoiseUneven partitioning Uneven complexity

Predictive ⇒ Unreliable Weak

Page 17: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 17

These kinda work. But not really.

Manual tuning = Sisyphean taskTime-consuming, uninformed, obsoleted by data drift⇒ Almost always tuned wrong

Statistics often missing / wrongDoesn’t exist for intermediate data

Size != complexity

Backups/restarts only address slow workers

Page 18: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Confidential & ProprietaryGoogle Cloud Platform 18

Upfront heuristics don’t work: will predict wrong.Higher scale → more likely.

Page 19: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Confidential & ProprietaryGoogle Cloud Platform 19

High scale triggers worst-case behavior.

Corollary: If you’re bottlenecked by worst-case behavior, you won’t scale.

Page 20: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

03.1 Dynamic rebalancingHow it works

Page 21: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 21

Detect and fight stragglersW

orke

rs

Time

Page 22: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 22

What is a straggler, really?W

orke

rs

Time

Slower than perfectly-parallel:

tend > sum(tend) / N

Page 23: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 23

Split stragglers, return residuals into pool of workW

orke

rs

Time

Now Avg completion time

100 130 200foo.txt

170

100 200170 170

keep running schedule

(cheap, atomic)

Page 24: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 24

Rinse, repeat (“liquid sharding”)W

orke

rs

Time

Now Avg completion time

Page 25: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 25

1 ParDo

Skewed

24 workers

ParDo/GBK/ParDo

Uniform

400 workers

50% 25%

Page 26: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Confidential & ProprietaryGoogle Cloud Platform 26

Get out of trouble > avoid trouble

Adaptive > Predictive

Page 27: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

03.2 Dynamic rebalancingWhy is it hard?

Page 28: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 28

And that’s it? What’s so hard?

Wait-free

Perfect granularity

What can be split?

Data consistency

Not just files

APIs

Non-uniform density

Stuckness

“Dark matter”

Making predictions

Testing consistency

Debugging

Measuring quality

Being sure it worksSemantics Quality

Page 29: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 29

What is splitting

foo.txt [100, 200)

foo.txt [100, 170) foo.txt [170, 200)

split at 170

Page 30: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 30

What is splitting: Associativity

[A, B) + [B, C) = [A, C)

Page 31: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 31

What is splitting: Rounding up

[A, B) = records starting in [A, B)

Random access

⇒ Can split without scanning data!

Page 32: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 32

What is splitting: Rounding up

[A, B) = records starting in [A, B)

Random access

⇒ Can split without scanning data!

appl

ebe

etfig gr

ape

kiw

ilim

epe

arro

se

squa

sh

vani

lla

[a, h) [h, s) [s, $)

appl

ebe

etfig gr

ape

kiw

ilim

epe

arro

se

squa

sh

vani

lla

Page 33: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 33

What is splitting: Blocks

[A, B) = records in blocks starting in [A, B)

Page 34: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 34

What is splitting: Readers

foo.txt [100, 200)

foo.txt [100, 170) foo.txt [170, 200)

split at 170

“Reader”

Re-reading consistency:continue until EOF = re-read shard

Page 35: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 35

Dynamic splitting: readers

read not yet read

not ok ok

e.g. can’t split an arbitrary SQL query

X = last record read:Exact, Increasing

Page 36: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Confidential & ProprietaryGoogle Cloud Platform 36

[A, B) = blocks of records starting in [A, B)[A, B) + [B, C) = [A, C)Random access⇒ No scanning needed to split

Reading repeatable, ordered by position, positions exact

Page 37: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 37

Concurrency when splitting

Read Process ...

time

should I split? ? ? ?

While we wait, 1000s of workers idle.Per-element processingin O(hours) is common!

Page 38: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 38

Concurrency when splitting

Read Process ...

Read Process ...

split! ok.

Split wait-free (but race-free), while processing/reading.see code: RangeTracker

Per-element processingin O(hours) is common!

split! ok.

should I split? ? ? ?

While we wait, 1000s of workers idle.

Page 39: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 39

Perfectly granular splitting

“Few records, heavy processing” is common.

⇒ Perfect parallelism required

Page 40: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Confidential & ProprietaryGoogle Cloud Platform 40

Separation:ParDo { record → sleep(∞) } parallelized

perfectly(requires wait-free + perfectly granular)

Page 41: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 41

Separation is a qualitative improvement

/path/to/foo*.txtParDo: expand

glob

foo5.txt

foo42.txt

foo8.txt

foo100.txt

foo91.txt

ParDo:read

records

perfectly parallelover files

perfectly parallelover records

infinite scalability(no “shard per file”)

foo26.txt

foo87.txt

foo56.txt

See also: Splittable DoFnhttp://s.apache.org/splittable-do-fn

Page 42: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Confidential & ProprietaryGoogle Cloud Platform 42

“Practical” solutions improve performance

“No compromise” solutions reduce dimensionof the problem space

Page 43: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 43

Page 44: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 44

Making predictions: easy, right?

100 130 200~30% complete: 130 / [100, 200) = 0.3Split at 70%: 0.7 [100, 200) = 170

appl

ebe

etfig gr

ape

kiw

i

~50% complete: k / [a, z) ≈ 0.5Split at 70%: 0.7 [a, z) ≈ t

t

70%

Page 45: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 45

100%

t

Progress100%

t

Progress

100%

t

Progress100%

t

Progress100%

Progress

t

Easy; usually too good to be true.

100%

t100%t

Progress

tx

px

Page 46: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Confidential & ProprietaryGoogle Cloud Platform 46

Accurate predictions = wrong goal, infeasible.Wildly off ⇒ System should still workOptimize for emergent behavior (separation)Better goal: detect stuckness

Page 47: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 47

Heavy work that you don’t know exists, until you hit it.

Goal: discover and distribute dark matter as quickly as possible.

(Image credit: NASA)

Dark matter

47

Page 48: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

04 AutoscalingWhy dynamic rebalancing really matters

Page 49: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

49

How much work will there be?Can’t predict: data size, complexity, etc.

What should you do?Adaptive > Predictive.Keep re-estimating total work; scale up/down

(Image credit: Wikipedia)

A lot of work ⇒ A lot of workers

49

Page 50: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

50

Start off with 3 workers,things are looking okay

10m

3 days

Re-estimation ⇒ orders of magnitude more work:need 100 workers!

92 workers idle

100 workers uselesswithout 100 pieces of work!

Page 51: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 51

Now scaling up is no big deal!Add workersWork distributes itself

Job smoothly scales 3 → 1000 workers.

Autoscaling + dynamic rebalancing

Waves of splitting

Upscaling &VM startup

Page 52: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

05 If you remember two thingsPhilosophy of everything above

Page 53: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Google Cloud Platform 53

Reducing dimension > Incremental improvement

“Corner cases” are clues that you’re still compromising

If you remember two things

Adaptive > Predictive

“No compromise” solutions matter

Fighting stragglers > Preventing stragglers

Emergent behavior > Local precision

wait-free

perfectly granularseparation

heavy records

reading-as-ParDo

rebalancing autoscaling reusability

Page 54: No Shard Left Behind€¦ · Google Cloud Platform 17 These kinda work. But not really. Manual tuning = Sisyphean task Time-consuming, uninformed, obsoleted by data drift ⇒ Almost

Confidential & ProprietaryGoogle Cloud Platform 54

Thank youQ&A