distributed systems fall 2018

41
Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University Distributed Systems 15-440/640 Fall 2018 17 – Fault-tolerant in-memory computation Readings: “Resilient Distributed Datasets” Paper, Optional: “Immutability Changes Everything”

Upload: others

Post on 30-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Distributed Systems

15-440/640

Fall 2018

17 – Fault-tolerant in-memory computation

Readings: “Resilient Distributed Datasets” Paper, Optional: “Immutability Changes Everything”

Page 2: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

AnnouncementsContinued midsemester feedback (Yuvraj & Daniel)

Thursday 4-5pm in GHC 4124

Please come and talk to us

if lettergrade < C-

if midterm < 60

if you just want more feedback

2

Daniel’s OH today: noon to 1pm

HW3 released yesterday

Page 3: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Cluster Computing (Review)

MapReduce (Hadoop) Framework:

3

Key features: fault tolerance and high throughput

⇒ Simplified data analysis on large, unreliable clusters

ReduceMap

local disk

HDFS HDFSMap

Map

Reduce

input data output data

Can you think of limitations of the MapReduce framework?

Page 4: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Limitations of MapReduce I

4

ReduceMap

diskHDFS HDFSMap

MapReduce

write I/O

write I/O

read I/O

read I/O

read I/O

read I/O

read I/O

write I/O write I/O

write I/O

3x

3x

I/O penalty makes interactive data analysis impossible

Implementation: store input/output after every step on disk

Effect on response time? From Jeff D

ean’s Latency Num

bers E

very Program

mer S

hould Know

Page 5: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Limitations of MapReduce II

Real-world applications require iterating MapReduce steps

5

Each iteration steps is small.

But: we need many iterations

⇒ 90% spent on I/O to disks and over network

⇒ 10% spent computing actual results

Does not work for iterative applications( ⇒ distributed machine learning)

Page 6: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Limitations of MapReduce III

MapReduce abstraction not expressive enough

6

Explosion of specialized analytics systems

● Streaming analytics:

● Iterative ML algorithms:

● Graph/social data:

Learn all of them? Share data between them?

Page 7: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Topics Today

Motivation

Going beyond MapReduce

In-memory computation (Spark)

Data sharing and fault tolerance

Distributed programming model

Distributed Machine Learning

The scalability challenge

7

Page 8: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

In-Memory ComputationBerkeley Extensions to Hadoop ( ⇒ Apache Spark)

Key idea: keep and share data sets in main memory

8

Why didn’t we do that in the first place? Problems?

Much faster response time (in practice: 10x-100x)

Page 9: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

In-memory computation and data-sharing

How to build fault-tolerant and efficient system?

9

Traditional fault-tolerance approaches

● Logging to persistent storage

● Replicating data across nodes (ideally: also to persistent storage)

● Checkpointing (checkpoints need to be stored persistently)

⇒ Expensive (10-100x slowdown)

Fault tolerance techniques from lectures so far?

Idea: do we need fine-grained updates to data?

Page 10: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Spark Approach: RDDs and Lineage

Resilient Distributed Datasets

● Limit update interface to coarse-grained operations

○ Map, group-by, filter, sample, ...

● Efficient fault recovery using lineage

○ Data is partitioned and each operation is applied to every partition

○ Individual operations are cheap

○ Recompute lost partitions on failure

10

Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI 2012.

Page 11: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

RDDs are Immutable Objects

Examples: e.g., lists/vectors in SML, strings in Java(remember 15-150/210/15-214?)

11

What’s immutability?⇒ Object’s state can’t be modified after creation

Why immutability?● Enables lineage

○ Recreate any RDD any time

○ More strictly: RDDs need to be deterministic functions of input

● Simplifies consistency○ Caching and sharing RDDs across Spark nodes

● Compatibility with storage interface (HDFS)○ HDFS chunks are append only

How to use RDDs?

Page 12: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

RDD Operations in Spark

1) Transformations: create new RDD from existing one

Map, filter, sample, groupByKey, sortByKey, union, join, cross

2) Actions: return value to caller

Count, sum, reduce, save, collect

3) Persist RDD to memory

⇒ Transformations are lazy: evaluation triggered by Action

12

val lines = spark.textFile("hdfs://...")val lineLengths = lines.map(s => s.length)val totalLength =

lineLengths.reduce((a, b) => a + b)lineLengths.persist()

Page 13: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Example 1: Log Mining

Parse error messages from logs, filter, and query interactively.

13

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.persist()

cachedMsgs.filter(_.contains(“foo”)).count..cachedMsgs.filter(_.contains(“bar”)).count

Full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data).

Recall: fault recovery via lineage

Page 14: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Example 2: Regression Algorithms

Find a line (plane) that separates some data points

14

var w = Vector.random(D-1)for (i <- 1 to ITERATIONS) {

val gradient = data.map(“math equation”).reduce(_+_)w -= gradient

}

Page 15: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Apache Spark Deployment

Master server (“driver”)

● Lineage and scheduling

Cluster manager (not part of Spark)

● Resource allocation

● Mesos, YARN, K8S

Worker nodes

● Executors isolate concurrent tasks

● Caches persist RDDs

15

https://spark.apache.org/docs/latest/cluster-overview.html

Page 16: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Spark Pipeline and Scheduler

Support for directed graphs of RDD operations

Automatic pipelining of functions within a stage

Partitioning/Cache-aware scheduling to minimizes shuffles

16

Page 17: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

RDD Lineage● What if need lineage grows really large?

○ manual checkpointing on HDFS

RDDs Immutability● Deterministic functions of input

○ how to incorporate randomness?

Spark Real World Challenges

17

Other design implications?

● Needs lots of memory (might not be able to run your workload)

● High overhead: copying data (no mutate-in-place)

Page 18: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Towards a New Unified Framework

Two Goals

1. In-memory computation and data-sharing

● 10-100x faster than disks or network

● Key problem:

18

fault tolerance

ease-of-use and generality

2. Unified computation abstraction

● Power of iterations (“local work + message passing”)

● Key problem:

Page 19: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

BSP computation abstraction

● Surprising power of iterations

○ (e.g., iterative Map/Reduce)

● Explained by theory of bulk synchronous parallel (BSP) model

19

Theorem (Leslie Valiant,1990):

“Any distributed system can

be emulated as local work +

message passing” (=BSP).

Communication

Communication

Communication

Spark implements BSP approximately

Page 20: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Spark as a Uniform Framework

Pregel on Spark (Bagel)

⇒ “200 lines of Spark code”

Iterative MapReduce

⇒ “200 lines of Spark code“

Hive on Spark (Shark)

⇒ “500 lines of code”

ML-lib and other distributed ML implementations

20

Page 21: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Should You Always Use Spark?

Spark is not a good fit for

● Non-batch workloads

○ Applications with fine-grained updates to shared state

● Datasets that don’t fit into memory

21

“Apache Spark does not provide out-of-the-box GPU integration.” (databricks.com)

● If you need high efficiency, SIMD/GPU

Page 22: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Topics Today

Motivation

Going beyond MapReduce

In-memory computation (Spark)

Data sharing and fault tolerance

Distributed programming model

Distributed Machine Learning

Scaling out DML

Two challenges and current solution approaches

Open challenges

22

Page 23: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Machine Learning

23

YOUR COMPANY NEEDS

MACHINE LEARNING

http

s://ww

w.ro

ckpap

ersho

tgun

.com

/20

15

/02

/2

6/sli-go

od

-or-b

ad/

http

://crypto

min

ing-b

log.co

m/tag/m

ulti-gp

u-m

inin

g-rig/

The ML hype

Enabled by huge leap in parallelization

Often: fast enough to build a powerful machine, lots of GPUs

Page 24: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Is There a Case for Distributed ML?

Some ML systems drive significant revenue

Some ML systems benefit from humongous amount of data

Some ML systems outscale even powerful machines (GPUs et al)

24

Ads make one case for Distributed Machine Learning.

From: Li et al., Scaling Distributed Machine Learning with the

Parameter Server, OSDI 2014.

Page 25: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

What Do ML Algorithms look like?

25

We’ve already seen some examples

Eric Xing, Strategies & Principles for Distributed

Machine Learning, Allen AI, 2016

1) lots of data 2) lots of parameters

Regression

Other examples: Bayes, K-means, NNs...

Common feature when computing these algorithms?

ML algorithms are iterative in nature

Three key challenges:

3) lots of iterations

Page Rank

Page 26: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Scaling Out Distributed Machine Learning10-100s nodes enough for data/model

Scale out for throughputGoal: more iterations / sec

26

Iter

atio

ns

/ Se

c

Machine Count1 10 100 1000 10k

ideal speedup

good speedup

pathetic speedup

Best case: 100x speedup from 1000 machines

1

10

10

0

1

00

0

1

0k

Worst case: 50% slowdown from 1000

machines

Can you think of reasons?

Page 27: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Challenge of Communication OverheadCommunication overhead scales badly with #machines (N)

e.g., for Netflix-like recommender systems

27

Naive

Fro

m: C

ho

wd

hu

ry, e

t al

., M

anag

ing

dat

a tr

ansf

ers

in c

om

pu

ter

clu

ster

s w

ith

orc

hes

tra.

, SIG

CO

MM

20

11

.

master node

worker nodes

Page 28: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Challenge of Communication OverheadCommunication overhead scales badly with #machines (N)

e.g., for Netflix-like recommender systems

28

Solution approaches:

● Centralized server

○ 2N overhead (instead of N!)

○ Limited scalability of central server

● P2P file sharing (BitTorrent-like)

○ Higher network overhead

○ Better scalability

○ Needs to be data center aware

P2P

Naive

Fro

m: C

ho

wd

hu

ry, e

t al

., M

anag

ing

dat

a tr

ansf

ers

in c

om

pu

ter

clu

ster

s w

ith

orc

hes

tra.

, SIG

CO

MM

20

11

.

Page 29: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Challenge of Synchronization Overhead

BSP model:● No computation during barrier● No communication during

computation

Fundamental limitation in BSP modelConstantly waiting for stragglers

29

Synchronization Barrier

Synchronization Barrier

Synchronization Barrier

Do we need a new programming model?

Page 30: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Relaxing BSP ConsistencyIdea: nodes can accept

slightly stale state

30

From: Eric Xing,

Strategies & Principles

for Distributed Machine

Learning, Allen AI, 2016

How can we incorporate stale state into the BSP model?

ML algorithms are robust

⇒ converge even with some stale state

Page 31: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Opposite Extreme: No Synchronization

What if we fully remove BSP’s

synchronization barriers?

Asynchronous communication: ● no communication at all, or● communication at any time

31

Observation through experiments:Iterative algorithms won’t converge

How can we get diverging algorithms under control?

Page 32: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Bounded-delay BSP for Distributed ML

Bound stale state by N steps:

⇒ N-bounded delay BSP

32

1-bounded delay

2-bounded delay

From: Li et al, Scaling

Distributed Machine

Learning with the

Parameter Server

OSDI 2014

what happens here?

Page 33: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Many Challenges RemainTrade-Off:

Stale state -> throughput (iter / sec)

Misleading design decisions:

Higher throughput

Less progress / iteration

Many open challenges

Automatic model partitioning

How to schedule many parallel jobs on ML clusters

How to build a framework for interactive ML applications

⇒ Very active field of research

33It

erat

ion

s /

Sec

Machine Count

ideal s

peedup

good

speedup

Page 34: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Many open challengesAutomatic model partitioning

e.g., how to partition natural graphs

34

Iter

atio

ns

/ Se

c

Machine Count

type 1

type 2

type 3

How to schedule many parallel jobs on ML clusters

e.g., how many nodes should be

assigned to each analysis job, given a

set of speed-up functions?

How to build a framework for interactive ML applications

e.g., how to get latency of queries

to distributed ML systems below

10s? 1? 1ms?

Ray: A Distributed Framework for Emerging AI Applications. OSDI 2018.

Page 35: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Summary: Spark and Distributed ML

Spark and in-memory computing

● Motivation: overcome disk i/o bottleneck

● Challenges: fault tolerance and generality/expressiveness

● Key ideas: RDDs, the BSP model, P2P architecture

Distributed Machine Learning

● Distributed variants highly complex and not always faster!

● Challenges: communication overhead and stragglers

● Key ideas: P2P+selective communication, bounded-delay BSP

● Many more challenges remain

35

Page 36: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Overview of Cache Update Propagation Mechanisms

(Please read before next lecture if not covered on 10/30.)

Page 37: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

S1

S2

S3

A

A

A

“cached”copies

A MasterCopy

Cache Update Propagation Techniques

37

S1

S2

S3

X

write X

A

Xwrite X

Ideal World: One-Copy Semantics Caching Reality

updatepropagation?

Page 38: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Cache Update Propagation Techniques

1. Enforce Read-Only (Immutable Objects)

2. Broadcast Invalidations

3. Check on Use

4. Callbacks

5. TTLs (“Faith-based Caching”)

38

S1

S2

S3

A

A

A

“cached”copies

A

Xwrite X

All of these approximate one-copy-semantics

● how little can you give up, and still remain scalable?

● how complex is the implementation?

Page 39: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

2. Broadcast Invalidations

Every potential caching site notified on every update

● No check to verify caching site actually contains object

● Notification includes specific object being invalidated

● Effectively broadcast of address being modified

● At each cache site, next reference to object will cause a miss

39

● Simple to implement

● No race conditions (with

blocking writes)

+● Wasted traffic if no readers

● Limited scalability (in

blocking implementation)

+

Usage: e.g., in CDNs (next lecture)

Page 40: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

3. Check On Use

Reader checks master copy before each use

● conditional fetch, if cache copy stale

● has to be done at coarse granularity (e.g. entire file)

● otherwise every read is slowed down excessively

40

● Simple to implement

● No server state (no need to

know caching node)

+● Wasted traffic if no updates

● Very slow if high latency

● High load

+

Usage: e.g., AFS-1, HTTP (Cache-control: must-revalidate)

Page 41: Distributed Systems Fall 2018

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

5. TTLs (“Faith-based Caching”)

Assume cached data is valid for a while

● Check after timer expires: Time-to-Live (TTL) field

● No communication during trust (TTL) period

41

● Simple to implement

● No server state (no need to

know caching node)

+● Use visible inconsistency

● Less efficient than

callback-based schemes

+

Usage: e.g., CDNs, DNS, HTTP (Cache-control: max-age=30)