pbse: a robust path-based speculative execution … rao g., and haryadi s. gunawi. 2017. pbse: a...

14
PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel Frameworks Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark University of Chicago Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu 1 University of Chicago Daniar H. Kurniawan Bandung Institute of Technology Vincentius Martin 2 Surya University Uma Maheswara Rao G. Intel Corp. Haryadi S. Gunawi University of Chicago ABSTRACT We reveal loopholes of Speculative Execution (SE) implementations under a unique fault model: node-level network throughput degra- dation. This problem appears in many data-parallel frameworks such as Hadoop MapReduce and Spark. To address this, we present PBSE, a robust, path-based speculative execution that employs three key ingredients: path progress, path diversity, and path-straggler detection and speculation. We show how PBSE is superior to other approaches such as cloning and aggressive speculation under the aforementioned fault model. PBSE is a general solution, applicable to many data-parallel frameworks such as Hadoop/HDFS+QFS, Spark and Flume. CCS CONCEPTS Computer systems organization Cloud computing; Distributed architectures; Dependable and fault-tolerant systems and networks; KEYWORDS Dependability, distributed systems, tail tolerance, speculative exe- cution, Hadoop MapReduce, HDFS, Apache Spark ACM Reference Format: Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu 1 , Daniar H. Kurniawan, Vincentius Martin 2 , Uma Maheswara Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path- Based Speculative Execution for Degraded-Network Tail Tolerance in Data- Parallel Frameworks. In Proceedings of SoCC ’17, Santa Clara, CA, USA, September 24–27, 2017, 14 pages. https://doi.org/10.1145/3127479.3131622 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA © 2017 Copyright held by the owner/author(s). Publication rights licensed to Associa- tion for Computing Machinery. ACM ISBN 978-1-4503-5028-0/17/09. . . $15.00 https://doi.org/10.1145/3127479.3131622 1 INTRODUCTION Data-parallel frameworks have become a necessity good in large- scale computing. To finish jobs on time, such parallel frameworks must address the “tail latency” problem. One popular solution is speculative execution (SE); with SE, if a task runs slower than other tasks in the same job (a “straggler”), the straggling task will be speculated (via a “backup task”). With a rich literature of SE algo- rithms (§3.47), existing SE implementations such as in Hadoop and Spark are considered quite robust. Hadoop’s SE for example has architecturally remained the same for the last six years, based on the LATE algorithm [83]; it can effectively handle stragglers caused by common sources of tail latencies such as resource con- tentions and heterogeneous resources. However, we found an important source of tail latencies that cur- rent SE implementations cannot handle graciously: node-level net- work throughput degradation (e.g., a 1Gbps NIC bandwidth drops to tens of Mbps or even below 1 Mbps). Such a fault model can be caused by a severe network contention (e.g., from VM over- packing) or due to degraded network devices (e.g., NICs and net- work switches whose bandwidth drops by orders of magnitude to Mbps/Kbps level in production systems; more in §2.2). The unique- ness of this fault model is that only the network device performs poorly, but other hardware components such as CPU and storage are working normally; it is significantly different than typical fault models for heterogeneous resources or CPU/storage contentions. To understand the impact of this fault model, we tested Hadoop [11] as well as other systems including Spark [82], Flume [9], and S4 [12], on a cluster of machines with one slow-NIC node. We dis- covered that many tasks transfer data through the slow-NIC node but cannot escape from it, resulting in long tail latencies. Our anal- ysis then uncovered many surprising loopholes in existing SE im- plementations (§3), which we bucket into two categories: the “no” straggler problem, where all tasks of a job involve the slow-NIC node (since all tasks are slow, there is “no” straggler detected) and the straggling backup problem, where the backup task involves the slow-NIC node again (hence, both of the original and the backup tasks are straggling at the same time). 1 Now a PhD student at Purdue University 2 Now a PhD student at Duke University

Upload: truongnguyet

Post on 27-Jul-2019

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

PBSE: A Robust Path-Based Speculative Execution forDegraded-Network Tail Tolerance in Data-Parallel Frameworks

Riza O. Suminto, Cesar A.Stuardo, Alexandra Clark

University of Chicago

Huan Ke, TanakornLeesatapornwongsa, Bo Fu1

University of Chicago

Daniar H. KurniawanBandung Institute of

Technology

Vincentius Martin2

Surya UniversityUma Maheswara Rao G.

Intel Corp.Haryadi S. GunawiUniversity of Chicago

ABSTRACT

We reveal loopholes of Speculative Execution (SE) implementations

under a unique fault model: node-level network throughput degra-

dation. This problem appears in many data-parallel frameworks

such as Hadoop MapReduce and Spark. To address this, we present

PBSE, a robust, path-based speculative execution that employs three

key ingredients: path progress, path diversity, and path-straggler

detection and speculation. We show how PBSE is superior to other

approaches such as cloning and aggressive speculation under the

aforementioned fault model. PBSE is a general solution, applicable

to many data-parallel frameworks such as Hadoop/HDFS+QFS,

Spark and Flume.

CCS CONCEPTS

• Computer systems organization → Cloud computing; Distributed

architectures; Dependable and fault-tolerant systems and networks;

KEYWORDS

Dependability, distributed systems, tail tolerance, speculative exe-cution, Hadoop MapReduce, HDFS, Apache Spark

ACM Reference Format:

Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, TanakornLeesatapornwongsa, Bo Fu1 , Daniar H. Kurniawan, Vincentius Martin2 , UmaMaheswara Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel Frameworks. In Proceedings of SoCC ’17, Santa Clara, CA, USA,

September 24–27, 2017, 14 pages.https://doi.org/10.1145/3127479.3131622

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’17, September 24–27, 2017, Santa Clara, CA, USA

© 2017 Copyright held by the owner/author(s). Publication rights licensed to Associa-tion for Computing Machinery.ACM ISBN 978-1-4503-5028-0/17/09. . . $15.00https://doi.org/10.1145/3127479.3131622

1 INTRODUCTION

Data-parallel frameworks have become a necessity good in large-scale computing. To finish jobs on time, such parallel frameworksmust address the “tail latency” problem. One popular solution isspeculative execution (SE); with SE, if a task runs slower than othertasks in the same job (a “straggler”), the straggling task will bespeculated (via a “backup task”). With a rich literature of SE algo-rithms (§3.4, §7), existing SE implementations such as in Hadoopand Spark are considered quite robust. Hadoop’s SE for examplehas architecturally remained the same for the last six years, basedon the LATE algorithm [83]; it can effectively handle stragglerscaused by common sources of tail latencies such as resource con-tentions and heterogeneous resources.

However, we found an important source of tail latencies that cur-rent SE implementations cannot handle graciously: node-level net-

work throughput degradation (e.g., a 1Gbps NIC bandwidth dropsto tens of Mbps or even below 1 Mbps). Such a fault model canbe caused by a severe network contention (e.g., from VM over-packing) or due to degraded network devices (e.g., NICs and net-work switches whose bandwidth drops by orders of magnitude toMbps/Kbps level in production systems; more in §2.2). The unique-ness of this fault model is that only the network device performspoorly, but other hardware components such as CPU and storageare working normally; it is significantly different than typical faultmodels for heterogeneous resources or CPU/storage contentions.

To understand the impact of this fault model, we tested Hadoop[11] as well as other systems including Spark [82], Flume [9], andS4 [12], on a cluster of machines with one slow-NIC node. We dis-covered that many tasks transfer data through the slow-NIC nodebut cannot escape from it, resulting in long tail latencies. Our anal-ysis then uncovered many surprising loopholes in existing SE im-plementations (§3), which we bucket into two categories: the “no”

straggler problem, where all tasks of a job involve the slow-NICnode (since all tasks are slow, there is “no” straggler detected) andthe straggling backup problem, where the backup task involves theslow-NIC node again (hence, both of the original and the backuptasks are straggling at the same time).

1Now a PhD student at Purdue University2Now a PhD student at Duke University

Page 2: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu1, Daniar H. Kurniawan,

Vincentius Martin2, Uma Maheswara Rao G., and Haryadi S. Gunawi

Overall, we found that a network-degraded node is worse than

a dead node, as the node can create a cascading performance prob-lem. One network-degraded node can make the performance of theentire cluster collapse (e.g., after several hours the whole-cluster jobthroughput can drop from hundreds of jobs per hour to 1 job/hour).This cascading effect can happen as unspeculated slow tasks lockup the task slots for a long period of time.

Given the maturity of SE implementations (e.g., Hadoop is 10years old), we investigate further the underlying design flaws thatlead to the loopholes. We believe there are two flaws. First, node-level network degradation (without CPU/storage contentions) is notconsidered a fault-model. Yet, this real fault model affects data-transfer paths, not just tasks per se. This leads us to the second flaw:existing SE approaches only report task progress but do not exposepath progress. Sub-task progresses such as data transfer rates aresimply lumped into one progress score, hiding slow paths from be-ing detected by the SE algorithm.

In this paper, we present a robust solution to the problem, path-

based speculative execution (PBSE), which contains three impor-tant ingredients: path progress, path diversity, and path-stragglerdetection and speculation.

First, in PBSE, path progresses are exposed by tasks to theirjob/application manager (AM). More specifically, tasks report thedata transfer progresses of their Input→Map, Map→Reduce (shuf-fling), and Reduce→Output paths, allowing the AM to detect strag-gling paths. Unlike the simplified task progress score, our approachdoes not hide the complex dataflows and their progresses, which areneeded for a more accurate SE. Paths were not typically exposeddue to limited transparency between the computing (e.g., Hadoop)and storage (e.g., HDFS) layers; previously, only data locality isexposed. PBSE advocates the need for a more cross-layer collabo-ration, in a non-intrusive way.

Second, before straggling paths can be observed, PBSE mustenforce path diversity. Let’s consider an initial job (data-transfer)topology X→B and X→C, where node X can potentially experi-ences a degraded NIC. In such a topology, X is a single point of

tail-latency failure (“tail-SPOF” for short). Path diversity ensuresno potential tail-SPOF will happen in the initial job topology. EachMapReduce stage will now involve multiple distinct paths, enoughfor revealing straggling paths.

Finally, we develop path-straggler detection and speculation,which can accurately pinpoint slow-NIC nodes and create specula-tive backup tasks that avoid the problematic nodes. When a pathA→B is straggling, the culprit can be A or B. For a more effectivespeculation, our techniques employ the concept of failure groupsand a combination of heuristics such as greedy, deduction, and dy-namic-retry approaches, which we uniquely personalize for MapRe-duce stages.

We have implemented PBSE in the Hadoop/HDFS stack in totalof 6003 LOC. PBSE runs side by side with the base SE; it doesnot replace but rather enhances the current SE algorithm. BeyondHadoop/HDFS, we also show that other data-parallel frameworkssuffer from the same problem. To show PBSE generality, we alsohave successfully performed initial integrations to Hadoop/QFS stack[67], Spark [82], and Flume [9].

Symbols Descriptions

AM Application/Job ManagerIi Node for an HDFS input block to task i

I′i , I′′i 2nd and 3rd replica nodes of an input blockMi Node for map task i

M′i Node for speculated (′) map i

Ri Node for reduce task iOi Node for output blockO′i , O′′

i 2nd and 3rd replica nodes of an output block

A sample of a job topology (2 maps, 2 reduces):

Map phase I1→M1, I2→M2

Shuffle phase M1→R1, M1→R2, M2→R1, M2→R2

Reduce phase R1→O1→O′1→O′′

1, R2→O2→O′

2→O′′

2

Speculated M1 I′1→M′

1

Speculated R2 M1→R′2, M2→R′

2

Table 1: Symbols. The table above describes the symbols that we use

to represent a job topology, as discussed in Section 2.1 and ilustrated in

Figures 1, 3, and 5.

We performed an extensive evaluation of PBSE with a varietyof NIC bandwidth degradations (60 to 0.1 Mbps), real-world pro-duction traces (Facebook and Cloudera), cluster sizes (15 to 60nodes), scheduling policies (Capacity, FIFO, and Fair), and othertail-tolerance strategies (aggressive SE, cloning, and hedge read).Overall, we show that under our fault model, PBSE is superior toother strategies (between 2–70× speed-ups above the 90th -percentileunder various severities of NIC degradation), which is possible be-cause PBSE is able to escape from the network-degraded nodes (asit fixes the limitation of existing SE strategies).

In summary, PBSE is robust; it handles many loopholes (tail-SPOF topologies) caused by node-level network degradation, en-suring no single job is “locked” in degraded mode. Under this faultmodel, PBSE performs faster than other strategies such as advancedschedulers, aggressive SE, and cloning. PBSE is accurate in detect-ing nodes with a degraded network; a prime example is accuratelydetecting a map’s slow NIC within the reduce/shuffling stage, whileHadoop always blames the reducer side. PBSE is informed; slowpaths that used to be a silent failure in Hadoop SE are now exposedand avoided in speculative paths. PBSE is simple; it runs side-by-side with the base SE and its integration does not require majorarchitectural changes. Finally, PBSE is general; it can be integratedto many data-parallel frameworks.

In the following sections, we present an extended motivation(§2), Hadoop SE loopholes (§3), PBSE design and evaluation (§4-5), further integrations (§6), related work and conclusion (§7-8).

2 BACKGROUND AND MOTIVATION

In this section, we first describe some background materials (§2.1)and present real cases of degraded network devices (§2.2) whichmotivates our unique fault model (§2.3). We then highlight the im-pact of this fault model to Hadoop cluster performance (§2.4) anddiscuss why existing monitoring tools are not sufficient and HadoopSE must be modified to handle the fault model (§2.5).

2

Page 3: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

PBSE: A Robust Path-Based Speculative Execution for

Degraded-Network Tail Tolerance in Data-Parallel Frameworks SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

SE

M1

M2'

I2 M2

I1M1I1

I2 M2

I2'

Slow NIC

Slow task

Data locality

Net. transfer

Node

(b)(a)

Figure 1: A Hadoop job and a successful SE. Figures (a) and

(b) are explained in the “Symbols” and “Successful SE” discussions in

Section 2.1, respectively.

2.1 Background

Yarn/Hadoop 2.0: A Hadoop node contains task containers/slots.When a job is scheduled, Hadoop creates an Application Manager

(AM) and deploys the job’s parallel tasks on allocated containers.Each task sends a periodic progress score to AM (via heartbeat).When a task reads/writes a file, it asks HDFS namenode to retrievethe file’s datanode locations. Hadoop and HDFS nodes are colo-cated, thus a task can access data remotely (via NIC) or locally (viadisk; aka. “data locality”). A file is composed of 64MB blocks. Eachis 3-way replicated.

Symbols: Table 1 describes the symbols we use througout thepaper to represent a job topology. For example, Figure 1a illus-trates a Hadoop job reading two input blocks (I1 and I2); eachinput block can have 3 replicas (e.g., I2, I′

2, I′′

2). The job runs 2

map tasks (M1, M2); reduce tasks (R1, R2) are not shown yet. Thefirst map achieves data locality (I1→M1 is local) while the secondmap reads data remotely (I2→M2 is via NICs). A complete job willhave three stages: Input→Map (e.g., I1→M1), Map→Reduce shuf-fle (e.g., M1→R1, M1→R2), and Reduce→Output 3-node writepipeline (e.g., R2→O2→O′

2→O′′

2).

Successful SE: The Hadoop SE algorithm (or “base SE” forshort), which is based on LATE [83, §4], runs in the AM of everyjob. Figure 1b depicts a successful SE: I2’s node has a degradedNIC (bold circle), thus M2 runs slower than M1 and is marked as astraggler, then the AM spawns a new speculative/backup task (M′

2)

on a new node that coincidentally reads from another fast inputreplica (I′

2→M′

2). For every task, the AM by default limits to only

one backup task.

2.2 Degraded Network Devices

Beyond fail-stop, network devices can exhibit “unexpected” formsof failures. Below, we re-tell the real cases of limping network de-vices in the field [1–8, 44, 48, 49, 57].

In many cases, NIC cards exhibit a high-degree of packet loss(from 10% up to 40%), which then causes spikes of TCP retries,dropping throughput by orders of magnitude. An unexpected auto-negotiation between a NIC and a TOR switch reduced the band-width between them (an auto-configuration issue). A clogged airfilter in a switch fan caused overheating, and subsequently heavyre-transmission (e.g., 10% packet loss). Some optical transceiverscollapsed from Gbps to Kbps rate (but only in one direction). A

non-deterministic Linux driver bug degraded a Gbps NIC’s perfor-mance to Kbps rate. Worn-out cables reportedly can also drop net-work performance. A worn-out Fibre Channel Pass-through modulein a high-end server blade added 200-3000 ms delay.

As an additional note, we also attempted to find (or perform)large-scale statistical studies on this problem but to no avail. As al-luded elsewhere, stories of “unexpected” failures are unfortunately“only passed by operators over beers” [36]. For performance-degradeddevices, one of the issues is that, most hardware vendors do notlog performance faults at such a low level (unlike hard errors [37]).Some companies log low-level performance metrics but aggregatethe results (e.g., hourly disk average latency [51]), preventing a de-tailed study. Thus, the problem of performance-degraded devices isstill under studied and requires further investigation.

2.3 Fault Model

Given the cases above, our fault model is a severe network band-

width degradation experienced by one or more machines. For ex-ample, the bandwidth of a NIC can drop to low Mbps or Kbps level,which can be caused by many hardware and software faults such asbit errors, extreme packet loss, overheating, clogged air filters, de-fects, buggy auto-negotiations, and buggy firmware and drivers, asdiscussed above.

A severe node-level bandwidth degradation can also happen inpublic multi-tenant clouds where extreme outliers are occasionallyobserved [66, Fig. 1]. For instance, if all the tenants of a 32-core ma-chine run network intensive processes, each process might only ob-serve ∼30 Mbps, given a 1GBps NIC. With a higher-bandwidth 10-100GBps NIC and future 1000-core processors [26, 79], the sameproblem will apply. Furthemore, over-allocation of VMs more thanthe available CPUs can reduce the obtained bandwidth by each VMdue to heavy context switching [76]. Such problem of “uneven con-gestion” across datanodes is relatively common [16].

2.4 Impacts

Slow tasks are not speculated: Under the fault model above, HadoopSE fails to speculate slow tasks. Figure 2a shows the CDF of job du-ration times of a Facebook workload running on a 15-node Hadoopcluster without and with one 1-Mbps slow node (With-0-Slow vs.With-1-Slow nodes). The 1-Mbps slow node represents a degradedNIC. As shown, in a healthy cluster, all jobs finish in less than 3minutes. But with a slow-NIC node, many tasks are not speculatedand cannot escape the degraded NIC, resulting in long job tail laten-cies, with 10% of the jobs (y=0.9) finishing more than 1 hour.

“One degraded device to slow them all:” Figure 2b showsthe impact of a slow NIC to the entire cluster over time. With-out a slow NIC, the cluster’s throughput (#jobs finished) increasessteadily (around 172 jobs/hour). But with a slow NIC, after about 4hours (x=250min) the cluster throughput collapses to 1 job/hour.

The two figures show that existing speculative execution fails tocut tail latencies induced by our fault model. Later in Section 3, wedissect the root causes.

3

Page 4: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu1, Daniar H. Kurniawan,

Vincentius Martin2, Uma Maheswara Rao G., and Haryadi S. Gunawi

.4

.6

.8

1

0 10 20 30 40 50 60

Job Latency (minute)

(a) CDF of Job Duration

With-0-SlowWith-1-Slow

0

2

4

6

8

10

12

0 100 200 300

# of

Job

s F

inis

hed

(x 1

00)

Time (minute)

(b) Job Throughput

With-0-SlowWith-1-Slow

Figure 2: Impact of a degraded NIC. Figure (a) shows the CDF of

job duration times of a Facebook workload on a 15-node Hadoop cluster

without and with a slow node (With-0-Slow vs. With-1-Slow lines). The

slow node has a 1-Mbps degraded NIC. Figure (b) is a replica of Figure

2 in our prior work [44], showing that after several hours, the problem

cascades to entire cluster, making cluster throughput drops to 1 job/hour.

2.5 The Need for a More Robust SE and

Why Monitoring Tools Do Not Help

We initially believed that the problem can be easily solved withsome cluster monitoring tools (hence why we did not invent PBSEdirectly after our earlier work [44]). However, the stories abovepoint to the fact that monitoring tools do not always help operatorsdetect and remove the problem quickly. The problems above tookdays to weeks to be fully resolved, and meanwhile, the problemscontinuously affected the users and sometimes cascaded to the en-tire cluster. One engineer called this a “costly debugging tail.” Thatis, while more-frequent failures can be solved quickly, less-frequentbut complex failures (that cannot be mitigated by the systems) cansignificantly cost the engineers’ time. In one story, an entire groupof developers were pulled to debug the problem, costing the com-pany thousands of dollar.

Monitoring tools not sufficient for the following reasons: (1) Mon-itoring tools are important but they are “passive/offline” solutions;they help signal slow components but shutting down a slow machineis still the job of human operators (an automated shutdown algo-rithm can have false positives that incorrectly shutdown healthy ma-chines). (2) To perform diagnoses, (data) nodes cannot be abruptlytaken offline as it can cause excessive re-replication load [68]. Op-erator’s decision to decommission degraded nodes must be well-grounded; many diagnoses must be performed before a node istaken offline. (3) Degraded modes can be complex (asymmetricaland/or transient), making diagnosis harder. In one worst-case story,several months were needed to pinpoint worn-out cables as the rootcause. (4) As a result, there is a significant window of time betweenwhen the degradation started and when the faulty device is fully di-agnosed. Worse, due to economies of scale, 100-1000 servers aremanaged by a single operator [34], hence longer diagnosis time.

In summary, the discussion above makes the case for a more ro-bust speculative execution as a proper “online” solution to the prob-lem. In fact, we believe that path straggler detection in PBSE canbe used as an alarm or a debugging aid to existing monitoring tools.

(a) Same slow

input source

(b) Same slow

map source

(c) Same slow

output destination

R1

I1

M1

M2

M1

I2

M2 R2I2 O2

O1R1

R2

O1'

O2'

Figure 3: Tail-SPOF and “No” Stragglers. Figures (a)-(c) are de-

scribed in Sections 3.1a-c, respectively. I1/I2 in Figure (a), M2 in (b), and

O1/O2 in (c) are a tail-SPOF that makes all affected tasks slow at the same

time, hence “no” straggler. Please see Figure 1 for legend description.

3 SE LOOPHOLES

This section presents the many cases of failed speculative execution(i.e., “SE loopholes”). For simplicity of discussion, we only injectone degraded NIC.

Benchmarking: To test Hadoop SE robustness to the fault modelabove, we ran real-world production traces on 15-60 nodes with oneslow NIC (more details in §5). To uncover SE loopholes, we collectall task topologies where the job latencies are significantly longerthan the expected latency (as if the jobs run on a healthy clusterwithout any slow NIC). We ran long experiments, more than 850hours, because every run leads to non-deterministic task placementsthat could reveal new loopholes.

SE Loopholes: An SE loophole is a unique topology where a jobcannot escape from the slow NIC. That is, the job’s latency followsthe rate of the degraded bandwidth. In such a topology, the slowNIC becomes a single point of tail-latency failure (a “tail-SPOF”).

By showing tail-SPOF, we can reveal not just the straggling tasks,but also the straggling paths. Below we describe some of the repre-sentative loopholes we found (all have been confirmed by Hadoopdevelopers). For each, we use a minimum topology for simplic-ity of illustration. The loopholes are categorized into no-straggler-

detected (§3.1) and straggling-backup (§3.2) problems.We note that our prior work only reported three “limplock” topolo-

gies [44, §5.1.2]) from only four simple microbenchmarks. In thissubsequent work, hundreds of hours of deployment allow us to de-bug more job topologies and uncover more loopholes.

3.1 No Straggler Detected

Hadoop SE is only triggered when at least one task is straggling.We discovered several topologies where all tasks (of a job) are slow,hence “no” straggler.

(a) Same slow input source: Figure 3a shows two map tasks (M1,M2) reading data remotely. Coincidentally (due to HDFS’s selec-tion randomness when locality is not met), both tasks retrieve theirinput blocks (I1, I2) from the same slow-NIC node. Because all thetasks are slow, there is “no” straggler. Ideally, a notion of “path di-versity” should be enforced to ensure that the tail-SPOF (I1/I2’snode) is detected. For example, M2 should choose another inputsource (I′

2or I′′

2).

(b) Same slow map source: Figure 3b shows a similar problem,but now during shuffle. Here, the map tasks (M1, M2) already com-plete normally; note that M2 is also fast due to data locality (I2→M2

4

Page 5: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

PBSE: A Robust Path-Based Speculative Execution for

Degraded-Network Tail Tolerance in Data-Parallel Frameworks SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

0

.2

.4

.6

.8

1

50 100 150 200 250

Pro

gres

s S

core

Elapsed time (second)

(a) Reducer Progres Scores

R1R2R3

0

1

2

3

4

5

6

57 100 150 200 250

Est

imat

ed ti

me

(min

ute)

Elapsed time (second)

(b) Complete vs. Replace Time

EstReplaceEstComplete

Figure 4: SE Algorithm “Bug”. The two figures above are explained

in Section 3.1d.

does not use the slow NIC). However, when shuffle starts, all thereducers (R1, R2) fetch M2’s intermediate data through the slowNIC, hence “no” straggling reducers. Ideally, if we monitor pathprogresses (the four arrows), the two straggling paths (M2→R1,M2→R2) and the culprit node (M2’s node) can be easily detected.

While case (a) above might only happen in small-parallel jobs,case (b) can easily happen in large-parallel jobs, as it only needsone slow map task with data locality to hit the slow NIC.

(c) Same slow output intersection: Figure 3c shows another casein write phase. Here, the reducers’ write pipelines (R1→O1→O′

1,

R2→O2→O′2) intersect the same slow NIC (O1/O2’s node), hence

“no” straggling reducer.(d) SE algorithm “bug”: Figure 4a shows progress scores of

three reducers, all observe a fast shuffle (the quick increase of pro-gress scores to 0.8), but one reducer (R1) gets a slow-NIC in its out-put pipeline (flat progress score). Ideally, R1 (which is much slowerthan R2&R3) should be speculated, but SE is never triggered. Fig-ure 4b reveals the root cause of why R1 never get speculated. Witha fast shuffle but a slow output, the estimated R1 replacement time(EstReplace) is always slightly higher (0.1-0.3 minutes) than theestimated R1 completion time (EstComplete). Hence, speculationis not deemed beneficial (incorrectly).

3.2 Straggling Backup Tasks

Let’s suppose a straggler is detected, then the backup task is ex-pected to finish faster. By default, Hadoop limits one backup perspeculated task (e.g., M′

1for M1, but no M′′

1). We found loopholes

where this “one-shot” task speculation is also “unlucky” (involvesthe slow NIC again).

(a) Same slow input source: Figure 5a shows a map (M2) read-ing from a slow remote node (I2) and is correctly marked as a strag-gler (slower than M1). However, when the backup task (M′

2) started,

HDFS gave the same input source (I2). As a result, both the originaland backup tasks (M2, M′

2) are slow.

(b) Same slow node in original and backup paths: Figure 5b re-veals a similar case but with a slightly different topology. Here, thebackup (M′

2) chooses a different source (I′

2), but it is put in the slow

node. As a result, both the original and backup paths (I2→M2 andI′2→M′

2) involve a tail-SPOF (M′

2/I2’s node).

(c) Same slow map source: Figure 5c depicts a similar shuffletopology as Figure 3b, but now the shuffle pattern is not all-to-all

(a) Same slow

input source

(b) Same slow node in

original & backup paths

(c) Same slow

map source

R1M1

M2 R2

M2'

M1

I2 M2

I1

R2'

M2'

M1

I2M2

I2' I2

Figure 5: Tail-SPOF and Straggling Backups. Figures (a)-(c) are

discussed in Sections 3.2a-c, respectively. I2 in Figure (a), I2/M′2

in (b),

and M2 in (c) are a tail-SPOF; the slow NIC is coincidentally involved

again in the backup task. Please see Figure 1 for legend description.

(job/data specific). Since M2’s NIC is slow, R2 becomes a straggler.The backup R′

2however cannot choose another map other than read-

ing through the slow M2’s NIC again. If the initial paths (the firstthree arrows) are exposed, a proper recovery can be done (e.g., pin-point M2 as the culprit and run M′

2).

3.3 The Cascading Impact

In Section 2.4 and Figure 2b, we show that the performance of theentire can eventually collapse. As explained in our previous work[44, §5.1.4], the reason for this is that the slow and unspeculatedtasks are occupying the task containers/slots in healthy nodes for along time. For example, in Figures 3 and 5, although only one nodeis degraded (bold edge), other nodes are affected (striped nodes).Newer jobs that arrive are possible to interact with the degradednode. Eventually, all tasks in all nodes are “locked up” by the de-graded NIC and there are not enough free containers/slots for newjobs. In summary, failing to speculate slow tasks from a degradednetwork can be fatal.

3.4 The Flaws

Hadoop is a decade-old mature software and many SE algorithmsare derived from Hadoop/MapReduce [43, 83]. Thus, we believethere are some fundamental flaws that lead to the existence of SEloopholes. We believe there are two flaws:

(1) Node-level network degradation is not incorporated as a fault

model. Yet, such failures occur in production. This fault model isdifferent than “node contentions” where CPU and local storage arealso contended (in which cases, the base SE is sufficient). In ourmodel, only the NIC is degraded, not CPU nor storage.

(2) Task , Path. When the fault model above is not incorporated,the concept of path is not considered. Fatally, path progresses ofa task are lumped into one progress score, yet a task can observediffering path progresses. Due to lack of path information, slowpaths are hidden. Worse, Hadoop can blame the straggling task eventhough the culprit is another node (e.g., in Figure 5c, R2 is blamedeven though M2 is the culprit).

In other works, task sub-progresses are also lumped into oneprogress score (e.g., in Late [83, §4.4], Grass [32, §5.1], Mantri[33, §5.5], Wrangler [77, §4.2.1], Dolly [30, §2.1.1], ParaTimer[63, §2.5], Parallax [64, §3]). While these novel methods are superb

5

Page 6: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu1, Daniar H. Kurniawan,

Vincentius Martin2, Uma Maheswara Rao G., and Haryadi S. Gunawi

in optimizing SE for other root causes (e.g., node contentions, het-erogeneous resources), they do not specifically tackle network-onlydegradation (at individual nodes). Some of these works also tries tomonitor data transfer progresses [33], but they are still lumped intoa single progress score.

4 PBSE

We now present the three important elements of PBSE: path progress(§4.1), path diversity (§4.2), and path-straggler detection and spec-ulation (§4.3), and at the end conclude the advantages of PBSE(§8). This section details our design in the context of Hadoop/HDFSstack with 3-way replication and 1 slow NIC (F=1).1

4.1 Paths

The heart of PBSE is the exposure of path progresses to the SE algo-rithm. A path progress P is a tuple of {Src,Dst ,Bytes,T ,BW }, sentby tasks to the job’s manager (AM); Bytes denotes the amount ofbytes transferred within the elapsed timeT and BW denotes the pathbandwidth (derived from Bytes/T ) between the source-destination(Src,Dst) pair. Path progresses are piggybacked along with exist-ing task heartbeats to the AM. In PBSE, tasks expose to AM thefollowing paths:

• Input→Map (I→M): In Hadoop/ HDFS stack, this is typicallya one-to-one path (e.g., I2→M2) as a map task usually reads one64/128-MB block. Inputs of multiple blocks are usually split to mul-tiple map tasks.

• Map→Reduce (M→R): This is typically an all-to-all shufflingcommunication between a set of map and reduce tasks; many-to-many or one-to-one communication is possible, depending on theuser-defined jobs and data content. The AM now can compare thepath progress of every M→R path in the shuffle stage.

• Reduce→Output (R→O): Unlike earlier paths above, an outputpath is a pipeline of sub-paths (e.g., R1→O1→O′

1→O′′

1). A single

slow node in the pipeline will become a downstream bottleneck. Toallow fine-grained detection, we expose the individual sub-path pro-gresses. For example, if R1→O1 is fast, but O1→O′

1and O′

1→O′′

1

are slow, O′1

can be the culprit.

The key to our implementation is a more information exposurefrom the storage (HDFS) to compute (Hadoop) layers. Without moretransparency, important information about paths is hidden. Fortu-nately, the concept of transparency in Hadoop/HDFS already exists(e.g., data locality exposure), hence the feasibility of our extension.The core responsibility of HDFS does not change (i.e., read/writefiles); it now simply exports more information to support more SEintelligence in the Hadoop layer.

4.2 Path Diversity

Straggler detection is only effective if independent progresses arecomparable. However, patterns such as X→M1 and X→M2 withX as the tail-SPOF is possible, in which case potential stragglersare undetectable. To address this, path diversity prevents a poten-tial tail-SPOF by enforcing independent, comparable paths. While

1F denotes the tolerable number of failures.

the idea is simple, the challenge lies in efficiently removing poten-tial input-SPOF, map-SPOF, reduce-SPOF, and output-SPOF in inevery MapReduce stage:

(a) No input-SPOF in I→M paths: It is possible that map tasks ondifferent nodes read inputs from the same node (I1→M1, I2→M2,and I1=I2).2 To enforce path diversity, map tasks must ask HDFSto diversify input nodes, at least to two (F+1) source nodes.

There are two possible designs, proactive and reactive. Proactiveenforces all tasks of a job to synchronize with each other to verifythe receipt of at least two input nodes. This early synchronizationis not practical because tasks do not always start at the same time(depends on container availability). Furthermore, considering thatin common cases not all jobs receive an input-SPOF, this approachimposes an unnecessary overhead.

We take the reactive approach. We let map tasks run indepen-dently in parallel, but when map tasks send their first heartbeats tothe AM, they report their input nodes. If the AM detects a potentialinput-SPOF, it will reactively inform one (as F=1) of the tasks toask HDFS namenode to re-pick another input node (e.g., I′

2→M2

and I′2,I1).3 After the switch (I2 to I′

2), the task continues reading

from the last read offset (no restart overhead).

(b) No map-SPOF in I→M and M→R paths: It is possible thatmap tasks are assigned to the same node (I1→M1, I2→M2, M1=M2,and M1/M2’s node is a potential tail-SPOF); note that Hadoop onlydisallows a backup and the original tasks to run in the same node(e.g., M1,M′

1, M2,M′

2). Thus, to prevent one map-SPOF (F=1), we

enforce at least two nodes (F+1) chosen for all the map tasks of ajob. One caveat is when a job deploys only one map (reads onlyone input block), in which case we split it to two map tasks, eachreading half of the input. This case however is very rare.

As of the implementation, when a job manager (AM) requests Ccontainers from the resource manager (RM), the AM also suppliesthe rule. RM will then returnC−1 containers to the AM first, whichis important so that most tasks can start. For the last container, ifthe rule is not satisfied and no other node is currently available, RMmust wait. To prevent starvation, if other tasks already finish halfway, RM can break the rule.

(c) No reduce-SPOF in M→R and R→O paths: In a similar way,we enforce each job to have reducers at least in two different nodes.Since the number of reducers is defined by the user, not the runtime,the only way to prevent a potential reduce-SPOF is by cloning thesingle reducer. This is reasonable as a single reducer implies a smalljob and cloning small tasks is not costly [30].

(d) No output-SPOF in R→O paths: Output pipelines of all reduc-ers can intersect the same node (e.g., R1→O1→O′

1, R2→O2→O′

2,

and O1=O2). Handling this output-SPOF is similar to Rule (a). How-ever, since write is different than read, the pipeline re-picking over-head can be significant if not designed carefully.

Through a few design iterations, we modify the reduce stage topre-allocate write pipelines during shuffling and keep re-pickinguntil all the pipelines are free from an output-SPOF. In vanilla Hadoop,write pipelines are created after shuffling (after reducers are ready

2A=B implies A and B are in the same node.3A,B implies A and B are not in the same node.

6

Page 7: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

PBSE: A Robust Path-Based Speculative Execution for

Degraded-Network Tail Tolerance in Data-Parallel Frameworks SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

to write the output). Contrary, in our design, when shuffling finishes,the no-SPOF write pipelines are ready to use.

We now explain why pre-allocating pipelines removes a signifi-cant overhead. Unlike read switch in Rule (a), switching nodes inthe middle of writes is not possible. In our strawman design, afterpipeline creation (e.g., R2→X→O′

2→...), reducers report paths to

AM and begin writing, similar to Rule (a). Imagine when an output-SPOF X is found but R2 already wrote 5 MB. A simple switch (e.g.,R2→Y→...) is impossible because R2 no longer has the data (be-cause R2’s HDFS client layer only buffers 4 MB of output). FillingY with the already-transferred data from O′

2will require complex

changes in the storage layer (HDFS) and alter its append-only na-ture. Another way around is to create a backup reducer (R′

2) with a

new no-SPOF pipeline (e.g., R′2→Y→...), which unfortunately in-

curs a high overhead as R′2

must repeat the shuffle phase. For thesereasons, we employ a background pre-allocation, which obviatespipeline switching in the middle of writes.

Another intricacy of output pipelines is that an output intersec-tion does not always imply an output-SPOF. Let us consider R1→A→X and R2→B→X. Although X is an output intersection, there isenough sub-path diversity needed to detect a tail-SPOF. Specifically,we can still compare the upper-stream R1→A and the lower-streamA→X to detect whether X is slow. Thus, as long as the intersec-tion node is not the first node in all the write pipelines, pipelinere-picking is unnecessary. As an additional note, we collapse local-transfer edges; for example, if R1→A and R2→B are local diskwrites, A and B are removed, resulting in R1→X and R2→X, whichwill activate path diversity as X is a a potential tail-SPOF.

Finally, we would like to note that by default PBSE will fol-low the original task placement (including data locality) from theHadoop scheduler and the original input source selection from HDFS.Only in rare conditions will PBSE break data locality. For example,let us suppose I1→M1 and I2→M2 achieve data locality and bothdata transfers happen in the same node. PBSE will try to move M2

to another node (the “no map-SPOF” rule) ideally to one of the twoother nodes that contain I2’s replicas (I′

2or I′′

2). But if the nodes

of I′2

and I′′2

do not have a free container, then M2must be placedsomewhere else and will read its input (I2/I′

2/I′′2

) remotely.

4.3 Detection and Speculation

As path diversity ensures no potential tail-SPOF, we then can com-pare paths, detect path-stragglers, and pinpoint the faulty node/NIC.Similar to base SE, PBSE detection algorithm is per-job (in AM)and runs for every MapReduce stage (input, shuffle, output). As animportant note, PBSE runs side by side with the base SE; the latterhandles task stragglers, while PBSE handles path stragglers. PBSEdetection algorithm runs in three phases:

(1) Detecting path stragglers: In every MapReduce stage, AMcollects a set of paths (SP ) and labels P as a potential stragglingpath if its BW (§4.1) is less than β × the average bandwidth, where0 ≤ β ≤ 1.0 (configurable). The straggling path will be speculatedonly if its estimated end time is longer than the estimated path re-placement time (plus the standard deviation).4 If a straggling path

4We omit our algorithm details because it is similar to task-level time-estimation andspeculation [23]. The difference is that we run the speculation algorithm on path pro-gresses, not just task-level progresses.

(e.g., A→B) is to be speculated, we execute the following phasesbelow, in order to pinpoint which node (A or B) is the culprit.

(2) Detecting the slow-NIC node with failure groups: We catego-rize every P in SP into failure/risk groups [84]. A failure group GN

is created for every source/destination node N in SP . If a path P in-volves a node N , P is put inGN . For example, path A→B will be inGA and GB groups. In every group G, we take the total bandwidth.If there is a group whose bandwidth is smaller than β × the averageof all group bandwidths, then a slow-NIC node is detected.

Let’s consider the shuffling topology in Figure 3b with 1000Mbpsnormal links and a slow M2’s NIC at 5 Mbps. AM receives fourpaths (M1→R1, M1→R2, M2→R1, and M2→R2) along with theirbandwidths (e.g., 490, 450, 3, and 2 Mbps respectively). Four groupsare created and the path bandwidths are grouped (M1:{490, 450},M2:{3, 2}, R1:{490, 3}, and R2:{450, 2}). After the sums are com-puted, M2’s node (5 Mbps total) will be marked as the culprit, im-plying M2 must be speculated, even though it is in the reduce stage(in PBSE, the stage does not define the straggler).

(3) Detecting the slow-NIC node with heuristics: Failure groupswork effectively in cases with many paths (e.g., many-to-many com-munications). In some cases, not enough paths exist to pinpoint theculprit. For example, given only a fast A→B and a straggling C→D,we cannot pinpoint the faulty node (C or D). Fortunately, giventhree factors (path diversity rules, the nature of MapReduce stages,and existing rules in Hadoop SE), we can employ the following ef-fective heuristics:

(a) Greedy approach: Let’s consider a fast I1→M1 and a stragglingI2→M2; the latter must be speculated, but the fault could be in I2 orM2. Fortunately, Hadoop SE by default prohibits M′

2to run on the

same node as M2. Thus, we could speculate with I2→M′2. However,

we take a greedy approach where we speculate a completely new

pair I′2→M′

2(avoiding both I2 and M2 nodes). To implement this,

when Hadoop spawns a task (M′2), it can provide a blacklisted input

source (I2) to HDFS.Arguably, node of I′

2could be busier than I2’s node, and hence

our greedy algorithm is sub-optimal. However, we find that HDFSdoes not employ a fine-grained load balancing policy; it only triesto achieve rack/node data locality (else, it uses a random selec-tion). This simplicity is reasonable because Hadoop tasks are evenlyspread out, hence a balanced cluster-wide read.

(b) Deduction approach: While the greedy approach works wellin the Input→Map stage, other stages need to employ a deduc-tion approach. Let’s consider a one-to-one shuffling phase (a fastM1→R1 and a slow M2→R2). By deduction, since M2 already“passes the check” in the I→M2 stage (it was not detected as aslow-NIC node), then the culprit is likely to be R2. Thus, M2→R′

2

backup path will start. Compared to deduction approach, employ-ing a greedy approach in shuffling stage is more expensive (e.g.,speculating M′

2→R′

2requires spawning M′

2).

(c) Dynamic retries: Using the same example above (slow M2→R2),caution must be taken if M2 reads locally. That is, if I→M2 onlyinvolves a local transfer, M2 is not yet proven to be fault-free. Inthis case, blaming M2 or R2 is only 50/50-chance correct. In sucha case, we initially do not blame the map side because speculating

7

Page 8: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu1, Daniar H. Kurniawan,

Vincentius Martin2, Uma Maheswara Rao G., and Haryadi S. Gunawi

M2 with M′2

is more expensive. We instead take the less expensivegamble; we first speculate the reducer (with R′

2), but if M2→R′

2

path is also slow, we perform a second retry (M′2→R′

2). Put simply,

sometimes it can take one or two retries to pinpoint one faulty node.We call this dynamic retries, which is different than the limited-retryin base SE (default of 1).

The above examples only cover I→M and M→R stages, but thetechniques are also adopted for R→O stage.

5 EVALUATION

We implemented PBSE in Hadoop/HDFS v2.7.1 in 6003 LOC (3270in AM, 1351 in Task Management, and 1382 in HDFS). We nowevaluate our implementation.

Setup: We use Emulab nodes [15], each running a (dual-thread)2×8-core Intel Xeon CPU E5-2630v3 @ 2.40GHz with 64GBDRAM and 1Gbps NIC. We use 15-60 nodes, 12 task slots perHadoop node (the other 4 core for HDFS), and 64MB HDFS blocksize.5 We set β=0.1 (§4.3), a non-aggressive path speculation.

Slowdown injection: We use Linux tc to delay one NIC to 60,30, 10, 1, and 0.1 Mbps; 60-30 Mbps represent a contended NICwith 16-32 other network-intensive tenants and 10-0.1 Mbps repre-sent a realistic degraded NIC; real cases of 10%-40% packet losswere observed in production systems (§2.2), which translate to 2Mbps to 0.1 Mbps NIC (as throughput exhibits exponential-decayingpattern with respect to packet loss rate).

Workload: We use real-world production workloads from Face-book (FB2009 and FB2010) and Cloudera (CC-b and CC-e) [41]. Ineach, we pick a sequence of 150 jobs6 with the lowest inter-arrivaltime (i.e., a busy cluster). We use SWIM to replay and rescalethe traces properly to our cluster sizes as instructed [24]. Figure6 shows the distribution of job sizes and inter-arrival times.

Metrics: We use two primary metrics: job duration (T ) and speed-up (=TBase/TPBSE ).

5.1 PBSE vs. Hadoop (Base) SE

Figure 7 shows the CDF of latencies of 150 jobs from FB2010 on15 nodes with five different setups from right (worse) to left (better):Base Hadoop SE with one 1Mbps slow NIC (BaseSE-1Slow), PBSEwith the same slow NIC (PBSE-1Slow), PBSE without any bad NIC(PBSE-0Slow), Base SE without any bad NIC (BaseSE-0Slow), andBase SE with one dead node (BaseSE-1Dead).

We make the following observations from Figure 7. First, as al-luded in §3, Hadoop SE cannot escape tail-SPOF caused by the de-graded NIC, resulting in long job tail latencies with the longest jobfinishing after 6004 seconds (BaseSE-1Slow line). Second, PBSE ismuch more effective than Hadoop SE; it successfully cuts tail la-tencies induced by degraded NIC (PBSE-1Slow vs. BaseSE-1Slow).Third, PBSE cannot reach the “perfect” scenario (BaseSE-0Slow);we dissect this more later (§5.2). Fourth, with Hadoop SE, a slowNIC is worse than a dead node (BaseSE-1Slow vs. BaseSE-1Dead);

5Today, HDFS default block size is 128 MB, which actually will show better PBSEresults because of the longer data transfer. We use 64 MB to be consistent with all ofour initial experiments.6150 jobs are chosen so that every normal run takes about 15 minutes; this is becausethe experiments with severe delay injections (e.g., 1 Mbps) can run for hours for baseHadoop. Longer runs are possible but will prevent us from completing many experi-ments.

.8

.9

1

0 50 100 150 200

#Tasks/Job

CDF of Job Sizes

FB2010FB2009

CC-bCC-e

.6

.7

.8

.9

1

0 50 100 150 200

Inter-arrival Time (s)

CDF of Inter-arrival Times

FB2010FB2009

CC-bCC-e

Figure 6: Distribution of job sizes and inter-arrival times. The

left figure shows CDF of the number of (map) tasks per job within the

chosen 150 jobs from each of the production traces. The number of reduce

tasks is mostly 1 in all the jobs. The right figure shows the CDF of job

inter-arrival times.

.4

.6

.8

1

20 30 40 50 60 70 80 90

(The horizontal difference in this gap

represents PBSE speed-up)

Job Latency (second)

CDF of Job Duration

BaseSE-1DeadBaseSE-0Slow

PBSE-0SlowPBSE-1Slow

BaseSE-1Slow

90 6000

Max:6004secs

Differentx-scale:

.

.

.

Figure 7: PBSE vs. Hadoop (Base) SE (§5.1). The figure above

shows CDF of latencies of 150 FB2010 jobs running on 15 nodes with

one 1-Mbps degraded NIC (1Slow), no degraded NIC (0Slow), and one

dead node (1Dead).

put simply, Hadoop is robust against fail-stop failures but not de-graded network. Finally, in the normal case, PBSE does not exhibitany overhead; the resulting job latencies in PBSE and Hadoop SEunder no failure are similar (PBSE-0Slow vs. Base-0Slow).

We now perform further experiments by varying the degradedNIC bandwidth (Figure 8a), workload (8b), and cluster size (8c).To compress the resulting figures, we will only show the speedupof PBSE over Hadoop SE (a more readable metric), as explained inthe figure caption.

Varying NIC degradation: Figure 8a shows PBSE speed-upswhen we vary the NIC bandwidth of the slow node to 60, 30, 10, 1,and 0.1 Mbps (the FB2010 and 15-node setups are kept the same).We make two observations from this figure. First, PBSE has higherspeed-ups at higher percentiles. In Hadoop SE, if a large job is“locked” by a tail-SPOF, the job’s duration becomes extremely long.PBSE on the other hand can quickly detect and failover from the

8

Page 9: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

PBSE: A Robust Path-Based Speculative Execution for

Degraded-Network Tail Tolerance in Data-Parallel Frameworks SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

1 2 4 8

16 32 64

128

P60 P70 P80 P90 P100

Spe

ed-u

p (lo

g2 s

cale

)

(a) Varying Bandwidth of Degraded NIC

0.1Mbps1Mbps

10Mbps30Mbps60Mbps

1 2 4 8

16 32 64

128

P60 P70 P80 P90 P100

.

(b) Varying Workload

FB2009FB2010

CC-bCC-e

1 2 4 8

16 32 64

128

P60 P70 P80 P90 P100

.

(c) Varying Cluster Size

15Nodes30Nodes45Nodes60Nodes

Figure 8: PBSE speed-ups (vs. Hadoop Base SE) with varying (a) degradation, (b) workload, and (c) cluster size (§5.1). The x-axis

above represents the percentiles (y-axis) in Figure 7 (e.g., “P80” denotes 80th -percentile). For example, the bold (1Mbps) line in Figure 8a plots the PBSE

speedup at every percentile from Figure 7 (i.e., the horizontal difference between the PBSE-1Slow and Base-1Slow lines). As an example point, in Figure

7, at 80thpercentile (y=0.8),our speed-up is 1.7× (TBase/TPBSE = 54sec/32sec) but in Figure 8a, the axis is reversed for readability (at x=P80, PBSE

speedup is y=1.7).

Features In stage: #Tasks #Jobs

Path I→M 66 59Diversity M→R 125 125

(§4.2) R→O 0 0Path I→M 62 54

Speculation M→R 26 6(§4.3) R→O 28 12

Table 2: Activated PBSE features (§5.2). The table shows how

many times each feature is activated, by task and job counts, in the

PBSE-1Slow experiment in Figure 7.

straggling paths. With a 60Mbps congested NIC, PBSE deliverssome speed-ups (1.5-1.7×) above P98. With a more congested NIC(30 Mbps), PBSE benefits start to become apparent, showing 1.5-2× speed-ups above P90. Second, PBSE speed-up increases (2-70×)when the NIC degradation is more severe (e.g., the speedups under1 Mbps are relatively higher than 10 Mbps). However, under a verysevere NIC degradation (0.1 Mbps), our speed-up is still positivebut slightly reduced. The reason is that at 0.1 Mbps, the degradednode becomes highly congested, causing timeouts and triggeringfail-stop failover. Again, in Hadoop SE, a dead node is better than aslow NIC (Figure 7). The dangerous point is when a degraded NICslows down at a rate that does not trigger any timeout.

Varying workload and cluster size: Figure 8b shows PBSEspeed-ups when we vary the workloads: FB2009, FB2010, CC-b,CC-e (1Mbps injection and 15-node setups are kept the same). Asshown, PBSE works well in many different workloads. Finally, Fig-ure 8c shows PBSE speed-ups when we vary the cluster size: 15 to60 nodes (1Mbps injection and FB2010 setups are kept the same).The figure shows that regardless of the cluster size, a degraded NICcan affect many jobs. The larger the cluster size, tail-SPOF proba-bility is reduced but still appear at a significant rate (> P90).

5.2 Detailed Analysis

We now dissect our first experiment’s results (Figure 7).

M1

I2-->M2

M2'Start-up

M1

I/M

I/M:

Good nodeI2-->M2

Stragglerdetected

(a)

Start-up

Start-up

Start-up

Slow Start-up

M1 ... Mn

R1 ... Rn Slow Clean-up

M2'Start-up

AM-->O-->O'-->O''

.jar-->M2

(1-3 secs)

(+1-9 secs)

(>10 secs)

(b)

(c)

ResidualTail

Slow task

Fast task

Slow NIC

Figure 9: Residual sources of tail latencies (§5.2).

Activated features: Table 2 shows how many times PBSE fea-tures (§4.2-4.3) are activated in the PBSE-1Slow experiment in Fig-ure 7. First, in terms of path diversity, 66 tasks (59 jobs) requireI→M diversity. In the FB2010 workload, 125 jobs only have onereducer, thus requiring M→R diversity (reducer cloning). R→O di-versity is rarely needed (0), mainly because of enough upper-streampaths to compare. Second, in terms of path speculation, we can seethat all the I→M, M→R, and R→O speculations are needed (in 54,6, and 12 jobs, respectively) to help the jobs escape all the many SEloopholes we discussed before (§3).

Residual overhead: We next discuss interesting findings on whyPBSE cannot reach the “perfect” case (Base-0Slow vs. PBSE-1Slowlines shown in Figure 7). Below are the sources of residual overheadthat we discovered:

Start-up overhead: Figure 9a shows that even when a stragglingpath (I2→M2) is detected early, running the backup task (M′

2) will

require 1-3 seconds of start-up overhead, which includes JVM warm-up (class loading, interpretation) and “localization” [20] (includingtransferring application’s .jar files from HDFS to the task’s node).

9

Page 10: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu1, Daniar H. Kurniawan,

Vincentius Martin2, Uma Maheswara Rao G., and Haryadi S. Gunawi

0

.2

.4

.6

.8

1

20 40 60 80

Job Latency (sec)

(a) CDF of Job Duration(PBSE vs. Other Schedulers)

PBSE+CapBaseSE+Cap

BaseSE+FIFOBaseSE+Fair

0

.2

.4

.6

.8

1

20 40 60 80

Job Latency (sec)

(b) CDF of Job Duration(PBSE vs. Other SE Strategies)

PBSECloning

Aggr-SEHRead-0.0HRead-0.5

Figure 10: PBSE vs. other strategies (§5.3).

At this point, this overhead is not removable and still become anongoing problem [60].

“Straggling” start-up/localization: Figure 9b shows two mapswith one of them (M2) running on a slow-NIC node. This causesthe transferring of application’s .jar files from HDFS to M2’s nodeto be slower than the one in M1. In our experience, this “straggling”start-up can consume 1-9 seconds (depending on the job size). Whatwe find interesting is that start-up time is not accounted in taskprogress and SE decision making. In other words, start-up durationsof M1 and M2 are not compared, and hence no’ straggling start-upis detected, delaying the task-straggler detection.

“Straggling” clean-up: Figure 9c shows that after a job finishes(but before returning to user), the AM must write a JobHistory fileto HDFS (part of the clean-up operation). It is possible that one ofthe output nodes is slow (e.g., AM→O→O′→O′′), in which caseAM will be stuck in this “straggling” clean-up, especially with alarge JobHistory file from a large job (M1..Mn , R1..Rn , n≫1). Thiscase is also outside the scope of SE.

We believe the last two findings reveal more flaws of existingtail-tolerance strategies: they only focus on “task progress,” but donot include “operational progress” (start-up/clean-up) as part of SEdecision making, which results in irremovable tail latencies. Again,these flaws surface when our unique fault model (§2.3) is consid-ered. Fortunately, all the problems above are solvable; start-up over-head is being solved elsewhere [60] and straggling localization/clean-up can be unearthed by incorporating start-up/clean-up paths andtheir latencies as part of SE decision making.

5.3 PBSE vs. Other Strategies

In this section, we will compare PBSE against other scheduling andtail-tolerant approaches.

Figure 10a shows the same setup as in Figure 7, but now wevary the scheduling configurations: capacity (default), FIFO, andfair scheduling. The figure essentially confirms that the SE loop-holes (§3) are not about scheduling problems; changing the sched-uler does not eliminate tail latencies induced by the degraded NIC.

Figure 10b shows the same setup, but now we vary the tail-tolerantstrategies: hedged read (0.5s), hedged read (0s), aggressive SE, andtask cloning. The first one, hedged read (HRead-0.5), is a new HDFSfeature [25] that enables a map task (e.g., I2→M2) to automatically

1 2 4 8

16 32 64

128

P60 P70 P80 P90 P100

Spe

ed-u

p (lo

g2)

(a) Multiple Failures

1Slow2Slow

0

1

2

P60 P70 P80 P90 P100

Spe

ed-u

p

(b) Heterogeneous Bandwidths

B=0.1B=0.2B=0.3B=0.8

Figure 11: PBSE speed-ups with multiple failures and hetero-

geneous network (§5.4 and §5.5).

read from another replica (I′2→M2) if the first 64KB packet is not re-

ceived after 0.5 sec. The map will use the data from the fastest read.The second one (HRead-0.0) does not wait at all. Hedged read canunnecessarily consume network resources. As shown, HRead-0.0does not eliminate all the tail-SPOF latencies (with a job latencymaximum of 7708 seconds). This is because hedged read only solvesthe tail-SPOF scenarios in the input stage, but not across all theMapReduce stages.

The third one (Aggr-SE) is the base SE but with the most ag-gressive SE configuration7 and the last one (Cloning) representstask-level cloning8 [30] (a.k.a. hedged requests [42]). AggressiveSE speculates more intensively in all the MapReduce stages, butlong tail latencies still appear (with a maximum of 6801 seconds).Even cloning also exhibits a long tail (3663 seconds at the end of thetail) as it still inherits the flaws of base SE. In this scenario, PBSEis the most effective (a maximum of only 251 seconds), as it solvesthe fundamental limitations of base SE.

In summary, all other approaches above only reduce but do not

eliminate the possibility of tail-SPOF to happen. We did not empir-ically evaluate PBSE with other strategies in the literature [30, 31,33, 63, 64, 77] because either they are either proprietary or inte-grated to non-Hadoop frameworks (e.g., Dryad [54], SCOPE [39],or Spark [13]), but they were discussed earlier (§3.4).

5.4 PBSE with Multiple Failures

We also evaluate PBSE against two NIC failures (both 1 Mbps).This is done by changing the value of F (§4). To handle two slow-NIC nodes (F=2), at least three nodes (F+1) are required for pathdiversity. F=3 is currently not possible as the replication factor isonly 3×. Figure 11a shows that PBSE performs effectively as wellunder the two-failure scenario.

5.5 PBSE on Heterogeneous Resources

To show that PBSE does not break the performance of Hadoop SEunder heterogeneous resources [83], we run PBSE on a stable-state

7Aggressive SE configurations:speculative-cap-running-tasks=1.0 (default=0.1);speculative-cap-total-tasks=1.0 (default=0.01);minimum-allowed-tasks=100 (default=10);retry-after-speculate=1000ms (default=15000ms); andslowtaskthreshold=0.1 (default=1.0).8We slightly modified Hadoop to always speculate every task at the moment the taskstarts (Hadoop does not support cloning by default).

10

Page 11: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

PBSE: A Robust Path-Based Speculative Execution for

Degraded-Network Tail Tolerance in Data-Parallel Frameworks SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

0

200

400

600

[a] Hadoop/QFS [b] Spark [c] Flume [d] S4

Late

ncy

(sec

)

Base-0Slow Base-1Slow PBSE-1Slow

Figure 12: Beyond Hadoop/HDFS (§6). The figure shows laten-

cies of microbenchmarks running on four different systems (Hadoop/QFS,

Spark, Flume, and S4) with three different setups: baseline without

degraded NIC (Base-0Slow), baseline with one 1Mbps degraded NIC

(Base-1Slow), and with initial PBSE integration (PBSE-1Slow). Base-

line (Base) implies the vanilla versions. The Hadoop/QFS microbench-

mark and topology is shown in Figure 13c. The Spark microbenchmark is

a 2-stage, 4-task, all-to-all communication as similarly depicted in Figure

3b. The Flume and S4 microbenchmarks have the same topology. We did

not integrate PBSE to S4 as its development is discontinued.

network throughput distribution in Amazon EC2 that is popularlycited (ranging from 200 to 920 Mbps; please see Figure 3 in [71]for the detailed distribution). Figure 11b shows that PBSE speed-upis constantly around one with small fluctuations at high percentilesfrom large jobs. The figure also shows the different path-stragglerthresholds we use (β in §4.3). With a higher threshold, sensitivity ishigher and more paths are speculated. However, because the hetero-geneity is not severe (>100 Mbps), the original tasks always com-plete faster than the backup tasks.

5.6 Limitations

PBSE can fail in extreme corner cases: for example, if a file cur-rently only has one surviving replica (path diversity is impossible);if a large batch of devices degrade simultaneously beyond the toler-able number of failures; or if there is not enough node availability.Note that the base Hadoop SE also fails in such cases. When thesecases happen, PBSE can log warning messages to allow operatorsto query the log and correlate the warnings with slow jobs (if any).

Another limitation of PBSE is that in a virtualized environment(e.g., EC2), if nodes (as VMs) are packed to the same machine,PBSE’s path diversity will not work. PBSE works if the VMs aredeployed across many machines and they expose the machine#.

6 BEYOND HADOOP AND HDFS

To show PBSE generality for many other data-parallel frameworksbeyond Hadoop/HDFS, we analyzed Apache Spark [13, 82], Flume[9], S4 [12], and Quantcast File System (QFS) [67]. We found thatall of them suffer from the tail-SPOF problem, as shown in Figure12 (Base-0Slow vs. Base-1Slow bars). We have performed an initialintegration of PBSE to Spark, Flume, and Hadoop/QFS stack andshowed that it speculates effectively, avoids the degraded network,and cuts tail latencies (PBSE-1Slow bars). We did not integrate fur-ther to S4 as its development is discontinued.

We briefly describe the tail-tolerance flaws we found in theseother systems. Spark (Figure 12b) has a built-in SE similar to Hadoop

SE, hence it is prone to the same tail-SPOF issues (§3). Flume (Fig-ure 12c) uses a static timeout to detect slow channels (no path com-parisons). If a channel’s throughput falls below 1000 events/sec, afailover is triggered. If a NIC degrades to slightly above the thresh-old, the timeout is not triggered. S4 (Figure 12d) has a similar statictimeout to Flume. Worse, it has a shared queue design in the fan-outprotocol. Thus, a slow recipient will cripple the sender in transfer-ring data to the other recipients (a head-of-line blocking problem).

We also analyzed the Hadoop/QFS stack due to its differencesfrom the Hadoop/HDFS (3-way replication) stack.9 Hadoop/QFSrepresents computation on erasure-coded (EC) storage. Many ECstorage systems [53, 67, 74] embed tail-tolerant mechanisms in theirclient layer. EC-level SE with m parities can tolerate up to m slowNICs. In addition to tolerating slow NICs, EC-level SE can alsotolerate rack-level slowdown (which can be caused by a degradedTOR switch or a malfunctioning power in the rack). For examplein Figure 13a, M1 reads chunks of a file (Ia , Ib ). As reading fromIb is slower than from Ia (due to the slow Rack-2), the EC clientlayer triggers its own EC-level SE, creating a backup speculativeread from Ip to construct the late Ib .

Unfortunately, EC-level SE also has loopholes. Figure 13b showsa similar case but with slow Rack-1. Here, the EC-level SE is nottriggered as all reads (Ia→M1, Ib→M1) are slow. Let’s suppose an-other map (M2) completes fast in Rack-3, as in Figure 13c. Here,Hadoop declares M1 as a straggler, however it is possible that thebackup M′

1will run in Rack-1, which means it must also read through

the slow rack. As a result, both original and backup tasks (M1 andM′

1) are straggling.To address this, with PBSE, the EC layer (QFS) exposes the in-

dividual read paths to Hadoop. In Figure 13c, if we expose Ia→M1

and Ib→M1 paths, Hadoop can try placing the backup M′1

in an-other rack (Rack-2/3) and furthermore informs the EC layer to haveM′

1directly read from Ib and Ip (instead of Ia). Overall, the key

principle is the same: when path progresses are exposed, the com-pute and storage layers can make a more informed decision. Figure12a shows that in a topology like Figure 13c, without PBSE, the jobfollows the slow Rack-1 performance (Base-1Slow bar), but withPBSE, the job can escape from the slow rack (PBSE-1Slow).

In summary, we have performed successful initial integrations ofPBSE to multiple systems, which we believe show its generality toany data-parallel frameworks that need robust tail tolerance againstnode-level network degradation. We leave full testing of these addi-tional integrations as a future work.

7 RELATED WORK

We now discuss related work.Paths: The concept of “paths” is prevalent in the context of mod-

eling (e.g., Magpie [38]), fine-grained tracing (e.g., XTrace [45],

9Quantcast File System (QFS) [19, 67] is a Reed-Solomon (RS) erasure-coded (EC)distributed file system. Although HDFS-RAID supports RS [22], when we started theproject (in 2015/2016), the RS is only executed in the background. In HDFS-RAID,files are still triple-replicated initially. Moreover, because the stripe size is the sameas the block size (64 MB), only large files are erasure coded while small files are stilltriple replicated (e.g., with RS(10,4), only files with 10×64MB size are erasure coded).HDFS with foreground EC (HDFS-EC) was still an ongoing non-stable development[17]. In contrast, QFS erasure-code data in the foreground with 64KB stripe size, henceour usage of QFS.

11

Page 12: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu1, Daniar H. Kurniawan,

Vincentius Martin2, Uma Maheswara Rao G., and Haryadi S. Gunawi

(a) EC-level SE

(b) No EC-level SE

D2Ia D2 D2

M1

D2Ia D2Ib D2Ip

M2

SE

D2Ia D2Ib D2Ip

(c) Ineffective Hadoop SE

M1 M1

M1'

Slow rack

EC client layer

Ib Ip

Figure 13: Rack slowdown and Erasure-Coded (EC) storage

(§6). For simplicity, the figure shows RS(2,1), a Reed Solomon where an

input file I is striped across two chunks (Ia , Ib ) with one parity (Ip ) with

64KB stripe size (see Figure 2 in [67] for more detail).

Pivot Tracing [62]), diagnosis (e.g., black-box debugging [28], path-based failure [40]), and availability auditing (e.g., INDaaS [84]),among many others. This set of work is mainly about monitoringand diagnosing paths. In PBSE, we actively “control” paths, for abetter online tail tolerance.

Tail tolerance: Earlier (§3.4), we have discussed a subtle limita-tion of existing SE implementations that hide path progresses [30,32, 33, 63, 64, 83]. While SE is considered a reactive tail-tolerance,proactive ones have also been proposed, for example by cloning(e.g., Dolly [30] and hedged requests [42]), launching few extratasks (e.g., KMN [70]), or placing tasks more intelligently (e.g.,Wrangler [77]). This novel set of work also uses the classical defini-tion of progress score (§3.4). Probabilistically, cloning or launchingextra tasks can reduce tail-SPOF, but fundamentally, as paths arenot exposed and controlled, tail-SPOF is still possible (§5.3). Tailtolerance is also deployed in the storage layer (e.g., RobuStore [74],CostTLO [73], C3 [69], Azure [53]). PBSE shows that storage andcompute layers need to collaborate for a more robust tail tolerance.

Task placement/scheduling: There is a large body of work ontask placement and scheduling (e.g., Pacman [31], delay schedul-ing [81], Quincy [55], Retro [61]). These efforts attempt to cut taillatencies in the initial task placements by achieving better data lo-cality, load balancing, and resource utilization. However, they donot modify SE algorithms, and thus SE (and PBSE) is orthogonalto this line of work.

Tail root causes: A large body of literature has discovered manyroot causes of tail latencies including resource contention of sharedresources [33, 35], hardware performance variability [59], work-load imbalance [46], data and compute skew [29], background pro-cesses [59], heterogeneous resources [83], degraded disks or SSDs[51, 78], and buggy machine configuration (e.g., disabled processcaches) [43]. For degraded disks and processors, as we mentionedbefore (§1), existing (task-based) speculations are sufficient to de-tect such problems. PBSE highlights that degraded network is animportant fault model to address.

Iterative graph frameworks: Other types of data-parallel sys-tems include iterative graph processing frameworks [10, 27, 65].As reported, they do not employ speculative execution within theframeworks, but rather deal with stragglers within the running al-gorithms [52]. Our initial evaluation of Apache Giraph [10] showsthat a slow NIC can hamper the entire graph computation, mainlybecause Giraph workers must occasionally checkpoint its states toHDFS (plus the use of barrier synchronization), thus experiencinga tail-SPOF (as in Figure 3c). This suggests that part of PBSE maybe applicable to graph processing frameworks as well.

Distributed system bugs: Distributed systems are hard to getright. A plethora of related work combat a variety of bugs in dis-tributed systems, including concurrency [50, 57], configuration [75],dependency [84], error/crash-handling [56, 80], performance [62],and scalability [58] bugs. In this paper, we highlight performanceissues caused by speculative execution bugs/loopholes.

8 CONCLUSION

Performance-degraded mode is dangerous; software systems tendto continue using the device without explicit failure warnings (hence,no failover). Such intricate problems took hours or days until manu-ally diagnosed, usually after whole-cluster performance is affected(a cascading failure) [2, 3, 8]. In this paper, we show that node-levelnetwork degradation combined with SE loopholes is dangerous. Webelieve it is the responsibility of software’s tail-tolerant strategies,not just monitoring tools, to properly handle performance-degradednetwork devices. To this end, we have presented PBSE as a novel,online solution to the problem.

9 ACKNOWLEDGMENTS

We thank Walfredo Cirne, our shepherd, and the anonymous review-ers for their tremendous feedback and comments. We also wouldlike to thank Andrew Wang and Aaron T. Myers for their help onCloudera Hadoop traces, Achmad Imam Kistijantoro for his super-vision on Martin’s undergraduate thesis, Yohanes Surya and NinaSugiarto for their support, and Bob Bartlett, Kevin Harms, RobertRicci, Kirk Webb, Gary Grider, Parks Fields, H. Birali Runesha,Andree Jacobson, Xing Lin, and Dhruba Borthakur for their sto-ries on degraded network devices. This material was supported byfunding from NSF (grant Nos. CCF-1336580, CNS-1350499, CNS-1526304, CNS-1405959, and CNS-1563956) as well as generousdonations from Dell EMC, Huawei, Google Faculty Research Award,NetApp Faculty Fellowship, and CERES Center for UnstoppableComputing. The experiments in this paper were performed in theUtah Emulab [15, 72], the NMC PRObE [18, 47], the University ofChicago RIVER [21], and Chameleon [14] clusters.

Any opinions, findings, and conclusions, or recommendations ex-pressed herein are those of the authors and do not necessarily reflectthe views of the NSF or other institutions.

12

Page 13: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

PBSE: A Robust Path-Based Speculative Execution for

Degraded-Network Tail Tolerance in Data-Parallel Frameworks SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

REFERENCES[1] Personal Communication from datacenter operators of University of Chicago IT

Services.[2] Personal Communication from Kevin Harms of Argonne National Laboratory.[3] Personal Communication from Robert Ricci of University of Utah.[4] Personal Communication from Gary Grider and Parks Fields of Los Alamos

National Laboratory.[5] Personal Communication from Xing Lin of NetApp.[6] Personal Communication from H. Birali Runesha (Director of Research

Computing Center, University of Chicago).[7] Personal Communication from Andree Jacobson (Chief Information Officer at

New Mexico Consortium).[8] Personal Communication from Dhruba Borthakur of Facebook.[9] Apache Flume. http://flume.apache.org/.

[10] Apache Giraph. http://giraph.apache.org/.[11] Apache Hadoop. http://hadoop.apache.org.[12] Apache S4. http://incubator.apache.org/s4/.[13] Apache Spark. http://spark.apache.org/.[14] Chameleon. https://www.chameleoncloud.org.[15] Emulab Network Emulation Testbed. http://www.emulab.net.[16] HDFS-8009: Signal congestion on the DataNode. https://issues.apache.org/jira/

browse/HDFS-8009.[17] Introduction to HDFS Erasure Coding in Apache Hadoop.

http://blog.cloudera.com/blog/2015/09/introduction-to-hdfs-erasure-

coding-in-apache-hadoop/.[18] Parallel Reconfigurable Observational Environment (PRObE). http://www.

nmc-probe.org.[19] QFS. https://quantcast.github.io/qfs/.[20] Resource Localization in Yarn: Deep dive. http://hortonworks.com/blog/

resource-localization-in-yarn-deep-dive/.[21] RIVER: A Research Infrastructure to Explore Volatility, Energy-Efficiency, and

Resilience. http://river.cs.uchicago.edu.[22] Saving capacity with HDFS RAID. https://code.facebook.com/posts/

536638663113101/saving-capacity-with-hdfs-raid/.[23] Speculative tasks in Hadoop. http://stackoverflow.com/questions/34342546/

speculative-tasks-in-hadoop.[24] Statistical Workload Injector for MapReduce (SWIM). https://github.com/

SWIMProjectUCB/SWIM/wiki.[25] Support ’hedged’ reads in DFSClient. https://issues.apache.org/jira/browse/

HDFS%2D5776.[26] WorldâAZs First 1,000-Processor Chip. https://www.ucdavis.edu/news/

worlds-first-1000-processor-chip/.[27] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey

Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard,Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G.Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, MartinWicke, Yuan Yu, , and Xiaoqiang Zheng. TensorFlow: A System forLarge-Scale Machine Learning. In Proceedings of the 12th Symposium on

Operating Systems Design and Implementation (OSDI), 2016.[28] Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and

Athicha Muthitacharoen. Performance Debugging for Distributed Systems ofBlack Boxes. In Proceedings of the 19th ACM Symposium on Operating

Systems Principles (SOSP), 2003.[29] Ganesh Ananthanarayanan, Sameer Agarwal, Srikanth Kandula, Albert

Greenberg, Ion Stoica, Duke Harlan, and Ed Harris. Scarlett: Coping withSkewed Content Popularity in MapReduce Clusters. In Proceedings of the 2011

EuroSys Conference (EuroSys), 2011.[30] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Effective

Straggler Mitigation: Attack of the Clones. In Proceedings of the 10th

Symposium on Networked Systems Design and Implementation (NSDI), 2013.[31] Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur,

Srikanth Kandula, Scott Shenker, and Ion Stoica. PACMan: CoordinatedMemory Caching for Parallel Jobs. In Proceedings of the 9th Symposium on

Networked Systems Design and Implementation (NSDI), 2012.[32] Ganesh Ananthanarayanan, Michael Chien-Chun Hung, Xiaoqi Ren, Ion Stoica,

Adam Wierman, and Minlan Yu. GRASS: Trimming Stragglers inApproximation Analytics. In Proceedings of the 11th Symposium on Networked

Systems Design and Implementation (NSDI), 2014.[33] Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica,

Yi Lu, Bikas Saha, and Edward Harris. Reining in the Outliers in Map-ReduceClusters using Mantri. In Proceedings of the 9th Symposium on Operating

Systems Design and Implementation (OSDI), 2010.[34] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H.

Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, IonStoica, and Matei Zaharia. Above the Clouds: A Berkeley View of CloudComputing. http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.

pdf.[35] Remzi H. Arpaci-Dusseau, Eric Anderson, Noah Treuhaft, David E. Culler,

Joseph M. Hellerstein, Dave Patterson, and Kathy Yelick. Cluster I/O withRiver: Making the Fast Case Common. In The 1999 Workshop on Input/Output

in Parallel and Distributed Systems (IOPADS), 1999.[36] Peter Bailis and Kyle Kingsbury. The Network is Reliable. An informal survey

of real-world communications failures. ACM Queue, 12(7), July 2014.[37] Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri

Schindler. An Analysis of Latent Sector Errors in Disk Drives. In Proceedings

of the 2007 ACM Conference on Measurement and Modeling of Computer

Systems (SIGMETRICS), 2007.[38] Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richar Mortier. Using

Magpie for request extraction and workload modelling. In Proceedings of the

6th Symposium on Operating Systems Design and Implementation (OSDI), 2004.[39] Ronnie Chaiken, Bob Jenkins, Paul Larson, Bill Ramsey, Darren Shakib, Simon

Weaver, and Jingren Zhou. SCOPE: Easy and Efficient Parallel Processing ofMassive Data Sets. In Proceedings of the 34th International Conference on Very

Large Data Bases (VLDB), 2008.[40] Mike Y. Chen, Anthony Accardi, Emre Kiciman, Dave Patterson, Armando Fox,

and Eric Brewer. Path-Based Failure and Evolution Management. InProceedings of the 1st Symposium on Networked Systems Design and

Implementation (NSDI), 2004.[41] Yanpei Chen, Sara Alspaugh, and Randy H. Katz. Interactive Analytical

Processing in Big Data Systems: A Cross-Industry Study of MapReduceWorkloads. In Proceedings of the 38th International Conference on Very Large

Data Bases (VLDB), 2012.[42] Jeffrey Dean and Luiz AndrÃl’ Barroso. The Tail at Scale. Communications of

the ACM, 56(2), February 2013.[43] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing

on Large Clusters. In Proceedings of the 6th Symposium on Operating Systems

Design and Implementation (OSDI), 2004.[44] Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

and Haryadi S. Gunawi. Limplock: Understanding the Impact of Limpware onScale-Out Cloud Systems. In Proceedings of the 4th ACM Symposium on Cloud

Computing (SoCC), 2013.[45] Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica.

X-Trace: A Pervasive Network Tracing Framework. In Proceedings of the 4th

Symposium on Networked Systems Design and Implementation (NSDI), 2007.[46] Rohan Gandhi, Di Xie, and Y. Charlie Hu. PIKACHU: How to Rebalance Load

in Optimizing MapReduce On Heterogeneous Clusters. In Proceedings of the

2013 USENIX Annual Technical Conference (ATC), 2013.[47] Garth Gibson, Gary Grider, Andree Jacobson, and Wyatt Lloyd. Probe: A

thousand-node experimental cluster for computer systems research. USENIX

;login:, 38(3), June 2013.[48] Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat

Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono,Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. What Bugs Live inthe Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the

5th ACM Symposium on Cloud Computing (SoCC), 2014.[49] Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D.

Satria, Jeffry Adityatama, and Kurnia J. Eliazar. Why Does the Cloud StopComputing? Lessons from Hundreds of Service Outages. In Proceedings of the

7th ACM Symposium on Cloud Computing (SoCC), 2016.[50] Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao

Zhang. Practical Software Model Checking via Dynamic Interface Reduction.In Proceedings of the 23rd ACM Symposium on Operating Systems Principles

(SOSP), 2011.[51] Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote,

Andrew A. Chien, and Haryadi S. Gunawi. The Tail at Store: A Revelation fromMillions of Hours of Disk and SSD Deployments. In Proceedings of the 14th

USENIX Symposium on File and Storage Technologies (FAST), 2016.[52] Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory R. Ganger,

Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. Addressing the stragglerproblem for iterative convergent parallel ML. In Proceedings of the 7th ACM

Symposium on Cloud Computing (SoCC), 2016.[53] Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit

Gopalan, Jin Li, and Sergey Yekhanin. Erasure Coding in Windows AzureStorage. In Proceedings of the 2012 USENIX Annual Technical Conference

(ATC), 2012.[54] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly.

Dryad: distributed data-parallel programs from sequential building blocks. InProceedings of the 2007 EuroSys Conference (EuroSys), 2007.

[55] Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, andAndrew Goldberg. Quincy: Fair Scheduling for Distributed Computing Clusters.In Proceedings of the 22nd ACM Symposium on Operating Systems Principles

(SOSP), 2009.

13

Page 14: PBSE: A Robust Path-Based Speculative Execution … Rao G., and Haryadi S. Gunawi. 2017. PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel

SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

Riza O. Suminto, Cesar A. Stuardo, Alexandra Clark, Huan Ke, Tanakorn Leesatapornwongsa, Bo Fu1, Daniar H. Kurniawan,

Vincentius Martin2, Uma Maheswara Rao G., and Haryadi S. Gunawi

[56] Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman,and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for FastDiscovery of Deep Bugs in Cloud Systems. In Proceedings of the 11th

Symposium on Operating Systems Design and Implementation (OSDI), 2014.[57] Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S.

Gunawi. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs inDatacenter Distributed Systems. In Proceedings of the 21st International

Conference on Architectural Support for Programming Languages and

Operating Systems (ASPLOS), 2016.[58] Tanakorn Leesatapornwongsa, Cesar A. Stuardo, Riza O. Suminto, Huan Ke,

Jeffrey F. Lukman, and Haryadi S. Gunawi. Scalability Bugs: When 100-NodeTesting is Not Enough. In The 16th Workshop on Hot Topics in Operating

Systems (HotOS XVII), 2017.[59] Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. Tales of

the Tail: Hardware, OS, and Application-level Sources of Tail Latency. InProceedings of the 5th ACM Symposium on Cloud Computing (SoCC), 2014.

[60] David Lion, Adrian Chiu, Hailong Sun, Xin Zhuang, Nikola Grcevski, and DingYuan. DonâAZt Get Caught in the Cold, Warm-up Your JVM: Understand andEliminate JVM Warm-up Overhead in Data-Parallel Systems. In Proceedings of

the 12th Symposium on Operating Systems Design and Implementation (OSDI),2016.

[61] Jonathan Mace, Peter Bodik, Rodrigo Fonseca, and Madanlal Musuvathi. Retro:Targeted Resource Management in Multi-tenant Distributed Systems. InProceedings of the 12th Symposium on Networked Systems Design and

Implementation (NSDI), 2015.[62] Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot Tracing: Dynamic

Causal Monitoring for Distributed Systems. In Proceedings of the 25th ACM

Symposium on Operating Systems Principles (SOSP), 2015.[63] Kristi Morton, Magdalena Balazinska, and Dan Grossman. ParaTimer: a

progress indicator for MapReduce DAGs. In Proceedings of the 2010 ACM

SIGMOD International Conference on Management of Data (SIGMOD), 2010.[64] Kristi Morton, Abram L. Friesen, Magdalena Balazinska, and Dan Grossman.

Estimating the progress of MapReduce pipelines. In Proceedings of the 26th

International Conference on Data Engineering (ICDE), 2010.[65] Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham,

and Martin Abadi. Naiad: A Timely Dataflow System. In Proceedings of the

24th ACM Symposium on Operating Systems Principles (SOSP), 2013.[66] Neda Nasiriani, Cheng Wang, George Kesidis, and Bhuvan Urgaonkar. Using

Burstable Instances in the Public Cloud: When and How? 2016.[67] Michael Ovsiannikov, Silvius Rus, Damian Reeves, Paul Sutter, Sriram Rao, and

Jim Kelly. The Quantcast File System. In Proceedings of the 39th International

Conference on Very Large Data Bases (VLDB), 2013.[68] Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos,

Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur.XORing Elephants: Novel Erasure Codes for Big Data. In Proceedings of the

39th International Conference on Very Large Data Bases (VLDB), 2013.[69] Lalith Suresh, Marco Canini, Stefan Schmid, and Anja Feldmann. C3: Cutting

Tail Latency in Cloud Data Stores via Adaptive Replica Selection. InProceedings of the 12th Symposium on Networked Systems Design and

Implementation (NSDI), 2015.[70] Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael J.

Franklin, and Ion Stoica. The Power of Choice in Data-Aware ClusterScheduling. In Proceedings of the 11th Symposium on Operating Systems

Design and Implementation (OSDI), 2014.[71] Guohui Wang and T. S. Eugene Ng. The impact of virtualization on network

performance of amazon EC2 data center. In The 29th IEEE International

Conference on Computer Communications (INFOCOM), 2010.[72] Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac

Newbold, Mike Hibler, Chad Barb, and Abhijeet Joglekar. An IntegratedExperimental Environment for Distributed Systems and Networks. InProceedings of the 5th Symposium on Operating Systems Design and

Implementation (OSDI), 2002.[73] Zhe Wu, Curtis Yu, and Harsha V. Madhyastha. CosTLO: Cost-Effective

Redundancy for Lower Latency Variance on Cloud Storage Services. InProceedings of the 12th Symposium on Networked Systems Design and

Implementation (NSDI), 2015.[74] Huaxia Xia and Andrew A. Chien. RobuSTore: Robust Performance for

Distributed Storage Systems. In Proceedings of the 2007 Conference on High

Performance Networking and Computing (SC), 2007.[75] Tianyin Xu, Jiaqi Zhang, Peng Huang, Jing Zheng, Tianwei Sheng, Ding Yuan,

Yuanyuan Zhou, and Shankar Pasupathy. Do Not Blame Users forMisconfigurations. In Proceedings of the 24th ACM Symposium on Operating

Systems Principles (SOSP), 2013.[76] Yunjing Xu, Zachary Musgrave, Brian Noble, and Michael Bailey. Bobtail:

Avoiding Long Tails in the Cloud. In Proceedings of the 10th Symposium on

Networked Systems Design and Implementation (NSDI), 2013.

[77] Neeraja J. Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. Wrangler:Predictable and Faster Jobs using Fewer Resources. In Proceedings of the 5th

ACM Symposium on Cloud Computing (SoCC), 2014.[78] Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan

Sundararaman, Andrew A. Chien, and Haryadi S. Gunawi. Tiny-Tail Flash:Near-Perfect Elimination of Garbage Collection Tail Latencies in NAND SSDs.In Proceedings of the 15th USENIX Symposium on File and Storage

Technologies (FAST), 2017.[79] Xiangyao Yu, George Bezerra, Andrew Pavlo, Srinivas Devadas, and Michael

Stonebraker. Staring into the Abyss: An Evaluation of Concurrency Controlwith One Thousand Cores. In Proceedings of the 40th International Conference

on Very Large Data Bases (VLDB), 2014.[80] Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao,

Yongle Zhang, Pranay U. Jain, and Michael Stumm. Simple Testing Can PreventMost Critical Failures: An Analysis of Production Failures in DistributedData-Intensive Systems. In Proceedings of the 11th Symposium on Operating

Systems Design and Implementation (OSDI), 2014.[81] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott

Shenker, and Ion Stoica. Delay scheduling: a simple technique for achievinglocality and fairness in cluster scheduling. In Proceedings of the 2010 EuroSys

Conference (EuroSys), 2010.[82] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma,

Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica.Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-MemoryCluster Computing. In Proceedings of the 9th Symposium on Networked

Systems Design and Implementation (NSDI), 2012.[83] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion

Stoica. Improving MapReduce Performance in Heterogeneous Environments. InProceedings of the 8th Symposium on Operating Systems Design and

Implementation (OSDI), 2008.[84] Ennan Zhai, Ruichuan Chen, David Isaac Wolinsky, and Bryan Ford. Heading

Off Correlated Failures through Independence-as-a-Service. In Proceedings of

the 11th Symposium on Operating Systems Design and Implementation (OSDI),2014.

14