modeling and adaptive scheduling of large-scale wide-area data transfers raj kettimuthu advisors:...

Modeling and Adaptive Scheduling of Large-Scale Wide-Area Data Transfers

Raj Kettimuthu

Advisors: Gagan Agrawal, P. Sadayappan

Exploding data volumes

100,000 TB

MACHO et al.: 1 TBPalomar: 3 TB

2MASS: 10 TBGALEX: 30 TBSloan: 40 TB

Pan-STARRS: 40,000 TB

2004: 36 TB2014: 3,300 TB

105 increase in data volumes in 6 years

Astronomy Climate

Genomics

Data movement

Data Transfer Node Data Transfer Node

Storage Storage

Current work

Understand characteristics, control and optimize transfers

Efficient scheduling of wide-area transfers Model – predict and control throughput

– Characterize, identify key features– Data-driven modeling using experimental data

Adaptive scheduling– Algorithm to minimize slowdown– Experimental evaluation using real transfer logs

High-performance, secure data transfer protocol optimized for high-bandwidth wide-area networks Parallel TCP streams, PKI security for authentication, integrity and

encryption, checkpointing for transfer restarts

Based on FTP protocol - defines extensions for high-performance operation and security

Globus implementation of GridFTP is widely used. Globus GridFTP servers support usage statistics

collection – Transfer type, size in bytes, start time of the transfer, transfer

duration etc. are collected for each transfer

GridFTP

5

GridFTP usage log

Parallelism vs concurrency in GridFTP

Data Transfer Node at Site B

Data Transfer Node at Site A

Parallel File System

Parallelism = 3

TCP Connection

TCP Connection

GridFTP Daemon

GridFTP Daemon

GridFTP Client

28112811

GridFTP Server

GridFTP Server

GridFTP Server

GridFTP Server

TCP Connection

TCP ConnectionTCP ConnectionTCP Connection

Concurrency = 2

Control channel

Control channel

Parallelism vs concurrency

Objective - control bandwidth allocation for transfer(s) from a source to the destination(s)

Most large transfers between supercomputers– Ability to both store and process large amounts of data

Site heavily loaded, most bandwidth consumed by small number of sites

Goal – develop simple model for GridFTP – Source concurrency - total number of ongoing transfers between

the endpoint A and all its major transfer endpoints – Destination concurrency - total number of ongoing transfers

between the endpoint A and the endpoint B– External load - All other activities on the endpoints including

transfers to other sites

Model throughput and control bandwidth allocation

Modeling throughput Linear models

Model dest throughput (DT) using source & destination CC

Data to train, validate models – load variation experiments Errors >15% for most cases Log models

Y’ = a1X1 + a2X2 + … + akXk + b

DT = a1*DC + a2*SC + b1

DT = a3 *DC/SC + b2

log(DT)=a4*log(SC) + a5*log(DC) + b3

Modeling throughput

Log model better than linear models, still high errors Model based on just SC and DC too simplistic Incorporate external load

– External load - network, disk, and CPU activities outside transfers– How to measure the external load?– How to include external load in model(s)?

External load

Multiple training data – same SC, DC - different days & times EL - Throughput differences for same SC, DC Three different functions for external load (EL)

– EL1=T −AT, T - throughput for transfer t, AT - average throughput of all transfers with same SC, DC as t

– EL2=T−MT, MT - max throughput with same SC, DC as t– EL3 = T/MT

ELa11 if EL>0 |EL|(−a11) otherwise

AEL{a11} =

DT = a6*DC + a7*SC + a8*EL + b4

DT = SCa9 * DCa10 * AEL{a11} * 2b5

Linear

Log

Models with external load

DT = a6*DC + a7*SC + a8*EL + b4

Predict Controllable Uncontrollable

Unlike SC and DC, external load is uncontrollable Train models – multiple data points with same SC, DC In practice, some recent transfers possible but all

combinations of SC, DC unlikely

Calculating external load in practice

DT = a6*DC + a7*SC + a8*EL + b4

Known Compute

Transfers in past 30 minutes

DT = a6*DC + a7*SC + a8*EL + b4 + e

Historictransfers

Previous Transfer Method

Recent Transfers Method

Recent Transfers with Error Correction

Applying models to control bandwidth

Find DC, SC to achieve target throughput Limit DC to 20 to narrow search space

– Even then, large number of possible DC combinations (20n) SCmax (max source concurrency allowed) is the number of

possible values for SC – Heuristics to limit search space to SCmax * #destinations

DT = a6*DC + a7*SC + a8*EL + b4

Predict Given Known (Compute w/ PT, RT or RTEC)

DT = a6*DC + a7*SC + a8*EL + b4

Given

Compute Known (Compute w/ PT, RT or RTEC)

Experimental setupTACC

NCARSDSC

IndianaNICS

PSC

Experiments

Ratio experiments – allocate available bandwidth at source to destinations using predefined ratio

Available bandwidth at stampede is 9 Gbps 2:1:2:3:3 for Kraken, Mason, Blacklight, Gordon, Yellowstone

Kraken = 2*9Gbps/(2+1+2+3+3) = 2*9Gbps/9 = 2Gbps

Mason=1Gbps, Blacklight=2Gbps, Gordon=3Gbps, Yellowstone=3Gbps

Kraken=2Gbps, Mason=1Gbps, Blacklight=2Gbps, Gordon=3Gbps, Yellowstone=3Gbps

Kraken=3Gbps, Mason=X1Gbps, Blacklight=X2Gbps, Gordon=X3Gbps, Yellowstone=X4Gbps

Factoring experiments – increase destination’s throughput by a factor when source is saturated

Results – Ratio experiments

Ratios are 4:5:6:8:9 for Kraken, Mason, Blacklight, Gordon, and Yellowstone. Concurrencies picked by Algorithm were {1,3,3,1,1}. Model: log with EL1. Method: RTEC

Ratios are 4:5:6:8:9 for Kraken, Mason, Blacklight, Gordon, and Yellowstone. Concurrencies picked by Algorithm were {1,4,3,1,1}. Model: log with EL3. Method: RT

Results – Factoring experiments

Increasing Gordon’s baseline throughput by 2x. Concurrency picked by picked by Algorithm for Gordon was 5

Increasing Yellowstone’s baseline throughput by 1.5x. Concurrency picked by picked by Algorithm for Yellowstone was 3

Adaptive scheduling of data transfers

Data Transfer Node Data Transfer Node

Storage Storage


Bursty transfers opportunity for adaptive scheduling Goals - optimize throughput, improve response times Challenge – adaptive concurrency

– Low load – increase CC (unsaturated destinations) to max. utilization– New requests queue or adjust ongoing transfer concurrency

Data transfer scheduling analogous to parallel job scheduling? – Data transfers ≅ compute jobs. wide-area bandwidth ≅ compute

resources, transfer concurrency ≅ job parallelism CPU, storage network different at source, destination Shared wide area network Scheduling wide-area data transfers challenging

– Heterogenous resources, shared network, dynamic nature of load– Scheduling decisions not based on resource availability at one site

Metrics

Turnaround time – time a job spends in the system: completion time - arrival timeJob slowdown – factor slowed relative to the time on a unloaded system: turnaround time / processing time

Bounded slowdown in parallel job scheduling

Bounded slowdown forwide-area transfers

Job priority for wide-area transfers

Scheduling algorithm

Maximize resource utilization and reduce slowdown– Adaptively queue and adjust concurrency based on load

Preemption/restart– State required is missing block information & No migration – Still overhead (auth, checkpoint restart), p-factor limits preemption

Four key decision-making points– Upon task arrival – schedule or queue – If scheduled, what concurrency value? – When to preempt (and schedule a waiting job)– When to change concurrency of a running job

Use both models and recent observed behavior– Models to predict throughput and determine concurrency value– 5-second averages of observed throughput to determine saturation

Illustrative example

Average turnaround time is 10.92

Average turnaround time for baseline is 12.04

Workload traces

Traces from actual executions– Anonymized GridFTP usage statistics

Busiest day from a 1 month period Busiest server log on that day Limit length of logs due to production environment Three 15-minute logs - 25%, 45%, and 60% load traces

– “load” is total bytes transferred / max. that can be transferred

Destination anonymized in logs – Weighted random split based on capacities

Experimental results – turnaround 60% load

Experimental results – worst case 60% load

Experimental results – 60% load improved baseline

Related work

Several models for predicting behavior & finding optimal parallel TCP streams – Uncongested networks, simulations

Many studies on bandwidth allocation at router – Our focus is application-level control

Adaptive replica selection, algorithms to utilize multiple paths– Ability to control network path– Overlay networks

Workflow schedulers - dependencies between computation and data movement

Adaptive file transfer scheduling w/preemption in production environments not studied

Summary of current work

Models for wide-area data transfer throughput in terms of few key parameters

Log models that combine total source CC, destination CC, and a measure of external load are effective

Methods that utilize both recent and historical experimental data better at estimating external load

Adaptive scheduling algorithm to improve the overall user experience

Evaluated it using real traces on a production system Significant improvements over the current state-of-the-art

Proposed work

File transfers have different time constraints– Near real time to highly flexible

Objective – account time requirements to improve overall user experience

Consider 2 job types – batch and interactive – First, exploit relaxed deadlines of batch jobs – Next, exploit knowledge about future arrival times

Finally, maximize utility value for jobs – Each job has a utility function

Batch jobs

If deadline closer, batch jobs get highest priority– Scheduled with a concurrency of 2, no preemption

Otherwise, batch jobs get lowest priority Interactive jobs measured by turnaround and slowdown,

batch jobs measured by deadline satisfaction rate

Knowledge about future jobs

T1(d2)

T2(d1)

T3(d2)

T1(d2)

T2(d1

)T3

(d2)

0 1 2

0 1 2 3

Wait queue

Schedule A – no knowledge of future jobs

4 5

T1(d2)

T2(d1)

T3(d2)

0 1 2 3 4 5

3

Schedule B – w/ knowledge of future jobs

T1 – 1GB, T2 – 1GBSource – 1GB/sDestination d1 – 1GB/sDestination d2 – 0.5GB/s

T3 – 0.5GB

0.5

1.0

Thro

ughp

ut in

GB/

s

Time in Seconds

0.5

1.0

Thro

ughp

ut in

GB/

s

Time in Seconds

Average Slowdown is (1.5+1+2)/3 = 1.5

Average Slowdown is (1+2+1)/3 = 1.33

Utility based scheduling

Both interactive and batch jobs have deadline Associated utility function

– Impact of missing the deadline Decay – linear, exponential, step, or a combination Each transfer request R defined by tuple, R = (d,A,S,D,U)

– d = destination,– A = arrival time of R, – S = size of the file to be transferred, – D = deadline of R, and – U = utility function of R.

Objective – maximize aggregate utility value of jobs

Utility based scheduling

Inverse of instantaneous utility value as priority Instantaneous utility value calculated as follows

Questions

modeling and adaptive scheduling of large-scale wide-area data transfers raj kettimuthu advisors:...

Documents

transfer restartsbased

transfer duration

secure data transfer

dc a7

sc a8

dc a2

t throughput

gridftpdata transfer