esma yildirim department of computer engineering fatih university istanbul, turkey datacloud 2013
TRANSCRIPT
A Flexible GridFTP Client for Implementation of
Intelligent Cloud Data Scheduling Services
Esma YildirimDepartment of Computer EngineeringFatih UniversityIstanbul, Turkey
DATACLOUD 2013
OutlineData Scheduling Services in the Cloud
File Transfer Scheduling Problem History
Implementation Details of the Client
Example Algorithms
Amazon EC2 Experiments
Conclusions
Cloud Data Scheduling ServicesData Clouds strive for novel services
for management, analysis, access and scheduling of Big Data
Application level protocols providing high performance in high speed networks is an integral part of data scheduling services
GridFTP, UDP based protocols are used frequently in Modern Day Schedulers (e.g. GlobusOnline, StorkCloud)
Bottlenecks in Data Scheduling Services
Data is large, diverse and complex
Transferring large datasets faces many bottlenecksTransport protocol’s under utilization of
networkEnd-system limitations (e.g. CPU, NIC and
disk speed)Dataset characteristics
Many short duration transfers Connection startup and tear down overhead
Optimizations in GridFTP Protocol: Pipelining, Parallelism and Concurrency
Application in Data Scheduling ServicesSetting optimal parameters for
different datasets is a challenging task
Data Scheduling Services sets static values based on experiences
Provided tools do not comply with dynamic intelligent algorithms that might change settings on the fly
Goals of the Flexible Client
Flexibility to scalable data scheduling algorithms
On the fly changes to the optimization parameters
Reshaping the dataset characteristics
File Transfer Scheduling Problem
Lies at the origin of the data scheduling services
Dates back to 1980s
Earliest approaches: List schedulingSort the transfers based on size, bandwidth
of the path or duration of the transferNear-optimal solution
Integer programming – not feasible to implement
File Transfer Scheduling Problem Scalable approaches:
Transferring from multiple replicas Divided datasets sent over different paths to make use of
additional network bandwidth
Adaptive approaches Divide files into multiple portions to send over parallel streams Divide dataset into multiple portions and send at the same time Adaptively change level of concurrency or parallelism based on
network throughput
Optimization algorithms Find optimal settings via modeling and set the optimal
parameters once and for all
File Transfer Scheduling Problem
Modern Day Data Scheduling Service Example Globus Online
Hosted SaaS Statically set pipelining, concurrency and
parallelism Stork
Multi-protocol support Finds optimal parallelism level based on
modeling Static job concurrency
Ideal Client Interface
Allow dataset transfers to beEnqueued, dequeuedSorted based on a propertyDivided, combined into chunksGrouped by source-destination pathsDone from multiple replicas
Implementation Details
Lacks of globus-url-copyDoes not let even static setting of
pipelining, uses its own default value invisible to the userglobus-url-copy -pp -p 5 -cc 4 src url dest url
A directory of files can not be divided and set different optimization parametersFilelist option does help but it can not apply
pipelining on the list as the developers indicates
globus-url-copy -pp -p 5 -cc 4 -f filelist.txt
Implementation Details
File data structure propertiesFile size: used to construct data chunks
based on total size, throughput calculation, transfer duration calculation
Source and destination paths: necessary for combining and dividing datasets, changing the source path based on replica location
File name: Necessary to reconstruct full paths
Implementation Details
Listing the files for a given pathContacts the GridFTP server Pulls information about the files in the
given pathProvides a list of file data structures
including the number of filesMakes it easier to divide, combine, sort ,
enqueue and dequeue on a list of files
Implementation Details
Performing the actual transferSets the optimization parameters on a
list of files returned by the list function and manipulated by different algorithms
For a data chunk it sets the parallel stream, concurrency and pipelining value
Example Algorithms 1: Adaptive Concurrency
Takes a file list structure returned by the list function as input
Divides the file list into chunks based on the number of files in a chunk
Starting with concurrency level of 1 , transfer each chunk with an exponentially increasing concurrency level as long as the throughput increases by each chunk transfer
If the throughput drops adaptively the concurrency level is also decreased for the subsequent chunk transfer
Example Algorithms 1: Adaptive Concurrency
Example Algorithm 2: Optimal Pipelining Mean-based algorithm to construct clusters of files
with different optimal pipelining levels
Calculates optimal pipelining level by dividing BDP into mean file size of the chunk
Dataset is recursively divided by the mean file size index as long as the following conditions are met:A chunk can only be divided further as long as its
pipelining level is different than its parent chunkA chunk can not be less than a preset minimum
chunk sizeOptimal pipelining level for a chunk cannot be
greater than a preset maximum pipelining level
Example Algorithm 2-a: Optimal Pipelining
Example Algorithm 2-b: Optimal Pipelining and Concurrency
After the recursive division of chunks, ppopt is set for each chunk
Chunks go through a revision phase where smaller chunks are combined and larger chunks are further divided
Starting with cc = 1, each chunk is transferred with exponentially increased cc levels until throughput drops down
The rest of the chunks are transferred with the optimal cc level
Example Algorithm 2-b: Optimal Pipelining and Concurrency
Amazon EC2 Experiments
Large nodes with 2vCPUs, 8GB storage, 7.5 GB memory and moderate network performance
50ms artificial delay
Globus Provision is used to automatic setup of servers
Datasets comprise of many number of small files (most difficult optimization case)5000 1MB files1000 random size files in range 1Byte to 10MB
Amazon EC2 Experiments: 5000 1MB filesBaseline performance: Default
pipelining+data channel caching
Throughput achieved is higher than baseline for majority of cases
Amazon EC2 Experiments: 1000 random size files
Conclusions
The flexible GridFTP client has the ability to comply with different natured data scheduling algorithms
Adaptive and optimization algorithms easily sort, divide and combine datasets
Possibility to implement intelligent cloud scheduling services in an easier way
Questions?