data replication and power consumption in data grids susan v. vrbsky, ming lei, karl smith and jeff...

Data Replication and Power Consumption in Data Grids

Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff ByrdDepartment of Computer Science The University of AlabamaIEEE 2010 Cloud Computing Technology and Science

March 16, 2011Taikyoung Kim

SNU IDB Lab.

2

Outline Introduction Data Replication Performance Results Conclusion and Future Work

3

Introduction

Data grid features– Millions of files are generated and thousands of clients access

the files– Need to manage an extremely large number of data sets

Present systems support scalability, but extremely en-ergy inefficient– Power and cooling of the data center are inefficient – The power demanded by data centers is predicted to double

from 2006 to 2011 Storing, managing and moving massive amounts of

data are also a significant bottleneck

4

Introduction Our approach

– Save energy through the use of efficient CPU usage– Consider strategies to minimize disk storage and data

transmission

We propose to minimize the amount of data stored by utilizing smart replication strategies– Consider replicating the data only when necessary

Goal– Design data aware strategies for data-intensive computing

Shorter running times Decreased amount of data transmitted Smaller storage space

– Reduce power needed

5

Outline Introduction Data Replication

– Data Grid Architecture– Sliding Window Strategy

Performance Results Conclusion and Future Work

6

Data Replication

Utilize data replication – High probability to access data which is not in the local

site– Remote data file access can be a very expensive opera-

tion Network bandwidth, network congestion

– It reduces the access time and avoids remote file ac-cess

limit size of the storage– To decrease the amount of energy needed to store the

data

Use of smart data replication to reduce the cost of accessing and storing data

7

Data Replication

Data Grid Architecture

We consider only single-tier grids– Expect the strategies developed for single-tier grids can

be usedwithin the multi-tier structure

It is common for a job in a data grid to list all the files needed to complete its task– We utilize this aspect in designing a data replication

scheme

8

Data Replication

Sliding Window Strategy

SWIN [Sliding Window replica scheme]

– Consider the file access times in the future and local site Storage Element size

– Build a “sliding window” that is a set of distinct files which will be used immediately in the future

Includes all the files the current job will access and the distinct files from the next arriving jobs

The sum of the files in the sliding window will be at most the size of the local Storage Element

– Slides forward on more file each time the system finishes processing one file

Keep changing in this way

9

Data Replication

Sliding Window Strategy

Q=<J1,J2….> : a set of jobs

FAS(Ji)=<fi1,fi2….,fik> : file accessing sequence (fin≠fim)

G_FAS=<FAS(J1),FAS(J2),…,FAS(Jn)> : global file accessing se-quence

POS(fx,G_FAS): return the first position of fx in G_FAS Sliding Window rules

1. The sum of the sizes of all the files in the sliding window ≤ Size(SE)2. No duplicated files exist in the sliding window3. Any files in the sliding window will not be in a position before the

POS(fK,G_FAS)

4. Any files not in the sliding window will be in a position after POS(fm,G_FAS)

10

Outline Introduction Data Replication Performance Results

– Performance Environment– Number of Nodes Powered On– File Availability

Conclusion and Future Work

11

Performance Results Evaluate the performance of SWIN replica strategy

using Sage-built at the University of Alabama

Sage nodes– Intel D201GLY2 mainboard with 1.2 GHz Celeron CPU

On-board 10/100 Megabit LAN

– 1 Gb 533 MHz RAM– 80 Gb SATA 3 hard drive

Energy usage rates– Booting and peak : 430 Watts– Idle : 335 Watts (Cooling fans turned on)

315 Watts (Cooling fans turned off)

12

Performance Results

Performance Environment

The client nodes are responsible for – Processing the request– Maintaining replica copies – Notifying the server when a job is completed

Default experiment parameters

Metric– Total running time– Average number of watts required to process a job

Sampled every 1 minute

(400MB)

13

Performance Results

Number of Nodes Powered On

The power consumed is affected by whether or not all of the nodes are powered on – Regardless of whether they are being used in the computa-

tion of the jobs

LFU -Least Frequently UsedLRU -Least Recently UsedMRU -Most Recently Used

14

Performance Results

Number of Client Nodes

Measured the total running time for 100 jobs with all nodes powered on

15

Performance Results

Number of Client Nodes

While LRU requires the most watts, it has a shorter running time overall than LFU and MRU– Does not require the highest number of watts

The jobs with only 1 or 2 client nodes take longer to run than those utilizing 8 client nodes

The watts required for computation is a smaller per-centage of the total watts

16

Performance Results

File Availability The files are only available at the server

– (a) The jobs are able to run in a shorter amount of time as clients in-crease

– (b) The bottleneck increases as the number of client nodes increases Assume all file requests must go through the resource broker at the

server

The amount of power consumed is not always strictly re-lated to the running time of the jobs

Lastly, have shown that the window size can be decreased without increasing the running time or power consumed

17

Outline Introduction Data Replication Performance Results Conclusion and Future Work

18

Conclusion and Future Work Propose the smart strategies for replication files

– One way to minimize the energy consumed in data grid SWIN strategy

– Minimize the amount of data transmitted and storage needed

– Performs better than existing strategies, such as LRU, MRU and LFU

– Particularly beneficial in power saving when resource con-tention is high

– Decrease running time and watts required Smaller storage can be used to lower the amount of power

Future work– Study the performance of SWIN when the files are of differ-

ent sizes– Explore more efficient implementations for transferring files– Design and test additional replica schemes by utilizing the

CPU – Consider ways to schedule the jobs

Thank you

Question?

data replication and power consumption in data grids susan v. vrbsky, ming lei, karl smith and jeff...

Documents

data gridssusan

data centers

data transmission

data replication scheme

simrank data replication

massive amounts of data

sliding windowany files

fasany files