dache: a data aware caching for big-data using map reduce framework

Dache: A Data Aware Caching for Big-Data Applications Using

the MapReduce FrameworkGuided By:Asst. prof: Ms. Mary Mareena

Submitted By:Muhammed Safir O P

Contents: INTRODUCTION

ABSTRACT

EXISTING SYSTEM

PROPOSED SYSTEM

SYSTEM ARCHITECTURE

RESULT AND DISCUSSION

CONCLUSION

INTRODUCTION: Google MapReduce: A software

framework for large-scale distributed computing on large amounts of data.

Hadoop : An open-source implementation of the Google MapReduce programming model.

Two phases:Map Phase and Reduce Phase.

Provisioning cache layer for efficiently identifying and accessing cache items.

EXISTING SYSTEM: MapReduce is used for providing a

standardized framework.

Intermediate data is thrown away since map reduce is unable to utilize them

LIMITATIONS:

Inefficiency in incremental processing.

Duplicate computations being performed.

Do not have a mechanism to find duplicate computations and accelerate job execution.

EXISTING SYSTEM(Cont.):

Input is splitter and feed to workers in map phase.

Intermediate files generated in the map phase are shuffled and sorted by the system and fed into workers in reduce phase.

Final results are computed by multiple reducers and written to disk.

Input is splitter and feed to workers in map phase.

Intermediate files generated in the map phase are shuffled and sorted by the system and fed into workers in reduce phase.

Final results are computed by multiple reducers and written to disk.

PROPOSED SYSTEM

Dache - a data aware cache system for big-data applications using the MapReduce framework.

Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.

PROPOSED SYSTEM(Cont.): Identifies the source input from which a

cache item is obtained, and the operations applied on the input, so that a cache item produced by the workers in the map phase is indexed properly.

Partition operation applied in map phase.

MAP REDUCE ARCHITECTURE

Job Client:• Submit

jobs

Job Tracker:• Co-

ordinate jobs

Task Tracker:• Execute

job Tasks

MAP REDUCE ARCHITECTURE

1. Clients submits jobs to the Job Tracker.2. Job Tracker talks to the name node.3. Job Tracker creates execution plan.4. Job Tracker submit works to Task tracker.5. Task Trackers report progress via heart beats.6. Job Tracker manages the phases.7. Job Tracker update the status.

MAP PHASE DESCRIPTION PHASE:

A piece of cached data stored in Distributed File System(DFS).

Content of a cache item is described by the original data and operations applied.

2 tuple:{Origin, Operation}.

Origin: Name of a file in DFS. Operation: Linear list of available operations performed on the origin file.

REDUCE PHASE DESCRIPTION PHASE:

Input for the reduce phase is also a list of key-value pairs, where the value could be list of values.

Original input and the applied operations are required.

Original input obtained by storing the intermediate results of the map phase in the DFS.

Job Types and Cache Organization Relation:

When processing each file split, the cache manager reports the previous file splitting scheme used in its cache item.

Job Types and Cache Organization Relation:

To find words starting with ‘ab’, We use the results from the cache for word starting with ‘a’ ; and also add it to the cache

Find the best match among overlapped results [choose ‘ab’ instead of ‘a’]

CACHE REQUEST AND REPLY:

MAP CACHE:

Cache requests must be sent out before the file splitting phase.

Job tracker issues cache requests to the cache manager.

Cache manager replies a list of cache descriptions.

CACHE REQUEST AND REPLY:

REDUCE CACHE:

First , compare the requested cache item with the cached items in the cache manager’s database.

Cache manager identify the overlaps of the original input files of the requested cache and stored cache.

Linear scan is used here.

LIFE TIME MANAGEMENT OF

CACHE ITEM Cache Manager: Determines how much time a cache

item can be kept in DFS.

Two types of policies:• Fixed Storage quota:

Least Recent Used(LRU) is employed.• Optimal utility

Estimates the saved computation time ts by caching cache item for given amount of time ta.

Expenses ts = P storage x Scache x ts

Save ts = P computation x R duplicate x ts

RESULT AND DISCUSSION:The graph for CPU utilization of Hadoop and Dache in the two programs:

Tera-sort Program. Word-count Program.

RESULT AND DISCUSSION(Cont.):The graph for Completion time for the two programs using Dache and Hadoop.


RESULT AND DISCUSSION(Cont.):The graph for Total cache size in GB for the two programs using Dache and Hadoop.


CONCLUSION: Requires minimum change to the original

MapReduce programming model.

Application code only requires slight changes in order to utilize Dache.

Implement Dache in Hadoop by extending relevant components.

Testbed experiments show that it can eliminate all the duplicate tasks in incremental MapReduce jobs.

Minimum execution time and CPU utilization.

REFERENCES: J. Dean and S. Ghemawat, MapReduce:

Simplified data processing on large clusters, Communication of ACM, vol. 51, no. 1, pp. 107-113, 2008.

Hadoop, http://Hadoop.apache.org, 2013.

Cache algorithms, http://en.wikipedia.org/wiki/Cache

algorithms, 2013.

Java programming language, http://www.java.com/, 2013.

Google compute engine, http://cloud.google.com/products/computeengine.html, 2013.

Any Questions???

dache: a data aware caching for big-data using map reduce framework

Technology

map phase description

mapreduce job

jobs job tracker

job types

job execution

cache items

data aware cache system

cache manager