dache: a data aware caching for big-data using map reduce framework

22
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework ed By: . prof: Ms. Mary Mareena Submitted By: Muhammed Safir O P

Upload: safir-shah

Post on 21-Aug-2015

60 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

Dache: A Data Aware Caching for Big-Data Applications Using

the MapReduce FrameworkGuided By:Asst. prof: Ms. Mary Mareena

Submitted By:Muhammed Safir O P

Page 2: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

Contents: INTRODUCTION

ABSTRACT

EXISTING SYSTEM

PROPOSED SYSTEM

SYSTEM ARCHITECTURE

RESULT AND DISCUSSION

CONCLUSION

Page 3: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

INTRODUCTION: Google MapReduce: A software

framework for large-scale distributed computing on large amounts of data.

Hadoop : An open-source implementation of the Google MapReduce programming model.

Two phases:Map Phase and Reduce Phase.

Provisioning cache layer for efficiently identifying and accessing cache items.

Page 4: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

EXISTING SYSTEM: MapReduce is used for providing a

standardized framework.

Intermediate data is thrown away since map reduce is unable to utilize them

LIMITATIONS:

Inefficiency in incremental processing.

Duplicate computations being performed.

Do not have a mechanism to find duplicate computations and accelerate job execution.

Page 5: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

EXISTING SYSTEM(Cont.):

Input is splitter and feed to workers in map phase.

Intermediate files generated in the map phase are shuffled and sorted by the system and fed into workers in reduce phase.

Final results are computed by multiple reducers and written to disk.

Input is splitter and feed to workers in map phase.

Intermediate files generated in the map phase are shuffled and sorted by the system and fed into workers in reduce phase.

Final results are computed by multiple reducers and written to disk.

Page 6: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

PROPOSED SYSTEM

Dache - a data aware cache system for big-data applications using the MapReduce framework.

Dache aim-extending the MapReduce framework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReduce job.

Page 7: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

PROPOSED SYSTEM(Cont.): Identifies the source input from which a

cache item is obtained, and the operations applied on the input, so that a cache item produced by the workers in the map phase is indexed properly.

Partition operation applied in map phase.

Page 8: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

MAP REDUCE ARCHITECTURE

Job Client:• Submit

jobs

Job Tracker:• Co-

ordinate jobs

Task Tracker:• Execute

job Tasks

Page 9: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

MAP REDUCE ARCHITECTURE

1. Clients submits jobs to the Job Tracker.2. Job Tracker talks to the name node.3. Job Tracker creates execution plan.4. Job Tracker submit works to Task tracker.5. Task Trackers report progress via heart beats.6. Job Tracker manages the phases.7. Job Tracker update the status.

Page 10: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

MAP PHASE DESCRIPTION PHASE:

A piece of cached data stored in Distributed File System(DFS).

Content of a cache item is described by the original data and operations applied.

2 tuple:{Origin, Operation}.

Origin: Name of a file in DFS. Operation: Linear list of available operations performed on the origin file.

Page 11: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

REDUCE PHASE DESCRIPTION PHASE:

Input for the reduce phase is also a list of key-value pairs, where the value could be list of values.

Original input and the applied operations are required.

Original input obtained by storing the intermediate results of the map phase in the DFS.

Page 12: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

Job Types and Cache Organization Relation:

When processing each file split, the cache manager reports the previous file splitting scheme used in its cache item.

Page 13: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

Job Types and Cache Organization Relation:

To find words starting with ‘ab’, We use the results from the cache for word starting with ‘a’ ; and also add it to the cache

Find the best match among overlapped results [choose ‘ab’ instead of ‘a’]

Page 14: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

CACHE REQUEST AND REPLY:

MAP CACHE:

Cache requests must be sent out before the file splitting phase.

Job tracker issues cache requests to the cache manager.

Cache manager replies a list of cache descriptions.

Page 15: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

CACHE REQUEST AND REPLY:

REDUCE CACHE:

First , compare the requested cache item with the cached items in the cache manager’s database.

Cache manager identify the overlaps of the original input files of the requested cache and stored cache.

Linear scan is used here.

Page 16: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

LIFE TIME MANAGEMENT OF

CACHE ITEM Cache Manager: Determines how much time a cache

item can be kept in DFS.

Two types of policies:• Fixed Storage quota:

Least Recent Used(LRU) is employed.• Optimal utility

Estimates the saved computation time ts by caching cache item for given amount of time ta.

Expenses ts = P storage x Scache x ts

Save ts = P computation x R duplicate x ts

Page 17: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

RESULT AND DISCUSSION:The graph for CPU utilization of Hadoop and Dache in the two programs:

Tera-sort Program. Word-count Program.

Page 18: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

RESULT AND DISCUSSION(Cont.):The graph for Completion time for the two programs using Dache and Hadoop.

Tera-sort Program. Word-count Program.

Page 19: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

RESULT AND DISCUSSION(Cont.):The graph for Total cache size in GB for the two programs using Dache and Hadoop.

Tera-sort Program. Word-count Program.

Page 20: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

CONCLUSION: Requires minimum change to the original

MapReduce programming model.

Application code only requires slight changes in order to utilize Dache.

Implement Dache in Hadoop by extending relevant components.

Testbed experiments show that it can eliminate all the duplicate tasks in incremental MapReduce jobs.

Minimum execution time and CPU utilization.

Page 21: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

REFERENCES: J. Dean and S. Ghemawat, MapReduce:

Simplified data processing on large clusters, Communication of ACM, vol. 51, no. 1, pp. 107-113, 2008.

Hadoop, http://Hadoop.apache.org, 2013.

Cache algorithms, http://en.wikipedia.org/wiki/Cache

algorithms, 2013.

Java programming language, http://www.java.com/, 2013.

Google compute engine, http://cloud.google.com/products/computeengine.html, 2013.

Page 22: Dache: A Data Aware Caching for Big-Data using Map Reduce framework

Any Questions???