dache: a data aware caching for big-data applications usingthe mapreduce framework

Jithin Raveendran

S7 IT

Roll No : 31

Guided by :

Prof. Remesh Babu

Presented by :

1

Buzz-word big-data : large-scale distributed data processing applications that operate on exceptionally large amounts of data.

2.5 Zettabytes of data/day — so much that 90% of the data in the world today has been created in the last two years alone.

BigData???

2

Open-source software framework from Apache - Distributed processing of large data sets across clusters of commodity servers.

Designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.

Inspired by

Google MapReduce

GFS (Google File System)

Case study with Hadoop MapReduceHadoop:

HDFSMap/Reduce

4

Apache Hadoop has two pillars

• HDFS• Self healing • High band width clustered

storage • MapReduce

• Retrieval System• Maper function tells the cluster

which data points we want to retrieve

• Reducer function then take all the data and aggregate

5

HDFS - Architecture

6

Name Node: Center piece of an HDFS file system

Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file.

Responds the successful requests by returning a list of relevant DataNodeservers where the data lives.

Data Node:

Stores data in the Hadoop File System.

A functional file system has more than one Data Node, with data replicated across them.

HDFS - Architecture

7

Secondary Name node:

Act as a check point of name node

Takes the snapshot of the Name node and use it whenever the back up is needed

HDFS Features:

Rack awareness

Reliable storage

High throughput

HDFS - Architecture

8

MapReduce Architecture

• Job Client:• Submits Jobs

• Job Tracker:• Co-ordinate

Jobs

• Task Tracker:• Execute Job

tasks

9

1. Clients submits jobs to the Job Tracker

2. Job Tracker talks to the name node

3. Job Tracker creates execution plan

4. Job Tracker submit works to Task tracker

5. Task Trackers report progress via heart beats

6. Job Tracker manages the phases

7. Job Tracker update the status

MapReduce Architecture

10

MapReduce is used for providing a standardized framework.

Limitation

Inefficiency in incremental processing.

Current System :

11

Dache - a data aware cache system for big-data applications using the MapReduce framework.

Dache aim-extending the MapReduceframework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReducejob.

Proposed System

12

Google Big table - handle incremental processing

Google Percolator - incremental processing platform

Ramcloud - distributed computing platform-Data on RAM

Related Work

13

Cache description scheme:

Data-aware caching requires each data object to be indexed by its content.

Provide a customizable indexing that enables the applications to describe their operations and the content of their generated partial results. This is a nontrivial task.

Cache request and reply protocol:

The size of the aggregated intermediate data can be very large. When such data is requested by other worker nodes, determining how to transport this data becomes complex

Technical challenges need to be

addressed

14

Map phase cache description scheme

Cache refers to the intermediate data produced by worker nodes/processes during the execution of a MapReduce task.

A piece of cached data stored in a Distributed File System (DFS).

Content of a cache item is described by the original data and the operations applied.

2-tuple: {Origin, Operation}

Origin : Name of a file in the DFS.

Operation : Linear list of available operations performed on the Origin file

Cache Description

15

Reduce phase cache description scheme

The input for the reduce phase is also a list of key-value pairs, where the value could be a list of values.

Original input and the applied operations are required.

Original input obtained by storing the intermediate results of the map phase in the DFS.

Cache Description

16

Protocol Relationship between job types and cache

organization

• When processing each file split, the cache manager reports the previous file splitting scheme used in its cache item.

17

Protocol Relationship between job types and cache

organization

To find words starting with ‘ab’, We use the results from the cache for word starting with ‘a’ ; and also add it to the cache

Find the best match among overlapped results [choose ‘ab’ instead of ‘a’]

18

Cache item submission

Mapper and reducer nodes/processes record cache items into their local storage space

Cache item should be put on the same machine as the worker process that generates it.

Worker node/process contacts the cache manager each time before it begins processing an input data file.

Worker process receives the tentative description and fetches the cache item.

Protocol

19

Cache manager - Determine how much time a cache item can be kept in the DFS.

Two types of policies for determining the lifetime of a cache item

Lifetime management of cache item

1. Fixed storage quota• Least Recent Used (LRU) is employed

2. Optimal utility• Estimates the saved computation

time, ts, by caching a cache item for a given amount of time, ta. • ts,ta - used to derive the monetary

gain and cost. 20

Map Cache: Cache requests must be sent out before the file splitting phase.

Job tracker issues cache requests to the cache manager.

Cache manager replies a list of cache descriptions.

Cache request and reply

Reduce Cache : • First , compare the requested cache item with the cached items in

the cache manager’s database.• Cache manager identify the overlaps of the original input files of

the requested cache and stored cache.• Linear scan is used here.

21

Implementation

Extend Hadoop to implement Dache by changing the components that are open to application developers.

The cache manager is implemented as an independent server.

Performance Evaluation

22

Hadoop is run in pseudo-distributed mode on a server that has

8-core CPU

core running at 3 GHz,

16GB memory,

a SATA disk

Two applications to benchmark the speedup of Dache over Hadoop

word-count and tera-sort.

Experiment settings

23

Results

24

Results

25

Results

26

Requires minimum change to the original MapReduce programming model.

Application code only requires slight changes in order to utilize Dache.

Implement Dache in Hadoop by extending relevant components.

Testbed experiments show that it can eliminate all the duplicate tasks in incremental MapReduce jobs.

Minimum execution time and CPU utilization.

Conclusion

27

This scheme utilizes much amount of cache.

Better cache management system will be needed.

Future Work

28

dache: a data aware caching for big-data applications usingthe mapreduce framework

Data & Analytics

data node

stores data

data lives

data points

data object

data aware caching

dataaware caching

multiple copies of data