dache: a data aware caching for big-data applications usingthe mapreduce framework

29
Jithin Raveendran S7 IT Roll No : 31 Guided by : Prof. Remesh Babu Presented by : 1

Upload: govtengineering-college-idukki

Post on 13-Jul-2015

452 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Jithin Raveendran

S7 IT

Roll No : 31

Guided by :

Prof. Remesh Babu

Presented by :

1

Page 2: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Buzz-word big-data : large-scale distributed data processing applications that operate on exceptionally large amounts of data.

2.5 Zettabytes of data/day — so much that 90% of the data in the world today has been created in the last two years alone.

BigData???

2

Page 3: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

3

Page 4: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Open-source software framework from Apache - Distributed processing of large data sets across clusters of commodity servers.

Designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.

Inspired by

Google MapReduce

GFS (Google File System)

Case study with Hadoop MapReduceHadoop:

HDFSMap/Reduce

4

Page 5: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Apache Hadoop has two pillars

• HDFS• Self healing • High band width clustered

storage • MapReduce

• Retrieval System• Maper function tells the cluster

which data points we want to retrieve

• Reducer function then take all the data and aggregate

5

Page 6: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

HDFS - Architecture

6

Page 7: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Name Node: Center piece of an HDFS file system

Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file.

Responds the successful requests by returning a list of relevant DataNodeservers where the data lives.

Data Node:

Stores data in the Hadoop File System.

A functional file system has more than one Data Node, with data replicated across them.

HDFS - Architecture

7

Page 8: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Secondary Name node:

Act as a check point of name node

Takes the snapshot of the Name node and use it whenever the back up is needed

HDFS Features:

Rack awareness

Reliable storage

High throughput

HDFS - Architecture

8

Page 9: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

MapReduce Architecture

• Job Client:• Submits Jobs

• Job Tracker:• Co-ordinate

Jobs

• Task Tracker:• Execute Job

tasks

9

Page 10: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

1. Clients submits jobs to the Job Tracker

2. Job Tracker talks to the name node

3. Job Tracker creates execution plan

4. Job Tracker submit works to Task tracker

5. Task Trackers report progress via heart beats

6. Job Tracker manages the phases

7. Job Tracker update the status

MapReduce Architecture

10

Page 11: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

MapReduce is used for providing a standardized framework.

Limitation

Inefficiency in incremental processing.

Current System :

11

Page 12: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Dache - a data aware cache system for big-data applications using the MapReduce framework.

Dache aim-extending the MapReduceframework and provisioning a cache layer for efficiently identifying and accessing cache items in a MapReducejob.

Proposed System

12

Page 13: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Google Big table - handle incremental processing

Google Percolator - incremental processing platform

Ramcloud - distributed computing platform-Data on RAM

Related Work

13

Page 14: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Cache description scheme:

Data-aware caching requires each data object to be indexed by its content.

Provide a customizable indexing that enables the applications to describe their operations and the content of their generated partial results. This is a nontrivial task.

Cache request and reply protocol:

The size of the aggregated intermediate data can be very large. When such data is requested by other worker nodes, determining how to transport this data becomes complex

Technical challenges need to be

addressed

14

Page 15: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Map phase cache description scheme

Cache refers to the intermediate data produced by worker nodes/processes during the execution of a MapReduce task.

A piece of cached data stored in a Distributed File System (DFS).

Content of a cache item is described by the original data and the operations applied.

2-tuple: {Origin, Operation}

Origin : Name of a file in the DFS.

Operation : Linear list of available operations performed on the Origin file

Cache Description

15

Page 16: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Reduce phase cache description scheme

The input for the reduce phase is also a list of key-value pairs, where the value could be a list of values.

Original input and the applied operations are required.

Original input obtained by storing the intermediate results of the map phase in the DFS.

Cache Description

16

Page 17: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Protocol Relationship between job types and cache

organization

• When processing each file split, the cache manager reports the previous file splitting scheme used in its cache item.

17

Page 18: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Protocol Relationship between job types and cache

organization

To find words starting with ‘ab’, We use the results from the cache for word starting with ‘a’ ; and also add it to the cache

Find the best match among overlapped results [choose ‘ab’ instead of ‘a’]

18

Page 19: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Cache item submission

Mapper and reducer nodes/processes record cache items into their local storage space

Cache item should be put on the same machine as the worker process that generates it.

Worker node/process contacts the cache manager each time before it begins processing an input data file.

Worker process receives the tentative description and fetches the cache item.

Protocol

19

Page 20: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Cache manager - Determine how much time a cache item can be kept in the DFS.

Two types of policies for determining the lifetime of a cache item

Lifetime management of cache item

1. Fixed storage quota• Least Recent Used (LRU) is employed

2. Optimal utility• Estimates the saved computation

time, ts, by caching a cache item for a given amount of time, ta. • ts,ta - used to derive the monetary

gain and cost. 20

Page 21: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Map Cache: Cache requests must be sent out before the file splitting phase.

Job tracker issues cache requests to the cache manager.

Cache manager replies a list of cache descriptions.

Cache request and reply

Reduce Cache : • First , compare the requested cache item with the cached items in

the cache manager’s database.• Cache manager identify the overlaps of the original input files of

the requested cache and stored cache.• Linear scan is used here.

21

Page 22: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Implementation

Extend Hadoop to implement Dache by changing the components that are open to application developers.

The cache manager is implemented as an independent server.

Performance Evaluation

22

Page 23: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Hadoop is run in pseudo-distributed mode on a server that has

8-core CPU

core running at 3 GHz,

16GB memory,

a SATA disk

Two applications to benchmark the speedup of Dache over Hadoop

word-count and tera-sort.

Experiment settings

23

Page 24: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Results

24

Page 25: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Results

25

Page 26: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Results

26

Page 27: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

Requires minimum change to the original MapReduce programming model.

Application code only requires slight changes in order to utilize Dache.

Implement Dache in Hadoop by extending relevant components.

Testbed experiments show that it can eliminate all the duplicate tasks in incremental MapReduce jobs.

Minimum execution time and CPU utilization.

Conclusion

27

Page 28: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

This scheme utilizes much amount of cache.

Better cache management system will be needed.

Future Work

28

Page 29: Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Framework

29