investigating distributed caching mechanisms for hadoop gurmeet singh puneet chandra rashid tahir

19
Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Upload: brayan-drury

Post on 31-Mar-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Investigating Distributed Caching Mechanisms for Hadoop

Gurmeet SinghPuneet Chandra

Rashid Tahir

Page 2: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

GOAL

• Explore the feasibility of a distributed caching mechanism inside Hadoop

Page 3: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Presentation Overview

• Motivation• Design• Experimental Results• Future Work

Page 4: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Motivation

• Disk Access Times are a bottleneck in cluster computing

• Large amount of data is read from disk• DARE• RAMClouds• PACMan – Coordinated Cache Replacement

We want to strike a balance between RAM and Disk Storage

Page 5: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Our Approach

• Integrate Memcached with Hadoop• Used Quickcached and Spymemcached• Reserve a portion of the main memory at each

node to serve as local cache• Local caches aggregate to abstract a distributed

caching mechanism governed by Memcached• Greedy caching strategy• Least Recently Used (LRU) cache eviction policy

Page 6: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Design Overview

Page 7: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Memcached

Page 8: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Design Choice 1

• Simultaneous requests to Namenode and Memcached

Minimizes access latency with additional network overhead

Page 9: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Design Choice 2• Send request to Namenode only in the case of

a cache miss

Minimizes network overhead with increased latency

Page 10: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Design Choice 3

• Datanodes send requests only to Memcached

• Memcached checks for cached blocks

• If cache miss occurs, it contacts the namenode and returns the replicas’ addresses to the datanodes

Page 11: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Global Cache Replacement• LRU based Global Cache Eviction Scheme

Page 12: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Prefetching

Page 13: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Simulation Results

• Test data ranging from 2GB to 24GB• Word Count and Grep

Page 14: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

0 5 10 15 20 25 30 35 400

20

40

60

80

100

Network Overhead vs Cache Size

Cache Size (GB)

% O

verh

eadWord Count

Page 15: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Word Count

0 5 10 15 20 25 30 350

0.2

0.4

0.6

0.8

1

Hit Ratio vs Cache Size

Cache Size (GB)

Cach

e H

it Ra

tio

Page 16: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Grep

0 5 10 15 20 25 30 350

20

40

60

80

100

Network Overhead vs Cache Size

Cache Size (GB)

% O

verh

ead

Page 17: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Grep

0 5 10 15 20 25 30 350

0.2

0.4

0.6

0.8

1

Hit Ratio vs Cache Size

Cache Size (GB)

Hit

Ratio

Page 18: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Future Work

• Implement a pre-fetching mechanism• Customized caching policies based on access

patterns• Compare and contrast caching with locality

aware scheduling

Page 19: Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

Conclusion

• Caching can improve the performance of cluster based systems based on the access patterns of the workload being executed