juniper innovation contest

20
MAP REDUCE WORKLOAD ON A SWITCH

Upload: amit-borude

Post on 15-Apr-2017

141 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Juniper Innovation Contest

MAP REDUCE WORKLOAD ON A SWITCH

Page 2: Juniper Innovation Contest

SJSU Washington Square

Team Determined Cheetahs• Hitesh Padekar - Computer Engineering • Sambu Gopan- Software Engineering• Sanket Desai - Software Engineering• Gokul Chand - Computer Engineering• Amit Borude - Software Engineering

Page 3: Juniper Innovation Contest

Background ● Hadoop clusters are deployed in data centers for handling Big Data. Hadoop is fully rack aware and

manages the data on its node in a topology-aware fashion. But Hadoop is not aware of the overall network workload.

● Hadoop servers perform two basic functions:

○ Distributed Data Processing (Map Reduce)

○ Distributed Data Storage (HDFS)

● Hadoop consists of the following Master nodes:

○ Name Node: Oversees and coordinates the data store function

○ Job Tracker: Oversees and coordinates the parallel processing of data using MapReduce functions

Page 4: Juniper Innovation Contest

Background

● Following are the slave nodes and are deployed in large numbers:

○ Data Node: Data node is a slave daemon to name node daemon and is responsible for actually storage of data

○ Task Tracker: Task tracker is a slave daemon to job tracker and is responsible for managing local tasks

● Client machines have Hadoop installed and are responsible for loading the data in the Hadoop cluster and submitting MapReduce jobs describing how that data should be processed. The Hadoop cluster is completely unaware of the network and the load it is putting on the network in terms of bandwidth requirements and latencies.

Page 5: Juniper Innovation Contest

Problem Statement

● Since networks are unaware of the amount of bandwidth requirements for a given Hadoop cluster, networks are designed to handle peak load. This is done by over provisioning the network. This means the network operates underutilized most of the time (a waste of network resources) and is over provisioned to handle occasional peak loads.

● If the Hadoop load could be measured on the networking devices, a Hadoop administrator could run the network at optimum efficiency by deploying the client machines and the jobs in a more network-efficient way.

Page 6: Juniper Innovation Contest

Problem to Solve● Develop a program to gather information about MapReduce workload on a switch and communicate

with Hadoop JobTracker using Hadoop RPC.

● Program running on Top Of Rack switches (this is the point where servers are connected to the network) will keep track of all the jobs which have been spawned. This will be done by opening a TCP connection to the Job Tracker node. Once job information is collected, amount of data being requested or moved around can be calculated for each task and its load on the network.

● This program is written in python/java and requires basic understanding of socket connections and storing information in in-memory databases.

Page 7: Juniper Innovation Contest

Deliverables● Demonstrate list of Jobs running on TaskTracker with associated data like start time, map count,

reduce count, etc.

● PowerPoint with high-level proposed solution due November 7 (for preliminary review by judges). You will receive a template closer to the deadline.

● Live or recorded demo to showcase the solution due November 14 Source code used for this solution (preferably on github)

● References

● http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

Page 8: Juniper Innovation Contest

SJSU Washington Square

ApproachDatanodes are interconnected through a switch. Switch also connects the

Datanodes with the JobTracker.

The program we have developed runs on the switch.

Switch acts as an RPC client (JobClient).

Hadoop runs RPC server (JobTracker).

RPC clients can leverage JobTracker methods.

RPC client collects information for Start time, Map count, reduce count, No of Jobs Spawned etc from the JobTracker.

Stores the relevant information in in-memory Redis DB.

JSP Servlet based UI will display the results.

Page 9: Juniper Innovation Contest

SJSU Washington Square

Estimation of Workload On Switch

RPC connectionMaster

JobTracker

Switch

Job Client

Namenode

Slave 1 Slave 2

TaskTracker TaskTracker

DataNode DataNode

Page 10: Juniper Innovation Contest

SJSU Washington Square

Approach: Multinode Cluster Topology and Communication setup

Master

JobTrackerSwitch

Namenode

Slave 1 Slave 2

TaskTracker TaskTracker

DataNode DataNode

Page 11: Juniper Innovation Contest

SJSU Washington Square

Approach: cont.

Program running on the switch has access to the information in the redis db and can give critical information for estimating further load and to make a decision.

Hadoop administrator can also access this information using UI.

Polling mechanism is used to gather data every 5 seconds.

Page 12: Juniper Innovation Contest

SJSU Washington Square

Resources Used:

• OS Ubuntu • Apache Hadoop 1.2.1• Redis in-memory db• Eclipse Luna SE• Eclipse EE• Cluster with 2 Datanodes and a Masternode• Github

Page 13: Juniper Innovation Contest

SJSU Washington Square

Difficulties encountered

• Multinode Cluster setup on Apache Hadoop.• Compatibility issues consumed a lot of time• Configuring Java Libraries for hadoop.• Compiling and Debugging the java source code to be run

on the switch.

Page 14: Juniper Innovation Contest

SJSU Washington Square

Recommendation

• With current solution we can only get the cluster level information, better solution would be to implement and expose new RPC method to communicate JobTracker and TaskTracker level info to the JobClient.

• Alternatively, If we have a daemon/agent running in the jobtracker which will trigger notifications to the program running on switch, it will be also be more efficient.

Page 15: Juniper Innovation Contest

Demo Screenshots

On Master Node On each Slave Node

Page 16: Juniper Innovation Contest

Start the load: sample wordcount example

Page 17: Juniper Innovation Contest

Web service running on switch reads info from the in-memory database and web client can display it in the web page

Page 18: Juniper Innovation Contest
Page 19: Juniper Innovation Contest

Thank You

Page 20: Juniper Innovation Contest