juniper innovation contest

MAP REDUCE WORKLOAD ON A SWITCH

SJSU Washington Square

Team Determined Cheetahs• Hitesh Padekar - Computer Engineering • Sambu Gopan- Software Engineering• Sanket Desai - Software Engineering• Gokul Chand - Computer Engineering• Amit Borude - Software Engineering

Background ● Hadoop clusters are deployed in data centers for handling Big Data. Hadoop is fully rack aware and

manages the data on its node in a topology-aware fashion. But Hadoop is not aware of the overall network workload.

● Hadoop servers perform two basic functions:

○ Distributed Data Processing (Map Reduce)

○ Distributed Data Storage (HDFS)

● Hadoop consists of the following Master nodes:

○ Name Node: Oversees and coordinates the data store function

○ Job Tracker: Oversees and coordinates the parallel processing of data using MapReduce functions

Background

● Following are the slave nodes and are deployed in large numbers:

○ Data Node: Data node is a slave daemon to name node daemon and is responsible for actually storage of data

○ Task Tracker: Task tracker is a slave daemon to job tracker and is responsible for managing local tasks

● Client machines have Hadoop installed and are responsible for loading the data in the Hadoop cluster and submitting MapReduce jobs describing how that data should be processed. The Hadoop cluster is completely unaware of the network and the load it is putting on the network in terms of bandwidth requirements and latencies.

Problem Statement

● Since networks are unaware of the amount of bandwidth requirements for a given Hadoop cluster, networks are designed to handle peak load. This is done by over provisioning the network. This means the network operates underutilized most of the time (a waste of network resources) and is over provisioned to handle occasional peak loads.

● If the Hadoop load could be measured on the networking devices, a Hadoop administrator could run the network at optimum efficiency by deploying the client machines and the jobs in a more network-efficient way.

Problem to Solve● Develop a program to gather information about MapReduce workload on a switch and communicate

with Hadoop JobTracker using Hadoop RPC.

● Program running on Top Of Rack switches (this is the point where servers are connected to the network) will keep track of all the jobs which have been spawned. This will be done by opening a TCP connection to the Job Tracker node. Once job information is collected, amount of data being requested or moved around can be calculated for each task and its load on the network.

● This program is written in python/java and requires basic understanding of socket connections and storing information in in-memory databases.

Deliverables● Demonstrate list of Jobs running on TaskTracker with associated data like start time, map count,

reduce count, etc.

● PowerPoint with high-level proposed solution due November 7 (for preliminary review by judges). You will receive a template closer to the deadline.

● Live or recorded demo to showcase the solution due November 14 Source code used for this solution (preferably on github)

● References

● http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html


ApproachDatanodes are interconnected through a switch. Switch also connects the

Datanodes with the JobTracker.

The program we have developed runs on the switch.

Switch acts as an RPC client (JobClient).

Hadoop runs RPC server (JobTracker).

RPC clients can leverage JobTracker methods.

RPC client collects information for Start time, Map count, reduce count, No of Jobs Spawned etc from the JobTracker.

Stores the relevant information in in-memory Redis DB.

JSP Servlet based UI will display the results.


Estimation of Workload On Switch

RPC connectionMaster

JobTracker

Switch

Job Client

Namenode

Slave 1 Slave 2

TaskTracker TaskTracker

DataNode DataNode


Approach: Multinode Cluster Topology and Communication setup

Master

JobTrackerSwitch

Namenode

Slave 1 Slave 2

TaskTracker TaskTracker

DataNode DataNode


Approach: cont.

Program running on the switch has access to the information in the redis db and can give critical information for estimating further load and to make a decision.

Hadoop administrator can also access this information using UI.

Polling mechanism is used to gather data every 5 seconds.


Resources Used:

• OS Ubuntu • Apache Hadoop 1.2.1• Redis in-memory db• Eclipse Luna SE• Eclipse EE• Cluster with 2 Datanodes and a Masternode• Github


Difficulties encountered

• Multinode Cluster setup on Apache Hadoop.• Compatibility issues consumed a lot of time• Configuring Java Libraries for hadoop.• Compiling and Debugging the java source code to be run

on the switch.


Recommendation

• With current solution we can only get the cluster level information, better solution would be to implement and expose new RPC method to communicate JobTracker and TaskTracker level info to the JobClient.

• Alternatively, If we have a daemon/agent running in the jobtracker which will trigger notifications to the program running on switch, it will be also be more efficient.

Demo Screenshots

On Master Node On each Slave Node

Start the load: sample wordcount example

Web service running on switch reads info from the in-memory database and web client can display it in the web page

Thank You

juniper innovation contest

Documents