juniper innovation contest
TRANSCRIPT
MAP REDUCE WORKLOAD ON A SWITCH
SJSU Washington Square
Team Determined Cheetahs• Hitesh Padekar - Computer Engineering • Sambu Gopan- Software Engineering• Sanket Desai - Software Engineering• Gokul Chand - Computer Engineering• Amit Borude - Software Engineering
Background ● Hadoop clusters are deployed in data centers for handling Big Data. Hadoop is fully rack aware and
manages the data on its node in a topology-aware fashion. But Hadoop is not aware of the overall network workload.
● Hadoop servers perform two basic functions:
○ Distributed Data Processing (Map Reduce)
○ Distributed Data Storage (HDFS)
● Hadoop consists of the following Master nodes:
○ Name Node: Oversees and coordinates the data store function
○ Job Tracker: Oversees and coordinates the parallel processing of data using MapReduce functions
Background
● Following are the slave nodes and are deployed in large numbers:
○ Data Node: Data node is a slave daemon to name node daemon and is responsible for actually storage of data
○ Task Tracker: Task tracker is a slave daemon to job tracker and is responsible for managing local tasks
● Client machines have Hadoop installed and are responsible for loading the data in the Hadoop cluster and submitting MapReduce jobs describing how that data should be processed. The Hadoop cluster is completely unaware of the network and the load it is putting on the network in terms of bandwidth requirements and latencies.
Problem Statement
● Since networks are unaware of the amount of bandwidth requirements for a given Hadoop cluster, networks are designed to handle peak load. This is done by over provisioning the network. This means the network operates underutilized most of the time (a waste of network resources) and is over provisioned to handle occasional peak loads.
● If the Hadoop load could be measured on the networking devices, a Hadoop administrator could run the network at optimum efficiency by deploying the client machines and the jobs in a more network-efficient way.
Problem to Solve● Develop a program to gather information about MapReduce workload on a switch and communicate
with Hadoop JobTracker using Hadoop RPC.
● Program running on Top Of Rack switches (this is the point where servers are connected to the network) will keep track of all the jobs which have been spawned. This will be done by opening a TCP connection to the Job Tracker node. Once job information is collected, amount of data being requested or moved around can be calculated for each task and its load on the network.
● This program is written in python/java and requires basic understanding of socket connections and storing information in in-memory databases.
Deliverables● Demonstrate list of Jobs running on TaskTracker with associated data like start time, map count,
reduce count, etc.
● PowerPoint with high-level proposed solution due November 7 (for preliminary review by judges). You will receive a template closer to the deadline.
● Live or recorded demo to showcase the solution due November 14 Source code used for this solution (preferably on github)
● References
● http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
SJSU Washington Square
ApproachDatanodes are interconnected through a switch. Switch also connects the
Datanodes with the JobTracker.
The program we have developed runs on the switch.
Switch acts as an RPC client (JobClient).
Hadoop runs RPC server (JobTracker).
RPC clients can leverage JobTracker methods.
RPC client collects information for Start time, Map count, reduce count, No of Jobs Spawned etc from the JobTracker.
Stores the relevant information in in-memory Redis DB.
JSP Servlet based UI will display the results.
SJSU Washington Square
Estimation of Workload On Switch
RPC connectionMaster
JobTracker
Switch
Job Client
Namenode
Slave 1 Slave 2
TaskTracker TaskTracker
DataNode DataNode
SJSU Washington Square
Approach: Multinode Cluster Topology and Communication setup
Master
JobTrackerSwitch
Namenode
Slave 1 Slave 2
TaskTracker TaskTracker
DataNode DataNode
SJSU Washington Square
Approach: cont.
Program running on the switch has access to the information in the redis db and can give critical information for estimating further load and to make a decision.
Hadoop administrator can also access this information using UI.
Polling mechanism is used to gather data every 5 seconds.
SJSU Washington Square
Resources Used:
• OS Ubuntu • Apache Hadoop 1.2.1• Redis in-memory db• Eclipse Luna SE• Eclipse EE• Cluster with 2 Datanodes and a Masternode• Github
SJSU Washington Square
Difficulties encountered
• Multinode Cluster setup on Apache Hadoop.• Compatibility issues consumed a lot of time• Configuring Java Libraries for hadoop.• Compiling and Debugging the java source code to be run
on the switch.
SJSU Washington Square
Recommendation
• With current solution we can only get the cluster level information, better solution would be to implement and expose new RPC method to communicate JobTracker and TaskTracker level info to the JobClient.
• Alternatively, If we have a daemon/agent running in the jobtracker which will trigger notifications to the program running on switch, it will be also be more efficient.
Demo Screenshots
On Master Node On each Slave Node
Start the load: sample wordcount example
Web service running on switch reads info from the in-memory database and web client can display it in the web page
Thank You