power hadoop cluster with aws cloud

22
View Hadoop Administration Course at www.edureka.co/hadoop-admin Power the Hadoop Cluster with AWS Cloud

Upload: edureka

Post on 28-Jul-2015

232 views

Category:

Technology


1 download

TRANSCRIPT

View Hadoop Administration Course at www.edureka.co/hadoop-admin

Power the Hadoop Cluster with AWS Cloud

www.edureka.co/hadoop-adminSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Objectives

At the end of this module, you will be able to

Hadoop Cluster introductionRecommended Configuration for clusterHadoop cluster running modesHadoop configuration filesHadoop Admin ResponsibilitiesHadoop cluster set up on AWS Demo

Slide 3Slide 3Slide 3 www.edureka.co/java-hadoop

Hadoop Core Components

Hadoop 2.x Core Components

HDFS YARN

Storage Processing

DataNode

NameNode Resource Manager

Node Manager

Master

Slave

SecondaryNameNode

www.edureka.co/hadoop-admin

Slide 4

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

Hadoop Cluster: A Typical Use Case

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

Active NameNodeSecondary NameNode

DataNode DataNode

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

StandBy NameNode

Optional

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

DataNode

DataNode DataNode DataNode

www.edureka.co/hadoop-admin

www.edureka.co/hadoop-adminSlide 5

Seeking cluster growth on storage capacity is often a good method to use!

Cluster Growth Based On Storage Capacity

Data grows by approximately5TB per week

HDFS set up to replicate eachblock three times

Thus, 15TB of extra storagespace required per week

Assuming machines with 5x3TBhard drives, equating to a newmachine required each week

Assume Overheads to be 30%

www.edureka.co/hadoop-adminSlide 6

Slave Nodes: Recommended Configuration

Higher-performance vs lower performance components

Save the Money, Buy more Nodes!

General ( Depends on requirement ‘base’ configuration for a slave Node

» 4 x 1 TB or 2 TB hard drives, in a JBOD* configuration

» Do not use RAID!» 2 x Quad-core CPUs» 24 -32GB RAM» Gigabit Ethernet

General Configuration

Multiples of ( 1 hard drive + 2 cores+ 6-8GB RAM) generally work wellfor many types of applications

Special Configuration

Slave Nodes

“A cluster with more nodes performs better than one with fewer, slightly faster nodes”

www.edureka.co/hadoop-adminSlide 7

Slave Nodes: More Details (RAM)

Slave Nodes (RAM)

Generally each Map or Reduce taskwill take 1GB to 2GB of RAM

Slave nodes should not be usingvirtual memory

RULE OF THUMB!Total number of tasks = 1.5 x numberof processor core

Ensure enough RAM is present torun all tasks, plus the DataNode,TaskTracker daemons, plus theoperating system

www.edureka.co/hadoop-adminSlide 8

Master Node Hardware Recommendations

Carrier-class hardware (Not commodity hardware)

Dual power supplies

Dual Ethernet cards(Bonded to provide failover)

Raided hard drives

At least 32GB of RAM

Master Node

Requires

www.edureka.co/hadoop-adminSlide 9

Hadoop Cluster Modes

Hadoop can run in any of the following three modes:

Fully-Distributed Mode

Pseudo-Distributed Mode

No daemons, everything runs in a single JVM Suitable for running MapReduce programs during development Has no DFS

Hadoop daemons run on the local machine

Hadoop daemons run on a cluster of machines

Standalone (or Local) Mode

www.edureka.co/hadoop-adminSlide 10

Configuration Files

ConfigurationFilenames

Description of Log Files

hadoop-env.shyarn-env.sh

Settings for Hadoop Daemon’s process environment.

core-site.xmlConfiguration settings for Hadoop Core such as I/O settings that common to both HDFS and YARN.

hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.

yarn-site.xml Configuration setting for Resource Manager and Node Manager.

mapred-site.xml Configuration settings for MapReduce Applications.

slaves A list of machines (one per line) that each run DataNode and Node Manager.

www.edureka.co/hadoop-adminSlide 11

Configuration Files (Contd.)

Deprecated Property Name New Property Name

dfs.data.dir dfs.datanode.data.dir

dfs.http.address dfs.namenode.http-address

fs.default.name fs.defaultFS

The core functionality and usage of these core configuration files are same in Hadoop 2.0 and 1.0 but many new properties have been added and many have been deprecated

For example: ’fs.default.name’ has been deprecated and replaced with ‘fs.defaultFS’ for YARN in core-site.xml ‘dfs.nameservices’ has been added to enable NameNode High Availability in hdfs-site.xml

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html

In Hadoop 2.2.0 release, you can use either the old or the new properties

The old property names are now deprecated, but still work!

Slide 12

Core

HDFS

core-site.xml

hdfs-site.xml

yarn-site.xmlYARN

mapred-site.xmlMap

Reduce

Hadoop 2.x Configuration Files – Apache Hadoop

www.edureka.co/hadoop-admin

Slide 13

Hadoop Daemons

NameNode daemon» Runs on master node of the Hadoop Distributed File System (HDFS)» Directs Data Nodes to perform their low-level I/O tasks

DataNode daemon» Runs on each slave machine in the HDFS» Does the low-level I/O work

Resource Manager» Runs on master node of the Data processing System(MapReduce)» Global resource Scheduler

Node Manager» Runs on each slave node of Data processing System» Platform for the Data processing tasks

Job HistoryServer» JobHistoryServer is responsible for servicing all job history related requests from client

www.edureka.co/hadoop-admin

www.edureka.co/hadoop-adminSlide 14

Why Cloud?

Challenges in current trend:

Arranging a large common storage areaProviding secure access to the shared data

www.edureka.co/hadoop-adminSlide 15

Amazon EC2

A cloud web host that allows you to dynamically add and remove computer server resources as you need them, allowing you to pay for only the capacity that you used.

Good For Hadoop Cluster set : we can bring up enormous cluster with in minutes and then spin it down when we have finished to reduce costs.

www.edureka.co/hadoop-adminSlide 16

Hadoop on AWS

ANALYZING…

www.edureka.co/hadoop-adminSlide 17

DEMO

www.edureka.co/hadoop-adminSlide 18

Hadoop Admin Responsibilities

Responsible for implementation and administration of Hadoop infrastructure.

Testing HDFS, Hive, Pig and MapReduce access for Applications.

Cluster maintenance tasks like Backup, Recovery, Upgrade, Patching.

Performance tuning and Capacity planning for Clusters.

Monitor Hadoop cluster and deploy security.

LIVE Online Class

Class Recording in LMS

24/7 Post Class Support

Module Wise Quiz

Project Work

Verifiable Certificate

www.edureka.co/hadoop-adminSlide 19 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

How it Works?

Questions

www.edureka.co/hadoop-adminSlide 20 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

www.edureka.co/hadoop-adminSlide 21

Course Topics

Module 1 » Hadoop Cluster Administration

Module 2» Hadoop Architecture and Cluster setup

Module 3 » Hadoop Cluster: Planning and Managing

Module 4 » Backup, Recovery and Maintenance

Module 5 » Hadoop 2.0 and High Availability

Module 6» Advanced Topics: QJM, HDFS Federation and

Security

Module 7» Oozie, Hcatalog/Hive and HBase Administration

Module 8» Project: Hadoop Implementation