hadoop summit san jose 2014 - apache hadoop yarn: best practices

© Hortonworks Inc. 2014

Apache Hadoop YARN

Best Practices

Zhijie Shen

zshen [at] hortonworks.com

Varun Vasudev

vvasudev [at] hortonworks.com


Who we are

• Zhijie Shen– Software engineer at Hortonworks– Apache Hadoop Committer– Apache SAMZA Committer and PPMC– PhD from National University of Singapore

• Varun Vasudev– Software engineer at Hortonworks, working on YARN– Worked on image and web search at Yahoo!

Architecting the Future of Big Data


Agenda

• Talking about what we have learnt from our experiences working with YARN users

• Best practices for– Administrators– Application Developers



For Administrators



Sub-Agenda

• Overview of YARN configuration• ResourceManager• Schedulers• NodeManagers• Others

– Log aggregation– Metrics



Overview of YARN configuration

• Almost everything YARN related in yarn-site.xml• Granular – individual variables documented• Nearly 150 configuration properties

– Required: Very small set – hostnames etc– Common: Client and server– Advanced: RPC retries etc.– yarn.resourcemanager.* yarn.nodemanager.* usually - server configs

– Admins can mark them ‘final’ to clarify to users they cannot be overridden

– yarn.client.* - client configs

• Security, ResourceManager, NodeManager, TimelineServer, Scheduler – all in one file

• Topology scripts on RM, NM and all nodes– BUG: MR AM has to read the same script. Work in progress to send it from RM to

AMs



ResourceManager

• Hardware requirements– ResourceManagers needs CPU– Doesn’t require as much memory as JobTracker

– 4 to 8 GB should be fine

• JobHistoryServer– Needs memory, at least 8 GB



Enable RM HA

• Enable RM HA - availability• Only supported using Zookeeper

– Leader election used– Fencing support

• Automatic failover enabled by default– Using zookeeper again– Embedded zkfc, no need to explicitly start separate process

• You can start multiple ResourceManagers• Specify rm-ids using yarn.resourcemanager.ha.rm-ids

– e.g yarn.resourcemanager.ha.rm-ids rm1, rm2

• Associate hostnames with rm-ids using yarn.resourcemanager.hostname.rm1, yarn.resourcemanager.hostname.rm2– No need to change any other configs – scheduler, resource-tracker addresses are

automatically taken care of

• Web-Uis automatically get redirected to the active



YARN schedulers

• Two main schedulers– capacity– fair

• Capacity Scheduler allows you to setup queues to split resources – useful for multi-tenant clusters where you want to guarantee resources

• Fair Scheduler allows you to split resources ‘fairly’ across applications• Both have admin files which can be used to dynamically change the

setup• If you have enabled HA, queue configuration files are on local disk

– Make sure queue files are consistent across nodes– Feature to centralize configs in progress



Capacity Scheduler


50%

queue-1 queue-2 queue-3

Apps Apps Apps

Guaranteed Resources

30% 20%


YARN Capacity scheduler

• Configuration in capacity-scheduler.xml• Take some time to setup your queues!• Queues have per-queue acls to restrict queue access

– Access can be dynamically changed

• Elasticity can be limited on a per-queue basis – use yarn.scheduler.capacity.<queue-path>.maximum-capacity

• Use yarn.scheduler.capacity.<queue-path>.state to drain queues– ‘Decommissioning’ a queue

• yarn rmadmin –refreshQueues to make runtime changes



YARN Fair Scheduler

• Apps get equal share of resources, on average, over time• No worry about starvation• Support for queues – meant to be used so that you can prevent users

from flooding the system with apps• Has support for fairness policy which can be modified at runtime• Good if you have lots of small jobs



Size your containers

• Memory and cores – minimum and maximum allocation, affects containers per node

• yarn.scheduler.*-allocation-*• Defaults are 1GB, 8GB, 1 core and 32 cores• CPU scheduling needs a bit more stabilization

– Historically – translate to memory calculations

• Similarly Disk-scheduling– translate disk limits to memory/cpu.


4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 640

10

20

30

40

50

60

70

Memory for NodeManager(in GB)

Number of containers per node


NodeManagers

• Set resource-memory – variable is yarn.nodemanager.resource.memory-mb– Sets how much memory YARN can use for containers– Default is 8GB

• Set up a health-checker script!– Check disk– Check network– Check any external resources required for job completion– Test it on your OS– Weed out bad nodes automatically!

• Figure out if the physical and virtual memory monitors make sense; both are enabled by default.– Default ratio is 2.1

• Multiple disks for containers on NodeManagers– HDFS too accesses them– If bottlenecked on disks, separate them. Haven’t seen it in the wild though



YARN log aggregation

• Log aggregation can be enabled using yarn.log-aggregation-enable. • Can control how long you keep the logs by setting parameters for

purging• App logs can be obtained using “yarn logs” command• Creates lots of small files, can affect HDFS performance



YARN Metrics

• JMX – http://<rm address>:<port>/jmx, http://<nm address>:<port>/jmx– Cluster metrics – apps running, successful, failed, etc– Scheduler metrics – queue usage– RPC metrics

• Web UI – http://<rm address>:<port>/cluster– Cluster metrics– Scheduler metrics – easier to digest, especially queue usage– Healthy, failed nodes

• Can be emitted to Ganglia directly using the metrics sink– Metrics configuration file



For Application Developers



Sub-Agenda

• Framework or a native Application?• Understanding YARN Basics• Writing an YARN Client• Writing an ApplicationMaster• Misc Lessons



Framework or a native app?

• Two choices– Write applications on top of existing frameworks

– Battle tested

– Already work

– APIs

– Roll your own native YARN application

• Existing frameworks– Scalable batch processing: MapReduce– Stream processing: Storm/Samza– Interactive processing, iterations: Tez/Spark– SQL: Hive– Data pipelines: Pig– Graph processing: Giraph– Existing app: Slider

• Apache: Your App Store



Ease of development

• Check the other developing or deployment tools


NativeSlider

Frameworks

Complexity

Twill/REEF


Understanding YARN Components


• ResourceManager– Master of a cluster

• NodeManager– Slave to take care of one host

• ApplicationMaster– Master of an application

• Container– Resource abstraction, process to

complete a task


User code: Client and AM

• Client– Client to ResourceManager

• ApplicationMaster– ApplicationMaster to scheduler

– Allocate resources

– ApplicationMaster to NodeManager– Manage containers



Client: Rule of Thumb

• Use the client libraries– YarnClient

– Submit an application

– AMRMClient(Async)– Negotiate resources

– NMClient(Async)– Manage containers

– TimelineClient– Monitor an application



Writing Client

1. Get the application Id from RM

2. Construct ApplicationSubmissionContext1. Shell command to run the AM

2. Environment (class path, env-variable)

3. LocalResources (Job jars downloaded from HDFS)

3. Submit the request to RM1. submitApplication



Tips for Writing Client

• Cluster Dependencies–Try to make zero assumptions on the cluster–Cluster location–Cluster sizes.

– ApplicationMaster too

• Your application bundle should deploy everything required using YARN’s local resources.



Writing ApplicationMaster

1. AM registers with RM (registerApplicationMaster)

2. HeartBeats(allocate) with RM (asynchronously)1. send the Request

1. Request new containers.

2. Release containers.

2. Received containers and send request to NM to start the container1. construct ContainerLaunchContext

– commands– env– jars

3. Unregisters with RM (finishApplicationMaster)



Tips for writing ApplicationMaster

• RM assigns containers asynchronously– Containers are likely not returned immediately at current call.– User needs to give empty requests until it gets the containers it requested.– ResourceRequest is incremental.

• Locality requests may not always be met– Relaxed Locality

• AMs can fail– They run on cluster nodes which can fail– RM restarts AMs automatically– Write AMs to handle failures on restarts - recovery– May be continue your work when AM restarts

• Optionally talk to your containers directly through the AM– To get progress, give work, kill it, etc– YARN doesn’t do anything for you



Using the Timeline Service

• Metadata/Metrics• Put application specific information

– TimelineClient– POJO objects

• Query the information– Get all entities of an entity type– Get one specific entity– Get all events of an entity type




Summary: Application Workflow

• Execution Sequence1. Client submits an application

2. RM allocates a container to start AM

3. AM registers with RM

4. AM asks containers from RM

5. AM notifies NM to launch containers

6. Application code is executed in container

7. Client contacts RM/AM to monitor application’s status

8. AM unregisters with RM

Client RM NM AM

1

2

3

4

5

7

8

6


Misc Lessons: Taking What YARN offers

• Monitor your application– RM– NM– Timeline server



Misc Lessons: Debugging/Testing

• MiniYARNCluster– In JVM YARN cluster!– Regression tests for your applications

• Unmanaged AM– Support to run the AM outside of a YARN cluster for development and

testing– AM logs on your console!

• Logs– RM/NM logs– App Log aggregation– Accessible via CLI, web UI



Thank you!Questions?


hadoop summit san jose 2014 - apache hadoop yarn: best practices

Technology

memory yarn

future of big data page

yarn schedulers

capacity scheduler page

node yarn

gb page

yarn fair scheduler

queue basis use yarn