writing yarn applications hadoop summit 2012

25
© Hortonworks Inc. 2011 Writing Application Frameworks on Apache Hadoop YARN Hitesh Shah [email protected] Page 1

Upload: hortonworks

Post on 27-Jan-2015

110 views

Category:

Technology


1 download

DESCRIPTION

Hadoop YARN is the next generation computing platform in Apache Hadoop with support for programming paradigms besides MapReduce. In the world of Big Data, one cannot solve all the problems wholly using the Map Reduce programming model. Typical installations run separate programming models like MR, MPI, graph-processing frameworks on individual clusters. Running fewer larger clusters is cheaper than running more small clusters. Therefore,_leveraging YARN to allow both MR and non-MR applications to run on top of a common cluster becomes more important from an economical and operational point of view. This talk will cover the different APIs and RPC protocols that are available for developers to implement new application frameworks on top of YARN. We will also go through a simple application which demonstrates how one can implement their own Application Master, schedule requests to the YARN resource-manager and then subsequently use the allocated resources to run user code on the NodeManagers.

TRANSCRIPT

Page 1: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Writing Application Frameworks on Apache Hadoop YARN

Hitesh Shah

[email protected]

Page 1

Page 2: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Hitesh Shah - Background

• Member of Technical Staff at Hortonworks Inc.

• Committer for Apache MapReduce and Ambari

• Earlier, spent 8+ years at Yahoo! building various infrastructure pieces all the way from data storage platforms to high throughput online ad-serving systems.

Page 2Architecting the Future of Big Data

Page 3: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Agenda

•YARN Architecture and Concepts•Writing a New Framework

Page 3Architecting the Future of Big Data

Page 4: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

YARN Architecture

• Resource Manager–Global resource scheduler–Hierarchical queues

• Node Manager–Per-machine agent–Manages the life-cycle of container–Container resource monitoring

• Application Master–Per-application–Manages application scheduling and task execution–E.g. MapReduce Application Master

Page 4Architecting the Future of Big Data

Page 5: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

YARN Architecture

Page 5Architecting the Future of Big Data

Page 6: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

YARN Concepts

• Application ID–Application Attempt IDs

• Container–ContainerLaunchContext

• ResourceRequest–Host/Rack/Any match–Priority–Resource constraints

• Local Resource–File/Archive–Visibility – public/private/application

Page 6Architecting the Future of Big Data

Page 7: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

What you need for a new Framework

•Application Submission Client–For example, the MR Job Client

•Application Master–The core framework library

•Application History ( optional )–History of all previously run instances

•Auxiliary Services ( optional )–Long-running application-specific services running on the

NodeManager

Page 7Architecting the Future of Big Data

Page 8: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Use Case: Distributed Shell

• Take a user-provided script or application and run it on a set of nodes in the Cluster

• Input:

–User Script to execute–Number of containers to run on–Variable arguments for each

different container–Memory requirements for the

shell script–Output Location/Dir

Page 8Architecting the Future of Big Data

NodeManager

Shell Script

NodeManager

Shell Script

NodeManager

DS AppMaster

Page 9: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Client: RPC calls

• Uses ClientRM Protocol

• Get a new Application ID from the RM

• Application Submission

• Application Monitoring

• Kill the Application?

Page 9Architecting the Future of Big Data

Page 10: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Client

• Registration with the RM–New Application ID

• Application Submission–User information–Scheduler queue–Define the container for the Distributed Shell App Master via

the ContainerLaunchContext

• Application Monitoring–AppMaster host details with tokens if needed, tracking url–Application Status (submitted/running/finished)

Page 10Architecting the Future of Big Data

Page 11: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Defining a Container

• ContainerLaunchContext class–Can run a shell script, a java process or launch a VM

• Command(s) to run• Local resources needed for the process to run

–Dependent jars, native libs, data files/archives

• Environment to setup–Java Classpath

• Security-related data–Container Tokens

Page 11Architecting the Future of Big Data

Page 12: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Application Master: RPC calls

• AMRM and CM protocols

• Register AM with RM

• Ask RM to allocate resources

• Launch tasks on allocated containers

• Manage tasks to final completion

• Inform RM of completion

Page 12Architecting the Future of Big Data

Page 13: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Application Master

• Setup RPC to handle requests from Client and/or tasks launched on Containers

• Register and send regular heartbeats to the RM

• Request resources from the RM.

• Launch user shell script on containers as and when allocated.

• Monitor status of user script of remote containers and manage failures by retrying if needed.

• Inform RM of completion when application is done.

Page 13Architecting the Future of Big Data

Page 14: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

AMRM#allocate

• Request:–Containers needed

– Not a delta protocol

– Locality constraints: Host/Rack/Any

– Resource constraints: memory

– Priority-based assignments

–Containers to release – extra/unwanted?– Only non-launched containers

• Response:–Allocated Containers

– Launch or release

–Completed Containers– Status of completion

Page 14Architecting the Future of Big Data

Page 15: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

YARN Applications

• Data Processing:–OpenMPI on Hadoop –Spark (UC Berkeley)

– Shark ( Hive-on-Spark )

–Real-time data processing– Storm ( Twitter )

– Apache S4

–Graph processing – Apache Giraph

• Beyond data:–Deploying Apache HBase via YARN (HBASE-4329)–Hbase Co-processors via YARN (HBASE-4047)

Page 15Architecting the Future of Big Data

Page 16: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

References

•Doc on writing new applications:–WritingYarnApplications.html ( available at http://hadoop.apache.org/common/docs/r2.0.0-alpha/ )

Page 16Architecting the Future of Big Data

Page 17: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Questions?

Architecting the Future of Big DataPage 17

Thank You!

Hitesh [email protected]

Page 18: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Appendix: Code Examples

Architecting the Future of Big DataPage 18

Page 19: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Client: Registration

ClientRMProtocol applicationsManager;

YarnConfiguration yarnConf = new YarnConfiguration(conf);

InetSocketAddress rmAddress = NetUtils.createSocketAddr(

yarnConf.get(YarnConfiguration.RM_ADDRESS));

applicationsManager = ((ClientRMProtocol)

rpc.getProxy(ClientRMProtocol.class,

rmAddress, appsManagerServerConf));

GetNewApplicationRequest request =

Records.newRecord(GetNewApplicationRequest.class);

GetNewApplicationResponse response =

applicationsManager.getNewApplication(request);

Page 19Architecting the Future of Big Data

Page 20: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Client: App Submission

ApplicationSubmissionContext appContext;

ContainerLaunchContext amContainer;

amContainer.setLocalResources(Map<String, LocalResource> localResources);

amContainer.setEnvironment(Map<String, String> env);

String command = "${JAVA_HOME}" + /bin/java" + " MyAppMaster " + " arg1 arg2 “;

amContainer.setCommands(List<String> commands);

Resource capability; capability.setMemory(amMemory); amContainer.setResource(capability); 

appContext.setAMContainerSpec(amContainer);

SubmitApplicationRequest appRequest;

appRequest.setApplicationSubmissionContext(appContext); 

applicationsManager.submitApplication(appRequest);

Page 20Architecting the Future of Big Data

Page 21: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

Client: App Monitoring

• Get Application Status

GetApplicationReportRequest reportRequest =

Records.newRecord(GetApplicationReportRequest.class); reportRequest.setApplicationId(appId);

GetApplicationReportResponse reportResponse =

applicationsManager.getApplicationReport(reportRequest);

ApplicationReport report = reportResponse.getApplicationReport();

• Kill the application

KillApplicationRequest killRequest =

Records.newRecord(KillApplicationRequest.class);

killRequest.setApplicationId(appId);

applicationsManager.forceKillApplication(killRequest);

Page 21Architecting the Future of Big Data

Page 22: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

AM: Ask RM for Containers

ResourceRequest rsrcRequest;

rsrcRequest.setHostName("*”); // hostname, rack, wildcard

rsrcRequest.setPriority(pri);

Resource capability; capability.setMemory(containerMemory);

rsrcRequest.setCapability(capability)

rsrcRequest.setNumContainers(numContainers);

List<ResourceRequest> requestedContainers;

List<ContainerId> releasedContainers;

AllocateRequest req;

req.setResponseId(rmRequestID);

req.addAllAsks(requestedContainers);

req.addAllReleases(releasedContainers);

req.setProgress(currentProgress);

AllocateResponse allocateResponse = resourceManager.allocate(req);

Page 22Architecting the Future of Big Data

Page 23: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

AM: Launch Containers

AMResponse amResp = allocateResponse.getAMResponse();

ContainerManager cm = (ContainerManager)rpc.getProxy

(ContainerManager.class, cmAddress, conf);

List<Container> allocatedContainers = amResp.getAllocatedContainers(); for (Container allocatedContainer : allocatedContainers) {

ContainerLaunchContext ctx;

ctx.setContainerId(allocatedContainer .getId());

ctx.setResource(allocatedContainer .getResource());

// set env, command, local resources, …

StartContainerRequest startReq;

startReq.setContainerLaunchContext(ctx);

cm.startContainer(startReq);

Page 23Architecting the Future of Big Data

Page 24: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

AM: Monitoring Containers

• Running ContainersGetContainerStatusRequest statusReq;

statusReq.setContainerId(containerId);

GetContainerStatusResponse statusResp =

cm.getContainerStatus(statusReq);

• Completed ContainersAMResponse amResp = allocateResponse.getAMResponse();

List<Container> completedContainersStatus =

amResp.getCompletedContainerStatuses();

for (ContainerStatus containerStatus : completedContainers) {

// containerStatus.getContainerId()

// containerStatus.getExitStatus()

// containerStatus.getDiagnostics()

}

Page 24Architecting the Future of Big Data

Page 25: Writing Yarn Applications Hadoop Summit 2012

© Hortonworks Inc. 2011

AM: I am done

FinishApplicationMasterRequest finishReq;

finishReq.setAppAttemptId(appAttemptID);

finishReq.setFinishApplicationStatus

(FinalApplicationStatus.SUCCEEDED); // or FAILED

finishReq.setDiagnostics(diagnostics);

resourceManager.finishApplicationMaster(finishReq);

Page 25Architecting the Future of Big Data