october 2013 hug: oozie 4.x

39
Oozie – Now and Beyond PRESENTED BY Mona Chitnis Hadoop User Group, Yahoo Sunnyvale, October 16, 2013

Upload: yahoo-developer-network

Post on 26-Jan-2015

123 views

Category:

Technology


3 download

DESCRIPTION

Apache Oozie has come a long way and now accounts for over 2.8 Million jobs per month on Yahoo's grid infrastructure. If you are running Hadoop jobs repeatedly and thinking of a smarter way of doing it, Apache Oozie is the answer. Be it running complex data transformation jobs chained one after another or simple daily data copy, Oozie workflows will help you to manage these tasks efficiently. Mona will cover the new features introduced in Apache Oozie 4.x, in particular, Apache HCatalog Integration, Job Notifications and SLA Monitoring for building large-scale and efficient data processing pipelines.

TRANSCRIPT

Page 1: October 2013 HUG: Oozie 4.x

Ooz ie – Now and Beyond

§  PRESENTED BY Mona Chitnis⎪ Hadoop User Group, Yahoo Sunnyvale, October 16, 2013

Page 2: October 2013 HUG: Oozie 4.x

Team In Action

2 Yahoo Confidential & Proprietary

§  Alejandro Abdelnur §  Mohammad Islam §  Rohini Palaniswamy §  Robert Kanter §  Virag Kothari §  Mona Chitnis §  Ryota Egashira §  Michelle Chiang §  Bowen Zhang

Page 3: October 2013 HUG: Oozie 4.x

OVERVIEW

Page 4: October 2013 HUG: Oozie 4.x

4 Yahoo Confidential & Proprietary

Why Oozie? The Problem The Need

§  Doing something on the grid often required multiple steps §  MapReduce job §  Pig job §  Streaming job §  HDFS operation (mkdir, chmod, etc)…

§  Workflow scheduler with better support for grid jobs (native integration with Hadoop) §  orchestrate dependency between jobs §  execute at specific time or on data

availability §  retry jobs in the event of failures

(reliable)

§  Multiple ad-hoc solutions existed §  custom job control §  shell scripts §  cron…

§  Common framework for communication and execution of production process §  sync (clocked dataset) awareness §  async (unspecified freq) data

awareness

§  Cost of building and running apps were high §  development and applications

engineering §  support, operations, and hardware

§  Horizontally scalable and extensible system §  Open-source §  Workflows to couple resources instead

of having a monolithic code base

A server-based workflow scheduling system to manage Hadoop jobs

Overview

Page 5: October 2013 HUG: Oozie 4.x

5 Yahoo Confidential & Proprietary

Oozie – A Workflow Engine

§  Oozie executes workflow defined as DAG of jobs §  The job type includes MapReduce, Pig, Hive, shell script, custom Java code

etc. §  Introduced in Oozie 1.x

start M/R job

M/R job

decision

fork

Pig job

M/R job

join

end Java FS job

ENOUGH

MORE

Control-flow nodes (start, kill, end | fork, join, decision)

Action nodes (map reduce, pig, hive, distcp, java, fs, sub-workflow, shell, ssh, email)

kill

OK

ERROR

Overview

Page 6: October 2013 HUG: Oozie 4.x

Example M/R Action

JT and NN

Mapper

Reducer

Queue Name

Input Directory

Output Directory

6 Yahoo Confidential & Proprietary

Overview

Page 7: October 2013 HUG: Oozie 4.x

7 Yahoo Confidential & Proprietary

Workflow State Transitions

Source: Chicago HUG, Dec 2012

Overview

Page 8: October 2013 HUG: Oozie 4.x

8 Yahoo Confidential & Proprietary

Oozie (Coordinator) – A Scheduler

§  Oozie executes workflow based on §  time dependency (frequency) §  data dependency

§  Introduced in 2.x

HDFS/ HCat

Oozie Server

Oozie Client

Oozie Workflow

WS API Oozie Coordinator

Check Data Availability

Overview

Page 9: October 2013 HUG: Oozie 4.x

9 Yahoo Confidential & Proprietary

Oozie (Bundle) – A Pipeline Framework

§  Users can define and execute a “bundle” of coordinator apps §  large scale data processing (inter-related coordinators) §  operability and manageability of pipelines

§  User can start/stop/suspend/resume/rerun in the bundle level §  Introduced in 3.x, bundles are optional

HDFS/ HCat

Oozie Server

Oozie Client

Oozie Workflow

WS API

Oozie Coordinator

Check Data Availability

Bundle

Overview

Page 10: October 2013 HUG: Oozie 4.x

10 Yahoo Confidential & Proprietary

Layers of Abstraction in Oozie

Coord  Action  

Coord  Action  

Coord  Action  

Coord  Action  

WF  Job   WF  Job   WF  Job  

M/R  Job  

PIG  Job  

M/R  Job  

PIG  Job  

Bundle    

1. Bundle

Coord  Job   Coord  Job  

2. Coordinator

WF  Job  

3. Workflow

Overview

Page 11: October 2013 HUG: Oozie 4.x

11 Yahoo Confidential & Proprietary

Architectural Overview

Oozie (Java Web-App)

Security

WS Callback WS API

DAG Engine

Oracle DB

Commands

Com

man

d Q

ueue

start rerun submit Command Executor

Thread Pool

Recovery Daemon Thread

Action Executors

M/R fs Pig pluggable, to support additional action types

Inst

rum

enta

tion WF

stor

e W

F lib

sub-wf

executed Asynchronously via Command Queue

resume kill suspend

info

start action

end action

check action

callback

signal job

notification

Web Services (JSON/REST API)

Overview

Page 12: October 2013 HUG: Oozie 4.x

12 Yahoo Confidential & Proprietary

Oozie Security, Multi-tenancy and Scalability

Oozie Server

Hadoop Cluster

YARN RM

Launcher Mapper

Actual M/R Job

1 Auth.

End User (Kerberos, Y! specific)

2 Create

Launcher Job (super-user)

3 Execute User Job (doAs)

5 Async Callback

4 Response

Overview

Page 13: October 2013 HUG: Oozie 4.x

USE CASES

Page 14: October 2013 HUG: Oozie 4.x

14 Yahoo Confidential & Proprietary

Use Case 1: Time Triggers

Execute your workflow every 15 minutes

00:15 00:30 00:45 01:00

Use Cases and Common Patterns

Page 15: October 2013 HUG: Oozie 4.x

15 Yahoo Confidential & Proprietary

Use Case 2: Time and Data Triggers

Materialize your workflow every hour, but only run them when the input data is ready (that is loaded to the grid every hour)

01:00 02:00 03:00 04:00

Hadoop Input Data

Exists?

Use Cases and Common Patterns

Page 16: October 2013 HUG: Oozie 4.x

16 Yahoo Confidential & Proprietary

Use Case 2: Time and Data Triggers <coordinator-app name=“coord1” frequency=“${1*HOURS}”…> <datasets> <dataset name="logs" frequency=“${1*HOURS}” initial-instance="2009-01-01T23:59Z"> <uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name=“inputLogs” dataset="logs"> <instance>${current(0)}</instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property> </configuration> </workflow> </action>

Use Cases and Common Patterns

Dataset Definition

Input Events Definition with time of coordinator action materialized (created)

Action Definition

Page 17: October 2013 HUG: Oozie 4.x

17 Yahoo Confidential & Proprietary

Use Case 3: Rolling Window

00:15 00:30 00:45 01:00

01:00

01:15 01:30 01:45 02:00

02:00

Access 15 minute datasets and roll them up into hourly datasets

Use Cases and Common Patterns

Page 18: October 2013 HUG: Oozie 4.x

18 Yahoo Confidential & Proprietary

Use Case 4: Sliding Window

Access last 24 hours of data, and roll them up every hour

01:00 02:00 03:00 24:00

24:00

02:00 03:00 04:00 +1 day 01:00

+1 day 01:00

03:00 04:00 05:00 +1 day 02:00

+1 day 02:00

Use Cases and Common Patterns

Page 19: October 2013 HUG: Oozie 4.x

§  17 clusters §  13,000 jobs/server day

§  2.8 M jobs/month

§  16% of all Hadoop jobs

§  75 products §  2,000+ projects

§  255 monthly users §  5.4 M compute hrs/month

§  770,000 workflows §  Between 1-8 actions

§  Avg. 4 actions/workflow

§  250 coordinator jobs/day §  67% of Oozie jobs kicked

thru coordinator

Proven Scale and Multi-tenancy

19 Yahoo Confidential & Proprietary

Where are We Today

Page 20: October 2013 HUG: Oozie 4.x

20 Yahoo Confidential & Proprietary

Mix Of Job Types For Workflows

39%

29%

28%

4%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Jobs

Pig MapReduce Java Other

SAMPLE USE OF JOB TYPES

Pig §  Data processing/ filtering §  Aggregation

MapReduce §  Publishing data (HDFS/HCat)

Java §  Legacy code and logic

Others §  Distcp and shell §  Data copy/ transfer

Where are We Today

Page 21: October 2013 HUG: Oozie 4.x

FEATURE DEEP-DIVE

Page 22: October 2013 HUG: Oozie 4.x

22 Yahoo Confidential & Proprietary

Existing Features (Oozie 3.x) §  HBase access through Oozie, via credentials

§  HCatalog access through Oozie, via credentials

§  Email action

§  DistCp action (intra as well as inter-cluster copy)

§  Shell action (run any script e.g. perl, python, hadoop CLI)

§  Workflow dry-run & Fork-Join validation

§  Bulk monitoring (REST API)

§  Coordinator EL functions for parameterized workflows

§  Job DAG

What’s New in Oozie

Page 23: October 2013 HUG: Oozie 4.x

HBase Credentials

23 Yahoo Confidential & Proprietary

§  Add in workflow.xml §  Add a section of "credentials". The type is "hbase”.

§  Specify the java action to use the credentials.

§  Put hbase-site.xml in oozie application path. And use <file> in workflow.xml to put hbase-site.xml in the distributed cache. A copy of the hbase-site.xml can be found in gateway:/home/gs/conf/hbase/hbase-site.xml.

§  Put jars "guava-*.jar, zookeeper-*.jar, hbase-*.jar, protobuf-java-*.jar” in workflow “lib” dir

§  Make sure you are using Oozie XSD version 0.3 and above for the tag.    

   

   <workflow-­‐app  name="foo-­‐wf"  xmlns="uri:oozie:workflow:0.3">                    <credentials>  

                         <credential  name="hbase.cert"  type="hbase">  </credential>  

                   //  optional  properties  -­‐  zookeeper.znode.parent,  hbase.zookeeper.quorum                    </credentials>  

                 <start  to=”map-­‐reduce-­‐action"  />  

                 <action  name=’map-­‐reduce-­‐action'  cred="hbase.cert">                            <map-­‐reduce>  

                         <configuration>      <property>  <name>mapred.mapper.class</name>  

                         <value>SampleMapperHBase</value>  </property>  

   <property>  <name>mapred.reducer.class</name>                            <value>org.apache.oozie.example.DemoReducer</value>  </property>  </configuration>  

                                     <file>hbase-­‐site.xml#hbase-­‐site.xml</file>  

                         </java>    

§  Refer to http://twiki.corp.yahoo.com/view/CCDI/UseHbaseCred

What’s New in Oozie

Page 24: October 2013 HUG: Oozie 4.x

Oozie 4.0

24 Yahoo Confidential & Proprietary

HCatalog Integration

Job Notifications

SLA Monitoring

1

2

3

What’s New in Oozie

Page 25: October 2013 HUG: Oozie 4.x

HCatalog Integration

§  Oozie now supports HCatalog datasets, in addition to HDFS §  Query HCat server directly -OR- §  Receive ‘partition created’ notifications

§  With HDFS datasets, poll NameNode to check data availability §  Delay §  Single source

Oozie NameNode

/data/click/2013/03/10 /data/click/2013/03/11 /data/click/2013/03/12

…….

HDFS

data exists? data exists?

…….

What’s New in Oozie

25 Yahoo Confidential & Proprietary

1

Page 26: October 2013 HUG: Oozie 4.x

›  HCat - metastore has info about HDFS

datasets, locations and file formats.

›  Using HCat loader and storer, dataset can be

consumed uniformly using Pig, Hive and

Map/Reduce in Oozie, using the “database,

table, partition” abstraction.

›  Oozie notified on partition availability via JMS

messages, to trigger workflows immediately

›  Use JARs hcatalog-core.jar, webhcat-java-

client.jar, hive-common.jar, hive-exec.jar,

hive-metastore.jar, hive-serde.jar and

libfb303.jar in workflow ‘lib’

§  Docs -

http://oozie.apache.org/docs/4.0.0/DG_HCatalogIntegration.html

<coordinator-­‐app  name=”hcat-­‐coord”  …  >    

   <datasets>  

       <dataset  name=”inp-­‐logs"  frequency="${coord:hours(1)}”>  

           <uri-­‐template>${hcatNode}/${db}/${table}/ds=${YEAR}-­‐${MONTH}-­‐${DAY};region=${region}</uri-­‐template>  

           <done-­‐flag></done-­‐flag>  

       </dataset>  

       <dataset  name=”out-­‐logs"  frequency=”${coord:days(1)}”>  

           <uri-­‐template>${hcatNode}/${db}/${outputtable}/ds=${dataOut};region=${region}</uri-­‐template>  

           <done-­‐flag></done-­‐flag>          </dataset>  ...   <property>              <name>FILTER</name>  

           <value>${coord:dataInPartitionFilter('input',  'pig')}  

           </value>  

Pig  action  script:  

A  =  load  '$DB.$TABLE'  using  org.apache.hcatalog.pig.HCatLoader();  

   B  =  FILTER  A  BY  $FILTER;  

   C  =  foreach  B  generate  foo,  bar;      store  C  into  '$OUTPUT_DB.$OUTPUT_TABLE'  USING  org.apache.hcatalog.pig.HCatStorer('$OUTPUT_PARTITION');  

26 Yahoo Confidential & Proprietary

Latest Oozie 4.0 Features HCatalog Integration

What’s New in Oozie

Page 27: October 2013 HUG: Oozie 4.x

With HCatalog + Notifications High-level Diagram

HCatalog

Data Producer HDFS

Update metadata (ALTER TABLE click ADD PARTITION(data=‘2013/03/12’) location ’hdfs://data/click/2013/03/12’)

/data/click/2013/03/12

Produce data (distcp, pig, M/R..)

What’s New in Oozie

27 Yahoo Confidential & Proprietary

Page 28: October 2013 HUG: Oozie 4.x

With HCatalog + Notifications High-level Diagram

Oozie

Message Bus (e..g, ActiveMQ)

HCatalog

2. Register Topic

Data Producer HDFS

1. Query/Poll Partition

What’s New in Oozie

28 Yahoo Confidential & Proprietary

Page 29: October 2013 HUG: Oozie 4.x

With HCatalog + Notifications High-level Diagram

Oozie

Message Bus (e..g, ActiveMQ)

HCatalog

3. Push notification <New Partition>

2. Register Topic

4. Notify New Partition

Data Producer HDFS Produce data (distcp, pig, M/R..)

/data/click/2013/03/12

1. Query/Poll Partition

Start workflow

Update metadata (ALTER TABLE click ADD PARTITION(data=‘2013/03/12’) location ’hdfs://data/click/2013/03/12’)

What’s New in Oozie

29 Yahoo Confidential & Proprietary

Page 30: October 2013 HUG: Oozie 4.x

§  Notification event sent on jobs’ status change

§  Messages sent on the configured JMS-compliant message broker

§  Users should write message listeners to listen on select topics (e.g. username)

§  To filter more, apply JMS selectors on messages.

§  E.g. user, jobid, app-type, status, msg-type (JOB or SLA).

§  Docs -

http://oozie.apache.org/docs/4.0.0/DG_JMSNotifications.html

Filter desired app-types for notification: <property>  

<name>oozie.service.EventHandlerService.  

filter.app.types</name>  

<value>workflow_job,  workflow_action,  

coordinator_job,  coordinator_action</value>  

</property>  

Notification Msg Example: Coordinator Action Failure Event

›  Header (Selectors) •  AppType – Coordinator_Action •  Status - FAILURE •  User •  App-Name

›  Message Body (JSON) •  ID (coord action id) •  Parent ID (coord Job ID) •  NominalTime •  StartTime •  EndTime •  Status - FAILED, KILLED, SUSPENDED, TIMEDOUT •  Error-Code, Error-Message (if KILLED or FAILED)

30 Yahoo Confidential & Proprietary

Latest Oozie 4.0 Features Job Notifications 2

What’s New in Oozie

Page 31: October 2013 HUG: Oozie 4.x

§  Oozie can actively track SLAs on Jobs’ §  Start-time, End-time, Duration

§  Event Status §  START_MET, START_MISS

§  END_MET, END_MISS

§  DURATION_MET, DURATION_MISS

§  At any time, the SLA processing stage will reflect: §  Not_Started <-- Job not yet begun

§  In_Process <-- Job started and is running, and SLAs are being tracked

§  Met <-- caused by an END_MET

§  Miss <-- caused by an END_MISS

§  Access/Filter SLA info via §  Web-console dashboard

§  REST API

§  JMS Messages

§  Email alert

§  Docs - http://oozie.apache.org/docs/4.0.0/DG_SLAMonitoring.html

 <workflow-­‐app  xmlns="uri:oozie:workflow:0.5"  xmlns:sla="uri:oozie:sla:0.2"  name=”sla-­‐wf">  ...      <end  name="end"/>      <sla:info>          <sla:nominal-­‐time>${nominalTime}        </sla:nominal-­‐time>          <sla:should-­‐start>${shouldStart}          </sla:should-­‐start>          <sla:should-­‐end>${shouldEnd}                </sla:should-­‐end>          <sla:max-­‐duration>${duration}              </sla:max-­‐duration>          <sla:alert-­‐events>start_miss,end_miss  </sla:alert-­‐events>          <sla:alert-­‐contact>joe@yahoo                </sla:alert-­‐contact>      </sla:info>  </workflow-­‐app>  

31 Yahoo Confidential & Proprietary

Latest Oozie 4.0 Features SLA Monitoring

3

What’s New in Oozie

Page 32: October 2013 HUG: Oozie 4.x

SLA Monitoring Dashboard

32 Yahoo Confidential & Proprietary

What’s New in Oozie

Page 33: October 2013 HUG: Oozie 4.x

Checking Oozie Job

33 Yahoo Confidential & Proprietary

1. CLI (yoozie_client)

$ oozie job -oozie http://localhost:11000/oozie -info 14-20090525161321-oozie-joe ---------------------------------------------------------------------------------------------------------------- Workflow Name : map-reduce-wf App Path : hdfs://localhost:8020/user/joe/workflows/map-reduce Status : SUCCEEDED Run : 0 User : joe Group : users Created : 2009-05-26 05:01 Started : 2009-05-26 05:01 Ended : 2009-05-26 05:01 Actions --------------------------------------------------------------------------------------------------------------------- Action Name Type Status Transition External Id External Status Error Code Start End ------------------------------------------------------------------------------------------------------------------------------------------------------hadoop1 map-reduce OK end job_200904281535_0254 SUCCEEDED - 2009-05-26 05:01 2009-05-26 05:01 ------------------------------------------------------------------------------------------------------------------------------------------------------

Demo

Page 34: October 2013 HUG: Oozie 4.x

Checking / Debugging Oozie Jobs

34 Yahoo Confidential & Proprietary

2. Web-Console e.g. http://my-oozie-server:4080/oozie

Docs - https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook

Demo

Page 35: October 2013 HUG: Oozie 4.x

What else is out there?

Page 36: October 2013 HUG: Oozie 4.x

36 Yahoo Confidential & Proprietary

Oozie vs. Other Workflow Systems

Champion Yahoo! (now ASF) LinkedIn Spotify

Apache Affiliation TLP License only License only

Language Java Java Python

Adoption High, part of all standard Hadoop distributions Low Low

Code Complexity High (>100K lines) Medium (< 50K lines) Low (<10K lines)

Hadoop Job Support Extensive built-in support Limited job types Limited job types

Docs & Support Excellent Limited Limited

Auth. Kerberos, custom xml-based, custom Linux-based

Reruns Yes (recovery, retries at all levels) Partial After removing output, idempotent

UI Average Good -

Oozie at ASF

Page 37: October 2013 HUG: Oozie 4.x

37 Yahoo Confidential & Proprietary

The Next Release

§  Scalability and performance improvements to handle higher loads

§  More 1 and 5 min frequency jobs

§  High Availability with Load Balancing

§  Flexible Cron-Based Scheduling

§  Handling cluster Rolling upgrades for Hadoop 2.0

Roadmap

Page 38: October 2013 HUG: Oozie 4.x

Q & A

Page 39: October 2013 HUG: Oozie 4.x

39 Yahoo Confidential & Proprietary