dsee administration sp 20071029 ptc[1]

8/9/2019 DSEE Administration SP 20071029 Ptc[1]

1/72

IBM IPS Parallel Framework: Admini stration and Production Automation October 29, 2007 1 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

IBM Information Platform and Solutions

Center of Excellence

IBM IPS Parallel Framework Standard Practices

Administration and Management:

DataStage EE Administration and Production Automation

Prepared by IBM Information Platform and Solutions Center of Excellence

October 29, 2007

CONFIDENTIAL, PROPRIETARY, AND TRADE SECRET NATURE OF ATTACHED DOCUMENTS

This document is Confidential, Proprietary and Trade Secret Information (Confidential Information) of IBM, Inc. and is provided solely for the purposeof evaluating IBM products with the understanding that such Confidential Information will be disclosed only to those who have a need to know. Theattached documents constitute Confidential Information as they include information relating to the business and/or products of IBM (including, withoutlimitation, trade secrets, technical, business, and financial information) and are trade secret under the laws of the State of Massachusetts and the UnitedStates.

Copyrights

2007 IBM Information Platform and SolutionsAll rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language inany form by any means without the written permission of IBM. While every precaution has been taken in the preparation of this document to reflectcurrent information, IBM assumes no responsibility for errors or omissions or for damages resulting from the use of information contained herein.


2/72




Document GoalsIntended Use This document presents a set of standard practices, methodologies, and an example Toolkit

for administering and integrating IBM WebSphere DataStage Enterprise Edition(DSEE) with a production infrastructure. Except where noted, this document is intended

to supplement, not replace the installation documentation.Target Audience The primary audience for this document is DataStage Administrators and Developers who

have been trained in Enterprise Edition. Information in certain sections may also be relevantfor Technical Architects and System Administrators.

Product Version This document is intended for the following product releases:- WebSphere DataStage Enterprise Edition 7.5.2 (UNIX, Linux, USS)

- WebSphere DataStage Enterprise Edition 7.5x2 (Windows)

Document Author and ContributorsAuthor Mike Carney Advanced Consulting Engineer [email protected]

Paul Christensen Global Technical Architect [email protected]

Bob Johnston Advanced Consulting Engineer [email protected] Owen Advanced Consulting Engineer [email protected]

Mike Ruland Global Technical Architect [email protected]

Contributing Authors

Jim Tsimis Advanced Support Engineer [email protected]

Document Revision HistoryDate Rev. Description

April 27, 2006 1.0 Initial release

July 17, 2006 1.1 Updated ETL and Project_Plus directory hierarchies for consistency across DSEEStandards. Added Staging directory hierarchy.

August 15, 2006 1.2 Updated styles and formatting.

October 5, 2006 1.3 Updated directory and Project_Plus naming standards for consistency across

deliverables. Updated terminology and Naming Standards for consistency.Expanded discussion of Environment Variables and Parameters. AddedEnvironment Variable Reference Appendix. Added Document Author andContributors, and Package Contents.

October 17, 2006 1.4 Added Feedback section and IIS Services Offerings. Corrected Data Set andScratch file system naming. Expanded backup discussion for DataSets.

February 8, 2007 2.0 Updated positioning, naming (IIS to IPS), Services Offerings.

October 29, 2007 3.0 First public reference release compliments Administration and ProductionAutomation Services Workshop.

Document ConventionsThis document uses the following conventions:

Convention UsageBold In syntax, bold indicates commands, function names, keywords, and options that must be input

exactly as shown. In text, bold indicates keys to press, function names, and menu selections.

Italic In syntax, italic indicates information that you supply. In text, italic also indicates UNIXcommands and options, file names, and pathnames.

Plain In text, plain indicates Windows NT commands and options, file names, and pathnames.

BoldItalic Indicates:important information.
mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]


3/72




LucidaConsole

Lucida Consoletext indicates examples of source code and system output.

Lucida Bold

In examples,Lucida Console bold

indicates characters that the user types or keys the userpresses (for example,

).

Lucida Blue In examples, Lucida Bluewill be used to illustrate operating system command line prompt.

A right arrow between menu commands indicates you should choose each command in sequence.For example, Choose FileExit means you should choose Filefrom the menu bar, and thenchoose Exit from the File pull-down menu.

This linecontinues

The continuation character is used in source code examples to indicate a line that is too long tofit on the page, but must be entered as a single line on screen.

The following are also used:

Syntax definitions and examples are indented for ease in reading. All punctuation marks included in the syntaxfor example, commas, parentheses, or quotation

marksare required unless otherwise indicated.

Syntax lines that do not fit on one line in this manual are continued on subsequent lines. The

continuation lines are indented. When entering syntax, type the entire syntax entry, includingthe continuation lines, on the same input line.

Text enclosed in parenthesis and underlined (like this) following the first use of proper termswill be used instead of the proper term.

Interaction with our example system will usually include the system prompt (in blue)and thecommand, most often on 2 or more lines.

If appropriate, the system prompt will include the user name and directory for context. For example:

%etl_node%:dsadm /usr/dsadm/Ascential/DataStage >/bin/tar cvf /dev/rmt0 /usr/dsadm/Ascential/DataStage/Projects

FeedbackWe value your input and suggestions for continuous improvement to this content. Direct any questions,comments, corrections, or suggested additions to: [email protected]
mailto:[email protected]:[email protected]


4/72




Table of Contents1 IBM INFORMATION PLATFORM AND SOLUTIONS SERVICES................................................................5

2 DATASTAGE ADMINISTRATION.......................................................................................................................92.1 CONFIGURING DATASTAGE ENVIRONMENTS FOR A SYSTEM LIFE CYCLE...............................................................92.2 CONFIGURING DATASTAGE FILE SYSTEMS AND DIRECTORIES .............................................................................112.3 ADMINISTRATOR TIPS ...........................................................................................................................................162.4 PERFORMANCE MONITORING ................................................................................................................................182.5 SECURITY,ROLES,DATASTAGE USER ACCOUNTS................................................................................................242.6 THE DATASTAGE ADMINISTRATOR PROJECT CONFIGURATION.............................................................................26

3 JOB MONITOR......................................................................................................................................................30

3.1 CONFIGURATION ...................................................................................................................................................303.2 JOB MONITOR ENVIRONMENT VARIABLES ............................................................................................................303.3 STARTING &STOPPINGTHE MONITOR..................................................................................................................303.4 MONITORING JOBMON ...........................................................................................................................................31

4 BACKUP / RECOVERY / REPLICATION/ FAILOVER PROCEDURES......................................................32

4.1 DATASTAGE CONDUCTOR BACKUP.......................................................................................................................324.2 DATASTAGE PROJECT BACKUPS ...........................................................................................................................324.3 DATASTAGE EXPORTS FOR PARTIAL BACK UP .....................................................................................................334.4 DATASETS,LOOKUP FILE SETS AND FILE SETS .....................................................................................................344.5 EXTERNAL ENTITIES SCRIPTS,ROUTINES,STAGING FILES ....................................................................................344.6 REPLICATING THE DATASTAGE ENVIRONMENT ....................................................................................................344.7 IMPORTANT PROJECT FILE SYSTEM CONSIDERATIONS..........................................................................................37

5 OVERVIEW OF PRODUCTION AUTOMATION AND INFRASTRUCTURE INTEGRATION FORDATASTAGE ........................................................................................................................................................................39

5.1 DATASTAGE JOBCONTROL DEVELOPMENT KIT ...................................................................................................40

5.2 JOB SEQUENCER....................................................................................................................................................415.3 EXCEPTION HANDLING..........................................................................................................................................415.4 CHECKPOINT RESTART ..........................................................................................................................................42

6 JOB PARAMETER AND ENVIRONMENT VARIABLE MANAGEMENT ..................................................46

6.1 DATASTAGE ENVIRONMENT VARIABLES ..............................................................................................................466.2 DATASTAGE JOB PARAMETERS .............................................................................................................................496.3 AUDIT AND METRICS REPORTING IN AN AUTOMATED PRODUCTION ENVIRONMENT ............................................546.4 INTEGRATING WITH EXTERNAL SCHEDULERS .......................................................................................................546.5 INTEGRATING WITH ENTERPRISE MANAGEMENT CONSOLES.................................................................................55

7 CHANGE MANAGEMENT ..................................................................................................................................56

7.1 SOURCE CONTROL.................................................................................................................................................567.2 PRODUCTION MIGRATION LIFE CYCLE..................................................................................................................577.3 SECURITY ..............................................................................................................................................................587.4 UPGRADE PROCEDURE (INCLUDING FALLBACK EMERGENCY PATCH) ..................................................................58

APPENDIX A: PROCESSES CREATED AT RUNTIME BY DATASTAGE EE..........................................................60

APPENDIX B: ENVIRONMENT VARIABLE REFERENCE ........................................................................................67


5/72




1 IBM Information Platform and Solutions ServicesIBM Information Platform and Solutions(IPS)Professional Servicesoffers a broad range ofworkshops and services designed to help you achieve success in the design, implementation, and

rollout of critical information integration projects.

Iterations 2 Methodology

Standard PracticesArchitecture and Design Education and Mentoring Virtual Services Certification

F igure 1: IBM IPS Services Overview

Services Offerings Description

Staff Augmentation

and Mentoring

Whether through workshop delivery, project leadership, or mentored augmentation,the Professional Services staff of IBM Information Platform and Solutions leveragesIBMs methodologies, Standard Practices and experience developed throughoutthousands of successful engagements in a wide range of industries and government

entities.Learning Services IBM offers a variety of courses covering the IPS product portfolio. IBMs blending

learning approach is based on the principle that people learn best when provided witha variety of learning methods that build upon and complement each other. With that inmind, courses are delivered through a variety of mechanisms: classroom, on-site andWeb-enabled FlexLearning.

Certification IBM offers a number of Professional Certifications offered through independenttesting centers worldwide. These certification exams provide a reliable, valid and fairmethod of assessing product skills and knowledge gained through classroom and real-world experience.

Client Support

Services

IBM is committed to providing our customers with reliable technical supportworldwide. All Client Support services are available to customers who are coveredunder an active IBM IPS maintenance agreement. Our worldwide support organization

is dedicated to assuring your continued success with IPS products and solutions.Vi rtual Services The low cost Virtual Services offering is designed to supplement the global IBM IPS

delivery team, as needed, by providing real-time, remote consulting services. VirtualServices has a large pool of experienced resources that can provide IT consulting,development, Migration and Training services to customers for WebSphereDataStage Enterprise Edition (DSEE).


6/72




Center of Excellence for Data Integration (CEDI)

Establishing a CEDI within your enterprise can help increase efficiency and drive down the cost ofimplementing data integration projects. A CEDI can be responsible for Competency, Readiness,Accelerated Mentored Learning, Common Business Rules, Standard Practices, Repeatable Processes,

and the development of custom methods and components tailored to your business.

IBM IPS Professional Services offerings can be delivered as part of a strategic CEDI initiative, or onan as-needed basis across a project lifecycle:

Identify StrategicPlanning Startup

Analysis& Design Build

Test &Implement

Monitor &Refine

RequirementsDefinition, Architecture,and Project Planning

Iterations 2

Health Check Evaluation

Sizing and Capacity PlanningPerformance Tuning

High Availability Architecture

InformationExchange andDiscovery

Installation and ConfigurationInformation Analysis

Data Flow and Job Design Standard PracticesData Quality Management Standard Practices

Administration, Management, and Production Automation

Grid Computing Discovery, Architecture, and PlanningGrid Computing Installation and Deployment

F igure 2: I PS Services Offeri ngs within an I nformation I ntegration Project Li fecycle

Project Startup

Workshops

Description

Information

Exchange and

Di scovery Workshop

Targeted for clients new to the IBM IPS product portfolio, this workshop providesIBMs high-level recommendations on how to solve a customer particular problem.IBM analyzes the data integration challenges outlined by the client, and develops astrategic approach for addressing those challenges.

Requirements

Definition,

Architecture, and

Project Planning

Workshop

Guiding clients through the critical process of establishing a framework for asuccessful future project implementation, this workshop delivers a detailed projectplan, as well as a Project Blueprint. These deliverables document project parameters,current and conceptual end states, network topology, data architecture and hardwareand software specifications, outlines a communication plan, defines scope, andcaptures identified project risk.

I terations 2 IBMs Iterations 2 is a framework for managing enterprise data integration projectsthat integrates with existing customer methodologies. Iterations 2 is a comprehensive,

iterative, step-by-step approach that leads project teams from initial planning andstrategy through to tactical implementation. This workshop includes the Iterations 2software, along with customized mentoring.


7/72




Standard Practices

Workshops

Description

I nstallation and

Configuration

Workshop

Establishes a documented, repeatable process for installation and configuration ofDSEE server and client components. This may involve review and validation of one or

more existing DSEE environments, or planning, performing, and documenting a newinstallation.

I nformation Analysis

Workshop

Provides clients with a set of Standard Practices and a repeatable methodology foranalyzing the content, structure, and quality of data sources using the combination ofWebSphere Profile Stage, Quality Stage, and Audit Stage.

Data Fl ow and Job

Design Standard

Practi ces Workshop

Helps clients establish standards and templates for the design and development ofparallel jobs using DSEE through practitioner-led application of IBM StandardPractices to a clients environment, business, and technical requirements. The deliveryincludes a customized Standards document as well as custom job designs andtemplates for a focused subject area.

Data Quality

Management

Standard Pr actices

Workshop

Provides clients with a set of standard processes for the design and development ofdata standardization, matching, and survivorship processes using WebSphereQualityStage. The data quality strategy formulates an auditing and monitoring

program that helps ensure on-going confidence in data accuracy, consistency, andidentification through client mentoring and sharing of IBM Standard Practices.

Administration,

Management, and

Production

Automation

Workshop

This workshop provides customers with a customized Toolkit and set of provenStandard Practices for integrating DSEE into a clients existing productioninfrastructure (monitoring, scheduling, auditing/logging, change management) and foradministering, managing and operating DSEE environments.

Advanced

Deployment

Workshops

Description

Health Check

Evaluation

This workshop is targeted for clients currently engaged in IPS development efforts thatare not progressing according to plan, or for clients seeking validation of proposed

plans prior to the commencement of new projects. It provides review of andrecommendations for core ETL development and operational environments by an IBMexpert practitioner.

Sizing and Capacity

Planning Workshop

Provides clients with an action plan and set of recommendations for meeting currentand future capacity requirements for data integration. This strategy is based onanalysis of business and technical requirements, data volumes and growth projections,existing standards and technical architecture, existing and future data integrationprojects.

Perf ormance Tuning

Workshop

Guides a clients technical staff through IBM Standard Practices and methodologiesfor review, analysis and performance optimization using a targeted sample of clientjobs and environments. This workshop can identify potential areas of improvement,demonstrate IBMs processes and techniques, and provide a final report that containsrecommended performance modifications and IBM performance tuning guidelines.

High-AvailabilityArchitecture

Workshop

Using IBMs IPS Standard Practices for high availability, this workshop presents aplan for meeting a customers high availability requirements using the parallelframework of DSEE. It then implements the architectural modifications necessary forhigh availability computing.

Gri d Computing

Discovery,

Architecture and

Provides the planning and readiness efforts required to support a future deployment ofthe parallel framework of IPS on Grid computing platforms. This workshop preparesthe foundation on which a follow-on Grid installation and deployment will be


8/72




Planning Workshop executed, and includes hardware and software recommendations and estimated scope.

Gri d Computing

I nstallation and

Deployment

Workshop

Installs, configures, and deploys the IBM IPS Grid Enabled Toolkit in a clients Gridenvironments and provides integration with Grid Resource Managers, configurationof DSEE, QualityStage/EE, and/or ProfileStage/EE.

For more details on any of these IBM IPS Professional Services offerings, and to find a local IBMInformation Integration Services contact, visit:

http://www.ibm.com/software/data/services/ii.htmlAdministration, Management and Production Automation Workshop

The following flowchart illustrates the various IPS Services workshops around the parallel frameworkof DSEE.

The Administration, Management and Production Automation Workshop is intended to provide a set ofproven Standard Practices and a customized toolkit for integrating DSEE into a customers existingproduction infrastructure (monitoring, scheduling, auditing/logging, change management). It alsoprovides expert practitioner recommendations for administering, managing and operating DSEEenvironments.

F igure 3: Services Workshops for the Paral lel Framework of DSEE
http://www.ibm.com/software/data/services/ii.htmlhttp://www.ibm.com/software/data/services/ii.html


9/72




2.1

2.1.1

2 DataStage AdministrationThis section of the document discusses DataStage Administration and Automation. It endeavors to jointhese disciplines in a cohesive manner, by defining standard practices that are complementary. The

standard practices are based on a foundation. The foundation is the operating environment and asimple life cycle methodology.

Conf igu r ing DataStage Env ironm ents for a System L i fe Cycle

There are many ways to configure a DataStage ETL environment. DataStage EE is a flexible piece ofsoftware, presenting the DataStage team with many facets of Configuration, Administration, Design,Development Operations to consider. After the software is installed many customers wonder what todo next. Seeing the end picture of your environment, how users will interact with it, and how it willfunction, will help you decide how to configure, administer, manage, develop and operate all aspectsthe DataStage environment. It will help you in other areas that are related to the DataStageenvironment like planning hardware resources, setting up users and security.

You should be familiar with many aspects of a DataStage project. In particular these aspects:

Projects are both the logical and physical means for storing work performed in DataStage. Projects are Meta data repositories for DataStage Objects, such as jobs, stages, shared

containers.

Projects also store configuration metadata, like environment variables. It is possible to create many projects. Projects are independent of each other. DataStage object metadata can be exported to a file as well as imported.

A Simple DataStage Application Life Cycle

It is common to refer to a collection of related DataStage objects such as jobs and shared containers, asan application. And just as for all software applications, applications developed with DataStage needto be developed and maintained under a life cycle methodology to ensure quality.

The life cycle that is advocated in this Standard practice only takes into account a subset of larger morecomprehensive methodologies. It is primarily concerned with the Development, Testing, Release andMaintenance of the Data stage application. It describes how to physically implement the environmentfor the life cycle and activities related to operation and maintenance. It does not consider aspects of a

broader life cycle such as Design, Documentation.

The DataStage application life cycle has at least three phases.1. Development/Maintenance2. Testing3. Production


10/72

IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 10 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.



2.1.2

A more robust life cycle will utilize more testing phases.1. Development/Maintenance2. Integration Testing

3. Quality Assurance4. User Acceptance Testing5. Production

DataStage installation and Configuration Considerations

In support of the phases of the life cycle DataStage project environments should be configured for eachphase of the project. That is, per DataStage application create projects for dev, test, production, etc. Inaddition to configuring project environments you will need to consider the system requirements aswell. It is a recognized standard practice, as well as, common for customers, to utilize completelydifferent systems for each phase of the life cycle. Consider other resources that will be used by theproject, such as disk space, CPU capacity, memory size and weather or not the system(s) can support

the anticipated or actual workload. Application performance depends on sufficient hardware resourcesto support the work load that parallel execution puts on a system.

If you plan to execute your DataStage jobs in a distributed fashion on a loosely coupled cluster (MPP,Grid, Cluster) or employ a failover strategy the DataStage EE environment will need to be replicatedon all physical processing nodes. SeeReplicating the DataStage Environment, section 4.6.

Installation of the DataStage environment requires some careful planning. Choosing hardwareresources is particularly important for a DataStage EE environment. Consider separate physicalenvironments for each phase of a life cycle to ensure adequate performance. For example designate aseparate machine for dev, test and production.

Single Mixed Physical Environments

Development & Test & Production Machine

Development Machine Test Machine ProductionMachine

Separate Physical Environments

Production MachineDevelopment & Test Machine

Mixed Physical Environments

Standardpractice

Figure 4: DataStage EE Physical Environments for a life cycle


11/72




2.2

Configuration for DataStage user accounts for all phases of the ETL test cycle should be set up to allowfor separate developer accounts and a separate account for each phase of the life cycle. Each of theseaccounts should be configured according to the section configuring a DataStage user (below).

Conf igu r ing DataStage Fi le Systems and Director ies

DataStage Enterprise Edition requires file systems to be available for:

Software Install Directory- DataStage Enterprise Edition executables, libraries, and pre-built components

DataStage Project (Repository) Directory

Data Storage

- DataStage temporary storage - scratch, temp, buffer- DataStage parallel Data Set segment files- Staging and Archival storage for any source file(s)

By default, each of these directories (except for file staging) are created during installation assubdirectories under the base DataStage installation directory.

IMPORTANT: Each storage class should be isolated in separate file systems to accommodatetheir different performance and capacity characteristics and backup requirements.The default installation is generally acceptable for small prototype environments.

2.2.1 Software Install DirectoryThe software install directory is created by the installation process, and contains the DSEE softwarefile tree. The install directory grows very little over the life of a major software release, so the defaultlocation ($HOME for dsadm, e.g.: /home/dsadm) may be adequate.

The system administrator may choose to install DataStage in a subdirectory within an overall installfile system. You should verify that the install file system has at least 1GB of space for the softwaredirectory (2GB if you are installing RTI or other optional components).

For cluster or Grid implementations, it is generally best to share the Install file system across

servers (at the same mount point).

NOTE: the DataStage installer will attempt to rename the installation directory to support laterupgrades; if you install directly to a mount point this rename will fail and several error

messages will be displayed. Installation will succeed but the messages may be confusing.


12/72




2.2.2 DataStage Projects (Repository) Directory

The DataStage Projects subdirectory contains the repository (Universe database files) of job designs,design and runtime metadata, logs, and components. Project directories can grow to contain thousandsof files and subdirectories depending on the number of projects, the number of jobs, and the volume of

logging information retained about each job.

During the installation process, the Projects subdirectory is created in the DataStage install directory.By default, the DataStage Administrator client creates its projects in this Projects subdirectory.

For cluster or Grid implementations, it is generally best to share the Projects file system across

servers (at the same mount point).

IMPORTANT: It is a bad practice to create DataStage projects in the default directory withinthe install file system, as disk space is typically limited.

Projects should be created in their own file system.

2.2.2.1 Creating the Projects File System

On most operating systems, it is possible to create separate file systems at non-root levels as a separatefile system for the Projects subdirectory within the DataStage installation, using the followingguidelines:

It is recommended that a separate file system be created and mounted over the default locationfor projects, the $DSROOT/Projectsdirectory. Mount this directory after installing DSEE butbefore projects are created.

The Projects directory should be a mirrored file system with sufficient space (minimum 100MBper project).

For cluster or Grid implementations, it is generally best to share the Project file system acrossservers (at the same mount point).

IMPORTANT:The project file system should be monitored to ensure adequate free spaceremains. If the Project file system runs out of free space during DataStage activity, the

repository may become corrupted, requiring a restore from backup.

Effective management of space is important to the health and performance of a project, and as jobs areadded to a project, new directories are created in this file tree, and as jobs are run, their log entriesmultiply. These activities cause file-system stress (for example, more time to insert or delete DataStagecomponents, longer update times for logs). Failure to perform routine projects maintenance (forexample, remove obsolete jobs and manage log entries) can cause project obesity and performanceissues.


13/72




2.2.3

The name of a DataStage Project is limited to a maximum of 18 characters. The project name cancontain alpha-numeric characters and it can contain underscores.

2.2.2.2 Project Recovery Considerations

Devising a backup scheme for project directories is based on 3 core issues:

1. Will there be valuable data stored in Server Edition hash files 1? DataStage Server Edition fileslocated in the DataStage file tree may require archiving from a data perspective.

2. How often will the UNIX file system containing the ENTIRE DataStage file tree be backed up?When can DataStage be shut down to enable a cold snapshot of the Universe database as well asthe project files? A complete file system backup while DataStage is shut down accomplishesthis backup.

3. How often will the projects be backed up? Keep in mind that the grain of project backups will

represent the ability to recover lost work should a project or a job become corrupted.

At a minimum, a UNIX file system backup of the entire DataStage file tree should be performed atleast weekly with the DataStage engine shut down, and each project should be backed up with theManager at least nightly with all users logged out of DataStage. This is the equivalent of a colddatabase backup and 6 updates.

If your installation has valuable information in Server hash files, you should increase the frequency ofyour UNIX backup OR write jobs to unload the Server files to external media.

Data Set and Sort Directories

The DataStage installer creates the following two subdirectories within the DataStage install directory:

Datasets/

- stores individual segment files of DataStage parallel Data Sets

Scratch/

- used by the parallel framework for temporary files such as sort and buffer overflow

Try not to use these directories and consider deleting them to ensure they are never used. This is bestdone immediately after installation; be sure to coordinate this standard with the rest of the team.

DataStage parallel Configuration files are used to assign resources (such as processing nodes, disk andscratch file systems) at runtime when a job is executed.

1Note that the use of Server Edition components in an Enterprise Edition environment is discouraged forperformance and maintenance reasons. However, if legacy Server Edition applications exist, their correspondingobjects may need to be taken into consideration.


14/72




The DataStage installer creates a default parallel Configuration file (Configurations/default.apt) whichreferences the Datasets and Scratch subdirectories within the install directory. The DataStageAdministrator should consider removing the default.apt file altogether, or at a minimum updating this

file reference the file systems you define (below).

2.2.3.1 Data and Scratch File Systems

It is a bad practice to share the DataStage install and Projects file systems with volatile files like scratchfiles and Parallel data set segment files. Resource, scratch and sort disks service very different kinds ofdata with completely opposite persistence characteristics. Furthermore, they compete directly witheach other for I/O bandwidth and service time if they share the same path.

Optimally, these file systems should not have any physical disks in common and should not share anyphysical disks with databases. While it is often impossible to allocate contention-free storage, it mustbe noted that at large data volumes and/or in highly active job environments, disk arm contention can

and usually does significantly constrain performance.

NOTE:For optimal performance, file systems should be created in high performance, lowcontention storage. The file systems should be expandable without requiring destruction and re-

creation.

2.2.3.2 Data Sets

Parallel Data Sets are used for persistent data storage in parallel, in native DSEE format. TheDataStage developer specifies the location of the Data Set header file, which is a very small pointer tothe actual data segment files that are created by the DSEE engine, in the directories specified by the

disk resources assigned to each node in the parallel Configuration file. Over time, the Data Set segmentfile directory(-ies) will grow to contain dozens to thousands of files depending on the number ofDataStage Data Sets used by DSEE jobs.

The need to archive Data Set segment files depends on the recovery strategy chosen by the DataStagedeveloper, the ability to recreate these files if the data sources remain, and the business requirements.Whatever archive policy is chosen should be coordinated with the DataStage Administrator andDevelopers. If Data Set segment files are archived, careful attention should be made to also archive thecorresponding Data Set header files.

2.2.3.3 Sort Space

As discussed, it is a recommended practice to isolate DataStage scratch space from DataSets and flatfiles, and DataStage sort space, in that temporary files exist only while a job is running 2and that they

2Some files created by database stages persist after job completion. For example, the Oracle .log, .ctl and .badfiles will remain in the first Scratch resource pool after a load completes.


15/72




2.2.4

are warm files (e.g.: being read and written at above average rates). Note that sort space mustaccommodate only the files being sorted simultaneously, and, assuming that jobs are scheduled non-concurrently, only the maximum of said sorts.

There is no persistence to these temporary sort files so they need not be archived.

Sizing DataStage scratch space is somewhat difficult. Objects in this space include lookups and intra-process buffers. Intra-process buffers absorb rows at runtime when a stage (or stages) in a partition (orall partitions) cannot process rows as fast as they are supplied. In general, there are as many buffers asthere are stages on the canvas for each partition. As a practical matter, assume that scratch space mustaccommodate the largest volume of data in one job (see the previous formula for Data Sets and flatfiles). There are advanced ways to isolate buffer storage from sort storage, but this is a performancetuning exercise, not a general requirement.

2.2.3.4 Maintaining Parallel Configuration Files

DataStage parallel Configuration files are used to assign resources (such as processing nodes, disk andscratch file systems) at runtime when a job is executed. Parallel Configuration files are discussed indetail in theDataStage Parallel Job Advanced Developers Guide.

Parallel configuration files define can be located within any directory that has suitable accesspermissions, defined at runtime through the environment variable $APT_CONFIG_FILE. However,the graphical Configurations tool within the DataStage clients expects these files to be stored within theConfigurationssubdirectory of the DataStage install. For this reason, it is recommended that allparallel configuration files be stored in the Configurationssubdirectory, with naming conventions toassociate them with a particular project or application.

The default.aptfile is created when DataStage is installed, and references the DataSets and Scratchsubdirectories of the DataStage install directory. To manage system resources and disk allocation,the DataStage administrator should consider removing this file, creating separate configuration

files that are referenced by the $APT_CONFIG_FILE setting in each DataStage Project.

At a minimum, the DataStage administrator should edit the default.aptconfiguration file to referencethe newly-created Data and Scratch file systems, and to ensure that these directories are used by anyother parallel configuration files.

Extending the DataStage Project for External Entities

It is recommended that another directory structure, be created to integrate all aspects of a DataStage

application that are managed outside of the DataStage Projects repository. This hierarchy shouldinclude directories for secured parameter files, Data Set header files, custom components,Orchestrate schema, sql and shell scripts. It may also be useful to support custom job logs andreports.


16/72




2.2.5

2.3

2.3.1

File Staging

It is recommended that a separate Stagingfile system and directory structure be used for storing,managing, and archiving various source data files.

Administrator Tips

Shell environment

Establish a convenient environment variable to the main DataStage directory and automatically sourcethe DataStage environment by adding the following three lines to the users shell profile. (.profile,.bashrc, etc.)

dsroot= `cat /.dshome`/..

export dsroot

. $dsroot/DSEngine/dsenv

Note: The /.dshome file is only created with a standard (non-itag) install of DataStage. If youhave installed multiple DataStage engines on a single server (using an itag install) then you

will need to source the appropriate dsenvfile for the DataStage environment you aremanaging.

2.3.2

2.3.3

2.3.4

Standard DSParams

The DSParams comes from a template. Configure 1 project as the standard configuration forenvironment variables, sequencer settings, etc. Then copy the DSParams from the model projectdirectory to the Template Directory. Every time a new project is created it will inherit the settings fromthe DSParams file in template.

Starting / Stopping the DataStage Engine

The DataStage installation will configure your system to start the DataStage Server main processes(dsrpcd and EE Job monitor JobMonApp) , automatically when the system starts. For UNIX systemsS99ds.rs is installed in /etc/rc2.d. For UNIX systems the DataStage services set to automatically start.One exception is for a non-root installation; in this case scripts should be executed by the root user toset up impersonation and autostart.

Too manually stop and start the service on windows from the Windows control panel invoke theDataStage Control Panel application.

To manually stop or start the DataStage engine refer to the Administrator Guide (dsadmgde.pdf)Stopping and Restarting the Server Engine.

Server will not start because port is busy.

Usually occurs when the server is brought down before all clients have exited.


17/72




2.3.5

2.3.6

A feature of UNIX related to TCP sockets that become disconnected will hold the port (which port) inFINWAIT state for the length of the FINWAIT. While this port is in a finwait state DataStage dsrpcdserver will not start.

You can either wait for the FINWAIT to expire usually 10 minutes or in an emergency as root changethe setting to something like 1 minute. This is a dynamic network parameter and can be set temporarilyto a lower value. Reset back to original value once the server starts.Use the following utilities

ndd Solaris, hpux, no - AIX

Universe Shell

The DataStage server engine is based on Universe. It is a complete application environment containinga shell, file types, programming language and many facilities for application operations like lockmanagement. To invoke the universe shell the DataStage environment variables must be set. This is

easily done by sourcing/executing the dsenv file in $DSHOME. To invoked the Universe use thesecommands:

cd $DSHOMEbin/uvsh

Resource Locks

If a developer is working on a job in the designer and there is a network failure or client machinefailure the job will remain locked according to DataStage. When a job is locked it must be clearedbefore it can be accessed by any DataStage component. Clearing locks can be done from theDataStage Director pull down Job->Cleanup Resources. Choosing this option will open the Job

Resources interface.


18/72


19/72




2.4.1

2.4.2

environment. Once the processing bottle neck is discovered action can then be taken to improveperformance. For example, a job may appear to be running slow, with no indication of CPU, I/O or amemory bottle neck, performance of the job could be improved by creating more logical processingnodes in the DataStage EE configuration file or it may need to be redesigned. As parallelism is

increased, more system resources will be utilized and one will find that the system may become thegating factor of performance. The remedy to this problem may be to increase system resources, likeadding more CPU or spreading I/O to other physical devices and controllers.

DataStage EE Job Monitor

The DataStage EE job monitor (JobMonApp) provides a useful snapshot of the jobs performance atthat moment of execution, but does not provide thorough performance metrics. That is, a JobMonAppsnapshot should not be used in place of a full run of the job, or a run with a sample set of data. Due tobuffering and to some jobs semantics, a snapshot image of the flow may not be a representativesample of the performance over the course of the entire job.The CPU Summary information provided by JobMonApp is useful as a first approximation of where

time is being spent in the flow. However, it will not show operators that were inserted by the parallelframework. Such operators include sorts, that were not explicitly included, and sub-operators ofcomposites.

Performance Metrics with DataStage EE Environment Variables

There are a number of environment variables that direct DataStage parallel jobs to report detailedruntime information that enable you to determine where time is being spent, how many rows processedand how much memory each instance of a stage utilized during a run. Setting these environmentvariables also allow you to report on operators that were inserted by the parallel framework. Suchoperators include sorts, that were not explicitly included, buffer operators and sub-operators ofcomposites.

APT_PM_PLAYER_MEMORY

Setting this variable causes each player process to report the process heap memory allocation in the joblog when the operator instance completes execution.

Example of player memory:

APT_CombinedOperatorController,0: Heap growth during runLocally(): 1773568 bytes

APT_PM_PLAYER_TIMING

Setting this variable causes each player process to report its call and return in the job log. The message

with the return is annotated with CPU times for the player process.

Example of player timings, showing the elapsed time of the operator, the amount of user and systemtime as will as total CPU.

APT_CombinedOperatorController,0: Operator completed. status: APT_StatusOk elapsed: 0.30 user: 0.02 sys: 0.02 (totalCPU: 0.04)


20/72




APT_RECORD_COUNTS

Setting the variable causes DataStage to print to the job log, for each operator player, the number of

records input and output. Abandoned input records are not necessarily accounted for. Buffer operatorsdo not print this information.

Example of record counts that shows the number of rows processed for the input link and output linkfor partition 0 of the Sort_3 stage.

Sort_3,0: Input 0 consumed 5000 records.Output 0 produced 5000 records.

APT_PERFORMACNE_DATA

APT_PERFORMANCE_DATA or the osh -pdd advanced runtimeoption allow you to capture raw performance data for every underlying job process at runtime.

Within a job parameter, set $APT_PERFORMANCE_DATA = dirpath where dirpath is a directoryspecified on the DataStage server to capture performance statistics. This will create an XML documentnamed performance. in specified directory. You can influence the name of the file byspecifying the osh -jobid advanced runtime option. Hence the performance XML documentwill be named performance..

The following XML header shows the detailed performance data captured in each record. Note that thisinformation is more detailed than the higher-level information captured by DSMakeJobReport and

includes information on all of the processes (including Buffer operators and framework-inserted sorts):


21/72




2.4.3 iostat

2.4.4

Starting with release 7.5 and later, the Perl script performance_convertlocated in the directory$APT_ORCHHOME/bincan be used to convert the raw performance data into other usable formatsincluding:- CSV text files

- detail Data Sets- summary Data Sets (summarizes the total time and maximum heap memory usage per operator)

The syntax is:

perl $APT_ORCHHOME/bin/performance_convert inputfile output_base [-schema|-dataset|-

summary] [-help]

whereinputfile - location of performance data to convertoutput_base - location and file prefix to all files being generated.

(ex. /mydir/jobid -> /mydir/jobid.CSV)

iostat is useful for examining the throughput of various disk resources. If one or more disks have highthroughput, understanding where that throughput is coming from is vital. If there are spare CPU cycles,IO is often the culprit. iostat can also help a user determine if there is excessive IO for a specific job.

The specifics of iostat output vary slightly from system to system. Here is an example from a Linuxmachine which shows a relatively light load:(The first set of output is cumulative data since the machine was booted)$ iostat 10

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtndev8-0 13.50 144.09 122.33 346233038 293951288

every N seconds (10 in the command line example) iostat outputs:Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtndev8-0 4.00 0.00 96.00 0 96

vmstat

vmstat is useful for examining system paging. Ideally, a EE flow, once it begins running, should neverbe paging to disk (si and so should be zero). Paging suggests EE is consuming too much total memory.

$vmstat 1procs memory swap io system cpur b w swpd free buff cache si so bi bo in cs us sy id0 0 0 10692 24648 51872 228836 0 0 0 1 2 2 1 1 0

vmstat produces the following every N seconds:


22/72




2.4.5

2.4.6

0 0 0 10692 24648 51872 228836 0 0 0 0 328 41 1 0 99

mpstat will produce a similar report based on each processor of an SMP.

Load Average

Ideally, each flow should be consuming as much CPU as is available. The load average on the machineshould be 2-3x the value as the number of processors on the machine (8-way SMP should have a loadaverage of roughly 16-24). Some operating systems, such as HPUX, show per-processor load average.In this case, load average should be 2-3, regardless of number of CPUs on the machine.

If the machine isnt CPU-saturated, a bottleneck may exist elsewhere in the flow. Over-partitioningmay be a useful strategy in these cases.

If the flow pegs the machine, then the flow is likely CPU limited, and some determination needs to be

made as to where the CPU time is being spent if performance isnt adequate. See the next section(2.4.6) to monitor individual processes.

The commands top or uptime can provide the load average.

xload can provide a histogram of the load average over time.

(top , topas, nmon) give you a real time view of the system and are extremely useful for evaluating asystems performance.

How to Monitor DataStage EE Processes

Refer to Appendix A:Processes Created at Runtime by DataStage EEfor diagrams of processescreated by DataStage.

Identifying the player processes identifiers (PID) of a job can be done so by setting the environmentvariable APT_PM_PLAYER_PID=TRUE. This will produce messages in the job log correlating aninstance of an operator and the PID.

You can also identify the processes without using APT_PM_PLAYER_PID, by looking for processesthat are running the osh or phantom programs. osh is the orchestrate shell or the main program of theparallel framework. All parallel job execution, that is, section leaders and players, are spawned fromthis program. osh processes will be started on all physical processing nodes participating in a jobs

execution. Phantom is the name of the process that is spawned by DataStage for job control that is JobSequencers. Phantom processes only run on the conductor node. When you invoke a job fromDataStage, it will first start a phantom process, which controls and monitors the overall execution of ajob. The phantom will then invoke osh. Phantom processes can also spawn other child phantoms ifyour job control invokes child Job Sequencers.


23/72




2.4.7 Engine Processes and System Resources

Refer toAppendix A:ProcessesCreated at Runtime by DataStage EEfor diagrams of processescreated by DataStage

The DataStage server engine program is actually called dsrpcd., which is a daemon that managesconnections to DataStage projects. The dsrpcd utilizes semaphores and shared memory segments whenit is operating. Semaphores and shared memory segments used by dsrpcd are prefixed by the string0xade. The UNIX command ipcswill produce a list of semaphores and shared memory segmentsused by DataStage.

When a user logs into DataStage the dsrpcd will spawn two processes for each session, the dsapi_clientand dsapi_slave process. These manage all of your interactions with the DataStage project. One wayto force users to log off from DataStage would be to kill the dsapi_slave process. Note on UNIX thedsapi_slave process is identified by dscs.

2.4.8 Disk Space Used by DataStage

DataStage utilizes disk space in a number of places.

Within a DataStage project the following will grow over time and need to be purged on a regular basis.

Job Logs Purge by setting up a purge policy through the DataStage Administrator. ThisStandard practice emphatically recommends setting a purge policy to avoid filling the projectfile system.

&PH& - This is a directory in each project that is used for stderr and stdout of a phantom

process. Each job execution will create a file in this directory, therefore over time the directorywill grow and thus should be cleaned on a regular basis to avoid filling up the project filesystem. Typical file size is less than 1K, files larger than 1K are an indication of a problemwith a job. In the event of a hard crash of a job, examining the DSD.RUN* files may provideuseful information in explaining the problem.

$TMPDIR This environment variable tells DataStage where to write temporary files that arecreated by the parallel framework, such as the job score, temp file for look up stage. Thisdirectory is automatically cleaned up by the parallel framework; however, hard crashes mayleave files stranded in this directory. You can identify DataStage EE temp files by looking forfiles that begin with APT*. The default $TMPDIR is /tmp, performance improvements can be

achieved by setting $TMPDIR to a faster file system.

Scratch Identifying scratch space is done so by examining the APT_CONFIG_FILE. Scratchis used for sort and buffer overflow files. These files are temporary and are managed by theframework. One can judge how a job is performing by examining the number of files that are


24/72




2.5

2.5.1

created in the scratch area. For example if there is a bottle neck in a process that fork joins,buffer over flow files will be written to scratch. The more files the more buffering.

DataSets - Identifying the directories used by data sets can be done by examining the

APT_CONFIG_FILE or by using the orchadmin command line tool or Tools-> DataSetManagement from the DataStage GUIs

Security, Roles, DataStage User Accounts

In general initial access to DataStage projects is enforced by the operating systems security, such aslogging into a project through DataStage Designer, as well as, read, write, execute and deletepermissions to a project directory.

As a first level of security Administrators should leverage operating system groups to grant and denyaccess to a DataStage project. That is for each project create an operating system group, (group nameshould be the same as the project) and assign the group to the project directory (chown), as well as,

grant users access to that project by making them members of the projects group. This will give usersthe authorization to log into and manage objects in the project.

As a second level of control Administrators should assign DataStage roles (see below) to the groupsthat have access to the project. This will limit what users can do within DataStage, such as create jobs,compile jobs and run jobs.

DataStage Roles

DataStage security is based on operating system groups. When creating a DataStage project considerlimiting access to a project by creating a operating system group and assigning that group as theowner of the DataStage project directory. Then make operating system user ids members of the group.

Then Grant roles to users from the DataStage Administrator as described below.

Following was copied out of the DataStage Administrator Guide (dsadmgde.pdf)To prevent unauthorized access to DataStage projects, you must assign the users on your system to theappropriate DataStage user category. To do this, you must have administrator status You can do manyof the administration tasks described in this section if you have been defined as a DataStage Developeror a DataStage Production Manager. You do not need to have specific administration rights.However, to do some tasks you must be logged on to DataStage using a user name that gives youadministrator status: For Windows servers: You must be logged on as a member of the Windows Administrators group. For UNIX servers: You must be logged in as root or the DataStage administrative user (dsadm by

default).You require administrator status, for example, to change license details, add and delete projects, or toset user group assignments.

There are four categories of DataStage user: DataStage Developer, who has full access to all areas of a DataStage project


25/72




2.5.2

DataStage Production Manager, who has full access to all areas of a DataStage project, and canalso create and manipulate protected projects. (Currently on UNIX systems the Production Managermust be root or the administrative user in order to protect or unprotect projects.) DataStage Operator, who has permission to run and manage DataStage jobs

, who does not have permission to log on to DataStage. You cannot assign individual usersto these categories. You have to assign the operating system user group to which the user belongs. Forexample, a user with the user ID peter belongs to a user group called clerks. To give DataStageOperator status to user peter, you must assign the clerks user group to the DataStage Operator category.

Note: When you first install DataStage, the Everyone group is assigned to the category DataStageDeveloper. This group contains all users, meaning that every user has full access to DataStage. Whenyou change the user group assignments, remember that these changes are meaningful only if you alsochange the category to which the Everyone group is assigned.

User Environment

It is common for DataStage developers and administrators to utilize the UNIX or windows commandline. For this reason the DataStage users account should be configured with proper environmentvariables.

All users should have these lines added to their login profile3.

dsroot="`cat /.dshome`/.."export dsroot. $dsroot/DSEngine/dsenv

Add these lines to the end of the $DSHOME/dsenv.

APT_ORCHHOME=$DSHOME/../PXEngine

export APT_ORCHHOMEAPT_CONFIG_FILE=$DSHOME/../Configurations/default.aptexport APT_CONFIG_FILEPATH=$APT_ORCHHOME/bin:$PATHexport PATHLD_LIBRARY_PATH=$APT_ORCHHOME/lib:$LD_LIBRARY_PATHexport LD_LIBRARY_PATH

To Configure an Orchestrate User:

The following steps explain in detail how to configure the DataStage users environment. The stepsdescribed above in User Environmentshould be sufficient.

1 In your .profile, .kshrc, or .cshrc, set theAPT_ORCHHOME environment variable to the directory in whichOrchestrate is installed. This is either the default, /ascential/apt, or the directory you have defined as partof the installation procedure.

3As noted earlier, the /.dshome file is only created on a default (non-itag) installs


26/72




2.6

2 Add $APT_ORCHHOME/bin to your PATH environment variable. This is required for access to allscripts, executable files, and maintenance commands.3 Add $APT_ORCHHOME/osh_wrappers and $APT_ORCHHOME/user_osh_wrappers to your PATH environmentvariable. This is required for access to the osh operators.

4 Make sure LIBPATH has been set to /usr/lib:/lib:$APT_ORCHHOME/lib:$APT_ORCHHOME/user_lib, followed byadd additional libraries you need .5 Optionally, add the path to the C++ compiler to your PATH environment variable. Orchestraterequires that the compiler be included in PATH if you will use the buildop utility or develop and runprograms using the Orchestrate C++ interface.6 Add the path to the dbx debugger to your PATH variable to facilitate error reporting. If an internalexecution error occurs, Orchestrate attempts to invoke a debugger in order to obtain a stack tracebackto include in the error report; if no debugger is available, no traceback will be generated.7 By default, Orchestrate uses the directory /tmp for some temporary file storage. If you do not want touse this directory, assign the path name to a different directory through the environment variableTMPDIR.

You can additionally assign this location through the Orchestrate environment variableAPT_PM_SCOREDIR.8 Make sure you have write access to the directories $APT_ORCHHOME/user_lib and$APT_ORCHHOME/user_osh_wrappers on all processing nodes.9 If your system connects multiple processing nodes by means of a switch network in an MPP, setAPT_IO_MAXIMUM_OUTSTANDING which sets the amount of memory in bytes reserved for Orchestrate onevery node communicating over the network. The default setting is 2 MB. Ascential Software suggestssettingAPT_IO_MAXIMUM_OUTSTANDING to no more than 64 MB (67,108,864 bytes). If your job fails withmessages about broken pipes or broken TCP connections, reduce the value to 16 MB (16,777,216 bytes). Ingeneral, if TCP throughput is so low that there is idle CPU time, increment this variable (by doubling) untilperformance improves. If the system is paging, the setting is probably too high.

The DataStage Administrator Project Configuration

This section describes Standard practices for configuring a project.


27/72


28/72




By default DataStage grants the Developer Role to all groups. You should restrict the DataStageDeveloper and Production Manager roles to only trusted users.

This standard practice recommends always Automatically Handle Activities that fail. The otheroptions are optional. Add checkpoints so sequence is restart-able on failure should be configuredonly if this is an acceptable approach to checkpoint restart.


29/72




The Generated OSH visible for Parallel jobs in ALL projects button should be checked.


30/72




3.1

3.2

3.3

3 Job MonitorThe DataStage job monitor provides the capability for collecting and reporting performance metrics. Itmust be running in order for the Audit & Metrics system (below) to functions. The job monitor may

impact system performance and can be tuned or shut off, configurable with environment variablesbelow.

Configuration

The job monitor uses two tcp/ports which are chosen during installation. These should be entered in/etc/services as a manual step.

Entries should be made in the /etc/servicesfile to protect the sockets used by the job monitor. Thedefault socket numbers are 13400 and 13401, and entries in this file may look like this:

13400 tcp dsjobmon13401 tcp dsjobmon

Job Monitor Environment Variables

The job monitor is controlled using the following environment variables. Standard practice in large volume dataenvironments is to use a size of about 10000 and turn off APT_MONITOR_TIME with $UNSET.

For an explanation of Time based versus row based monitoring in Parallel Job Advanced DevelopersGuide (advpx.pdf) see JOB MONITOR PAGE 31.

APT_MONITOR_SIZE

Determines the minimum number of records the DataStage Job Monitor reports. The default is 5000records.

APT_MONITOR_TIME

Determines the minimum time interval in seconds for generating monitor information at runtime. Thedefault is 5 seconds. This variable takes precedence over APT_MONITOR_SIZE.

A PT_NO_JOBMON

Turn off job monitoring entirely.

Starting & Stopping the Monitor

The monitor is normally started and stopped with the DataStage server engine. The root user haspermission stop and start the job monitor using these commands:$DSHOME/../PXHOME/java/jobmoninit stop$DSHOME/../PXHOME/java/jobmoninit start


31/72




3.4 Monitoring jobmon

The existence of the job monitor process can be detected by looking for the JobMonApp string in theoutput of the ps command.

For example : ps ef | grep JobMonApp

Will produce this rather long output, but you will be able to identify the process number.

root 6700 1 0 Mar24 ? 00:00:01 /var/dsadm/Ascential/DataStage/DSEngine/../PXEngine/java/jre/bin/java -classpath/var/dsadm/Ascential/DataStage/DSEngine/../PXEngine/java/JobMonApp.jar:/var/dsadm/Ascential/DataStage/DSEngine/../PXEngine/java/xerces/xercesImpl.jar:/var/dsadm/Ascential/DataStage/DSEngine/.


32/72




4.1

4.2

4 Backup / Recovery / Replication/ Failover ProceduresThe DataStage environment should be backed up using your sites system backup tools. It is alsopossible to backup DataStage using archive tools such as tar and zip. In general you should protect the

DataStage environment with a combination of full and incremental backup ups, with a frequency that issufficient to minimize loss of work (disk crash) and minimize recovery time and effort.

In order to properly back up and promptly recover a DataStage installation and the applicationsdeveloped with DataStage, you must identify the files and file systems that are required by theDataStage application. Minimal backup protection requires that DataStage Conductor and projects bebacked up by system backup on a regular basis.

It likely external entities will be closely integrated with applications developed with DataStage and willneed to be backed as well. If your site has standardized the directory structure for External Entitiesthen identifying them for backup is straight forward. Otherwise, identification is a cumbersome ad-hoc

exercise.

DataStage Conductor Backup

Also, known as the DataStage installation directory, the conductor directory contains the DataStagecore product software and configuration. It is critical that it is protected by regularly scheduled full andIncremental backups.

Location Path ../Ascential/DataStage

Events that result in changes to the DataStage conductor files and directories are creating and deletingprojects, installing patches to the Engine, manual modifications to files or subdirectories.

The DataStage installation creates the following subdirectories under ../Ascential/DataStage. Scratch,Datasets and Projects. These directories are used to store volatile files and warrant specialconsiderations. The project file system may be a separate file system, as recommended by Install &upgrade standard practices. See the section below for details related backing up the Projects directory.Consider not backing up the Scratch and Datasets directories.

DataStage Project Backups

The location of a DataStage project can be determined when the project is created, by specifying apath. The default location is $DSHOME/Ascential/DataStage/Projects. The DataStage Projectsdirectory will contain a subdirectory for each project. It is a useful practice to utilize the default projectlocation, or standardize on one location for all projects created on the system, because it will simplifyidentifying the location of projects for backup.

One can locate determine the path of a project through the DataStage Administrator.


33/72




4.3

The DataStage Repository file UV.ACCOUNT contains the directory paths for each project. This filecan be queried by the command:

echo "SELECT * FROM UV.ACCOUNT;" | bin/uvsh

DataStage projects should be protected by both full and incremental system backups, performed atregular intervals (daily, hourly) that minimize exposure to a crash.

Special consideration should be given to development projects, since these are where developers willbe saving work through out the day. Developers and administrators should be aware that work could belost that was saved between backups in the event of a catastrophic storage system failure.

It is best to backup up the system, especially projects, when jobs are not running or when Developersare not on the system. Due to the dynamic nature of a DataStage repository and its multi file structure,there is a potential for a hot backup to contain an inconsistent view of the repository. This situationexist in almost all modern databases (except single file), because the database is made up of many filesthat are updated at different times, getting a consistent view of all these files with a hot backup isdifficult, without complex solutions like breaking volume mirrors.

Avoid storing volatile files in a DataStage project to prevent the waste of time and space required forthe project backup.

Consider locating non volatile external entities in the project, to provide a convenient method forbacking up External Entities that are related to the project.

Consider DataStage job log purge policy. In order to maximize backup efficiency Set a log retentionpolicy to purge shortly after a backup, without erasing entries before they are backed up. For exampleif you incrementally back up a project daily, then set the purge policy for every two days. This willensure all log entries are backed up, with minimal overlap.

DataStage Exports for Partial Back Up

Some customers may choose to rely on the DataStage exports for backups. This is not acomprehensive solution and should only be used in conjunction with full and incremental backups ofthe DataStage installation, DataStage projects and external entities.

DataStage developers can supplement exposure to gaps in system back up to save there work in

between backups.

You cannot export locked jobs.

Export is a DataStage client based Win32 application. It can be run from the DataStage manager orusing the command line tools Need to be at console because windows pop up dialog boxes sometimesappear.


34/72




4.4

4.5

4.6

Datasets, Lookup File Sets and File Sets

Before you begin backing up directories full of Datasets and file sets, consider their volatility. These

files are often temporary files that do not justify the time and expense related to backing them up.

The parallel framework of DSEE supports three proprietary file types:

1. Persistent Data Sets: .ds native EE data types, partitioned to multiple part files2. Lookup File Sets: .fs native EE data types, lookup key structure, 1 or more partitioned part

files.3. External File Sets: .fs external data types, 1 or more data files per processing node.

All three are multipart files, consisting of a descriptor file, and one or more data part files. Thedescriptor and all part files need to be backed up together. Data Sets can be backed up using any UNIX

backup method so long as BOTH the control file portion and data file portion(s) of the DataSets arebacked up at the same time (and no process is writing, or waiting to write to them). Restorationrequires that the data segment files return to the EXACT location from which they came, while thecontrol file portion (filename.ds) can be restored anywhere.

Following the standard practice, the descriptor file should be located in a Datasets directory for eachproject, $PROJECT_PLUS/datasets and the part files will be located on processing nodes, as specifiedby the EE configuration file (APT_CONFIG_FILE). It is also important to know that there is arequirement that the nodes of the part file be reflected in the APT_CONFIG_FILE used by the job thatreads the dataset or file set. Thus, administrators should ensure that the APT configuration files arebacked up.

The orchadmin utility allows you to manage persistent data sets and look up files sets. The utility canbe accessed from the DataStage Manager, Designer and Director by choosing Tools-> Data SetManagement, or the orchadmin can be invoked form the command line. Note when using orchadminfrom the command line the users environment must be configured according to the setting up thecommand line environment for orchestrate users.

External Entities Scripts, Routines, Staging Files

Account for all scripts that are related to

dsee administration sp 20071029 ptc[1]

Documents