dsee administration sp 20071029 ptc[1]

Upload: nkrdwh6354

Post on 01-Jun-2018

231 views

Category:

Documents


2 download

TRANSCRIPT

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    1/72

    IBM IPS Parallel Framework: Admini stration and Production Automation October 29, 2007 1 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    IBM IPS Parallel Framework Standard Practices

    Administration and Management:

    DataStage EE Administration and Production Automation

    Prepared by IBM Information Platform and Solutions Center of Excellence

    October 29, 2007

    CONFIDENTIAL, PROPRIETARY, AND TRADE SECRET NATURE OF ATTACHED DOCUMENTS

    This document is Confidential, Proprietary and Trade Secret Information (Confidential Information) of IBM, Inc. and is provided solely for the purposeof evaluating IBM products with the understanding that such Confidential Information will be disclosed only to those who have a need to know. Theattached documents constitute Confidential Information as they include information relating to the business and/or products of IBM (including, withoutlimitation, trade secrets, technical, business, and financial information) and are trade secret under the laws of the State of Massachusetts and the UnitedStates.

    Copyrights

    2007 IBM Information Platform and SolutionsAll rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language inany form by any means without the written permission of IBM. While every precaution has been taken in the preparation of this document to reflectcurrent information, IBM assumes no responsibility for errors or omissions or for damages resulting from the use of information contained herein.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    2/72

    IBM IPS Parallel Framework: Admini stration and Production Automation October 29, 2007 2 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    Document GoalsIntended Use This document presents a set of standard practices, methodologies, and an example Toolkit

    for administering and integrating IBM WebSphere DataStage Enterprise Edition(DSEE) with a production infrastructure. Except where noted, this document is intended

    to supplement, not replace the installation documentation.Target Audience The primary audience for this document is DataStage Administrators and Developers who

    have been trained in Enterprise Edition. Information in certain sections may also be relevantfor Technical Architects and System Administrators.

    Product Version This document is intended for the following product releases:- WebSphere DataStage Enterprise Edition 7.5.2 (UNIX, Linux, USS)

    - WebSphere DataStage Enterprise Edition 7.5x2 (Windows)

    Document Author and ContributorsAuthor Mike Carney Advanced Consulting Engineer [email protected]

    Paul Christensen Global Technical Architect [email protected]

    Bob Johnston Advanced Consulting Engineer [email protected] Owen Advanced Consulting Engineer [email protected]

    Mike Ruland Global Technical Architect [email protected]

    Contributing Authors

    Jim Tsimis Advanced Support Engineer [email protected]

    Document Revision HistoryDate Rev. Description

    April 27, 2006 1.0 Initial release

    July 17, 2006 1.1 Updated ETL and Project_Plus directory hierarchies for consistency across DSEEStandards. Added Staging directory hierarchy.

    August 15, 2006 1.2 Updated styles and formatting.

    October 5, 2006 1.3 Updated directory and Project_Plus naming standards for consistency across

    deliverables. Updated terminology and Naming Standards for consistency.Expanded discussion of Environment Variables and Parameters. AddedEnvironment Variable Reference Appendix. Added Document Author andContributors, and Package Contents.

    October 17, 2006 1.4 Added Feedback section and IIS Services Offerings. Corrected Data Set andScratch file system naming. Expanded backup discussion for DataSets.

    February 8, 2007 2.0 Updated positioning, naming (IIS to IPS), Services Offerings.

    October 29, 2007 3.0 First public reference release compliments Administration and ProductionAutomation Services Workshop.

    Document ConventionsThis document uses the following conventions:

    Convention UsageBold In syntax, bold indicates commands, function names, keywords, and options that must be input

    exactly as shown. In text, bold indicates keys to press, function names, and menu selections.

    Italic In syntax, italic indicates information that you supply. In text, italic also indicates UNIXcommands and options, file names, and pathnames.

    Plain In text, plain indicates Windows NT commands and options, file names, and pathnames.

    BoldItalic Indicates:important information.

    mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]
  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    3/72

    IBM IPS Parallel Framework: Admini stration and Production Automation October 29, 2007 3 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    LucidaConsole

    Lucida Consoletext indicates examples of source code and system output.

    Lucida Bold

    In examples,Lucida Console bold

    indicates characters that the user types or keys the userpresses (for example,

    ).

    Lucida Blue In examples, Lucida Bluewill be used to illustrate operating system command line prompt.

    A right arrow between menu commands indicates you should choose each command in sequence.For example, Choose FileExit means you should choose Filefrom the menu bar, and thenchoose Exit from the File pull-down menu.

    This linecontinues

    The continuation character is used in source code examples to indicate a line that is too long tofit on the page, but must be entered as a single line on screen.

    The following are also used:

    Syntax definitions and examples are indented for ease in reading. All punctuation marks included in the syntaxfor example, commas, parentheses, or quotation

    marksare required unless otherwise indicated.

    Syntax lines that do not fit on one line in this manual are continued on subsequent lines. The

    continuation lines are indented. When entering syntax, type the entire syntax entry, includingthe continuation lines, on the same input line.

    Text enclosed in parenthesis and underlined (like this) following the first use of proper termswill be used instead of the proper term.

    Interaction with our example system will usually include the system prompt (in blue)and thecommand, most often on 2 or more lines.

    If appropriate, the system prompt will include the user name and directory for context. For example:

    %etl_node%:dsadm /usr/dsadm/Ascential/DataStage >/bin/tar cvf /dev/rmt0 /usr/dsadm/Ascential/DataStage/Projects

    FeedbackWe value your input and suggestions for continuous improvement to this content. Direct any questions,comments, corrections, or suggested additions to: [email protected]

    mailto:[email protected]:[email protected]
  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    4/72

    IBM IPS Parallel Framework: Admini stration and Production Automation October 29, 2007 4 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    Table of Contents1 IBM INFORMATION PLATFORM AND SOLUTIONS SERVICES................................................................5

    2 DATASTAGE ADMINISTRATION.......................................................................................................................92.1 CONFIGURING DATASTAGE ENVIRONMENTS FOR A SYSTEM LIFE CYCLE...............................................................92.2 CONFIGURING DATASTAGE FILE SYSTEMS AND DIRECTORIES .............................................................................112.3 ADMINISTRATOR TIPS ...........................................................................................................................................162.4 PERFORMANCE MONITORING ................................................................................................................................182.5 SECURITY,ROLES,DATASTAGE USER ACCOUNTS................................................................................................242.6 THE DATASTAGE ADMINISTRATOR PROJECT CONFIGURATION.............................................................................26

    3 JOB MONITOR......................................................................................................................................................30

    3.1 CONFIGURATION ...................................................................................................................................................303.2 JOB MONITOR ENVIRONMENT VARIABLES ............................................................................................................303.3 STARTING &STOPPINGTHE MONITOR..................................................................................................................303.4 MONITORING JOBMON ...........................................................................................................................................31

    4 BACKUP / RECOVERY / REPLICATION/ FAILOVER PROCEDURES......................................................32

    4.1 DATASTAGE CONDUCTOR BACKUP.......................................................................................................................324.2 DATASTAGE PROJECT BACKUPS ...........................................................................................................................324.3 DATASTAGE EXPORTS FOR PARTIAL BACK UP .....................................................................................................334.4 DATASETS,LOOKUP FILE SETS AND FILE SETS .....................................................................................................344.5 EXTERNAL ENTITIES SCRIPTS,ROUTINES,STAGING FILES ....................................................................................344.6 REPLICATING THE DATASTAGE ENVIRONMENT ....................................................................................................344.7 IMPORTANT PROJECT FILE SYSTEM CONSIDERATIONS..........................................................................................37

    5 OVERVIEW OF PRODUCTION AUTOMATION AND INFRASTRUCTURE INTEGRATION FORDATASTAGE ........................................................................................................................................................................39

    5.1 DATASTAGE JOBCONTROL DEVELOPMENT KIT ...................................................................................................40

    5.2 JOB SEQUENCER....................................................................................................................................................415.3 EXCEPTION HANDLING..........................................................................................................................................415.4 CHECKPOINT RESTART ..........................................................................................................................................42

    6 JOB PARAMETER AND ENVIRONMENT VARIABLE MANAGEMENT ..................................................46

    6.1 DATASTAGE ENVIRONMENT VARIABLES ..............................................................................................................466.2 DATASTAGE JOB PARAMETERS .............................................................................................................................496.3 AUDIT AND METRICS REPORTING IN AN AUTOMATED PRODUCTION ENVIRONMENT ............................................546.4 INTEGRATING WITH EXTERNAL SCHEDULERS .......................................................................................................546.5 INTEGRATING WITH ENTERPRISE MANAGEMENT CONSOLES.................................................................................55

    7 CHANGE MANAGEMENT ..................................................................................................................................56

    7.1 SOURCE CONTROL.................................................................................................................................................567.2 PRODUCTION MIGRATION LIFE CYCLE..................................................................................................................577.3 SECURITY ..............................................................................................................................................................587.4 UPGRADE PROCEDURE (INCLUDING FALLBACK EMERGENCY PATCH) ..................................................................58

    APPENDIX A: PROCESSES CREATED AT RUNTIME BY DATASTAGE EE..........................................................60

    APPENDIX B: ENVIRONMENT VARIABLE REFERENCE ........................................................................................67

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    5/72

    IBM IPS Parallel Framework: Admini stration and Production Automation October 29, 2007 5 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    1 IBM Information Platform and Solutions ServicesIBM Information Platform and Solutions(IPS)Professional Servicesoffers a broad range ofworkshops and services designed to help you achieve success in the design, implementation, and

    rollout of critical information integration projects.

    Iterations 2 Methodology

    Standard PracticesArchitecture and Design Education and Mentoring Virtual Services Certification

    F igure 1: IBM IPS Services Overview

    Services Offerings Description

    Staff Augmentation

    and Mentoring

    Whether through workshop delivery, project leadership, or mentored augmentation,the Professional Services staff of IBM Information Platform and Solutions leveragesIBMs methodologies, Standard Practices and experience developed throughoutthousands of successful engagements in a wide range of industries and government

    entities.Learning Services IBM offers a variety of courses covering the IPS product portfolio. IBMs blending

    learning approach is based on the principle that people learn best when provided witha variety of learning methods that build upon and complement each other. With that inmind, courses are delivered through a variety of mechanisms: classroom, on-site andWeb-enabled FlexLearning.

    Certification IBM offers a number of Professional Certifications offered through independenttesting centers worldwide. These certification exams provide a reliable, valid and fairmethod of assessing product skills and knowledge gained through classroom and real-world experience.

    Client Support

    Services

    IBM is committed to providing our customers with reliable technical supportworldwide. All Client Support services are available to customers who are coveredunder an active IBM IPS maintenance agreement. Our worldwide support organization

    is dedicated to assuring your continued success with IPS products and solutions.Vi rtual Services The low cost Virtual Services offering is designed to supplement the global IBM IPS

    delivery team, as needed, by providing real-time, remote consulting services. VirtualServices has a large pool of experienced resources that can provide IT consulting,development, Migration and Training services to customers for WebSphereDataStage Enterprise Edition (DSEE).

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    6/72

    IBM IPS Parallel Framework: Admini stration and Production Automation October 29, 2007 6 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    Center of Excellence for Data Integration (CEDI)

    Establishing a CEDI within your enterprise can help increase efficiency and drive down the cost ofimplementing data integration projects. A CEDI can be responsible for Competency, Readiness,Accelerated Mentored Learning, Common Business Rules, Standard Practices, Repeatable Processes,

    and the development of custom methods and components tailored to your business.

    IBM IPS Professional Services offerings can be delivered as part of a strategic CEDI initiative, or onan as-needed basis across a project lifecycle:

    Identify StrategicPlanning Startup

    Analysis& Design Build

    Test &Implement

    Monitor &Refine

    RequirementsDefinition, Architecture,and Project Planning

    Iterations 2

    Health Check Evaluation

    Sizing and Capacity PlanningPerformance Tuning

    High Availability Architecture

    InformationExchange andDiscovery

    Installation and ConfigurationInformation Analysis

    Data Flow and Job Design Standard PracticesData Quality Management Standard Practices

    Administration, Management, and Production Automation

    Grid Computing Discovery, Architecture, and PlanningGrid Computing Installation and Deployment

    F igure 2: I PS Services Offeri ngs within an I nformation I ntegration Project Li fecycle

    Project Startup

    Workshops

    Description

    Information

    Exchange and

    Di scovery Workshop

    Targeted for clients new to the IBM IPS product portfolio, this workshop providesIBMs high-level recommendations on how to solve a customer particular problem.IBM analyzes the data integration challenges outlined by the client, and develops astrategic approach for addressing those challenges.

    Requirements

    Definition,

    Architecture, and

    Project Planning

    Workshop

    Guiding clients through the critical process of establishing a framework for asuccessful future project implementation, this workshop delivers a detailed projectplan, as well as a Project Blueprint. These deliverables document project parameters,current and conceptual end states, network topology, data architecture and hardwareand software specifications, outlines a communication plan, defines scope, andcaptures identified project risk.

    I terations 2 IBMs Iterations 2 is a framework for managing enterprise data integration projectsthat integrates with existing customer methodologies. Iterations 2 is a comprehensive,

    iterative, step-by-step approach that leads project teams from initial planning andstrategy through to tactical implementation. This workshop includes the Iterations 2software, along with customized mentoring.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    7/72

    IBM IPS Parallel Framework: Admini stration and Production Automation October 29, 2007 7 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    Standard Practices

    Workshops

    Description

    I nstallation and

    Configuration

    Workshop

    Establishes a documented, repeatable process for installation and configuration ofDSEE server and client components. This may involve review and validation of one or

    more existing DSEE environments, or planning, performing, and documenting a newinstallation.

    I nformation Analysis

    Workshop

    Provides clients with a set of Standard Practices and a repeatable methodology foranalyzing the content, structure, and quality of data sources using the combination ofWebSphere Profile Stage, Quality Stage, and Audit Stage.

    Data Fl ow and Job

    Design Standard

    Practi ces Workshop

    Helps clients establish standards and templates for the design and development ofparallel jobs using DSEE through practitioner-led application of IBM StandardPractices to a clients environment, business, and technical requirements. The deliveryincludes a customized Standards document as well as custom job designs andtemplates for a focused subject area.

    Data Quality

    Management

    Standard Pr actices

    Workshop

    Provides clients with a set of standard processes for the design and development ofdata standardization, matching, and survivorship processes using WebSphereQualityStage. The data quality strategy formulates an auditing and monitoring

    program that helps ensure on-going confidence in data accuracy, consistency, andidentification through client mentoring and sharing of IBM Standard Practices.

    Administration,

    Management, and

    Production

    Automation

    Workshop

    This workshop provides customers with a customized Toolkit and set of provenStandard Practices for integrating DSEE into a clients existing productioninfrastructure (monitoring, scheduling, auditing/logging, change management) and foradministering, managing and operating DSEE environments.

    Advanced

    Deployment

    Workshops

    Description

    Health Check

    Evaluation

    This workshop is targeted for clients currently engaged in IPS development efforts thatare not progressing according to plan, or for clients seeking validation of proposed

    plans prior to the commencement of new projects. It provides review of andrecommendations for core ETL development and operational environments by an IBMexpert practitioner.

    Sizing and Capacity

    Planning Workshop

    Provides clients with an action plan and set of recommendations for meeting currentand future capacity requirements for data integration. This strategy is based onanalysis of business and technical requirements, data volumes and growth projections,existing standards and technical architecture, existing and future data integrationprojects.

    Perf ormance Tuning

    Workshop

    Guides a clients technical staff through IBM Standard Practices and methodologiesfor review, analysis and performance optimization using a targeted sample of clientjobs and environments. This workshop can identify potential areas of improvement,demonstrate IBMs processes and techniques, and provide a final report that containsrecommended performance modifications and IBM performance tuning guidelines.

    High-AvailabilityArchitecture

    Workshop

    Using IBMs IPS Standard Practices for high availability, this workshop presents aplan for meeting a customers high availability requirements using the parallelframework of DSEE. It then implements the architectural modifications necessary forhigh availability computing.

    Gri d Computing

    Discovery,

    Architecture and

    Provides the planning and readiness efforts required to support a future deployment ofthe parallel framework of IPS on Grid computing platforms. This workshop preparesthe foundation on which a follow-on Grid installation and deployment will be

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    8/72

    IBM IPS Parallel Framework: Admini stration and Production Automation October 29, 2007 8 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    Planning Workshop executed, and includes hardware and software recommendations and estimated scope.

    Gri d Computing

    I nstallation and

    Deployment

    Workshop

    Installs, configures, and deploys the IBM IPS Grid Enabled Toolkit in a clients Gridenvironments and provides integration with Grid Resource Managers, configurationof DSEE, QualityStage/EE, and/or ProfileStage/EE.

    For more details on any of these IBM IPS Professional Services offerings, and to find a local IBMInformation Integration Services contact, visit:

    http://www.ibm.com/software/data/services/ii.htmlAdministration, Management and Production Automation Workshop

    The following flowchart illustrates the various IPS Services workshops around the parallel frameworkof DSEE.

    The Administration, Management and Production Automation Workshop is intended to provide a set ofproven Standard Practices and a customized toolkit for integrating DSEE into a customers existingproduction infrastructure (monitoring, scheduling, auditing/logging, change management). It alsoprovides expert practitioner recommendations for administering, managing and operating DSEEenvironments.

    F igure 3: Services Workshops for the Paral lel Framework of DSEE

    http://www.ibm.com/software/data/services/ii.htmlhttp://www.ibm.com/software/data/services/ii.html
  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    9/72

    IBM IPS Parallel Framework: Admini stration and Production Automation October 29, 2007 9 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.1

    2.1.1

    2 DataStage AdministrationThis section of the document discusses DataStage Administration and Automation. It endeavors to jointhese disciplines in a cohesive manner, by defining standard practices that are complementary. The

    standard practices are based on a foundation. The foundation is the operating environment and asimple life cycle methodology.

    Conf igu r ing DataStage Env ironm ents for a System L i fe Cycle

    There are many ways to configure a DataStage ETL environment. DataStage EE is a flexible piece ofsoftware, presenting the DataStage team with many facets of Configuration, Administration, Design,Development Operations to consider. After the software is installed many customers wonder what todo next. Seeing the end picture of your environment, how users will interact with it, and how it willfunction, will help you decide how to configure, administer, manage, develop and operate all aspectsthe DataStage environment. It will help you in other areas that are related to the DataStageenvironment like planning hardware resources, setting up users and security.

    You should be familiar with many aspects of a DataStage project. In particular these aspects:

    Projects are both the logical and physical means for storing work performed in DataStage. Projects are Meta data repositories for DataStage Objects, such as jobs, stages, shared

    containers.

    Projects also store configuration metadata, like environment variables. It is possible to create many projects. Projects are independent of each other. DataStage object metadata can be exported to a file as well as imported.

    A Simple DataStage Application Life Cycle

    It is common to refer to a collection of related DataStage objects such as jobs and shared containers, asan application. And just as for all software applications, applications developed with DataStage needto be developed and maintained under a life cycle methodology to ensure quality.

    The life cycle that is advocated in this Standard practice only takes into account a subset of larger morecomprehensive methodologies. It is primarily concerned with the Development, Testing, Release andMaintenance of the Data stage application. It describes how to physically implement the environmentfor the life cycle and activities related to operation and maintenance. It does not consider aspects of a

    broader life cycle such as Design, Documentation.

    The DataStage application life cycle has at least three phases.1. Development/Maintenance2. Testing3. Production

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    10/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 10 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.1.2

    A more robust life cycle will utilize more testing phases.1. Development/Maintenance2. Integration Testing

    3. Quality Assurance4. User Acceptance Testing5. Production

    DataStage installation and Configuration Considerations

    In support of the phases of the life cycle DataStage project environments should be configured for eachphase of the project. That is, per DataStage application create projects for dev, test, production, etc. Inaddition to configuring project environments you will need to consider the system requirements aswell. It is a recognized standard practice, as well as, common for customers, to utilize completelydifferent systems for each phase of the life cycle. Consider other resources that will be used by theproject, such as disk space, CPU capacity, memory size and weather or not the system(s) can support

    the anticipated or actual workload. Application performance depends on sufficient hardware resourcesto support the work load that parallel execution puts on a system.

    If you plan to execute your DataStage jobs in a distributed fashion on a loosely coupled cluster (MPP,Grid, Cluster) or employ a failover strategy the DataStage EE environment will need to be replicatedon all physical processing nodes. SeeReplicating the DataStage Environment, section 4.6.

    Installation of the DataStage environment requires some careful planning. Choosing hardwareresources is particularly important for a DataStage EE environment. Consider separate physicalenvironments for each phase of a life cycle to ensure adequate performance. For example designate aseparate machine for dev, test and production.

    Single Mixed Physical Environments

    Development & Test & Production Machine

    Development Machine Test Machine ProductionMachine

    Separate Physical Environments

    Production MachineDevelopment & Test Machine

    Mixed Physical Environments

    Standardpractice

    Figure 4: DataStage EE Physical Environments for a life cycle

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    11/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 11 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.2

    Configuration for DataStage user accounts for all phases of the ETL test cycle should be set up to allowfor separate developer accounts and a separate account for each phase of the life cycle. Each of theseaccounts should be configured according to the section configuring a DataStage user (below).

    Conf igu r ing DataStage Fi le Systems and Director ies

    DataStage Enterprise Edition requires file systems to be available for:

    Software Install Directory- DataStage Enterprise Edition executables, libraries, and pre-built components

    DataStage Project (Repository) Directory

    Data Storage

    - DataStage temporary storage - scratch, temp, buffer- DataStage parallel Data Set segment files- Staging and Archival storage for any source file(s)

    By default, each of these directories (except for file staging) are created during installation assubdirectories under the base DataStage installation directory.

    IMPORTANT: Each storage class should be isolated in separate file systems to accommodatetheir different performance and capacity characteristics and backup requirements.The default installation is generally acceptable for small prototype environments.

    2.2.1 Software Install DirectoryThe software install directory is created by the installation process, and contains the DSEE softwarefile tree. The install directory grows very little over the life of a major software release, so the defaultlocation ($HOME for dsadm, e.g.: /home/dsadm) may be adequate.

    The system administrator may choose to install DataStage in a subdirectory within an overall installfile system. You should verify that the install file system has at least 1GB of space for the softwaredirectory (2GB if you are installing RTI or other optional components).

    For cluster or Grid implementations, it is generally best to share the Install file system across

    servers (at the same mount point).

    NOTE: the DataStage installer will attempt to rename the installation directory to support laterupgrades; if you install directly to a mount point this rename will fail and several error

    messages will be displayed. Installation will succeed but the messages may be confusing.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    12/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 12 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.2.2 DataStage Projects (Repository) Directory

    The DataStage Projects subdirectory contains the repository (Universe database files) of job designs,design and runtime metadata, logs, and components. Project directories can grow to contain thousandsof files and subdirectories depending on the number of projects, the number of jobs, and the volume of

    logging information retained about each job.

    During the installation process, the Projects subdirectory is created in the DataStage install directory.By default, the DataStage Administrator client creates its projects in this Projects subdirectory.

    For cluster or Grid implementations, it is generally best to share the Projects file system across

    servers (at the same mount point).

    IMPORTANT: It is a bad practice to create DataStage projects in the default directory withinthe install file system, as disk space is typically limited.

    Projects should be created in their own file system.

    2.2.2.1 Creating the Projects File System

    On most operating systems, it is possible to create separate file systems at non-root levels as a separatefile system for the Projects subdirectory within the DataStage installation, using the followingguidelines:

    It is recommended that a separate file system be created and mounted over the default locationfor projects, the $DSROOT/Projectsdirectory. Mount this directory after installing DSEE butbefore projects are created.

    The Projects directory should be a mirrored file system with sufficient space (minimum 100MBper project).

    For cluster or Grid implementations, it is generally best to share the Project file system acrossservers (at the same mount point).

    IMPORTANT:The project file system should be monitored to ensure adequate free spaceremains. If the Project file system runs out of free space during DataStage activity, the

    repository may become corrupted, requiring a restore from backup.

    Effective management of space is important to the health and performance of a project, and as jobs areadded to a project, new directories are created in this file tree, and as jobs are run, their log entriesmultiply. These activities cause file-system stress (for example, more time to insert or delete DataStagecomponents, longer update times for logs). Failure to perform routine projects maintenance (forexample, remove obsolete jobs and manage log entries) can cause project obesity and performanceissues.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    13/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 13 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.2.3

    The name of a DataStage Project is limited to a maximum of 18 characters. The project name cancontain alpha-numeric characters and it can contain underscores.

    2.2.2.2 Project Recovery Considerations

    Devising a backup scheme for project directories is based on 3 core issues:

    1. Will there be valuable data stored in Server Edition hash files 1? DataStage Server Edition fileslocated in the DataStage file tree may require archiving from a data perspective.

    2. How often will the UNIX file system containing the ENTIRE DataStage file tree be backed up?When can DataStage be shut down to enable a cold snapshot of the Universe database as well asthe project files? A complete file system backup while DataStage is shut down accomplishesthis backup.

    3. How often will the projects be backed up? Keep in mind that the grain of project backups will

    represent the ability to recover lost work should a project or a job become corrupted.

    At a minimum, a UNIX file system backup of the entire DataStage file tree should be performed atleast weekly with the DataStage engine shut down, and each project should be backed up with theManager at least nightly with all users logged out of DataStage. This is the equivalent of a colddatabase backup and 6 updates.

    If your installation has valuable information in Server hash files, you should increase the frequency ofyour UNIX backup OR write jobs to unload the Server files to external media.

    Data Set and Sort Directories

    The DataStage installer creates the following two subdirectories within the DataStage install directory:

    Datasets/

    - stores individual segment files of DataStage parallel Data Sets

    Scratch/

    - used by the parallel framework for temporary files such as sort and buffer overflow

    Try not to use these directories and consider deleting them to ensure they are never used. This is bestdone immediately after installation; be sure to coordinate this standard with the rest of the team.

    DataStage parallel Configuration files are used to assign resources (such as processing nodes, disk andscratch file systems) at runtime when a job is executed.

    1Note that the use of Server Edition components in an Enterprise Edition environment is discouraged forperformance and maintenance reasons. However, if legacy Server Edition applications exist, their correspondingobjects may need to be taken into consideration.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    14/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 14 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    The DataStage installer creates a default parallel Configuration file (Configurations/default.apt) whichreferences the Datasets and Scratch subdirectories within the install directory. The DataStageAdministrator should consider removing the default.apt file altogether, or at a minimum updating this

    file reference the file systems you define (below).

    2.2.3.1 Data and Scratch File Systems

    It is a bad practice to share the DataStage install and Projects file systems with volatile files like scratchfiles and Parallel data set segment files. Resource, scratch and sort disks service very different kinds ofdata with completely opposite persistence characteristics. Furthermore, they compete directly witheach other for I/O bandwidth and service time if they share the same path.

    Optimally, these file systems should not have any physical disks in common and should not share anyphysical disks with databases. While it is often impossible to allocate contention-free storage, it mustbe noted that at large data volumes and/or in highly active job environments, disk arm contention can

    and usually does significantly constrain performance.

    NOTE:For optimal performance, file systems should be created in high performance, lowcontention storage. The file systems should be expandable without requiring destruction and re-

    creation.

    2.2.3.2 Data Sets

    Parallel Data Sets are used for persistent data storage in parallel, in native DSEE format. TheDataStage developer specifies the location of the Data Set header file, which is a very small pointer tothe actual data segment files that are created by the DSEE engine, in the directories specified by the

    disk resources assigned to each node in the parallel Configuration file. Over time, the Data Set segmentfile directory(-ies) will grow to contain dozens to thousands of files depending on the number ofDataStage Data Sets used by DSEE jobs.

    The need to archive Data Set segment files depends on the recovery strategy chosen by the DataStagedeveloper, the ability to recreate these files if the data sources remain, and the business requirements.Whatever archive policy is chosen should be coordinated with the DataStage Administrator andDevelopers. If Data Set segment files are archived, careful attention should be made to also archive thecorresponding Data Set header files.

    2.2.3.3 Sort Space

    As discussed, it is a recommended practice to isolate DataStage scratch space from DataSets and flatfiles, and DataStage sort space, in that temporary files exist only while a job is running 2and that they

    2Some files created by database stages persist after job completion. For example, the Oracle .log, .ctl and .badfiles will remain in the first Scratch resource pool after a load completes.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    15/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 15 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.2.4

    are warm files (e.g.: being read and written at above average rates). Note that sort space mustaccommodate only the files being sorted simultaneously, and, assuming that jobs are scheduled non-concurrently, only the maximum of said sorts.

    There is no persistence to these temporary sort files so they need not be archived.

    Sizing DataStage scratch space is somewhat difficult. Objects in this space include lookups and intra-process buffers. Intra-process buffers absorb rows at runtime when a stage (or stages) in a partition (orall partitions) cannot process rows as fast as they are supplied. In general, there are as many buffers asthere are stages on the canvas for each partition. As a practical matter, assume that scratch space mustaccommodate the largest volume of data in one job (see the previous formula for Data Sets and flatfiles). There are advanced ways to isolate buffer storage from sort storage, but this is a performancetuning exercise, not a general requirement.

    2.2.3.4 Maintaining Parallel Configuration Files

    DataStage parallel Configuration files are used to assign resources (such as processing nodes, disk andscratch file systems) at runtime when a job is executed. Parallel Configuration files are discussed indetail in theDataStage Parallel Job Advanced Developers Guide.

    Parallel configuration files define can be located within any directory that has suitable accesspermissions, defined at runtime through the environment variable $APT_CONFIG_FILE. However,the graphical Configurations tool within the DataStage clients expects these files to be stored within theConfigurationssubdirectory of the DataStage install. For this reason, it is recommended that allparallel configuration files be stored in the Configurationssubdirectory, with naming conventions toassociate them with a particular project or application.

    The default.aptfile is created when DataStage is installed, and references the DataSets and Scratchsubdirectories of the DataStage install directory. To manage system resources and disk allocation,the DataStage administrator should consider removing this file, creating separate configuration

    files that are referenced by the $APT_CONFIG_FILE setting in each DataStage Project.

    At a minimum, the DataStage administrator should edit the default.aptconfiguration file to referencethe newly-created Data and Scratch file systems, and to ensure that these directories are used by anyother parallel configuration files.

    Extending the DataStage Project for External Entities

    It is recommended that another directory structure, be created to integrate all aspects of a DataStage

    application that are managed outside of the DataStage Projects repository. This hierarchy shouldinclude directories for secured parameter files, Data Set header files, custom components,Orchestrate schema, sql and shell scripts. It may also be useful to support custom job logs andreports.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    16/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 16 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.2.5

    2.3

    2.3.1

    File Staging

    It is recommended that a separate Stagingfile system and directory structure be used for storing,managing, and archiving various source data files.

    Administrator Tips

    Shell environment

    Establish a convenient environment variable to the main DataStage directory and automatically sourcethe DataStage environment by adding the following three lines to the users shell profile. (.profile,.bashrc, etc.)

    dsroot= `cat /.dshome`/..

    export dsroot

    . $dsroot/DSEngine/dsenv

    Note: The /.dshome file is only created with a standard (non-itag) install of DataStage. If youhave installed multiple DataStage engines on a single server (using an itag install) then you

    will need to source the appropriate dsenvfile for the DataStage environment you aremanaging.

    2.3.2

    2.3.3

    2.3.4

    Standard DSParams

    The DSParams comes from a template. Configure 1 project as the standard configuration forenvironment variables, sequencer settings, etc. Then copy the DSParams from the model projectdirectory to the Template Directory. Every time a new project is created it will inherit the settings fromthe DSParams file in template.

    Starting / Stopping the DataStage Engine

    The DataStage installation will configure your system to start the DataStage Server main processes(dsrpcd and EE Job monitor JobMonApp) , automatically when the system starts. For UNIX systemsS99ds.rs is installed in /etc/rc2.d. For UNIX systems the DataStage services set to automatically start.One exception is for a non-root installation; in this case scripts should be executed by the root user toset up impersonation and autostart.

    Too manually stop and start the service on windows from the Windows control panel invoke theDataStage Control Panel application.

    To manually stop or start the DataStage engine refer to the Administrator Guide (dsadmgde.pdf)Stopping and Restarting the Server Engine.

    Server will not start because port is busy.

    Usually occurs when the server is brought down before all clients have exited.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    17/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 17 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.3.5

    2.3.6

    A feature of UNIX related to TCP sockets that become disconnected will hold the port (which port) inFINWAIT state for the length of the FINWAIT. While this port is in a finwait state DataStage dsrpcdserver will not start.

    You can either wait for the FINWAIT to expire usually 10 minutes or in an emergency as root changethe setting to something like 1 minute. This is a dynamic network parameter and can be set temporarilyto a lower value. Reset back to original value once the server starts.Use the following utilities

    ndd Solaris, hpux, no - AIX

    Universe Shell

    The DataStage server engine is based on Universe. It is a complete application environment containinga shell, file types, programming language and many facilities for application operations like lockmanagement. To invoke the universe shell the DataStage environment variables must be set. This is

    easily done by sourcing/executing the dsenv file in $DSHOME. To invoked the Universe use thesecommands:

    cd $DSHOMEbin/uvsh

    Resource Locks

    If a developer is working on a job in the designer and there is a network failure or client machinefailure the job will remain locked according to DataStage. When a job is locked it must be clearedbefore it can be accessed by any DataStage component. Clearing locks can be done from theDataStage Director pull down Job->Cleanup Resources. Choosing this option will open the Job

    Resources interface.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    18/72

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    19/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 19 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.4.1

    2.4.2

    environment. Once the processing bottle neck is discovered action can then be taken to improveperformance. For example, a job may appear to be running slow, with no indication of CPU, I/O or amemory bottle neck, performance of the job could be improved by creating more logical processingnodes in the DataStage EE configuration file or it may need to be redesigned. As parallelism is

    increased, more system resources will be utilized and one will find that the system may become thegating factor of performance. The remedy to this problem may be to increase system resources, likeadding more CPU or spreading I/O to other physical devices and controllers.

    DataStage EE Job Monitor

    The DataStage EE job monitor (JobMonApp) provides a useful snapshot of the jobs performance atthat moment of execution, but does not provide thorough performance metrics. That is, a JobMonAppsnapshot should not be used in place of a full run of the job, or a run with a sample set of data. Due tobuffering and to some jobs semantics, a snapshot image of the flow may not be a representativesample of the performance over the course of the entire job.The CPU Summary information provided by JobMonApp is useful as a first approximation of where

    time is being spent in the flow. However, it will not show operators that were inserted by the parallelframework. Such operators include sorts, that were not explicitly included, and sub-operators ofcomposites.

    Performance Metrics with DataStage EE Environment Variables

    There are a number of environment variables that direct DataStage parallel jobs to report detailedruntime information that enable you to determine where time is being spent, how many rows processedand how much memory each instance of a stage utilized during a run. Setting these environmentvariables also allow you to report on operators that were inserted by the parallel framework. Suchoperators include sorts, that were not explicitly included, buffer operators and sub-operators ofcomposites.

    APT_PM_PLAYER_MEMORY

    Setting this variable causes each player process to report the process heap memory allocation in the joblog when the operator instance completes execution.

    Example of player memory:

    APT_CombinedOperatorController,0: Heap growth during runLocally(): 1773568 bytes

    APT_PM_PLAYER_TIMING

    Setting this variable causes each player process to report its call and return in the job log. The message

    with the return is annotated with CPU times for the player process.

    Example of player timings, showing the elapsed time of the operator, the amount of user and systemtime as will as total CPU.

    APT_CombinedOperatorController,0: Operator completed. status: APT_StatusOk elapsed: 0.30 user: 0.02 sys: 0.02 (totalCPU: 0.04)

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    20/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 20 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    APT_RECORD_COUNTS

    Setting the variable causes DataStage to print to the job log, for each operator player, the number of

    records input and output. Abandoned input records are not necessarily accounted for. Buffer operatorsdo not print this information.

    Example of record counts that shows the number of rows processed for the input link and output linkfor partition 0 of the Sort_3 stage.

    Sort_3,0: Input 0 consumed 5000 records.Output 0 produced 5000 records.

    APT_PERFORMACNE_DATA

    APT_PERFORMANCE_DATA or the osh -pdd advanced runtimeoption allow you to capture raw performance data for every underlying job process at runtime.

    Within a job parameter, set $APT_PERFORMANCE_DATA = dirpath where dirpath is a directoryspecified on the DataStage server to capture performance statistics. This will create an XML documentnamed performance. in specified directory. You can influence the name of the file byspecifying the osh -jobid advanced runtime option. Hence the performance XML documentwill be named performance..

    The following XML header shows the detailed performance data captured in each record. Note that thisinformation is more detailed than the higher-level information captured by DSMakeJobReport and

    includes information on all of the processes (including Buffer operators and framework-inserted sorts):

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    21/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 21 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.4.3 iostat

    2.4.4

    Starting with release 7.5 and later, the Perl script performance_convertlocated in the directory$APT_ORCHHOME/bincan be used to convert the raw performance data into other usable formatsincluding:- CSV text files

    - detail Data Sets- summary Data Sets (summarizes the total time and maximum heap memory usage per operator)

    The syntax is:

    perl $APT_ORCHHOME/bin/performance_convert inputfile output_base [-schema|-dataset|-

    summary] [-help]

    whereinputfile - location of performance data to convertoutput_base - location and file prefix to all files being generated.

    (ex. /mydir/jobid -> /mydir/jobid.CSV)

    iostat is useful for examining the throughput of various disk resources. If one or more disks have highthroughput, understanding where that throughput is coming from is vital. If there are spare CPU cycles,IO is often the culprit. iostat can also help a user determine if there is excessive IO for a specific job.

    The specifics of iostat output vary slightly from system to system. Here is an example from a Linuxmachine which shows a relatively light load:(The first set of output is cumulative data since the machine was booted)$ iostat 10

    Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtndev8-0 13.50 144.09 122.33 346233038 293951288

    every N seconds (10 in the command line example) iostat outputs:Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtndev8-0 4.00 0.00 96.00 0 96

    vmstat

    vmstat is useful for examining system paging. Ideally, a EE flow, once it begins running, should neverbe paging to disk (si and so should be zero). Paging suggests EE is consuming too much total memory.

    $vmstat 1procs memory swap io system cpur b w swpd free buff cache si so bi bo in cs us sy id0 0 0 10692 24648 51872 228836 0 0 0 1 2 2 1 1 0

    vmstat produces the following every N seconds:

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    22/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 22 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.4.5

    2.4.6

    0 0 0 10692 24648 51872 228836 0 0 0 0 328 41 1 0 99

    mpstat will produce a similar report based on each processor of an SMP.

    Load Average

    Ideally, each flow should be consuming as much CPU as is available. The load average on the machineshould be 2-3x the value as the number of processors on the machine (8-way SMP should have a loadaverage of roughly 16-24). Some operating systems, such as HPUX, show per-processor load average.In this case, load average should be 2-3, regardless of number of CPUs on the machine.

    If the machine isnt CPU-saturated, a bottleneck may exist elsewhere in the flow. Over-partitioningmay be a useful strategy in these cases.

    If the flow pegs the machine, then the flow is likely CPU limited, and some determination needs to be

    made as to where the CPU time is being spent if performance isnt adequate. See the next section(2.4.6) to monitor individual processes.

    The commands top or uptime can provide the load average.

    xload can provide a histogram of the load average over time.

    (top , topas, nmon) give you a real time view of the system and are extremely useful for evaluating asystems performance.

    How to Monitor DataStage EE Processes

    Refer to Appendix A:Processes Created at Runtime by DataStage EEfor diagrams of processescreated by DataStage.

    Identifying the player processes identifiers (PID) of a job can be done so by setting the environmentvariable APT_PM_PLAYER_PID=TRUE. This will produce messages in the job log correlating aninstance of an operator and the PID.

    You can also identify the processes without using APT_PM_PLAYER_PID, by looking for processesthat are running the osh or phantom programs. osh is the orchestrate shell or the main program of theparallel framework. All parallel job execution, that is, section leaders and players, are spawned fromthis program. osh processes will be started on all physical processing nodes participating in a jobs

    execution. Phantom is the name of the process that is spawned by DataStage for job control that is JobSequencers. Phantom processes only run on the conductor node. When you invoke a job fromDataStage, it will first start a phantom process, which controls and monitors the overall execution of ajob. The phantom will then invoke osh. Phantom processes can also spawn other child phantoms ifyour job control invokes child Job Sequencers.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    23/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 23 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.4.7 Engine Processes and System Resources

    Refer toAppendix A:ProcessesCreated at Runtime by DataStage EEfor diagrams of processescreated by DataStage

    The DataStage server engine program is actually called dsrpcd., which is a daemon that managesconnections to DataStage projects. The dsrpcd utilizes semaphores and shared memory segments whenit is operating. Semaphores and shared memory segments used by dsrpcd are prefixed by the string0xade. The UNIX command ipcswill produce a list of semaphores and shared memory segmentsused by DataStage.

    When a user logs into DataStage the dsrpcd will spawn two processes for each session, the dsapi_clientand dsapi_slave process. These manage all of your interactions with the DataStage project. One wayto force users to log off from DataStage would be to kill the dsapi_slave process. Note on UNIX thedsapi_slave process is identified by dscs.

    2.4.8 Disk Space Used by DataStage

    DataStage utilizes disk space in a number of places.

    Within a DataStage project the following will grow over time and need to be purged on a regular basis.

    Job Logs Purge by setting up a purge policy through the DataStage Administrator. ThisStandard practice emphatically recommends setting a purge policy to avoid filling the projectfile system.

    &PH& - This is a directory in each project that is used for stderr and stdout of a phantom

    process. Each job execution will create a file in this directory, therefore over time the directorywill grow and thus should be cleaned on a regular basis to avoid filling up the project filesystem. Typical file size is less than 1K, files larger than 1K are an indication of a problemwith a job. In the event of a hard crash of a job, examining the DSD.RUN* files may provideuseful information in explaining the problem.

    $TMPDIR This environment variable tells DataStage where to write temporary files that arecreated by the parallel framework, such as the job score, temp file for look up stage. Thisdirectory is automatically cleaned up by the parallel framework; however, hard crashes mayleave files stranded in this directory. You can identify DataStage EE temp files by looking forfiles that begin with APT*. The default $TMPDIR is /tmp, performance improvements can be

    achieved by setting $TMPDIR to a faster file system.

    Scratch Identifying scratch space is done so by examining the APT_CONFIG_FILE. Scratchis used for sort and buffer overflow files. These files are temporary and are managed by theframework. One can judge how a job is performing by examining the number of files that are

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    24/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 24 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.5

    2.5.1

    created in the scratch area. For example if there is a bottle neck in a process that fork joins,buffer over flow files will be written to scratch. The more files the more buffering.

    DataSets - Identifying the directories used by data sets can be done by examining the

    APT_CONFIG_FILE or by using the orchadmin command line tool or Tools-> DataSetManagement from the DataStage GUIs

    Security, Roles, DataStage User Accounts

    In general initial access to DataStage projects is enforced by the operating systems security, such aslogging into a project through DataStage Designer, as well as, read, write, execute and deletepermissions to a project directory.

    As a first level of security Administrators should leverage operating system groups to grant and denyaccess to a DataStage project. That is for each project create an operating system group, (group nameshould be the same as the project) and assign the group to the project directory (chown), as well as,

    grant users access to that project by making them members of the projects group. This will give usersthe authorization to log into and manage objects in the project.

    As a second level of control Administrators should assign DataStage roles (see below) to the groupsthat have access to the project. This will limit what users can do within DataStage, such as create jobs,compile jobs and run jobs.

    DataStage Roles

    DataStage security is based on operating system groups. When creating a DataStage project considerlimiting access to a project by creating a operating system group and assigning that group as theowner of the DataStage project directory. Then make operating system user ids members of the group.

    Then Grant roles to users from the DataStage Administrator as described below.

    Following was copied out of the DataStage Administrator Guide (dsadmgde.pdf)To prevent unauthorized access to DataStage projects, you must assign the users on your system to theappropriate DataStage user category. To do this, you must have administrator status You can do manyof the administration tasks described in this section if you have been defined as a DataStage Developeror a DataStage Production Manager. You do not need to have specific administration rights.However, to do some tasks you must be logged on to DataStage using a user name that gives youadministrator status: For Windows servers: You must be logged on as a member of the Windows Administrators group. For UNIX servers: You must be logged in as root or the DataStage administrative user (dsadm by

    default).You require administrator status, for example, to change license details, add and delete projects, or toset user group assignments.

    There are four categories of DataStage user: DataStage Developer, who has full access to all areas of a DataStage project

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    25/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 25 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.5.2

    DataStage Production Manager, who has full access to all areas of a DataStage project, and canalso create and manipulate protected projects. (Currently on UNIX systems the Production Managermust be root or the administrative user in order to protect or unprotect projects.) DataStage Operator, who has permission to run and manage DataStage jobs

    , who does not have permission to log on to DataStage. You cannot assign individual usersto these categories. You have to assign the operating system user group to which the user belongs. Forexample, a user with the user ID peter belongs to a user group called clerks. To give DataStageOperator status to user peter, you must assign the clerks user group to the DataStage Operator category.

    Note: When you first install DataStage, the Everyone group is assigned to the category DataStageDeveloper. This group contains all users, meaning that every user has full access to DataStage. Whenyou change the user group assignments, remember that these changes are meaningful only if you alsochange the category to which the Everyone group is assigned.

    User Environment

    It is common for DataStage developers and administrators to utilize the UNIX or windows commandline. For this reason the DataStage users account should be configured with proper environmentvariables.

    All users should have these lines added to their login profile3.

    dsroot="`cat /.dshome`/.."export dsroot. $dsroot/DSEngine/dsenv

    Add these lines to the end of the $DSHOME/dsenv.

    APT_ORCHHOME=$DSHOME/../PXEngine

    export APT_ORCHHOMEAPT_CONFIG_FILE=$DSHOME/../Configurations/default.aptexport APT_CONFIG_FILEPATH=$APT_ORCHHOME/bin:$PATHexport PATHLD_LIBRARY_PATH=$APT_ORCHHOME/lib:$LD_LIBRARY_PATHexport LD_LIBRARY_PATH

    To Configure an Orchestrate User:

    The following steps explain in detail how to configure the DataStage users environment. The stepsdescribed above in User Environmentshould be sufficient.

    1 In your .profile, .kshrc, or .cshrc, set theAPT_ORCHHOME environment variable to the directory in whichOrchestrate is installed. This is either the default, /ascential/apt, or the directory you have defined as partof the installation procedure.

    3As noted earlier, the /.dshome file is only created on a default (non-itag) installs

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    26/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 26 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    2.6

    2 Add $APT_ORCHHOME/bin to your PATH environment variable. This is required for access to allscripts, executable files, and maintenance commands.3 Add $APT_ORCHHOME/osh_wrappers and $APT_ORCHHOME/user_osh_wrappers to your PATH environmentvariable. This is required for access to the osh operators.

    4 Make sure LIBPATH has been set to /usr/lib:/lib:$APT_ORCHHOME/lib:$APT_ORCHHOME/user_lib, followed byadd additional libraries you need .5 Optionally, add the path to the C++ compiler to your PATH environment variable. Orchestraterequires that the compiler be included in PATH if you will use the buildop utility or develop and runprograms using the Orchestrate C++ interface.6 Add the path to the dbx debugger to your PATH variable to facilitate error reporting. If an internalexecution error occurs, Orchestrate attempts to invoke a debugger in order to obtain a stack tracebackto include in the error report; if no debugger is available, no traceback will be generated.7 By default, Orchestrate uses the directory /tmp for some temporary file storage. If you do not want touse this directory, assign the path name to a different directory through the environment variableTMPDIR.

    You can additionally assign this location through the Orchestrate environment variableAPT_PM_SCOREDIR.8 Make sure you have write access to the directories $APT_ORCHHOME/user_lib and$APT_ORCHHOME/user_osh_wrappers on all processing nodes.9 If your system connects multiple processing nodes by means of a switch network in an MPP, setAPT_IO_MAXIMUM_OUTSTANDING which sets the amount of memory in bytes reserved for Orchestrate onevery node communicating over the network. The default setting is 2 MB. Ascential Software suggestssettingAPT_IO_MAXIMUM_OUTSTANDING to no more than 64 MB (67,108,864 bytes). If your job fails withmessages about broken pipes or broken TCP connections, reduce the value to 16 MB (16,777,216 bytes). Ingeneral, if TCP throughput is so low that there is idle CPU time, increment this variable (by doubling) untilperformance improves. If the system is paging, the setting is probably too high.

    The DataStage Administrator Project Configuration

    This section describes Standard practices for configuring a project.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    27/72

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    28/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 28 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    By default DataStage grants the Developer Role to all groups. You should restrict the DataStageDeveloper and Production Manager roles to only trusted users.

    This standard practice recommends always Automatically Handle Activities that fail. The otheroptions are optional. Add checkpoints so sequence is restart-able on failure should be configuredonly if this is an acceptable approach to checkpoint restart.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    29/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 29 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    The Generated OSH visible for Parallel jobs in ALL projects button should be checked.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    30/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 30 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    3.1

    3.2

    3.3

    3 Job MonitorThe DataStage job monitor provides the capability for collecting and reporting performance metrics. Itmust be running in order for the Audit & Metrics system (below) to functions. The job monitor may

    impact system performance and can be tuned or shut off, configurable with environment variablesbelow.

    Configuration

    The job monitor uses two tcp/ports which are chosen during installation. These should be entered in/etc/services as a manual step.

    Entries should be made in the /etc/servicesfile to protect the sockets used by the job monitor. Thedefault socket numbers are 13400 and 13401, and entries in this file may look like this:

    13400 tcp dsjobmon13401 tcp dsjobmon

    Job Monitor Environment Variables

    The job monitor is controlled using the following environment variables. Standard practice in large volume dataenvironments is to use a size of about 10000 and turn off APT_MONITOR_TIME with $UNSET.

    For an explanation of Time based versus row based monitoring in Parallel Job Advanced DevelopersGuide (advpx.pdf) see JOB MONITOR PAGE 31.

    APT_MONITOR_SIZE

    Determines the minimum number of records the DataStage Job Monitor reports. The default is 5000records.

    APT_MONITOR_TIME

    Determines the minimum time interval in seconds for generating monitor information at runtime. Thedefault is 5 seconds. This variable takes precedence over APT_MONITOR_SIZE.

    A PT_NO_JOBMON

    Turn off job monitoring entirely.

    Starting & Stopping the Monitor

    The monitor is normally started and stopped with the DataStage server engine. The root user haspermission stop and start the job monitor using these commands:$DSHOME/../PXHOME/java/jobmoninit stop$DSHOME/../PXHOME/java/jobmoninit start

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    31/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 31 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    3.4 Monitoring jobmon

    The existence of the job monitor process can be detected by looking for the JobMonApp string in theoutput of the ps command.

    For example : ps ef | grep JobMonApp

    Will produce this rather long output, but you will be able to identify the process number.

    root 6700 1 0 Mar24 ? 00:00:01 /var/dsadm/Ascential/DataStage/DSEngine/../PXEngine/java/jre/bin/java -classpath/var/dsadm/Ascential/DataStage/DSEngine/../PXEngine/java/JobMonApp.jar:/var/dsadm/Ascential/DataStage/DSEngine/../PXEngine/java/xerces/xercesImpl.jar:/var/dsadm/Ascential/DataStage/DSEngine/.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    32/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 32 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    4.1

    4.2

    4 Backup / Recovery / Replication/ Failover ProceduresThe DataStage environment should be backed up using your sites system backup tools. It is alsopossible to backup DataStage using archive tools such as tar and zip. In general you should protect the

    DataStage environment with a combination of full and incremental backup ups, with a frequency that issufficient to minimize loss of work (disk crash) and minimize recovery time and effort.

    In order to properly back up and promptly recover a DataStage installation and the applicationsdeveloped with DataStage, you must identify the files and file systems that are required by theDataStage application. Minimal backup protection requires that DataStage Conductor and projects bebacked up by system backup on a regular basis.

    It likely external entities will be closely integrated with applications developed with DataStage and willneed to be backed as well. If your site has standardized the directory structure for External Entitiesthen identifying them for backup is straight forward. Otherwise, identification is a cumbersome ad-hoc

    exercise.

    DataStage Conductor Backup

    Also, known as the DataStage installation directory, the conductor directory contains the DataStagecore product software and configuration. It is critical that it is protected by regularly scheduled full andIncremental backups.

    Location Path ../Ascential/DataStage

    Events that result in changes to the DataStage conductor files and directories are creating and deletingprojects, installing patches to the Engine, manual modifications to files or subdirectories.

    The DataStage installation creates the following subdirectories under ../Ascential/DataStage. Scratch,Datasets and Projects. These directories are used to store volatile files and warrant specialconsiderations. The project file system may be a separate file system, as recommended by Install &upgrade standard practices. See the section below for details related backing up the Projects directory.Consider not backing up the Scratch and Datasets directories.

    DataStage Project Backups

    The location of a DataStage project can be determined when the project is created, by specifying apath. The default location is $DSHOME/Ascential/DataStage/Projects. The DataStage Projectsdirectory will contain a subdirectory for each project. It is a useful practice to utilize the default projectlocation, or standardize on one location for all projects created on the system, because it will simplifyidentifying the location of projects for backup.

    One can locate determine the path of a project through the DataStage Administrator.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    33/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 33 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    4.3

    The DataStage Repository file UV.ACCOUNT contains the directory paths for each project. This filecan be queried by the command:

    echo "SELECT * FROM UV.ACCOUNT;" | bin/uvsh

    DataStage projects should be protected by both full and incremental system backups, performed atregular intervals (daily, hourly) that minimize exposure to a crash.

    Special consideration should be given to development projects, since these are where developers willbe saving work through out the day. Developers and administrators should be aware that work could belost that was saved between backups in the event of a catastrophic storage system failure.

    It is best to backup up the system, especially projects, when jobs are not running or when Developersare not on the system. Due to the dynamic nature of a DataStage repository and its multi file structure,there is a potential for a hot backup to contain an inconsistent view of the repository. This situationexist in almost all modern databases (except single file), because the database is made up of many filesthat are updated at different times, getting a consistent view of all these files with a hot backup isdifficult, without complex solutions like breaking volume mirrors.

    Avoid storing volatile files in a DataStage project to prevent the waste of time and space required forthe project backup.

    Consider locating non volatile external entities in the project, to provide a convenient method forbacking up External Entities that are related to the project.

    Consider DataStage job log purge policy. In order to maximize backup efficiency Set a log retentionpolicy to purge shortly after a backup, without erasing entries before they are backed up. For exampleif you incrementally back up a project daily, then set the purge policy for every two days. This willensure all log entries are backed up, with minimal overlap.

    DataStage Exports for Partial Back Up

    Some customers may choose to rely on the DataStage exports for backups. This is not acomprehensive solution and should only be used in conjunction with full and incremental backups ofthe DataStage installation, DataStage projects and external entities.

    DataStage developers can supplement exposure to gaps in system back up to save there work in

    between backups.

    You cannot export locked jobs.

    Export is a DataStage client based Win32 application. It can be run from the DataStage manager orusing the command line tools Need to be at console because windows pop up dialog boxes sometimesappear.

  • 8/9/2019 DSEE Administration SP 20071029 Ptc[1]

    34/72

    IBM IPS Parallel F ramework: Admini stration and Production Automation October 29, 2007 34 of 72 2007 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in aretrieval system, or translated into any language in any form by any means without the written permission of IBM.

    IBM Information Platform and Solutions

    Center of Excellence

    4.4

    4.5

    4.6

    Datasets, Lookup File Sets and File Sets

    Before you begin backing up directories full of Datasets and file sets, consider their volatility. These

    files are often temporary files that do not justify the time and expense related to backing them up.

    The parallel framework of DSEE supports three proprietary file types:

    1. Persistent Data Sets: .ds native EE data types, partitioned to multiple part files2. Lookup File Sets: .fs native EE data types, lookup key structure, 1 or more partitioned part

    files.3. External File Sets: .fs external data types, 1 or more data files per processing node.

    All three are multipart files, consisting of a descriptor file, and one or more data part files. Thedescriptor and all part files need to be backed up together. Data Sets can be backed up using any UNIX

    backup method so long as BOTH the control file portion and data file portion(s) of the DataSets arebacked up at the same time (and no process is writing, or waiting to write to them). Restorationrequires that the data segment files return to the EXACT location from which they came, while thecontrol file portion (filename.ds) can be restored anywhere.

    Following the standard practice, the descriptor file should be located in a Datasets directory for eachproject, $PROJECT_PLUS/datasets and the part files will be located on processing nodes, as specifiedby the EE configuration file (APT_CONFIG_FILE). It is also important to know that there is arequirement that the nodes of the part file be reflected in the APT_CONFIG_FILE used by the job thatreads the dataset or file set. Thus, administrators should ensure that the APT configuration files arebacked up.

    The orchadmin utility allows you to manage persistent data sets and look up files sets. The utility canbe accessed from the DataStage Manager, Designer and Director by choosing Tools-> Data SetManagement, or the orchadmin can be invoked form the command line. Note when using orchadminfrom the command line the users environment must be configured according to the setting up thecommand line environment for orchestrate users.

    External Entities Scripts, Routines, Staging Files

    Account for all scripts that are related to