datastage parallel job developer’s guide

706
Ascential DataStage Parallel Job Developer’s Guide Version 6.0 September 2002 Part No. 00D-023DS60

Upload: mohenishjaiswal

Post on 15-Jun-2015

1.533 views

Category:

Documents


65 download

TRANSCRIPT

Page 1: DataStage Parallel Job Developer’s Guide

Ascential DataStage

Parallel Job Developer’s

Guide

Version 6.0 September 2002Part No. 00D-023DS60

Page 2: DataStage Parallel Job Developer’s Guide

Published by Ascential Software

© 1997–2002 Ascential Software Corporation. All rights reserved.

Ascential, DataStage and MetaStage are trademarks of Ascential Software Corporation or its affiliates and may be registered in other jurisdictions.

Documentation Team: Mandy deBelin, Gretchen Wang

GOVERNMENT LICENSE RIGHTS

Software and documentation acquired by or for the US Government are provided with rights as follows: (1) if for civilian agency use, with rights as restricted by vendor’s standard license, as prescribed in FAR 12.212; (2) if for Dept. of Defense use, with rights as restricted by vendor’s standard license, unless superseded by a negotiated vendor license, as prescribed in DFARS 227.7202. Any whole or partial reproduction of software or documentation marked with this legend must reproduce this legend.

Page 3: DataStage Parallel Job Developer’s Guide

Table of Contents

PrefaceDocumentation Conventions .................................................................................... xix

User Interface Conventions ................................................................................ xxiDataStage Documentation ......................................................................................... xxi

Chapter 1. IntroductionDataStage Parallel Jobs ............................................................................................... 1-1

Chapter 2. Designing Parallel Extender JobsParallel Processing ...................................................................................................... 2-1

Pipeline Parallelism ............................................................................................. 2-2Partition Parallelism ............................................................................................ 2-3Combining Pipeline and Partition Parallelism ................................................ 2-4

Parallel Processing Environments ............................................................................ 2-5

The Configuration File ............................................................................................... 2-6Partitioning and Collecting Data .............................................................................. 2-7

Partitioning ........................................................................................................... 2-7Collecting .............................................................................................................. 2-8The Mechanics of Partitioning and Collecting ................................................ 2-9

Meta Data ................................................................................................................... 2-11Runtime Column Propagation ......................................................................... 2-12Table Definitions ................................................................................................ 2-12Schema Files and Partial Schemas ................................................................... 2-12Data Types ........................................................................................................... 2-13Complex Data Types ......................................................................................... 2-14

Incorporating Server Job Functionality ................................................................. 2-17

Table of Contents iii

Chapter 3. Stage EditorsThe Stage Page ............................................................................................................. 3-2

Page 4: DataStage Parallel Job Developer’s Guide

General Tab ........................................................................................................... 3-2Properties Tab ....................................................................................................... 3-2Advanced Tab ....................................................................................................... 3-5Link Ordering Tab ................................................................................................ 3-6

Inputs Page ................................................................................................................... 3-9General Tab ......................................................................................................... 3-10Properties Tab ..................................................................................................... 3-10Partitioning Tab .................................................................................................. 3-11Columns Tab .......................................................................................................3-16Format Tab ........................................................................................................... 3-24

Outputs Page ..............................................................................................................3-25General Tab ......................................................................................................... 3-26Properties Page ................................................................................................... 3-27Columns Tab .......................................................................................................3-28Format Tab ........................................................................................................... 3-29Mapping Tab .......................................................................................................3-30

Chapter 4. Sequential File StageStage Page .....................................................................................................................4-1

Advanced Tab ....................................................................................................... 4-2Inputs Page ................................................................................................................... 4-2

Input Link Properties ........................................................................................... 4-3Partitioning on Input Links ................................................................................ 4-4Format of Sequential Files ...................................................................................4-7

Outputs Page ..............................................................................................................4-12Output Link Properties ..................................................................................... 4-12Reject Link Properties ........................................................................................ 4-15Format of Sequential Files .................................................................................4-15

Using RCP With Sequential Stages ......................................................................... 4-20

Chapter 5. File Set StageStage Page .....................................................................................................................5-2

Advanced Tab ....................................................................................................... 5-2

iv Ascential DataStage Parallel Job Developer’s Guide

Inputs Page ................................................................................................................... 5-2Input Link Properties ........................................................................................... 5-3Partitioning on Input Links ................................................................................ 5-5

Page 5: DataStage Parallel Job Developer’s Guide

Format of File Set Files ........................................................................................ 5-8

Outputs Page ............................................................................................................. 5-12Output Link Properties ..................................................................................... 5-13Reject Link Properties ........................................................................................ 5-15Format of File Set Files ...................................................................................... 5-15

Using RCP With File Set Stages .............................................................................. 5-20

Chapter 6. Data Set StageStage Page .................................................................................................................... 6-1

Advanced Tab ....................................................................................................... 6-2

Inputs Page .................................................................................................................. 6-2Input Link Properties .......................................................................................... 6-2

Outputs Page ............................................................................................................... 6-4Output Link Properties ....................................................................................... 6-4

Chapter 7. Lookup File Set StageStage Page .................................................................................................................... 7-1

Advanced Tab ....................................................................................................... 7-2

Inputs Page .................................................................................................................. 7-2Input Link Properties .......................................................................................... 7-3Partitioning on Input Links ................................................................................ 7-4

Outputs Page ............................................................................................................... 7-7Output Link Properties ....................................................................................... 7-7

Chapter 8. External Source StageStage Page .................................................................................................................... 8-1

Advanced Tab ....................................................................................................... 8-2

Outputs Page ............................................................................................................... 8-2Output Link Properties ....................................................................................... 8-3Reject Link Properties .......................................................................................... 8-4Format of Data Being Read ................................................................................ 8-5

Using RCP With External Source Stages ................................................................ 8-10

Table of Contents v

Chapter 9. External Target StageStage Page .................................................................................................................... 9-1

Advanced Tab ....................................................................................................... 9-2

Page 6: DataStage Parallel Job Developer’s Guide

Inputs Page ................................................................................................................... 9-2Input Link Properties ........................................................................................... 9-3Partitioning on Input Links ................................................................................ 9-4Format of File Set Files ........................................................................................9-6

Outputs Page ..............................................................................................................9-12Using RCP With External Target Stages ................................................................. 9-12

Chapter 10. Write Range Map StageStage Page ...................................................................................................................10-1

Advanced Tab ..................................................................................................... 10-2

Inputs Page ................................................................................................................. 10-2Input Link Properties ......................................................................................... 10-2Partitioning on Input Links .............................................................................. 10-3

Chapter 11. SAS Data Set StageStage Page ................................................................................................................... 11-1

Advanced Tab ..................................................................................................... 11-2Inputs Page ................................................................................................................. 11-2

Input Link Properties ......................................................................................... 11-3Partitioning on Input Links .............................................................................. 11-4

Outputs Page .............................................................................................................. 11-6Output Link Properties ..................................................................................... 11-6

Chapter 12. DB2 StageStage Page ...................................................................................................................12-1

Advanced Tab ..................................................................................................... 12-2Inputs Page ................................................................................................................. 12-2

Input Link Properties ......................................................................................... 12-3Partitioning on Input Links .............................................................................. 12-8

Outputs Page ............................................................................................................12-10Output Link Properties ................................................................................... 12-11

Chapter 13. Oracle StageStage Page ...................................................................................................................13-1

vi Ascential DataStage Parallel Job Developer’s Guide

Advanced Tab ..................................................................................................... 13-1

Inputs Page ................................................................................................................. 13-2Input Link Properties ......................................................................................... 13-2

Page 7: DataStage Parallel Job Developer’s Guide

Partitioning on Input Links .............................................................................. 13-9

Outputs Page ........................................................................................................... 13-11Output Link Properties ................................................................................... 13-11

Chapter 14. Teradata StageStage Page .................................................................................................................. 14-1

Advanced Tab ..................................................................................................... 14-1

Inputs Page ................................................................................................................ 14-2Input Link Properties ........................................................................................ 14-2Partitioning on Input Links .............................................................................. 14-6

Outputs Page ............................................................................................................. 14-8Output Link Properties ..................................................................................... 14-8

Chapter 15. Informix XPS StageStage Page .................................................................................................................. 15-1

Advanced Tab ..................................................................................................... 15-1

Inputs Page ................................................................................................................ 15-2Input Link Properties ........................................................................................ 15-2Partitioning on Input Links .............................................................................. 15-4

Outputs Page ............................................................................................................. 15-7Output Link Properties ..................................................................................... 15-7

Chapter 16. Transformer StageTransformer Editor Components ............................................................................ 16-3

Toolbar ................................................................................................................. 16-3Link Area ............................................................................................................. 16-3Meta Data Area .................................................................................................. 16-3Shortcut Menus .................................................................................................. 16-4

Transformer Stage Basic Concepts .......................................................................... 16-5Input Link ........................................................................................................... 16-5Output Links ...................................................................................................... 16-5

Editing Transformer Stages ..................................................................................... 16-6Using Drag and Drop ........................................................................................ 16-7

Table of Contents vii

Find and Replace Facilities ............................................................................... 16-8Creating and Deleting Columns ...................................................................... 16-9Moving Columns Within a Link ...................................................................... 16-9

Page 8: DataStage Parallel Job Developer’s Guide

Editing Column Meta Data ............................................................................... 16-9Defining Output Column Derivations ............................................................ 16-9Defining Constraints and Handling Rejects .................................................16-12Specifying Link Order .....................................................................................16-14Defining Local Stage Variables .......................................................................16-15

The DataStage Expression Editor ..........................................................................16-18Entering Expressions .......................................................................................16-18Completing Variable Names ...........................................................................16-19Validating the Expression ...............................................................................16-19Exiting the Expression Editor .........................................................................16-19Configuring the Expression Editor ................................................................16-20

Transformer Stage Properties ................................................................................16-20Stage Page ..........................................................................................................16-20Inputs Page ........................................................................................................16-21Outputs Page ....................................................................................................16-24

Chapter 17. Aggregator StageStage Page ...................................................................................................................17-2

Properties ............................................................................................................. 17-2Advanced Tab ...................................................................................................17-10

Inputs Page ...............................................................................................................17-12Partitioning on Input Links ............................................................................17-12

Outputs Page ............................................................................................................17-14Mapping Tab .....................................................................................................17-15

Chapter 18. Join StageStage Page ...................................................................................................................18-2

Properties ............................................................................................................. 18-2Advanced Tab ..................................................................................................... 18-3Link Ordering ..................................................................................................... 18-4

Inputs Page ................................................................................................................. 18-5Partitioning on Input Links .............................................................................. 18-5

Outputs Page ..............................................................................................................18-7

viii Ascential DataStage Parallel Job Developer’s Guide

Mapping Tab .......................................................................................................18-8

Page 9: DataStage Parallel Job Developer’s Guide

Chapter 19. Funnel StageStage Page .................................................................................................................. 19-2

Properties ............................................................................................................ 19-2Advanced Tab ..................................................................................................... 19-4Link Ordering ..................................................................................................... 19-5

Inputs Page ................................................................................................................ 19-5Partitioning on Input Links .............................................................................. 19-6

Outputs Page ............................................................................................................. 19-8Mapping Tab ....................................................................................................... 19-9

Chapter 20. Lookup StageStage Page .................................................................................................................. 20-2

Properties ............................................................................................................ 20-2Advanced Tab ..................................................................................................... 20-3Link Ordering ..................................................................................................... 20-4

Inputs Page ................................................................................................................ 20-4Input Link Properties ........................................................................................ 20-5Partitioning on Input Links .............................................................................. 20-6

Outputs Page ............................................................................................................. 20-8Reject Link Properties ........................................................................................ 20-9Mapping Tab ....................................................................................................... 20-9

Chapter 21. Sort StageStage Page .................................................................................................................. 21-1

Properties ............................................................................................................ 21-1Advanced Tab ..................................................................................................... 21-6

Inputs Page ................................................................................................................ 21-6Partitioning on Input Links .............................................................................. 21-6

Outputs Page ............................................................................................................. 21-9Mapping Tab ....................................................................................................... 21-9

Chapter 22. Merge StageStage Page .................................................................................................................. 22-2

Table of Contents ix

Properties ............................................................................................................ 22-2Advanced Tab ..................................................................................................... 22-3

Page 10: DataStage Parallel Job Developer’s Guide

Link Ordering ..................................................................................................... 22-4

Inputs Page ................................................................................................................. 22-5Partitioning on Input Links .............................................................................. 22-6

Outputs Page ..............................................................................................................22-8Reject Link Properties ........................................................................................ 22-8Mapping Tab .......................................................................................................22-9

Chapter 23. Remove Duplicates StageStage Page ...................................................................................................................23-2

Properties ............................................................................................................. 23-2Advanced Tab ..................................................................................................... 23-3

Inputs Page ................................................................................................................. 23-3Partitioning on Input Links .............................................................................. 23-4

Output Page ............................................................................................................... 23-6Mapping Tab .......................................................................................................23-7

Chapter 24. Compress StageStage Page ...................................................................................................................24-1

Properties ............................................................................................................. 24-2Advanced Tab ..................................................................................................... 24-2

Input Page ...................................................................................................................24-3Partitioning on Input Links .............................................................................. 24-3

Output Page ............................................................................................................... 24-5

Chapter 25. Expand StageStage Page ...................................................................................................................25-1

Properties ............................................................................................................. 25-2Advanced Tab ..................................................................................................... 25-2

Input Page ...................................................................................................................25-3Partitioning on Input Links .............................................................................. 25-3

Output Page ............................................................................................................... 25-4

Chapter 26. Sample Stage

x Ascential DataStage Parallel Job Developer’s Guide

Stage Page ...................................................................................................................26-1Properties ............................................................................................................. 26-2Advanced Tab ..................................................................................................... 26-3

Page 11: DataStage Parallel Job Developer’s Guide

Link Ordering ..................................................................................................... 26-4

Input Page .................................................................................................................. 26-4Partitioning on Input Links .............................................................................. 26-5

Outputs Page ............................................................................................................. 26-7Mapping Tab ....................................................................................................... 26-8

Chapter 27. Row Generator StageStage Page .................................................................................................................. 27-1

Advanced Tab ..................................................................................................... 27-2

Outputs Page ............................................................................................................. 27-2Properties ............................................................................................................ 27-2

Chapter 28. Column Generator StageStage Page .................................................................................................................. 28-1

Properties ............................................................................................................ 28-1Advanced Tab ..................................................................................................... 28-3

Input Page .................................................................................................................. 28-3Partitioning on Input Links .............................................................................. 28-3

Outputs Page ............................................................................................................. 28-6Mapping Tab ....................................................................................................... 28-6

Chapter 29. Copy StageStage Page .................................................................................................................. 29-1

Properties ............................................................................................................ 29-1Advanced Tab ..................................................................................................... 29-2

Input Page .................................................................................................................. 29-3Partitioning on Input Links .............................................................................. 29-3

Outputs Page ............................................................................................................. 29-5Mapping Tab ....................................................................................................... 29-6

Chapter 30. External Filter StageStage Page .................................................................................................................. 30-1

Properties ............................................................................................................ 30-1

Table of Contents xi

Advanced Tab ..................................................................................................... 30-2Input Page .................................................................................................................. 30-3

Partitioning on Input Links .............................................................................. 30-3

Page 12: DataStage Parallel Job Developer’s Guide

Outputs Page ..............................................................................................................30-5

Chapter 31. Change Capture StageStage Page ...................................................................................................................31-2

Properties ............................................................................................................. 31-2Advanced Tab ..................................................................................................... 31-5Link Ordering ..................................................................................................... 31-6

Inputs Page ................................................................................................................. 31-7Partitioning on Input Links .............................................................................. 31-7

Outputs Page ..............................................................................................................31-9Mapping Tab .....................................................................................................31-10

Chapter 32. Change Apply StageStage Page ...................................................................................................................32-3

Properties ............................................................................................................. 32-3Advanced Tab ..................................................................................................... 32-6Link Ordering ..................................................................................................... 32-7

Inputs Page ................................................................................................................. 32-7Partitioning on Input Links .............................................................................. 32-8

Outputs Page ............................................................................................................32-10Mapping Tab ..................................................................................................... 32-11

Chapter 33. Encode StageStage Page ...................................................................................................................33-1

Properties ............................................................................................................. 33-1Advanced Tab ..................................................................................................... 33-2

Inputs Page ................................................................................................................. 33-3Partitioning on Input Links .............................................................................. 33-3

Outputs Page ..............................................................................................................33-5

Chapter 34. Decode StageStage Page ...................................................................................................................34-1

Properties ............................................................................................................. 34-1

xii Ascential DataStage Parallel Job Developer’s Guide

Advanced Tab ..................................................................................................... 34-2Inputs Page ................................................................................................................. 34-3

Partitioning on Input Links .............................................................................. 34-3

Page 13: DataStage Parallel Job Developer’s Guide

Outputs Page ............................................................................................................. 34-4

Chapter 35. Difference StageStage Page .................................................................................................................. 35-2

Properties ............................................................................................................ 35-2

Advanced Tab ............................................................................................................ 35-5Link Ordering ..................................................................................................... 35-6

Inputs Page ................................................................................................................ 35-6Partitioning on Input Links .............................................................................. 35-7

Outputs Page ............................................................................................................. 35-9Mapping Tab ..................................................................................................... 35-10

Chapter 36. Column Import StageStage Page .................................................................................................................. 36-2

Properties ............................................................................................................ 36-2Advanced Tab ..................................................................................................... 36-3

Inputs Page ................................................................................................................ 36-4Partitioning on Input Links .............................................................................. 36-4

Outputs Page ............................................................................................................. 36-7Format Tab .......................................................................................................... 36-7Mapping Tab ..................................................................................................... 36-13Reject Link ......................................................................................................... 36-13

Chapter 37. Column Export StageStage Page .................................................................................................................. 37-1

Properties ............................................................................................................ 37-2Advanced Tab ..................................................................................................... 37-3

Inputs Page ................................................................................................................ 37-3Partitioning on Input Links .............................................................................. 37-4Format Tab .......................................................................................................... 37-6

Outputs Page ........................................................................................................... 37-11Mapping Tab ..................................................................................................... 37-12Reject Link ......................................................................................................... 37-13

Table of Contents xiii

Chapter 38. Make Subrecord StageStage Page .................................................................................................................. 38-2

Page 14: DataStage Parallel Job Developer’s Guide

Properties ............................................................................................................. 38-2Advanced Tab ..................................................................................................... 38-3

Inputs Page ................................................................................................................. 38-3Partitioning on Input Links .............................................................................. 38-4

Outputs Page ..............................................................................................................38-6

Chapter 39. Split Subrecord StageStage Page ...................................................................................................................39-1

Properties Tab ..................................................................................................... 39-2Advanced Tab ..................................................................................................... 39-2

Inputs Page ................................................................................................................. 39-3Partitioning on Input Links .............................................................................. 39-3

Outputs Page ..............................................................................................................39-5

Chapter 40. Promote Subrecord StageStage Page ...................................................................................................................40-1

Properties ............................................................................................................. 40-2Advanced Tab ..................................................................................................... 40-2

Inputs Page ................................................................................................................. 40-3Partitioning on Input Links .............................................................................. 40-3

Outputs Page ..............................................................................................................40-5

Chapter 41. Combine Records StageStage Page ...................................................................................................................41-1

Properties ............................................................................................................. 41-1Advanced Tab ..................................................................................................... 41-3

Inputs Page ................................................................................................................. 41-3Partitioning on Input Links .............................................................................. 41-4

Outputs Page ..............................................................................................................41-6

Chapter 42. Make Vector StageStage Page ...................................................................................................................42-1

Properties ............................................................................................................. 42-2

xiv Ascential DataStage Parallel Job Developer’s Guide

Advanced Tab ..................................................................................................... 42-2Inputs Page ................................................................................................................. 42-3

Page 15: DataStage Parallel Job Developer’s Guide

Partitioning on Input Links .............................................................................. 42-3

Outputs Page ............................................................................................................. 42-5

Chapter 43. Split Vector StageStage Page .................................................................................................................. 43-1

Properties ............................................................................................................ 43-2Advanced Tab ..................................................................................................... 43-2

Inputs Page ................................................................................................................ 43-3Partitioning on Input Links .............................................................................. 43-3

Outputs Page ............................................................................................................. 43-5

Chapter 44. Head StageStage Page .................................................................................................................. 44-2

Properties ............................................................................................................ 44-2Advanced Tab ..................................................................................................... 44-3

Inputs Page ................................................................................................................ 44-4Partitioning on Input Links .............................................................................. 44-4

Outputs Page ............................................................................................................. 44-6Mapping Tab ....................................................................................................... 44-7

Chapter 45. Tail StageStage Page .................................................................................................................. 45-1

Properties ............................................................................................................ 45-2Advanced Tab ..................................................................................................... 45-3

Inputs Page ................................................................................................................ 45-3Partitioning on Input Links .............................................................................. 45-4

Outputs Page ............................................................................................................. 45-6Mapping Tab ....................................................................................................... 45-7

Chapter 46. Compare StageStage Page .................................................................................................................. 46-1

Properties ............................................................................................................ 46-2Advanced Tab ..................................................................................................... 46-3

Table of Contents xv

Link Ordering Tab .............................................................................................. 46-4Inputs Page ................................................................................................................ 46-5

Partitioning on Input Links .............................................................................. 46-5

Page 16: DataStage Parallel Job Developer’s Guide

Outputs Page ..............................................................................................................46-6

Chapter 47. Peek StageStage Page ...................................................................................................................47-1

Properties ............................................................................................................. 47-1Advanced Tab ..................................................................................................... 47-4Link Ordering ..................................................................................................... 47-5

Inputs Page ................................................................................................................. 47-5Partitioning on Input Links .............................................................................. 47-6

Outputs Page ..............................................................................................................47-8Mapping Tab .......................................................................................................47-9

Chapter 48. SAS StageStage Page ...................................................................................................................48-2

Properties ............................................................................................................. 48-2Advanced Tab ..................................................................................................... 48-6Link Ordering ..................................................................................................... 48-7

Inputs Page ................................................................................................................. 48-7Partitioning on Input Links .............................................................................. 48-8

Outputs Page ............................................................................................................48-10Mapping Tab ..................................................................................................... 48-11

Chapter 49. Specifying Custom Parallel StagesDefining Custom Stages ...........................................................................................49-2

Defining Build Stages ............................................................................................... 49-7Build Stage Macros ..................................................................................................49-16

How Your Code is Executed ...........................................................................49-18Inputs and Outputs ..........................................................................................49-19Example Build Stage ........................................................................................49-21

Defining Wrapped Stages .......................................................................................49-27Example Wrapped Stage .................................................................................49-35

Chapter 50. Managing Data Sets

xvi Ascential DataStage Parallel Job Developer’s Guide

Structure of Data Sets ................................................................................................ 50-1Starting the Data Set Manager .................................................................................50-3

Page 17: DataStage Parallel Job Developer’s Guide

Data Set Viewer ......................................................................................................... 50-4Viewing the Schema .......................................................................................... 50-5Viewing the Data ................................................................................................ 50-6Copying Data Sets ............................................................................................. 50-7Deleting Data Sets .............................................................................................. 50-8

Chapter 51. DataStage Development Kit (Job Control Interfaces)DataStage Development Kit .................................................................................... 51-2

The dsapi.h Header File .................................................................................... 51-2Data Structures, Result Data, and Threads .................................................... 51-2Writing DataStage API Programs .................................................................... 51-3Building a DataStage API Application ........................................................... 51-4Redistributing Applications ............................................................................. 51-4API Functions ..................................................................................................... 51-5

Data Structures ........................................................................................................ 51-44Error Codes .............................................................................................................. 51-57

DataStage BASIC Interface .................................................................................... 51-61Job Status Macros .................................................................................................. 51-103

Command Line Interface ..................................................................................... 51-104The Logon Clause .......................................................................................... 51-104Starting a Job ................................................................................................... 51-105Stopping a Job ................................................................................................ 51-107Listing Projects, Jobs, Stages, Links, and Parameters ............................... 51-107Retrieving Information ................................................................................. 51-108Accessing Log Files ........................................................................................ 51-110

Appendix A. SchemasSchema Format ........................................................................................................... A-1

Date Columns ...................................................................................................... A-3Decimal Columns ............................................................................................... A-3Floating-Point Columns ..................................................................................... A-4Integer Columns .................................................................................................. A-4Raw Columns ...................................................................................................... A-4

Table of Contents xvii

String Columns ................................................................................................... A-5Time Columns ..................................................................................................... A-5Timestamp Columns .......................................................................................... A-5

Page 18: DataStage Parallel Job Developer’s Guide

Vectors ...................................................................................................................A-6Subrecords ............................................................................................................A-6Tagged Columns ..................................................................................................A-8

Partial Schemas ...........................................................................................................A-9

Appendix B. FunctionsDate and Time Functions ........................................................................................... B-1Logical Functions ........................................................................................................ B-4

Mathematical Functions ............................................................................................ B-4Null Handling Functions .......................................................................................... B-6

Number Functions ..................................................................................................... B-7Raw Functions ............................................................................................................ B-8

String Functions .......................................................................................................... B-8Type Conversion Functions .................................................................................... B-10

Utility Functions ....................................................................................................... B-12

Appendix C. Header FilesC++ Classes – Sorted By Header File ......................................................................C-1

C++ Macros – Sorted By Header File ......................................................................C-6

Index

xviii Ascential DataStage Parallel Job Developer’s Guide

Page 19: DataStage Parallel Job Developer’s Guide

Preface

This manual describes the features of the DataStage Manager and DataStage Designer. It is intended for application developers and system administrators who want to use DataStage to design and develop data warehousing applications using parallel jobs.

If you are new to DataStage, you should read the DataStage Designer Guide and the DataStage Manager Guide. These provide general descriptions of the DataStage Manager and DataStage Designer, and give you enough information to get you up and running.

This manual contains more specific information and is intended to be used as a reference guide. It gives detailed information about parallel job design and stage editors.

Documentation ConventionsThis manual uses the following conventions:

Convention Usage

Bold In syntax, bold indicates commands, function names, keywords, and options that must be input exactly as shown. In text, bold indicates keys to press, function names, and menu selections.

UPPERCASE In syntax, uppercase indicates BASIC statements and functions and SQL statements and keywords.

Italic In syntax, italic indicates information that you supply. In text, italic also indicates UNIX commands and options, file names, and pathnames.

Plain In text, plain indicates Windows NT commands and options, file names, and path names.

Preface xix

Courier Courier indicates examples of source code and system output.

Page 20: DataStage Parallel Job Developer’s Guide

The following conventions are also used:

• Syntax definitions and examples are indented for ease in reading.

• All punctuation marks included in the syntax—for example, commas, parentheses, or quotation marks—are required unless otherwise indicated.

• Syntax lines that do not fit on one line in this manual are continued on subsequent lines. The continuation lines are indented. When entering syntax, type the entire syntax entry, including the continuation lines, on the same input line.

Courier Bold In examples, courier bold indicates characters that the user types or keys the user presses (for example, <Return>).

[ ] Brackets enclose optional items. Do not type the brackets unless indicated.

{ } Braces enclose nonoptional items from which you must select at least one. Do not type the braces.

itemA | itemB A vertical bar separating items indicates that you can choose only one item. Do not type the vertical bar.

... Three periods indicate that more of the same type of item can optionally follow.

➤ A right arrow between menu commands indi-cates you should choose each command in sequence. For example, “Choose File ➤ Exit” means you should choose File from the menu bar, then choose Exit from the File pull-down menu.

This line ➥ continues

The continuation character is used in source code examples to indicate a line that is too long to fit on the page, but must be entered as a single line on screen.

Convention Usage

xx Ascential DataStage Parallel Job Developer’s Guide

Page 21: DataStage Parallel Job Developer’s Guide

User Interface ConventionsThe following picture of a typical DataStage dialog box illustrates the terminology used in describing user interface elements:

The DataStage user interface makes extensive use of tabbed pages, sometimes nesting them to enable you to reach the controls you need from within a single dialog box. At the top level, these are called “pages”, at the inner level these are called “tabs”. In the example above, we are looking at the General tab of the Inputs page. When using context sensitive online help you will find that each page has a separate help topic, but each tab uses the help topic for the parent page. You can jump to the help pages for the separate tabs from within the online help.

DataStage DocumentationDataStage documentation includes the following:

OptionButton

Button

CheckBox

Browse Button

Drop

ListDown

The Inputs Page

The

TabGeneral

Field

Preface xxi

DataStage Parallel Job Developer Guide: This guide describes the tools that are used in building a parallel job, and it supplies programmer’s reference information.

Page 22: DataStage Parallel Job Developer’s Guide

DataStage Install and Upgrade Guide: This guide describes how to install DataStage on Windows and UNIX systems, and how to upgrade existing installations.

DataStage Server Job Developer Guide: This guide describes the tools that are used in building a server job, and it supplies programmer’s reference information.

DataStage Designer Guide: This guide describes the DataStage Manager and Designer, and gives a general description of how to create, design, and develop a DataStage application.

DataStage Manager Guide: This guide describes the DataStage Director and how to validate, schedule, run, and monitor DataStage server jobs.

XE/390 Job Developer Guide: This guide describes the tools that are used in building a mainframe job, and it supplies programmer’s reference information.

DataStage Director Guide: This guide describes the DataStage Director and how to validate, schedule, run, and monitor DataStage server jobs.

DataStage Administrator Guide: This guide describes DataStage setup, routine housekeeping, and administration.

These guides are also available online in PDF format. You can read them using the Adobe Acrobat Reader supplied with DataStage.

Extensive online help is also supplied. This is especially useful when you have become familiar with using DataStage and need to look up particular pieces of information.

xxii Ascential DataStage Parallel Job Developer’s Guide

Page 23: DataStage Parallel Job Developer’s Guide

1Introduction

This chapter gives an overview of parallel jobs. Parallel jobs are compiled and run on the DataStage server. Such jobs connect to a data source, extract, and transform data and write it to a data warehouse.

DataStage also supports server jobs and mainframe jobs. Server jobs are also compiled and run on the server. These are for use on non-parallel systems and SMP systems with up to 64 processors. Server jobs are described in DataStage Server Job Developer’s Guide. Mainframe jobs are available if have XE/390 installed. These are loaded onto a mainframe and compiled and run there. Mainframe jobs are described in XE/390 Job Devel-oper’s Guide.

DataStage Parallel JobsDataStage jobs consist of individual stages. Each stage describes a partic-ular database or process. For example, one stage may extract data from a data source, while another transforms it. Stages are added to a job and linked together using the Designer.

The following diagram represents one of the simplest jobs you could have: a data source, a Transformer (conversion) stage, and the final database.

Introduction 1-1

Page 24: DataStage Parallel Job Developer’s Guide

The links between the stages represent the flow of data into or out of a stage.

You must specify the data you want at each stage, and how it is handled. For example, do you want all the columns in the source data, or only a select few? Should the data be aggregated or converted before being passed on to the next stage?

General information on how to construct your job and define the required meta data using the DataStage Designer and the DataStage Manager is in the DataStage Designer Guide and DataStage Manager Guide. Chapter 4 onwards of this manual describe the individual stage editors that you may use when developing parallel jobs.

DataSource

TransformerStage

DataWarehouse

1-2 Ascential DataStage Parallel Job Developer’s Guide

Page 25: DataStage Parallel Job Developer’s Guide

2Designing Parallel

Extender Jobs

The DataStage Parallel Extender brings the power of parallel processing to your data extraction and transformation applications.

This chapter gives a basic introduction to parallel processing, and describes some of the key concepts in designing parallel jobs for DataStage. If you are new to DataStage, you should read the introductory chapters of the DataStage Designer Guide first so that you are familiar with the DataStage Designer interface and the way jobs are built from stages and links.

Parallel ProcessingThere are two basic types of parallel processing; pipeline and partitioning. DataStage allows you to use both of these methods. The following sections illustrate these methods using a simple DataStage job which extracts data from a data source, transforms it in some way, then writes it to another data source. In all cases this job would appear the same on your Designer

Designing Parallel Extender Jobs 2-1

Page 26: DataStage Parallel Job Developer’s Guide

canvas, but you can configure it to behave in different ways (which are shown diagrammatically.

Pipeline ParallelismIf you implemented the example job using the parallel extender and ran it sequentially, each stage would process a single row of data then pass it to the next process, which would run and process this row then pass it on, etc. If you ran it in parallel, on a system with at least three processing nodes, the stage reading would start on one node and start filling a pipe-line with the data it had read. The transformer stage would start running on another node as soon as there was data in the pipeline, process it and start filling another pipeline. The stage writing the transformed data to the

2-2 Ascential DataStage Parallel Job Developer’s Guide

Page 27: DataStage Parallel Job Developer’s Guide

target database would similarly start writing as soon as there was data available. Thus all three stages are operating simultaneously.

Partition ParallelismImagine you have the same simple job as described above, but that it is handling very large quantities of data. In this scenario you could use the

Job running sequentially

Conceptual representation of same job using pipeline parallelism

Time taken

Time taken

Designing Parallel Extender Jobs 2-3

power of parallel processing to your best advantage by partitioning the data into a number of separate sets, with each partition being handled by a separate processing node.

Page 28: DataStage Parallel Job Developer’s Guide

Using partition parallelism the same job would effectively be run simulta-neously by several processing nodes, each handling a separate subset of the total data.

At the end of the job the data partitions can be collected back together again and written to a single data source.

Combining Pipeline and Partition ParallelismIf your system has enough processors, you can combine pipeline and partition parallel processing to achieve even greater performance gains. In this scenario you would have stages processing partitioned data and filling pipelines so the next one could start on that partition before the previous one had finished.

Conceptual representation of job using partition parallelism

2-4 Ascential DataStage Parallel Job Developer’s Guide

Conceptual representation of job using pipeline and partitioning

Page 29: DataStage Parallel Job Developer’s Guide

Parallel Processing EnvironmentsThe environment in which you run your DataStage jobs is defined by your system’s architecture and hardware resources. All parallel-processing environments are categorized as one of:

• SMP (symmetric multiprocessing), in which some hardware resources may be shared among processors.

• Cluster or MPP (massively parallel processing), also known as shared-nothing, in which each processor has exclusive access to hardware resources.

SMP systems allow you to scale up the number of CPUs, which may improve performance of your jobs. The improvement gained depends on how your job is limited:

• CPU-limited jobs. In these jobs the memory, memory bus, and disk I/O spend a disproportionate amount of time waiting for the CPU to finish its work. Running a CPU-limited application on more processing nodes can shorten this waiting time so speed up overall performance.

• Memory-limited jobs. In these jobs CPU and disk I/O wait for the memory or the memory bus. SMP systems share memory resources, so it may be harder to improve performance on SMP systems without hardware upgrade.

• Disk I/O limited jobs. In these jobs CPU, memory and memory bus wait for disk I/O operations to complete. Some SMP systems allow scalability of disk I/O, so that throughput improves as the number of processors increases. A number of factors contribute to the I/O scalability of an SMP, including the number of disk spin-dles, the presence or absence of RAID, and the number of I/O controllers.

In a cluster or MPP environment, you can use the multiple CPUs and their associated memory and disk resources in concert to tackle a single job. In this environment, each CPU has its own dedicated memory, memory bus, disk, and disk access. In a shared-nothing environment, parallelization of your job is likely to improve the performance of CPU-limited, memory-limited, or disk I/O-limited applications.

Designing Parallel Extender Jobs 2-5

Page 30: DataStage Parallel Job Developer’s Guide

The Configuration FileOne of the great strengths of the DataStage parallel extender is that, when designing jobs, you don’t have to worry too much about the underlying structure of your system, beyond appreciating its parallel processing capa-bilities. If your system changes, is upgraded or improved, or if you develop a job on one platform and implement it on another, you don’t necessarily have to change your job design.

DataStage learns about the shape and size of the system from the configu-ration file. It organizes the resources needed for a job according to what is defined in the configuration file. When your system changes, you change the file not the jobs.

Every MPP, cluster, or SMP environment has characteristics that define the system overall as well as the individual processing nodes. These character-istics include node names, disk storage locations, and other distinguishing attributes. For example, certain processing nodes might have a direct connection to a mainframe for performing high-speed data transfers, while other nodes have access to a tape drive, and still others are dedicated to running an RDBMS application.

The configuration file describes every processing node that DataStage will use to run your application. When you run a DataStage job, DataStage first reads the configuration file to determine the available system resources.

When you modify your system by adding or removing processing nodes or by reconfiguring nodes, you do not need to alter or even recompile your DataStage job. Just edit the configuration file.

The configuration file also gives you control over parallelization of your job during the development cycle. For example, by editing the configura-tion file, you can first run your job on a single processing node, then on two nodes, then four, then eight, and so on. The configuration file lets you measure system performance and scalability without actually modifying your job.

You can define and edit the configuration file using the DataStage Manager. This is described in the DataStage Manager Guide, which also gives detailed information on how you might set up the file for different systems.

2-6 Ascential DataStage Parallel Job Developer’s Guide

Page 31: DataStage Parallel Job Developer’s Guide

Partitioning and Collecting DataWe have already described how you can use partitioning of data to imple-ment parallel processing in your job (see “Partition Parallelism” on page 2-3). This section takes a closer look at how you can partition data in your jobs, and collect it together again.

PartitioningIn the simplest scenario you probably won’t be bothered how your data is partitioned. It is enough that it is partitioned and that the job runs faster. In these circumstances you can safely delegate responsibility for parti-tioning to DataStage. Once you have identified where you want to partition data, DataStage will work out the best method for doing it and implement it.

The aim of most partitioning operations is to end up with a set of partitions that are as near equal size as possible, ensuring an even load across your processing nodes.

When performing some operations however, you will need to take control of partitioning to ensure that you get consistent results. A good example of this would be where you are using an aggregator stage to summarize your data. To get the answers you want (and need) you must ensure that related data is grouped together in the same partition before the summary operation is performed on that partition. DataStage lets you do this.

There are a number of different partitioning methods available:

Round robin. The first record goes to the first processing node, the second to the second processing node, and so on. When DataStage reaches the last processing node in the system, it starts over. This method is useful for resizing partitions of an input data set that are not equal in size. The round robin method always creates approximately equal-sized partitions.

Random. Records are randomly distributed across all processing nodes. Like round robin, random partitioning can rebalance the partitions of an input data set to guarantee that each processing node receives an approx-imately equal-sized partition. The random partitioning has a slightly higher overhead than round robin because of the extra processing required to calculate a random value for each record.

Designing Parallel Extender Jobs 2-7

Same. The operator using the data set as input performs no repartitioning and takes as input the partitions output by the preceding stage. With this

Page 32: DataStage Parallel Job Developer’s Guide

partitioning method, records stay on the same processing node; that is, they are not redistributed. Same is the fastest partitioning method.

Entire. Every instance of a stage on every processing node receives the complete data set as input. It is useful when you want the benefits of parallel execution, but every instance of the operator needs access to the entire input data set. You are most likely to use this partitioning method with stages that create lookup tables from their input.

Hash by field. Partitioning is based on a function of one or more columns (the hash partitioning keys) in each record. This method is useful for ensuring that related records are in the same partition. It does not neces-sarily result in an even distribution of data between partitions.

Modulus. Partitioning is based on a key column modulo the number of partitions. This method is similar to hash by field, but involves simpler computation.

Range. Divides a data set into approximately equal-sized partitions, each of which contains records with key columns within a specified range. This method is also useful for ensuring that related records are in the same partition.

DB2. Partitions an input data set in the same way that DB2 would parti-tion it. For example, if you use this method to partition an input data set containing update information for an existing DB2 table, records are assigned to the processing node containing the corresponding DB2 record. Then, during the execution of the parallel operator, both the input record and the DB2 table record are local to the processing node. Any reads and writes of the DB2 table would entail no network activity.

The most common method you will see on the DataStage stages is Auto. This just means that you are leaving it to DataStage to determine the best partitioning method to use depending on the type of stage, and what the previous stage in the job has done.

CollectingCollecting is the process of joining your partitions back together again into a single data set. There are various situations where you may want to do

2-8 Ascential DataStage Parallel Job Developer’s Guide

this. There may be a stage in your job that you want to run sequentially rather than in parallel, in which case you will need to collect all your parti-tioned data at this stage to make sure it is operating on the whole data set.

Page 33: DataStage Parallel Job Developer’s Guide

Similarly, at the end of a job, you may want to write all your data to a single database, in which case you need to collect it before you write it.

There may be other cases where you don’t want to collect the data at all. For example, you may want to write each partition to a separate flat file.

Just as for partitioning, in many situations you can leave DataStage to work out the best collecting method to use. There are situations, however, where you will want to explicitly specify the collection method. The following methods are available:

Round robin. Read a record from the first input partition, then from the second partition, and so on. After reaching the last partition, start over. After reaching the final record in any partition, skip that partition in the remaining rounds.

Ordered. Read all records from the first partition, then all records from the second partition, and so on. This collection method preserves the order of totally sorted input data sets. In a totally sorted data set, both the records in each partition and the partitions themselves are ordered.

Sorted merge. Read records in an order based on one or more columns of the record. The columns used to define record order are called collecting keys.

The most common method you will see on the DataStage stages is Auto. This just means that you are leaving it to DataStage to determine the best collecting method to use depending on the type of stage, and what the previous stage in the job has done.

The Mechanics of Partitioning and Collecting This section gives a quick guide to how partitioning and collecting is represented in a DataStage job.

Partitioning Icons

Each parallel stage in a job can partition or repartition incoming data before it operates on it. Equally it can just accept the partitions that the data come in. There is an icon on the input link to a stage which shows how the stage handles partitioning.

Designing Parallel Extender Jobs 2-9

Page 34: DataStage Parallel Job Developer’s Guide

In most cases, if you just lay down a series of parallel stages in a DataStage job and join them together, the auto method will determine partitioning. This is shown on the canvas by the auto partitioning icon:

In some cases, stages have a specific partitioning method associated with them that cannot be overridden. It always uses this method to organize incoming data before it processes it. In this case an icon on the input link tells you that the stage is repartitioning data:

If you specifically select a partitioning method for a stage, rather than just leaving it to default to Auto, the following icon is shown:

You can specify that you want to accept the existing data partitions by choosing a partitioning method of same. This is shown by the following icon on the input link:

Partitioning methods are set on the Partitioning tab of the Inputs pages on a stage editor (see page 3-11).

Preserve Partitioning Flag

A stage can also request that the next stage in the job preserves whatever

2-10 Ascential DataStage Parallel Job Developer’s Guide

partitioning it has implemented. It does this by setting the Preserve Parti-tioning flag for its output link. Note, however, that the next stage may

Page 35: DataStage Parallel Job Developer’s Guide

ignore this request. It will only preserve partitioning as requested if it is using the Auto partition method.

If the Preserve Partitioning flag is cleared, this means that the current stage doesn’t care what the next stage in the job does about partitioning.

On some stages, the Preserve Partitioning flag can be set to Propagate. In this case the stage sets the flag on its output link according to what the previous stage in the job has set. If the previous job is also set to Propagate, the setting from the stage before that is used and so on until a Set or Clear flag is encountered earlier in the job. If the stage has multiple inputs and has a flag set to Propagate, its Preserve Partitioning flag is set if it is set on any of the inputs, or cleared if all the inputs are clear.

Collecting Icons

A stage in the job which is set to run sequentially will need to collect parti-tioned data before it operates on it. There is an icon on the input link to a stage which shows that it is collecting data:

Meta DataMeta data is information about data. It describes the data flowing through your job in terms of column definitions, which describe each of the fields making up a data record.

DataStage has two alternative ways of handling meta data, through Table definitions, or through Schema files. By default, parallel stages derive their meta data from the columns defined on the Outputs or Inputs page Column tab of your stage editor. Additional formatting information is supplied, where needed, by a Formats tab on the Outputs or Inputs page. You can also specify that the stage uses a schema file instead by explicitly setting a property on the stage editor and specify the name and location of the schema file.

Designing Parallel Extender Jobs 2-11

Page 36: DataStage Parallel Job Developer’s Guide

Runtime Column PropagationDataStage is also flexible about meta data. It can cope with the situation where meta data isn’t fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP). This can be enabled for a project via the DataStage Administrator (see DataStage Administrator Guide), and set for individual links via the Outputs Page Columns tab (see “Columns Tab” on page 3-28).s

Table DefinitionsA Table Definition is a set of related columns definitions that are stored in the DataStage Repository. These can be loaded into stages as and when required.

You can import a table definition from a data source via the DataStage Manager or Designer. You can also edit and define new Table Definitions in the Manager or Designer (see DataStage Manager Guide and DataStage Designer Guide). If you want, you can edit individual column definitions once you have loaded them into your stage.

You can also simply type in your own column definition from scratch on the Outputs or Inputs page Column tab of your stage editor (see page 3-16 and page 3-28). When you have entered a set of column defini-tions you can save them as a new Table definition in the Repository for subsequent reuse in another job.

Schema Files and Partial SchemasYou can also specify the meta data for a stage in a plain text file known as a schema file. This is not stored in the DataStage Repository but you could, for example, keep it in a document management or source code control system, or publish it on an intranet site.

The format of schema files is described in Appendix A of this manual.

Some parallel job stages allow you to use a partial schema. This means that you only need define column definitions for those columns that you are

2-12 Ascential DataStage Parallel Job Developer’s Guide

actually going to operate on. Partial schemas are also described in Appendix A.

Page 37: DataStage Parallel Job Developer’s Guide

Data TypesWhen you work with parallel job column definitions, you will see that they have an SQL type associated with them. This maps onto an under-lying data type which you use when specifying a schema via a file, and which you can view in the Parallel tab of the Edit Column Meta Data dialog box (see page 3-16 for details). The following table summarizes the underlying data types that columns definitions can have:

SQL Type Underlying Data Type Size Description

Date date 4 bytes Date with month, day, and year

DecimalNumeric

decimal (Roundup(p)+1)/2

Packed decimal, compatible with IBM packed decimal format

FloatReal

sfloat 4 bytes IEEE single-precision (32-bit) floating point value

Double dfloat 8 bytes IEEE double-precision (64-bit) floating point value

TinyInt int8uint8

1 byte Signed or unsigned integer of 8 bits

SmallInt int16uint16

2 bytes Signed or unsigned integer of 16 bits

Integer int32uint32

4 bytes Signed or unsigned integer of 32 bits

BigInt int64uint64

8 bytes Signed or unsigned integer of 64 bits

BinaryBitLongVarBinaryComplex data type comprising nested columns\rBinary

raw 1 byte per character

Untypes collection, consisting of a fixed or variable number of contiguous bytes and an optional alignment value

Designing Parallel Extender Jobs 2-13

Page 38: DataStage Parallel Job Developer’s Guide

Complex Data TypesParallel jobs support three complex data types:

• Subrecords• Tagged subrecords• Vectors

Subrecords

A subrecord is a nested data structure. The column with type subrecord does not itself define any storage, but the columns it contains do. These columns can have any data type, and you can nest subrecords one within another. The LEVEL property is used to specify the structure of

UnknownCharLongNVarCharLongVarCharNCharNVarCharVarChar

string 1 byte per character

ASCII character string of fixed or variable length

Char subrec sum of lengths of subrecord fields

Complex data type comprising nested columns

Char tagged sum of lengths of subrecord fields

Complex data type comprising tagged columns, of which one can be referenced when the column is used

Time time 5 bytes Time of day, with resolution of seconds or microseconds

Timestamp timestamp 9 bytes Single field containing both data and time value

SQL Type Underlying Data Type Size Description

2-14 Ascential DataStage Parallel Job Developer’s Guide

Page 39: DataStage Parallel Job Developer’s Guide

subrecords. The following diagram gives an example of a subrecord struc-ture.

Tagged Subrecord

This is a special type of subrecord structure, it comprises a number of columns of different types and the actual column is ONE of these, as indi-cated by the value of a tag at run time. The columns can be of any type except subrecord or tagged. The following diagram illustrates a tagged subrecord.

Vector

A vector is a one dimensional array of any type except tagged. All the elements of a vector are of the same type, and are numbered from 0. The vector can be of fixed or variable length. For fixed length vectors the length is explicitly stated, for variable length ones a property defines a link field

Parent (subrecord)Child1 (string)Child2 (string)Child3 (integer)Child4 (date)Child5 (subrecord)

Grandchild1 (string)Grandchild2 (time)Grandchild3 (sfloat)

LEVEL 01

LEVEL02

Parent (tagged)Child1 (string)Child2 (int8)Child3 (raw)

Tag = Child1, so column has data type of string

Designing Parallel Extender Jobs 2-15

Page 40: DataStage Parallel Job Developer’s Guide

which gives the length at run time. The following diagram illustrates a vector of fixed length and one of variable length.

int32 int32 int32 int32 int32 int32 int32 int32 int32

0 1 2 3 4 5 6 7 8

int32 int32 int32 int32 int32 int32 int32 int32

0 1 2 3 4 5 6 Nlink field = N

Fixed Length

Variable Length

2-16 Ascential DataStage Parallel Job Developer’s Guide

Page 41: DataStage Parallel Job Developer’s Guide

Incorporating Server Job FunctionalityYou can incorporate Server job functionality in your Parallel jobs by the use of Shared Container stages. This allows you to, for example, use Server job plug-in stages to access data source that are not directly supported by Parallel jobs.

You create a new shared container in the DataStage Designer, add Server job stages as required, and then add the shared container to your Parallel job and connect it to the Parallel stages. Shared container stages used in Parallel jobs have extra pages in their Properties dialog box, which enable you to specify details about parallel processing and partitioning and collecting data.

You can only use Shared Containers in this way on SMP systems (not MPP or cluster systems).

The following limitations apply to the contents of such shared containers:

• There must be zero or one container inputs, zero or more container outputs, and at least one of either.

• There can be no disconnected flows – all stages must be linked to the input or an output of the container directly or via an active stage. When the container has an input and one or more outputs, each stage must connect to the input and at least one of the outputs.

• There can be no synchronization by having a passive stage with both input and output links.

For details on how to use Shared Containers, see DataStage Designer Guide.

Designing Parallel Extender Jobs 2-17

Page 42: DataStage Parallel Job Developer’s Guide

2-18 Ascential DataStage Parallel Job Developer’s Guide

Page 43: DataStage Parallel Job Developer’s Guide

3Stage Editors

The Parallel job stage editors all use a generic user interface (with the exception of the Transformer stage and Shared Container stages). This chapter describes the generic editor and gives a guide to using it.

Parallel jobs have a large number of stages available. You can remove the ones you don’t intend to using regularly using the View ➤ Customize Palette feature.

The stage editors divided into the following basic types:

• Active. These are stages that perform some processing on the data that is passing through them. Examples of active stages are the Aggregator and Sort stages.

• File. These are stages that read or write data contained in a file or set of files. Examples of file stages are the Sequential File and Data Set stages.

• Database. These are stages that read or write data contained in a database. Examples of database stages are the Oracle and DB2 stages.

All of the stage types use the same basic stage editor, but the pages that actually appear when you edit the stage depend on the exact type of stage you are editing. The following sections describe all the page types and sub tabs that are available. The individual descriptions of stage editors in the following chapters tell you exactly which features of the generic editor each stage type uses.

Stage Editors 3-1

Page 44: DataStage Parallel Job Developer’s Guide

The Stage PageAll stage editors have a Stage page. This contains a number of subsidiary tabs depending on the stage type. The only field the Stage page itself contains gives the name of the stage being edited.

General TabAll stage editors have a General tab, this allows you to enter an optional description of the stage. Specifying a description here enhances job maintainability.

Properties TabA Properties tab appears on the General page where there are general properties that need setting for the particular stage you are editing. Prop-erties tabs can also occur under Input and Output pages where there are link-specific properties that need to be set.

3-2 Ascential DataStage Parallel Job Developer’s Guide

Page 45: DataStage Parallel Job Developer’s Guide

All the properties for active stages are set under the General page.

The available properties are displayed in a tree structure. They are divided into categories to help you find your way around them. All the mandatory properties are included in the tree by default and cannot be removed. Properties that you must set a value for (i.e. which have not got a default value) are shown in the warning color (red by default), but change to black when you have set a value. You can change the warning color by opening the Options dialog box (select Tools ➤ Options … from the DataStage Designer main menu) and choosing the Transformer item from the tree. Reset the Invalid column color by clicking on the color bar and choosing a new color from the palette.

To set a property, select it in the list and specify the required property value in the property value field. The title of this field and the method for entering a value changes according to the property you have selected. In the example above, the Key property is selected so the Property Value field is called Key and you set its value by choosing one of the available input columns from a drop down list. Key is shown in red because you must

Property Value field

Stage Editors 3-3

select a key for the stage to work properly. The Information field contains details about the property you currently have selected in the tree. Where you can browse for a property value, or insert a job parameter whose value

Page 46: DataStage Parallel Job Developer’s Guide

is provided at run time, a right arrow appears next to the field. Click on this and a menu gives access to the Browse Files dialog box and/or a list of available job parameters (job parameters are defined in the Job Proper-ties dialog box - see DataStage Designer Guide).

Some properties have default values, and you can always return to the default by selecting it in the tree and choosing Set to default from the shortcut menu.

Some properties are optional. These appear in the Available properties to add field. Click on an optional property to add it to the tree the tree or choose to add it from the shortcut menu. You can remove it again by selecting it in the tree and selecting Remove from the shortcut menu.

Some properties can be repeated. In the example above you can add multiple key properties. The Key property appears in the Available prop-erties to add list when you select the tree top level Properties node. Click on the Key item to add multiple key properties to the tree.

Some properties have dependents. These are properties which somehow relate to or modify the parent property. They appear under the parent in a tree structure.

For some properties you can supply a job parameter as their value. At runtime the value of this parameter will be used for the property. Such properties are identifies by an arrow next to their Property Value box (as shown for the example Sort Stage Key property above). Click the arrow to get a list of currently defined job parameters to chose from (see DataStage Designer Guide for information about job parameters).

You can switch to a multiline editor for entering property values for some properties. Do this by clicking on the arrow next to their Property Value box and choosing Switch to multiline editor from the menu.

The property capabilities are indicated by different icons in the tree as follows:

non-repeating property with no dependents

non-repeating property with dependents

repeating property with no dependents

repeating property with dependents

3-4 Ascential DataStage Parallel Job Developer’s Guide

The properties for individual stage types are described in the chapter about the stage.

Page 47: DataStage Parallel Job Developer’s Guide

Advanced TabAll stage editors have a Advanced tab. This allows you to:

• Specify the execution mode of the stage. This allows you to choose between Parallel and Sequential operation. If the execution mode for a particular type of stage cannot be changed, then this drop down list is disabled. Selecting Sequential operation forces the stage to be executed on a single node. If you have intermixed sequential and parallel stages this has implications for partitioning and collecting data between the stages. You can also let DataStage decide by choosing the default setting for the stage (the drop down list tells you whether this is parallel or sequential).

• Set or clear the preserve partitioning flag. This indicates whether the stage wants to preserve partitioning at the next stage of the job. You choose between Set, Clear and Propagate. For some stage types, Propagate is not available. The operation of each option is as follows:

– Set. Sets the preserve partitioning flag, this indicates to the next stage in the job that it should preserve existing partitioning if possible.

– Clear. Clears the preserve partitioning flag. Indicates that this stage does not care which partitioning method the next stage uses.

– Propagate. Sets the flag to Set or Clear depending on what the previous stage in the job has set (or if that is set to Propagate the stage before that and so on until a preserve partitioning flag setting is encountered).

You can also let DataStage decide by choosing the default setting for the stage (the drop down list tells you whether this is set, clear, or propagate).

• Specify node map or node pool or resource pool constraints. This enables you to limit where the stage can be executed as follows:

– Node pool and resource constraints. This allows you to specify constraints in a grid. Select Node pool or Resource pool from the Constraint drop-down list. Select a Type for a resource pool and,

Stage Editors 3-5

finally, select the name of the pool you are limiting execution to. You can select multiple node or resource pools.

Page 48: DataStage Parallel Job Developer’s Guide

– Node map constraints. Select the option box and type in the nodes to which execution will be limited in text box. You can also browse through the available nodes to add to the text box. Using this feature conceptually sets up an additional node pool which doesn’t appear in the configuration file.

The lists of available nodes, available node pools, and available resource pools are derived from the configuration file.

The Data Set stage only allows you to select disk pool constraints.

Link Ordering TabThis tab allows you to order the links for stages that have more than one link and where ordering of the links is required.

3-6 Ascential DataStage Parallel Job Developer’s Guide

Page 49: DataStage Parallel Job Developer’s Guide

The tab allows you to order input links and/or output links as needed. Where link ordering is not important or is not possible the tab does not appear

The link label gives further information about the links being ordered. In the example we are looking at the Link Ordering tab for a Join stage. The join operates in terms of having a left link and a right link, and this tab tells you which actual link the stage regards as being left and which right. If you use the arrow keys to change the link order, the link name changes but not the link label. In our example, if you pressed the down arrow button, DSLink27 would become the left link, and DSLink26 the right.

A Join stage can only have one output link, so in the example the Order the following output links section is disabled.

The following example shows the Link Ordering tab from a Merge stage. In this case you can order both input links and output links. The Merge stage handles reject links as well as a stream link and the tab allows you to

Stage Editors 3-7

Page 50: DataStage Parallel Job Developer’s Guide

order these, although you cannot move them to the stream link position. Again the link labels give the sense of how the links are being used.

The individual stage descriptions tell you whether link ordering is possible and what options are available.

3-8 Ascential DataStage Parallel Job Developer’s Guide

Page 51: DataStage Parallel Job Developer’s Guide

Inputs PageThe Inputs page gives information about links going into a stage. In the case of a file or database stage an input link carries data being written to the file or database. In the case of an active stage it carries data that the stage will process before outputting to another stage. Where there are no input links the stage editor has no Inputs page.

Where it is present, the Inputs page contains various tabs depending on stage type. The only field the Inputs page itself contains is Input name, which gives the name of the link being edited. Where a stage has more than one input link, you can select the link you are editing from the Input name drop-down list.

The Inputs page also has a Columns… button. Click this to open a window showing column names from the meta data defined for this link. You can drag these columns various fields in the Inputs page tabs as required.

Certain stage types will also have a View Data… button. Press this to view the actual data associated with the specified data source or data target. The button is available if you have defined meta data for the link.

Stage Editors 3-9

Page 52: DataStage Parallel Job Developer’s Guide

General TabThe Inputs page always has a General tab. this allows you to enter an optional description of the link. Specifying a description for each link enhances job maintainability.

Properties TabSome types of file and database stages can have properties that are partic-ular to specific input links. In this case the Inputs page has a Properties

3-10 Ascential DataStage Parallel Job Developer’s Guide

Page 53: DataStage Parallel Job Developer’s Guide

tab. This has the same format as the Stage page Properties tab (see “Prop-erties Tab” on page 3-2).

Partitioning TabMost parallel stages have a default partitioning or collecting method asso-ciated with them. This is used depending on the execution mode of the stage (i.e., parallel or sequential), whether Preserve Partitioning on the Stage page Advanced tab is Set, Clear, or Propagate, and the execution mode of the immediately preceding stage in the job. For example, if the preceding stage is processing data sequentially and the current stage is processing in parallel, the data will be partitioned before as it enters the current stage. Conversely if the preceding stage is processing data in parallel and the current stage is sequential, the data will be collected as it enters the current stage.

You can, if required, override the default partitioning or collecting method on the Partitioning tab. The selected method is applied to the incoming data as it enters the stage on a particular link, and so the Partitioning tab appears on the Inputs page. You can also use the tab to repartition data

Stage Editors 3-11

between two parallel stages. If both stages are executing sequentially, you cannot select a partition or collection method and the fields are disabled. The fields are also disabled if the particular stage does not permit selection

Page 54: DataStage Parallel Job Developer’s Guide

of partitioning or collection methods. The following table shows what can be set from the Partitioning tab in what circumstances:

The Partitioning tab also allows you to specify that the data should be sorted as it enters.

The Partitioning tab has the following fields:

• Partition type. Choose the partitioning (or collecting) type from the drop-down list. The following partitioning types are available:

Preceding Stage Current Stage Partition Tab Mode

Parallel Parallel Partition

Parallel Sequential Collect

Sequential Parallel Partition

Sequential Sequential None (disabled)

3-12 Ascential DataStage Parallel Job Developer’s Guide

– (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the

Page 55: DataStage Parallel Job Developer’s Guide

previous stage in the job, and how many nodes are specified in the Configuration file. This is the default method for many stages.

– Entire. Every processing node receives the entire data set. No further information is required.

– Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

– Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

– Random. The records are partitioned randomly, based on the output of a random number generator. No further information is required.

– Round Robin. The records are partitioned on a round robin basis as they enter the stage. No further information is required.

– Same. Preserves the partitioning already in place. No further information is required.

– DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

– Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following collection types are available:

– (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for many stages.

– Ordered. Reads all records from the first partition, then all records from the second partition, and so on. Requires no further information.

Stage Editors 3-13

– Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

Page 56: DataStage Parallel Job Developer’s Guide

– Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

• Available. This lists the input columns for the input link. Key columns are identified by a key icon. For partitioning or collecting methods that require you to select columns, you click on the required column in the list and it appears in the Selected list to the right. This list is also used to select columns to sort on.

• Selected. This list shows which columns have been selected for partitioning on, collecting on, or sorting on and displays informa-tion about them. The available information is whether a sort is being performed (indicated by an arrow), if so the order of the sort (ascending or descending) and collating sequence (ASCII or EBCDIC) and whether an alphanumeric key is case sensitive or not. You can select sort order, case sensitivity, and collating sequence from the shortcut menu. If applicable, the Usage field indicates whether a particular key column is being used for sorting, partitioning, or both.

• Sorting. The check boxes in the section allow you to specify sort details.

– Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Avail-able list.

– Stable. Select this if you want to preserve previously sorted data sets. The default is stable.

– Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu. The availability of sorting depends on the partitioning method chosen.

If you require a more complex sort operation, you should use the Sort stage.

3-14 Ascential DataStage Parallel Job Developer’s Guide

Page 57: DataStage Parallel Job Developer’s Guide

DB2 Partition Properties

This dialog box appears when you select a Partition type of DB2 and click the properties button . It allows you to specify the DB2 table whose partitioning method is to be replicated.

Range Partition Properties

This dialog box appears when you select a Partition type of Range and click the properties button . It allows you to specify the range map that is to be used to determine the partitioning. Type in a pathname or browse for a file.

Stage Editors 3-15

Page 58: DataStage Parallel Job Developer’s Guide

Columns TabThe Inputs page always has a Columns tab. This displays the column meta data for the selected input link in a grid.

There are various ways of populating the grid:

• If the other end of the link has meta data specified for it, this will be displayed in the Columns tab (meta data is associated with, and travels with a link).

• You can type the required meta data into the grid. When you have done this, you can click the Save… button to save the meta data as a table definition in the Repository for subsequent reuse.

• You can load an existing table definition from the Repository. Click the Load… button to be offered a choice of table definitions to load. Note that when you load in this way you bring in the columns defi-nitions, not any formatting information associated with them (to load that, go to the Format tab).

• You can drag a table definition from the Repository Window on the

3-16 Ascential DataStage Parallel Job Developer’s Guide

Designer onto a link on the canvas. This transfers both the column definitions and the associated format information.

Page 59: DataStage Parallel Job Developer’s Guide

If you click in a row and select Edit Row… from the shortcut menu, the Edit Column meta data dialog box appears, which allows you edit the row details in a dialog box format. It also has a Parallel tab which allows you to specify properties that are peculiar to parallel job column definitions. The dialog box only shows those properties that are relevant for the current link.

The Parallel tab enables you to specify properties that give more detail about each column, and properties that are specific to the data type.

Field Format

This has the following properties:

• Bytes to Skip. Skip the specified number of bytes from the end of the previous column to the beginning of this column.

Stage Editors 3-17

• Delimiter. Specifies the trailing delimiter of the column. Type an ASCII character or select one of whitespace, end, none, or null.

Page 60: DataStage Parallel Job Developer’s Guide

– whitespace. A whitespace character is used.

– end. Specifies that the last column in the record is composed of all remaining bytes until the end of the record.

– none. No delimiter.

– null. Null character is used.

• Delimiter string. Specify a string to be written at the end of the column. Enter one or more ASCII characters.

• Generate on output. Creates a column and sets it to the default value.

• Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the column’s length or the tag value for a tagged column.

• Quote. Specifies that variable length columns are enclosed in single quotes, double quotes, or another ASCII character or pair of ASCII characters. Choose Single or Double, or enter an ASCII character.

• Start position. Specifies the starting position of a column in the record. The starting position can be either an absolute byte offset from the first record position (0) or the starting position of another column.

• Tag case value. Explicitly specifies the tag value corresponding to a subfield in a tagged subrecord. By default the fields are numbered 0 to N-1, where N is the number of fields. (A tagged subrecord is a column whose type can vary. The subfields of the tagged subrecord are the possible types. The tag case value of the tagged subrecord selects which of those types is used to interpret the column’s value for the record.)

• User defined. Allows free format entry of any properties not defined elsewhere. Specify in a comma-separated list.

String Type

This has the following properties:

• Default. The value to substitute for a column that causes an error.

3-18 Ascential DataStage Parallel Job Developer’s Guide

• Export EBCDIC as ASCII. Select this to specify that EBCDIC char-acters are written as ASCII characters.

Page 61: DataStage Parallel Job Developer’s Guide

• Is link field. Selected to indicate that a column holds the length of a another, variable-length column of the record or of the tag value of a tagged record field.

• Layout max width. The maximum number of bytes in a column represented as a string. Enter a number.

• Layout width. The number of bytes in a column represented as a string. Enter a number.

• Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null.

Date Type

• Byte order. Specifies how multiple byte data types are ordered. Choose from:

– little-endian. The high byte is on the left.– big-endian. The high byte is on the right.– native-endian. As defined by the native format of the machine.

• Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd.

• Format. Specifies the data representation format of a column. Choose from:

– binary– text

• Format string. The string format of a date. By default this is %yyyy-%mm-%dd.

• Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.

Time Type

• Byte order. Specifies how multiple byte data types are ordered.

Stage Editors 3-19

Choose from:

– little-endian. The high byte is on the left.– big-endian. The high byte is on the right.

Page 62: DataStage Parallel Job Developer’s Guide

– native-endian. As defined by the native format of the machine.

• Format. Specifies the data representation format of a column. Choose from:

– binary– text

• Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss.

• Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight.

Timestamp Type

• Byte order. Specifies how multiple byte data types are ordered. Choose from:

– little-endian. The high byte is on the left.– big-endian. The high byte is on the right.– native-endian. As defined by the native format of the machine.

• Format. Specifies the data representation format of a column. Choose from:

– binary– text

• Format string. Specifies the format of a column representing a timestamp as a string. defaults to %yyyy-%mm-%dd %hh:%nn:%ss.

Integer Type

• Byte order. Specifies how multiple byte data types are ordered. Choose from:

– little-endian. The high byte is on the left.– big-endian. The high byte is on the right.– native-endian. As defined by the native format of the

machine.C_format

• Default. The value to substitute for a column that causes an error.

3-20 Ascential DataStage Parallel Job Developer’s Guide

• Format. Specifies the data representation format of a column. Choose from:

– binary

Page 63: DataStage Parallel Job Developer’s Guide

– text

• Is link field. Selected to indicate that a column holds the length of a another, variable-length column of the record or of the tag value of a tagged record field.

• Layout max width. The maximum number of bytes in a column represented as a string. Enter a number.

• Layout width. The number of bytes in a column represented as a string. Enter a number.

• Out_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sprintf().

• Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null.

Decimal Type

• Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No.

• Default. The value to substitute for a column that causes an error.

• Format. Specifies the data representation format of a column. Choose from:

– binary– text

• Layout max width. The maximum number of bytes in a column represented as a string. Enter a number.

• Layout width. The number of bytes in a column represented as a string. Enter a number.

• Packed. Select Yes to specify that the decimal columns contain data in packed decimal format or No to specify that they contain unpacked decimal with a separate sign byte. This property has two dependent properties as follows:

– Check. Select Yes to verify that data is packed, or No to not verify.

Stage Editors 3-21

– Signed. Select Yes to use the existing sign when writing decimal columns. Select No to write a positive sign (0xf) regardless of the columns actual sign value.

Page 64: DataStage Parallel Job Developer’s Guide

• Precision. Specifies the precision where a decimal column is written in text format. Enter a number.

• Rounding. Specifies how to round a decimal column when writing it. Choose from:

– up (ceiling). Truncate source column towards positive infinity.

– down (floor). Truncate source column towards negative infinity.

– nearest value. Round the source column towards the nearest representable value.

– truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign.

• Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination.

Float Type

• C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a C-language format string used for writing integer or floating point strings. This is passed to sprintf().

• Default. The value to substitute for a column that causes an error.

• Format. Specifies the data representation format of a column. Choose from:

– binary– text

• Is link field. Selected to indicate that a column holds the length of a another, variable-length column of the record or of the tag value of a tagged record field.

• Layout max width. The maximum number of bytes in a column represented as a string. Enter a number.

• Layout width. The number of bytes in a column represented as a string. Enter a number.

• Out_format. Format string used for conversion of data from

3-22 Ascential DataStage Parallel Job Developer’s Guide

integer or floating-point data to a string. This is passed to sprintf().

Page 65: DataStage Parallel Job Developer’s Guide

• Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null.

Vectors

If the row you are editing represents a column which is a variable length vector, tick the Variable check box. The Vector properties appear, these give the size of the vector in one of two ways:

• Link Field Reference. The name of a column containing the number of elements in the variable length vector. This should have an integer or float type, and have its Is Link field property set.

• Vector prefix. Specifies 1-, 2-, or 4-byte prefix containing the number of elements in the vector.

If the row you are editing represents a column which is a vector of known length, enter the number of elements in the Vector Occurs box.

Subrecords

If the row you are editing represents a column which is part of a subrecord the Level Number column indicates the level of the column within the subrecord structure.

If you specify Level numbers for columns, the column immediately preceding will be identified as a subrecord. Subrecords can be nested, so can contain further subrecords with higher level numbers (i.e., level 06 is nested within level 05). Subrecord fields have a Tagged check box to indi-cate that this is a tagged subrecord.

Stage Editors 3-23

Page 66: DataStage Parallel Job Developer’s Guide

Format TabCertain types of file stage (i.e., the Sequential File stage) also have a Format tab which allows you to specify the format of the flat file or files being read from.

The Format tab is similar in structure to the Properties tab. A flat file has a number of properties that you can set different attributes for. Select the property in the tree and select the attributes you want to set from the Available properties to add window, it will then appear as a dependent property in the property tree and you can set its value as required.

If you click the Load button you can load the format information from a table definition in the Repository.

The short-cut menu from the property tree gives access to the following functions:

• Format as. This applies a predefined template of properties. Choose from the following:

– Delimited/quoted

3-24 Ascential DataStage Parallel Job Developer’s Guide

– Fixed-width records– UNIX line terminator– DOS line terminator

Page 67: DataStage Parallel Job Developer’s Guide

– No terminator (fixed width)– Mainframe (COBOL)

• Add sub-property. Gives access to a list of dependent properties for the currently selected property (visible only if the property has dependents).

• Set to default. Appears if the currently selected property has been set to a non-default value, allowing you to re-select the default.

• Remove. Removes the currently selected property. This is disabled if the current property is mandatory.

• Remove all. Removes all the non-mandatory properties.

Details of the properties you can set are given in the chapter describing the individual stage editors.

Outputs PageThe Outputs page gives information about links going out of a stage. In the case of a file or database stage an input link carries data being read from the file or database. In the case of an active stage it carries data that the stage has processed. Where there are no output links the stage editor has no Outputs page.

Where it is present, the Outputs page contains various tabs depending on stage type. The only field the Outputs page itself contains is Output name, which gives the name of the link being edited. Where a stage has more than one output link, you can select the link you are editing from the Output name drop-down list.

The Outputs page also has a Columns… button. Click this to open a window showing column names from the meta data defined for this link. You can drag these columns to various fields in the Outputs page tabs as required.

Stage Editors 3-25

Page 68: DataStage Parallel Job Developer’s Guide

General TabThe Outputs page always has a General tab. this allows you to enter an optional description of the link. Specifying a description for each link enhances job maintainability.

3-26 Ascential DataStage Parallel Job Developer’s Guide

Page 69: DataStage Parallel Job Developer’s Guide

Properties PageSome types of file and database stages can have properties that are partic-ular to specific output links. In this case the Outputs page has a Properties tab. This has the same format as the Stage page Properties tab (see “Prop-erties Tab” on page 3-2).

Stage Editors 3-27

Page 70: DataStage Parallel Job Developer’s Guide

Columns TabThe Outputs page always has a Columns tab. This displays the column meta data for the selected output link in a grid.

There are various ways of populating the grid:

• If the other end of the link has meta data specified for it, this will be displayed in the Columns tab (meta data is associated with, and travels with a link).

• You can type the required meta data into the grid. When you have done this, you can click the Save… button to save the meta data as a table definition in the Repository for subsequent reuse.

• You can load an existing table definition from the Repository. Click the Load… button to be offered a choice of table definitions to load.

If runtime column propagation is enabled in the DataStage Administrator, you can select the Runtime column propagation to specify that columns encountered by the stage can be used even if they are not explicitly defined

3-28 Ascential DataStage Parallel Job Developer’s Guide

in the meta data. There are some special considerations when using runtime column propagation with certain stage types:

Page 71: DataStage Parallel Job Developer’s Guide

• Sequential File• File Set• External Source • External Target

See the individual stage descriptions for details of these.If you click in a row and select Edit Row… from the shortcut menu, the Edit Column meta data dialog box appears, which allows you edit the row details in a dialog box format. It also has a Parallel tab which allows you to specify properties that are peculiar to parallel job column definitions. (See page 3-17 for details.)

If the selected output link is a reject link, the column meta data grid is read only and cannot be modified.

Format TabCertain types of file stage (i.e., the Sequential File stage) also have a Format tab which allows you to specify the format of the flat file or files being written to.

Stage Editors 3-29

The Format page is similar in structure to the Properties page. A flat file has a number of properties that you can set different attributes for. Select the property in the tree and select the attributes you want to set from the

Page 72: DataStage Parallel Job Developer’s Guide

Available properties to add window, it will then appear as a dependent property in the property tree and you can set its value as required.

Format details are also stored with table definitions, and you can use the Load… button to load a format from a table definition stored in the DataStage Repository.

Details of the properties you can set are given in the chapter describing the individual stage editors.

Mapping TabFor active stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the input columns and/or the generated columns. These are read only and cannot be modified on this tab. These columns represent the data that the stage has produced after it has processed the

3-30 Ascential DataStage Parallel Job Developer’s Guide

input data.

Page 73: DataStage Parallel Job Developer’s Guide

The right pane shows the output columns for each link. This has a Deriva-tions field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility. If you have not yet defined any output column definitions, this will define them for you. If you have already defined output column definitions, the stage performs the mapping for you as far as possible.

In the above example the left pane represents the data after it has been joined. The Expression field shows how the column has been derived, the Column Name shows the column after it has been renamed by the join operation (preceded by leftRec_ or RightRec_). The right pane represents the data being output by the stage after the join. In this example the data has been mapped straight across.

More details about mapping operations for the different stages are given in the individual stage descriptions.

A shortcut menu can be invoked from the right pane that allows you to:

• Find and replace column names.• Validate a derivation you have entered.• Clear an existing derivation.• Append a new column.• Select all columns.• Insert a new column at the current position.• Delete the selected column or columns.• Cut and copy columns.• Paste a whole column.• Paste just the derivation from a column.

The Find button opens a dialog box which allows you to search for partic-ular output columns.

Stage Editors 3-31

Page 74: DataStage Parallel Job Developer’s Guide

The Auto-Match button opens a dialog box which will automatically map left pane columns onto right pane columns according to the specified criteria.

Select Location match to map input columns onto the output ones occu-pying the equivalent position. Select Name match to match by names. You can specify that all columns are to be mapped by name, or only the ones you have selected. You can also specify that prefixes and suffixes are ignored for input and output columns, and that case can be ignored.

3-32 Ascential DataStage Parallel Job Developer’s Guide

Page 75: DataStage Parallel Job Developer’s Guide

4Sequential File Stage

The Sequential File stage is a file stage. It allows you to read data from or write data one or more flat files. The stage can have a single input link or a single output link, and a single rejects link. It usually executes in parallel mode but can be configured to execute sequentially if it is only reading one file with a single reader.

When you edit a Sequential File stage, the Sequential File stage editor appears. This is based on the generic stage editor described in Chapter 3, “Stage Editors.”

The stage editor has up to three pages, depending on whether you are reading or writing a file:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is present when you are writing to a flat file. This is where you specify details about the file or files being written to.

• Outputs page. This is present when you are reading from a flat file. This is where you specify details about the file or files being read from.

There are one or two special points to note about using runtime column propagation (RCP) with Sequential stages. See “Using RCP With Sequen-tial Stages” on page 4-20 for details.

Stage PageThe General tab allows you to specify an optional description of the stage.

Sequential File Stage 4-1

The Advanced page allows you to specify how the stage executes.

Page 76: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the contents of the file are processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire contents of the file are processed by the conductor node. When a stage is reading a single file the Execution Mode is sequential and you cannot change it. When a stage is reading multiple files, the Execution Mode is parallel and you cannot change it.

• Preserve partitioning. You can select Set or Clear. If you select Set file read operations will request that the next stage preserves the partitioning as is (it is ignored for file write operations). If you set the Keep File Partitions output property this will automatically set the preserve partitioning flag.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about how the Sequential File stage writes data to one or more flat files. The Sequential File stage can have only one input link, but this can write to multiple files.

The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the

4-2 Ascential DataStage Parallel Job Developer’s Guide

link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the file or files. The Formats tab gives

Page 77: DataStage Parallel Job Developer’s Guide

information about the format of the files being written. The Columns tab specifies the column definitions of incoming data.

Details about Sequential File stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Input Link PropertiesThe Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and to what files. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Target Category

File. This property defines the flat file that the incoming data will be written to. You can type in a pathname, or browse for a file. You can specify multiple files by repeating the File property. Do this by selecting the Prop-erties item at the top of the tree, and clicking on File in the Available properties to add window. Do this for each extra file you want to specify.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Target/File Pathname N/A Y Y N/A

Target/File Update Mode

Append/Create/Overwrite

Create Y N N/A

Options/Cleanup On Failure

True/False True Y N N/A

Options/Reject Mode Continue/Fail/Save

Continue Y N N/A

Options/Filter Command N/A N N N/A

Options/Schema File Pathname N/A N N N/A

Sequential File Stage 4-3

You must specify at least one file to be written to, which must exist unless you specify a File Update Mode of Create or Overwrite.

Page 78: DataStage Parallel Job Developer’s Guide

File Update Mode. This property defines how the specified file or files are updated. The same method applies to all files being written to. Choose from Append to append to existing files, Overwrite to overwrite existing files, or Create to create a new file. If you specify the Create property for a file that already exists you will get an error at runtime.

By default this property is set to Overwrite.

Options Category

Cleanup On Failure. This is set to True by default and specifies that the stage will delete any partially written files if the stage fails for any reason. Set this to False to specify that partially written files should be left.

Reject Mode. This specifies what happens to any data records that are not written to a file for some reason. Choose from Continue to continue oper-ation and discard any rejected rows, Fail to cease writing if any rows are rejected, or Save to send rejected rows down a reject link.

Continue is set by default.

Filter. This is an optional property. You can use this to specify that the data is passed through a filter program before being written to the file or files. Specify the filter command, and any required arguments, in the Property Value box.

Schema File. This is an optional property. By default the Sequential File stage will use the column definitions defined on the Columns and Format tabs as a schema for writing to the file. You can, however, override this by specifying a file containing a schema. Type in a pathname or browse for a file.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the file or files. It also allows you to specify that the data should be sorted before being written.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been

4-4 Ascential DataStage Parallel Job Developer’s Guide

set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the Stage page Advanced tab

Page 79: DataStage Parallel Job Developer’s Guide

(see page 4-2) the stage will attempt to preserve the partitioning of the incoming data.

If the Sequential File stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Sequential File stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Sequential File stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the Stage page Advanced tab).

If the Sequential File stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default auto collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default collection method for the Sequential File stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the

Sequential File Stage 4-5

output of a random number generator.

Page 80: DataStage Parallel Job Developer’s Guide

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for Sequential File stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the file or files. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The avail-ability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available

4-6 Ascential DataStage Parallel Job Developer’s Guide

list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

Page 81: DataStage Parallel Job Developer’s Guide

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Format of Sequential FilesThe Format tab allows you to supply information about the format of the flat file or files to which you are writing. The tab has a similar format to the Properties tab and is described on page 3-24.

Select a property type from main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to set window. You can then set a value for that property in the Property Value box. Pop up help for each of the available properties appears if you over the mouse pointer over it.

The following sections list the Property types and properties available for each type.

Record level. These properties define details about how data records are formatted in the flat file. The available properties are:

• Fill char. Specify an ASCII character or a value in the range 0 to 255. This character is used to fill any gaps in an exported record caused by column positioning properties. Set to 0 by default.

• Final delimiter string. Specify a string to be written after the last column of a record in place of the column delimiter. Enter one or more ASCII characters (precedes the record delimiter if one is used).

• Final delimiter. Specify a single character to be written after the last column of a record in place of the column delimiter. Type an ASCII character or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used.– end. Record delimiter is used (defaults to newline)– none. No delimiter (column length is used).

Sequential File Stage 4-7

– null. Null character is used.

• Intact. Allows you to define a partial record schema. See “Partial Schemas” in Appendix A for details on complete versus partial

Page 82: DataStage Parallel Job Developer’s Guide

schemas. (The dependent property Check Intact is only relevant for output links.)

• Record delimiter string. Specify a string to be written at the end of each record. Enter one or more ASCII characters.

• Record delimiter. Specify a single character to be written at the end of each record. Type an ASCII character or select one of the following:

– ‘\n’. Newline (the default).– null. Null character.

This is mutually exclusive with Record delimiter string, although the dialog box does not enforce this.

• Record length. Select Fixed where the fixed length columns are being written. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes.

• Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. 1 byte is the default.

• Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is written as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, or VBS.

This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix.

• User defined. Allows free format entry of any properties not defined elsewhere. Specify in a comma-separated list.

Field Defaults. Defines default properties for columns written to the file or files. These are applied to all columns written. The available properties are:

• Delimiter. Specifies the trailing delimiter of all columns in the record. Type an ASCII character or select one of whitespace, end,

4-8 Ascential DataStage Parallel Job Developer’s Guide

none, or null.

– whitespace. A whitespace character is used.

Page 83: DataStage Parallel Job Developer’s Guide

– end. Specifies that the last column in the record is composed of all remaining bytes until the end of the record.

– none. No delimiter.

– null. Null character is used.

• Delimiter string. Specify a string to be written at the end of each column. Enter one or more ASCII characters.

• Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the column’s length or the tag value for a tagged field.

• Print field. This property is not relevant for input links.

• Quote. Specifies that variable length columns are enclosed in single quotes, double quotes, or another ASCII character or pair of ASCII characters. Choose Single or Double, or enter an ASCII character.

• Vector prefix. For columns that are variable length vectors, speci-fies a 1-, 2-, or 4-byte prefix containing the number of elements in the vector.

Type Defaults. These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type.

General. These properties apply to several data types (unless overridden at column level):

• Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

– little-endian. The high byte is on the left.– big-endian. The high byte is on the right.– native-endian. As defined by the native format of the machine.

• Format. Specifies the data representation format of a column. Choose from:

– binary– text

Sequential File Stage 4-9

• Layout max width. The maximum number of bytes in a column represented as a string. Enter a number.

Page 84: DataStage Parallel Job Developer’s Guide

• Layout width. The number of bytes in a column represented as a string. Enter a number.

• Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null.

String. These properties are applied to columns with a string data type, unless overridden at column level.

• Export EBCDIC as ASCII. Select this to specify that EBCDIC char-acters are written as ASCII characters.

• Import ASCII as EBCDIC. Not relevant for input links.

Decimal. These properties are applied to columns with a decimal data type unless overridden at column level.

• Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No.

• Packed. Select Yes to specify that the decimal columns contain data in packed decimal format or No to specify that they contain unpacked decimal with a separate sign byte. This property has two dependent properties as follows:

– Check. Select Yes to verify that data is packed, or No to not verify.

– Signed. Select Yes to use the existing sign when writing decimal columns. Select No to write a positive sign (0xf) regardless of the columns actual sign value.

• Precision. Specifies the precision where a decimal column is written in text format. Enter a number.

• Rounding. Specifies how to round a decimal column when writing it. Choose from:

– up (ceiling). Truncate source column towards positive infinity.

– down (floor). Truncate source column towards negative infinity.

– nearest value. Round the source column towards the nearest representable value.

4-10 Ascential DataStage Parallel Job Developer’s Guide

Page 85: DataStage Parallel Job Developer’s Guide

– truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign.

• Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination.

Numeric. These properties are applied to columns with an integer or float data type unless overridden at column level.

• C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a C-language format string used for writing integer or floating point strings. This is passed to sprintf().

• In_format. Not relevant for input links.

• Out_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sprintf().

Date. These properties are applied to columns with a date data type unless overridden at column level.

• Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd.

• Format string. The string format of a date. By default this is %yyyy-%mm-%dd.

• Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.

Time. These properties are applied to columns with a time data type unless overridden at column level.

• Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss.

• Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed

Sequential File Stage 4-11

from the previous midnight.

Timestamp. These properties are applied to columns with a timestamp data type unless overridden at column level.

Page 86: DataStage Parallel Job Developer’s Guide

• Format string. Specifies the format of a column representing a timestamp as a string. defaults to %yyyy-%mm-%dd %hh:%nn:%ss.

Outputs PageThe Outputs page allows you to specify details about how the Sequential File stage reads data from one or more flat files. The Sequential File stage can have only one output link, but this can read from multiple files.

It can also have a single reject link. This is typically used when you are writing to a file and provides a location where records that have failed to be written to a file for some reason can be sent.

The Output name drop-down list allows you to choose whether you are looking at details of the main output link (the stream link) or the reject link.

The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Formats tab gives information about the format of the files being read. The Columns tab specifies the column definitions of incoming data.

Details about Sequential File stage properties and formatting are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Output Link PropertiesThe Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from what files. Some of the prop-erties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Mandatory? Repeats?Dependent of

4-12 Ascential DataStage Parallel Job Developer’s Guide

Source/File pathname N/A Y if Read Method = Specific Files(s)

Y N/A

Page 87: DataStage Parallel Job Developer’s Guide

Source Category

File. This property defines the flat file that data will be read from. You can type in a pathname, or browse for a file. You can specify multiple files by repeating the File property. Do this by selecting the Properties item at the top of the tree, and clicking on File in the Available properties to add window. Do this for each extra file you want to specify.

File Pattern. Specifies a group of files to import. Specify file containing a list of files or a job parameter representing the file. The file could also contain be any valid shell expression, in Bourne shell syntax, that gener-ates a list of file names.

Source/File Pattern pathname N/A Y if Read Method = Field Pattern

N N/A

Source/Read Method

Specific File(s)/File Pattern

Specific Files(s)

Y N N/A

Options/Missing File Mode

Error/OK/Depends

Depends Y if File used N N/A

Options/Keep file Partitions

True/false False Y N N/A

Options/Reject Mode

Continue/Fail/Save

Continue Y N N/A

Options/Report Progress

Yes/No Yes Y N N/A

Options/Filter command N/A N N N/A

Options/Number Of Readers Per Node

number 1 N N N/A

Options/Schema File

pathname N/A N N N/A

Category/Property Values Default Mandatory? Repeats?Dependent of

Sequential File Stage 4-13

Read Method. This property specifies whether you are reading from a specific file or files or using a file pattern to select files.

Page 88: DataStage Parallel Job Developer’s Guide

Options Category

Missing File Mode. Specifies the action to take if one of your File proper-ties has specified a file that does not exist. Choose from Error to stop the job, OK to skip the file, or Depends, which means the default is Error, unless the file has a node name prefix of *: in which case it is OK. The default is Depends.

Keep file Partitions. Set this to True to partition the imported data set according to the organization of the input file(s). So, for example, if you are reading three files you will have three partitions. Defaults to False.

Reject Mode. Allows you to specify behavior if a record fails to be read for some reason. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.

Report Progress. Choose Yes or No to enable or disable reporting. By default the stage displays a progress report at each 10% interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB, records are fixed length, and there is no filter on the file.

Filter. Specifies a UNIX command to process all exported data before it is written to a file.

Number Of Readers Per Node. This is an optional property. Specifies the number of instances of the file read operator on each processing node. The default is one operator per node per input data file. If numReaders is greater than one, each instance of the file read operator reads a contiguous range of records from the input file. The starting record location in the file for each operator, or seek location, is determined by the data file size, the record length, and the number of instances of the operator, as specified by numReaders.

The resulting data set contains one partition per instance of the file read operator, as determined by numReaders. The data file(s) being read must contain fixed-length records.

Schema File. This is an optional property. By default the Sequential File stage will use the column definitions defined on the Columns and Format

4-14 Ascential DataStage Parallel Job Developer’s Guide

tabs as a schema for reading the file. You can, however, override this by specifying a file containing a schema. Type in a pathname or browse for a file.

Page 89: DataStage Parallel Job Developer’s Guide

Reject Link PropertiesYou cannot change the properties of a Reject link. The Properties page for a reject link is blank.

Similarly, you cannot edit the column definitions for a reject link The link uses the column definitions for the link rejecting the data records.

Format of Sequential FilesThe Format tab allows you to supply information about the format of the flat file or files which you are reading. The tab has a similar format to the Properties tab and is described on page 3-24.

Select a property type from main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to set window. You can then set a value for that property in the Property Value box. Pop-up help for each of the available properties appears if you hover the mouse pointer over it.

The following sections list the Property types and properties available for each type.

Record level. These properties define details about how data records are formatted in the flat file. The available properties are:

• Fill char. Not relevant for Output links.

• Final delimiter string. Specify the string that appears after the last column of a record in place of the column delimiter. Enter one or more ASCII characters (precedes the record delimiter if one is used).

• Final delimiter. Specify a single character that appears after the last column of a record in place of the column delimiter. Type an ASCII character or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used.– end. Record delimiter is used (defaults to newline)– none. No delimiter (column length is used).– null. Null character is used.

• Intact. Allows you to define that this is a partial record schema. See

Sequential File Stage 4-15

Appendix A for details on complete versus partial schemas. This property has a dependent property:

Page 90: DataStage Parallel Job Developer’s Guide

– Check Intact. Select this to force validation of the partial schema as the file or files are. Note that this can degrade performance.

• Record delimiter string. Specifies the string at the end of each record. Enter one or more ASCII characters.

• Record delimiter. Specifies the single character at the end of each record. Type an ASCII character or select one of the following:

– ‘\n’. Newline (the default).– null. Null character.

Mutually exclusive with Record delimiter string.

• Record length. Select Fixed where the fixed length columns are being read. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes.

• Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. 1 byte is the default.

• Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is read as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property is allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, or VBS.

This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix.

• User defined. Allows free format entry of any properties not defined elsewhere. Specify in a comma-separated list.

Field Defaults. Defines default properties for columns read from the file or files. These are applied to all columns read. The available properties are:

• Delimiter. Specifies the trailing delimiter of all columns in the record. This is skipped when the file is read. Type an ASCII char-acter or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used. By default all

4-16 Ascential DataStage Parallel Job Developer’s Guide

whitespace characters are skipped when the file is read.

– end. Specifies that the last column in the record is composed of all remaining bytes until the end of the record.

Page 91: DataStage Parallel Job Developer’s Guide

– none. No delimiter.

– null. Null character is used.

• Delimiter string. Specify the string used as the trailing delimiter at the end of each column. Enter one or more ASCII characters.

• Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the column’s length or the tag value for a tagged field.

• Print field. Select this to specify the stage writes a message for each column that it reads of the format:

Importing columnname value

• Quote. Specifies that variable length columns are enclosed in single quotes, double quotes, or another ASCII character or pair of ASCII characters. Choose Single or Double, or enter an ASCII character.

• Vector prefix. For columns that are variable length vectors, speci-fies a 1-, 2-, or 4-byte prefix containing the number of elements in the vector.

Type Defaults. These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type.

General. These properties apply to several data types (unless overridden at column level):

• Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

– little-endian. The high byte is on the left.– big-endian. The high byte is on the right.– native-endian. As defined by the native format of the machine.

• Format. Specifies the data representation format of a column. Choose from:

– binary– text

Sequential File Stage 4-17

• Layout max width. The maximum number of bytes in a column represented as a string. Enter a number.

Page 92: DataStage Parallel Job Developer’s Guide

• Layout width. The number of bytes in a column represented as a string. Enter a number.

• Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null.

String. These properties are applied to columns with a string data type, unless overridden at column level.

• Export EBCDIC as ASCII. Not relevant for output links

• Import ASCII as EBCDIC. Select this to specify that ASCII charac-ters are read as EBCDIC characters.

Decimal. These properties are applied to columns with a decimal data type unless overridden at column level.

• Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No.

• Packed. Select Yes to specify that the decimal columns contain data in packed decimal format, No (separate) to specify that they contain unpacked decimal with a separate sign byte, or No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This property has two dependent properties as follows:

– Check. Select Yes to verify that data is packed, or No to not verify.

– Signed. Select Yes to use the existing sign when reading decimal columns. Select No to use a positive sign (0xf) regardless of the column’s actual sign value.

• Precision. Specifies the precision where a decimal column is in text format. Enter a number.

• Rounding. Specifies how to round a decimal column when reading it. Choose from:

– up (ceiling). Truncate source column towards positive infinity.

– down (floor). Truncate source column towards negative infinity.

4-18 Ascential DataStage Parallel Job Developer’s Guide

– nearest value. Round the source column towards the nearest representable value.

Page 93: DataStage Parallel Job Developer’s Guide

– truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign.

• Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination.

Numeric. These properties are applied to columns with an integer or float data type unless overridden at column level.

• C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a C-language format string used for reading integer or floating point strings. This is passed to sscanf().

• In_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sscanf().

• Out_format. Not relevant for output links.

Date. These properties are applied to columns with a date data type unless overridden at column level.

• Days since. Dates are read as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd.

• Format string. The string format of a date. By default this is %yyyy-%mm-%dd.

• Is Julian. Select this to specify that dates are read as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.

Time. These properties are applied to columns with a time data type unless overridden at column level.

• Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss.

• Is midnight seconds. Select this to specify that times are read as a binary 32-bit integer containing the number of seconds elapsed

Sequential File Stage 4-19

from the previous midnight.

Timestamp. These properties are applied to columns with a timestamp data type unless overridden at column level.

Page 94: DataStage Parallel Job Developer’s Guide

• Format string. Specifies the format of a column representing a timestamp as a string. defaults to %yyyy-%mm-%dd %hh:%nn:%ss.

Using RCP With Sequential StagesRuntime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you need can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between.

Sequential files, unlike most other data sources, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on sequential files if you have used the Schema File property (see “Schema File” on page 4-4 and on page 4-14) to specify a schema which describes all the columns in the sequential file. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are:

• Sequential File• File Set• External Source • External Target

4-20 Ascential DataStage Parallel Job Developer’s Guide

Page 95: DataStage Parallel Job Developer’s Guide

5File Set Stage

The File Set stage is a file stage. It allows you to read data from or write data to a file set. The stage can have a single input link, a single output link, and a single rejects link. It only executes in parallel mode.

What is a file set? DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent overruns.

The amount of data that can be stored in each destination data file is limited by the characteristics of the file system and the amount of free disk space available. The number of files created by a file set depends on:

• The number of processing nodes in the default node pool

• The number of disks in the export or default disk pool connected to each processing node in the default node pool

• The size of the partitions of the data set

The File Set stage enables you to create and write to file sets, and to read data back from file sets.

When you edit a File Set stage, the File Set stage editor appears. This is based on the generic stage editor described in Chapter 3, “Stage Editors.”

The stage editor has up to three pages, depending on whether you are reading or writing a file set:

File Set Stage 5-1

• Stage page. This is always present and is used to specify general information about the stage.

Page 96: DataStage Parallel Job Developer’s Guide

• Inputs page. This is present when you are writing to a file set. This is where you specify details about the file set being written to.

• Outputs page. This is present when you are reading from a file set. This is where you specify details about the file set being read from.

There are one or two special points to note about using runtime column propagation (RCP) with File Set stages. See “Using RCP With File Set Stages” on page 5-20 for details.

Stage PageThe General tab allows you to specify an optional description of the stage. The Advanced page allows you to specify how the stage executes.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. This is set to parallel and cannot be changed.

• Preserve partitioning. You can select Set or Clear. If you select Set, file set read operations will request that the next stage preserves the partitioning as is (it is ignored for file set write operations).

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about how the File Set stage

5-2 Ascential DataStage Parallel Job Developer’s Guide

writes data to a file set. The File Set stage can have only one input link.

Page 97: DataStage Parallel Job Developer’s Guide

The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the file set. The Formats tab gives information about the format of the files being written. The Columns tab specifies the column definitions of the data.

Details about File Set stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Input Link PropertiesThe Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and to what file set. Some of the properties are mandatory, although many have default settings. Prop-erties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Target/File Set pathname N/A Y N N/A

Target/File Set Update Policy

Create (Error if exists) /Over-write/Use Existing (Discard records)/ Use Existing (Discard schema & records)

Error if exists

Y N N/A

Target/The default is Overwrite.

Write/Omit Write Y N N/A

Options/Cleanup on True/False True Y N N/A

File Set Stage 5-3

Failure

Options/Single File Per Partition.

True/False False Y N N/A

Page 98: DataStage Parallel Job Developer’s Guide

Target Category

File Set. This property defines the file set that the incoming data will be written to. You can type in a pathname of, or browse for a file set descriptor file (by convention ending in .fs).

File Set Update Policy. Specifies what action will be taken if the file set you are writing to already exists. Choose from:

• Create (Error if exists) • Overwrite• Use Existing (Discard records)• Use Existing (Discard schema & records)

The default is Overwrite.

File Set Schema policy. Specifies whether the schema should be written to the file set. Choose from Write or Omit. The default is Write.

Options Category

Cleanup on Failure. This is set to True by default and specifies that the stage will delete any partially written files if the stage fails for any reason. Set this to False to specify that partially written files should be left.

Options/Reject Mode Continue/Fail/Save

Continue Y N N/A

Options/Diskpool string N/A N N N/A

Options/File Prefix string export.username

N N N/A

Options/File Suffix string none N N N/A

Options/Maximum File Size

number MB N/A N N N/A

Options/Schema File pathname N/A N N N/A

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

5-4 Ascential DataStage Parallel Job Developer’s Guide

Single File Per Partition. Set this to True to specify that one file is written for each partition. The default is False.

Page 99: DataStage Parallel Job Developer’s Guide

Reject Mode. Allows you to specify behavior if a record fails to be written for some reason. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.

Diskpool. This is an optional property. Specify the name of the disk pool into which to write the file set. You can also specify a job parameter.

File Prefix. This is an optional property. Specify a prefix for the name of the file set components. If you do not specify a prefix, the system writes the following: export.username, where username is your login. You can also specify a job parameter.

File Suffix. This is an optional property. Specify a suffix for the name of the file set components. The suffix is omitted by default.

Maximum File Size. This is an optional property. Specify the maximum file size in MB. The value of numMB must be equal to or greater than 1.

Schema File. This is an optional property. By default the File Set stage will use the column definitions defined on the Columns tab as a schema for writing the file. You can, however, override this by specifying a file containing a schema. Type in a pathname or browse for a file.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the file or files. It also allows you to specify that the data should be sorted before being written.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the Stage page Advanced tab (see page 5-2) the stage will attempt to preserve the partitioning of the incoming data.

If the File Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method.

File Set Stage 5-5

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

Page 100: DataStage Parallel Job Developer’s Guide

• Whether the File Set stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the File Set stage is set to execute in parallel, then you can set a parti-tioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the Stage page Advanced tab).

If the File Set stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default method for the File Set stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

5-6 Ascential DataStage Parallel Job Developer’s Guide

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set.

Page 101: DataStage Parallel Job Developer’s Guide

Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default method for the File Set stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the file or files. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The avail-ability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

File Set Stage 5-7

Page 102: DataStage Parallel Job Developer’s Guide

Format of File Set FilesThe Format tab allows you to supply information about the format of the files in the files set to which you are writing. The tab has a similar format to the Properties tab and is described on page 3-24.

Select a property type from main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to set window. You can then set a value for that property in the Property Value box. Pop-up help for each of the available properties appears if you hover the mouse pointer over it.

The following sections list the Property types and properties available for each type.

Record level. These properties define details about how data records are formatted in a file. The available properties are:

• Fill char. Specify an ASCII character or a value in the range 0 to 255. This character is used to fill any gaps in an exported record caused by column positioning properties. Set to 0 by default.

• Final delimiter string. Specify a string to be written after the last column of a record in place of the column delimiter. Enter one or more ASCII characters (precedes the record delimiter if one is used).

• Final delimiter. Specify a single character to be written after the last column of a record in place of the column delimiter. Type an ASCII character or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used.– end. Record delimiter is used (defaults to newline)– none. No delimiter (column length is used).– null. Null character is used.

• Intact. Allows you to define that this is a partial record schema. See “Partial Schemas” in Appendix A for details on complete versus partial schemas. (The dependent property Check Intact is only rele-vant for output links.)

• Record delimiter string. Specify a string to be written at the end of each record. Enter one or more ASCII characters.

5-8 Ascential DataStage Parallel Job Developer’s Guide

• Record delimiter. Specify a single character to be written at the end of each record. Type an ASCII character or select one of the following:

Page 103: DataStage Parallel Job Developer’s Guide

– ‘\n’. Newline (the default).– null. Null character.

Mutually exclusive with Record delimiter string.

• Record length. Select Fixed where the fixed length columns are being written. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes.

• Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. 1 byte is the default.

• Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is written as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property is allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, or VBS.

This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix.

• User defined. Allows free format entry of any properties not defined elsewhere. Specify in a comma-separated list.

Field Defaults. Defines default properties for columns written to the files. These are applied to all columns written. The available properties are:

• Delimiter. Specifies the trailing delimiter of all columns in the record. Type an ASCII character or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used.

– end. Specifies that the last column in the record is composed of all remaining bytes until the end of the record.

– none. No delimiter.

– null. Null character is used.

• Delimiter string. Specify a string to be written at the end of each column. Enter one or more ASCII characters.

File Set Stage 5-9

Page 104: DataStage Parallel Job Developer’s Guide

• Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the column’s length or the tag value for a tagged field.

• Print field. This property is not relevant for input links.

• Quote. Specifies that variable length columns are enclosed in single quotes, double quotes, or another ASCII character or pair of ASCII characters. Choose Single or Double, or enter an ASCII character.

• Vector prefix. For columns that are variable length vectors, speci-fies a 1-, 2-, or 4-byte prefix containing the number of elements in the vector.

Type Defaults. These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type.

General. These properties apply to several data types (unless overridden at column level):

• Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

– little-endian. The high byte is on the left.– big-endian. The high byte is on the right.– native-endian. As defined by the native format of the machine.

• Format. Specifies the data representation format of a column. Choose from:

– binary– text

• Layout max width. The maximum number of bytes in a column represented as a string. Enter a number.

• Layout width. The number of bytes in a column represented as a string. Enter a number.

• Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null.

5-10 Ascential DataStage Parallel Job Developer’s Guide

String. These properties are applied to columns with a string data type, unless overridden at column level.

Page 105: DataStage Parallel Job Developer’s Guide

• Export EBCDIC as ASCII. Select this to specify that EBCDIC char-acters are written as ASCII characters.

• Import ASCII as EBCDIC. Not relevant for input links.

Decimal. These properties are applied to columns with a decimal data type unless overridden at column level.

• Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No.

• Packed. Select Yes to specify that the decimal columns contain data in packed decimal format or No to specify that they contain unpacked decimal with a separate sign byte. This property has two dependent properties as follows:

– Check. Select Yes to verify that data is packed, or No to not verify.

– Signed. Select Yes to use the existing sign when writing decimal columns. Select No to write a positive sign (0xf) regardless of the columns actual sign value.

• Precision. Specifies the precision where a decimal column is written in text format. Enter a number.

• Rounding. Specifies how to round a decimal column when writing it. Choose from:

– up (ceiling). Truncate source column towards positive infinity.

– down (floor). Truncate source column towards negative infinity.

– nearest value. Round the source column towards the nearest representable value.

– truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign.

• Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination.

Numeric. These properties are applied to columns with an integer or float data type unless overridden at column level.

File Set Stage 5-11

• C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a C-language

Page 106: DataStage Parallel Job Developer’s Guide

format string used for writing integer or floating point strings. This is passed to sprintf().

• In_format. Not relevant for input links.

• Out_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sprintf().

Date. These properties are applied to columns with a date data type unless overridden at column level.

• Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd.

• Format string. The string format of a date. By default this is %yyyy-%mm-%dd.

• Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.

Time. These properties are applied to columns with a time data type unless overridden at column level.

• Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss.

• Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight.

Timestamp. These properties are applied to columns with a timestamp data type unless overridden at column level.

Format string. Specifies the format of a column representing a times-tamp as a string. defaults to %yyyy-%mm-%dd %hh:%nn:%ss.

Outputs PageThe Outputs page allows you to specify details about how the File Set

5-12 Ascential DataStage Parallel Job Developer’s Guide

stage reads data from a file set. The File Set stage can have only one output link. It can also have a single reject link, where records that have failed to be written or read for some reason can be sent. The Output name drop-

Page 107: DataStage Parallel Job Developer’s Guide

down list allows you to choose whether you are looking at details of the main output link (the stream link) or the reject link.

The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Formats tab gives information about the format of the files being read. The Columns tab specifies the column definitions of incoming data.

Details about File Set stage properties and formatting are given in the following sections. See Chapter 3, “Stage Editors,” for a general descrip-tion of the other tabs.

Output Link PropertiesThe Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from files in the file set. Some of the properties are mandatory, although many have default settings. Prop-erties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Source/File Set pathname N/A Y N N/A

Options/Keep file Partitions

True/False False Y N N/A

Options/Reject Mode Continue/Fail/Save

Continue Y N N/A

Options/Report Progress

Yes/No Yes Y N N/A

Options/Filter command N/A N N N/A

Options/Number Of Readers Per Node

number 1 N N N/A

Options/Schema File pathname N/A N N N/A

Options/Use Schema True/False False Y N N/A

File Set Stage 5-13

Defined in File Set

Page 108: DataStage Parallel Job Developer’s Guide

Source Category

File Set. This property defines the file set that the data will be read from. You can type in a pathname of, or browse for, a file set descriptor file (by convention ending in .fs).

Options Category

Keep file Partitions. Set this to True to partition the read data set according to the organization of the input file(s). So, for example, if you are reading three files you will have three partitions. Defaults to False.

Reject Mode. Allows you to specify behavior if a record fails to be read for some reason. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.

Report Progress. Choose Yes or No to enable or disable reporting. By default the stage displays a progress report at each 10% interval when it can ascertain file size. Reporting occurs only if the file is greater than 100 KB, records are fixed length, and there is no filter on the file.

Filter. This is an optional property. You can use this to specify that the data is passed through a filter program after being read from the files. Specify the filter command, and any required arguments, in the Property Value box.

Number Of Readers Per Node. This is an optional property. Specifies the number of instances of the file read operator on each processing node. The default is one operator per node per input data file. If numReaders is greater than one, each instance of the file read operator reads a contiguous range of records from the input file. The starting record location in the file for each operator, or seek location, is determined by the data file size, the record length, and the number of instances of the operator, as specified by numReaders.

The resulting data set contains one partition per instance of the file read operator, as determined by numReaders. The data file(s) being read must contain fixed-length records.

5-14 Ascential DataStage Parallel Job Developer’s Guide

Schema File. This is an optional property. By default the File Set stage will use the column definitions defined on the Columns and Format tabs

Page 109: DataStage Parallel Job Developer’s Guide

as a schema for reading the file. You can, however, override this by speci-fying a file containing a schema. Type in a pathname or browse for a file.

Use Schema Defined in File Set. When you create a file set you have an option to save the schema along with it. When you read the file set you can use this schema in preference to the column definitions or a schema file by setting this property to True.

Reject Link PropertiesYou cannot change the properties of a Reject link. The Properties tab for a reject link is blank.

Similarly, you cannot edit the column definitions for a reject link The link uses the column definitions for the link rejecting the data records.

Format of File Set FilesThe Format tab allows you to supply information about the format of the files in the file set which you are reading. The tab has a similar format to the Properties tab and is described on page 3-24.

Select a property type from main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to set window. You can then set a value for that property in the Property Value box. Pop-up help for each of the available properties appears if you hover the mouse pointer over it.

The following sections list the Property types and properties available for each type.

Record level. These properties define details about how data records are formatted in the flat file. The available properties are:

• Fill char. Not relevant for Output links.

• Final delimiter string. Specify the string that appears after the last column of a record in place of the column delimiter. Enter one or more ASCII characters (precedes the record delimiter if one is used).

• Final delimiter. Specify a single character that appears after the

File Set Stage 5-15

last column of a record in place of the column delimiter. Type an ASCII character or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used.

Page 110: DataStage Parallel Job Developer’s Guide

– end. Record delimiter is used (defaults to newline)– none. No delimiter (column length is used).– null. Null character is used.

• Intact. Allows you to define that this is a partial record schema. See “Partial Schemas” in Appendix A for details on complete versus partial schemas. This property has a dependent property:

– Check Intact. Select this to force validation of the partial schema as the file or files are. Note that this can degrade performance.

• Record delimiter string. Specifies the string at the end of each record. Enter one or more ASCII characters.

• Record delimiter. Specifies the single character at the end of each record. Type an ASCII character or select one of the following:

– ‘\n’. Newline (the default).– null. Null character.

Mutually exclusive with Record delimiter string.

• Record length. Select Fixed where the fixed length columns are being read. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes.

• Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. 1 byte is the default.

• Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is read as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property is allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, or VBS.

This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix.

• User defined. Allows free format entry of any properties not defined elsewhere. Specify in a comma-separated list.

5-16 Ascential DataStage Parallel Job Developer’s Guide

Field Defaults. Defines default properties for columns read from the files. These are applied to all columns read. The available properties are:

Page 111: DataStage Parallel Job Developer’s Guide

• Delimiter. Specifies the trailing delimiter of all columns in the record. This is skipped when the file is read. Type an ASCII char-acter or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used. By default all whitespace characters are skipped when the file is read.

– end. Specifies that the last column in the record is composed of all remaining bytes until the end of the record.

– none. No delimiter.

– null. Null character is used.

• Delimiter string. Specify the string used as the trailing delimiter at the end of each column. Enter one or more ASCII characters.

• Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the column’s length or the tag value for a tagged field.

• Print field. Select this to specify the stage writes a message for each column that it reads of the format:

Importing columnname value

• Quote. Specifies that variable length columns are enclosed in single quotes, double quotes, or another ASCII character or pair of ASCII characters. Choose Single or Double, or enter an ASCII character.

• Vector prefix. For columns that are variable length vectors, speci-fies a 1-, 2-, or 4-byte prefix containing the number of elements in the vector.

Type Defaults. These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type.

General. These properties apply to several data types (unless overridden at column level):

• Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

File Set Stage 5-17

– little-endian. The high byte is on the left.– big-endian. The high byte is on the right.– native-endian. As defined by the native format of the machine.

Page 112: DataStage Parallel Job Developer’s Guide

• Format. Specifies the data representation format of a column. Choose from:

– binary– text

• Layout max width. The maximum number of bytes in a column represented as a string. Enter a number.

• Layout width. The number of bytes in a column represented as a string. Enter a number.

• Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null.

String. These properties are applied to columns with a string data type, unless overridden at column level.

• Export EBCDIC as ASCII. Not relevant for output links

• Import ASCII as EBCDIC. Select this to specify that ASCII charac-ters are read as EBCDIC characters.

Decimal. These properties are applied to columns with a decimal data type unless overridden at column level.

• Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No.

• Packed. Select Yes to specify that the decimal columns contain data in packed decimal format, No (separate) to specify that they contain unpacked decimal with a separate sign byte, or No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This property has two dependent properties as follows:

– Check. Select Yes to verify that data is packed, or No to not verify.

– Signed. Select Yes to use the existing sign when reading decimal columns. Select No to use a positive sign (0xf) regardless of the column’s actual sign value.

5-18 Ascential DataStage Parallel Job Developer’s Guide

• Precision. Specifies the precision where a decimal column is in text format. Enter a number.

Page 113: DataStage Parallel Job Developer’s Guide

• Rounding. Specifies how to round a decimal column when reading it. Choose from:

– up (ceiling). Truncate source column towards positive infinity.

– down (floor). Truncate source column towards negative infinity.

– nearest value. Round the source column towards the nearest representable value.

– truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign.

• Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination.

Numeric. These properties are applied to columns with an integer or float data type unless overridden at column level.

• C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a C-language format string used for reading integer or floating point strings. This is passed to sscanf().

• In_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sscanf().

• Out_format. Not relevant for output links.

Date. These properties are applied to columns with a date data type unless overridden at column level.

• Days since. Dates are read as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd.

• Format string. The string format of a date. By default this is %yyyy-%mm-%dd.

• Is Julian. Select this to specify that dates are read as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.

File Set Stage 5-19

Time. These properties are applied to columns with a time data type unless overridden at column level.

Page 114: DataStage Parallel Job Developer’s Guide

• Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss.

• Is midnight seconds. Select this to specify that times are read as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight.

Timestamp. These properties are applied to columns with a timestamp data type unless overridden at column level.

• Format string. Specifies the format of a column representing a timestamp as a string. defaults to %yyyy-%mm-%dd %hh:%nn:%ss.

Using RCP With File Set StagesRuntime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you need can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between.

Data Set stage handle a set of sequential files. Sequential files, unlike most other data sources, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on File Set stages if you have used the Schema File property (see “Schema File” on page 5-5 and on page 5-14) to specify a schema which describes all the columns in the sequential files referenced by the stage. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are:

• Sequential File• File Set• External Source • External Target

5-20 Ascential DataStage Parallel Job Developer’s Guide

Page 115: DataStage Parallel Job Developer’s Guide

6Data Set Stage

The Data Set stage is a file stage. It allows you to read data from or write data to a data set. The stage can have a single input link or a single output link. It can be configured to execute in parallel or sequential mode.

What is a data set? DataStage parallel extender jobs use data sets to store data being operated on in a persistent form. Data sets are oper-ating system files, each referred to by a control file, which by convention has the suffix .ds. Using data sets wisely can be key to good performance in a set of linked jobs. You can also manage data sets independently of a job using the Data Set Management utility, available from the DataStage Designer, Manager, or Director, see Chapter 50.

The stage editor has up to three pages, depending on whether you are reading or writing a data set:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is present when you are writing to a data set. This is where you specify details about the data set being written to.

• Outputs page. This is present when you are reading from a data set. This is where you specify details about the data set being read from.

Stage Page

Data Set Stage 6-1

The General tab allows you to specify an optional description of the stage. The Advanced page allows you to specify how the stage executes.

Page 116: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab allows you to specify the following:

• Execution Mode. This is not relevant for a data set and so is disabled.

• Preserve partitioning. A data set stores the setting of the preserve partitioning flag with the data. It cannot be changed on this stage and so the field is disabled (it does not appear if your stage only has an input link).

• Node pool and resource constraints. You can specify resource constraints limit execution to the resource pools or pools speci-fied in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. This is not relevant to a Data Set stage.

Inputs PageThe Inputs page allows you to specify details about how the Data Set stage writes data to a data set. The Data Set stage can have only one input link.

The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of the data.

Details about Data Set stage properties are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Input Link PropertiesThe Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and to what data set. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

6-2 Ascential DataStage Parallel Job Developer’s Guide

Page 117: DataStage Parallel Job Developer’s Guide

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows

Target Category

File. The name of the control file for the data set. You can browse for the file or enter a job parameter. By convention, the file has the suffix .ds.

Update Policy. Specifies what action will be taken if the data set you are writing to already exists. Choose from:

• Append. Append any new data to the existing data.

• Create (Error if exists). DataStage reports an error if the data set already exists.

• Overwrite. Overwrites any existing data with new data.

• Use existing (Discard records). Keeps the existing data and discards any new data.

• Use existing (Discard records and schema). Keeps the existing data and discards any new data and its associated schema.

The default is Overwrite.

Category/Property Values Default Mandatory? Repeats? Depen

dent of

Target/File pathname N/A Y N N/A

Target/Update Policy

Append/Create (Error if exists)/Over-write/Use existing (Discard records)/Use existing (Discard records and schema)

Create (Error if exists)

Y N N/A

Data Set Stage 6-3

Page 118: DataStage Parallel Job Developer’s Guide

Outputs PageThe Outputs page allows you to specify details about how the Data Set stage reads data from a data set. The Data Set stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of incoming data.

Details about Data Set stage properties and formatting are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Output Link PropertiesThe Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from the data set. Some of the properties are mandatory, although many have default settings. Prop-erties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Source Category

File. The name of the control file for the data set. You can browse for the file or enter a job parameter. By convention the file has the suffix .ds.

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Source/File pathname N/A Y N N/A

6-4 Ascential DataStage Parallel Job Developer’s Guide

Page 119: DataStage Parallel Job Developer’s Guide

7Lookup File Set Stage

The Lookup File Set stage is a file stage. It allows you to create a lookup file set or reference one for a lookup. The stage can have a single input link or a single output link. The output link must be a reference link. The stage can be configured to execute in parallel or sequential mode when used with an input link.

For more information about look up operations, see Chapter 20,“Lookup Stage.”

When you edit a Lookup File Set stage, the Lookup File Set stage editor appears. This is based on the generic stage editor described in Chapter 3, “Stage Editors.”

The stage editor has up to two pages, depending on whether you are creating or referencing a file set:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is present when you are creating a lookup table. This is where you specify details about the file set being created and written to.

• Outputs page. This is present when you are reading from a lookup file set, i.e., where the stage is providing a reference link to a Lookup stage. This is where you specify details about the file set being read from.

Stage Page

Lookup File Set Stage 7-1

The General tab allows you to specify an optional description of the stage. The Advanced page allows you to specify how the stage executes.

Page 120: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab only appears when you are using the stage to create a reference file set (i.e., where the stage has an input link). It allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the contents of the table are processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire contents of the table are processed by the conductor node.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

Inputs PageThe Inputs page allows you to specify details about how the Lookup File Set stage writes data to a table or file set. The Lookup File Set stage can have only one input link.

The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the table or file set. The Columns tab specifies the column definitions of the data.

Details about Lookup File Set stage properties and partitioning are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

7-2 Ascential DataStage Parallel Job Developer’s Guide

Page 121: DataStage Parallel Job Developer’s Guide

Input Link PropertiesThe Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and to the file set. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Lookup Keys Category

Key. Specifies the name of a lookup key column. The Key property must be repeated if there are multiple key columns. The property has a depen-dent property, Case Sensitive.

Case Sensitive. This is a dependent property of Key and specifies whether the parent key is case sensitive or not. Set to true by default.

Target Category

Lookup File Set. This property defines the file set that the incoming data will be written to. You can type in a pathname of, or browse for a file set descriptor file (by convention ending in .fs).

Options Category

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Lookup Keys/Key Input column N/A Y Y N/A

Lookup Keys/Case Sensitive

True/False True N N Key

Target/Lookup File Set pathname N/A Y N N/A

Options/Allow Duplicates

True/False False Y N N/A

Options/Diskpool string N/A N N N/A

Lookup File Set Stage 7-3

Allow Duplicates. Set this to cause multiple copies of duplicate records to be saved in the lookup table without a warning being issued. Two lookup records are duplicates when all lookup key columns have the same value

Page 122: DataStage Parallel Job Developer’s Guide

in the two records. If you do not specify this option, DataStage issues a warning message when it encounters duplicate records and discards all but the first of the matching records.

Diskpool. This is an optional property. Specify the name of the disk pool into which to write the file set. You can also specify a job parameter.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the file set. It also allows you to specify that the data should be sorted before being written.

By default the stage will write to the file set in entire mode. The complete data set is written to the file set.

If the Lookup File Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default (auto) collec-tion method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Lookup File Set stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Lookup File Set stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the previous stage).

If the Lookup File Set stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default auto collection method.

The following partitioning methods are available:

• Entire. Each file written to receives the entire data set. This is the default partitioning method for the Lookup File Set stage.

7-4 Ascential DataStage Parallel Job Developer’s Guide

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

Page 123: DataStage Parallel Job Developer’s Guide

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default method for the Lookup Data Set stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over. This is the default method for the Lookup File Set stage.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab normally allows you to specify that data arriving on the input link should be sorted before being written to the lookup table. Availability depends on the partitioning method chosen.

Lookup File Set Stage 7-5

Select the check boxes as follows:

Page 124: DataStage Parallel Job Developer’s Guide

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

7-6 Ascential DataStage Parallel Job Developer’s Guide

Page 125: DataStage Parallel Job Developer’s Guide

Outputs PageThe Outputs page allows you to specify details about how the Lookup File Set stage references a file set. The Lookup File Set stage can have only one output link which is a reference link.

The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of incoming data.

Details about Lookup File Set stage properties are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Output Link PropertiesThe Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from the lookup table. Some of the properties are mandatory, although many have default settings. Prop-erties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Lookup Source Category

Lookup File Set. This property defines the file set that the data will be referenced from. You can type in a pathname of, or browse for a file set descriptor file (by convention ending in .fs).

Category/Property Values DefaultMandatory?

Repeats? Depen-dent of

Lookup Source/Lookup File Set

pathname N/A Y N N/A

Lookup File Set Stage 7-7

Page 126: DataStage Parallel Job Developer’s Guide

7-8 Ascential DataStage Parallel Job Developer’s Guide

Page 127: DataStage Parallel Job Developer’s Guide

8External Source Stage

The External Source stage is a file stage. It allows you to read data that is output from one or more source programs. The stage can have a single output link, and a single rejects link. It can be configured to execute in parallel or sequential mode.

The external source stage allows you to perform actions such as interface with databases not currently supported by the DataStage Parallel Extender.

When you edit an External Source stage, the External Source stage editor appears. This is based on the generic stage editor described in Chapter 3, “Stage Editors.”

The stage editor has two pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Outputs page. This is where you specify details about the program or programs whose output data you are reading.

There are one or two special points to note about using runtime column propagation (RCP) with External Source stages. See “Using RCP With External Source Stages” on page 8-10 for details.

Stage PageThe General tab allows you to specify an optional description of the stage. The Advanced page allows you to specify how the stage executes.

External Source Stage 8-1

Page 128: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the data input from external programs is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode all the data from the source program is processed by the conductor node.

• Preserve partitioning. You can select Set or Clear. If you select Set, it will request that the next stage preserves the partitioning as is. Clear is the default.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop-down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Outputs PageThe Outputs page allows you to specify details about how the External Source stage reads data from an external program. The External Source stage can have only one output link. It can also have a single reject link, where records that have failed to be read for some reason can be sent. The Output name drop-down list allows you to choose whether you are looking at details of the main output link (the stream link) or the reject link.

The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Formats tab gives information about the format of the files being read. The Columns tab specifies the column definitions of

8-2 Ascential DataStage Parallel Job Developer’s Guide

incoming data.

Page 129: DataStage Parallel Job Developer’s Guide

Details about External Source stage properties and formatting are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Output Link PropertiesThe Properties tab allows you to specify properties for the output link. These dictate how data is read from the external program or programs. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Mandatory? Repeats?

Depen-dent of

Source/Source Program

string N/A Y if Source Method = Specific Program(s)

Y N/A

Source/Source Programs File

pathname N/A Y if Source Method = Program File(s)

Y N/A

Source/Source Method

Specific Program(s)/Program File(s)

Specific Program(s)

Y N N/A

Options/Keep File Partitions

True/False False Y N N/A

Options/Reject Mode

Continue/Fail/Save

Continue Y N N/A

Options/Report Progress

Yes/No Yes Y N N/A

Options/Schema File

pathname N/A N N N/A

External Source Stage 8-3

Page 130: DataStage Parallel Job Developer’s Guide

Source Category

Source Program. Specifies the name of a program providing the source data. DataStage calls the specified program and passes to it any arguments specified. You can repeat this property to specify multiple program instances with different arguments. You can use a job parameter to supply program name and arguments.

Source Programs File. Specifies a file containing a list of program names and arguments. You can browse for the file or specify a job parameter. You can repeat this property to specify multiple files.

Source Method. This property specifies whether you directly specifying a program (using the Source Program property) or using a file to specify a program (using the Source Programs File property).

Options Category

Keep File Partitions. Set this to True to maintain the partitioning of the read data. Defaults to False.

Reject Mode. Allows you to specify behavior if a record fails to be read for some reason. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.

Report Progress. Choose Yes or No to enable or disable reporting. By default the stage displays a progress report at each 10% interval when it can ascertain input data size. Reporting occurs only if the input data size is greater than 100 KB, records are fixed length, and there is no filter specified.

Schema File. This is an optional property. By default the External Source stage will use the column definitions defined on the Columns tab and Schema tab as a schema for reading the file. You can, however, override this by specifying a file containing a schema. Type in a pathname or browse for a file.

Reject Link Properties

8-4 Ascential DataStage Parallel Job Developer’s Guide

You cannot change the properties of a Reject link. The Properties tab for a reject link is blank.

Page 131: DataStage Parallel Job Developer’s Guide

Similarly, you cannot edit the column definitions for a reject link. The link uses the column definitions for the link rejecting the data records.

Format of Data Being ReadThe Format tab allows you to supply information about the format of the data which you are reading. The tab has a similar format to the Properties tab and is described on page 3-24.

Select a property type from main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to set window. You can then set a value for that property in the Property Value box. Pop-up help for each of the available properties appears if you hover the mouse pointer over it.

The following sections list the Property types and properties available for each type.

Record level. These properties define details about how data records are formatted in the flat file. The available properties are:

• Fill char. Not relevant for Output links.

• Final delimiter string. Specify the string that appears after the last column of a record in place of the column delimiter. Enter one or more ASCII characters (precedes the record delimiter if one is used).

• Final delimiter. Specify a single character that appears after the last column of a record in place of the column delimiter. Type an ASCII character or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used.– end. Record delimiter is used (defaults to newline)– none. No delimiter (column length is used).– null. Null character is used.

• Intact. Allows you to define that this is a partial record schema. See “Partial Schemas” in Appendix A for details on complete versus partial schemas. This property has a dependent property:

– Check Intact. Select this to force validation of the partial schema as the file or files are. Note that this can degrade performance.

External Source Stage 8-5

• Record delimiter string. Specifies the string at the end of each record. Enter one or more ASCII characters.

Page 132: DataStage Parallel Job Developer’s Guide

• Record delimiter. Specifies the single character at the end of each record. Type an ASCII character or select one of the following:

– ‘\n’. Newline (the default).– null. Null character.

Mutually exclusive with Record delimiter string.

• Record length. Select Fixed where the fixed length columns are being read. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes.

• Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. 1 byte is the default.

• Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is read as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property is allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, or VBS.

This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix.

• User defined. Allows free format entry of any properties not defined elsewhere. Specify in a comma-separated list.

Field Defaults. Defines default properties for columns read from the files. These are applied to all columns read. The available properties are:

• Delimiter. Specifies the trailing delimiter of all columns in the record. This is skipped when the file is read. Type an ASCII char-acter or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used. By default all whitespace characters are skipped when the file is read.

– end. specifies that the last column in the record is composed of all remaining bytes until the end of the record.

– none. No delimiter.

8-6 Ascential DataStage Parallel Job Developer’s Guide

– null. Null character is used.

Page 133: DataStage Parallel Job Developer’s Guide

• Delimiter string. Specify the string used as the trailing delimiter at the end of each column. Enter one or more ASCII characters.

• Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the column’s length or the tag value for a tagged column.

• Print field. Select this to specify the stage writes a message for each column that it reads of the format:

Importing columnname value

• Quote. Specifies that variable length columns are enclosed in single quotes, double quotes, or another ASCII character or pair of ASCII characters. Choose Single or Double, or enter an ASCII character.

• Vector prefix. For columns that are variable length vectors, speci-fies a 1-, 2-, or 4-byte prefix containing the number of elements in the vector.

Type Defaults. These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type.

General. These properties apply to several data types (unless overridden at column level):

• Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

– little-endian. The high byte is on the left.– big-endian. The high byte is on the right.– native-endian. As defined by the native format of the machine.

• Format. Specifies the data representation format of a column. Choose from:

– binary– text

• Layout max width. The maximum number of bytes in a column represented as a string. Enter a number.

External Source Stage 8-7

• Layout width. The number of bytes in a column represented as a string. Enter a number.

Page 134: DataStage Parallel Job Developer’s Guide

• Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null.

String. These properties are applied to columns with a string data type, unless overridden at column level.

• Export EBCDIC as ASCII. Not relevant for output links

• Import ASCII as EBCDIC. Select this to specify that ASCII charac-ters are read as EBCDIC characters.

Decimal. These properties are applied to columns with a decimal data type unless overridden at column level.

• Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No.

• Packed. Select Yes to specify that the decimal columns contain data in packed decimal format, No (separate) to specify that they contain unpacked decimal with a separate sign byte, or No (zoned) to specify that they contain an unpacked decimal in either ASCII or EBCDIC text. This property has two dependent properties as follows:

– Check. Select Yes to verify that data is packed, or No to not verify.

– Signed. Select Yes to use the existing sign when reading decimal columns. Select No to use a positive sign (0xf) regardless of the column’s actual sign value.

• Precision. Specifies the precision where a decimal column is in text format. Enter a number.

• Rounding. Specifies how to round a decimal column when reading it. Choose from:

– up (ceiling). Truncate source column towards positive infinity.

– down (floor). Truncate source column towards negative infinity.

– nearest value. Round the source column towards the nearest representable value.

8-8 Ascential DataStage Parallel Job Developer’s Guide

– truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign.

Page 135: DataStage Parallel Job Developer’s Guide

• Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination.

Numeric. These properties are applied to columns with an integer or float data type unless overridden at column level.

• C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a C-language format string used for reading integer or floating point strings. This is passed to sscanf().

• In_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sscanf().

• Out_format. Not relevant for output links.

Date. These properties are applied to columns with a date data type unless overridden at column level.

• Days since. Dates are read as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd.

• Format string. The string format of a date. By default this is %yyyy-%mm-%dd.

• Is Julian. Select this to specify that dates are read as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.

Time. These properties are applied to columns with a time data type unless overridden at column level.

• Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss.

• Is midnight seconds. Select this to specify that times are read as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight.

Timestamp. These properties are applied to columns with a timestamp data type unless overridden at column level.

External Source Stage 8-9

• Format string. Specifies the format of a column representing a timestamp as a string. defaults to %yyyy-%mm-%dd %hh:%nn:%ss.

Page 136: DataStage Parallel Job Developer’s Guide

Using RCP With External Source StagesRuntime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you need can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between.

External Source stages, unlike most other data sources, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on External Source stages if you have used the Schema File property (see “Schema File” on page 8-4) to specify a schema which describes all the columns in the sequential files referenced by the stage. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are:

• Sequential File• File Set• External Source • External Target

8-10 Ascential DataStage Parallel Job Developer’s Guide

Page 137: DataStage Parallel Job Developer’s Guide

9External Target Stage

The External Target stage is a file stage. It allows you to write data to one or more source programs. The stage can have a single input link and a single rejects link. It can be configured to execute in parallel or sequential mode.

The External Target stage allows you to perform actions such as interface with databases not currently supported by the DataStage Parallel Extender.

When you edit an External Target stage, the External Target stage editor appears. This is based on the generic stage editor described in Chapter 3, “Stage Editors.”

The stage editor has up to three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify details about the program or programs you are writing data to.

• Outputs Page. This appears if the stage has a rejects link.

There are one or two special points to note about using runtime column propagation (RCP) with External Target stages. See “Using RCP With External Target Stages” on page 9-12 for details.

Stage PageThe General tab allows you to specify an optional description of the stage.

External Target Stage 9-1

The Advanced page allows you to specify how the stage executes.

Page 138: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the data output to external programs is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode all the data from the source program is processed by the conductor node.

• Preserve partitioning. You can select Set or Clear. If you select Set, it will request that the next stage preserves the partitioning as is. Clear is the default.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop-down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about how the External Target stage writes data to an external program. The External Target stage can have only one input link.

The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the external program. The Formats tab gives information about the format of the data being written. The Columns tab specifies the column definitions of the data.

Details about External Target stage properties, partitioning, and format-

9-2 Ascential DataStage Parallel Job Developer’s Guide

ting are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Page 139: DataStage Parallel Job Developer’s Guide

Input Link PropertiesThe Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and to what program. Some of the properties are mandatory, although many have default settings. Prop-erties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Target Category

Destination Program. This is an optional property. Specifies the name of a program receiving data. DataStage calls the specified program and passes to it any arguments specified.You can repeat this property to specify multiple program instances with different arguments. You can use

Category/Property Values Default Manda-tory?

Repeats?

Depen-dent of

Target /Destination Program

string N/A Y if Source Method = Specific Program(s)

Y N/A

Target /Destination Programs File

pathname N/A Y if Source Method = Program File(s)

Y N/A

Target /Target Method Specific Program(s)/Program File(s)

Specific Program(s)

Y N N/A

Options/Cleanup on Failure

True/False True Y N N/A

Options/Reject Mode Continue/Fail/Save

Continue N N N/A

Options/Schema File pathname N/A N N N/A

External Target Stage 9-3

a job parameter to supply program name and arguments.

Page 140: DataStage Parallel Job Developer’s Guide

Destination Programs File. This is an optional property. Specifies a file containing a list of program names and arguments. You can browse for the file or specify a job parameter. You can repeat this property to specify multiple files.

Target Method. This property specifies whether you directly specifying a program (using the Destination Program property) or using a file to specify a program (using the Destination Programs File property).

Cleanup on Failure. This is set to True by default and specifies that the stage will delete any partially written data if the stage fails for any reason. Set this to False to specify that partially data should be left.

Reject Mode. This is an optional property. Allows you to specify behavior if a record fails to be written for some reason. Choose from Continue to continue operation and discard any rejected rows, Fail to cease reading if any rows are rejected, or Save to send rejected rows down a reject link. Defaults to Continue.

Schema File. This is an optional property. By default the External Target stage will use the column definitions defined on the Columns tab as a schema for reading the file. You can, however, override this by specifying a file containing a schema. Type in a pathname or browse for a file.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the target program. It also allows you to specify that the data should be sorted before being written.

By default the stage will write data in Auto mode. If the Preserve Parti-tioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the External Target stage is operating in sequential mode, it will first collect the data before writing it to the file using the default round robin collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

9-4 Ascential DataStage Parallel Job Developer’s Guide

• Whether the External Target stage is set to execute in parallel or sequential mode.

Page 141: DataStage Parallel Job Developer’s Guide

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the External Target stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning type drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the previous stage in the job).

If the External Target stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default Auto collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default partitioning method for the External Target stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag columns.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often

External Target Stage 9-5

a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

Page 142: DataStage Parallel Job Developer’s Guide

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default method for the External Target stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the target program. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Format of File Set Files

9-6 Ascential DataStage Parallel Job Developer’s Guide

The Format tab allows you to supply information about the format of the data being written. The tab has a similar format to the Properties tab and is described on page 3-24.

Page 143: DataStage Parallel Job Developer’s Guide

Select a property type from main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to set window. You can then set a value for that property in the Property Value box. Pop up help for each of the available properties appears if you hover the mouse pointer over it.

The following sections list the Property types and properties available for each type.

Record level. These properties define details about how data records are formatted in a file. The available properties are:

• Fill char. Specify an ASCII character or a value in the range 0 to 255. This character is used to fill any gaps in an exported record caused by column positioning properties. Set to 0 by default.

• Final delimiter string. Specify a string to be written after the last column of a record in place of the column delimiter. Enter one or more ASCII characters (precedes the record delimiter if one is used).

• Final delimiter. Specify a single character to be written after the last column of a record in place of the column delimiter. Type an ASCII character or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used.– end. Record delimiter is used (defaults to newline)– none. No delimiter (column length is used).– null. Null character is used.

• Intact. Allows you to define that this is a partial record schema. See “Partial Schemas” in Appendix A for details on complete versus partial schemas. (The dependent property Check Intact is only rele-vant for output links.)

• Record delimiter string. Specify a string to be written at the end of each record. Enter one or more ASCII characters.

• Record delimiter. Specify a single character to be written at the end of each record. Type an ASCII character or select one of the following:

– ‘\n’. Newline (the default).

External Target Stage 9-7

– null. Null character.

Mutually exclusive with Record delimiter string.

Page 144: DataStage Parallel Job Developer’s Guide

• Record length. Select Fixed where the fixed length columns are being written. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes.

• Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. 1 byte is the default.

• Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is written as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property is allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, or VBS.

This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix.

• User defined. Allows free format entry of any properties not defined elsewhere. Specify in a comma-separated list.

Field Defaults. Defines default properties for columns written to the files. These are applied to all columns written. The available properties are:

• Delimiter. Specifies the trailing delimiter of all columns in the record. Type an ASCII character or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used.

– end. Specifies that the last column in the record is composed of all remaining bytes until the end of the record.

– none. No delimiter.

– null. Null character is used.

• Delimiter string. Specify a string to be written at the end of each column. Enter one or more ASCII characters.

• Prefix bytes. Specifies that each column in the data file is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the column’s length or the tag value for a tagged column.

9-8 Ascential DataStage Parallel Job Developer’s Guide

• Print field. This property is not relevant for input links.

Page 145: DataStage Parallel Job Developer’s Guide

• Quote. Specifies that variable length columns are enclosed in single quotes, double quotes, or another ASCII character or pair of ASCII characters. Choose Single or Double, or enter an ASCII character.

• Vector prefix. For columns that are variable length vectors, speci-fies a 1-, 2-, or 4-byte prefix containing the number of elements in the vector.

Type Defaults. These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type.

General. These properties apply to several data types (unless overridden at column level):

• Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

– little-endian. The high byte is on the left.– big-endian. The high byte is on the right.– native-endian. As defined by the native format of the machine.

• Format. Specifies the data representation format of a column. Choose from:

– binary– text

• Layout max width. The maximum number of bytes in a column represented as a string. Enter a number.

• Layout width. The number of bytes in a column represented as a string. Enter a number.

• Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null.

String. These properties are applied to columns with a string data type, unless overridden at column level.

• Export EBCDIC as ASCII. Select this to specify that EBCDIC char-

External Target Stage 9-9

acters are written as ASCII characters.

• Import ASCII as EBCDIC. Not relevant for input links.

Page 146: DataStage Parallel Job Developer’s Guide

Decimal. These properties are applied to columns with a decimal data type unless overridden at column level.

• Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No.

• Packed. Select Yes to specify that the decimal columns contain data in packed decimal format or No to specify that they contain unpacked decimal with a separate sign byte. This property has two dependent properties as follows:

– Check. Select Yes to verify that data is packed, or No to not verify.

– Signed. Select Yes to use the existing sign when writing decimal columns. Select No to write a positive sign (0xf) regardless of the columns actual sign value.

• Precision. Specifies the precision where a decimal column is written in text format. Enter a number.

• Rounding. Specifies how to round a decimal column when writing it. Choose from:

– up (ceiling). Truncate source column towards positive infinity.

– down (floor). Truncate source column towards negative infinity.

– nearest value. Round the source column towards the nearest representable value.

– truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign.

• Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination.

Numeric. These properties are applied to columns with an integer or float data type unless overridden at column level.

• C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a C-language format string used for writing integer or floating point strings. This is passed to sprintf().

9-10 Ascential DataStage Parallel Job Developer’s Guide

• In_format. Not relevant for input links.

Page 147: DataStage Parallel Job Developer’s Guide

• Out_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sprintf().

Date. These properties are applied to columns with a date data type unless overridden at column level.

• Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd.

• Format string. The string format of a date. By default this is %yyyy-%mm-%dd.

• Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.

Time. These properties are applied to columns with a time data type unless overridden at column level.

• Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss.

• Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight.

Timestamp. These properties are applied to columns with a timestamp data type unless overridden at column level.

Format string. Specifies the format of a column representing a timestamp as a string. defaults to %yyyy-%mm-%dd %hh:%nn:%ss.

External Target Stage 9-11

Page 148: DataStage Parallel Job Developer’s Guide

Outputs PageThe Outputs page appears if the stage has a Reject link

The General tab allows you to specify an optional description of the output link.

You cannot change the properties of a Reject link. The Properties tab for a reject link is blank.

Similarly, you cannot edit the column definitions for a reject link. The link uses the column definitions for the link rejecting the data records.

Using RCP With External Target StagesRuntime column propagation (RCP) allows DataStage to be flexible about the columns you define in a job. If RCP is enabled for a project, you need can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages. So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between.

External Target stages, unlike most other data targets, do not have inherent column definitions, and so DataStage cannot always tell where there are extra columns that need propagating. You can only use RCP on External Target stages if you have used the Schema File property (see “Schema File” on page 9-4) to specify a schema which describes all the columns in the sequential files referenced by the stage. You need to specify the same schema file for any similar stages in the job where you want to propagate columns. Stages that will require a schema file are:

• Sequential File• File Set• External Source • External Target

9-12 Ascential DataStage Parallel Job Developer’s Guide

Page 149: DataStage Parallel Job Developer’s Guide

10Write Range Map

Stage

The Write Range Map stage allows you to write data to a range map. The stage can have a single input link. It can only run in parallel mode.

The Write Range Map stage takes an input data set produced by sampling and sorting a data set and writes it to a file in a form usable by the range partitioning method. The range partitioning method uses the sampled and sorted data set to determine partition boundaries. See “Partitioning and Collecting Data” on page 2-7 for a descrip-tion of the range partitioning method.

A typical use for the Write Range Map stage would be in a job which used the Sample stage to sample a data set, the Sort stage to sort it and the Write Range Map stage to write the resulting data set to a file.

The Write Range Map stage editor has two pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is present when you are writing a range map. This is where you specify details about the file being written to.

Stage PageThe General tab allows you to specify an optional description of the stage. The Advanced page allows you to specify how the stage executes.

Write Range Map Stage 10-1

Page 150: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage always executes in parallel mode.

• Preserve partitioning. This is Set by default. The partitioning mode is range and cannot be overridden.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about how the Write Range Map stage writes the range map to a file. The Write Range Map stage can have only one input link.

The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify sorting details. The Columns tab specifies the column definitions of the data.

Details about Write Range Map stage properties an partitioning are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Input Link PropertiesThe Properties tab allows you to specify properties for the input link. These dictate how incoming data is written to the range map file.

10-2 Ascential DataStage Server Job Developer’s Guide

Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning

Page 151: DataStage Parallel Job Developer’s Guide

color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Options Category

File Update Mode. This is set to Create by default. If the file you specify already exists this will cause an error. Choose Overwrite to overwrite existing files.

Key. This allows you to specify the key for the range map. Choose an input column from the drop-down list. You can specify a composite key by specifying multiple key properties.

Range Map File. Specify the file that is to hold the range map. You can browse for a file or specify a job parameter.

Partitioning on Input LinksThe Partitioning tab normally allows you to specify details about how the incoming data is partitioned or collected before it is written to the file or files. In the case of the Write Range Map stage execution is always parallel, so there is never a need to set a collection method. The partition method is set to Range and cannot be overridden.

Because the partition mode is set and cannot be overridden, you cannot use the stage sort facilities, so these are disabled.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/File Update Mode

Create/Over-write

Create Y N N/A

Options/Key input column N/A Y Y N/A

Options/Range Map File

pathname N/A Y N N/A

Write Range Map Stage 10-3

Page 152: DataStage Parallel Job Developer’s Guide

10-4 Ascential DataStage Server Job Developer’s Guide

Page 153: DataStage Parallel Job Developer’s Guide

11SAS Data Set Stage

The Parallel SAS Data Set stage is a file stage. It allows you to read data from or write data to a parallel SAS data set in conjunction with an SAS stage. The stage can have a single input link or a single output link. It can be configured to execute in parallel or sequential mode.

DataStage uses a parallel SAS data set to store data being operated on by an SAS stage in a persistent form. A parallel SAS data set is a set of one or more sequential SAS data sets, with a header file specifying the names and locations of all the component files. By convention, the header file has the suffix .psds.

The stage editor has up to three pages, depending on whether you are reading or writing a data set:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is present when you are writing to a data set. This is where you specify details about the data set being written to.

• Outputs page. This is present when you are reading from a data set. This is where you specify details about the data set being read from.

Stage PageThe General tab allows you to specify an optional description of the stage. The Advanced page allows you to specify how the stage executes.

SAS Data Set Stage 11-1

Page 154: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about how the SAS Data Set stage writes data to a data set. The SAS Data Set stage can have only one input link.

The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the data set. The Columns tab specifies the column definitions of the data.

11-2 Ascential DataStage Parallel Job Developer’s Guide

Details about SAS Data Set stage properties are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Page 155: DataStage Parallel Job Developer’s Guide

Input Link PropertiesThe Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and to what data set. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows:

Options Category

File. The name of the control file for the data set. You can browse for the file or enter a job parameter. By convention the file has the suffix .psds.

Update Policy. Specifies what action will be taken if the data set you are writing to already exists. Choose from:

• Append. Append to the existing data set

• Create (Error if exists). DataStage reports an error if the data set already exists

• Overwrite. Overwrite any existing file set

The default is Overwrite.

Category/Property Values Default Mandatory? Repeats? Depen

dent of

Target/File pathname N/A Y N N/A

Target/Update Policy

Append/Create (Error if exists)/Over-write/

Create (Error if exists)

Y N N/A

SAS Data Set Stage 11-3

Page 156: DataStage Parallel Job Developer’s Guide

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the data set. It also allows you to specify that the data should be sorted before being written.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Config-uration file. If the Preserve Partitioning option has been set on the Stage page Advanced tab (see page 11-2) the stage will attempt to preserve the partitioning of the incoming data.

If the SAS Data Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the SAS Data Set stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the SAS Data Set stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the Stage page Advanced tab).

If the SAS Data Set stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default auto collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has

11-4 Ascential DataStage Parallel Job Developer’s Guide

been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default parti-tioning method for the Parallel SAS Data Set stage.

Page 157: DataStage Parallel Job Developer’s Guide

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus func-tion on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these proper-ties by clicking the properties button

• Range. Divides a data set into approximately equal size parti-tions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for Parallel SAS Data Set stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting

SAS Data Set Stage 11-5

key column from the Available list.

Page 158: DataStage Parallel Job Developer’s Guide

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the data set. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The avail-ability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about how the Parallel SAS Data Set stage reads data from a data set. The Parallel SAS Data Set stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of incoming data.

Details about Data Set stage properties and formatting are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Output Link PropertiesThe Properties tab allows you to specify properties for the output link.

11-6 Ascential DataStage Parallel Job Developer’s Guide

These dictate how incoming data is read from the data set. Some of the properties are mandatory, although many have default settings. Prop-

Page 159: DataStage Parallel Job Developer’s Guide

erties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Source Category

File. The name of the control file for the parallel SAS data set. You can browse for the file or enter a job parameter. The file has the suffix .psds.

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Source/File pathname N/A Y N N/A

SAS Data Set Stage 11-7

Page 160: DataStage Parallel Job Developer’s Guide

11-8 Ascential DataStage Parallel Job Developer’s Guide

Page 161: DataStage Parallel Job Developer’s Guide

12DB2 Stage

The DB2 stage is a database stage. It allows you to read data from and write data to a DB2 database. It can also be used in conjunction with a Lookup stage to access a lookup table hosted by a DB2 database (see Chapter 20, “Lookup Stage.”)

The DB2 stage can have a single input link and a single output reject link, or a single output link or output reference link.

When you edit a DB2 stage, the DB2 stage editor appears. This is based on the generic stage editor described in Chapter 3, “Stage Editors.”

The stage editor has up to three pages, depending on whether you are reading or writing a database:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is present when you are writing to a DB2 database. This is where you specify details about the data being written.

• Outputs page. This is present when you are reading from a DB2 database, or performing a lookup on a DB2 database. This is where you specify details about the data being read.

To use DB2 stages you must have valid accounts and appropriate priv-ileges on the databases to which they connect. The required DB2 privileges are as follows:

• SELECT on any tables to be read.

• INSERT on any existing tables to be updated.

• TABLE CREATE to create any new tables.

DB2 Stage 12-1

• INSERT and TABLE CREATE on any existing tables to be replaced.

Page 162: DataStage Parallel Job Developer’s Guide

• DBADM on any database written by LOAD method.

You can grant this privilege in several ways in DB2. One is to start DB2, connect to a database, and grant DBADM privilege to a user, as shown below:

db2> CONNECT TO db_namedb2> GRANT DBADM ON DATABASE TO USER user_name

where db_name is the name of the DB2 database and user_name is the login name of the DataStage user. If you specify the message file prop-erty, the database instance must have read/write privilege on that file.

The user’s PATH should include $DB2_HOME/bin (e.g., /opt/IBMdb2/V7.1/bin). The LD_LIBRARY_PATH should include $DB2_HOME/lib before any other lib statments (e.g., /opt/IBMdb2/V7.1/lib)

The following DB2 environment variables set the run-time character-istics of your system:

• DB2INSTANCE specifies the user name of the owner of the DB2 instance. DB2 uses DB2INSTANCE to determine the loca-tion of db2nodes.cfg. For example, if you set DB2INSTANCE to "Mary", the location of db2nodes. cfg is ~Mary/sqllib/db2nodes.cfg.

• DB2DBDFT specifies the name of the DB2 database that you want to access from your DB2 stage.

There are two other methods of specifying the DB2 database:

1. The override database property of the DB2 stage Inputs or Outputs link.

2. The APT_DBNAME environment variable (this takes prece-dence over DB2DBDFT).

The environment variable APT_RDBMS_COMMIT_ROWS specifies the number of records to insert into a data set between commits. You can set this environment variable to any value between 1 and (231 - 1) to specify the number of records.

The default value is 2048. You may find that you can increase your system performance by decreasing the frequency of these commits using the environment variable APT_RDBMS_COMMIT_ROWS.

12-2 Ascential DataStage Parallel Job Developer’s Guide

If you set APT_RDBMS_COMMIT_ROWS to 0, a negative number, or an invalid value, a warning is issued and each partition commits only once after the last insertion.

Page 163: DataStage Parallel Job Developer’s Guide

If you set APT_RDBMS_COMMIT_ROWS to a small value, you force DB2 to perform frequent commits. Therefore, if your program termi-nates unexpectedly, your data set can still contain partial results that you can use. However, you may pay a performance penalty because of the high frequency of the commits. If you set a large value for APT_RDBMS_COMMIT_ROWS, DB2 must log a correspondingly large amount of rollback information. This, too, may slow your application.

Stage PageThe General tab allows you to specify an optional description of the stage. The Advanced page allows you to specify how the stage executes.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the contents of the file are processed by the available nodes as specified in the Configura-tion file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire write is processed by the conductor node.

• Preserve partitioning. You can select Set or Clear. If you select Set file read operations will request that the next stage preserves the partitioning as is (it does not appear if your stage only has an input link).

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by

DB2 Stage 12-3

clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively

Page 164: DataStage Parallel Job Developer’s Guide

defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about how the DB2 stage writes data to a DB2 database. The DB2 stage can have only one input link writing to one table.

The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the database. The Columns tab specifies the column definitions of incoming data.

Details about DB2 stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Input Link PropertiesThe Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and where. Some of the properties are mandatory, although many have default settings. Prop-erties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Target/Table String N/A Y N N/A

12-4 Ascential DataStage Parallel Job Developer’s Guide

Page 165: DataStage Parallel Job Developer’s Guide

Target/Upsert Mode Auto-gener-ated Update & Insert/Auto-gener-ated Update Only/User-defined Update & Insert/User-defined Update Only

Auto-gener-ated Update & Insert

Y if Write method = Upsert

N N/A

Target/Insert SQL String N/A Y if Write method = Upsert and Upd

N N/A

Target/Update SQL String N/A Y if Write method = Upsert

N N/A

Target/Write Method

Write/Load/Upsert

Load Y N N/A

Target/Write Mode Append/Create/Replace/Truncate

Append Y N N/A

Connection/Use Database Environ-ment Variable

True/False True Y N N/A

Connection/Use Server Environment Variable

True/False True Y N N/A

Connection/Over-ride Database

string N/A Y (if Use Database environ-ment variable =

N N/A

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

DB2 Stage 12-5

False)

Page 166: DataStage Parallel Job Developer’s Guide

Target Category

Table. Specify the name of the table to write to. You can specify a job parameter if required.

Connection/Over-ride Server

string N/A Y (if Use Server environ-ment variable = False)

N N/A

Options/Write Mode True/False False Y N N/A

Options/Silently Drop Columns Not in Table

True/False False Y N N/A

Options/Truncation Length

number 18 N N Trun-cate Column Names

Options/Close Command

string N/A N N N/A

Options/Default String Length

number 32 N N N/A

Options/Open Command

string N/A N N N/A

Options/Use ASCII Delimited Format

True/False False Y (if Write Method = Load)

N N/A

Options/Cleanup on Failure

True/False False Y (if Write Method = Load)

N N/A

Options/Message File

pathname N/A N N N/A

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

12-6 Ascential DataStage Parallel Job Developer’s Guide

Upsert Mode. This only appears for the Upsert write method. Allows you to specify how the insert and update statements are to be derived. Choose from:

Page 167: DataStage Parallel Job Developer’s Guide

• Auto-generated Update & Insert. DataStage generates update and insert statements for you, based on the values you have supplied for table name and on column details. The statements can be viewed by selecting the Insert SQL or Update SQL properties.

• Auto-generated Update Only. DataStage generates an update statement for you, based on the values you have supplied for table name and on column details. The statement can be viewed by selecting the Update SQL properties.

• User-defined Update & Insert. Select this to enter your own update and insert statements. Then select the Insert SQL and Update SQL properties and edit the statement proformas.

• User-defined Update Only. Select this to enter your own update statement. Then select the Update SQL property and edit the statement proforma.

Insert SQL. Only appears for the Upsert write method. This property allows you to view an auto-generated Insert statement, or to specify your own (depending on the setting of the Update Mode property).

Update SQL. Only appears for the Upsert write method. This prop-erty allows you to view an auto-generated Update statement, or to specify your own (depending on the setting of the Update Mode property).

Write Method. Choose from Write, Upsert, or Load (the default). Load takes advantage of fast DB2 loader technology for writing data to the database. Upsert uses Insert and Update SQL statements to write to the database.

Write Mode. Select from the following:

• Append. This is the default. New records are appended to an existing table.

• Create. Create a new table. If the DB2 table already exists an error occurs and the job terminates. You must specify this mode if the DB2 table does not exist.

DB2 Stage 12-7

• Replace. The existing table is first dropped and an entirely new table is created in its place. DB2 uses the default parti-tioning method for the new table.

Page 168: DataStage Parallel Job Developer’s Guide

• Truncate. The existing table attributes (including schema) and the DB2 partitioning keys are retained, but any existing records are discarded. New records are then appended to the table.

Connection Category

Use Server Environment Variable. This is set to True by default, which causes the stage to use the setting of the DB2INSTANCE envi-ronment variable to derive the server. If you set this to False, you must specify a value for the Override Server property.

Use Database Environment Variable. This is set to True by default, which causes the stage to use the setting of the environment variable APT_DBNAME, if defined, and DB2DBDFT otherwise to derive the database. If you set this to False, you must specify a value for the Override Database property.

Override Server. Optionally specifies the DB2 instance name for the table. This property appears if you set Use Server Environment Vari-able property to False.

Override Database. Optionally specifies the name of the DB2 data-base to access. This property appears if you set Use Database Environment Variable property to False.

Options Category

Silently Drop Columns Not in Table. This is False by default. Set to True to silently drop all input columns that do not correspond to columns in an existing DB2 table. Otherwise the stage reports an error and terminates the job.

Truncate Column Names. Select this option to truncate column names to 18 characters. To specify a length other than 18, use the Trun-cation Length dependent property:

• Truncation Length

This is set to 18 by default. Change it to specify a different trun-cation length.

12-8 Ascential DataStage Parallel Job Developer’s Guide

Close Command. This is an optional property. Use it to specify any command to be parsed and executed by the DB2 database on all

Page 169: DataStage Parallel Job Developer’s Guide

processing nodes after the stage finishes processing the DB2 table. You can specify a job parameter if required.

Default String Length. This is an optional property and is set to 32 by default. Sets the default string length of variable-length strings written to a DB2 table. Variable-length strings longer than the set length cause an error.

The maximum length you can set is 4000 bytes. Note that the stage always allocates the specified number of bytes for a variable-length string. In this case, setting a value of 4000 allocates 4000 bytes for every string. Therefore, you should set the expected maximum length of your largest string and no larger.

Open Command. This is an optional property. Use it to specify any command to be parsed and executed by the DB2 database on all processing nodes before the DB2 table is opened. You can specify a job parameter if required.

Use ASCII Delimited Format. This property only appears if Write Mode is set to Load. Specify this option to configure DB2 to use the ASCII-delimited format for loading binary numeric data instead of the default ASCII-fixed format.

This option can be useful when you have variable-length columns, because the database will not have to allocate the maximum amount of storage for each variable-length column. However, all numeric columns are converted to an ASCII format by DB2, which is a CPU-intensive operation. See the DB2 reference manuals for more information.

Cleanup on Failure. This property only appears if Write Mode is set to Load. Specify this option to deal with failures during stage execu-tion that leave the tablespace being loaded in an inaccessible state.

The cleanup procedure neither inserts data into the table nor deletes data from it. You must delete rows that were inserted by the failed execution either through the DB2 command-level interpreter or by using the stage subsequently using the replace or truncate write modes.

DB2 Stage 12-9

Message File. This property only appears if Write Mode is set to Load. Specifies the file where the DB2 loader writes diagnostic

Page 170: DataStage Parallel Job Developer’s Guide

messages. The database instance must have read/write privilege to the file.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the DB2 database. It also allows you to specify that the data should be sorted before being written.

By default the stage partitions in DB2 mode.

If the DB2 stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the DB2 stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the DB2 stage is set to execute in parallel, then you can set a parti-tioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage in the job).

If the DB2 stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default Auto collection method.

The following partitioning methods are available:

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus func-

12-10 Ascential DataStage Parallel Job Developer’s Guide

tion on the key column selected from the Available list. This is commonly used to partition on tag columns.

Page 171: DataStage Parallel Job Developer’s Guide

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of the specified DB2 table. This is the default method for the DB2 stage.

• Range. Divides a data set into approximately equal size parti-tions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for DB2 stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the database. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The avail-ability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

DB2 Stage 12-11

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

Page 172: DataStage Parallel Job Developer’s Guide

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about how the DB2 stage reads data from a DB2 database. The DB2 stage can have only one output link. Alternatively it can have a reference output link, which is used by the Lookup stage when referring to a DB2 lookup table. It can also have a reject link where rejected records are routed (used in conjunction with an input link)

The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of incoming data.

Details about DB2 stage properties are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Output Link PropertiesThe Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from what table. Some of the properties are mandatory, although many have default settings. Prop-erties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

12-12 Ascential DataStage Parallel Job Developer’s Guide

Page 173: DataStage Parallel Job Developer’s Guide

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Source/Lookup Type Normal/Sparse

Normal Y (if output is reference link connected to Lookup stage)

N N/A

Source/Read Method Table/Auto-generated SQL/User-defined SQL

Table Y N N/A

Source/Table string N/A Y (if Read Method = Table)

N N/A

Source/Where clause string N/A N N Table

Source/Select List string N/A N N Table

Source/Query string N/A Y (if Read Method = Query)

N N/A

Source/Partition Table string N/A N N Query

Connection/Use Data-base Environment Variable

True/False True Y N N/A

Connection/Use Server Environment Variable

True/False True Y N N/A

Connection/Override Server

string N/A Y (if Use Database environ-ment

N N/A

DB2 Stage 12-13

variable = False)

Page 174: DataStage Parallel Job Developer’s Guide

Source Category

Lookup Type. Where the DB2 stage is connected to a Lookup stage via a reference link, this property specifies whether the DB2 stage will provide data for an in-memory look up (Lookup Type = Normal) or whether the lookup will access the database directly (Lookup Type = Sparse). If the Lookup Type is Normal, the Lookup stage can have multiple reference links. If the Lookup Type is Sparse, the Lookup stage can only have one reference link.

Read Method. Select Table to use the Table property to specify the read (this is the default). Select Auto-generated SQL to have DataStage automatically generate an SQL query based on the columns you have defined and the table you specify in the Table property. Select User-defined SQL to define your own query.

Query. This property is used to contain the SQL query when you choose a Read Method of User-defined query or Auto-generated SQL. If you are using Auto-generated SQL you must select a table and specify some column definitions. Any SQL statement can contain

Connection/Override Database

string N/A Y (if Use Server environ-ment variable = False)

N N/A

Options/Query string N/A N N N/A

Options/Open Command

string N/A N N N/A

Options/Make Combinable

True/False False Y (if link is reference and Lookup type = sparse)

N N/A

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

12-14 Ascential DataStage Parallel Job Developer’s Guide

joins, views, database links, synonyms, and so on. It has the following dependent option:

Page 175: DataStage Parallel Job Developer’s Guide

• Partition Table

Specifies execution of the query in parallel on the processing nodes containing a partition derived from the named table. If you do not specify this, the stage executes the query sequen-tially on a single node.

Table. Specifies the name of the DB2 table. The table must exist and you must have SELECT privileges on the table. If your DB2 user name does not correspond to the owner of the specified table, you can prefix it with a table owner in the form:

table_owner.table_name

If you are a Read method of Table, then the Table property has two dependent properties:

• Where clause

Allows you to specify a WHERE clause of the SELECT state-ment to specify the rows of the table to include or exclude from the read operation. If you do not supply a WHERE clause, all rows are read.

• Select List

Allows you to specify an SQL select list of column names.

Connection Category

Use Server Environment Variable. This is set to True by default, which causes the stage to use the setting of the DB2INSTANCE envi-ronment variable to derive the server. If you set this to False, you must specify a value for the Override Server property.

Use Database Environment Variable. This is set to True by default, which causes the stage to use the setting of the environment variable APT_DBNAME, if defined, and DB2DBDFT otherwise to derive the database. If you set this to False, you must specify a value for the Override Database property.

Override Server. Optionally specifies the DB2 instance name for the

DB2 Stage 12-15

table. This property appears if you set Use Server Environment Vari-able property to False.

Page 176: DataStage Parallel Job Developer’s Guide

Override Database. Optionally specifies the name of the DB2 data-base to access. This property appears if you set Use Database Environment Variable property to False.

Options Category

Close Command. This is an optional property. Use it to specify any command to be parsed and executed by the DB2 database on all processing nodes after the stage finishes processing the DB2 table. You can specify a job parameter if required.

Open Command. This is an optional property. Use it to specify any command to be parsed and executed by the DB2 database on all processing nodes before the DB2 table is opened. You can specify a job parameter if required.

Make Combinable. Only applies to reference links where the Lookup Type property has been set to sparse. Set to True to specify that the lookup can be combined with its preceding and/or following process.

12-16 Ascential DataStage Parallel Job Developer’s Guide

Page 177: DataStage Parallel Job Developer’s Guide

13Oracle Stage

The Oracle stage is a database stage. It allows you to read data from and write data to a Oracle database. It can also be used in conjunction with a Lookup stage to access a lookup table hosted by an Oracle database (see Chapter 20, “Lookup Stage.”)

The Oracle stage can have a single input link and a single reject link, or a single output link or output reference link.

When you edit a Oracle stage, the Oracle stage editor appears. This is based on the generic stage editor described in Chapter 3, “Stage Editors.”

The stage editor has up to three pages, depending on whether you are reading or writing a database:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is present when you are writing to a Oracle data-base. This is where you specify details about the data being written.

• Outputs page. This is present when you are reading from a Oracle database, or performing a lookup on a Oracle database. This is where you specify details about the data being read.

You need to be running Oracle 8 or better, Enterprise Edition in order to use the Oracle stage.

You must also do the following:

1. Create the user defined environment variable ORACLE_HOME and set this to the $ORACLE_HOME path (e.g., /disk3/oracle9i)

2. Create the user defined environment variable ORACLE_SID and set

Oracle Stage 13-1

this to the correct service name (e.g., ODBCSOL).

Page 178: DataStage Parallel Job Developer’s Guide

3. Add ORACLE_HOME/bin to your PATH and ORACLE_HOME/lib to your LIBPATH, LD_LIBRARY_PATH, or SHLIB_PATH.

4. Have login privileges to Oracle using a valid Oracle user name and corresponding password. These must be recognized by Oracle before you attempt to access it.

5. Have SELECT privilege on:

• DBA_EXTENTS• DBA_DATA_FILES• DBA_TAB_PARTITONS• DBA_OBJECTS• ALL_PART_INDEXES• ALL_PART_TABLES• ALL_INDEXES• SYS.GV_$INSTANCE (Only if Oracle Parallel Server is used)

Note: APT_ORCHHOME/bin�must appear before ORACLE_HOME/bin in your PATH.

We suggest that you create a role that has the appropriate SELECT privi-leges, as follows:

CREATE ROLE DSXE;GRANT SELECT on sys.dba_extents to DSXE;GRANT SELECT on sys.dba_data_files to DSXE;GRANT SELECT on sys.dba_tab_partitions to DSXE;GRANT SELECT on sys.dba_objects to DSXE;GRANT SELECT on sys.all_part_indexes to DSXE;GRANT SELECT on sys.all_part_tables to DSXE;GRANT SELECT on sys.all_indexes to DSXE;

Once the role is created, grant it to users who will run DataStage jobs, as follows:

GRANT DSXE to <oracle userid>;

Stage PageThe General tab allows you to specify an optional description of the stage.

13-2 Ascential DataStage Parallel Job Developer’s Guide

The Advanced page allows you to specify how the stage executes.

Page 179: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the contents of the file are processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire write is processed by the conductor node.

• Preserve partitioning. You can select Set or Clear. If you select Set read operations will request that the next stage preserves the parti-tioning as is (it is ignored for write operations).

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about how the Oracle stage writes data to a Oracle database. The Oracle stage can have only one input link writing to one table.

The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the database. The Columns tab spec-ifies the column definitions of incoming data.

Details about Oracle stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, “Stage Editors,” for a

Oracle Stage 13-3

general description of the other tabs.

Page 180: DataStage Parallel Job Developer’s Guide

Input Link PropertiesThe Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and where. Some of the prop-erties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Target/Table string N/A Y (if Write Method = Load)

N N/A

Target/Upsert method

Auto-gener-ated Update & insert/Auto-generated Update Only/User-defined Update & Insert/User-defined Update Only

Auto-gener-ated Update & insert

Y (if Write Method = Upsert)

N N/A

Target/Insert SQL string N/A N N N/A

Target/Insert Array Size

number 500 N N Insert SQL

Target/Update SQL

string N/A Y (if Write Method = Upsert)

N N/A

Target/Write Method

Upsert/Load Load Y N N/A

13-4 Ascential DataStage Parallel Job Developer’s Guide

Page 181: DataStage Parallel Job Developer’s Guide

Target/Write Mode

Append/Create/Replace/Truncate

Append Y (if Write Method = Load)

N N/A

Connection/DB Options

string N/A Y N N/A

Connection/DB Options Mode

Auto-generate/User-defined

Auto-generate

Y N N/A

Connection/User string N/A Y (if DB Options Mode = Auto-generate)

N DB Options Mode

Connection/Pass-word

string N/A Y (if DB Options Mode = Auto-generate)

N DB Options Mode

Connec-tion/Remote Server

string N/A N N N/A

Options/Output Reject Records

True/False False Y (if Write Method = Upsert)

N N/A

Options/Silently Drop Columns Not in Table

True/False False Y (if Write Method = Load)

N N/A

Options/Truncate Column Names

True/False False Y (if Write

N N/A

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Oracle Stage 13-5

Method = Load)

Page 182: DataStage Parallel Job Developer’s Guide

Target Category

Table. This only appears for the Load Write Method. Specify the name of the table to write to. You can specify a job parameter if required.

Upsert method. This only appears for the Upsert write method. Allows you to specify how the insert and update statements are to be derived. Choose from:

• Auto-generated Update & Insert. DataStage generates update and insert statements for you, based on the values you have supplied for table name and on column details. The statements can be viewed by selecting the Insert SQL or Update SQL properties.

• Auto-generated Update Only. DataStage generates an update statement for you, based on the values you have supplied for table name and on column details. The statement can be viewed by

Options/Close Command

string N/A N N N/A

Options/Default String Length

number 32 N N N/A

Options/Index Mode

Maintenance/Rebuild

N/A N N N/A

Options/Add NOLOGGING clause to Index rebuild

True/False False N N Index Mode

Options/Add COMPUTE STATISTICS clause to Index rebuild

True/False False N N Index Mode

Options/Open Command

string N/A N N N/A

Options/Oracle 8 Partition

string N/A N N N/A

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

13-6 Ascential DataStage Parallel Job Developer’s Guide

selecting the Update SQL properties.

Page 183: DataStage Parallel Job Developer’s Guide

• User-defined Update & Insert. Select this to enter your own update and insert statements. Then select the Insert SQL and Update SQL properties and edit the statement proformas.

• User-defined Update Only. Select this to enter your own update statement. Then select the Update SQL property and edit the state-ment proforma.

Insert SQL. Only appears for the Upsert write method. This property allows you to view an auto-generated Insert statement, or to specify your own (depending on the setting of the Update Mode property). It has a dependent property:

• Insert Array Size

Specify the size of the insert host array. The default size is 500 records. If you want each insert statement to be executed individu-ally, specify 1 for this property.

Update SQL. Only appears for the Upsert write method. This property allows you to view an auto-generated Update statement, or to specify your own (depending on the setting of the Update Mode property).

Write Method. Choose from Upsert or Load (the default). Upsert allows you to provide the insert and update SQL statements and uses Oracle host-array processing to optimize the performance of inserting records. Write sets up a connection to Oracle and inserts records into a table, taking a single input data set. The Write Mode property determines how the records of a data set are inserted into the table.

Write Mode. This only appears for the Load Write Method. Select from the following:

• Append. This is the default. New records are appended to an existing table.

• Create. Create a new table. If the Oracle table already exists an error occurs and the job terminates. You must specify this mode if the Oracle table does not exist.

• Replace. The existing table is first dropped and an entirely new table is created in its place. Oracle uses the default partitioning

Oracle Stage 13-7

method for the new table.

Page 184: DataStage Parallel Job Developer’s Guide

• Truncate. The existing table attributes (including schema) and the Oracle partitioning keys are retained, but any existing records are discarded. New records are then appended to the table.

Connection Category

DB Options. Specify a user name and password for connecting to Oracle in the form:

<user=<user>,password=<password>[,arraysize=<num_records>]

Arraysize is only relevant to the Upsert Write Method.

DB Options Mode . If you select Auto-generate for this property, DataStage will create a DB Options string for you. If you select User-defined, you have to edit the DB Options property yourself. When Auto-generate is selected, there are two dependent properties:

• User

The user name to use in the auto-generated DB options string.

• Password

The password to use in the auto-generated DB options string.

Remote Server. This is an optional property. Allows you to specify a remote server name.

Options Category

Output Reject Records. This only appears for the Upsert write method. It is False by default, set to True to send rejected records to the reject link.

Silently Drop Columns Not in Table. This only appears for the Load Write Method. It is False by default. Set to True to silently drop all input columns that do not correspond to columns in an existing Oracle table. Otherwise the stage reports an error and terminates the job.

Truncate Column Names. This only appears for the Load Write Method. Set this property to True to truncate column names to 30 characters.

13-8 Ascential DataStage Parallel Job Developer’s Guide

Close Command. This is an optional property and only appears for the Load Write Method. Use it to specify any command, in single quotes, to be

Page 185: DataStage Parallel Job Developer’s Guide

parsed and executed by the Oracle database on all processing nodes after the stage finishes processing the Oracle table. You can specify a job param-eter if required.

Default String Length. This is an optional property and only appears for the Load Write Method. It is set to 32 by default. Sets the default string length of variable-length strings written to a Oracle table. Variable-length strings longer than the set length cause an error.

The maximum length you can set is 2000 bytes. Note that the stage always allocates the specified number of bytes for a variable-length string. In this case, setting a value of 2000 allocates 2000 bytes for every string. Therefore, you should set the expected maximum length of your largest string and no larger.

Index Mode. This is an optional property and only appears for the Load Write Method. Lets you perform a direct parallel load on an indexed table without first dropping the index. You can choose either Maintenance or Rebuild mode. The Index property only applies to append and truncate Write Modes.

Rebuild skips index updates during table load and instead rebuilds the indexes after the load is complete using the Oracle alter index rebuild command. The table must contain an index, and the indexes on the table must not be partitioned. The Rebuild option has two dependent properties:

• Add NOLOGGING clause to Index rebuild

This is False by default. Set True to add a NOLOGGING clause.

• Add COMPUTE STATISTICS clause to Index rebuild

This is False by default. Set True to add a COMPUTE STATISTICS clause.

Maintenance results in each table partition’s being loaded sequentially. Because of the sequential load, the table index that exists before the table is loaded is maintained after the table is loaded. The table must contain an index and be partitioned, and the index on the table must be a local range-partitioned index that is partitioned according to the same range values that were used to partition the table. Note that in this case sequential means

Oracle Stage 13-9

sequential per partition, that is, the degree of parallelism is equal to the number of partitions.

Page 186: DataStage Parallel Job Developer’s Guide

Open Command. This is an optional property and only appears for the Load Write Method. Use it to specify any command, in single quotes, to be parsed and executed by the Oracle database on all processing nodes before the Oracle table is opened. You can specify a job parameter if required.

Oracle 8 Partition. This is an optional property and only appears for the Load Write Method. Name of the Oracle 8 table partition that records will be written to. The stage assumes that the data provided is for the partition specified.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the Oracle database. It also allows you to specify that the data should be sorted before being written.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the Stage page Advanced tab (see page 13-3) the stage will attempt to preserve the partitioning of the incoming data.

If the Oracle stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Oracle stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Oracle stage is set to execute in parallel, then you can set a parti-tioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the Stage page Advanced tab).

13-10 Ascential DataStage Parallel Job Developer’s Guide

If the Oracle stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the

Page 187: DataStage Parallel Job Developer’s Guide

Collection type drop-down list. This will override the default Same collec-tion method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Oracle stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place. This is the default for Oracle stages.

• DB2. Replicates the Oracle partitioning method of the specified DB2 table.

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for Oracle stages.

Oracle Stage 13-11

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

Page 188: DataStage Parallel Job Developer’s Guide

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the file or files. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about how the Oracle stage reads data from a Oracle database. The Oracle stage can have only one output link. Alternatively it can have a reference output link, which is used by the Lookup stage when referring to a Oracle lookup table. It can also have a reject link where rejected records are routed (used in conjunc-tion with an input link). The Output Name drop-down list allows you to choose whether you are looking at details of the main output link or the reject link.

13-12 Ascential DataStage Parallel Job Developer’s Guide

The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly

Page 189: DataStage Parallel Job Developer’s Guide

what the link does. The Columns tab specifies the column definitions of incoming data.

Details about Oracle stage properties are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Output Link PropertiesThe Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from what table. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Manda-tory?

Repeats?

Depen-dent of

Source/Lookup Type Normal/Sparse

Normal Y (if output is reference link connected to Lookup stage)

N N/A

Source/Read Method Table/Query Table Y N N/A

Source/Table string N/A N N N/A

Source/Where string N/A N N Table

Source/Select List string N/A N N Table

Source/Query string N/A N N N/A

Source/Partition Table string N/A N N Query

Connection/DB Options

string N/A Y N N/A

Connection/DB Options Mode

Auto-generate/User-defined

Auto-generate

Y N N/A

Oracle Stage 13-13

Page 190: DataStage Parallel Job Developer’s Guide

Source Category

Lookup Type. Where the Oracle stage is connected to a Lookup stage via a reference link, this property specifies whether the Oracle stage will provide data for an in-memory look up (Lookup Type = Normal) or whether the lookup will access the database directly (Lookup Type = Sparse). If the Lookup Type is Normal, the Lookup stage can have multiple reference links. If the Lookup Type is Sparse, the Lookup stage can only have one reference link.

Connection/User string N/A Y (if DB Options Mode = Auto-generate)

N DB Options Mode

Connection/Password string N/A Y (if DB Options Mode = Auto-generate)

N DB Options Mode

Connection/Remote Server

string N/A N N N/A

Options/Close Command

True/false False Y (for reference links)

N N/A

Options/Close Command

string N/A N N N/A

Options/Open Command

string N/A N N N/A

Options/Make Combinable

True/False False Y (if link is refer-ence and Lookup type = sparse)

N N/A

Category/Property Values Default Manda-tory?

Repeats?

Depen-dent of

13-14 Ascential DataStage Parallel Job Developer’s Guide

Read Method. This property specifies whether you are specifying a table or a query when reading the Oracle database.

Page 191: DataStage Parallel Job Developer’s Guide

Query. Optionally allows you to specify an SQL query to read a table. The query specifies the table and the processing that you want to perform on the table as it is read by the stage. This statement can contain joins, views, database links, synonyms, and so on. It has the following dependent option:

Table. Specifies the name of the Oracle table. The table must exist and you must have SELECT privileges on the table. If your Oracle user name does not correspond to the owner of the specified table, you can prefix it with a table owner in the form:

table_owner.table_name

Table has dependent properties:

• Where

Stream links only. Specifies a WHERE clause of the SELECT state-ment to specify the rows of the table to include or exclude from the read operation. If you do not supply a WHERE clause, all rows are read.

• Select List

Optionally specifies an SQL select list, enclosed in single quotes, that can be used to determine which columns are read. You must specify the columns in list in the same order as the columns are defined in the record schema of the input table.

Partition Table. This only appears for stream links. Specifies execution of the SELECT in parallel on the processing nodes containing a partition derived from the named table. If you do not specify this, the stage executes the query sequentially on a single node.

Connection Category

DB Options. Specify a user name and password for connecting to Oracle in the form:

<user=<user>,password=<password>[,arraysize=<num_records>]

Arraysize only applies to stream links. The default arraysize is 1000.

Oracle Stage 13-15

DB Options Mode. If you select Auto-generate for this property, DataStage will create a DB Options string for you. If you select User-

Page 192: DataStage Parallel Job Developer’s Guide

defined, you have to edit the DB Options property yourself. When Auto-generate is selected, there are two dependent properties:

• User

The user name to use in the auto-generated DB options string.

• Password

The password to use in the auto-generated DB options string.

Remote Server. This is an optional property. Allows you to specify a remote server name.

Options Category

Close Command. This is an optional property and only appears for stream links. Use it to specify any command to be parsed and executed by the Oracle database on all processing nodes after the stage finishes processing the Oracle table. You can specify a job parameter if required.

Open Command. This is an optional property only appears for stream links. Use it to specify any command to be parsed and executed by the Oracle database on all processing nodes before the Oracle table is opened. You can specify a job parameter if required

Make Combinable. Only applies to reference links where the Lookup Type property has been set to Sparse. Set to True to specify that the lookup can be combined with its preceding and/or following process.

13-16 Ascential DataStage Parallel Job Developer’s Guide

Page 193: DataStage Parallel Job Developer’s Guide

14Teradata Stage

The Teradata stage is a database stage. It allows you to read data from and write data to a Teradata database.

The Teradata stage can have a single input link or a single output link.

When you edit a Teradata stage, the Teradata stage editor appears. This is based on the generic stage editor described in Chapter 3, “Stage Editors,”

The stage editor has up to three pages, depending on whether you are reading or writing a file:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is present when you are writing to a Teradata database. This is where you specify details about the data being written.

• Outputs page. This is present when you are reading from a Tera-data database. This is where you specify details about the data being read.

There are no special steps you need in order to ensure that the Teradata stage can communicate with Teradata, other than ensuring that you have /usr/lib in your path.

Stage PageThe General tab allows you to specify an optional description of the stage. The Advanced page allows you to specify how the stage executes.

Teradata Stage 14-1

Advanced TabThis tab allows you to specify the following:

Page 194: DataStage Parallel Job Developer’s Guide

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the contents of the file are processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire write is processed by the conductor node.

• Preserve partitioning. You can select Set or Clear. If you select Set read operations will request that the next stage preserves the parti-tioning as is (the Preserve partitioning field is not visible unless the stage has an output links).

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about how the Teradata stage writes data to a Teradata database. The Teradata stage can have only one input link writing to one table.

The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the database. The Columns tab spec-ifies the column definitions of incoming data.

Details about Teradata stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

14-2 Ascential DataStage Parallel Job Developer’s Guide

Page 195: DataStage Parallel Job Developer’s Guide

Input Link PropertiesThe Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and where. Some of the prop-erties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Target/Table Table_Name N/A Y N N/A

Target/Primary Index

Columns List N/A N N Table

Target/Select List List N/A N N Table

Target/Write Mode Append/Create/Replace/Truncate

Append Y N N/A

Connection/DB Options

String N/A Y N N/A

Connection/Data-base

Database Name

N/A N N N/A

Connection/Server Server Name N/A Y N N/A

Options/Close Command

CloseCommand

500 N N Insert SQL

Options/Open Command

Open Command

False N N N/A

Options/Silently Drop Columns Not in Table

True/False False Y N N/A

Options/Default String Length

String Length 32 N N N/A

Options/Truncate True/False False Y N N/A

Teradata Stage 14-3

Column Names

Options/Progress Interval

Number 100000 N N N/A

Page 196: DataStage Parallel Job Developer’s Guide

Target Category

Table. Specify the name of the table to write to. The table name must be a valid Teradata table name. Table has two dependent properties:

• Select List

Specifies a list that determines which columns are written. If you do not supply the list, the TeraData stage writes to all columns. Do not include formatting characters in the list.

• Primary Index

Specify a comma-separated list of column names that will become the primary index for tables. Format the list according to Teradata standards and enclose it in single quotes.

For performance reasons, the data set should not be sorted on the primary index. The primary index should not be a smallint, or a column with a small number of values, or a high proportion of null values. If no primary index is specified, the first column is used. All the considerations noted above apply to this case as well.

Connection Category

DB Options. Specify a user name and password for connecting to Tera-data in the form:

<user = <user>, password= <password> [, arraysize = <num_records>]

DB Options Mode . If you select Auto-generate for this property, DataStage will create a DB Options string for you. If you select User-defined, you have to edit the DB Options property yourself. When Auto-generate is selected, there are two dependent properties:

• User

The user name to use in the auto-generated DB options string.

• Password

The password to use in the auto-generated DB options string.

14-4 Ascential DataStage Parallel Job Developer’s Guide

Database. By default, the write operation is carried out in the default database of the Teradata user whose profile is used. If no default database is specified in that user’s Teradata profile, the user name is the default

Page 197: DataStage Parallel Job Developer’s Guide

database. If you supply the database name, the database to which it refers must exist and you must have necessary privileges.

Server. Specify the name of a Teradata server.

Options Category

Close Command. Specify a Teradata command quotes to be parsed and executed by Teradata on all processing nodes after the table has been populated.

Open Command. Specify a Teradata command to be parsed and executed by Teradata on all processing nodes before the table is populated.

Silently Drop Columns Not in Table. Specifying True causes the stage to silently drop all unmatched input columns; otherwise the job fails.

Write Mode. Select from the following:

• Append� Appends new records to the table. The database user must have TABLE CREATE privileges and INSERT privileges on the table being written to. This is the default.

• Create. Creates a new table. The database user must have TABLE CREATE privileges. If a table exists of the same name as the one you want to create, the data flow that contains Teradata�terminates in error.

• Replace. Drops the existing table and creates a new one in its place; the database user must have TABLE CREATE and TABLE DELETE privileges. If a table exists of the same name as the one you want to create, it is overwritten.

• Truncate. Retains the table attributes, including the table defini-tion, but discards existing records and appends new ones. The database user must have DELETE and INSERT privileges on the table.

Default String Length. Specify the maximum length of variable-length raw or string columns. The default length is 32 bytes. The upper bound is slightly less than 32 KB.

Teradata Stage 14-5

Truncate Column Names. Specify whether the column names should be truncated to 30 characters or not.

Page 198: DataStage Parallel Job Developer’s Guide

Progress Interval. By default, the stage displays a progress message for every 100,000 records per partition it processes. Specify this option either to change the interval or to disable the message. To change the interval, specify a new number of records per partition. To disable the messages, specify 0.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the Teradata database. It also allows you to specify that the data should be sorted before being written.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the Stage page Advanced tab (see page 14-1) the stage will attempt to preserve the partitioning of the incoming data.

If the Teradata stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Teradata stage is set to execute in parallel or sequen-tial mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Teradata stage is set to execute in parallel, then you can set a parti-tioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the Stage page Advanced tab).

If the Teradata stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

14-6 Ascential DataStage Parallel Job Developer’s Guide

The following partitioning methods are available:

Page 199: DataStage Parallel Job Developer’s Guide

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Teradata stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag columns.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place. This is the default for Teradata stages.

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for Teradata stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more

Teradata Stage 14-7

columns of the record. This requires you to select a collecting key column from the Available list.

Page 200: DataStage Parallel Job Developer’s Guide

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the database. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about how the Teradata stage reads data from a Teradata database. The Teradata stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly what the link does. The Columns tab specifies the column definitions of incoming data.

Details about Teradata stage properties are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Output Link PropertiesThe Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from what table. Some of the

14-8 Ascential DataStage Parallel Job Developer’s Guide

properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

Page 201: DataStage Parallel Job Developer’s Guide

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category\Property Values Default Manda-tory?

Repeats?

Depen-dent of

Source/Read Method Table/Auto-generated SQL/User-defined SQL

Table Y N N/A

Source/Table Table Name Y Y (if Read Method = Table or Auto-generated SQL)

N N/A

Source/Select List List N/A N N Table

Source/Where Clause Filter N/A N N Table

Source/Query SQL query N/A Y (if Read Method = User-defined SQL or Auto-generated SQL

N N/A

Connection/DB Options String N/A Y N N/A

Connection/Database Database Name

N/A N N N/A

Connection/Server Server Name N/A Y N N/A

Options/Close Command

String N/A N N N/A

Options/Open Command

String N/A N N N/A

Options/Progress Interval

Number 100000 N N N/A

Teradata Stage 14-9

Page 202: DataStage Parallel Job Developer’s Guide

Source Category

Read Method. Select Table to use the Table property to specify the read (this is the default). Select Auto-generated SQL this to have DataStage automatically generate an SQL query based on the columns you have defined and the table you specify in the Table property. You must select the Query property and select Generate from the right-arrow menu to actu-ally generate the statement. Select User-defined SQL to define your own query.

Table. Specifies the name of the Teradata table to read from. The table must exist, and the user must have the necessary privileges to read it.

The Teradata stage reads the entire table, unless you limit its scope by means of the Select List and/or Where suboptions:

• Select List

Specifies a list of columns to read. The items of the list must appear in the same order as the columns of the table.

• Where Clause

Specifies selection criteria to be used as part of an SQL statement’s WHERE clause. Do not include formatting characters in the query

These dependent properties are only available when you have specifed a Read Method of Table rather than Auto-generated SQL.

Query. This property is used to contain the SQL query when you choose a Read Method of User-defined query or Auto-generated SQL. If you are using Auto-generated SQL you must select a table and specify some column definitions, then select Generate from the right-arrow menu to have DataStage generate the query.

Connection Category

DB Options. Specify a user name and password for connecting to Tera-data in the form:

<user = <user>, password= <password> [, arraysize = <num_records>]

14-10 Ascential DataStage Parallel Job Developer’s Guide

The default arraysize is 1000.

Page 203: DataStage Parallel Job Developer’s Guide

DB Options Mode . If you select Auto-generate for this property, DataStage will create a DB Options string for you. If you select User-defined, you have to edit the DB Options property yourself. When Auto-generate is selected, there are two dependent properties:

• User

The user name to use in the auto-generated DB options string.

• Password

The password to use in the auto-generated DB options string.

Database. By default, the read operation is carried out in the default data-base of the Teradata user whose profile is used. If no default database is specified in that user’s Teradata profile, the user name is the default data-base. This option overrides the default.

If you supply the database name, the database to which it refers must exist and you must have the necessary privileges.

Server. Specify the name of a Teradata server.

Options Category

Close Command. Optionally specifies a Teradata command to be run once by Teradata on the conductor node after the query has completed.

Open Command. Optionally specifies a Teradata command run once by Teradata on the conductor node before the query is initiated.

Progress Interval. By default, the stage displays a progress message for every 100,000 records per partition it processes. Specify this option either to change the interval or to disable the message. To change the interval, specify a new number of regards per partition. To disable the messages, specify 0.

Teradata Stage 14-11

Page 204: DataStage Parallel Job Developer’s Guide

14-12 Ascential DataStage Parallel Job Developer’s Guide

Page 205: DataStage Parallel Job Developer’s Guide

15Informix XPS Stage

The Informix XPS stage is a database stage. It allows you to read data from and write data to an Informix XPS database.

The Informix XPS stage can have a single input link or a single output link.

When you edit a Informix XPS stage, the Informix XPS stage editor appears. This is based on the generic stage editor described in Chapter 3, “Stage Editors.”

The stage editor has up to three pages, depending on whether you are reading or writing a database:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is present when you are writing to an Informix XPS database. This is where you specify details about the data being written.

• Outputs page. This is present when you are reading from an Informix XPS database. This is where you specify details about the data being read.

You must have the correct privileges and settings in order to use the Informix XPS stage. You must have a valid account and appropriate priv-ileges on the databases to which you connect.

You require read and write privileges on any table to which you connect, and Resource privileges for using the Partition Table property on an output link or using create and replace modes on an input link.

To configure access to Informix XPS:

1. Make sure that Informix XPS is running.

Informix XPS Stage 15-1

2. Make sure the INFORMIXSERVER is set in your environment. This corresponds to a server name in sqlhosts and is set to the coserver

Page 206: DataStage Parallel Job Developer’s Guide

name of coserver 1. The coserver must be accessible from the node on which you invoke your DataStage job.

3. Make sure that INFORMIXDIR points to the installation directory of your INFORMIX server.

4. Make sure that INFROMIXSQLHOSTS points to the sql hosts path (e.g., /disk6/informix/informix_runtime/etc/sqlhosts)

Stage PageThe General tab allows you to specify an optional description of the stage. The Advanced page allows you to specify how the stage executes.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the contents of the file are processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire write is processed by the conductor node.

• Preserve partitioning. You can select Set or Clear. If you select Set read operations will request that the next stage preserves the parti-tioning as is (it is ignored for write operations).

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

15-2 Ascential DataStage Parallel Job Developer’s Guide

Page 207: DataStage Parallel Job Developer’s Guide

Inputs PageThe Inputs page allows you to specify details about how the Informix XPS stage writes data to an Informix XPS database. The stage can have only one input link writing to one table.

The General tab allows you to specify an optional description of the input link. The Properties tab allows you to specify details of exactly what the link does. The Partitioning tab allows you to specify how incoming data is partitioned before being written to the database. The Columns tab spec-ifies the column definitions of incoming data.

Details about stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Input Link PropertiesThe Properties tab allows you to specify properties for the input link. These dictate how incoming data is written and where. Some of the prop-erties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Target/Default String Length

Append/Create/Replace/Truncate

Append Y N N/A

Target/Table Table Name N/A Y N N/A

Connection/Data-base

DatabaseName

N/A Y N N/A

Connection/Select List

List N/A N N Table

Connection/Server Server Name N/A Y N N/A

Informix XPS Stage 15-3

Options/Close Command

Close Command

500 N N Insert SQL

Page 208: DataStage Parallel Job Developer’s Guide

Target Category

Write Mode. Select from the following:

• Append� Appends new records to the table. The database user who writes in this mode must have Resource privileges. This is the default mode.

• Create. Creates a new table. The database user who writes in this mode must have Resource privileges. The stage returns an error if the table already exists.

• Replace. Deletes the existing table and creates a new one in its place. The database user who writes in this mode must have Resource privileges.

• Truncate. Retains the table attributes but discards existing records and appends new ones. The stage will run more slowly in this mode if the user does not have Resource privileges.

Table. Specify the name of the Informix XPS table to write to. It has a dependent property:

• Select List

Specifies a list that determines which columns are written. If you do not supply the list, the stage writes to all columns.

Connection Category

Options/Open Command

Open Command

False Y N N/A

Options/Silently Drop Columns Not in Table

True/False False Y N N/A

Options/Default String Length

String Length 32 Y N N/A

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

15-4 Ascential DataStage Parallel Job Developer’s Guide

Database. Specify the name of the Informix XPS database containing the table specified by the Table property.

Page 209: DataStage Parallel Job Developer’s Guide

Server. Specify the name of an Informix XPS server.

Option Category

Close Command. Specify an INFORMIX SQL statement to be parsed and executed by Informix XPS on all processing nodes after the table has been populated.

Open Command. Specify an INFORMIX SQL statement to be parsed and executed by Informix XPS on all processing nodes before opening the table.

Silently Drop Columns Not in Table. Use this property to cause the stage to drop, with a warning, all input columns that do not correspond to the columns of an existing table. If do you not specify drop, an unmatched column generates an error and the associated step terminates.

Default String Length. Set the default length of string columns. If you do not specify a length, the default is 32 bytes. You can specify a length up to 255 bytes.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is written to the Informix XPS database. It also allows you to specify that the data should be sorted before being written.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the Stage page Advanced tab (see page 15-2) the stage will attempt to preserve the partitioning of the incoming data.

If the stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

Informix XPS Stage 15-5

• Whether the stage is set to execute in parallel or sequential mode.

Page 210: DataStage Parallel Job Developer’s Guide

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the Stage page Advanced tab).

If the stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collec-tion type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Informix XPS stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag columns.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place. This is the default for INFORMIX stages.

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

15-6 Ascential DataStage Parallel Job Developer’s Guide

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages,

Page 211: DataStage Parallel Job Developer’s Guide

and how many nodes are specified in the Configuration file. This is the default collection method for Informix XPS stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the database. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about how the Informix XPS stage reads data from an Informix XPS database. The stage can have only one output link.

Informix XPS Stage 15-7

The General tab allows you to specify an optional description of the output link. The Properties tab allows you to specify details of exactly

Page 212: DataStage Parallel Job Developer’s Guide

what the link does. The Columns tab specifies the column definitions of incoming data.

Details about Informix XPS stage properties are given in the following sections. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Output Link PropertiesThe Properties tab allows you to specify properties for the output link. These dictate how incoming data is read from what table. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category\Property Values Default Manda-tory?

Repeats?

Depen-dent of

Source/Read Method Table/Auto-generated SQL/User-defined SQL

Table Y N N/A

Source/Table Table Name Y Y (if Read Method = Table or Auto-generated SQL)

N N/A

Source/Select List List N/A N N Table

Source/Where Clause Filter N/A N N Table

Source/Partition Table Table N/A N N Table

Source/Query SQL query N/A Y (if Read Method = User-defined SQL or

N N/A

15-8 Ascential DataStage Parallel Job Developer’s Guide

Auto-generated SQL

Page 213: DataStage Parallel Job Developer’s Guide

Source Category

Read Method. Select Table to use the Table property to specify the read (this is the default). Select Auto-generated SQL this to have DataStage automatically generate an SQL query based on the columns you have defined and the table you specify in the Table property. Select User-defined SQL to define your own query.

Table. Specify the name of the Informix XPS table to read from. The table must exist. You can prefix the table name with a table owner in the form: table_owner.table_name.

• Where Clause

Specify selection criteria to be used as part of an SQL statement’s WHERE clause, to specify the rows of the table to include in or exclude from the data set.

• Select List

Specifies a list that determines which columns are read. If you do not supply the list, the stage reads all columns. Do not include formatting characters in the list.

• Partition Table

Specify this property if the table is fragmented to improve perfor-mance by creating one instance of the stage per table fragment. If the table is fragmented across nodes, this property creates one instance of the stage per fragment per node. If the table is frag-

Connection/Database Database Name

N/A N N N/A

Connection/Server Server Name N/A Y N N/A

Options/Close Command

String N/A N N N/A

Options/Open Command

String N/A N N N/A

Category\Property Values Default Manda-tory?

Repeats?

Depen-dent of

Informix XPS Stage 15-9

mented and you do not specify this option, the stage nonetheless functions successfully, if more slowly. You must have Resource privilege to invoke this property.

Page 214: DataStage Parallel Job Developer’s Guide

These dependent properties are only available when you have specified a Read Method of Table rather than Auto-generated SQL.

Query. This property is used to contain the SQL query when you choose a Read Method of User-defined query or Auto-generated SQL. If you are using Auto-generated SQL you must select a table and specify some column definitions to have DataStage generate the query.

Connection Category

Database. The name of the Informix XPS database.

Server. The name of the Informix XPS server.

Options Category

Close Command. Optionally specify an INFORMIX SQL statement to be parsed and executed on all processing nodes after the table selection or query is completed.

Open Command. Optionally specify an INFORMIX SQL statement to be parsed and executed by the database on all processing nodes before the read query is prepared and executed.

15-10 Ascential DataStage Parallel Job Developer’s Guide

Page 215: DataStage Parallel Job Developer’s Guide

16Transformer Stage

The Transformer stage is an active stage. Transformer stages do not extract data or write data to a target database. They are used to handle extracted data, perform any conversions required, and pass data to another active stage or a stage that writes data to a target database or file.

Transformer stages can have a single input and any number of outputs. It can have two types of reject link:

• Constraint reject. This is a link defined inside the Transformer stage which takes any rows that have failed the constraint on all other output links.

• Failure reject. This link is defined outside the Transformer stage and takes any rows which have not been written to any of the outputs links by reason of a write failure.

Unlike most of the other stages in a Parallel Extender job, the Transformer stage has its own user interface. It does not use the generic interface as described in Chapter 3.

Transformer Stage 16-1

Page 216: DataStage Parallel Job Developer’s Guide

When you edit a Transformer stage, the Transformer Editor appears. An example Transformer stage is shown below. In this example, meta data has been defined for the input and the output links.

16-2 Ascential DataStage Parallel Job Developer’s Guide

Page 217: DataStage Parallel Job Developer’s Guide

Transformer Editor ComponentsThe Transformer Editor has the following components.

Toolbar The Transformer toolbar contains the following buttons:

Link AreaThe top area displays links to and from the Transformer stage, showing their columns and the relationships between them.

The link area is where all column definitions and stage variables are defined.

The link area is divided into two panes; you can drag the splitter bar between them to resize the panes relative to one another. There is also a horizontal scroll bar, allowing you to scroll the view left or right.

The left pane shows the input link, the right pane shows output links. Output columns that have no derivation defined are shown in red.

Within the Transformer Editor, a single link may be selected at any one time. When selected, the link’s title bar is highlighted, and arrowheads indicate any selected columns within that link.

Meta Data AreaThe bottom area shows the column meta data for input and output links. Again this area is divided into two panes: the left showing input link meta data and the right showing output link meta data.

stageproperties

show all or selected relations

constraints cut copy pasteload column definition

save column definition

find/replacecolumn auto-match

show/hidestage variables input link

execution order

execution orderoutput link

Transformer Stage 16-3

The meta data for each link is shown in a grid contained within a tabbed page. Click the tab to bring the required link to the front. That link is also selected in the link area.

Page 218: DataStage Parallel Job Developer’s Guide

If you select a link in the link area, its meta data tab is brought to the front automatically.

You can edit the grids to change the column meta data on any of the links. You can also add and delete meta data.

Shortcut MenusThe Transformer Editor shortcut menus are displayed by right-clicking the links in the links area.

There are slightly different menus, depending on whether you right-click an input link, an output link, or a stage variable. The input link menu offers you operations on input columns, the output link menu offers you operations on output columns and their derivations, and the stage vari-able menu offers you operations on stage variables.

The shortcut menu enables you to:

• Open the Constraints dialog box to specify a constraint (only avail-able for output links).

• Open the Column Auto Match dialog box.

• Display the Find/Replace dialog box.

• Edit, validate, or clear a derivation, or stage variable.

• Append a new column or stage variable to the selected link.

• Select all columns on a link.

• Insert or delete columns or stage variables.

• Cut, copy, and paste a column or a key expression or a derivation or stage variable.

If you display the menu from the links area background, you can:

• Open the Stage Properties dialog box in order to specify stage or link properties.

• Open the Constraints dialog box in order to specify a constraint for the selected output link.

• Open the Link Execution Order dialog box in order to specify the

16-4 Ascential DataStage Parallel Job Developer’s Guide

order in which links should be processed.

• Toggle between viewing link relations for all links, or for the selected link only.

Page 219: DataStage Parallel Job Developer’s Guide

• Toggle between displaying stage variables and hiding them.

Right-clicking in the meta data area of the Transformer Editor opens the standard grid editing shortcut menus.

Transformer Stage Basic ConceptsWhen you first edit a Transformer stage, it is likely that you will have already defined what data is input to the stage on the input link. You will use the Transformer Editor to define the data that will be output by the stage and how it will be transformed. (You can define input data using the Transformer Editor if required.)

This section explains some of the basic concepts of using a Transformer stage.

Input LinkThe input data source is joined to the Transformer stage via the input link,.

Output LinksYou can have any number of output links from your Transformer stage.

You may want to pass some data straight through the Transformer stage unaltered, but it’s likely that you’ll want to transform data from some input columns before outputting it from the Transformer stage.

You can specify such an operation by entering a transform expression. The source of an output link column is defined in that column’s Derivation cell within the Transformer Editor. You can use the Expression Editor to enter expressions in this cell. You can also simply drag an input column to an output column’s Derivation cell, to pass the data straight through the Transformer stage.

In addition to specifying derivation details for individual output columns, you can also specify constraints that operate on entire output links. A constraint is an expression that specifies criteria that data must meet before it can be passed to the output link. You can also specify a reject link, which is an output link that carries all the data not output on other links, that is, columns that have not met the criteria.

Transformer Stage 16-5

Each output link is processed in turn. If the constraint expression evaluates to TRUE for an input row, the data row is output on that link. Conversely,

Page 220: DataStage Parallel Job Developer’s Guide

if a constraint expression evaluates to FALSE for an input row, the data row is not output on that link.

Constraint expressions on different links are independent. If you have more than one output link, an input row may result in a data row being output from some, none, or all of the output links.

For example, if you consider the data that comes from a paint shop, it could include information about any number of different colors. If you want to separate the colors into different files, you would set up different constraints. You could output the information about green and blue paint on LinkA, red and yellow paint on LinkB, and black paint on LinkC.

When an input row contains information about yellow paint, the LinkA constraint expression evaluates to FALSE and the row is not output on LinkA. However, the input data does satisfy the constraint criterion for LinkB and the rows are output on LinkB.

If the input data contains information about white paint, this does not satisfy any constraint and the data row is not output on Links A, B or C, but will be output on the reject link. The reject link is used to route data to a table or file that is a “catch-all” for rows that are not output on any other link. The table or file containing these rejects is represented by another stage in the job design.

You can also specify another output link which takes rows that have not be written to any other links because of write failure. This is specified outside the stage by adding a link and converting it to a reject link using the shortcut menu. This link is not shown in the Transformer meta data grid, and derives its meta data from the input link. Its column values are those in the input row that failed to be written.

Editing Transformer StagesThe Transformer Editor enables you to perform the following operations on a Transformer stage:

• Create new columns on a link• Delete columns from within a link• Move columns within a link• Edit column meta data

16-6 Ascential DataStage Parallel Job Developer’s Guide

• Define output column derivations• Define link constraints and handle rejects• Specify the order in which links are processed

Page 221: DataStage Parallel Job Developer’s Guide

• Define local stage variables

Using Drag and DropMany of the Transformer stage edits can be made simpler by using the Transformer Editor’s drag and drop functionality. You can drag columns from any link to any other link. Common uses are:

• Copying input columns to output links• Moving columns within a link• Copying derivations in output links

To use drag and drop:

1. Click the source cell to select it.

2. Click the selected cell again and, without releasing the mouse button, drag the mouse pointer to the desired location within the target link. An insert point appears on the target link to indicate where the new cell will go.

3. Release the mouse button to drop the selected cell.

You can drag and drop multiple columns, key expressions, or derivations. Use the standard Explorer keys when selecting the source column cells, then proceed as for a single cell.

You can drag and drop the full column set by dragging the link title.

You can add a column to the end of an existing derivation by holding down the Ctrl key as you drag the column.

The drag and drop insert point is shown below:

Transformer Stage 16-7

Page 222: DataStage Parallel Job Developer’s Guide

Find and Replace FacilitiesIf you are working on a complex job where several links, each containing several columns, go in and out of the Transformer stage, you can use the find/replace column facility to help locate a particular column or expres-sion and change it.

The find/replace facility enables you to:

• Find and replace a column name• Find and replace expression text• Find the next empty expression• Find the next expression that contains an error

To use the find/replace facilities, do one of the following:

• Click the find/replace button on the toolbar• Choose find/replace from the link shortcut menu• Type Ctrl-F

The Find and Replace dialog box appears. It has three tabs:

• Expression Text. Allows you to locate the occurrence of a partic-ular string within an expression, and replace it if required. You can search up or down, and choose to match case, match whole words, or neither. You can also choose to replace all occurrences of the string within an expression.

• Columns Names. Allows you to find a particular column and rename it if required. You can search up or down, and choose to match case, match the whole word, or neither.

• Expression Types. Allows you to find the next empty expression or the next expression that contains an error. You can also press Ctrl-M to find the next empty expression or Ctrl-N to find the next erroneous expression.

Note: The find and replace results are shown in the color specified in Tools ³ Options.

Press F3 to repeat the last search you made without opening the Find and Replace dialog box.

16-8 Ascential DataStage Parallel Job Developer’s Guide

Page 223: DataStage Parallel Job Developer’s Guide

Creating and Deleting ColumnsYou can create columns on links to the Transformer stage using any of the following methods:

• Select the link, then click the load column definition button in the toolbar to open the standard load columns dialog box.

• Use drag and drop or copy and paste functionality to create a new column by copying from an existing column on another link.

• Use the shortcut menus to create a new column definition.

• Edit the grids in the link’s meta data tab to insert a new column.

When copying columns, a new column is created with the same meta data as the column it was copied from.

To delete a column from within the Transformer Editor, select the column you want to delete and click the cut button or choose Delete Column from the shortcut menu.

Moving Columns Within a LinkYou can move columns within a link using either drag and drop or cut and paste. Select the required column, then drag it to its new location, or cut it and paste it in its new location.

Editing Column Meta DataYou can edit column meta data from within the grid in the bottom of the Transformer Editor. Select the tab for the link meta data that you want to edit, then use the standard DataStage edit grid controls.

The meta data shown does not include column derivations since these are edited in the links area.

Defining Output Column DerivationsYou can define the derivation of output columns from within the Trans-former Editor in five ways:

• If you require a new output column to be directly derived from an

Transformer Stage 16-9

input column, with no transformations performed, then you can use drag and drop or copy and paste to copy an input column to an

Page 224: DataStage Parallel Job Developer’s Guide

output link. The output columns will have the same names as the input columns from which they were derived.

• If the output column already exists, you can drag or copy an input column to the output column’s Derivation field. This specifies that the column is directly derived from an input column, with no transformations performed.

• You can use the column auto-match facility to automatically set that output columns are derived from their matching input columns.

• You may need one output link column derivation to be the same as another output link column derivation. In this case you can use drag and drop or copy and paste to copy the derivation cell from one column to another.

• In many cases you will need to transform data before deriving an output column from it. For these purposes you can use the Expres-sion Editor. To display the Expression Editor, double-click on the required output link column Derivation cell. (You can also invoke the Expression Editor using the shortcut menu or the shortcut keys.)

If a derivation is displayed in red (or the color defined in Tools ³ Options), it means that the Transformer Editor considers it incorrect.

Once an output link column has a derivation defined that contains any input link columns, then a relationship line is drawn between the input column and the output column, as shown in the following example. This is a simple example; there can be multiple relationship lines either in or out of columns. You can choose whether to view the relationships for all links, or just the relationships for the selected links, using the button in the toolbar.

16-10 Ascential DataStage Parallel Job Developer’s Guide

Page 225: DataStage Parallel Job Developer’s Guide

Column Auto-Match Facility

This time-saving feature allows you to automatically set columns on an output link to be derived from matching columns on an input link. Using this feature you can fill in all the output link derivations to route data from corresponding input columns, then go back and edit individual output link columns where you want a different derivation.

To use this facility:

1. Do one of the following:

• Click the Auto-match button in the Transformer Editor toolbar.

• Choose Auto-match from the input link header or output link header shortcut menu.

The Column Auto-Match dialog box appears:

2. Choose the output link that you want to match columns with the input link from the drop down list.

3. Click Location match or Name match from the Match type area.

If you choose Location match, this will set output column derivations

Transformer Stage 16-11

to the input link columns in the equivalent positions. It starts with the first input link column going to the first output link column, and works its way down until there are no more input columns left.

Page 226: DataStage Parallel Job Developer’s Guide

If you choose Name match, you need to specify further information for the input and output columns as follows:

• Input columns:

– Match all columns or Match selected columns. Choose one of these to specify whether all input link columns should be matched, or only those currently selected on the input link.

– Ignore prefix. Allows you to optionally specify characters at the front of the column name that should be ignored during the matching procedure.

– Ignore suffix. Allows you to optionally specify characters at the end of the column name that should be ignored during the matching procedure.

• Output columns:

– Ignore prefix. Allows you to optionally specify characters at the front of the column name that should be ignored during the matching procedure.

– Ignore suffix. Allows you to optionally specify characters at the end of the column name that should be ignored during the matching procedure.

• Ignore case. Select this check box to specify that case should be ignored when matching names. The setting of this also affects the Ignore prefix and Ignore suffix settings. For example, if you specify that the prefix IP will be ignored, and turn Ignore case on, then both IP and ip will be ignored.

4. Click OK to proceed with the auto-matching.

Note: Auto-matching does not take into account any data type incompat-ibility between matched columns; the derivations are set regardless.

Defining Constraints and Handling RejectsYou can define limits for output data by specifying a constraint. Constraints are expressions and you can specify a constraint for each output link from a Transformer stage. You can also specify that a particular

16-12 Ascential DataStage Parallel Job Developer’s Guide

link is to act as a reject link and catch those rows that have failed to satisfy the constraints on all other output links.

Page 227: DataStage Parallel Job Developer’s Guide

To define a constraint or specify a reject link, do one of the following:

• Select an output link and click the constraints button.

• Double-click the output link’s constraint entry field.

• Choose Constraints from the background or header shortcut menus.

A dialog box appears which allows you either to define constraints for any of the Transformer output links or to define a link as a reject link.

Define a constraint by entering an expression in the Constraint field for that link. Once you have done this, any constraints will appear below the link’s title bar in the Transformer Editor. This constraint expression will then be checked against the row data at runtime. If the data does not satisfy the constraint, the row will not be written to that link. It is also possible to define a link which can be used to catch these rows which have been "rejected" from a previous link.

A reject link can be defined by:

• Clicking on the Reject Row field so a tick appears and leaving the Constraint fields blank. This will catch any rows that have failed to meet constraints on all the previous output links.

• Set the constraint to REJECTED. This will be set whenever a row is rejected on a link because the row fails to match a constraint. REJECTED is cleared by any output link that accepts the row. Provided the reject link should occur after the output links it will catch rows that have failed to meet the constraints of all the output links.

• Clicking on the Reject Row field so a tick appears and defining a Constraint. This will result in the number of rows written to that link (i.e. rows which satisfy the constraint) to be recorded in the job log as a warning message indicating "rejected rows".

Note: You can also specify another reject link which will catch rows that have not been written on any output links due to a write error. Define this outside Transformer stage by adding a link and using the shortcut menu to convert it to a reject link.

Transformer Stage 16-13

Page 228: DataStage Parallel Job Developer’s Guide

Specifying Link OrderYou can specify the order in which output links process a row.

The initial order of the links is the order in which they are added to the stage.

To reorder the links:

1. Do one of the following:

• Click the output link execution order button on the Transformer Editor toolbar.

• Choose output link reorder from the background shortcut menu.

16-14 Ascential DataStage Parallel Job Developer’s Guide

Page 229: DataStage Parallel Job Developer’s Guide

The Transformer Stage Properties dialog box appears with the Link Ordering tab of the Stage page uppermost.:

2. Use the arrow buttons to rearrange the list of links in the execution order required.

3. When you are happy with the order, click OK.

Defining Local Stage VariablesYou can declare and use your own variables within a Transformer stage. Such variables are accessible only from the Transformer stage in which they are declared. They can be used as follows:

• They can be assigned values by expressions.

• They can be used in expressions which define an output column derivation.

• Expressions evaluating a variable can include other variables or the variable being evaluated itself.

Transformer Stage 16-15

Any stage variables you declare are shown in a table in the right pane of the links area. The table looks similar to an output link. You can display or

Page 230: DataStage Parallel Job Developer’s Guide

hide the table by clicking the Stage Variable button in the Transformer toolbar or choosing Stage Variable from the background shortcut menu.

Note: Stage variables are not shown in the output link meta data area at the bottom of the right pane.

The table lists the stage variables together with the expressions used to derive their values. Link lines join the stage variables with input columns used in the expressions. Links from the right side of the table link the vari-ables to the output columns that use them.

To declare a stage variable:

1. Do one of the following:

16-16 Ascential DataStage Parallel Job Developer’s Guide

• Select Insert New Stage Variable from the stage variable shortcut menu. A new variable is added to the stage variables table in the links pane. The variable is given the default name StageVar and

Page 231: DataStage Parallel Job Developer’s Guide

default data type VarChar (255). You can edit these properties using the Transformer Stage Properties dialog box, as described in the next step.

• Click the Stage Properties button on the Transformer toolbar.

• Select Stage Properties from the background shortcut menu.

• Select Stage Variable Properties from the stage variable shortcut menu.

The Transformer Stage Properties dialog box appears:

2. Using the grid on the Variables page, enter the variable name, initial value, SQL type, precision, scale, and an optional description. Vari-able names must begin with an alphabetic character (a–z, A–Z) and can only contain alphanumeric characters (a–z, A–Z, 0–9).

3. Click OK. The new variable appears in the stage variable table in the links pane.

You perform most of the same operations on a stage variable as you can on

Transformer Stage 16-17

an output column (see page 16-9). A shortcut menu offers the same commands. You cannot, however, paste a stage variable as a new column, or a column as a new stage variable.

Page 232: DataStage Parallel Job Developer’s Guide

The DataStage Expression EditorThe DataStage Expression Editor helps you to enter correct expressions when you edit Transformer stages. The Expression Editor can:

• Facilitate the entry of expression elements• Complete the names of frequently used variables• Validate the expression

The Expression Editor can be opened from:

• Output link Derivation cells• Stage variable Derivation cells• Constraint dialog box

Entering ExpressionsWhenever the insertion point is in an expression box, you can use the Expression Editor to suggest the next element in your expression. Do this by right-clicking the box, or by clicking the Suggest button to the right of the box. This opens the Suggest Operand or Suggest Operator menu. Which menu appears depends on context, i.e., whether you should be entering an operand or an operator as the next expression element. (The Functions available from this menu are described in Appendix B.)

Suggest Operand Menu:

Suggest Operator Menu:

16-18 Ascential DataStage Parallel Job Developer’s Guide

Page 233: DataStage Parallel Job Developer’s Guide

Completing Variable NamesThe Expression Editor stores variable names. When you enter a variable name you have used before, you can type the first few characters, then press F5. The Expression Editor completes the variable name for you.

If you enter the name of the input link followed by a period, for example, DailySales., the Expression Editor displays a list of the column names of the link. If you continue typing, the list selection changes to match what you type. You can also select a column name using the mouse. Enter a selected column name into the expression by pressing Tab or Enter. Press Esc to dismiss the list without selecting a column name.

Validating the ExpressionWhen you have entered an expression in the Transformer Editor, press Enter to validate it. The Expression Editor checks that the syntax is correct and that any variable names used are acceptable to the compiler.

If there is an error, a message appears and the element causing the error is highlighted in the expression box. You can either correct the expression or close the Transformer Editor or Transform dialog box.

Exiting the Expression EditorYou can exit the Expression Editor in the following ways:

• Press Esc (which discards changes).• Press Return (which accepts changes).• Click outside the Expression Editor box (which accepts changes).

Transformer Stage 16-19

Page 234: DataStage Parallel Job Developer’s Guide

Configuring the Expression EditorThe Expression Editor is switched on by default. If you prefer not to use it, you can switch it off or use selected features only. The Expression Editor is configured by editing the Designer options. For more information, see the DataStage Designer Guide.

Transformer Stage PropertiesThe Transformer stage has a Properties dialog box which allows you to specify details about how the stage operates.

The Transform Stage dialog box has three pages:

• Stage page. This is used to specify general information about the stage.

• Inputs page. This is where you specify details about the data input to the Transformer stage.

• Outputs page. This is where you specify details about the output links from the Transformer stage.

Stage PageThe Stage page has four tabs:

• General. Allows you to enter an optional description of the stage.

• Variables. Allows you to set up stage variables for use in the stage.

• Advanced. Allows you to specify how the stage executes.

• Link Ordering. Allows you to specify the order in which the output links will be processed.

The Variables tab is described in “Defining Local Stage Variables” on page 16-15. The Link Ordering tab is described in “Specifying Link Order” on page 16-14.

Advanced Tab

The Advanced tab is the same as the Advanced tab of the generic stage editor as described in “Advanced Tab” on page 3-5. This tab allows you to

16-20 Ascential DataStage Parallel Job Developer’s Guide

specify the following:

Page 235: DataStage Parallel Job Developer’s Guide

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the contents of the file are processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In sequential mode the entire contents of the file are processed by the conductor node.

• Preserve partitioning. This is set to Propagate by default, this sets or clears the partitioning in accordance with what the previous stage has set. You can also select Set or Clear. If you select Set, the stage will request that the next stage preserves the partitioning as is.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about data coming into the Transformer stage. The Transformer stage can have only one input link.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned. This is the same as the Partitioning tab in the generic stage editor described in “Partitioning Tab” on page 3-11.

Partitioning on the Input Link

The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected when input to the Transformer stage. It also allows you to specify that the data should be sorted on input.

Transformer Stage 16-21

By default the Transformer stage will attempt to preserve partitioning of incoming data, or use its own partitioning method according to what the previous stage in the job dictates.

Page 236: DataStage Parallel Job Developer’s Guide

If the Transformer stage is operating in sequential mode, it will first collect the data before writing it to the file using the default collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Transformer stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning type drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the Stage page Advanced tab).

If the Transformer stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default method for the Transformer stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

16-22 Ascential DataStage Parallel Job Developer’s Guide

• Same. Preserves the partitioning already in place.

Page 237: DataStage Parallel Job Developer’s Guide

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default method for the Transformer stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

Transformer Stage 16-23

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is

Page 238: DataStage Parallel Job Developer’s Guide

also set, the first record is retained <otherwise the last one is retained?>.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs Page has a General tab which allows you to enter an optional description for each of the output links on the Transformer stage.

16-24 Ascential DataStage Parallel Job Developer’s Guide

Page 239: DataStage Parallel Job Developer’s Guide

17Aggregator Stage

The Aggregator stage is an active stage. It classifies data rows from a single input link into groups and computes totals or other aggregate functions for each group. The summed totals for each group are output from the stage via an output link.

When you edit an Aggregator stage, the Aggregator stage editor appears. This is based on the generic stage editor described in Chapter 3, “Stage Editors.”

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify details about the data being grouped and/or aggregated.

• Outputs page. This is where you specify details about the groups being output from the stage.

The aggregator stage gives you access to grouping and summary opera-tions. One of the easiest ways to expose patterns in a collection of records is to group records with similar characteristics, then compute statistics on all records in the group. You can then use these statistics to compare prop-erties of the different groups. For example, records containing cash register transactions might be grouped by the day of the week to see which day had the largest number of transactions, the largest amount of revenue, etc.

Records can be grouped by one or more characteristics, where record char-acteristics correspond to column values. In other words, a group is a set of

Aggregator Stage 17-1

records with the same value for one or more columns. For example, trans-action records might be grouped by both day of the week and by month.

Page 240: DataStage Parallel Job Developer’s Guide

These groupings might show that the busiest day of the week varies by season.

In addition to revealing patterns in your data, grouping can also reduce the volume of data by summarizing the records in each group, making it easier to manage. If you group a large volume of data on the basis of one or more characteristics of the data, the resulting data set is generally much smaller than the original and is therefore easier to analyze using standard workstation or PC-based tools.

At a practical level, you should be aware that, in a parallel environment, the way that you partition data before grouping and summarizing it can affect the results. For example, if you partitioned using the round robin method records with identical values in the column you are grouping on would end up in different partitions. If you then performed a sum opera-tion within these partitions you would not be operating on all the relevant columns. In such circumstances you may want the hash partition the data on the on one or more of the grouping keys to ensure that your groups are entire.

It is important that you bear these facts in mind and take any steps you need to prepare your data set before presenting it to the Aggregator stage. In practice this could mean you use Sort stages or additional Aggregate stages in the job.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties page lets you specify what the stage does. The Advanced page allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

17-2 Ascential DataStage Parallel Job Developer’s Guide

Page 241: DataStage Parallel Job Developer’s Guide

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Manda-tory? Repeats? Dependent

of

Grouping Keys/Column for Calculation

Input column

N/A Y Y N/A

Grouping Keys/Case Sensitive

True/False

True N N Group

Aggrega-tions/Aggregation Type

Calculate/Recalcu-late/Count rows

Reduce Y N N/A

Aggrega-tions/Count Output Column

Output column

N/A N Y (if Aggrega-tion Type = Count Rows)

N/A

Aggregations/Summary Column for Recalculation

Input column

N/A N Y (if Aggrega-tion Type = Rereduce)

N/A

Aggregations/Corrected Sum of Squares

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Aggregations/Maximum Value

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Aggregator Stage 17-3

Page 242: DataStage Parallel Job Developer’s Guide

Aggregations/Mean Value

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Aggregations/Minimum Value

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Aggregations/Missing Value

Output column

N/A N Y Column to Calculate

Aggregations/Missing Values Count

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Aggregations/Non-missing Values Count

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Aggregations/Percent Coefficient of Variation

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Aggregations/Range Output column

N/A N N Column to Calculate & Summary Column for

Category/Property Values Default Manda-tory? Repeats? Dependent

of

17-4 Ascential DataStage Parallel Job Developer’s Guide

Recalcula-tion

Page 243: DataStage Parallel Job Developer’s Guide

Aggregations/Standard Deviation

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Aggregations/Standard Error

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Aggregations/Sum of Weights

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Aggregations/Sum

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Aggregations/Summary

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Aggregations/Uncorrected Sum of Squares

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Category/Property Values Default Manda-tory? Repeats? Dependent

of

Aggregator Stage 17-5

Page 244: DataStage Parallel Job Developer’s Guide

Grouping Keys Category

Group. Specifies the input columns you are using as group keys. Repeat the property to select multiple columns as group keys. This property has a dependent property:

• Case Sensitive

Use this to specify whether each group key is case sensitive or not, this is set to True by default, i.e., the values “CASE” and “case” in would end up in different groups.

Aggregations Category

Aggregation Type. This property allows you to specify the type of aggre-gation operation your stage is performing. Choose from Calculate (the default), Recalculate, and Count Rows.

Column for Calculation. The Calculate aggregate type allows you to

Aggregations/Variance

Output column

N/A N N Column to Calculate & Summary Column for Recalcula-tion

Aggregations/Variance divisor

Default/Nrecs

Default N N Variance

Aggregations/Weighting column

Input column

N/A N N Column to Recalculate or Count Output Column

Options/Group hash/sort hash Y Y N/A

Options/Ignore Null Values

True/False

False Y N N/A

Category/Property Values Default Manda-tory? Repeats? Dependent

of

17-6 Ascential DataStage Parallel Job Developer’s Guide

summarize the contents of a particular column or columns in your input data set by applying one or more aggregate functions to it. Select the

Page 245: DataStage Parallel Job Developer’s Guide

column to be aggregated, then select dependent properties to specify the operation to perform on it, and the output column to carry the result.

Count Output Column. The Count Rows aggregate type performs a count of the number of records within each group. Specify the column on which the count is output.

Summary Column for Recalculation. This aggregate type allows you to apply aggregate functions to a column that has already been summarized. This is like reduce but performs the specified aggregate operation on a set of data that has already been summarized. In practice this means you should have performed a calculate (or recalculate) operation in a previous Aggregator stage with the Summary property set to produce a subrecord containing the summary data that is then included with the data set. Select the column to be aggregated, then select dependent properties to specify the operation to perform on it, and the output column to carry the result.

Options Category

Method. The aggregate stage has two modes of operation: hash and sort. Your choice of mode depends primarily on the number of groupings in the input data set, taking into account the amount of memory available. You typically use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory to be used.

When using hash mode, you should hash partition the input data set by one or more of the grouping key columns so that all the records in the same group are in the same partition (this happens automatically if (auto) is set in the Partitioning tab). However, hash partitioning is not mandatory, you can use any partitioning method you choose if keeping groups together in a single partition is not important. For example, if you’re summing records in each partition and later you’ll add the sums across all partitions, you don’t need all records in a group to be in the same partition to do this. Note, though, that there will be multiple output records for each group.

If the number of groups is large, which can happen if you specify many grouping keys, or if some grouping keys can take on many values, you would normally use sort mode. However, sort mode requires the input data set to have been partition sorted with all of the grouping keys speci-fied as hashing and sorting keys (this happens automatically if (auto) is set

Aggregator Stage 17-7

in the Partitioning tab). Sorting requires a pregrouping operation: after sorting, all records in a given group in the same partition are consecutive.

Page 246: DataStage Parallel Job Developer’s Guide

The method property is set to hash by default.

You may want to try both modes with your particular data and application to determine which gives the better performance. You may find that when calculating statistics on large numbers of groups, sort mode performs better than hash mode, assuming the input data set can be efficiently sorted before it is passed to group.

Ignore Null Values. Set this to True to indicate that null values will not be counted as part of the total column count when calculating minimum value, maximum value, mean value, standard deviation, standard error, sum, sum of weights, and variance. If False, the null value will have 0 substituted and so will be counted as a valid column. It is False by default.

Weighting column. Configures the stage to increment the count for the group by the contents of the weight column for each record in the group, instead of by 1. Not available for Summary Column to Rereduce. Setting this option affects only the following options:

• Percent Coefficient of Variation

• Mean Value

• Sum

• Sum of Weights

– Uncorrected Sum of Squares

Calculation and Recalculation Dependent Properties

The following properties are dependents of both Column for Calculation and Summary Column for Recalculation. These specify the various aggre-gate functions and the output columns to carry the results.

• Corrected Sum of Squares

Produces a corrected sum of squares for data in the aggregate column and outputs it to the specified output column.

• Maximum Value

Gives the maximum value in the aggregate column and outputs it to the specified output column.

17-8 Ascential DataStage Parallel Job Developer’s Guide

Page 247: DataStage Parallel Job Developer’s Guide

• Mean Value

Gives the mean value in the aggregate column and outputs it to the specified output column.

• Minimum Value

Gives the minimum value in the aggregate column and outputs it to the specified output column.

• Missing Value

This specifies what constitutes a ‘missing’ values, for example -1 or NULL. Enter the value as a floating point number. Not available for Summary Column to Rereduce.

• Missing Values Count

Counts the number of aggregate columns with missing values in them and outputs the count to the specified output column. Not available for rereduce.

• Non-missing Values Count

Counts the number of aggregate columns with values in them and outputs the count to the specified output column.

• Percent Coefficient of Variation

Calculates the percent coefficient of variation for the aggregate column and outputs it to the specified output column.

• Range

Calculates the range of values in the aggregate column and outputs it to the specified output column.

• Standard Deviation

Calculates the standard deviation of values in the aggregate column and outputs it to the specified output column.

• Standard Error

Calculates the standard error of values in the aggregate column and outputs it to the specified output column.

Aggregator Stage 17-9

Page 248: DataStage Parallel Job Developer’s Guide

• Sum of Weights

Calculates the sum of values in the weight column specified by the Weight column property and outputs it to the specified output column.

• Sum

Sums the values in the aggregate column and outputs the sum to the specified output column.

• Summary

Specifies a subrecord to write the results of the reduce or rereduce operation to.

• Uncorrected Sum of Squares

Produces an uncorrected sum of squares for data in the aggregate column and outputs it to the specified output column.

• Variance

Calculates the variance for the aggregate column and outputs the sum to the specified output column. This has a dependent property:

– Variance divisor

Specifies the variance divisor. By default, uses a value of the number of records in the group minus the number of records with missing values minus 1 to calculate the variance. This corre-sponds to a vardiv setting of Default If you specify NRecs, DataStage uses the number of records in the group minus the number of records with missing values instead.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data set is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

17-10 Ascential DataStage Parallel Job Developer’s Guide

Page 249: DataStage Parallel Job Developer’s Guide

• Preserve partitioning. This is Set by default. You can select Set or Clear. If you select Set the stage will request that the next stage in the job attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Aggregator Stage 17-11

Page 250: DataStage Parallel Job Developer’s Guide

Inputs PageThe Inputs page allows you to specify details about the incoming data set.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being grouped and/or summarized. The Columns tab specifies the column definitions of incoming data.

Details about Aggregator stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is grouped and/or summarized. It also allows you to specify that the data should be sorted before being oper-ated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Aggregator stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Aggregator stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Aggregator stage is set to execute in parallel, then you can set a parti-tioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-

17-12 Ascential DataStage Parallel Job Developer’s Guide

tioning option has been set on the previous stage).

Page 251: DataStage Parallel Job Developer’s Guide

If the Aggregator stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Aggregator stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is

Aggregator Stage 17-13

the default collection method for Aggregator stages.

Page 252: DataStage Parallel Job Developer’s Guide

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being written to the file or files. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Aggregator stage. The Aggregator stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the

17-14 Ascential DataStage Parallel Job Developer’s Guide

processed data being produced by the Aggregator stage and the Output columns.

Page 253: DataStage Parallel Job Developer’s Guide

Details about Aggregator stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Mapping TabFor the Aggregator stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the input columns and/or the generated columns. These are read only and cannot be modified on this tab.

The right pane shows the output columns for each link. This has a Deriva-tions field where you can specify how the column is derived.You can fill it in by dragging columns over from the left pane, or by using the Auto-match facility.

In the above example the left pane represents the data after it has been grouped and summarized. The Expression field shows how the column has been derived. The right pane represents the data being output by the stage after the grouping and summarizing. In this example ocol1 carries the value of the key field on which the data was grouped (for example, if you were grouping by date it would contain each date grouped on). Column ocol2 carries the mean of all the col2 values in the group, ocol4 the

Aggregator Stage 17-15

minimum value, and ocol3 the sum.

Page 254: DataStage Parallel Job Developer’s Guide

17-16 Ascential DataStage Parallel Job Developer’s Guide

Page 255: DataStage Parallel Job Developer’s Guide

18Join Stage

The Join stage is an active stage. It performs join operations on two or more data sets input to the stage and then outputs the resulting data set. The input data sets are notionally identified as the “right” set and the “left” set, and “intermediate” sets. You can specify which is which. It has any number of input links and a single output link.

The stage can perform one of four join operations:

• Inner transfers records from input data sets whose key columns contain equal values to the output data set. Records whose key columns do not contain equal values are dropped.

• Left outer transfers all values from the left data set but transfers values from the right data set and intermediate data sets only where key columns match. The operator drops the key column from the right data set.

• Right outer transfers all values from the right data set and trans-fers values from the left data set and intermediate data sets only where key columns match. The operator drops the key column from the left data set.

• Full outer transfers records in which the contents of the key columns are equal from the left and right input data sets to the output data set. It also transfers records whose key columns contain unequal values from both input data sets to the output data set. (Full outer joins do not support more than two input links.)

The stage editor has three pages:

Join Stage 18-1

• Stage page. This is always present and is used to specify general information about the stage.

Page 256: DataStage Parallel Job Developer’s Guide

• Inputs page. This is where you specify details about the data sets being joined.

• Outputs page. This is where you specify details about the joined data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties page lets you specify what the stage does. The Advanced page allows you to specify how the stage executes. The Link Ordering tab allows you to specify which of the input links is the right link and which is the left link and which are intermediate.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Join Keys Category

Category/Property Values Default Manda-tory? Repeats?

Dependent of

Join Keys/Key Input Column N/A Y Y N/A

Join Keys/Case Sensitive

True/False True N N Key

Options/Join Type Full Outer/Inner/Left Outer/Right Outer

Inner Y N N/A

18-2 Ascential DataStage Parallel Job Developer’s Guide

Key. Choose the input column you want to join on. You are offered a choice of input columns common to all links. For a join to work you must join on a column that appears in all input data sets, i.e. have the same

Page 257: DataStage Parallel Job Developer’s Guide

name and compatible data types. If, for example, you select a column called “name” from the left link, the stage will expect there to be an equivalent column called “name” on the right link.

You can join on multiple key columns. To do so, repeat the Key property.

Key has a dependent property:

• Case Sensitive

Use this to specify whether each group key is case sensitive or not, this is set to True by default, i.e., the values “CASE” and “case” in would not be judged equivalent.

Options Category

Join Type. Specify the type of join operation you want to perform. Choose one of:

• Full Outer• Inner• Left Outer• Right Outer

The default is Inner.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts the setting which results from ORing the settings of the input stages, i.e., if either of the input stages uses Set then this stage will use Set. You can explicitly select Set or Clear. Select Set to request that the next stage in the job attempts to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain

Join Stage 18-3

parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

Page 258: DataStage Parallel Job Developer’s Guide

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Link OrderingThis tab allows you to specify which input link is regarded as the left link and which link is regarded as the right link, and which links are regarded as intermediate. By default the first link you add is regarded as the left link, and the last one as the right link, with all other links labelled as Intermediate N. You can use this tab to override the default order.

In the example DSLink4 is the left link, click on it to select it then click on the down arrow to convert it into the right link.

18-4 Ascential DataStage Parallel Job Developer’s Guide

Page 259: DataStage Parallel Job Developer’s Guide

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. Choose an input link from the Input name drop down list to specify which link you want to work on.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being joined. The Columns tab specifies the column definitions of incoming data.

Details about Join stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the data on each of the incoming links is partitioned or collected before it is joined. It also allows you to specify that the data should be sorted before being oper-ated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file.

If the Join stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Join stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Join stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage in the job).

Join Stage 18-5

If the Join stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the

Page 260: DataStage Parallel Job Developer’s Guide

Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default collection method for the Join stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for Join stages.

18-6 Ascential DataStage Parallel Job Developer’s Guide

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

Page 261: DataStage Parallel Job Developer’s Guide

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being joined. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Join stage. The Join stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the columns being input to the Join stage and the Output columns.

Details about Join stage mapping is given in the following section. See

Join Stage 18-7

Chapter 3, “Stage Editors,” for a general description of the other tabs.

Page 262: DataStage Parallel Job Developer’s Guide

Mapping TabFor Join stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the input columns and/or the generated columns. These are read only and cannot be modified on this tab.

The right pane shows the output columns for each link. This has a Deriva-tions field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility.

In the above example the left pane represents the data after it has been joined. The Expression field shows how the column has been derived, the Column Name shows the column after it has been renamed by the join operation. The right pane represents the data being output by the stage after the join. In this example the data has been mapped straight across.

18-8 Ascential DataStage Parallel Job Developer’s Guide

Page 263: DataStage Parallel Job Developer’s Guide

19Funnel Stage

The Funnel stage is an active stage. It copies multiple input data sets to a single output data set. This operation is useful for combining separate data sets into a single large data set. The stage can have any number of input links and a single output link.

The Funnel stage can operate in one of three modes:

• Funnel combines the records of the input data in no guaranteed order. it uses a round robin method to transfer data from input links to output link, i.e., it takes one record from each input link in turn.

• Sort Funnel combines the input records in the order defined by the value(s) of one or more key columns and the order of the output records is determined by these sorting keys.

• Sequence copies all records from the first input data set to the output data set, then all the records from the second input data set, and so on.

For all methods the meta data of all input data sets must be identical.

The sort funnel method has some particular requirements about its input data. All input data sets must be sorted by the same key columns as used by the Funnel operation.

Typically all input data sets for a sort funnel operation are hash-parti-tioned before they’re sorted (choosing the (auto) partitioning method will ensure that this is done). Hash partitioning guarantees that all records with the same key column values are located in the same partition and so are processed on the same node. If sorting and partitioning are carried out

Funnel Stage 19-1

on separate stages before the Funnel stage, this partitioning must be preserved.

Page 264: DataStage Parallel Job Developer’s Guide

The sortfunnel operation allows you to set one primary key and multiple secondary keys. The Funnel stage first examines the primary key in each input record. For multiple records with the same primary key value, it then examines secondary keys to determine the order of records it will output.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify details about the data sets being joined.

• Outputs page. This is where you specify details about the joined data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link Ordering tab allows you to specify which order the input links are processed in.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Manda-tory?

Repeats?

Dependent of

Options/Funnel Type Funnel/Sequence/

Funnel Y N N/A

19-2 Ascential DataStage Parallel Job Developer’s Guide

Sort funnel

Page 265: DataStage Parallel Job Developer’s Guide

Options Category

Funnel Type. Specifies the type of Funnel operation. Choose from:

• Funnel• Sequence• Sort Funnel

The default is Funnel.

Sorting Keys Category

Key. This property is only required for Sort Funnel operations. Specify the key column that the sort will be carried out on. The first column you specify is the primary key, you can add multiple secondary keys by repeating the key property.

Key has the following dependent properties:

• Sort Order

Choose Ascending or Descending. The default is Ascending.

Sorting Keys/Key Input Column N/A Y (if Funnel Type = Sort Funnel)

Y N/A

Sorting Keys/Sort Order Ascending/Descending

Ascending Y (if Funnel Type = Sort Funnel)

N Key

Sorting Keys/Nulls position

First/Last First Y (if Funnel Type = Sort Funnel)

N Key

Sorting Keys/Case Sensitive

True/False True N N Key

Sorting Keys/Character Set

ASCII/EBCDIC

ASCII N N Key

Category/Property Values Default Manda-tory?

Repeats?

Dependent of

Funnel Stage 19-3

• Nulls position

By default columns containing null values appear first in the funneled data set. To override this default so that columns

Page 266: DataStage Parallel Job Developer’s Guide

containing null values appear last in the funneled data set, select Last.

• Character Set

By default data is represented in the ASCII character set. To repre-sent data in the EBCDIC character set, choose EBCDIC.

• Case Sensitive

Use this to specify whether each key is case sensitive or not, this is set to True by default, i.e., the values “CASE” and “case” would not be judged equivalent.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts the setting which results from ORing the settings of the input stages, i.e., if any of the input stages uses Set then this stage will use Set. You can explicitly select Set or Clear. Select Set to request that the next stage in the job attempts to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

19-4 Ascential DataStage Parallel Job Developer’s Guide

Page 267: DataStage Parallel Job Developer’s Guide

Link OrderingThis tab allows you to specify the order in which links input to the Funnel stage are processed This is only relevant if you have chose the Sequence Funnel Type.

By default the input links will be processed in the order they were added. To rearrange them, choose an input link and click the up arrow button or the down arrow button.

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. Choose an input link from the Input name drop down list to specify which link you want to work on.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being funneled. The Columns tab specifies the column definitions of incoming data.

Funnel Stage 19-5

Details about Funnel stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Page 268: DataStage Parallel Job Developer’s Guide

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the data on each of the incoming links is partitioned or collected before it is funneled. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file.

If the Funnel stage is operating in sequential mode, it will first collect the data before writing it to the file using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Funnel stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Funnel stage is set to execute in parallel, then you can set a parti-tioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the previous stage in the job).

If you are using the Sort Funnel method, and haven’t partitioned the data in a previous stage, you should hash partition it by choosing the Hash partition method on this tab.

If the Funnel stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the

19-6 Ascential DataStage Parallel Job Developer’s Guide

Configuration file. This is the default collection method for the Funnel stage.

• Entire. Each file written to receives the entire data set.

Page 269: DataStage Parallel Job Developer’s Guide

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for Funnel stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being funneled. The sort is always

Funnel Stage 19-7

carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection.

Page 270: DataStage Parallel Job Developer’s Guide

If you are using the Sort Funnel method, and haven’t sorted the data in a previous stage, you should sort it here using the same keys that the data is hash partitioned on and funneled on. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Funnel stage. The Funnel stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the columns being input to the Funnel stage and the Output columns.

Details about Funnel stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

19-8 Ascential DataStage Parallel Job Developer’s Guide

Page 271: DataStage Parallel Job Developer’s Guide

Mapping TabFor Funnel stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the input columns. These are read only and cannot be modified on this tab. It is a requirement of the Funnel stage that all input links have identical meta data, so only one set of column definitions is shown.

The right pane shows the output columns for each link. This has a Deriva-tions field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Auto-match facility.

In the above example the left pane represents the incoming data after it has been funneled. The right pane represents the data being output by the stage after the funnel operation. In this example the data has been mapped straight across.

Funnel Stage 19-9

Page 272: DataStage Parallel Job Developer’s Guide

19-10 Ascential DataStage Parallel Job Developer’s Guide

Page 273: DataStage Parallel Job Developer’s Guide

20Lookup Stage

The Lookup stage is an active stage. It is used to perform lookup opera-tions on a lookup table contained in a Lookup File Set stage (see Chapter 7, “Lookup File Set Stage.”) or provided by one of the database stages that support reference output links (see Chapter 12 and Chapter 13). It can also perform a look up on a data set read into memory from any other Parallel job stage that can output data.

The Lookup stage can have a reference link, a single input link, a single output link, and a single rejects link. Depending upon the type and setting of the stage(s) providing the look up information, it can have multiple reference links (where it is directly looking up a DB2 table or Oracle table, it can only have a single reference link).

The input link carries the data from the source data set and is known as the primary link.

For each record of the source data set from the input link, the Lookup stage performs a table lookup on each of the lookup tables attached by reference links. The table lookup is based on the values of a set of lookup key columns, one set for each table. For in-memory look ups, the keys are defined on the Lookup stage (in the Inputs page Properties tab). For lookups of data accessed through other stages, the keys are defined in that stage (i.e., the Lookup File Set stage, the Oracle and DB2 stages in sparse lookup mode).

Each record of the output data set contains all of the columns from a source record plus columns from all the corresponding lookup records where corresponding source and lookup records have the same value for the lookup key columns.

Lookup Stage 20-1

The optional reject link carries source records that do not have a corre-sponding entry in the input lookup tables.

Page 274: DataStage Parallel Job Developer’s Guide

For example, you could have an input data set carrying names and addresses of your U.S. customers. The data as presented identifies state as a two letter U. S. state postal code, but you want the data to carry the full name of the state. You could define a lookup table that carries a list of codes matched to states, defining the code as the key column. As the Lookup stage reads each line, it uses the key to look up the state in the lookup table. It adds the state to a new column defined for the output link, and so the full state name is added to each address. If any state codes have been incorrectly entered in the data set, the code will not be found in the lookup table, and so that record will be rejected.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify details about the incoming data and the reference links.

• Outputs page. This is where you specify details about the data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link Ordering tab allows you to specify which order the input links are processed in.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

20-2 Ascential DataStage Parallel Job Developer’s Guide

Page 275: DataStage Parallel Job Developer’s Guide

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Options Category

If Not Found. This property specifies the action to take if the lookup value is not found in the lookup table. Choose from:

• Fail. The default, failure to find a value in the lookup table or tables causes the job to fail.

• Continue. The stage adds the offending record to its output and continues.

• Drop. The stage drops the offending record and continues.

• Reject. The offending record is sent to the reject link.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts the setting of the previous stage on the stream link.You can explicitly select Set or Clear. Select Set to request the next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

If Not Found Fail/Continue/Drop/Reject

Fail Y N N/A

Lookup Stage 20-3

from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a

Page 276: DataStage Parallel Job Developer’s Guide

node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Link OrderingThis tab allows you to specify which input link is the primary link and the order in which the reference links are processed.

By default the input links will be processed in the order they were added. To rearrange them, choose an input link and click the up arrow button or the down arrow button.

Inputs PageThe Inputs page allows you to specify details about the incoming data set

20-4 Ascential DataStage Parallel Job Developer’s Guide

and the reference links. Choose a link from the Input name drop down list to specify which link you want to work on.

Page 277: DataStage Parallel Job Developer’s Guide

The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data.

Details about Lookup stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Input Link PropertiesWhere the Lookup stage is performing in-memory look ups, the Inputs page has a Properties tab. At a minimum this allows you to define the lookup keys. Depending on the source of the reference link, other proper-ties may be specified on this link.

The properties most commonly set on this tab are as follows:

Lookup Keys Category

Key. Specifies the name of a lookup key column. The Key property must be repeated if there are multiple key columns. The property has a depen-dent property, Case Sensitive.

Case Sensitive. This is a dependent property of Key and specifies whether the parent key is case sensitive or not. Set to true by default.

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Lookup Keys/Key Input column N/A Y Y N/A

Lookup Keys/Case Sensitive

True/False True N N Key

Options/Allow Duplicates

True/False False Y N N/A

Options/Diskpool string N/A N N N/A

Options/Save to Lookup File Set

pathname N/A N N N/A

Lookup Stage 20-5

Page 278: DataStage Parallel Job Developer’s Guide

Options Category

Allow Duplicates. Set this to cause multiple copies of duplicate records to be saved in the lookup table without a warning being issued. Two lookup records are duplicates when all lookup key columns have the same value in the two records. If you do not specify this option, DataStage issues a warning message when it encounters duplicate records and discards all but the first of the matching records.

Diskpool. This is an optional property. Specify the name of the disk pool into which to write the table or file set. You can also specify a job parameter.

Save to Lookup File Set. Allows you to specify a lookup file set to save the look up data.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the lookup is performed. It also allows you to specify that the data should be sorted before the lookup. Note that you cannot specify partitioning or sorting on the reference links, this is specified in their source stage.

By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job the stage will attempt to preserve the partitioning of the incoming data.

If the Lookup stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Lookup File Set stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Lookup stage is set to execute in parallel, then you can set a parti-tioning method by selecting from the Partitioning mode drop-down list.

20-6 Ascential DataStage Parallel Job Developer’s Guide

This will override any current partitioning (even if the Preserve Parti-tioning option has been set by the previous stage in the job).

Page 279: DataStage Parallel Job Developer’s Guide

If the Lookup stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default auto collec-tion method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default method for the Lookup stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is

Lookup Stage 20-7

the default collection method for the Lookup stage.

Page 280: DataStage Parallel Job Developer’s Guide

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the lookup is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Lookup stage. The Lookup stage can have only one output link. It can also have a single reject link, where records can be sent if the lookup fails. The Output Link drop-down list allows you to choose whether you are looking at details of the main output link (the stream link) or the reject link.

20-8 Ascential DataStage Parallel Job Developer’s Guide

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming

Page 281: DataStage Parallel Job Developer’s Guide

data. The Mapping tab allows you to specify the relationship between the columns being input to the Lookup stage and the Output columns.

Details about Lookup stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Reject Link PropertiesYou cannot change the properties of a Reject link. You cannot edit the column definitions for a reject link. The link uses the column definitions for the primary input link.

Mapping TabFor Lookup stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the lookup columns. These are read only and cannot be modified on this tab. This shows the meta data from the primary input link and the reference input links. If a given lookup column appears in more than one lookup table, only one occurrence of the column will appear in the left pane.

Lookup Stage 20-9

The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived.You

Page 282: DataStage Parallel Job Developer’s Guide

can fill it in by dragging input columns over, or by using the Auto-match facility.

In the above example the left pane represents the data after the lookup has been performed. The right pane represents the data being output by the stage after the lookup operation. In this example the data has been mapped straight across.

20-10 Ascential DataStage Parallel Job Developer’s Guide

Page 283: DataStage Parallel Job Developer’s Guide

21Sort Stage

The Sort stage is an active stage. It is used to perform more complex sort operations than can be provided for on the Input page Partitioning tab of parallel job stage editors. You can also use it to insert a more explicit simple sort operation where you want to make your job easier to understand. The Sort stage has a single input link which carries the data to be sorted, and a single output link carrying the sorted data.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify details about the data sets being sorted.

• Outputs page. This is where you specify details about the sorted data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in

Sort Stage 21-1

the warning color (red by default) and turn black when you supply a value for them.

Page 284: DataStage Parallel Job Developer’s Guide

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Sorting Keys/Key Input Column N/A Y Y N/A

Sorting Keys/Sort Order

Ascending/Descending

Ascending Y N Key

Sorting Keys/Nulls position (only avail-able for Sort Utility = DataStage)

First/Last First N N Key

Sorting Keys/Collating Sequence

ASCII/EBCDIC ASCII Y N Key

Sorting Keys/Case Sensitive

True/False True N N Key

Sorting Keys/Sort Key Mode (only available for Sort Utility = DataStage)

Sort/Don’t Sort (Previously Grouped)/Don’t Sort (Previously Sorted)

Sort Y N Key

Options/Sort Utility

DataStage/SyncSort/UNIX

DataStage Y N N/A

Options/Stable Sort True/False True for Sort Utility = DataStage, False otherwise

Y N N/A

Options/Allow Duplicates (not available for Sort Utility = UNIX)

True/False True Y N N/A

Options/Output Statistics

True/False False Y N N/A

21-2 Ascential DataStage Parallel Job Developer’s Guide

Page 285: DataStage Parallel Job Developer’s Guide

Sorting Keys Category

Key. Specifies the key column for sorting. This property can be repeated to specify multiple key columns. Key has dependent properties depending on the Sort Utility chosen:

• Sort Order

All sort types. Choose Ascending or Descending. The default is Ascending.

• Nulls position

This property appears for sort type DataStage and is optional. By default columns containing null values appear first in the sorted data set. To override this default so that columns containing null values appear last in the sorted data set, select Last.

• Collating Sequence

All sort types. By default data is set to ASCII. You can also choose EBCDIC.

• Case Sensitive

Options/Create Cluster Key Change Column (only avail-able for Sort Utility = DataStage)

True/False False N N N/A

Options/Create Key Change Column

True/False False N N N/A

Options/Restrict Memory Usage

number MB 20 N N N/A

Options/SyncSort Extra Options

string N/A N N N/A

Options/Work-space

string N/A N N N/A

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Sort Stage 21-3

All sort types. This property is optional. Use this to specify whether each group key is case sensitive or not, this is set to True

Page 286: DataStage Parallel Job Developer’s Guide

by default, i.e., the values “CASE” and “case” would not be judged equivalent.

• Sort Key Mode

This property appears for sort type DataStage. It is set to Sort by default and this sorts on all the specified key columns.

Set to Don’t Sort (Previously Sorted) to specify that input records are already sorted by this column. The Sort stage will then sort on secondary key columns, if any. This option can increase the speed of the sort and reduce the amount of temporary disk space when your records are already sorted by the primary key column(s) because you only need to sort your data on the secondary key column(s).

Set to Don’t Sort (Previously Grouped) to specify that specifies that input records are already grouped by this column, but not sorted. The operator will then sort on any secondary key columns. This option is useful when your records are already grouped by the primary key column(s), but not necessarily sorted, and you want to sort your data only on the secondary key column(s) within each group

Options Category

Sort Utility. The type of sort the stage will carry out. Choose from:

• DataStage. The default. This uses the built-in DataStage sorter, you do not require any additional software to use this option.

• SyncSort. This specifies that the SyncSort utility (UNIX version, Release 1) is used to perform the sort.

• UNIX. This specifies that the UNIX sort command is used to perform the sort.

Stable Sort. Applies to Sort Utility of DataStage or SyncSort, the default is True. It is set to True to guarantee that this sort operation will not rear-range records that are already in a properly sorted data set. If set to False no prior ordering of records is guaranteed to be preserved by the sorting operation.

21-4 Ascential DataStage Parallel Job Developer’s Guide

Allow Duplicates. Set to True by default. If False, specifies that, if multiple records have identical sorting key values, only one record is

Page 287: DataStage Parallel Job Developer’s Guide

retained. If Stable Sort is True, then the first record is retained. This prop-erty is not available for the UNIX sort type.

Output Statistics. Set False by default. If True causes the sort operation to output statistics. This property is not available for the UNIX sort type.

Create Cluster Key Change Column. This property appears for sort type DataStage and is optional. It is set False by default. If set True it tells the Sort stage to create the column clusterKeyChange in each output record. The clusterKeyChange column is set to 1 for the first record in each group where groups are defined by using a Sort Key Mode of Don’t Sort (Previously Sorted) or Don’t Sort (Previously Grouped). Subsequent records in the group have the clusterKeyChange column set to 0.

Create Key Change Column. This property appears for sort type DataStage and is optional. It is set False by default. If set True it tells the Sort stage to create the column KeyChange in each output record. The KeyChange column is set to 1 for the first record in each group where the value of the sort key changes. Subsequent records in the group have the KeyChange column set to 0.

Restrict Memory Usage. This is set to 20 by default. It causes the Sort stage to restrict itself to the specified number of megabytes of virtual memory on a processing node.

We recommend that the number of megabytes specified is smaller than the amount of physical memory on a processing node.

Workspace. This property appears for sort type SyncSort and UNIX only. Optionally specifies the workspace used by the stage.

SyncSort Extra Options. This property appears for sort type SyncSort and is optional. It allows you to specify arguments that are passed on the command line to SyncSort. You can use a job parameter if required.

Sort Stage 21-5

Page 288: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Set by default. You can explicitly select Set or Clear. Select Set to request the next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about the data coming in to be sorted. The Sort stage can have only one input link.

The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data.

Details about Sort stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input Links

21-6 Ascential DataStage Parallel Job Developer’s Guide

The Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the sort is performed.

Page 289: DataStage Parallel Job Developer’s Guide

By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job the stage will attempt to preserve the partitioning of the incoming data.

If the Sort Set stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Sort stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Sort stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Partitioning option has been set by the previous stage in the job).

If the Sort stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default auto collec-tion method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default method for the Sort stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

Sort Stage 21-7

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

Page 290: DataStage Parallel Job Developer’s Guide

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for the Sort stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the Sort is performed. This is a standard feature of the stage editors, if you make use of it you will be running a simple sort before the main Sort operation that the stage provides. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

21-8 Ascential DataStage Parallel Job Developer’s Guide

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

Page 291: DataStage Parallel Job Developer’s Guide

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Sort stage. The Sort stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the columns being input to the Sort stage and the Output columns.

Details about Sort stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Mapping TabFor Sort stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them.

Sort Stage 21-9

Page 292: DataStage Parallel Job Developer’s Guide

The left pane shows the columns of the sorted data. These are read only and cannot be modified on this tab. This shows the meta data from the input link.

The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility.

In the above example the left pane represents the incoming data after the sort has been performed. The right pane represents the data being output by the stage after the sort operation. In this example the data has been mapped straight across.

21-10 Ascential DataStage Parallel Job Developer’s Guide

Page 293: DataStage Parallel Job Developer’s Guide

22Merge Stage

The Merge stage is an active stage. It can have any number of input links, a single output link, and the same number of reject links as there are input links.

The Merge stage combines a sorted master data set with one or more sorted update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record.

A master record and an update record are merged only if both of them have the same values for the merge key column(s) that you specify. Merge key columns are one or more columns that exist in both the master and update records. As part of preprocessing your data for the Merge stage, you first sort the input data sets and remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well. This chapter describes how to use the Merge stage. See Chapter 21 for information about the Sort stage and Chapter 23 for information about the Remove Duplicates stage.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify details about the data sets being merged.

• Outputs page. This is where you specify details about the merged

Merge Stage 22-1

data being output from the stage and about the reject links.

Page 294: DataStage Parallel Job Developer’s Guide

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Merge Keys Category

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Merge Keys/Key Input Column N/A Y Y N/A

Merge Keys/Sort Order

Ascending/Descending

Ascending Y N Key

Merge Keys/Nulls position

First/Last First N N Key

Merge Keys/Char-acter Set

ASCII/EBCDIC ASCII Y N Key

Merge Keys/Case Sensitive

True/False True N N Key

Options/Reject Masters Mode

Keep/Drop Keep Y N N/A

Options/Warn On Reject Masters

True/False True Y N N/A

Options/Warn On Reject Updates

True/False True Y N N/A

22-2 Ascential DataStage Parallel Job Developer’s Guide

Key. This specifies the key column you are merging on. Repeat the prop-erty to specify multiple keys. Key has the following dependent properties:

Page 295: DataStage Parallel Job Developer’s Guide

• Sort Order

Choose Ascending or Descending. The default is Ascending.

• Nulls position

By default columns containing null values appear first in the merged data set. To override this default so that columns containing null values appear last in the merged data set, select Last.

• Character Set

By default data is represented in the ASCII character set. To repre-sent data in the EBCDIC character set, choose EBCDIC.

• Case Sensitive

Use this to specify whether each merge key is case sensitive or not, this is set to True by default, i.e., the values “CASE” and “case” would not be judged equivalent.

Options Category

Reject Masters Mode. Set to Keep by default. It specifies that rejected rows from the master link are output to the merged data set. Set to Drop to specify that rejected records are dropped instead.

Warn On Reject Masters. Set to True by default. This will warn you when bad records from the master link are rejected. Set it to False to receive no warnings.

Warn On Reject Updates. Set to True by default. This will warn you when bad records from any update links are rejected. Set it to False to receive no warnings.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by

Merge Stage 22-3

any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

Page 296: DataStage Parallel Job Developer’s Guide

• Preserve partitioning. This is Propagate by default. It adopts the setting which results from ORing the settings of the input stages, i.e., if any of the input stages uses Set then this stage will use Set. You can explicitly select Set or Clear. Select Set to request the next stage in the job attempts to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Link OrderingThis tab allows you to specify which of the input links is the master link and the order in which links input to the Merge stage are processed. You

22-4 Ascential DataStage Parallel Job Developer’s Guide

Page 297: DataStage Parallel Job Developer’s Guide

can also specify which of the output links is the master link, and which of the reject links corresponds to which of the incoming update links.

By default the links will be processed in the order they were added. To rearrange them, choose an input link and click the up arrow button or the down arrow button.

Inputs PageThe Inputs page allows you to specify details about the data coming in to be merged. Choose an input link from the Input name drop down list to specify which link you want to work on.

The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data.

Details about Merge stage partitioning are given in the following section.

Merge Stage 22-5

See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Page 298: DataStage Parallel Job Developer’s Guide

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the merge is performed.

By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Merge stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Merge stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Merge stage is set to execute in parallel, then you can set a parti-tioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the previous stage).

If the Merge stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default auto collec-tion method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default method for the Merge stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on

22-6 Ascential DataStage Parallel Job Developer’s Guide

the key column selected from the Available list. This is commonly used to partition on tag fields.

Page 299: DataStage Parallel Job Developer’s Guide

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for the Merge stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the merge is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

Merge Stage 22-7

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

Page 300: DataStage Parallel Job Developer’s Guide

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Merge stage. The Merge stage can have only one master output link carrying the merged data and a number of reject links, each carrying rejected records from one of the update links. Choose an input link from the Input name drop down list to specify which link you want to work on.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the columns being input to the Merge stage and the Output columns.

Details about Merge stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Reject Link PropertiesYou cannot change the properties of a Reject link. They have the meta data of the corresponding incoming update link and this cannot be altered.

22-8 Ascential DataStage Parallel Job Developer’s Guide

Page 301: DataStage Parallel Job Developer’s Guide

Mapping TabFor Merge stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them.

The left pane shows the columns of the merged data. These are read only and cannot be modified on this tab. This shows the meta data from the master input link and any additional columns carried on the update links.

The right pane shows the output columns for the master output link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Auto-match facility.

In the above example the left pane represents the incoming data after the merge has been performed. The right pane represents the data being output by the stage after the merge operation. In this example the data has been mapped straight across.

Merge Stage 22-9

Page 302: DataStage Parallel Job Developer’s Guide

22-10 Ascential DataStage Parallel Job Developer’s Guide

Page 303: DataStage Parallel Job Developer’s Guide

23Remove Duplicates

Stage

The Remove Duplicates stage is an active stage. It can have a single input link and a single output link.

The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate records, and writes the results to an output data set.

Removing duplicate records is a common way of cleansing a data set before you perform further processing. Two records are considered dupli-cates if they are adjacent in the input data set and have identical values for the key column(s). A key column is any column you designate to be used in determining whether two records are identical.

The input data set to the remove duplicates operator must be sorted so that all records with identical key values are adjacent. You can either achieve this using the in-stage sort facilities available on the Inputs page Parti-tioning tab, or have an explicit Sort stage feeding the Remove duplicates stage.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify details about the data set having its duplicates removed.

• Outputs page. This is where you specify details about the

Remove Duplicates Stage 23-1

processed data being output from the stage.

Page 304: DataStage Parallel Job Developer’s Guide

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Keys that Define Duplicates Category

Key. Specifies the key column for the operation. This property can be repeated to specify multiple key columns. Key has dependent properties as follows:

• Character Set

By default data is represented in the ASCII character set. To repre-

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Keys that Define Duplicates/Key

Input Column N/A Y Y N/A

Keys that Define Duplicates/Char-acter Set

ASCII/EBCDIC ASCII Y N Key

Keys that Define Duplicates/Case Sensitive

True/False True N N Key

Options/Duplicate to retain

First/Last First Y N N/A

23-2 Ascential DataStage Parallel Job Developer’s Guide

sent data in the EBCDIC character set, choose EBCDIC.

Page 305: DataStage Parallel Job Developer’s Guide

• Case Sensitive

Use this to specify whether each key is case sensitive or not, this is set to True by default, i.e., the values “CASE” and “case” would not be judged equivalent.

Options Category

Duplicate to retain. Specifies which of the duplicate columns encoun-tered to retain. Choose between First and Last. It is set to First by default.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs Page

Remove Duplicates Stage 23-3

The Inputs page allows you to specify details about the data coming in to be sorted. Choose an input link from the Input name drop down list to specify which link you want to work on.

Page 306: DataStage Parallel Job Developer’s Guide

The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data.

Details about Remove Duplicates stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the operation is performed.

By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job this stage will attempt to preserve the partitioning of the incoming data.

If the Remove Duplicates stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Remove Duplicates stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Remove Duplicates stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage).

If the Remove Duplicates stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default auto collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning

23-4 Ascential DataStage Parallel Job Developer’s Guide

method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the

Page 307: DataStage Parallel Job Developer’s Guide

Configuration file. This is the default method for the Remove Duplicates stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for the Remove Duplicates stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more

Remove Duplicates Stage 23-5

columns of the record. This requires you to select a collecting key column from the Available list.

Page 308: DataStage Parallel Job Developer’s Guide

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the remove duplicates operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The avail-ability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Output PageThe Outputs page allows you to specify details about data output from the Remove Duplicates stage. The stage only has one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the columns being input to the Remove Duplicates stage and the output columns.

Details about Remove Duplicates stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

23-6 Ascential DataStage Parallel Job Developer’s Guide

Page 309: DataStage Parallel Job Developer’s Guide

Mapping TabFor Remove Duplicates stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them.

The left pane shows the columns of the input data. These are read only and cannot be modified on this tab. This shows the meta data from the incoming link

The right pane shows the output columns for the master output link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Auto-match facility.

In the above example the left pane represents the incoming data after the remove duplicates operation has been performed. The right pane repre-sents the data being output by the stage after the remove duplicates operation. In this example the data has been mapped straight across.

Remove Duplicates Stage 23-7

Page 310: DataStage Parallel Job Developer’s Guide

23-8 Ascential DataStage Parallel Job Developer’s Guide

Page 311: DataStage Parallel Job Developer’s Guide

24Compress Stage

The Compress stage is an active stage. It can have a single input link and a single output link.

The Compress stage uses the UNIX compress or GZIP utility to compress a data set. It converts a data set from a sequence of records into a stream of raw binary data.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify details about the data set being compressed.

• Outputs page. This is where you specify details about the compressed data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Compress Stage 24-1

Page 312: DataStage Parallel Job Developer’s Guide

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. The stage only has a single property which deter-mines whether the stage uses compress or GZIP.

Options Category

Command. Specifies whether the stage will use compress (the default) or GZIP.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Set by default. You can explicitly select Set or Clear. Select Set to request the next stage should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Options/Command compress/gzip compress Y N N/A

24-2 Ascential DataStage Parallel Job Developer’s Guide

pool for this stage (in addition to any node pools defined in the Configuration file).

Page 313: DataStage Parallel Job Developer’s Guide

Input PageThe Inputs page allows you to specify details about the data set being compressed. There is only one input link.

The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data.

Details about Compress stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the compress is performed.

By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Compress stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Compress stage is set to execute in parallel or sequen-tial mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Compress stage is set to execute in parallel, then you can set a parti-tioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the Stage page Advanced tab).

If the Compress stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection

Compress Stage 24-3

method from the Collection type drop-down list. This will override the default auto collection method.

The following partitioning methods are available:

Page 314: DataStage Parallel Job Developer’s Guide

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default method for the Compress stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for the Compress stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then

24-4 Ascential DataStage Parallel Job Developer’s Guide

from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

Page 315: DataStage Parallel Job Developer’s Guide

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the compression is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Output PageThe Outputs page allows you to specify details about data output from the Compress stage. The stage only has one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. See Chapter 3, “Stage Editors,” for a general description of the tabs.

Compress Stage 24-5

Page 316: DataStage Parallel Job Developer’s Guide

24-6 Ascential DataStage Parallel Job Developer’s Guide

Page 317: DataStage Parallel Job Developer’s Guide

25Expand Stage

The Expand stage is an active stage. It can have a single input link and a single output link.

The Expand stage uses the UNIX uncompress or GZIP utility to expand a data set. It converts a previously compressed data set back into a sequence of records from a stream of raw binary data.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify details about the data set being expanded.

• Outputs page. This is where you specify details about the expanded data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced page allows you to specify how the stage executes.

Expand Stage 25-1

Page 318: DataStage Parallel Job Developer’s Guide

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. The stage only has a single property which deter-mines whether the stage uses uncompress or GZIP.

Options Category

Command. Specifies whether the stage will use uncompress (the default) or GZIP.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. The stage has a mandatory partitioning method of Same, this overrides the preserve partitioning flag and so the partitioning of the incoming data is always preserved.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Options/Command uncom-press/gzip

compress Y N N/A

25-2 Ascential DataStage Parallel Job Developer’s Guide

selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Page 319: DataStage Parallel Job Developer’s Guide

Input PageThe Inputs page allows you to specify details about the data set being expanded. There is only one input link.

The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data.

Details about Expand stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the expansion is performed.

By default the stage uses the Same partitioning method and this cannot be altered. This preserves the partitioning already in place.

If the Expand stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default auto collec-tion method.

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for the Expand stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key

Expand Stage 25-3

column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the expansion is performed. The sort is

Page 320: DataStage Parallel Job Developer’s Guide

always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Output PageThe Outputs page allows you to specify details about data output from the Expand stage. The stage only has one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of outgoing data.

See Chapter 3, “Stage Editors,” for a general description of the tabs.

25-4 Ascential DataStage Parallel Job Developer’s Guide

Page 321: DataStage Parallel Job Developer’s Guide

26Sample Stage

The Sample stage is an active stage. It can have a single input link and any number of output links.

The Sample stage samples an input data set. It operates in two modes. In Percent mode, it extracts records, selecting them by means of a random number generator, and writes a given percentage of these to each output data set. You specify the number of output data sets, the percentage written to each, and a seed value to start the random number generator. You can reproduce a given distribution by repeating the same number of outputs, the percentage, and the seed value.

In Period mode, it extracts every Nth row from each partition, where N is the period, which you supply. In this case all rows will be output to a single data set.

For both modes you can specify the maximum number of rows that you want to sample from each partition.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Input page. This is where you specify details about the data set being Sampled.

• Outputs page. This is where you specify details about the Sampled data being output from the stage.

Sample Stage 26-1

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab

Page 322: DataStage Parallel Job Developer’s Guide

allows you to specify how the stage executes. The Link Ordering tab allows you to specify which output links are which.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Options Category

Sample Mode. Specifies the type of sample operation. You can sample on a percentage of input rows (percent), or you can sample the Nth row of every partition (period).

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/Sample Mode

percent/period percent Y N N/A

Options/Percent number N/A Y (if Sample Mode = Percent)

Y N/A

Options/Output Link Number

number N/A Y N Percent

Options/Seed number N/A N N N/A

Options/Period (Per Partition)

number N/A Y (if Sample Mode = Period)

N N/A

Options/Max Rows Per Partition

number N/A N N N/A

26-2 Ascential DataStage Parallel Job Developer’s Guide

Percent. Specifies the sampling percentage for each output data set when use a Sample Mode of Percent. You can repeat this property to specify different percentages for each output data set. The sum of the percentages

Page 323: DataStage Parallel Job Developer’s Guide

specified for all output data sets cannot exceed 100%. You can specify a job parameter if required.

Percent has a dependent property:

• Output Link Number

This specifies the output link to which the percentage corresponds. You can specify a job parameter if required.

Seed. This is the number used to initialize the random number generator. You can specify a job parameter if required. This property is only available if Sample Mode is set to percent.

Period (Per Partition). Specifies the period when using a Sample Mode of Period.

Max Rows Per Partition. This specifies the maximum number of rows that will be sampled from each partition.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request the next stage should attempt to main-tain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking

Sample Stage 26-3

the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node

Page 324: DataStage Parallel Job Developer’s Guide

pool for this stage (in addition to any node pools defined in the Configuration file).

Link OrderingThis tab allows you to specify the order in which the output links are processed.

By default the output links will be processed in the order they were added. To rearrange them, choose an output link and click the up arrow button or the down arrow button.

Input PageThe Input page allows you to specify details about the data set being sampled. There is only one input link.

The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the

26-4 Ascential DataStage Parallel Job Developer’s Guide

source data set link is partitioned. The Columns tab specifies the column definitions of incoming data.

Page 325: DataStage Parallel Job Developer’s Guide

Details about Sample stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the sample is performed.

By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job, the stage will attempt to preserve the partitioning of the incoming data.

If the Sample stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Sample stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Sample stage is set to execute in parallel, then you can set a parti-tioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the previous stage).

If the Sample stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default auto collec-tion method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default method for the Sample stage.

• Entire. Each file written to receives the entire data set.

Sample Stage 26-5

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

Page 326: DataStage Parallel Job Developer’s Guide

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for the Sample stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the sample is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is

26-6 Ascential DataStage Parallel Job Developer’s Guide

collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

Page 327: DataStage Parallel Job Developer’s Guide

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Sample stage. The stage can have any number of output links, choose the one you want to work on from the Output Link drop down list.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of outgoing data. The Mapping tab allows you to specify the relationship between the columns being input to the Sample stage and the output columns.

Details about Sample stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Sample Stage 26-7

Page 328: DataStage Parallel Job Developer’s Guide

Mapping TabFor Sample stages the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them.

The left pane shows the columns of the sampled data. These are read only and cannot be modified on this tab. This shows the meta data from the incoming link

The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Auto-match facility.

In the above example the left pane represents the incoming data after the Sample operation has been performed. The right pane represents the data being output by the stage after the Sample operation. In this example the data has been mapped straight across.

26-8 Ascential DataStage Parallel Job Developer’s Guide

Page 329: DataStage Parallel Job Developer’s Guide

27Row Generator Stage

The Row Generator stage is a file stage. It can have any number of output links.

The Row Generator stage produces a set of mock data fitting the specified meta data. This is useful where you want to test your job but have no real data available to process. (See also the Column Generator stage which allows you to add extra columns to existing data sets)

The meta data you specify on the output link determines the columns you are generating. Most of the properties are specified using the Edit Column Meta Data dialog box to provide format details for each column (the Edit Column Meta Data dialog box is accessible from the shortcut menu of the Outputs Page Columns tab - select Edit Row…).

The stage editor has two pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Outputs page. This is where you specify details about the gener-ated data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Advanced page allows you to specify how the stage executes.

Row Generator Stage 27-1

Page 330: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The Generate stage executes in Sequential mode by default. You can select Parallel mode to generate data sets in separate partitions.

• Preserve partitioning. This is Propagate by default. If you have an input data set, it adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request the next stage should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Outputs PageThe Outputs page allows you to specify details about data output from the Row Generator stage. The stage can have any number of output links, choose the one you want to work on from the Output Link drop down list.

The General tab allows you to specify an optional description of the output link. The Properties page lets you specify what the stage does. The Columns tab specifies the column definitions of outgoing data.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in

27-2 Ascential DataStage Parallel Job Developer’s Guide

the warning color (red by default) and turn black when you supply a value

Page 331: DataStage Parallel Job Developer’s Guide

for them. The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Options Category

Number of Records. The number of records you want your generated data set to contain.

The default number is 10.

Schema File. By default the stage will take the meta data defined on the input link to base the mock data set on. But you can specify the column definitions in a schema file, if required. You can browse for the schema file or specify a job parameter.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/Number of Records

number 10 Y N N/A

Options/Schema File pathname N/A N N N/A

Row Generator Stage 27-3

Page 332: DataStage Parallel Job Developer’s Guide

27-4 Ascential DataStage Parallel Job Developer’s Guide

Page 333: DataStage Parallel Job Developer’s Guide

28Column Generator

Stage

The Column Generator stage is an active stage. It can have a single input link and a single output link.

The Column Generator stage adds columns to incoming data and gener-ates mock data for these columns for each data row processed. The new data set is then output. (See also the Row Generator stage which allows you to generate complete sets of mock data.)

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Input page. This is where you specify details about the input link.

• Outputs page. This is where you specify details about the gener-ated data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Properties

Column Generator Stage 28-1

The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in

Page 334: DataStage Parallel Job Developer’s Guide

the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Options Category

Column Method. Select Explicit if you are going to specify the column or columns you want the stage to generate data for. Select Schema File if you are supplying a schema file containing the column definitions.

Column to Generate. When you have chosen a column method of Explicit, this property allows you to specify which output columns the stage is generating data for. Repeat the property to specify multiple columns. You can specify the properties for each column using the Parallel tab of the Edit Column Meta Dialog box (accessible from the shortcut menu on the columns grid of the output Columns tab).

Schema File. When you have chosen a column method of schema file, this property allows you to specify the column definitions in a schema file. You can browse for the schema file or specify a job parameter.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/Column Method

Explicit/Column Method

Explicit Y N N/A

Options/Column to Generate

output column

N/A Y Y (if Column Method = Explicit)

N/A

Options/Schema File

pathname N/A N Y (if Column Method = Schema File)

N/A

28-2 Ascential DataStage Parallel Job Developer’s Guide

Page 335: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The Generate stage executes in Sequential mode by default. You can select Parallel mode to generate data sets in separate partitions.

• Preserve partitioning. This is Propagate by default. If you have an input data set, it adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request the next stage should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Input PageThe Inputs page allows you to specify details about the incoming data set you are adding generated columns to. There is only one input link and this is optional.

The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data.

Details about Generate stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Column Generator Stage 28-3

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the generate is performed.

Page 336: DataStage Parallel Job Developer’s Guide

By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job, the stage will attempt to preserve the partitioning of the incoming data.

If the Column Generator stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Column Generator stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Column Generator stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the Stage page Advanced tab).

If the Column Generator stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default auto collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default method for the Column Generator stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

28-4 Ascential DataStage Parallel Job Developer’s Guide

• Random. The records are partitioned randomly, based on the output of a random number generator.

Page 337: DataStage Parallel Job Developer’s Guide

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for the Column Generator stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operation starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the column generate operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The avail-ability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available

Column Generator Stage 28-5

list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

Page 338: DataStage Parallel Job Developer’s Guide

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageDetails about Column Generator stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Mapping TabFor Column Generator stages the Mapping tab allows you to specify how the output columns are derived, i.e., how the generated data maps onto them.

The left pane shows the generated columns. These are read only and cannot be modified on this tab. These columns are automatically mapped onto the equivalent output columns.

28-6 Ascential DataStage Parallel Job Developer’s Guide

The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived.You

Page 339: DataStage Parallel Job Developer’s Guide

can fill it in by dragging input columns over, or by using the Auto-match facility.

The right pane represents the data being output by the stage after the generate operation. In the above example two columns belong to incoming data and have automatically been mapped through and the two generated columns have been mapped straight across.

Column Generator Stage 28-7

Page 340: DataStage Parallel Job Developer’s Guide

28-8 Ascential DataStage Parallel Job Developer’s Guide

Page 341: DataStage Parallel Job Developer’s Guide

29Copy Stage

The Copy stage is an active stage. It can have a single input link and any number of output links.

The Copy stage copies a single input data set to a number of output data sets. Each record of the input data set is copied to every output data set without modification. This lets you make a backup copy of a data set on disk while performing an operation on another copy, for example.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Input page. This is where you specify details about the input link carrying the data to be copied.

• Outputs page. This is where you specify details about the copied data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although

Copy Stage 29-1

many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

Page 342: DataStage Parallel Job Developer’s Guide

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Options Category

Force. Set True to specify that DataStage should not try to optimize the job by removing a Copy operation where there is one input and one output. Set False by default.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts the setting of the previous stage.You can explicitly select Set or Clear. Select Set to request the stage should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/Force True/False False N N N/A

29-2 Ascential DataStage Parallel Job Developer’s Guide

pool for this stage (in addition to any node pools defined in the Configuration file).

Page 343: DataStage Parallel Job Developer’s Guide

Input PageThe Inputs page allows you to specify details about the data set being copied. There is only one input link.

The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data.

Details about Copy stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the copy is performed.

By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job, the stage will attempt to preserve the partitioning of the incoming data.

If the Copy stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Copy stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Copy stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage).

If the Copy stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default auto collec-tion method.

Copy Stage 29-3

The following partitioning methods are available:

Page 344: DataStage Parallel Job Developer’s Guide

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default method for the Copy stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for the Copy stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-

29-4 Ascential DataStage Parallel Job Developer’s Guide

tion, the operator starts over.

Page 345: DataStage Parallel Job Developer’s Guide

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the remove duplicates operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The avail-ability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Copy stage. The stage can have any number of output links, choose the one you want to work on from the Output name drop down list.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of outgoing data. The Mapping tab allows you to specify the relationship between the columns being input to the Copy stage and the output columns.

Details about Copy stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Copy Stage 29-5

Page 346: DataStage Parallel Job Developer’s Guide

Mapping TabFor Copy stages the Mapping tab allows you to specify how the output columns are derived, i.e., what copied columns map onto them.

The left pane shows the copied columns. These are read only and cannot be modified on this tab.

The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived.You can fill it in by dragging copied columns over, or by using the Auto-match facility.

In the above example the left pane represents the incoming data after the copy has been performed. The right pane represents the data being output by the stage after the copy operation. In this example the data has been mapped straight across.

29-6 Ascential DataStage Parallel Job Developer’s Guide

Page 347: DataStage Parallel Job Developer’s Guide

30External Filter Stage

The External Filter stage is an active stage. It can have a single input link and a single output link.

The External filter stage allows you to specify a UNIX command that acts as a filter on the data you are processing. An example would be to use the stage to grep a data set for a certain string, or pattern, and discard records which did not contain a match. This can be a quick and efficient way of filtering data.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Input page. This is where you specify details about the input link carrying the data to be filtered.

• Outputs page. This is where you specify details about the filtered data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties page lets you specify what the stage does. The Advanced page allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what

External Filter Stage 30-1

the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in

Page 348: DataStage Parallel Job Developer’s Guide

the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Options Category

Filter Command. Specifies the filter command line to be executed and any command line options it requires. For example:

grep

Arguments. Allows you to specify any arguments that the command line requires. For example:

\(cancel\).*\1

Together with the grep command would extract all records that contained the string “cancel” twice and discard other records.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts the setting of the previous stage.You can explicitly select Set or Clear. Select Set to request the next stage should attempt to maintain the partitioning.

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Options/Filter Command

string N/A Y N N/A

Options/Arguments string N/A N N N/A

30-2 Ascential DataStage Parallel Job Developer’s Guide

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools

Page 349: DataStage Parallel Job Developer’s Guide

or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Input PageThe Inputs page allows you to specify details about the data set being filtered. There is only one input link.

The General tab allows you to specify an optional description of the link. The Partitioning tab allows you to specify how incoming data on the source data set link is partitioned. The Columns tab specifies the column definitions of incoming data.

Details about External Filter stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the filter is executed.

By default the stage uses the auto partitioning method. If the Preserve Partitioning option has been set on the previous stage in the job, the stage will attempt to preserve the partitioning of the incoming data.

If the External Filter stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the External Filter stage is set to execute in parallel or

External Filter Stage 30-3

sequential mode.

Page 350: DataStage Parallel Job Developer’s Guide

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the External Filter stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the Stage page Advanced tab).

If the External Filter stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default auto collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default method for the External Filter stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often

30-4 Ascential DataStage Parallel Job Developer’s Guide

a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

Page 351: DataStage Parallel Job Developer’s Guide

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for the External Filter stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the remove duplicates operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The avail-ability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs Page

External Filter Stage 30-5

The Outputs page allows you to specify details about data output from the External Filter stage. The stage can only have one output link.

Page 352: DataStage Parallel Job Developer’s Guide

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of outgoing data. See Chapter 3, “Stage Editors,” for a general description of these tabs.

30-6 Ascential DataStage Parallel Job Developer’s Guide

Page 353: DataStage Parallel Job Developer’s Guide

31Change Capture Stage

The Change Capture Stage is an active stage. The stage compares two data sets and makes a record of the differences.

The Change Capture stage takes two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. The stage produces a change data set, whose table definition is transferred from the after data set’s table definition with the addition of one column: a change code with values encoding the four actions: insert, delete, copy, and edit. The preserve-partitioning flag is set on the change data set.

The compare is based on a set a set of key columns, records from the two data sets are assumed to be copies of one another if they have the same values in these key columns. You can also optionally specify change values. If two records have identical key columns, you can compare the value columns to see if one is an edited copy of the other.

The stage assumes that the incoming data is hash-partitioned and sorted in ascending order. The columns the data is hashed on should be the key columns used for the data compare. You can achieve the sorting and parti-tioning using the Sort stage or by using the built-in sorting and partitioning abilities of the Change Capture stage.

You can use the companion Change Apply stage to combine the changes from the Change Capture stage with the original before data set to repro-duce the after data set.

The Change Capture stage is very similar to the Difference stage described in Chapter 35, “Difference Stage.”

Change Capture Stage 31-1

The stage editor has three pages:

Page 354: DataStage Parallel Job Developer’s Guide

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify details about the data set having its duplicates removed.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link Ordering tab allows you to specify which input link carries the before data set and which the after data set.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Change Keys/Key Input Column N/A Y Y N/A

Change Keys/Case Sensitive

True/False True N N Key

Change Keys/Sort Order

Ascending/Descending

Ascending N N Key

Change Keys/Nulls Position

First/Last First N N Key

31-2 Ascential DataStage Manager Guide

Change Values/Value

Input Column N/A N Y N/A

Page 355: DataStage Parallel Job Developer’s Guide

Change Keys Category

Key. Specifies the name of a difference key input column (see page 31-1 for an explanation of how Key columns are used). This property can be repeated to specify multiple difference key input columns. Key has the

Change Values/Case Sensitive

True/False True N N Value

Options/Change Mode

Explicit Keys & Values/All keys, Explicit values/Explicit Keys, All Values

Explicit Keys & Values

Y N N/A

Options/Log Statistics

True/False False N N N/A

Options/Drop Output for Insert

True/False False N N N/A

Options/Drop Output for Delete

True/False False N N N/A

Options/Drop Output for Edit

True/False False N N N/A

Options/Drop Output for Copy

True/False True N N N/A

Options/Code Column Name

string change_code

N N N/A

Options/Copy Code

number 0 N N N/A

Options/Deleted Code

number 2 N N N/A

Options/Edit Code number 3 N N N/A

Options/Insert Code

number 1 N N N/A

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Change Capture Stage 31-3

following dependent properties:

Page 356: DataStage Parallel Job Developer’s Guide

• Case Sensitive

Use this to property to specify whether each key is case sensitive or not. It is set to True by default; for example, the values “CASE” and “case” would not be judged equivalent.

• Sort Order

Specify ascending or descending sort order.

• Nulls Position

Specify whether null values should be placed first or last.

Change Value category

Value. Specifies the name of a value input column (see page 31-1 for an explanation of how Value columns are used). Value has the following dependent properties:

• Case Sensitive

Use this to property to specify whether each value is case sensitive or not. It is set to True by default; for example, the values “CASE” and “case” would not be judged equivalent.

Options Category

Change Mode. This mode determines how keys and values are specified. Choose Explicit Keys & Values to specify the keys and values yourself. Choose All keys, Explicit values to specify that value columns must be defined, but all other columns are key columns unless excluded. Choose Explicit Keys, All Values to specify that key columns must be defined but all other columns are value columns unless they are excluded.

Log Statistics. This property configures the stage to display result infor-mation containing the number of input records and the number of copy, delete, edit, and insert records.

Drop Output for Insert. Specifies to drop (not generate) an output record for an insert result. By default, an output record is always created by the stage.

31-4 Ascential DataStage Manager Guide

Page 357: DataStage Parallel Job Developer’s Guide

Drop Output for Delete. Specifies to drop (not generate) the output record for a delete result. By default, an output record is always created by the stage.

Drop Output for Edit. Specifies to drop (not generate) the output record for an edit result. By default, an output record is always created by the stage.

Drop Output for Copy. Specifies to drop (not generate) the output record for a copy result. By default, an output record is always created by the stage.

Code Column Name. Allows you to specify a different name for the output column carrying the change code generated for each record by the stage. By default the column is called change_code.

Copy Code. Allows you to specify an alternative value for the code that indicates the after record is a copy of the before record. By default this code is 0.

Deleted Code. Allows you to specify an alternative value for the code that indicates that a record in the before set has been deleted from the after set. By default this code is 2.

Edit Code. Allows you to specify an alternative value for the code that indicates the after record is an edited version of the before record. By default this code is 3.

Insert Code. Allows you to specify an alternative value for the code that indicates a new record has been inserted in the after set that did not exist in the before set. By default this code is 1.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential

Change Capture Stage 31-5

mode the entire data set is processed by the conductor node.

Page 358: DataStage Parallel Job Developer’s Guide

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Link OrderingThis tab allows you to specify which input link carries the before data set and which carries the after data set.

31-6 Ascential DataStage Manager Guide

Page 359: DataStage Parallel Job Developer’s Guide

By default the first link added will represent the before set. To rearrange the links, choose an input link and click the up arrow button or the down arrow button.

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Change Capture expects two incoming data sets: a before data set and an after data set.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being compared. The Columns tab specifies the column definitions of incoming data.

Details about Change Capture stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is compared. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Change Capture stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Change Capture stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel

Change Capture Stage 31-7

or sequential mode.

Page 360: DataStage Parallel Job Developer’s Guide

If the Change Capture stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the previous stage).

If the Change Capture stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Change Capture stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

31-8 Ascential DataStage Manager Guide

The following Collection methods are available:

Page 361: DataStage Parallel Job Developer’s Guide

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for Change Capture stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being compared. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the

Change Capture Stage 31-9

Change Capture stage. The Change Capture stage can have only one output link.

Page 362: DataStage Parallel Job Developer’s Guide

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the columns being input to the Change Capture stage and the Output columns.

Details about Change Capture stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Mapping TabFor the Change Capture stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them and which column carries the change code data.

31-10 Ascential DataStage Manager Guide

The left pane shows the columns from the before/after data sets plus the change code column. These are read only and cannot be modified on this tab.

Page 363: DataStage Parallel Job Developer’s Guide

The right pane shows the output columns for each link. This has a Deriva-tions field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility. By default the data set columns are mapped automatically. You need to ensure that there is an output column to carry the change code and that this is mapped to the Change_code column.

Change Capture Stage 31-11

Page 364: DataStage Parallel Job Developer’s Guide

31-12 Ascential DataStage Manager Guide

Page 365: DataStage Parallel Job Developer’s Guide

32Change Apply Stage

The Change Apply stage is an active stage. It takes the change data set, that contains the changes in the before and after data sets, from the Change Capture stage and applies the encoded change operations to a before data set to compute an after data set. (See Chapter 31 for a description of the Change Capture stage.)

The before input to Change Apply must have the same columns as the before input that was input to Change Capture, and an automatic conversion must exist between the types of corresponding columns. In addition, results are only guaranteed if the contents of the before input to Change Apply are identical (in value and record order in each partition) to the before input that was fed to Change Capture, and if the keys are unique.

The change input to Change Apply must have been output from Change Capture without modification. Because preserve-partitioning is set on the change output of Change Capture, the Change Apply stage has the same number of partitions as the Change Capture stage. Additionally, both inputs of Change Apply are designated as partitioned using the Same partitioning method.

The Change Apply stage read a record from the change data set and from the before data set, compares their key column values, and acts accordingly:

• If the before keys come before the change keys in the specified sort order, the before record is copied to the output. The change record is retained for the next comparison.

• If the before keys are equal to the change keys, the behavior depends on the code in the change_code column of the change record:

Change Apply Stage 32-1

– Insert: The change record is copied to the output; the stage retains the same before record for the next comparison. If key columns are not unique, and there is more than one consecutive insert with

Page 366: DataStage Parallel Job Developer’s Guide

the same key, then Change Apply applies all the consecutive inserts before existing records. This record order may be different from the after data set given to Change Capture.

– Delete: The value columns of the before and change records are compared. If the value columns are the same or if the Check Value Columns on Delete is specified as False, the change and before records are both discarded; no record is transferred to the output. If the value columns are not the same, the before record is copied to the output and the stage retains the same change record for the next comparison.

If key columns are not unique, the value columns ensure that the correct record is deleted. If more than one record with the same keys have matching value columns, the first-encountered record is deleted. This may cause different record ordering than in the after data set given to the Change Capture stage. A warning is issued and both change record and before record are discarded, i.e. no output record results.

– Edit: The change record is copied to the output; the before record is discarded. If key columns are not unique, then the first before record encountered with matching keys will be edited. This may be a different record from the one that was edited in the after data set given to the Change Capture stage. A warning is issued and the change record is copied to the output; but the stage retains the same before record for the next comparison.

– Copy: The change record is discarded. The before record is copied to the output.

• If the before keys come after the change keys, behavior also depends on the change_code column:

– Insert. The change record is copied to the output, the stage retains the same before record for the next comparison. (The same as when the keys are equal.)

– Delete. A warning is issued and the change record discarded while the before record is retained for the next comparison.

– Edit or Copy. A warning is issued and the change record is copied

32-2 Ascential DataStage Parallel Job Developer’s Guide

to the output while the before record is retained for the next comparison.

Page 367: DataStage Parallel Job Developer’s Guide

Note: If the before input of Change Apply is identical to the before input of Change Capture and either the keys are unique or copy records are used, then the output of Change Apply is identical to the after input of Change Capture. However, if the before input of Change Apply is not the same (different record contents or ordering), or the keys are not unique and copy records are not used, this is not detected and the rules described above are applied anyway, producing a result that might or might not be useful.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Change Keys/Key Input Column N/A Y Y N/A

Change Apply Stage 32-3

Change Keys/Case Sensitive

True/False True N N Key

Page 368: DataStage Parallel Job Developer’s Guide

Change Keys Category

Key. Specifies the name of a difference key input column. This property can be repeated to specify multiple difference key input columns. Key has the following dependent properties:

• Case Sensitive

Use this to property to specify whether each key is case sensitive or

Change Keys/Sort Order

Ascending/Descending

Ascending N N Key

Change Keys/Nulls Position

First/Last First N N Key

Change Values/Value

Input Column N/A N Y N/A

Change Values/Case Sensitive

True/False True N N Value

Options/Change Mode

Explicit Keys & Values/All keys, Explicit values/Explicit Keys, All Values

Explicit Keys & Values

Y N N/A

Options/Log Statistics

True/False False N N N/A

Options/Check Value Columns on Delete

True/False True Y N N/A

Options/Code Column Name

string change_code

N N N/A

Options/Copy Code number 0 N N N/A

Options/Deleted Code

number 2 N N N/A

Options/Edit Code number 3 N N N/A

Options/Insert Code

number 1 N N N/A

32-4 Ascential DataStage Parallel Job Developer’s Guide

not. It is set to True by default; for example, the values “CASE” and “case” would not be judged equivalent.

Page 369: DataStage Parallel Job Developer’s Guide

• Sort Order

Specify ascending or descending sort order.

• Nulls Position

Specify whether null values should be placed first or last.

Change Value category

Value. Specifies the name of a value input column (see page 32-1 for an explanation of how Value columns are used). Value has the following dependent properties:

• Case Sensitive

Use this to property to specify whether each value is case sensitive or not. It is set to True by default; for example, the values “CASE” and “case” would not be judged equivalent.

Options Category

Change Mode. This mode determines how keys and values are specified. Choose Explicit Keys & Values to specify the keys and values yourself. Choose All keys, Explicit values to specify that value columns must be defined, but all other columns are key columns unless excluded. Choose Explicit Keys, All Values to specify that key columns must be defined but all other columns are value columns unless they are excluded.

Log Statistics. This property configures the stage to display result infor-mation containing the number of input records and the number of copy, delete, edit, and insert records.

Check Value Columns on Delete. Specifies that DataStage should not check value columns on deletes. Normally, Change Apply compares the value columns of delete change records to those in the before record to ensure that it is deleting the correct record.

Code Column Name. Allows you to specify that a different name has been used for the change data set column carrying the change code gener-ated for each record by the stage. By default the column is called

Change Apply Stage 32-5

change_code.

Copy Code. Allows you to specify an alternative value for the code that indicates a record copy. By default this code is 0.

Page 370: DataStage Parallel Job Developer’s Guide

Deleted Code. Allows you to specify an alternative value for the code that indicates a record delete. By default this code is 2.

Edit Code. Allows you to specify an alternative value for the code that indicates a record edit. By default this code is 3.

Insert Code. Allows you to specify an alternative value for the code that indicates a record insert. By default this code is 1.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

32-6 Ascential DataStage Parallel Job Developer’s Guide

Page 371: DataStage Parallel Job Developer’s Guide

Link OrderingThis tab allows you to specify which input link carries the before data set and which carries the change data set.

By default the first link added will represent the before set. To rearrange the links, choose an input link and click the up arrow button or the down arrow button.

Inputs PageThe Inputs page allows you to specify details about the incoming data set.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being compared. The Columns tab specifies the column definitions of incoming data.

Details about Change Apply stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the

Change Apply Stage 32-7

other tabs.

Page 372: DataStage Parallel Job Developer’s Guide

Partitioning on Input LinksThe change input to Change Apply should have been output from the Change Capture stage without modification. Because preserve-parti-tioning is set on the change output of Change Capture, the Change Apply stage has the same number of partitions as the Change Capture stage. Additionally, both inputs of Change Apply are automatically designated as partitioned using the Same partitioning method.

The standard partitioning and collecting controls are available on the Change Apply stage, however, so you can override this behavior.

If the Change Apply stage is operating in sequential mode, it will first collect the data before writing it to the file using the default auto collection method.

The Partitioning tab allows you to override the default behavior. The exact operation of this tab depends on:

• Whether the Change Apply stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Change Apply stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the Stage page Advanced tab).

If the Change Apply stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default auto collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning flag has been set on the previous stage in the job, and how many nodes are specified in the Configuration file. This is the default method for the Change Apply stage.

32-8 Ascential DataStage Parallel Job Developer’s Guide

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

Page 373: DataStage Parallel Job Developer’s Guide

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for the Change Apply stage.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is

Change Apply Stage 32-9

collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

Page 374: DataStage Parallel Job Developer’s Guide

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Change Apply stage. The Change Apply stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of outgoing data.The Mapping tab allows you to specify the relationship between the columns being input to the Change Apply stage and the Output columns.

Details about Change Apply stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

32-10 Ascential DataStage Parallel Job Developer’s Guide

Page 375: DataStage Parallel Job Developer’s Guide

Mapping TabFor the Change Capture stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the common columns of the before and change data sets. These are read only and cannot be modified on this tab.

The right pane shows the output columns for the output link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Auto-match facility. By default the columns are mapped straight across as shown in the example.

Change Apply Stage 32-11

Page 376: DataStage Parallel Job Developer’s Guide

32-12 Ascential DataStage Parallel Job Developer’s Guide

Page 377: DataStage Parallel Job Developer’s Guide

33Encode Stage

The Encode stage is an active stage. It encodes a data set using a UNIX encoding command that you supply. The stage converts a data set from a sequence of records into a stream of raw binary data. The companion Decode stage reconverts the data stream to a data set.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties page lets you specify what the stage does. The Advanced page allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. This stage only has one property and you must supply a value for this. The property appears in the warning color (red by default) until you supply a value.

Encode Stage 33-1

Page 378: DataStage Parallel Job Developer’s Guide

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Options Category

Command Line. Specifies the command line used for encoding the data set. The command line must configure the UNIX command to accept input from standard input and write its results to standard output. The command must be located in your search path and be accessible by every processing node on which the Encode stage executes.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Set by default to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the

Category/Property Values Default Mandatory? Repeats? Depen-

dent of

Options/Command Line

Command Line N/A Y N N/A

33-2 Ascential DataStage Parallel Job Developer’s Guide

Configuration file).

Page 379: DataStage Parallel Job Developer’s Guide

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Encode stage can only have one input link.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being encoded. The Columns tab specifies the column definitions of incoming data.

Details about Change Capture stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is encoded. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Encode stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Encode stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Encode stage is set to execute in parallel, then you can set a parti-tioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the previous stage).

Encode Stage 33-3

If the Encode stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the

Page 380: DataStage Parallel Job Developer’s Guide

Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Encode stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for Encode stages.

33-4 Ascential DataStage Parallel Job Developer’s Guide

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

Page 381: DataStage Parallel Job Developer’s Guide

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being encoded. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Encode stage. The Encode stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data.

See Chapter 3, “Stage Editors,” for a general description of these tabs.

Encode Stage 33-5

Page 382: DataStage Parallel Job Developer’s Guide

33-6 Ascential DataStage Parallel Job Developer’s Guide

Page 383: DataStage Parallel Job Developer’s Guide

34Decode Stage

The Decode stage is an active stage. It decodes a data set using a UNIX decoding command that you supply. It converts a data stream of raw binary data into a data set. Its companion stage Encode converts a data set from a sequence of records to a stream of raw binary data.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. This stage only has one property and you must

Decode Stage 34-1

Page 384: DataStage Parallel Job Developer’s Guide

supply a value for this. The property appears in the warning color (red by default) until you supply a value.

Options Category

Command Line. Specifies the command line used for decoding the data set. The command line must configure the UNIX command to accept input from standard input and write its results to standard output. The command must be located in the search path of your application and be accessible by every processing node on which the Decode stage executes.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/Command Line

Command Line N/A Y N N/A

34-2 Ascential DataStage Parallel Job Developer’s Guide

selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Page 385: DataStage Parallel Job Developer’s Guide

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Decode stage expects two incoming data sets.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being decoded summarized. The Columns tab specifies the column definitions of incoming data.

Details about Compare stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is decoded. It also allows you to specify that the data should be sorted before being operated on.

The Decode stage partitions in Same mode and this cannot be overridden.

If the Decode stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for Decode stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key

Decode Stage 34-3

column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being decoded. The sort is always

Page 386: DataStage Parallel Job Developer’s Guide

carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Decode stage. The Decode stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data.

See Chapter 3, “Stage Editors,” for a general description of the tabs.

34-4 Ascential DataStage Parallel Job Developer’s Guide

Page 387: DataStage Parallel Job Developer’s Guide

35Difference Stage

The Difference stage is an active stage. It performs a record-by-record comparison of two input data sets, which are different versions of the same data set designated the before and after data sets. The Difference stage outputs a single data set whose records represent the difference between them. The stage assumes that the input data sets have been hash-parti-tioned and sorted in ascending order on the key columns you specify for the Difference stage comparison. You can achieve this by using the Sort stage or by using the built in sorting and partitioning abilities of the Differ-ence stage.

The comparison is performed based on a set of difference key columns. Two records are copies of one another if they have the same value for all difference keys. You can also optionally specify change values. If two records have identical key columns, you can compare the value columns to see if one is an edited copy of the other.

The stage generates an extra column, DiffCode, which indicates the result of each record comparison.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify details about the data set having its duplicates removed.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Difference Stage 35-1

Page 388: DataStage Parallel Job Developer’s Guide

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link Ordering tab allows you to specify which input link carries the before data set and which the after data set.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Difference Keys/Key

Input Column N/A Y Y N/A

Difference Keys/Case Sensitive

True/False True N N Key

Difference Values/All non-Key Columns are Values

True/False False Y N N/A

Difference Values/Case Sensitive

True/False True N N All non-Key Columns are Values

Options/Tolerate Unsorted Inputs

True/False False N N N/A

Options/Log Statistics

True/False False N N N/A

Options/Drop True/False False N N N/A

35-2 Ascential DataStage Manager Guide

Output for Insert

Page 389: DataStage Parallel Job Developer’s Guide

Difference Keys Category

Key. Specifies the name of a difference key input column. This property can be repeated to specify multiple difference key input columns. Key has this dependent property:

• Case Sensitive

Use this to property to specify whether each key is case sensitive or not. It is set to True by default; for example, the values “CASE” and “case” would not be judged equivalent.

Difference Values Category

All non-Key Columns are Values. Set this to True to indicate that any columns not designated as difference key columns are value columns (see page 35-1 for a description of value columns). It is False by default. The property has this dependent property:

• Case Sensitive

Use this to property to specify whether each value is case sensitive

Options/Drop Output for Delete

True/False False N N N/A

Options/Drop Output for Edit

True/False False N N N/A

Options/Drop Output for Copy

True/False False N N N/A

Options/Copy Code

number 0 N N N/A

Options/Deleted Code

number 2 N N N/A

Options/Edit Code number 3 N N N/A

Options/Insert Code

number 1 N N N/A

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Difference Stage 35-3

or not. It is set to True by default; for example, the values “CASE” and “case” would not be judged equivalent.

Page 390: DataStage Parallel Job Developer’s Guide

Options Category

Tolerate Unsorted Inputs. Specifies that the input data sets are not sorted. This property allows you to process groups of records that may be arranged by the difference key columns but not sorted. The stage processed the input records in the order in which they appear on its input. It is False by default.

Log Statistics. This property configures the stage to display result infor-mation containing the number of input records and the number of copy, delete, edit, and insert records. It is False by default.

Drop Output for Insert. Specifies to drop (not generate) an output record for an insert result. By default, an output record is always created by the stage.

Drop Output for Delete. Specifies to drop (not generate) the output record for a delete result. By default, an output record is always created by the stage.

Drop Output for Edit. Specifies to drop (not generate) the output record for an edit result. By default, an output record is always created by the stage.

Drop Output for Copy. Specifies to drop (not generate) the output record for a copy result. By default, an output record is always created by the stage.

Copy Code. Allows you to specify an alternative value for the code that indicates the after record is a copy of the before record. By default this code is 0.

Deleted Code. Allows you to specify an alternative value for the code that indicates that a record in the before set has been deleted from the after set. By default this code is 2.

Edit Code. Allows you to specify an alternative value for the code that indicates the after record is an edited version of the before record. By default this code is 3.

35-4 Ascential DataStage Manager Guide

Page 391: DataStage Parallel Job Developer’s Guide

Insert Code. Allows you to specify an alternative value for the code that indicates a new record has been inserted in the after set that did not exist in the before set. By default this code is 1.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Difference Stage 35-5

Page 392: DataStage Parallel Job Developer’s Guide

Link OrderingThis tab allows you to specify which input link carries the before data set and which carries the after data set.

By default the first link added will represent the before set. To rearrange the links, choose an input link and click the up arrow button or the down arrow button.

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Difference stage expects two incoming data sets: a before data set and an after data set.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being compared. The Columns tab specifies the column definitions of incoming data.

35-6 Ascential DataStage Manager Guide

Details about Difference stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Page 393: DataStage Parallel Job Developer’s Guide

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before the operation is performed. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Difference stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Difference stage is set to execute in parallel or sequen-tial mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Difference stage is set to execute in parallel, then you can set a parti-tioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the previous stage).

If the Difference stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Difference stage.

Difference Stage 35-7

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

Page 394: DataStage Parallel Job Developer’s Guide

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for Difference stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before the operation is performed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is

35-8 Ascential DataStage Manager Guide

collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

Page 395: DataStage Parallel Job Developer’s Guide

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Difference stage. The Difference stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the columns being input to the Difference stage and the Output columns.

Details about Difference stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Difference Stage 35-9

Page 396: DataStage Parallel Job Developer’s Guide

Mapping TabFor the Difference stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the columns from the before/after data sets plus the DiffCode column. These are read only and cannot be modified on this tab.

The right pane shows the output columns for each link. This has a Deriva-tions field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility. By default the data set columns are mapped automatically. You need to ensure that there is an output column to carry the change code and that this is mapped to the DiffCode column.

35-10 Ascential DataStage Manager Guide

Page 397: DataStage Parallel Job Developer’s Guide

36Column Import Stage

The Column Import stage is an active stage. It can have a single input link, a single output link and a single rejects link.

The Column Import stage imports data from a single column and outputs it to one or more columns. You would typically use it to divide data arriving in a single column into multiple columns. The data would be delimited in some way to tell the Column Import stage where to make the divisions. The input column must be a string or binary data, the output columns can be any data type.

You supply an import table definition to specify the target columns and their types. This also determines the order in which data from the import column is written to output columns. Information about the format of the incoming column (e.g., how it is delimited) is given in the Format tab of the Outputs page. You can optionally save reject records, that is, records whose import was rejected, and write them to a rejects link.

In addition to importing a column you can also pass other columns straight through the stage. So, for example, you could pass a key column straight through.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the

Column Import Stage 36-1

processed data being output from the stage.

Page 398: DataStage Parallel Job Developer’s Guide

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Manda-tory?

Repeats?

Dependent of

Input/Import Input Column

Input Column N/A Y N N/A

Output/Column Method

Explicit/Schema File

Explicit Y N N/A

Output/Column to Import

Output Column N/A Y (if Column Method = Explicit)

Y N/A

Output/Schema File Pathname N/A Y (if Column Method = Schema file)

N N/A

Options/Keep Input Column

True/False False N N N/A

Options/Reject Mode Continue (warn) /Output/Fail

Continue N N N/A

36-2 Ascential DataStage Manager Guide

Page 399: DataStage Parallel Job Developer’s Guide

Input Category

Import Input Column. Specifies the name of the column containing the string or binary data to import.

Output Category

Column Method. Specifies whether the columns to import should be derived from column definitions on the Output page Columns tab (Explicit) or from a schema file (Schema File).

Column to Import. Specifies an output column. The meta data for this column determines the type that the import column will be converted to. Repeat the property to specify multiple columns. You can specify the prop-erties for each column using the Parallel tab of the Edit Column Meta dialog box (accessible from the shortcut menu on the columns grid of the output Columns tab).

Schema File. Instead of specifying the target data type details via output column definitions, you can use a schema file. You can type in the schema file name or browse for it.

Options Category

Keep Input Column. Specifies whether the original input column should be transferred to the output data set unchanged in addition to being imported and converted. Defaults to False

Reject Mode. The values of this property specify the following actions:

• Fail. The stage fails when it encounters a record whose import is rejected.

• Output. The stage continues when it encounters a reject record and writes the record to the reject link.

• Continue. The stage is to continue but report failures to the log file.

Advanced TabThis tab allows you to specify the following:

Column Import Stage 36-3

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by

Page 400: DataStage Parallel Job Developer’s Guide

the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Column Import stage expects one incoming data set.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being imported. The Columns tab specifies the column definitions of incoming data.

Details about Column Import stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is imported. It also allows you to specify that the data should be sorted before being operated on.

36-4 Ascential DataStage Manager Guide

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current

Page 401: DataStage Parallel Job Developer’s Guide

and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Column Import stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Column Import stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Column Import stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the previous stage).

If the Column Import stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Column Import stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the

Column Import Stage 36-5

output of a random number generator.

Page 402: DataStage Parallel Job Developer’s Guide

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for Column Import stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being imported. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available

36-6 Ascential DataStage Manager Guide

list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

Page 403: DataStage Parallel Job Developer’s Guide

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Column Import stage. The Column Import stage can have only one output link, but can also have a reject link carrying records that have been rejected.

The General tab allows you to specify an optional description of the output link. The Format tab allows you to specify details about how data in the column you are importing is formatted so the stage can divide it into separate columns. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the columns being input to the Column Import stage and the Output columns.

Details about Column Import stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Format TabThe Format tab allows you to supply information about the format of the column you are importing. You use it in the same way as you would to describe the format of a flat file you were reading. The tab has a similar format to the Properties tab and is described in detail on page 3-24.

Select a property type from main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to add window. You can then set a value for that property in the Property Value box. Pop up help for each of the available properties appears if you over the mouse pointer over it.

The following sections list the Property types and properties available for

Column Import Stage 36-7

each type.

Page 404: DataStage Parallel Job Developer’s Guide

Record level. These properties define details about how data records are formatted in the column. The available properties are:

• Fill char. Specify an ASCII character or a value in the range 0 to 255. This character is used to fill any gaps in an exported record caused by column positioning properties. Set to 0 by default.

• Final delimiter string. Specify a string to be written after the last column of a record in place of the column delimiter. Enter one or more ASCII characters (precedes the record delimiter if one is used).

• Final delimiter. Specify a single character to be written after the last column of a record in place of the column delimiter. Type an ASCII character or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used.– end. Record delimiter is used (defaults to newline)– none. No delimiter (column length is used).– null. Null character is used.

• Intact. Allows you to define a partial record schema. See “Partial Schemas” in Appendix A for details on complete versus partial schemas. (The dependent property Check Intact is only relevant for output links.)

• Record delimiter string. Specify a string to be written at the end of each record. Enter one or more ASCII characters.

• Record delimiter. Specify a single character to be written at the end of each record. Type an ASCII character or select one of the following:

– ‘\n’. Newline (the default).– null. Null character.

This is mutually exclusive with Record delimiter string, although the dialog box does not enforce this

• Record length. Select Fixed where the fixed length columns are being written. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes.

36-8 Ascential DataStage Manager Guide

• Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. 1 byte is the default.

Page 405: DataStage Parallel Job Developer’s Guide

• Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is written as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, or VBS.

This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix.

• User defined. Allows free format entry of any properties not defined elsewhere. Specify in a comma-separated list.

Field Defaults. Defines default properties for columns written to the file or files. These are applied to all columns written. The available properties are:

• Delimiter. Specifies the trailing delimiter of all columns in the record. Type an ASCII character or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used.

– end. Specifies that the last column in the record is composed of all remaining bytes until the end of the record.

– none. No delimiter.

– null. Null character is used.

• Delimiter string. Specify a string to be written at the end of each column. Enter one or more ASCII characters.

• Prefix bytes. Specifies that each column is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the column’s length or the tag value for a tagged column.

• Print field. This property is not relevant for input links.

• Quote. Specifies that variable length columns are enclosed in single quotes, double quotes, or another ASCII character or pair of ASCII characters. Choose Single or Double, or enter an ASCII character.

Column Import Stage 36-9

• Vector prefix. For columns that are variable length vectors, speci-fies a 1-, 2-, or 4-byte prefix containing the number of elements in the vector.

Page 406: DataStage Parallel Job Developer’s Guide

Type Defaults. These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type.

General. These properties apply to several data types (unless overridden at column level):

• Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

– little-endian. The high byte is on the left.– big-endian. The high byte is on the right.– native-endian. As defined by the native format of the machine.

• Format. Specifies the data representation format of a column. Choose from:

– binary– text

• Layout max width. The maximum number of bytes in a column represented as a string. Enter a number.

• Layout width. The number of bytes in a column represented as a string. Enter a number.

• Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null.

String. These properties are applied to columns with a string data type, unless overridden at column level.

• Export EBCDIC as ASCII. Select this to specify that EBCDIC char-acters are written as ASCII characters.

• Import ASCII as EBCDIC. Not relevant for input links.

Decimal. These properties are applied to columns with a decimal data type unless overridden at column level.

• Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No.

36-10 Ascential DataStage Manager Guide

• Packed. Select Yes to specify that the decimal columns contain data in packed decimal format or No to specify that they contain

Page 407: DataStage Parallel Job Developer’s Guide

unpacked decimal with a separate sign byte. This property has two dependent properties as follows:

– Check. Select Yes to verify that data is packed, or No to not verify.

– Signed. Select Yes to use the existing sign when writing decimal columns. Select No to write a positive sign (0xf) regardless of the columns actual sign value.

• Precision. Specifies the precision where a decimal column is written in text format. Enter a number.

• Rounding. Specifies how to round a decimal column when writing it. Choose from:

– up (ceiling). Truncate source column towards positive infinity.

– down (floor). Truncate source column towards negative infinity.

– nearest value. Round the source column towards the nearest representable value.

– truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign.

• Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination.

Numeric. These properties are applied to columns with an integer or float data type unless overridden at column level.

• C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a C-language format string used for writing integer or floating point strings. This is passed to sprintf().

• In_format. Not relevant for input links.

• Out_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sprintf().

Date. These properties are applied to columns with a date data type unless overridden at column level.

Column Import Stage 36-11

• Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd.

Page 408: DataStage Parallel Job Developer’s Guide

• Format string. The string format of a date. By default this is %yyyy-%mm-%dd.

• Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.

Time. These properties are applied to columns with a time data type unless overridden at column level.

• Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss.

• Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight.

Timestamp. These properties are applied to columns with a timestamp data type unless overridden at column level.

• Format string. Specifies the format of a column representing a timestamp as a string. defaults to %yyyy-%mm-%dd %hh:%nn:%ss.

36-12 Ascential DataStage Manager Guide

Page 409: DataStage Parallel Job Developer’s Guide

Mapping TabFor the Column Import stage the Mapping tab allows you to specify how the output columns are derived.

The left pane shows the columns the stage is deriving from the single imported column. These are read only and cannot be modified on this tab.

The right pane shows the output columns for each link.

In the example the stage has automatically mapped the specified Columns to Import onto the output columns. The Key column is an extra input column and is automatically passed through the stage. Because the Keep Import Column property was set to True, the original column (comp_col in this example) is available to map onto an output column.

We recommend that you maintain the automatic mappings of the gener-ated columns when using this stage.

Reject Link You cannot change the details of a Reject link. The link uses the column definitions for the link rejecting the data records.

Column Import Stage 36-13

Page 410: DataStage Parallel Job Developer’s Guide

36-14 Ascential DataStage Manager Guide

Page 411: DataStage Parallel Job Developer’s Guide

37Column Export Stage

The Column Export stage is an active stage. It can have a single input link, a single output link and a single rejects link.

The Column Export stage exports data from a number of columns of different data types into a single column of data type string or binary. It is the complementary stage to Column Import (see Chapter 36).

The input data column definitions determine the order in which the columns are exported to the single output column. Information about how the single column being exported is delimited is given in the Formats tab of the Inputs page.You can optionally save reject records, that is, records whose export was rejected.

In addition to importing a column you can also pass other columns straight through the stage. So, for example, you could pass a key column straight through.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage Page

Column Export Stage 37-1

The General tab allows you to specify an optional description of the stage. The Properties page lets you specify what the stage does. The Advanced page allows you to specify how the stage executes.

Page 412: DataStage Parallel Job Developer’s Guide

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Options Category

Export Output Column. Specifies the name of the single column to which the input column or columns are exported.

Export Column Type. Specify either binary or VarChar (string).

Reject Mode. The values of this property specify the following actions:

• Output. The stage continues when it encounters a reject record and writes the record to the rejects link.

• Continue(warn). The stage is to continue but report failures to the log file.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/Export Output Column

Output Column N/A Y N N/A

Options/Export Column Type

Binary/ VarChar Binary N N N/A

Options/Reject Mode

Continue (warn) /Output

Continue N N N/A

Options/Column to Export

Input Column N/A N Y N/A

Options/Schema File

Pathname N/A N N N/A

37-2 Ascential DataStage Manager Guide

Column to Export. Specifies an input column the stage extracts data from. The format properties for this column can be set on the Format tab of the Inputs page. Repeat the property to specify multiple input columns.

Page 413: DataStage Parallel Job Developer’s Guide

Schema File. Instead of specifying the source data details via input column definitions, you can use a schema file. You can type in the schema file name or browse for it.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Column Export stage expects one incoming data set.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being exported. The Format tab allows you to specify details how data in the column you are exporting will be formatted. The

Column Export Stage 37-3

Columns tab specifies the column definitions of incoming data.

Page 414: DataStage Parallel Job Developer’s Guide

Details about Column Export stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is exported. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Column Export stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Column Export stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Column Export stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the previous stage).

If the Column Export stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding

37-4 Ascential DataStage Manager Guide

stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Column Export stage.

Page 415: DataStage Parallel Job Developer’s Guide

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag columns.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for Column Export stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

Column Export Stage 37-5

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being exported. The sort is always carried out within data partitions. If the stage is partitioning incoming

Page 416: DataStage Parallel Job Developer’s Guide

data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Format TabThe Format tab allows you to supply information about the format of the column you are exporting. You use it in the same way as you would to describe the format of a flat file you were writing. The tab has a similar format to the Properties tab and is described in detail on page 3-24.

Select a property type from main tree then add the properties you want to set to the tree structure by clicking on them in the Available properties to add window. You can then set a value for that property in the Property Value box. Pop up help for each of the available properties appears if you over the mouse pointer over it.

The following sections list the Property types and properties available for each type.

Record level. These properties define details about how data records are formatted in the column. The available properties are:

• Fill char. Specify an ASCII character or a value in the range 0 to 255. This character is used to fill any gaps in an exported record caused by column positioning properties. Set to 0 by default.

37-6 Ascential DataStage Manager Guide

• Final delimiter string. Specify a string to be written after the last column of a record in place of the column delimiter. Enter one or

Page 417: DataStage Parallel Job Developer’s Guide

more ASCII characters (precedes the record delimiter if one is used).

• Final delimiter. Specify a single character to be written after the last column of a record in place of the column delimiter. Type an ASCII character or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used.– end. Record delimiter is used (defaults to newline)– none. No delimiter (column length is used).– null. Null character is used.

• Intact. Allows you to define a partial record schema. See “Partial Schemas” in Appendix A for details on complete versus partial schemas. (The dependent property Check Intact is only relevant for output links.)

• Record delimiter string. Specify a string to be written at the end of each record. Enter one or more ASCII characters.

• Record delimiter. Specify a single character to be written at the end of each record. Type an ASCII character or select one of the following:

– ‘\n’. Newline (the default).– null. Null character.

This is mutually exclusive with Record delimiter string, although the dialog box does not enforce this

• Record length. Select Fixed where the fixed length columns are being written. DataStage calculates the appropriate length for the record. Alternatively specify the length of fixed records as number of bytes.

• Record Prefix. Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix. 1 byte is the default.

• Record type. Specifies that data consists of variable-length blocked records (varying) or implicit records (implicit). If you choose the implicit property, data is written as a stream with no explicit record boundaries. The end of the record is inferred when all of the columns defined by the schema have been parsed. The varying property allows you to specify one of the following IBM blocked or

Column Export Stage 37-7

spanned formats: V, VB, VS, or VBS.

Page 418: DataStage Parallel Job Developer’s Guide

This property is mutually exclusive with Record length, Record delimiter, Record delimiter string, and Record prefix.

• User defined. Allows free format entry of any properties not defined elsewhere. Specify in a comma-separated list.

Field Defaults. Defines default properties for columns written to the file or files. These are applied to all columns written. The available properties are:

• Delimiter. Specifies the trailing delimiter of all columns in the record. Type an ASCII character or select one of whitespace, end, none, or null.

– whitespace. A whitespace character is used.

– end. Specifies that the last column in the record is composed of all remaining bytes until the end of the record.

– none. No delimiter.

– null. Null character is used.

• Delimiter string. Specify a string to be written at the end of each column. Enter one or more ASCII characters.

• Prefix bytes. Specifies that each column in the column is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the column’s length or the tag value for a tagged column.

• Print field. This property is not relevant for input links.

• Quote. Specifies that variable length columns are enclosed in single quotes, double quotes, or another ASCII character or pair of ASCII characters. Choose Single or Double, or enter an ASCII character.

• Vector prefix. For columns that are variable length vectors, speci-fies a 1-, 2-, or 4-byte prefix containing the number of elements in the vector.

Type Defaults. These are properties that apply to all columns of a specific data type unless specifically overridden at the column level. They are divided into a number of subgroups according to data type.

37-8 Ascential DataStage Manager Guide

General. These properties apply to several data types (unless overridden at column level):

Page 419: DataStage Parallel Job Developer’s Guide

• Byte order. Specifies how multiple byte data types (except string and raw data types) are ordered. Choose from:

– little-endian. The high byte is on the left.– big-endian. The high byte is on the right.– native-endian. As defined by the native format of the machine.

• Format. Specifies the data representation format of a column. Choose from:

– binary– text

• Layout max width. The maximum number of bytes in a column represented as a string. Enter a number.

• Layout width. The number of bytes in a column represented as a string. Enter a number.

• Pad char. Specifies the pad character used when strings or numeric values are exported to an external string representation. Enter an ASCII character or choose null.

String. These properties are applied to columns with a string data type, unless overridden at column level.

• Export EBCDIC as ASCII. Select this to specify that EBCDIC char-acters are written as ASCII characters.

• Import ASCII as EBCDIC. Not relevant for input links.

Decimal. These properties are applied to columns with a decimal data type unless overridden at column level.

• Allow all zeros. Specifies whether to treat a packed decimal column containing all zeros (which is normally illegal) as a valid representation of zero. Select Yes or No.

• Packed. Select Yes to specify that the decimal columns contain data in packed decimal format or No to specify that they contain unpacked decimal with a separate sign byte. This property has two dependent properties as follows:

– Check. Select Yes to verify that data is packed, or No to not verify.

Column Export Stage 37-9

– Signed. Select Yes to use the existing sign when writing decimal columns. Select No to write a positive sign (0xf) regardless of the columns actual sign value.

Page 420: DataStage Parallel Job Developer’s Guide

• Precision. Specifies the precision where a decimal column is written in text format. Enter a number.

• Rounding. Specifies how to round a decimal column when writing it. Choose from:

– up (ceiling). Truncate source column towards positive infinity.

– down (floor). Truncate source column towards negative infinity.

– nearest value. Round the source column towards the nearest representable value.

– truncate towards zero. This is the default. Discard fractional digits to the right of the right-most fractional digit supported by the destination, regardless of sign.

• Scale. Specifies how to round a source decimal when its precision and scale are greater than those of the destination.

Numeric. These properties are applied to columns with an integer or float data type unless overridden at column level.

• C_format. Perform non-default conversion of data from integer or floating-point data to a string. This property specifies a C-language format string used for writing integer or floating point strings. This is passed to sprintf().

• In_format. Not relevant for input links.

• Out_format. Format string used for conversion of data from integer or floating-point data to a string. This is passed to sprintf().

Date. These properties are applied to columns with a date data type unless overridden at column level.

• Days since. Dates are written as a signed integer containing the number of days since the specified date. Enter a date in the form %yyyy-%mm-%dd.

• Format string. The string format of a date. By default this is %yyyy-%mm-%dd.

• Is Julian. Select this to specify that dates are written as a numeric value containing the Julian day. A Julian day specifies the date as

37-10 Ascential DataStage Manager Guide

the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT.

Page 421: DataStage Parallel Job Developer’s Guide

Time. These properties are applied to columns with a time data type unless overridden at column level.

• Format string. Specifies the format of columns representing time as a string. By default this is %hh-%mm-%ss.

• Is midnight seconds. Select this to specify that times are written as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight.

Timestamp. These properties are applied to columns with a timestamp data type unless overridden at column level.

Outputs PageThe Outputs page allows you to specify details about data output from the Column Export stage. The Column Export stage can have only one output link, but can also have a reject link carrying records that have been rejected.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the columns being input to the Column Export stage and the Output columns.

Details about Column Export stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Column Export Stage 37-11

Page 422: DataStage Parallel Job Developer’s Guide

Mapping TabFor the Column Export stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the input columns plus the composite column that the stage exports the specified input columns to. These are read only and cannot be modified on this tab.

The right pane shows the output columns for each link. This has a Deriva-tions field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility.

In the example, the Key column is being passed straight through (it has not been defined as a Column to Export in the stage properties. The remaining columns are all being exported to comp_col, which is the specified Export Column. You could also pass the original columns through the stage, if required.

37-12 Ascential DataStage Manager Guide

Page 423: DataStage Parallel Job Developer’s Guide

Reject Link You cannot change the details of a Reject link. The link uses the column definitions for the link rejecting the data records.

Column Export Stage 37-13

Page 424: DataStage Parallel Job Developer’s Guide

37-14 Ascential DataStage Manager Guide

Page 425: DataStage Parallel Job Developer’s Guide

38Make Subrecord Stage

The Make Subrecord stage is an active stage. It can have a single input link and a single output link.

The Make Subrecord stage combines specified vectors in an input data set into a vector of subrecords whose columns have the names and data types of the original vectors. You specify the vector columns to be made into a vector of subrecords and the name of the new subrecord. See “Complex Data Types” on page 2-14 for an explanation of vectors and subrecords.

The Split Subrecord stage performs the inverse operation. See Chapter 39, “Split Subrecord Stage.”

The length of the subrecord vector created by this operator equals the length of the longest vector column from which it is created. If a variable-length vector column was used in subrecord creation, the subrecord vector is also of variable length.

Vectors that are smaller than the largest combined vector are padded with default values: NULL for nullable columns and the corresponding type-dependent value for non-nullable columns. When the Make Subrecord stage encounters mismatched vector lengths, it warns you by writing to the job log.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

Make Subrecord Stage 38-1

• Outputs page. This is where you specify details about the processed data being output from the stage.

Page 426: DataStage Parallel Job Developer’s Guide

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Input Category

Subrecord Output Column. Specify the name of the subrecord into which you want to combine the columns specified by the Vector Column for Subrecord property.

Output Category

Vector Column for Subrecord. Specify the name of the column to include in the subrecord. You can specify multiple columns to be combined into a subrecord. For each column, specify the property followed by the name of the column to include.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/Subrecord Output Column

Output Column N/A Y N N/A

Options/Vector Column for Subrecord

Input Column N/A N Y Key

Options/Disable Warning of Column Padding

True/False False N N N/A

38-2 Ascential DataStage Manager Guide

Page 427: DataStage Parallel Job Developer’s Guide

Options Category

Disable Warning of Column Padding. When the operator combines vectors of unequal length, it pads columns and displays a message to this effect. Optionally specify this property to disable display of the message.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Make Subrecord stage expects one incoming data set.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is

Make Subrecord Stage 38-3

partitioned before being converted. The Columns tab specifies the column definitions of incoming data.

Page 428: DataStage Parallel Job Developer’s Guide

Details about Make Subrecord stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is converted. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Make Subrecord stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Make Subrecord stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Make Subrecord stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the previous stage).

If the Make Subrecord stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default method of the Make Subrecord stage.

38-4 Ascential DataStage Manager Guide

• Entire. Each file written to receives the entire data set.

Page 429: DataStage Parallel Job Developer’s Guide

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag columns.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place. This is the default partitioning method for the Make Subrecord stage.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for Make Subrecord stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the

Make Subrecord Stage 38-5

input link should be sorted before being converted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the

Page 430: DataStage Parallel Job Developer’s Guide

sort occurs before the collection. The availability of sorting depends on the partitioning method chosen

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Make Subrecord stage. The Make Subrecord stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data.

See Chapter 3, “Stage Editors,” for a general description of the tabs.

38-6 Ascential DataStage Manager Guide

Page 431: DataStage Parallel Job Developer’s Guide

39Split Subrecord Stage

The Split Subrecord stage separates an input subrecord field into a set of top-level vector columns. It can have a single input link and a single output link.

The stage creates one new vector column for each element of the original subrecord. That is, each top-level vector column that is created has the same number of elements as the subrecord from which it was created. The stage outputs columns of the same name and data type as those of the columns that comprise the subrecord.

The Make Subrecord stage performs the inverse operation (see Chapter 38, “Make Subrecord Stage.”)

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Split Subrecord Stage 39-1

Page 432: DataStage Parallel Job Developer’s Guide

Properties TabThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Options Category

Subrecord Column. Specifies the name of the vector whose elements you want to promote to a set of similarly named top-level columns.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/Subrecord Column

Input Column N/A Y N N/A

39-2 Ascential DataStage Manager Guide

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking

Page 433: DataStage Parallel Job Developer’s Guide

the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. There can be only one input to the Split Subrecord stage.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being converted. The Columns tab specifies the column definitions of incoming data.

Details about Split Subrecord stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is converted. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. You can use any partitioning method except Modulus. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Split Subrecord stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Split Subrecord stage is set to execute in parallel or sequential mode.

Split Subrecord Stage 39-3

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

Page 434: DataStage Parallel Job Developer’s Guide

If the Split Subrecord stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the previous stage).

If the Split Subrecord stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Split Subrecord stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

39-4 Ascential DataStage Manager Guide

The following Collection methods are available:

Page 435: DataStage Parallel Job Developer’s Guide

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for Split Subrecord stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being converted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the

Split Subrecord Stage 39-5

Split Subrecord stage. The Split Subrecord stage can have only one output link.

Page 436: DataStage Parallel Job Developer’s Guide

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. See Chapter 3, “Stage Editors,” for a general description of these tabs.

39-6 Ascential DataStage Manager Guide

Page 437: DataStage Parallel Job Developer’s Guide

40Promote Subrecord

Stage

The Promote Subrecord stage is an active stage. It can have a single input link and a single output link.

The Promote Subrecord stage promotes the columns of an input subrecord to top-level columns. The number of output records equals the number of subrecord elements. The data types of the input subrecord columns deter-mine those of the corresponding top-level columns.

The Combine Records stage performs the inverse operation. See Chapter 41, “Promote Subrecord Stage.”.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab

Promote Subrecord Stage 40-1

allows you to specify how the stage executes.

Page 438: DataStage Parallel Job Developer’s Guide

PropertiesThe Promote Subrecord Stage has one property:

Options Category

Subrecord Column. Specifies the name of the subrecord whose elements will be promoted to top-level records.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/Subrecord Column

Input Column N/A Y N N/A

40-2 Ascential DataStage Manager Guide

Configuration file).

Page 439: DataStage Parallel Job Developer’s Guide

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Promote Subrecord stage expects one incoming data set.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being converted. The Columns tab specifies the column definitions of incoming data.

Details about Promote Subrecord stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is converted. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Promote Subrecord stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Promote Subrecord stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Promote Subrecord stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage).

Promote Subrecord Stage 40-3

If the Promote Subrecord stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection

Page 440: DataStage Parallel Job Developer’s Guide

method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default method for the Promote Subrecord stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place. This is the default partitioning method for the Promote Subrecord stage.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for Promote Subrecord stages.

40-4 Ascential DataStage Manager Guide

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

Page 441: DataStage Parallel Job Developer’s Guide

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being converted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Promote Subrecord stage. The Promote Subrecord stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data.

See Chapter 3, “Stage Editors,” for a general description of the tabs.

Promote Subrecord Stage 40-5

Page 442: DataStage Parallel Job Developer’s Guide

40-6 Ascential DataStage Manager Guide

Page 443: DataStage Parallel Job Developer’s Guide

41Combine Records

Stage

The Combine Records stage is an active stage. It can have a single input link and a single output link.

The Combine Records stage combines records, in which particular key-column values are identical, into vectors of subrecords. As input, the stage takes a data set in which one or more columns are chosen as keys. All adja-cent records whose key columns contain the same value are gathered into the same record as subrecords.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Combine Records Stage 41-1

Properties

Page 444: DataStage Parallel Job Developer’s Guide

The Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Outputs Category

Subrecord Output Column. Specify the name of the subrecord that the Combine Records stage creates.

Combine Keys Category

Key. Specify one or more columns. All records whose key columns contain identical values are gathered into the same record as subrecords. If the Top Level Keys property is set to False, each column becomes the element of a subrecord.

If the Top Level Keys property is set to True, the key column appears as a top-level column in the output record as opposed to in the subrecord. All non-key columns belonging to input records with that key column appear as elements of a subrecord in that key column’s output record. Key has the following dependent property:

• Case Sensitive

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/Top Level Keys

Output Column N/A Y N N/A

Options/Key Input Column N/A Y Y N/A

Options/Case Sensitive

True/False True N N Key

Options/Top Level Keys

True/False False N N N/A

41-2 Ascential DataStage Parallel Job Developer’s Guide

Use this to property to specify whether each key is case sensitive or not. It is set to True by default; for example, the values “CASE” and “case” would not be judged equivalent.

Page 445: DataStage Parallel Job Developer’s Guide

Options Category

Top Level Keys. Specify whether to leave keys as top-level columns or have them put into the subrecord. False by default.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Combine Records stage expects one incoming data set.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being converted. The Columns tab specifies the column

Combine Records Stage 41-3

definitions of incoming data.

Page 446: DataStage Parallel Job Developer’s Guide

Details about Combine Records stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is converted. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Combine Records stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Combine Records stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Combine Records stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage).

If the Combine Records stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding

41-4 Ascential DataStage Parallel Job Developer’s Guide

stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Combine Records stage.

Page 447: DataStage Parallel Job Developer’s Guide

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for Combine Records stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

Combine Records Stage 41-5

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being converted. The sort is always carried out within data partitions. If the stage is partitioning incoming

Page 448: DataStage Parallel Job Developer’s Guide

data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Combine Records stage. The Combine Records stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data.

See Chapter 3, “Stage Editors,” for a general description of the tabs.

41-6 Ascential DataStage Parallel Job Developer’s Guide

Page 449: DataStage Parallel Job Developer’s Guide

42Make Vector Stage

The Make Vector stage is an active stage. It can have a single input link and a single output link.

The Make Vector stage combines specified columns of an input data record into a vector of columns of the same type. The input columns must be consecutive and numbered in ascending order. The numbers must increase by one. The columns must be named column_name0 to column_namen, where column_name starts the name of a column and 0 and n are the first and last of its consecutive numbers. All these columns are combined into a vector of the same length as the number of columns (n+1). The vector is called column_name.

The Split Vector stage performs the inverse operation. See Chapter 43, “Split Vector Stage.”

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage.

Make Vector Stage 42-1

The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

Page 450: DataStage Parallel Job Developer’s Guide

PropertiesThe Make Vector stage has one property:

Options Category

Column’s Common Partial Name. Specifies the beginning column_name of the series of consecutively numbered columns column_name0 to column_namen to be combined into a vector called column_name�

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/Column’s Common Partial Name

Name N/A Y N N/A

42-2 Ascential DataStage Manager Guide

pool for this stage (in addition to any node pools defined in the Configuration file).

Page 451: DataStage Parallel Job Developer’s Guide

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Make Vector stage expects one incoming data set.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being converted. The Columns tab specifies the column definitions of incoming data.

Details about Make Vector stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is converted. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Same mode. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Make Vector stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Make Vector stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Make Vector stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-tioning option has been set on the previous stage).

If the Make Vector stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection

Make Vector Stage 42-3

method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

Page 452: DataStage Parallel Job Developer’s Guide

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place. This is the default partitioning method for the Make Vector stage.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is the default collection method for Make Vector stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-

42-4 Ascential DataStage Manager Guide

tion, the operator starts over.

Page 453: DataStage Parallel Job Developer’s Guide

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being converted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Make Vector stage. The Make Vector stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data.

See Chapter 3, “Stage Editors,” for a general description of the tabs.

Make Vector Stage 42-5

Page 454: DataStage Parallel Job Developer’s Guide

42-6 Ascential DataStage Manager Guide

Page 455: DataStage Parallel Job Developer’s Guide

43Split Vector Stage

The Split Vector stage It can have a single input link and a single output link.

The Split Vector stage promotes the elements of a fixed-length vector to a set of similarly named top-level columns. The stage creates columns of the format name0 to namen, where name is the original vector’s name and 0 and n are the first and last elements of the vector.

The Make Vector stage performs the inverse operation.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties page lets you specify what the stage does. The Advanced page allows you to specify how the stage executes.

Split Vector Stage 43-1

Page 456: DataStage Parallel Job Developer’s Guide

PropertiesThe Make Vector stage has one property:

Options Category

Vector Column. Specifies the name of the vector whose elements you want to promote to a set of similarly named top-level columns.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/Vector Column

Name N/A Y N N/A

43-2 Ascential DataStage Parallel Job Developer’s Guide

Configuration file).

Page 457: DataStage Parallel Job Developer’s Guide

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. There are be only one input to the Split Vector stage.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being converted. The Columns tab specifies the column definitions of incoming data.

Details about Split Vector stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is converted. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. You can use any partitioning method except Modulus. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Split Vector stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Split Vector stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Split Vector stage is set to execute in parallel, then you can set a parti-tioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Parti-

Split Vector Stage 43-3

tioning option has been set on the previous stage).

Page 458: DataStage Parallel Job Developer’s Guide

If the Split Vector stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Split Vector stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file. This is

43-4 Ascential DataStage Parallel Job Developer’s Guide

the default collection method for Split Vector stages.

Page 459: DataStage Parallel Job Developer’s Guide

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being converted. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Split Vector stage. The Split Vector stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data.

Split Vector Stage 43-5

Details about Split Vector stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Page 460: DataStage Parallel Job Developer’s Guide

43-6 Ascential DataStage Parallel Job Developer’s Guide

Page 461: DataStage Parallel Job Developer’s Guide

44Head Stage

The Head Stage is an active stage. It can have a single input link and a single output link.

The Head Stage selects the first N records from each partition of an input data set and copies the selected records to an output data set. You deter-mine which records are copied by setting properties which allow you to specify:

• The number of records to copy

• The partition from which the records are copied

• The location of the records to copy

• The number of records to skip before the copying operation begins

This stage is helpful in testing and debugging applications with large data sets. For example, the Partition property lets you see data from a single partition to determine if the data is being partitioned as you want it to be. The Skip property lets you access a certain portion of a data set.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Head Stage 44-1

Page 462: DataStage Parallel Job Developer’s Guide

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Rows Category

All Rows. Copy all input rows to the output data set. You can skip rows before Head performs its copy operation by using the Skip property. The

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Rows/All Rows True/False False N N N/A

Rows/Number of Rows (per Partition)

Count 10 N N N/A

Rows/Period (per Partition)

Number N/A N N N/A

Rows/Skip (per Partition)

Number N/A N N N/A

Partitions/All Partitions

Partition Number

N/A N Y N/A

Partitions/Partition Number

Number N/A Y (if All Parti-tions = False)

Y N/A

44-2 Ascential DataStage Parallel Job Developer’s Guide

Number of Rows property is not needed if All Rows is true.

Number of Rows (per Partition). Specify the number of rows to copy from each partition of the input data set to the output data set. The default

Page 463: DataStage Parallel Job Developer’s Guide

value is 10. The Number of Rows property is not needed if All Rows is true.

Period (per Partition). Copy every Pth record in a partition, where P is the period. You can start the copy operation after records have been skipped by using the Skip property. P must equal or be greater than 1.

Skip (per Partition). Ignore the first number of rows of each partition of the input data set, where number is the number of rows to skip. The default skip count is 0.

Partitions Category

All Partitions. If False, copy records only from the indicated partition, specified by number. By default, the operator copies rows from all partitions.

Partition Number. Specifies particular partitions to perform the Head operation on. You can specify the Partition Number property multiple times to specify multiple partition numbers.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

Head Stage 44-3

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking

Page 464: DataStage Parallel Job Developer’s Guide

the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Head stage expects one input.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being headed. The Columns tab specifies the column definitions of incoming data.

Details about Head stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is headed. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Head stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Head stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

44-4 Ascential DataStage Parallel Job Developer’s Guide

If the Head stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will

Page 465: DataStage Parallel Job Developer’s Guide

override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage).

If the Head stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Head stage.

• Entire. Each file written to receives the entire data set.

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

Head Stage 44-5

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages,

Page 466: DataStage Parallel Job Developer’s Guide

and how many nodes are specified in the Configuration file.This is the default collection method for Head stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being headed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Head stage. The Head stage can have only one output link.

44-6 Ascential DataStage Parallel Job Developer’s Guide

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming

Page 467: DataStage Parallel Job Developer’s Guide

data. The Mapping tab allows you to specify the relationship between the columns being input to the Head stage and the Output columns.

Details about Head stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Mapping TabFor the Head stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the input columns and/or the generated columns. These are read only and cannot be modified on this tab.

The right pane shows the output columns for each link. This has a Deriva-tions field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Auto-match facility.

Head Stage 44-7

Page 468: DataStage Parallel Job Developer’s Guide

44-8 Ascential DataStage Parallel Job Developer’s Guide

Page 469: DataStage Parallel Job Developer’s Guide

45Tail Stage

The Tail Stage is an active stage. It can have a single input link and a single output link.

The Tail Stage selects the last N records from each partition of an input data set and copies the selected records to an output data set. You deter-mine which records are copied by setting properties which allow you to specify:

• The number of records to copy

• The partition from which the records are copied

This stage is helpful in testing and debugging applications with large data sets. For example, the Partition property lets you see data from a single partition to determine if the data is being partitioned as you want it to be. The Skip property lets you access a certain portion of a data set.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage Page

Tail Stage 45-1

The General tab allows you to specify an optional description of the stage. The Properties page lets you specify what the stage does. The Advanced page allows you to specify how the stage executes.

Page 470: DataStage Parallel Job Developer’s Guide

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Rows Category

Number of Rows (per Partition). Specify the number of rows to copy from each partition of the input data set to the output data set. The default value is 10.

Partitions Category

All Partitions. If False, copy records only from the indicated partition, specified by number. By default, the operator copies records from all partitions.

Partition Number. Specifies particular partitions to perform the Tail oper-ation on. You can specify the Partition Number property multiple times to specify multiple partition numbers.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Rows/Number of Rows (per Partition)

Count 10 N N Key

Partitions/All Partitions

Partition Number

N/A N Y N/A

Partitions/Partition Number

Number N/A Y (if All Parti-tions = False)

Y N/A

45-2 Ascential DataStage Manager Guide

Page 471: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Tail stage expects one input.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being tailed. The Columns tab specifies the column definitions of incoming data.

Details about Tail stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Tail Stage 45-3

Page 472: DataStage Parallel Job Developer’s Guide

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is tailed. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Tail stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Tail stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Tail stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage).

If the Tail stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the Tail stage.

• Entire. Each file written to receives the entire data set.

45-4 Ascential DataStage Manager Guide

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

Page 473: DataStage Parallel Job Developer’s Guide

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for Tail stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being tailed. The sort is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is collecting data, the sort

Tail Stage 45-5

occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

Page 474: DataStage Parallel Job Developer’s Guide

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Tail stage. The Tail stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the columns being input to the Tail stage and the Output columns.

Details about Tail stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

45-6 Ascential DataStage Manager Guide

Page 475: DataStage Parallel Job Developer’s Guide

Mapping TabFor the Tail stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the input columns and/or the generated columns. These are read only and cannot be modified on this tab.

The right pane shows the output columns for each link. This has a Deriva-tions field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Auto-match facility.

Tail Stage 45-7

Page 476: DataStage Parallel Job Developer’s Guide

45-8 Ascential DataStage Manager Guide

Page 477: DataStage Parallel Job Developer’s Guide

46Compare Stage

The Compare stage is an active stage. It can have two input links and a single output link.

The Compare stage performs a column-by-column comparison of records in two presorted input data sets. You can restrict the comparison to speci-fied key columns.

The Compare stage does not change the table definition, partitioning, or content of the records in either input data set. It transfers both data sets intact to a single output data set generated by the stage. The comparison results are also recorded in the output data set.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties page lets you specify what the stage does. The Advanced page allows you to specify how the stage executes.

Compare Stage 46-1

Page 478: DataStage Parallel Job Developer’s Guide

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Options Category

Abort On Difference. This property forces the stage to abort its operation each time a difference is encountered between two corresponding

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Options/Abort On Difference

True/False False Y N N/A

Options/Warn on Record Count Mismatch

True/False False Y N N/A

Options/‘Equals’ Value

number 0 N N N/A

Options/‘First is Empty’ Value

number 1 N N N/A

Options/‘Greater Than’ Value

number 2 N N N/A

Options/‘Less Than’ Value

number -1 N N N/A

Options/‘Second is Empty’ Value

number -2 N N N/A

Options/Key Input Column

N/A N Y N/A

Options/Case Sensitive

True/False True N N Key

46-2 Ascential DataStage Parallel Job Developer’s Guide

columns in any record of the two input data sets. This is False by default, if you set it to True you cannot set Warn on Record Count Mismatch.

Page 479: DataStage Parallel Job Developer’s Guide

Warn on Record Count Mismatch. This property directs the stage to output a warning message when a comparison is aborted due to a mismatch in the number of records in the two input data sets. This is False by default, if you set it to True you cannot set Abort on difference.

‘Equals’ Value. Allows you to set an alternative value for the code which the stage outputs to indicate two compared records are equal. This is 0 by default.

‘First is Empty’ Value. Allows you to set an alternative value for the code which the stage outputs to indicate the first record is empty. This is 1 by default.

‘Greater Than’ Value. Allows you to set an alternative value for the code which the stage outputs to indicate the first record is greater than the other. This is 2 by default.

‘Less Than’ Value. Allows you to set an alternative value for the code which the stage outputs to indicate the second record is greater than the other. This is -1 by default.

‘Second is Empty’ Value. Allows you to set an alternative value for the code which the stage outputs to indicate the second record is empty. This is -2 by default.

Key. Allows you to specify one or more key columns. Only these columns will be compared. Repeat the property to specify multiple columns. The Key property has a dependent property:

• Case Sensitive

Use this to specify whether each key is case sensitive or not, this is set to True by default, i.e., the values “CASE” and “case” in would end up in different groups.

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by

Compare Stage 46-3

the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

Page 480: DataStage Parallel Job Developer’s Guide

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

Link Ordering TabThis tab allows you to specify which input link carries the First data set and which carries the Second data set. Which is categorized as first and which second affects the setting of the comparison code.

46-4 Ascential DataStage Parallel Job Developer’s Guide

Page 481: DataStage Parallel Job Developer’s Guide

By default the first link added will represent the First set. To rearrange the links, choose an input link and click the up arrow button or the down arrow button.

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Compare stage expects two incoming data sets.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being compared. The Columns tab specifies the column definitions of incoming data.

Details about Compare stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is compared. It also allows you to specify that the data should be sorted before being operated on.

If the Compare stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for Compare stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

Compare Stage 46-5

Page 482: DataStage Parallel Job Developer’s Guide

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

If you are collecting data, the Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being collected and compared. The sort is always carried out within data partitions. The sort occurs before the collection.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Compare stage. The Compare stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data.

See Chapter 3, “Stage Editors,” for a general description of the tabs.

46-6 Ascential DataStage Parallel Job Developer’s Guide

Page 483: DataStage Parallel Job Developer’s Guide

47Peek Stage

The Peek stage is an active stage. It has a single input link and any number of output links.

The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets. This can be helpful for monitoring the progress of your application or to diagnose a bug in your application.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although

Peek Stage 47-1

many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

Page 484: DataStage Parallel Job Developer’s Guide

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Rows Category

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

Rows/All Records (After Skip)

True/False False N N N/A

Rows/Number of Records (Per Partition)

number 10 Y N N/A

Rows/Period (per Partition)

Number N/A N N N/A

Rows/Skip (per Partition)

Number N/A N N N/A

Columns/Peek All Input Columns

True/False True Y N N/A

Columns/Input Column to Peek

Input Column N/A Y (if Peek All Input Columns = False)

Y N/A

Partitions/All Partitions

True/False True Y N N/A

Partitions/Parti-tion Number

number N/A Y (if All Parti-tions = False)

Y N/A

Options/Peek Records Output Mode

Job Log/Output

Job Log N N N/A

Options/Show Column Names

True/False False N N N/A

Options/Delimiter String

space/nl/tab space N N N/A

47-2 Ascential DataStage Manager Guide

All Records (After Skip). True to print all records from each partition. Set to False by default.

Page 485: DataStage Parallel Job Developer’s Guide

Number of Records (Per Partition). Specifies the number of records to print from each partition. The default is 10.

Period (per Partition). Print every Pth record in a partition, where P is the period. You can start the copy operation after records have been skipped by using the Skip property. P must equal or be greater than 1.

Skip (per Partition). Ignore the first number of rows of each partition of the input data set, where number is the number of rows to skip. The default skip count is 0.

Columns Category

Peek All Input Columns. True by default and prints all the input columns. Set to False to specify that only selected columns will be printed and specify these columns using the Input Column to Peek property.

Input Column to Peek. If you have set Peek All Input Columns to False, use this property to specify a column to be printed. Repeat the property to specify multiple columns.

Partitions Category

All Partitions. Set to True by default. Set to False to specify that only certain partitions should have columns printed, and specify which parti-tions using the Partition Number property.

Partition Number. If you have set All Partitions to False, use this property to specify which partition you want to print columns from. Repeat the property to specify multiple columns.

Options Category

Peek Records Output Mode. Specifies whether the output should go to an output column (the Peek Records column) or to the job log.

Show Column Names. If True, causes the stage to print the column name, followed by a colon, followed by the column value. By default, the stage prints only the column value, followed by a space.

Peek Stage 47-3

Delimiter String. The string to use as a delimiter on columns. Can be space, tab or newline. The default is space.

Page 486: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

47-4 Ascential DataStage Manager Guide

Page 487: DataStage Parallel Job Developer’s Guide

Link OrderingThis tab allows you to specify which output link carries the peek records data set if you have chosen to output the records to a link rather than the job log.

By default the last link added will represent the peek data set. To rearrange the links, choose an output link and click the up arrow button or the down arrow button.

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. The Peek stage expects one incoming data set.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being peeked. The Columns tab specifies the column definitions of incoming data.

Details about Peek stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

Peek Stage 47-5

Page 488: DataStage Parallel Job Developer’s Guide

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before it is peeked. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the Peek stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the Peek stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the Peek stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage).

If the Peek stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default method of the Peek stage.

• Entire. Each file written to receives the entire data set.

47-6 Ascential DataStage Manager Guide

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

Page 489: DataStage Parallel Job Developer’s Guide

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place. This is the default partitioning method for the Peek stage.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for Peek stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being peeked. The sort is always carried out within data partitions. If the stage is partitioning incoming data the

Peek Stage 47-7

sort occurs after the partitioning. If the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Page 490: DataStage Parallel Job Developer’s Guide

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the Peek stage. The Peek stage can have any number of output links. Select the link whose details you are looking at from the Output name drop-down list.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the columns being input to the Peek stage and the Output columns.

Details about Peek stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

47-8 Ascential DataStage Manager Guide

Page 491: DataStage Parallel Job Developer’s Guide

Mapping TabFor the Tail stage the Mapping tab allows you to specify how the output columns are derived, i.e., what input columns map onto them or how they are generated.

The left pane shows the columns being peeked. These are read only and cannot be modified on this tab.

The right pane shows the output columns for each link. This has a Deriva-tions field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility.

Peek Stage 47-9

Page 492: DataStage Parallel Job Developer’s Guide

47-10 Ascential DataStage Manager Guide

Page 493: DataStage Parallel Job Developer’s Guide

48SAS Stage

The SAS stage is an active stage. It can have multiple input links and multiple output links.

The SAS stage allows you to execute part or all of an SAS application in parallel. It reduces or eliminates the performance bottlenecks that might otherwise occur when SAS is run on a parallel computer.

DataStage enables SAS users to:

• Access, for reading or writing, large volumes of data in parallel from parallel relational databases, with much higher throughput than is possible using PROC SQL.

• Process parallel streams of data with parallel instances of SAS DATA and PROC steps, enabling scoring or other data transforma-tions to be done in parallel with minimal changes to existing SAS code.

• Store large data sets in parallel, eliminating restrictions on data-set size imposed by your file system or physical disk-size limitations. Parallel data sets are accessed from SAS programs in the same way as conventional SAS data sets, but at much higher data I/O rates.

• Realize the benefits of pipeline parallelism, in which some number of SAS stages run at the same time, each receiving data from the previous process as it becomes available.

The stage editor has three pages:

• Stage page. This is always present and is used to specify general information about the stage.

SAS Stage 48-1

• Inputs page. This is where you specify the details about the single input set from which you are selecting records.

Page 494: DataStage Parallel Job Developer’s Guide

• Outputs page. This is where you specify details about the processed data being output from the stage.

Stage PageThe General tab allows you to specify an optional description of the stage. The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes.

PropertiesThe Properties tab allows you to specify properties which determine what the stage actually does. Some of the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default) and turn black when you supply a value for them.

The following table gives a quick reference list of the properties and their attributes. A more detailed description of each property follows.

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

SAS Source/Source Method

Explicit/Source File

Explicit Y N N/A

SAS Source/Source code N/A Y (if Source Method = Explicit)

N N/A

SAS Source/Source File

pathname N/A Y (if Source Method = Source File)

N N/A

Inputs/Input Link Number

number N/A N Y N/A

Inputs/Input SAS Data Set Name.

string N/A Y (if input link number

N N/A

48-2 Ascential DataStage Parallel Job Developer’s Guide

specified)

Page 495: DataStage Parallel Job Developer’s Guide

SAS Source Category

Source Method. Choose from Explicit (the default) or Source File. You then have to set either the Source property or the Source File property to specify the actual source.

Source. Specify the SAS code to be executed. This can contain both PROC and DATA steps.

Source File. Specify a file containing the SAS code to be executed by the stage.

Outputs/Output Link Number

number N/A N Y N/A

Outputs/Output SAS Data Set Name.

string N/A Y (if output link number specified)

N N/A

Options/Disable Working Directory Warning

True/False False Y N N/A

Options/Convert Local

True/False False Y N N/A

Options/Debug Program

No/Verbose/ Yes

No Y N N/A

Options/SAS List File Location Type

File/Job Log/ None/Output

Job Log Y N N/A

Options/SAS Log File Location Type

File/Job Log/ None/Output

Job Log Y N N/A

Options/SAS Options

string N/A N N N/A

Options/Working Directory

pathname N/A N N N/A

Category/Property Values Default Manda-tory? Repeats? Depen-

dent of

SAS Stage 48-3

Page 496: DataStage Parallel Job Developer’s Guide

Inputs Category

Input Link Number. Specifies inputs to the SAS code in terms of input link numbers. Repeat the property to specify multiple links. This has a dependent property:

• Input SAS Data Set Name.

The name of the SAS data set receiving its input from the specified input link.

Outputs Category

Output Link Number. Specifies an output link to connect to the output of the SAS code. Repeat the property to specify multiple links. This has a dependent property:

• Output SAS Data Set Name.

The name of the SAS data set sending its output to the specified output link.

Options Category

Disable Working Directory Warning. Disables the warning message generated by the stage when you omit the Working Directory property. By default, if you omit the Working Directory property, the SAS working directory is indeterminate and the stage generates a warning message.

Convert Local. Specify that the conversion phase of the SAS stage (from the input data set format to the stage SAS data set format) should run on the same nodes as the SAS stage. If this option is not set, the conversion runs by default with the previous stage’s degree of parallelism and, if possible, on the same nodes as the previous stage.

Debug Program. A setting of Yes causes the stage to ignore errors in the SAS program and continue execution of the application. This allows your application to generate output even if an SAS step has an error. By default, the setting is No, which causes the stage to abort when it detects an error in the SAS program.

48-4 Ascential DataStage Parallel Job Developer’s Guide

Setting the property as Verbose is the same as Yes, but in addition it causes the operator to echo the SAS source code executed by the operator.

Page 497: DataStage Parallel Job Developer’s Guide

SAS List File Location Type. Specifying File for this property causes the stage to write the SAS list file generated by the executed SAS code to a plain text file. The list is sorted before being written out. The name of the list file, which cannot be modified, is dsident�lst, where ident is the name of the stage, including an index in parentheses if there are more than one with the same name. For example, dssas(1)�lst is the list file from the second SAS stage in a data flow.

Specifying Job Log causes the list to be written to the DataStage job log.

Specifying Output causes the list file to be written to an output data set of the stage. The data set from a parallel SAS stage containing the list infor-mation will not be sorted.

If you specify None no list will be generated.

SAS Log File Location Type. Specifying File for this property causes the stage to write the SAS list file generated by the executed SAS code to a plain text file. The list is sorted before being written out. The name of the list file, which cannot be modified, is dsident�lst, where ident is the name of the stage, including an index in parentheses if there are more than one with the same name. For example, dssas(1)�lst is the list file from the second SAS stage in a data flow.

Specifying Job Log causes the list to be written to the DataStage job log.

Specifying Output causes the list file to be written to an output data set of the stage. The data set from a parallel SAS stage containing the list infor-mation will not be sorted.

If you specify None no list will be generated.

SAS Options. Specify any options for the SAS code in a quoted string. These are the options that you would specify to an SAS OPTIONS directive.

Working Directory. Name of the working directory on all the processing nodes executing the SAS application. All relative pathnames in the SAS code are relative to this pathname.

SAS Stage 48-5

Page 498: DataStage Parallel Job Developer’s Guide

Advanced TabThis tab allows you to specify the following:

• Execution Mode. The stage can execute in parallel mode or sequential mode. In parallel mode the input data is processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the Advanced tab. In Sequential mode the entire data set is processed by the conductor node.

• Preserve partitioning. This is Propagate by default. It adopts Set or Clear from the previous stage. You can explicitly select Set or Clear. Select Set to request that next stage in the job should attempt to maintain the partitioning.

• Node pool and resource constraints. Select this option to constrain parallel execution to the node pool or pools and/or resource pools or pools specified in the grid. The grid allows you to make choices from drop down lists populated from the Configuration file.

• Node map constraint. Select this option to constrain parallel execution to the nodes in a defined node map. You can define a node map by typing node numbers into the text box or by clicking the browse button to open the Available Nodes dialog box and selecting nodes from there. You are effectively defining a new node pool for this stage (in addition to any node pools defined in the Configuration file).

48-6 Ascential DataStage Parallel Job Developer’s Guide

Page 499: DataStage Parallel Job Developer’s Guide

Link OrderingThis tab allows you to specify how input links and output links are numbered. This is important when you are specifying Input Link Number and Output Link Number properties.

By default the first link added will be link 1, the second link 2 and so on. Select a link and use the arrow buttons to change its position.

Inputs PageThe Inputs page allows you to specify details about the incoming data sets. There can be multiple inputs to the SAS stage.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being passed to the SAS code. The Columns tab speci-fies the column definitions of incoming data.

Details about SAS stage partitioning are given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

SAS Stage 48-7

Page 500: DataStage Parallel Job Developer’s Guide

Partitioning on Input LinksThe Partitioning tab allows you to specify details about how the incoming data is partitioned or collected before passed to the SAS code. It also allows you to specify that the data should be sorted before being operated on.

By default the stage partitions in Auto mode. This attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. You can use any partitioning method except Modulus. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will attempt to preserve the partitioning of the incoming data.

If the SAS stage is operating in sequential mode, it will first collect the data using the default Auto collection method.

The Partitioning tab allows you to override this default behavior. The exact operation of this tab depends on:

• Whether the SAS stage is set to execute in parallel or sequential mode.

• Whether the preceding stage in the job is set to execute in parallel or sequential mode.

If the SAS stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage).

If the SAS stage is set to execute in sequential mode, but the preceding stage is executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.

The following partitioning methods are available:

• (Auto). DataStage attempts to work out the best partitioning method depending on execution modes of current and preceding stages, whether the Preserve Partitioning option has been set, and how many nodes are specified in the Configuration file. This is the default partitioning method for the SAS stage.

48-8 Ascential DataStage Parallel Job Developer’s Guide

• Entire. Each file written to receives the entire data set.

Page 501: DataStage Parallel Job Developer’s Guide

• Hash. The records are hashed into partitions based on the value of a key column or columns selected from the Available list.

• Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is commonly used to partition on tag fields.

• Random. The records are partitioned randomly, based on the output of a random number generator.

• Round Robin. The records are partitioned on a round robin basis as they enter the stage.

• Same. Preserves the partitioning already in place.

• DB2. Replicates the DB2 partitioning method of a specific DB2 table. Requires extra properties to be set. Access these properties by clicking the properties button

• Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set. Requires extra properties to be set. Access these properties by clicking the properties button

The following Collection methods are available:

• (Auto). DataStage attempts to work out the best collection method depending on execution modes of current and preceding stages, and how many nodes are specified in the Configuration file.This is the default collection method for SAS stages.

• Ordered. Reads all records from the first partition, then all records from the second partition, and so on.

• Round Robin. Reads a record from the first input partition, then from the second partition, and so on. After reaching the last parti-tion, the operator starts over.

• Sort Merge. Reads records in an order based on one or more columns of the record. This requires you to select a collecting key column from the Available list.

The Partitioning tab also allows you to specify that data arriving on the input link should be sorted before being passed to the SAS code. The sort

SAS Stage 48-9

is always carried out within data partitions. If the stage is partitioning incoming data the sort occurs after the partitioning. If the stage is

Page 502: DataStage Parallel Job Developer’s Guide

collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.

Select the check boxes as follows:

• Sort. Select this to specify that data coming in on the link should be sorted. Select the column or columns to sort on from the Available list.

• Stable. Select this if you want to preserve previously sorted data sets. This is the default.

• Unique. Select this to specify that, if multiple records have iden-tical sorting key values, only one record is retained. If stable sort is also set, the first record is retained.

You can also specify sort direction, case sensitivity, and collating sequence for each column in the Selected list by selecting it and right-clicking to invoke the shortcut menu.

Outputs PageThe Outputs page allows you to specify details about data output from the SAS stage. The SAS stage can have multiple output links. Choose the link whose details you are viewing from the Output Name drop-down list.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of incoming data. The Mapping tab allows you to specify the relationship between the columns being input to the SAS stage and the Output columns.

Details about SAS stage mapping is given in the following section. See Chapter 3, “Stage Editors,” for a general description of the other tabs.

48-10 Ascential DataStage Parallel Job Developer’s Guide

Page 503: DataStage Parallel Job Developer’s Guide

Mapping TabFor the SAS stage the Mapping tab allows you to specify how the output columns are derived, how SAS data maps onto them.

The left pane shows the data output from the SAS code. These are read only and cannot be modified on this tab.

The right pane shows the output columns for each link. This has a Deriva-tions field where you can specify how the column is derived.You can fill it in by dragging input columns over, or by using the Auto-match facility.

SAS Stage 48-11

Page 504: DataStage Parallel Job Developer’s Guide

48-12 Ascential DataStage Parallel Job Developer’s Guide

Page 505: DataStage Parallel Job Developer’s Guide

49Specifying Custom

Parallel Stages

In addition to the wide range of parallel stage types available, DataStage allows you to define your own stage types, which you can then use in parallel jobs.

There are three different types of stage that you can define:

• Custom. This allows knowledgeable Orchestrate users to specify an Orchestrate operator as a DataStage stage. This is then available to use in DataStage Parallel jobs.

• Build. This allows you to design and build your own bespoke operator as a stage to be included in DataStage Parallel Jobs.

• Wrapped. This allows you to specify a UNIX command to be executed by a DataStage stage. You define a wrapper file that in turn defines arguments for the UNIX command and inputs and outputs.

The DataStage Manager provides an interface that allows you to define a new DataStage Parallel job stage of any of these types. This interface is also available from the Repository window of the DataStage Designer.

Specifying Custom Parallel Stages 49-1

Page 506: DataStage Parallel Job Developer’s Guide

Defining Custom StagesYou can define a custom stage in order to include an Orchestrate operator in a DataStage stage which you can then include in a DataStage job. The stage will be available to all jobs in the project in which the stage was defined. You can make it available to other projects using the DataStage Manager Export/Import facilities. The stage is automatically added to the job palette.

To define a custom stage type from the DataStage Manager:

1. Select the Stage Types category in the Repository tree.

2. Choose File ➤ New Parallel Stage ➤ Custom from the main menu or New Parallel Stage ➤ Custom from the shortcut menu. The Stage Type dialog box appears.

49-2 Ascential DataStage Parallel Job Developer’s Guide

3. Fill in the fields on the General page as follows:

• Stage type name. This is the name that the stage will be known by to DataStage. Avoid using the same name as existing stages.

Page 507: DataStage Parallel Job Developer’s Guide

• Category. The category that the new stage will be stored in under the stage types branch of the Repository tree view. Type in or browse for an existing category or type in the name of a new one.

• Parallel Stage type. This indicates the type of new Parallel job stage you are defining (Custom, Build, or Wrapped). You cannot change this setting.

• Execution Mode. Choose the execution mode. This is the mode that will appear in the Advanced tab on the stage editor. You can override this mode for individual instances of the stage as required, unless you select Parallel only or Sequential only. See “Advanced Tab” on page 3-5 for a description of the execution mode.

• Output Mapping. Choose whether the stage has a Mapping tab or not. A Mapping tab enables the user of the stage to specify how output columns are derived from the data produced by the stage. Choose None to specify that output mapping is not performed, choose Default to accept the default setting that DataStage uses.

• Preserve Partitioning. Choose the default setting of the Preserve Partitioning flag. This is the setting that will appear in the Advanced tab on the stage editor. You can override this setting for individual instances of the stage as required. See “Advanced Tab” on page 3-5 for a description of the preserve partitioning flag.

• Partitioning. Choose the default partitioning method for the stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See “Partitioning Tab” on page 3-11 for a description of the partitioning methods.

• Collecting. Choose the default collection method for the stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See “Partitioning Tab” on page 3-11 for a description of the collection methods.

• Operator. Enter the name of the Orchestrate operator that you want the stage to invoke.

• Short Description. Optionally enter a short description of the

Specifying Custom Parallel Stages 49-3

stage.

• Long Description. Optionally enter a long description of the stage.

Page 508: DataStage Parallel Job Developer’s Guide

4. Go to the Links page and specify information about the links allowed to and from the stage you are defining.

Use this to specify the minimum and maximum number of input and output links that your custom stage can have.

5. Go to the Creator page and optionally specify information about the stage you are creating. We recommend that you assign a version

49-4 Ascential DataStage Parallel Job Developer’s Guide

Page 509: DataStage Parallel Job Developer’s Guide

number to the stage so you can keep track of any subsequent changes.

6. Go to the Properties page. This allows you to specify the options that the Orchestrate operator requires as properties that appear in the

Specifying Custom Parallel Stages 49-5

Page 510: DataStage Parallel Job Developer’s Guide

Stage Properties tab. For custom stages the Properties tab always appears under the Stage page.

Fill in the fields as follows:

• Property name. The name of the property. This will be passed to the Orchestrate operator as an option, prefixed with ‘-’ and followed by the value selected in the Properties tab of the stage editor.

• Data type. The data type of the property. Choose from:

– Boolean– Float– Integer– String– Pathname– Input Column

49-6 Ascential DataStage Parallel Job Developer’s Guide

– Output Column

Page 511: DataStage Parallel Job Developer’s Guide

If you choose Input Column or Output Column, when the stage is includes in a job a drop-down list will offer a choice of the defined input or output columns.

• Prompt. The name of the property that will be displayed on the Properties tab of the stage editor.

• Default Value. The value the option will take if no other is specified.

• Repeating. Set this true if the property repeats (i.e. you can have multiple instances of it).

• Required. Set this to True if the property is mandatory.

Defining Build StagesYou define a Build stage to enable you to provide a bespoke operator that can be executed from a DataStage Parallel job stage. The stage will be available to all jobs in the project in which the stage was defined. You can make it available to other projects using the DataStage Manager Export facilities. The stage is automatically added to the job palette.

When defining a Build stage you provide the following information:

• Description of the data that will be input to the stage.

• Description of the data that will be output from the stage.

• Whether records are transferred from input to output. A transfer copies the input record to the output buffer. If you specify auto transfer, the operator transfers the input record to the output record immediately after execution of the per record code. The code can still access data in the output buffer until it is actually written.

• Any definitions and header file information that needs to be included.

• Code that is executed at the beginning of the stage (before any records are processed).

• Code that is executed at the end of the stage (after all records have been processed).

Specifying Custom Parallel Stages 49-7

• Code that is executed every time the stage processes a record.

• Compilation and build details for actually building the stage.

Page 512: DataStage Parallel Job Developer’s Guide

The Code for the Build stage is specified in C++. There are a number of macros available to make the job of coding simpler (see “Build Stage Macros” on page 49-16). There are also a number of header files available containing many useful functions, see Appendix C

When you have specified the information, and request that the stage is generated, DataStage generates a number of files and then compiles these to build an operator which the stage executes. The generated files include:

• Header files (ending in .h)

• Source files (ending in .C)

• Object files (ending in .so)

To define a Build stage from the DataStage Manager:

1. Select the Stage Types category in the Repository tree.

2. Choose File ➤ New Parallel Stage ➤ Build from the main menu or New Parallel Stage ➤ Build from the shortcut menu. The Stage Type dialog box appears:

49-8 Ascential DataStage Parallel Job Developer’s Guide

3. Fill in the fields on the General page as follows:

Page 513: DataStage Parallel Job Developer’s Guide

• Stage type name. This is the name that the stage will be known by to DataStage. Avoid using the same name as existing stages.

• Category. The category that the new stage will be stored in under the stage types branch. Type in or browse for an existing category or type in the name of a new one.

• Class Name. The name of the C++ class. By default this takes the name of the stage type.

• Parallel Stage type. This indicates the type of new Parallel job stage you ar edefining (Custom, Build, or Wrapped). You cannot change this setting.

• Execution mode. Choose the default execution mode. This is the mode that will appear in the Advanced tab on the stage editor. You can override this mode for individual instances of the stage as required, unless you select Parallel only or Sequential only. See “Advanced Tab” on page 3-5 for a description of the execution mode.

• Preserve Partitioning. Choose the default setting of the Preserve Partitioning flag. This is the setting that will appear in the Advanced tab on the stage editor. You can override this setting for individual instances of the stage as required. See “Advanced Tab” on page 3-5 for a description of the preserve partitioning flag.

• Partitioning. Choose the default partitioning method for the stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See “Partitioning Tab” on page 3-11 for a description of the partitioning methods.

• Collecting. Choose the default collection method for the stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See “Partitioning Tab” on page 3-11 for a description of the collection methods.

• Operator. The name of the operator that your code is defining and which will be executed by the DataStage stage. By default this takes the name of the stage type.

Specifying Custom Parallel Stages 49-9

• Short Description. Optionally enter a short description of the stage.

• Long Description. Optionally enter a long description of the stage.

Page 514: DataStage Parallel Job Developer’s Guide

4. Go to the Creator page and optionally specify information about the stage you are creating. We recommend that you assign a release number to the stage so you can keep track of any subsequent changes.

5. Go to the Properties page. This allows you to specify the options that the Build stage requires as properties that appear in the Stage Proper-

49-10 Ascential DataStage Parallel Job Developer’s Guide

Page 515: DataStage Parallel Job Developer’s Guide

ties tab. For custom stages the Properties tab always appears under the Stage page.

Fill in the fields as follows:

• Property name. The name of the property. This will be passed to the operator you are defining as an option, prefixed with ‘-’ and followed by the value selected in the Properties tab of the stage editor.

• Data type. The data type of the property. Choose from:

– Boolean– Float– Integer– String– Pathname

Specifying Custom Parallel Stages 49-11

– Input Column– Output Column

Page 516: DataStage Parallel Job Developer’s Guide

If you choose Input Column or Output Column, when the stage is includes in a job a drop-down list will offer a choice of the defined input or output columns.

• Prompt. The name of the property that will be displayed on the Properties tab of the stage editor.

• Default Value. The value the option will take if no other is specified.

• Required. Set this to True if the property is mandatory.

6. Click on the Build page. The tabs here allow you to define the actual operation that the stage will perform.

The Interfaces tab enable you to specify details about inputs to and outputs from the stage, and about automatic transfer of records from

49-12 Ascential DataStage Parallel Job Developer’s Guide

input to output. You specify port details, a port being where a link connects to the stage. You need a port for each possible input link to the stage, and a port for each possible output link from the stage.

Page 517: DataStage Parallel Job Developer’s Guide

You provide the following information on the Input sub-tab:

• Port Name. Optional name for the port. The default names for the ports are in0, in1, in2 … . You can refer to them in the code using either the default name or the name you have specified.

• AutoRead. This defaults to True which means the stage will auto-matically read records from the port. Otherwise you explicitly control read operations in the code.

• Table Name. Specify a table definition in the DataStage Repository which describes the meta data for the port. You can browse for a table definition by choosing Select Table from the menu that appears when you click the browse button. You can also view the schema corresponding to this table definition by choosing View Schema from the same menu. You do not have to supply a Table Name.

• RCP. Choose True if runtime column propagation is allowed for inputs to this port. Defaults to False. You do not need to set this if you are using the automatic transfer facility.

You provide the following information on the Output sub-tab:

• Port Name. Optional name for the port. The default names for the links are out0, out1, out2 … . You can refer to them in the code using either the default name or the name you have specified.

• AutoWrite. This defaults to True which means the stage will auto-matically write records to the port. Otherwise you explicitly control write operations in the code. Once records are written, the code can no longer access them.

• Table Name. Specify a table definition in the DataStage Repository which describes the meta data for the port. You can browse for a table definition. You do not have to supply a Table Name.

• RCP. Choose True if runtime column propagation is allowed for outputs from this port. Defaults to False. You do not need to set this if you are using the automatic transfer facility.

The Transfer sub-tab allows you to connect an input port to an output port such that records will be automatically transferred from input to output. You can also disable automatic transfer, in which case you

Specifying Custom Parallel Stages 49-13

have to explicitly transfer data in the code. Transferred data sits in an

Page 518: DataStage Parallel Job Developer’s Guide

output buffer and can still be accessed and altered by the code until it is actually written to the port.

You provide the following information on the Transfer tab:

• Input. Select the input port to connect from the drop-down list.

• Output. Select the output port to transfer input records to from the drop-down list.

• Auto Transfer. This defaults to False, which means that you have to include code which manages the transfer. Set to True to have the transfer carried out automatically.

• Separate. This is False by default, which means this transfer will be combined with other transfers to the same port. Set to True to specify that the transfer should be separate from other transfers.

The Logic tab is where you specify the actual code that the stage

49-14 Ascential DataStage Parallel Job Developer’s Guide

executes.

Page 519: DataStage Parallel Job Developer’s Guide

The Definitions sub-tab allows you to specify variables, include header files, and otherwise initialize the stage before processing any records.

The Pre-Loop sub-tab allows you to specify code which is executed at the beginning of the stage, before any records are processed.

The Per-Record sub-tab allows you to specify the code which is executed once for every record processed.

The Post-Loop sub-tab allows you to specify code that is executed after all the records have been processed.

You can type straight into these pages or cut and paste from another editor. The shortcut menu on the Pre-Loop, Per-Record, and Post-Loop pages gives access to the macros that are available for use in the code.

The Advanced tab allows you to specify details about how the stage is compiled and build. Fill in the page as follows:

• Compile and Link Flags. Allows you to specify flags which are passed to the C++ compiler.

• Verbose. Select this check box to specify that the compile and build is done in verbose mode.

• Debug. Select this check box to specify that the compile and build is done in debug mode. Otherwise, it is done in optimize mode.

• Suppress Compile. Select this check box to generate files without compiling, and without deleting the generated files. This option is useful for fault finding.

• Base File Name. The base filename for generated files. All gener-ated files will have this name followed by the appropriate suffix. This defaults to the name specified under Operator on the General page.

• Source Directory. The directory where generated .C files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the DataStage Administrator (see DataStage Administrator Guide).

Specifying Custom Parallel Stages 49-15

• Header Directory. The directory where generated .h files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the

Page 520: DataStage Parallel Job Developer’s Guide

DS_OPERATOR_BUILDOP_DIR environment variable in the DataStage Administrator (see DataStage Administrator Guide).

• Object Directory. The directory where generated .so files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the DataStage Administrator (see DataStage Administrator Guide).

• Wrapper directory. The directory where generated .op files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the DataStage Administrator (see DataStage Administrator Guide).

7. When you have filled in the details in all the pages, click Generate to generate the stage. A window appears showing you the result of the build.

Build Stage MacrosThere are a number of macros you can use when specifying Pre-Loop, Per-Record, and Post-Loop code. Insert a macro by selecting it from the short cut menu. They are grouped into the following categories:

• Informational• Flow-control• Input and output• Transfer

Informational Macros

Use these macros in your code to determine the number of inputs, outputs, and transfers as follows:

• inputs(). Returns the number of inputs to the stage.

• outputs(). Returns the number of outputs from the stage.

• transfers(). Returns the number of transfers in the stage.

Flow-Control Macros

49-16 Ascential DataStage Parallel Job Developer’s Guide

Use these macros to override the default behavior of the Per-Record loop in your stage definition:

Page 521: DataStage Parallel Job Developer’s Guide

• endLoop(). Causes the operator to stop looping, following comple-tion of the current loop and after writing any auto outputs for this loop.

• nextLoop() Causes the operator to immediately skip to the start of next loop, without writing any outputs.

• failStep() Causes the operator to return a failed status and termi-nate the job.

Input and Output Macros

These macros allow you to explicitly control the read and write and transfer of individual records.

Each of the macros takes an argument as follows:

• input is the index of the input (0 to n). If you have defined a name for the input port you can use this in place of the index in the form portname.portid_.

• output is the index of the output (0 to n). If you have defined a name for the output port you can use this in place of the index in the form portname.portid_.

• index is the index of the transfer (0 to n).

The following macros are available:

• readRecord(input). Immediately reads the next record from input, if there is one. If there is no record, the next call to inputDone() will return false.

• writeRecord(output). Immediately writes a record to output.

• inputDone(input). Returns true if the last call to readRecord() for the specified input failed to read a new record, because the input has no more records.

• holdRecord(input). Causes auto input to be suspended for the current record, so that the operator does not automatically read a new record at the start of the next loop. If auto is not set for the input, holdRecord() has no effect.

• discardRecord(output). Causes auto output to be suspended for the

Specifying Custom Parallel Stages 49-17

current record, so that the operator does not output the record at the end of the current loop. If auto is not set for the output, discar-dRecord() has no effect.

Page 522: DataStage Parallel Job Developer’s Guide

• discardTransfer(index). Causes auto transfer to be suspended, so that the operator does not perform the transfer at the end of the current loop. If auto is not set for the transfer, discardTransfer() has no effect.

Transfer Macros

These macros allow you to explicitly control the transfer of individual records.

Each of the macros takes an argument as follows:

• input is the index of the input (0 to n). If you have defined a name for the input port you can use this in place of the index in the form portname.portid_.

• output is the index of the output (0 to n). If you have defined a name for the output port you can use this in place of the index in the form portname.portid_.

• index is the index of the transfer (0 to n).

The following macros are available:

• doTransfer(index). Performs the transfer specified by index.

• doTransfersFrom(input). Performs all transfers from input.

• doTransfersTo(output). Performs all transfers to output.

• transferAndWriteRecord(output). Performs all transfers and writes a record for the specified output. Calling this macro is equivalent to calling the macros doTransfersTo() and writeRecord().

How Your Code is ExecutedThis section describes how the code that you define when specifying a Build stage executes when the stage is run in a DataStage job.

The sequence is as follows:

1. Handles any definitions that you specified in the Definitions sub-tab when you entered the stage details.

2. Executes an code that was entered in the Pre-Loop sub-tab.

49-18 Ascential DataStage Parallel Job Developer’s Guide

3. Loops repeatedly until either all inputs have run out of records, or the Per-Record code has explicitly invoked endLoop(). In the loop, performs the following steps:

Page 523: DataStage Parallel Job Developer’s Guide

a. Reads one record for each input, except where any of the following is true:

– The input has no more records left.

– The input has Auto Read set to false.

– The holdRecord() macro was called for the input last time around the loop.

b. Executes the Per-Record code, which can explicitly read and write records, perform transfers, and invoke loop-control macros such as endLoop().

c. Performs each specified transfer, except where any of the following is true:

– The input of the transfer has no more records.

– The transfer has Auto Transfer set to False.

– The discardTransfer() macro was called for the transfer during the current loop iteration.

d. Writes one record for each output, except where any of the following is true:

– The output has Auto Write set to false.

– The discardRecord() macro was called for the output during the current loop iteration.

4. If you have specified code in the Post-loop sub-tab, executes it.

5. Returns a status, which is written to the DataStage Job Log.

Inputs and OutputsThe input and output ports that you defined for your Build stage are where input and output links attach to the stage. By default links are connected to ports in the order they are connected to the stage, but where your stage allows multiple input or output links you can change the link order using the Link Order tab on the stage editor.

When you specify details about the input and output ports for your Build stage, you need to define the meta data for the ports. You do this by

Specifying Custom Parallel Stages 49-19

loading a table definition from the DataStage Repository.

Page 524: DataStage Parallel Job Developer’s Guide

When you actually use your stage in a job, you have to specify meta data for the links that attach to these ports. For the job to run successfully the meta data specified for the port and that specified for the link should match. An exception to this is where you have runtime column propaga-tion enabled for the job. In this case the input link meta data can be a super-set of the port meta data and the extra columns will be automatically propagated.

Using Multiple Inputs

Where you require your stage to handle multiple inputs, there are some special considerations. Your code needs to ensure the following:

• The stage only tries to access a column when there are records available. It should not try to access a column after all records have been read (using the inputDone() macro to check), and should not attempt to access a column unless either Auto Read is enabled on the link or an explicit read record has been performed.

• The reading of records is terminated immediately after all the required records have been read from it. In the case of a port with Auto Read disabled, the code must determine when all required records have been read and call the endLoop() macro.

In most cases we recommend that you keep auto read enabled when you are using multiple inputs, this minimizes the need for explicit control in your code. But there are circumstances when this is not appropriate. The following paragraphs describes some common scenarios:

Using Auto Read for all Inputs. All ports have Auto Read enabled and so all record reads are handled automatically. You need to code for Per-record loop such that each time it accesses a column on any input it first uses the inputDone() macro to determine if there are any more records.

This method is fine if you want your stage to read a record from every link, every time round the loop.

Using Inputs with Auto Read Enabled for Some and Disabled for Others. You define one (or possibly more) inputs as auto read, and the rest with auto read disabled. You code the stage in such a way as the processing of records from the auto read input drives the processing of the

49-20 Ascential DataStage Parallel Job Developer’s Guide

other inputs. Each time round the loop, your code should call inputDone() on the auto read input and call exitLoop() to complete the actions of the stage.

Page 525: DataStage Parallel Job Developer’s Guide

This method is fine where you process a record from the auto read input every time around the loop, and then process records from one or more of the other inputs depending on the results of processing the auto read record.

Using Inputs with Auto Read Disabled. Your code must explicitly perform all record reads. You should define Per-Loop code which calls readRecord() once for each input to start processing. Your Per-record code should call inputDone() for every input each time round the loop to deter-mine if a record was read on the most recent readRecord(), and if it did, call readRecord() again for that input. When all inputs run out of records, the Per-Loop code should exit.

This method is intended where you want explicit control over how each input is treated.

Example Build StageThis section shows you how to define a Build stage called Divide, which basically divides one number by another and writes the result and any remainder to an output link. The stage also checks whether you are trying to divide by zero and, if you are, sends the input record down a reject link.

To demonstrate the use of properties, the stage also lets you define a minimum divisor. If the number you are dividing by is smaller than the minimum divisor you specify when adding the stage to a job, then the record is also rejected

The input to the stage is defined as auto read, while the two outputs have auto write disabled. The code has to explicitly write the data to one or other of the output links. In the case of a successful division the data written is the original record plus the result of the division and any remainder. In the case of a rejected record, only the original record is written.

The input record has two columns; dividend and divisor. Output 0 has four columns; dividend, divisor, result, and remainder. Output 1 (the reject link) has two columns; dividend and divisor.

If the divisor column of an input record contains zero or is less than the specified minimum divisor, the record is rejected, and the code uses the

Specifying Custom Parallel Stages 49-21

macro transferAndWriteRecord(1) to transfer the data to port 1 and write it. If the divisor is not zero, the code uses doTransfersTo(0) to transfer the

Page 526: DataStage Parallel Job Developer’s Guide

input record to Output 0, assigns the division results to result and remainder and finally calls writeRecord(0) to write the record to output 0.

The following screen shots show how this stage is defined in DataStage using the Stage Type dialog box:

1. First general details are supplied in the General tab:

49-22 Ascential DataStage Parallel Job Developer’s Guide

Page 527: DataStage Parallel Job Developer’s Guide

2. Details about the stage’s creation are supplied on the Creator page:

3. The optional property of the stage is defined in the Properties page:

4. Details of the inputs and outputs is defined on the interfaces tab of the Build page.

Specifying Custom Parallel Stages 49-23

Page 528: DataStage Parallel Job Developer’s Guide

Details about the single input to Divide are given on the Input sub-tab of the Interfaces tab. A table definition for the inputs link is available to be loaded from the DataStage Repository.

49-24 Ascential DataStage Parallel Job Developer’s Guide

Page 529: DataStage Parallel Job Developer’s Guide

Details about the outputs are given on the Output sub-tab of the Inter-faces tab.

Note: When you use the stage in a job, make sure that you use table defi-nitions compatible with the tables defined in the input and output sub-tabs.

Details about the transfers carried out by the stage are defined on the Transfer sub-tab of the Interfaces tab.

Specifying Custom Parallel Stages 49-25

Page 530: DataStage Parallel Job Developer’s Guide

5. The code itself is defined on the Logic tab. In this case all the processing is done in the Per-Record loop and so is entered on the Per-Record sub-tab.

6. As this example uses all the compile and build defaults, all that remains is to click Generate to build the stage.

49-26 Ascential DataStage Parallel Job Developer’s Guide

Page 531: DataStage Parallel Job Developer’s Guide

Defining Wrapped StagesYou define a Wrapped stage to enable you to specify a UNIX command to be executed by a DataStage stage. You define a wrapper file that handles arguments for the UNIX command and inputs and outputs. The DataStage Manager provides an interface that helps you define the wrapper. The stage will be available to all jobs in the project in which the stage was defined. You can make it available to other projects using the DataStage Manager Export facilities. You can add the stage to your job palette using palette customization features in the DataStage Designer.

When defining a Build stage you provide the following information:

• Details of the UNIX command that the stage will execute.

• Description of the data that will be input to the stage.

• Description of the data that will be output from the stage.

• Definition of the environment in which the command will execute.

The UNIX command that you wrap can be a built-in command, such as grep, a utility, such as SyncSort, or your own UNIX application. The only limitation is that the command must be ‘pipe-safe’ (to be pipe-safe a UNIX command reads its input sequentially, from beginning to end).

You need to define meta data for the data being input to and output from the stage. You also need to define the way in which the data will be input or output. UNIX commands can take their inputs from standard in, or another stream, a file, or from the output of another command via a pipe. Similarly data is output to standard out, or another stream, to a file or to a pipe to be input to another command. You specify what the command expects.

DataStage handles data being input to the Wrapped stage and will present it in the specified form. If you specify a command that expects input on standard in, or another stream, DataStage will present the input data from the jobs data flow as if it was on standard in. Similarly it will intercept data output on standard out, or another stream, and integrate it into the job’s data flow.

You also specify the environment in which the UNIX command will be executed when you define the wrapped stage.

Specifying Custom Parallel Stages 49-27

To define a Wrapped stage from the DataStage Manager:

1. Select the Stage Types category in the Repository tree.

Page 532: DataStage Parallel Job Developer’s Guide

2. Choose File ➤ New Parallel Stage ➤ Wrapped from the main menu or New Parallel Stage ➤ Wrapped from the shortcut menu. The Stage Type dialog box appears:

3. Fill in the fields on the General page as follows:

• Stage type name. This is the name that the stage will be known by to DataStage. Avoid using the same name as existing stages or the name of the actual UNIX command you are wrapping.

• Category. The category that the new stage will be stored in under the stage types branch. Type in or browse for an existing category or type in the name of a new one.

• Parallel Stage type. This indicates the type of new Parallel job stage you are defining (Custom, Build, or Wrapped). You cannot change this setting.

49-28 Ascential DataStage Parallel Job Developer’s Guide

Page 533: DataStage Parallel Job Developer’s Guide

• Wrapper Name. The name of the wrapper file DataStage will generate to call the command. By default this will take the same name as the Stage type name.

• Execution mode. Choose the default execution mode. This is the mode that will appear in the Advanced tab on the stage editor. You can override this mode for individual instances of the stage as required, unless you select Parallel only or Sequential only. See “Advanced Tab” on page 3-5 for a description of the execution mode.

• Preserve Partitioning. Choose the default setting of the Preserve Partitioning flag. This is the setting that will appear in the Advanced tab on the stage editor. You can override this setting for individual instances of the stage as required. See “Advanced Tab” on page 3-5 for a description of the preserve partitioning flag.

• Partitioning. Choose the default partitioning method for the stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See “Partitioning Tab” on page 3-11 for a description of the partitioning methods.

• Collecting. Choose the default collection method for the stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See “Partitioning Tab” on page 3-11 for a description of the collection methods.

• Command. The name of the UNIX command to be wrapped, plus any required arguments. The arguments that you enter here are ones that do not change with different invocations of the command. Arguments that need to be specified when the Wrapped stage is included in a job are defined as properties for the stage.

• Short Description. Optionally enter a short description of the stage.

• Long Description. Optionally enter a long description of the stage.

4. Go to the Creator page and optionally specify information about the stage you are creating. We recommend that you assign a release

Specifying Custom Parallel Stages 49-29

Page 534: DataStage Parallel Job Developer’s Guide

number to the stage so you can keep track of any subsequent changes.

5. Go to the Properties page. This allows you to specify the arguments that the UNIX command requires as properties that appear in the

49-30 Ascential DataStage Parallel Job Developer’s Guide

Page 535: DataStage Parallel Job Developer’s Guide

Stage Properties tab. For custom stages the Properties tab always appears under the Stage page.

Fill in the fields as follows:

• Property name. The name of the property that will be displayed on the Properties tab of the stage editor.

• Data type. The data type of the property. Choose from:

– Boolean– Float– Integer– String– Pathname– Input Column– Output Column

Specifying Custom Parallel Stages 49-31

Page 536: DataStage Parallel Job Developer’s Guide

If you choose Input Column or Output Column, when the stage is includes in a job a drop-down list will offer a choice of the defined input or output columns.

• Prompt. The name of the property that will be displayed on the Properties tab of the stage editor.

• Default Value. The value the option will take if no other is specified.

• Required. set this to True if the property is mandatory.

6. Go to the Wrapped page. This allows you to specify information about the command to be executed by the stage and how it will be handled.

The Interfaces tab is used to describe the inputs to and outputs from the stage, specifying the interfaces that the stage will need to function.

Details about inputs to the stage are defined on the Inputs sub-tab:

49-32 Ascential DataStage Parallel Job Developer’s Guide

• Link. The link number, this is assigned for you and is read-only. When you actually use your stage, links will be assigned in the order in which you add them. In our example, the first link will be

Page 537: DataStage Parallel Job Developer’s Guide

taken as link 0, the second as link 1 and so on. You can reassign the links using the stage editor’s Link Ordering tab on the General page.

• Table Name. The meta data for the link. You define this by loading a table definition from the Repository. Type in the name, or browse for a table definition. Alternatively, you can specify an argument to the UNIX command which specifies a table definition. In this case, when the wrapped stage is used in a job design, the designer will be prompted for an actual table definition to use.

• Stream. Here you can specify whether the UNIX command expects its input on standard in, or another stream, or whether it expects it in a file. Click on the browse button to open the Wrapped Stream dialog box.

In the case of a file, you should also specify whether the file to be read is specified in a command line argument, or by an environ-ment variable.

Details about outputs from the stage are defined on the Outputs sub-tab:

Specifying Custom Parallel Stages 49-33

• Link. The link number, this is assigned for you and is read-only. When you actually use your stage, links will be assigned in the order in which you add them. In our example, the first link will be

Page 538: DataStage Parallel Job Developer’s Guide

taken as link 0, the second as link 1 and so on. You can reassign the links using the stage editor’s Link Ordering tab on the General page.

• Table Name. The meta data for the link. You define this by loading a table definition from the Repository. Type in the name, or browse for a table definition.

• Stream. Here you can specify whether the UNIX command will write its output to standard our, or another stream, or whether it outputs to a file. Click on the browse button to open the Wrapped Stream dialog box.

In the case of a file, you should also specify whether the file to be written is specified in a command line argument, or by an environ-ment variable.

The Environment tab gives information about the environment in which the command will execute.

49-34 Ascential DataStage Parallel Job Developer’s Guide

Set the following on the Environment tab:

Page 539: DataStage Parallel Job Developer’s Guide

• All Exit Codes Successful. By default DataStage treats an exit code of 0 as successful and all others as errors. Select this check box to specify that all exit codes should be treated as successful other than those specified in the Failure codes grid.

• Exit Codes. The use of this depends on the setting of the All Exits Codes Successful check box.

If All Exits Codes Successful is not selected, enter the codes in the Success Codes grid which will be taken as indicating successful completion. All others will be taken as indicating failure.

If All Exits Codes Successful is selected, enter the exit codes in the Failure Code grid which will be taken as indicating failure. All others will be taken as indicating success.

• Environment. Specify environment variables and settings that the UNIX command requires in order to run.

7. When you have filled in the details in all the pages, click Generate to generate the stage.

Example Wrapped StageThis section shows you how to define a Wrapped stage called ex_sort which runs the UNIX sort command in parallel. The stage sorts data in two files and outputs the results to a file. The incoming data has two columns, order number and code. The sort command sorts the data on the second field, code. You can optionally specify that the sort is run in reverse order.

Wrapping the sort command in this way would be useful if you had a situ-ation where you had a fixed sort operation that was likely to be needed in several jobs. Having it as an easily reusable stage would save having to configure a built-in sort stage every time you needed it.

When included in a job and run, the stage will effectively call the Sort command as follows:

sort -r -o outfile -k 2 infile1 infile2

The following screen shots show how this stage is defined in DataStage using the Stage Type dialog box:

Specifying Custom Parallel Stages 49-35

Page 540: DataStage Parallel Job Developer’s Guide

1. First general details are supplied in the General tab. The argument defining the second column as the key is included in the command because this does not vary:

2. The reverse order argument (-r) are included as properties because it is optional and may or may not be included when the stage is incor-porated into a job.

49-36 Ascential DataStage Parallel Job Developer’s Guide

Page 541: DataStage Parallel Job Developer’s Guide

3. The fact that the sort command expects two files as input is defined on the Input sub-tab on the Interfaces tab of the Wrapper page.

4. The fact that the sort command outputs to a file is defined on the Output sub-tab on the Interfaces tab of the Wrapper page.

Note: When you use the stage in a job, make sure that you use table defi-nitions compatible with the tables defined in the input and output sub-tabs.

5. Because all exit codes other than 0 are treated as errors, and because there are no special environment requirements for this command, you do not need to alter anything on the Environment tab of the Wrapped page. All that remains is to click Generate to build the stage.

Specifying Custom Parallel Stages 49-37

Page 542: DataStage Parallel Job Developer’s Guide

49-38 Ascential DataStage Parallel Job Developer’s Guide

Page 543: DataStage Parallel Job Developer’s Guide

M

50Managing Data Sets

DataStage parallel extender jobs use data sets to store data being oper-ated on in a persistent form. Data sets are operating system files, each referred to by a descriptor file, usually with the suffix .ds.

You can create and read data sets using the Data Set stage, which is described in Chapter 6. DataStage also provides a utility for managing data sets from outside a job. This utility is available from the DataStage Designer, Manager, and Director clients.

Structure of Data SetsA data set comprises a descriptor file and a number of other files that are added as the data set grows. These files are stored on multiple disks in your system. A data set is organized in terms of partitions and segments. Each partition of a data set is stored on a single processing node. Each data segment contains all the records written by a single DataStage job. So a segment can contain files from many partitions, and a partition has files from many segments.

anaging Data Sets 50-1

Page 544: DataStage Parallel Job Developer’s Guide

The descriptor file for a data set contains the following information:

• Data set header information.

• Creation time and data of the data set.

• The schema of the data set.

• A copy of the configuration file use when the data set was created.

For each segment, the descriptor file contains:

• The time and data the segment was added to the data set.

• A flag marking the segment as valid or invalid.

• Statistical information such as number of record in the segment and number of bytes.

• Path names of all data files, on all processing nodes.

This information can be accessed through the Data Set Manager.

Partition 1 Partition 2 Partition 3 Partition 4

Segment 1

Segment 2

Segment 3

One or more data files

50-2 Ascential DataStage Parallel Job Developer’s Guide

Page 545: DataStage Parallel Job Developer’s Guide

Starting the Data Set ManagerTo start the Data Set Manager from the DataStage Designer, Manager, or Director:

1. Choose Tools ➤ Data Set Management, a Browse Files dialog box appears:

2. Navigate to the directory containing the data set you want to manage. By convention, data set files have the suffix .ds.

3. Select the data set you want to manage and click OK. The Data Set Viewer appears.From here you can copy or delete the chosen data set.

Managing Data Sets 50-3

Page 546: DataStage Parallel Job Developer’s Guide

You can also view its schema (column definitions) or the data it contains.

Data Set ViewerThe Data Set viewer displays information about the data set you are viewing:

Partitions. The partition grid shows the partitions the data set contains and describes their properties:

• #. The partition number.

• Node. The processing node that the partition is currently assigned

50-4 Ascential DataStage Parallel Job Developer’s Guide

to.

• Records. The number of records the partition contains.

Page 547: DataStage Parallel Job Developer’s Guide

• Blocks. The number of blocks the partition contains.

• Bytes. The number of bytes the partition contains.

Segments. Click on an individual partition to display the associated segment details. This contains the following information:

• #. The segment number.

• Created. Date and time of creation.

• Bytes. The number of bytes in the segment.

• Pathname. The name and path of the file containing the segment in the selected partition.

Click the Refresh button to reread and refresh all the displayed information.

Click the Output button to view a text version of the information displayed in the Data Set Viewer.

You can open a different data set from the viewer by clicking the Open icon on the tool bar. The browse dialog open box opens again and lets you browse for a data set.

Viewing the SchemaClick the Schema icon from the tool bar to view the record schema of the current data set. This is presented in text form in the Record Schema window:

Managing Data Sets 50-5

Page 548: DataStage Parallel Job Developer’s Guide

Viewing the DataClick the Data icon from the tool bar to view the data held by the current data set. This options the Data Viewer Options dialog box, which allows you to select a subset of the data to view.

• Rows to display. Specify the number of rows of data you want the data browser to display.

• Skip count. Skip the specified number of rows before viewing data.

• Period. Display every Pth record where P is the period. You can start after records have been skipped by using the Skip property. P must equal or be greater than 1.

• Partitions. Choose between viewing the data in All partitions or the data in the partition selected from the drop-down list.

50-6 Ascential DataStage Parallel Job Developer’s Guide

Page 549: DataStage Parallel Job Developer’s Guide

Click OK to view the selected data, the Data Viewer window appears:

Copying Data SetsClick the Copy icon on the tool bar to copy the selected data set. The Copy data set dialog box appears, allowing you to specify a path where the new data set will be stored:

Managing Data Sets 50-7

The new data set will have the same record schema, number of partitions and contents as the original data set.

Page 550: DataStage Parallel Job Developer’s Guide

Note: You cannot use the UNIX cp command to copy a data set because DataStage represents a single data set with multiple files.

Deleting Data SetsClick the Delete icon on the tool bar to delete the current data set data set. You will be asked to confirm the deletion.

Note: You cannot use the UNIX rm command to copy a data set because DataStage represents a single data set with multiple files. Using rm simply removes the descriptor file, leaving the much larger data files behind.

50-8 Ascential DataStage Parallel Job Developer’s Guide

Page 551: DataStage Parallel Job Developer’s Guide

51DataStage

Development Kit (JobControl Interfaces)

DataStage provides a range of methods that enable you to run DataStage server or parallel jobs directly on the server, without using the DataStage Director. The methods are:

• C/C++ API (the DataStage Development kit)

• DataStage BASIC calls

• Command line Interface commands (CLI)

• DataStage macros

These methods can be used in different situations as follows:

• API. Using the API you can build a self-contained program that can run anywhere on your system, provided that it can connect to a DataStage server across the network.

• BASIC. Programs built using the DataStage BASIC interface can be run from any DataStage server on the network. You can use this interface to define jobs that run and control other jobs. The control-ling job can be run from the Director client like any other job, or directly on the server machine from the TCL prompt. (Job sequences provide another way of producing control jobs – see DataStage Designer Guide for details.)

DataStage Development Kit (Job Control Interfaces) 51-1

Page 552: DataStage Parallel Job Developer’s Guide

• CLI. The CLI can be used from the command line of any DataStage server on the network. Using this method, you can run jobs on other servers too.

• Macros. A set of macros can be used in job designs or in BASIC programs. These are mostly used to retrieve information about other jobs.

DataStage Development KitThe DataStage Development Kit provides the DataStage API, a C or C++ application programming interface.

This section gives general information about using the DataStage API. Specific information about API functions is in “API Functions” on page 51-5.

A listing for an example program which uses the API is in Appendix A.

The dsapi.h Header FileDataStage API provides a header file that should be included with all API programs. The header file includes prototypes for all DataStage API functions. Their format depends on which tokens you have defined:

• If the _STDC_ or WIN32 tokens are defined, the prototypes are in ANSI C style.

• If the _cplusplus token is defined, the prototypes are in C++ format with the declarations surrounded by:

extern "C" {…}

• Otherwise the prototypes are in Kernighan and Ritchie format.

Data Structures, Result Data, and ThreadsDataStage API functions return information about objects as pointers to data items. This is either done directly, or indirectly by setting pointers in the elements of a data structure that is provided by the caller.

Each thread within a calling application is allocated a separate storage

51-2 Ascential DataStage Parallel Job Developer’s Guide

area. Each call to a DataStage API routine overwrites any existing contents of this data area with the results of the call, and returns a pointer into the area for the requested data.

Page 553: DataStage Parallel Job Developer’s Guide

For example, the DSGetProjectList function obtains a list of DataStage projects, and the DSGetProjectInfo function obtains a list of jobs within a project. When the DSGetProjectList function is called it retrieves the list of projects, stores it in the thread’s data area, and returns a pointer to this area. If the same thread then calls DSGetProjectInfo, the job list is retrieved and stored in the thread’s data area, overwriting the project list. The job list pointer in the supplied data structure references the thread data area.

This means that if the results of a DataStage API function need to be reused later, the application should make its own copy of the data before making a new DataStage API call. Alternatively, the calls can be used in multiple threads.

DataStage API stores errors for each thread: a call to the DSGetLastError function returns the last error generated within the calling thread.

Writing DataStage API ProgramsYour application should use the DataStage API functions in a logical order to ensure that connections are opened and closed correctly, and jobs are run effectively. The following procedure suggests an outline for the program logic to follow, and which functions to use at each step:

1. If required, set the server name, user name, and password to use for connecting to DataStage (DSSetServerParams).

2. Obtain the list of valid projects (DSGetProjectList).

3. Open a project (DSOpenProject).

4. Obtain a list of jobs (DSGetProjectInfo).

5. Open one or more jobs (DSOpenJob).

6. List the job parameters (DSGetParamInfo).

7. Lock the job (DSLockJob).

8. Set the job’s parameters and limits (DSSetJobLimit, DSSetParam).

9. Start the job running (DSRunJob).

10. Poll for the job or wait for job completion (DSWaitForJob, DSStopJob, DSGetJobInfo).

11. Unlock the job (DSUnlockJob).

DataStage Development Kit (Job Control Interfaces) 51-3

12. Display a summary of the job’s log entries (DSFindFirstLogEntry, DSFindNextLogEntry).

Page 554: DataStage Parallel Job Developer’s Guide

13. Display details of specific log events (DSGetNewestLogId, DSGetLogEntry).

14. Examine and display details of job stages (DSGetJobInfo – stage list, DSGetStageInfo).

15. Examine and display details of links within active stages (DSGet-StageInfo – link list, DSGetLinkInfo).

16. Close all open jobs (DSCloseJob).

17. Detach from the project (DSCloseProject).

Building a DataStage API ApplicationEverything you need to create an application that uses the DataStage API is in a subdirectory called dsdk (DataStage Development Kit) in the Ascen-tial\DataStage installation directory on the server machine.

To build an application that uses the DataStage API:

1. Write the program, including the dsapi.h header file in all source modules that uses the DataStage API.

2. Compile the code. Ensure that the WIN32 token is defined. (This happens automatically in the Microsoft Visual C/C++ compiler environment.)

3. Link the application, including vmdsapi.lib, in the list of libraries to be included.

Redistributing ApplicationsIf you intend to run your DataStage API application on a computer where DataStage Server is installed, you do not need to include DataStage API DLLs or libraries as these are installed as part of DataStage Server.

If you want to run the application from a computer used as a DataStage client, you should redistribute the following library with your application:

vmdsapi.dll

If you intend to run the program from a computer that has neither

51-4 Ascential DataStage Parallel Job Developer’s Guide

DataStage Server nor any DataStage client installed, in addition to the library mentioned above, you should also redistribute the following:

Page 555: DataStage Parallel Job Developer’s Guide

uvclnt32.dllunirpc32.dll

You should locate these files where they will be in the search path of any user who uses the application, for example, in the %SystemRoot%\System32 directory.

API FunctionsThis section details the functions provided in the DataStage API. These functions are described in alphabetical order. The following table briefly describes the functions categorized by usage:

Usage Function Description

Accessing projects

DSCloseProject Closes a project that was opened with DSOpenProject.

DSGetProjectList Retrieves a list of all projects on the server.

DSGetProjectInfo Retrieves a list of jobs in a project.

DSOpenProject Opens a project.

DSSetServerParams

Sets the server name, user name, and password to use for a job.

Accessing jobs DSCloseJob Closes a job that was opened with DSOpenJob.

DSGetJobInfo Retrieves information about a job, such as the date and time of the last run, parameter names, and so on.

DSLockJob Locks a job prior to setting job parameters or starting a job run.

DSOpenJob Opens a job.

DSRunJob Runs a job.

DSStopJob Aborts a running job.

DSUnlockJob Unlocks a job, enabling other processes to use it.

DataStage Development Kit (Job Control Interfaces) 51-5

DSWaitForJob Waits until a job has completed.

Page 556: DataStage Parallel Job Developer’s Guide

Accessing job parameters

DSGetParamInfo Retrieves information about a job parameter.

DSSetJobLimit Sets row processing and warning limits for a job.

DSSetParam Sets job parameter values.

Accessing stages

DSGetStageInfo Retrieves information about a stage within a job.

Accessing links

DSGetLinkInfo Retrieves information about a link of an active stage within a job.

Accessing log entries

DSFindFirstLogEntry

Retrieves entries in a log that meet the specified criteria.

DSFindNextLogEntry

Finds the next log entry that meets the criteria specified in DSFindFirstLogEntry.

DSGetLogEntry Retrieves the specified log entry.

DSGetNewestLogId

Retrieves the newest entry in the log.

DSLogEvent Adds a new entry to the log.

Handling errors

DSGetLastError Retrieves the last error code value generated by the calling thread.

DSGetLastErrorMsg

Retrieves the text of the last reported error.

Usage Function Description

51-6 Ascential DataStage Parallel Job Developer’s Guide

Page 557: DataStage Parallel Job Developer’s Guide

DSCloseJob

Closes a job that was opened using DSOpenJob.DSCloseJob

Syntaxint DSCloseJob(

DSJOB JobHandle);

Parameter

JobHandle is the value returned from DSOpenJob.

Return Values

If the function succeeds, the return value is DSJE_NOERROR.

If the function fails, the return value is:

DSJE_BADHANDLE Invalid JobHandle.

Remarks

If the job is locked when DSCloseJob is called, it is unlocked.

If the job is running when DSCloseJob is called, the job is allowed to finish, and the function returns a value of DSJE_NOERROR immediately.

DataStage Development Kit (Job Control Interfaces) 51-7

Page 558: DataStage Parallel Job Developer’s Guide

DSCloseProject

Closes a project that was opened using the DSOpenProject function.DSCloseProject

Syntaxint DSCloseProject(

DSPROJECT ProjectHandle);

Parameter

ProjectHandle is the value returned from DSOpenProject.

Return Value

This function always returns a value of DSJE_NOERROR.

Remarks

Any open jobs in the project are closed, running jobs are allowed to finish, and the function returns immediately.

51-8 Ascential DataStage Parallel Job Developer’s Guide

Page 559: DataStage Parallel Job Developer’s Guide

DSFindFirstLogEntry

Retrieves all the log entries that meet the specified criteria, and writes the first entry to a data structure. Subsequent log entries can then be read using the DSFindNextLogEntry function.DSFindFirstLogEntry

Syntaxint DSFindFirstLogEntry(

DSJOB JobHandle,int EventType,time_t StartTime,time_t EndTime,int MaxNumber,DSLOGEVENT *Event

);

Parameters

JobHandle is the value returned from DSOpenJob.

EventType is one of the following keys:

StartTime limits the returned log events to those that occurred on or after the specified date and time. Set this value to 0 to return the

This key… Retrieves this type of message…

DSJ_LOGINFO Information

DSJ_LOGWARNING Warning

DSJ_LOGFATAL Fatal

DSJ_LOGREJECT Transformer row rejection

DSJ_LOGSTARTED Job started

DSJ_LOGRESET Job reset

DSJ_LOGBATCH Batch control

DSJ_LOGOTHER All other log types

DSJ_LOGANY Any type of event

DataStage Development Kit (Job Control Interfaces) 51-9

earliest event.

Page 560: DataStage Parallel Job Developer’s Guide

DSFindFirstLogEntry

EndTime limits the returned log events to those that occurred before the specified date and time. Set this value to 0 to return all entries up to the most recent.

MaxNumber specifies the maximum number of log entries to retrieve, starting from the latest.

Event is a pointer to a data structure to use to hold the first retrieved log entry.

Return Values

If the function succeeds, the return value is DSJE_NOERROR, and summary details of the first log entry are written to Event.

If the function fails, the return value is one of the following:

Remarks

The retrieved log entries are cached for retrieval by subsequent calls to DSFindNextLogEntry. Any cached log entries that are not processed by a call to DSFindNextLogEntry are discarded at the next DSFindFirstLogEntry call (for any job), or when the project is closed.

Note: The log entries are cached by project handle. Multiple threads using the same open project handle must coordinate access to

Token Description

DSJE_NOMORE There are no events matching the filter criteria.

DSJE_NO_MEMORY Failed to allocate memory for results from server.

DSJE_BADHANDLE Invalid JobHandle.

DSJE_BADTYPE Invalid EventType value.

DSJE_BADTIME Invalid StartTime or EndTime value.

DSJE_BADVALUE Invalid MaxNumber value.

51-10 Ascential DataStage Parallel Job Developer’s Guide

DSFindFirstLogEntry and DSFindNextLogEntry.

Page 561: DataStage Parallel Job Developer’s Guide

DSFindNextLogEntry

Retrieves the next log entry from the cache.DSFindNextLogEntry

Syntaxint DSFindNextLogEntry(

DSJOB JobHandle,DSLOGEVENT *Event

);

Parameters

JobHandle is the value returned from DSOpenJob.

Event is a pointer to a data structure to use to hold the next log entry.

Return Values

If the function succeeds, the return value is DSJE_NOERROR and summary details of the next available log entry are written to Event.

If the function fails, the return value is one of the following:

Remarks

This function retrieves the next log entry from the cache of entries produced by a call to DSFindFirstLogEntry.

Note: The log entries are cached by project handle. Multiple threads using the same open project handle must coordinate access to DSFindFirstLogEntry and DSFindNextLogEntry.

Token Description

DSJE_NOMORE All events matching the filter criteria have been returned.

DSJE_SERVER_ERROR Internal error. The DataStage Server returned invalid data.

DataStage Development Kit (Job Control Interfaces) 51-11

Page 562: DataStage Parallel Job Developer’s Guide

DSGetJobInfo

Retrieves information about the status of a job.DSGetJobInfo

Syntaxint DSGetJobInfo(

DSJOB JobHandle,int InfoType,DSJOBINFO *ReturnInfo

);

Parameters

JobHandle is the value returned from DSOpenJob.

InfoType is a key indicating the information to be returned and can have any of the following values:

This key… Returns this information…

DSJ_JOBSTATUS The current status of the job.

DSJ_JOBNAME The name of the job referenced by JobHandle.

DSJ_JOBCONTROLLER The name of the job controlling the job referenced by JobHandle.

DSJ_JOBSTARTTIMESTAMP The date and time when the job started.

DSJ_JOBWAVENO The wave number of last or current run.

DSJ_USERSTATUS The value, if any, set as the user status by the job.

DSJ_STAGELIST A list of active stages in the job. Sepa-rated by nulls.

DSJ_JOBINTERIMSTATUS The status of a job after it has run all stages and controlled jobs, but before it has attempted to run an after-job subroutine. (Designed to be used by

51-12 Ascential DataStage Parallel Job Developer’s Guide

an after-job subroutine to get the status of the current job.)

Page 563: DataStage Parallel Job Developer’s Guide

DSGetJobInfo

ReturnInfo is a pointer to a DSJOBINFO data structure where the requested information is stored. The DSJOBINFO data structure contains a union with an element for each of the possible return values from the call to DSGetJobInfo. For more information, see “Data Struc-tures” on page 51-46.

Return Values

If the function succeeds, the return value is DSJE_NOERROR.

If the function fails, the return value is one of the following:

Remarks

For controlled jobs, this function can be used either before or after a

DSJ_PARAMLIST A list of job parameter names. Sepa-rated by nulls.

DSJ_JOBCONTROL Whether a stop request has been issued for the job referenced by JobHandle.

DSJ_JOBPID Process id of DSD.RUN process.

DSJ_JOBLASTTIMESTAMP The date and time when the job last finished.

DSJ_JOBINVOCATIONS List of job invocation ids. The ids are separated by nulls.

DSJ_JOBINVOCATIONID Invocation name of the job refer-enced by JobHandle.

This key… Returns this information…

Token Description

DSJE_NOT_AVAILABLE There are no instances of the requested information in the job.

DSJE_BADHANDLE Invalid JobHandle.

DSJE_BADTYPE Invalid InfoType.

DataStage Development Kit (Job Control Interfaces) 51-13

call to DSRunJob.

Page 564: DataStage Parallel Job Developer’s Guide

DSGetLastError

Returns the calling thread’s last error code value.DSGetLastError

Syntaxint DSGetLastError(void);

Return Values

The return value is the last error code value. The “Return Values” section of each reference page notes the conditions under which the function sets the last error code.

Remarks

Use DSGetLastError immediately after any function whose return value on failure might contain useful data, otherwise a later, successful function might reset the value back to 0 (DSJE_NOERROR).

Note: Multiple threads do not overwrite each other’s error codes.

51-14 Ascential DataStage Parallel Job Developer’s Guide

Page 565: DataStage Parallel Job Developer’s Guide

DSGetLastErrorMsg

Retrieves the text of the last reported error from the DataStage server.DSGetLastErrorMsg

Syntaxchar *DSGetLastErrorMsg(

DSPROJECT ProjectHandle);

Parameter

ProjectHandle is either the value returned from DSOpenProject or NULL.

Return Values

The return value is a pointer to a series of null-terminated strings, one for each line of the error message associated with the last error gener-ated by the DataStage Server in response to a DataStage API function call. Use DSGetLastError to determine what the error number is.

The following example shows the buffer contents with <null> repre-senting the terminating NULL character:

line1<null>line2<null>line3<null><null>

The DSGetLastErrorMsg function returns NULL if there is no error message.

Rermarks

If ProjectHandle is NULL, this function retrieves the error message associated with the last call to DSOpenProject or DSGetProjectList, otherwise it returns the last message associated with the specified project.

The error text is cleared following a call to DSGetLastErrorMsg.

Note: The text retrieved by a call to DSGetLastErrorMsg relates to

DataStage Development Kit (Job Control Interfaces) 51-15

the last error generated by the server and not necessarily the last error reported back to a thread using DataStage API.

Page 566: DataStage Parallel Job Developer’s Guide

DSGetLastErrorMsg

Multiple threads using DataStage API must cooperate in order to obtain the correct error message text.

51-16 Ascential DataStage Parallel Job Developer’s Guide

Page 567: DataStage Parallel Job Developer’s Guide

DSGetLinkInfo

Retrieves information relating to a specific link of the specified active stage of a job.DSGetLinkInfo

Syntaxint DSGetLinkInfo(

DSJOB JobHandle,char *StageName,char *LinkName,int InfoType,DSLINKINFO *ReturnInfo

);

Parameters

JobHandle is the value returned from DSOpenJob.

StageName is a pointer to a null-terminated character string specifying the name of the active stage to be interrogated.

LinkName is a pointer to a null-terminated character string specifying the name of a link (input or output) attached to the stage.

InfoType is a key indicating the information to be returned and is one of the following values:

ReturnInfo is a pointer to a DSJOBINFO data structure where the

Value Description

DSJ_LINKLASTERR Last error message reported by the link.

DSJ_LINKROWCOUNT Number of rows that have passed down the link.

DSJ_LINKNAME Name of the link.

DSJ_LINKSQLSTATE SQLSTATE value from last error message.

DSJ_LINKDBMSCODE DBMSCODE value from last error message.

DataStage Development Kit (Job Control Interfaces) 51-17

requested information is stored. The DSJOBINFO data structure contains a union with an element for each of the possible return values

Page 568: DataStage Parallel Job Developer’s Guide

DSGetLinkInfo

from the call to DSGetLinkInfo. For more information, see “Data Structures” on page 51-46.

Return Value

If the function succeeds, the return value is DSJE_NOERROR.

If the function fails, the return value is one of the following:

Remarks

This function can be used either before or after a call to DSRunJob.

Token Description

DSJE_NOT_AVAILABLE There is no instance of the requested information available.

DSJE_BADHANDLE JobHandle was invalid.

DSJE_BADTYPE InfoType was unrecognized.

DSJE_BADSTAGE StageName does not refer to a known stage in the job.

DSJE_BADLINK LinkName does not refer to a known link for the stage in question.

51-18 Ascential DataStage Parallel Job Developer’s Guide

Page 569: DataStage Parallel Job Developer’s Guide

DSGetLogEntry

Retrieves detailed information about a specific entry in a job log.DSGetLogEntry

Syntaxint DSGetLogEntry(

DSJOB JobHandle,int EventId,DSLOGDETAIL *Event

);

Parameters

JobHandle is the value returned from DSOpenJob.

EventId is the identifier for the event to be retrieved, see “Remarks.”

Event is a pointer to a data structure to hold details of the log entry.

Return Values

If the function succeeds, the return value is DSJE_NOERROR and the event structure contains the details of the requested event.

If the function fails, the return value is one of the following:

Remarks

Entries in the log file are numbered sequentially starting from 0. The latest event ID can be obtained through a call to DSGetNewestLogId. When a log is cleared, there always remains a single entry saying when the log was cleared.

Token Description

DSJE_BADHANDLE Invalid JobHandle.

DSJE_SERVER_ERROR Internal error. DataStage server returned invalid data.

DSJE_BADEVENTID Invalid event if for a specified job.

DataStage Development Kit (Job Control Interfaces) 51-19

Page 570: DataStage Parallel Job Developer’s Guide

DSGetNewestLogId

Obtains the identifier of the newest entry in the jobs log.DSGetNewestLogId

Syntaxint DSGetNewestLogId(

DSJOB JobHandle,int EventType

);

Parameters

JobHandle is the value returned from DSOpenJob.

EventType is a key specifying the type of log entry whose identifier you want to retrieve and can be one of the following:

Return Values

If the function succeeds, the return value is the positive identifier of the most recent entry of the requested type in the job log file.

If the function fails, the return value is –1. Use DSGetLastError to retrieve one of the following error codes:

This key… Retrieves this type of log entry…

DSJ_LOGINFO Information

DSJ_LOGWARNING Warning

DSJ_LOGFATAL Fatal

DSJ_LOGREJECT Transformer row rejection

DSJ_LOGSTARTED Job started

DSJ_LOGRESET Job reset

DSJ_LOGOTHER Any other log event type

DSJ_LOGBATCH Batch control

DSJ_LOGANY Any type of event

51-20 Ascential DataStage Parallel Job Developer’s Guide

Token Description

DSJE_BADHANDLE Invalid JobHandle.

Page 571: DataStage Parallel Job Developer’s Guide

DSGetNewestLogId

Remarks

Use this function to determine the ID of the latest entry in a log file before starting a job run. Once the job has started or finished, it is then possible to determine which entries have been added by the job run.

DSJE_BADTYPE Invalid EventType value.

Token Description

DataStage Development Kit (Job Control Interfaces) 51-21

Page 572: DataStage Parallel Job Developer’s Guide

DSGetParamInfo

Retrieves information about a particular parameter within a job.DSGetParamInfo

Syntaxint DSGetParamInfo(

DSJOB JobHandle,char *ParamName,DSPARAMINFO *ReturnInfo

);

Parameters

JobHandle is the value returned from DSOpenJob.

ParamName is a pointer to a null-terminated string specifying the name of the parameter to be interrogated.

ReturnInfo is a pointer to a DSPARAMINFO data structure where the requested information is stored. For more information, see “Data Structures” on page 51-46.

Return Values

If the function succeeds, the return value is DSJE_NOERROR.

If the function fails, the return value is one of the following:

Remarks

Unlike the other information retrieval functions, DSGetParamInfo returns all the information relating to the specified item in a single call. The DSPARAMINFO data structure contains all the information

Token Description

DSJE_SERVER_ERROR Internal error. DataStage Server returned invalid data.

DSJE_BADHANDLE Invalid JobHandle.

51-22 Ascential DataStage Parallel Job Developer’s Guide

required to request a new parameter value from a user and partially validate it. See “Data Structures” on page 51-46.

Page 573: DataStage Parallel Job Developer’s Guide

DSGetParamInfo

This function can be used either before or after a DSRunJob call has been issued:

• If called after a successful call to DSRunJob, the information retrieved refers to that run of the job.

• If called before a call to DSRunJob, the information retrieved refers to any previous run of the job, and not to any call to DSSetParam that may have been issued.

DataStage Development Kit (Job Control Interfaces) 51-23

Page 574: DataStage Parallel Job Developer’s Guide

DSGetProjectInfo

Obtains a list of jobs in a project.DSGetProjectInfo

Syntaxint DSGetProjectInfo(

DSPROJECT ProjectHandle,int InfoType,DSPROJECTINFO *ReturnInfo

);

Parameters

ProjectHandle is the value returned from DSOpenProject.

InfoType is a key indicating the information to be returned.

ReturnInfo is a pointer to a DSPROJECTINFO data structure where the requested information is stored.

Return Values

If the function succeeds, the return value is DSJE_NOERROR.

If the function fails, the return value is one of the following:

This key… Retrieves this type of log entry…

DSJ_JOBLIST Lists all jobs within the project.

DSJ_PROJECTNAME Name of current project.

DSJ_HOSTNAME Host name of the server.

Token Description

DSJE_NOT_AVAILABLE There are no compiled jobs defined within the project.

DSJE_BADTYPE Invalid InfoType.

51-24 Ascential DataStage Parallel Job Developer’s Guide

Page 575: DataStage Parallel Job Developer’s Guide

DSGetProjectInfo

Remarks

The DSPROJECTINFO data structure contains a union with an element for each of the possible return values from a call to DSGet-ProjectInfo.

Note: The returned list contains the names of all jobs known to the project, whether they can be opened or not.

DataStage Development Kit (Job Control Interfaces) 51-25

Page 576: DataStage Parallel Job Developer’s Guide

DSGetProjectList

Obtains a list of all projects on the host system.DSGetProjectList

Syntaxchar* DSGetProjectList(void);

Return Values

If the function succeeds, the return value is a pointer to a series of null-terminated strings, one for each project on the host system, ending with a second null character. The following example shows the buffer contents with <null> representing the terminating null character:

project1<null>project2<null><null>

If the function fails, the return value is NULL. And the DSGetLast-Error function retrieves the following error code:

DSJE_SERVER_ERROR Unexpected/unknown server error occurred.

Remarks

This function can be called before any other DataStage API function.

Note: DSGetProjectList opens, uses, and closes its own communica-tions link with the server, so it may take some time to retrieve the project list.

51-26 Ascential DataStage Parallel Job Developer’s Guide

Page 577: DataStage Parallel Job Developer’s Guide

DSGetStageInfo

Obtains information about a particular stage within a job.DSGetStageInfo

Syntaxint DSGetStageInfo(

DSJOB JobHandle,char *StageName,int InfoType,DSSTAGEINFO *ReturnInfo

);

Parameters

JobHandle is the value returned from DSOpenJob.

StageName is a pointer to a null-terminated string specifying the name of the stage to be interrogated.

InfoType is one of the following keys:

ReturnInfo is a pointer to a DSSTAGEINFO data structure where the

This key… Returns this information…

DSJ_STAGELASTERR Last error message reported from any link of the stage.

DSJ_STAGENAME Stage name.

DSJ_STAGETYPE Stage type name.

DSJ_STAGEINROWNUM Primary links input row number.

DSJ_LINKLIST List of names of links in stage.

DSJ_VARLIST List of stage variable names in the stage.

DSJ_STAGESTARTTIME-STAMP

Date and time when stage started.

DSJ_STAGEENDTIME-STAMP

Date and time when stage finished.

DataStage Development Kit (Job Control Interfaces) 51-27

requested information is stored. See “Data Structures” on page 51-46.

Page 578: DataStage Parallel Job Developer’s Guide

DSGetStageInfo

Return Values

If the function succeeds, the return value is DSJE_NOERROR.

If the function fails, the return value is one of the following:

Remarks

This function can be used either before or after a DSRunJob function has been issued.

The DSSTAGEINFO data structure contains a union with an element for each of the possible return values from the call to DSGetStageInfo.

Token Description

DSJE_NOT_AVAILABLE There are no instances of the requested information in the stage.

DSJE_BADHANDLE Invalid JobHandle.

DSJE_BADSTAGE StageName does not refer to a known stage in the job.

DSJE_BADTYPE Invalid InfoType.

51-28 Ascential DataStage Parallel Job Developer’s Guide

Page 579: DataStage Parallel Job Developer’s Guide

DSLockJob

Locks a job. This function must be called before setting a job’s run parameters or starting a job run.DSLockJob

Syntaxint DSLockJob(

DSJOB JobHandle);

Parameter

JobHandle is the value returned from DSOpenJob.

Return Values

If the function succeeds, the return value is DSJE_NOERROR.

If the function fails, the return value is one of the following:

Remarks

Locking a job prevents any other process from modifying the job details or status. This function must be called before any call of DSSet-JobLimit, DSSetParam, or DSRunJob.

If you try to lock a job you already have locked, the call succeeds. If you have the same job open on several DataStage API handles, locking the job on one handle locks the job on all the handles.

Token Description

DSJE_BADHANDLE Invalid JobHandle.

DSJE_INUSE Job is locked by another process.

DataStage Development Kit (Job Control Interfaces) 51-29

Page 580: DataStage Parallel Job Developer’s Guide

DSLogEvent

Adds a new entry to a job log file.DSLogEvent

Syntaxint DSLogEvent(

DSJOB JobHandle,int EventType,char *Reserved,char *Message

);

Parameters

JobHandle is the value returned from DSOpenJob.

EventType is one of the following keys specifying the type of event to be logged:

Reserved is reserved for future use, and should be specified as null.

Message points to a null-terminated character string specifying the text of the message to be logged.

Return Values

If the function succeeds, the return value is DSJE_NOERROR.

If the function fails, the return value is one of the following:

This key… Specifies this type of event…

DSJ_LOGINFO Information

DSJ_LOGWARNING Warning

Token Description

DSJE_BADHANDLE Invalid JobHandle.

DSJE_SERVER_ERROR Internal error. DataStage Server returned invalid data.

51-30 Ascential DataStage Parallel Job Developer’s Guide

DSJE_BADTYPE Invalid EventType value.

Page 581: DataStage Parallel Job Developer’s Guide

DSLogEvent

Remarks

Messages that contain more that one line of text should contain a newline character (\n) to indicate the end of a line.

DataStage Development Kit (Job Control Interfaces) 51-31

Page 582: DataStage Parallel Job Developer’s Guide

DSOpenJob

Opens a job. This function must be called before any other function that manipu-lates the job.DSOpenJob

SyntaxDSJOB DSOpenJob(

DSPROJECT ProjectHandle,char *JobName

);

Parameters

ProjectHandle is the value returned from DSOpenProject.

JobName is a pointer to a null-terminated string that specifies the name of the job that is to be opened. This may be in either of the following formats:

Return Values

If the function succeeds, the return value is a handle to the job.

If the function fails, the return value is NULL. Use DSGetLastError to retrieve one of the following:

Remarks

The DSOpenJob function must be used to return a job handle before

job Finds the latest version of the job.

job%Reln.n.n Finds a particular release of the job on a development system.

Token Description

DSJE_OPENFAIL Server failed to open job.

DSJE_NO_MEMORY Memory allocation failure.

51-32 Ascential DataStage Parallel Job Developer’s Guide

a job can be addressed by any of the DataStage API functions. You can gain exclusive access to the job by locking it with DSLockJob.

Page 583: DataStage Parallel Job Developer’s Guide

DSOpenJob

The same job may be opened more than once and each call to DSOpenJob will return a unique job handle. Each handle must be separately closed.

DataStage Development Kit (Job Control Interfaces) 51-33

Page 584: DataStage Parallel Job Developer’s Guide

DSOpenProject

Opens a project. It must be called before any other DataStage API function, except DSGetProjectList or DSGetLastError.DSOpenProject

SyntaxDSPROJECT DSOpenProject(

char *ProjectName);

Parameter

ProjectName is a pointer to a null-terminated string that specifies the name of the project to open.

Return Values

If the function succeeds, the return value is a handle to the project.

If the function fails, the return value is NULL. Use DSGetLastError to retrieve one of the following:

Token Description

DSJE_BAD_VERSION The DataStage server is an older version than the DataStage API.

DSJE_INCOMPATIBLE_SERVER

The DataStage Server is either older or newer than that supported by this version of DataStage API.

DSJE_SERVER_ERROR Internal error. DataStage Server returned invalid data.

DSJE_BADPROJECT Invalid project name.

DSJE_NO_DATASTAGE DataStage is not correctly installed on the server system.

DSJE_NOLICENSE No DataStage license is available for the project.

51-34 Ascential DataStage Parallel Job Developer’s Guide

Page 585: DataStage Parallel Job Developer’s Guide

DSOpenProject

Remarks

The DSGetProjectList function can return the name of a project that does not contain valid DataStage jobs, but this is detected when DSOpenProject is called. A process can only have one project open at a time.

DataStage Development Kit (Job Control Interfaces) 51-35

Page 586: DataStage Parallel Job Developer’s Guide

DSRunJob

Starts a job run.DSRunJob

Syntaxint DSRunJob(

DSJOB JobHandle,int RunMode

);

Parameters

JobHandle is a value returned from DSOpenJob.

RunMode is a key determining the run mode and should be one of the following values:

Return Values

If the function succeeds, the return value is DSJE_NOERROR.

If the function fails, the return value is one of the following:

This key… Indicates this action…

DSJ_RUNNORMAL Start a job run.

DSJ_RUNRESET Reset the job.

DSJ_RUNVALIDATE Validate the job.

Token Description

DSJE_BADHANDLE Invalid JobHandle.

DSJE_BADSTATE Job is not in the right state (must be compiled and not running).

DSJE_BADTYPE RunMode is not recognized.

DSJE_SERVER_ERROR Internal error. DataStage Server returned invalid data.

51-36 Ascential DataStage Parallel Job Developer’s Guide

DSJE_NOTLOCKED Job has not been locked.

Page 587: DataStage Parallel Job Developer’s Guide

DSRunJob

Remarks

The job specified by JobHandle must be locked, using DSLockJob, before the DSRunJob function is called.

If no limits were set by calling DSSetJobLimit, the default limits are used.

DataStage Development Kit (Job Control Interfaces) 51-37

Page 588: DataStage Parallel Job Developer’s Guide

DSSetJobLimit

Sets row or warning limits for a job.DSSetJobLimit

Syntaxint DSSetJobLimit(

DSJOB JobHandle,int LimitType,int LimitValue

);

Parameters

JobHandle is a value returned from DSOpenJob.

LimitType is one of the following keys specifying the type of limit:

LimitValue is the value to set the limit to.

Return Values

If the function succeeds, the return value is DSJE_NOERROR.

If the function fails, the return value is one of the following:

This key… Specifies this type of limit…

DSJ_LIMITWARN Job to be stopped after LimitValue warning events.

DSJ_LIMITROWS Stages to be limited to LimitValue rows.

Token Description

DSJE_BADHANDLE Invalid JobHandle.

DSJE_BADSTATE Job is not in the right state (compiled, not running).

DSJE_BADTYPE LimitType is not the name of a known limiting condition.

DSJE_BADVALUE LimitValue is not appropriate for the

51-38 Ascential DataStage Parallel Job Developer’s Guide

limiting condition type.

Page 589: DataStage Parallel Job Developer’s Guide

DSSetJobLimit

Remarks

The job specified by JobHandle must be locked, using DSLockJob, before the DSSetJobLimit function is called.

Any job limits that are not set explicitly before a run will use the default values. Make two calls to DSSetJobLimit in order to set both types of limit.

Set the value to 0 to indicate that there should be no limit for the job.

DSJE_SERVER_ERROR Internal error. DataStage Server returned invalid data.

DSJE_NOTLOCKED Job has not been locked.

Token Description

DataStage Development Kit (Job Control Interfaces) 51-39

Page 590: DataStage Parallel Job Developer’s Guide

DSSetParam

Sets job parameter values before running a job. Any parameter that is not explicitly set uses the default value.DSSetParam

Syntaxint DSSetParam(

DSJOB JobHandle,char *ParamName,DSPARAM *Param

);

Parameters

JobHandle is the value returned from DSOpenJob.

ParamName is a pointer to a null-terminated string that specifies the name of the parameter to set.

Param is a pointer to a structure that specifies the name, type, and value of the parameter to set.

Note: The type specified in Param need not match the type specified for the parameter in the job definition, but it must be possible to convert it. For example, if the job defines the parameter as a string, it can be set by specifying it as an integer. However, it will cause an error with unpredictable results if the parameter is defined in the job as an integer and a nonnumeric string is passed by DSSetParam.

Return Values

If the function succeeds, the return value is DSJE_NOERROR.

If the function fails, the return value is one of the following:

Token Description

DSJE_BADHANDLE Invalid JobHandle.

51-40 Ascential DataStage Parallel Job Developer’s Guide

DSJE_BADSTATE Job is not in the right state (compiled, not running).

Page 591: DataStage Parallel Job Developer’s Guide

DSSetParam

Remarks

The job specified by JobHandle must be locked, using DSLockJob, before the DSSetParam function is called.

DSJE_BADPARAM Param does not reference a known parameter of the job.

DSJE_BADTYPE Param does not specify a valid param-eter type.

DSJE_BADVALUE Param does not specify a value that is appropriate for the parameter type as specified in the job definition.

DSJE_SERVER_ERROR Internal error. DataStage Server returned invalid data.

DSJE_NOTLOCKED Job has not been locked.

Token Description

DataStage Development Kit (Job Control Interfaces) 51-41

Page 592: DataStage Parallel Job Developer’s Guide

DSSetServerParams

Sets the logon parameters to use for opening a project or retrieving a project list.DSSetServerParams

Syntaxvoid DSSetServerParams(

char *ServerName,char *UserName,char *Password

);

Parameters

ServerName is a pointer to either a null-terminated character string specifying the name of the server to connect to, or NULL.

UserName is a pointer to either a null-terminated character string spec-ifying the user name to use for the server session, or NULL.

Password is a pointer to either a null-terminated character string spec-ifying the password for the user specified in UserName, or NULL.

Return Values

This function has no return value.

Remarks

By default, DSOpenProject and DSGetProjectList attempt to connect to a DataStage Server on the same computer as the client process, then create a server process that runs with the same user identification and access rights as the client process. DSSetServerParams overrides this behavior and allows you to specify a different server, user name, and password.

Calls to DSSetServerParams are not cumulative. All parameter values, including NULL pointers, are used to set the parameters to be used on the subsequent DSOpenProject or DSGetProjectList call.

51-42 Ascential DataStage Parallel Job Developer’s Guide

Page 593: DataStage Parallel Job Developer’s Guide

DSStopJob

Aborts a running job.DSStopJob

Syntaxint DSStopJob(

DSJOB JobHandle);

Parameter

JobHandle is the value returned from DSOpenJob.

Return Values

If the function succeeds, the return value is DSJE_NOERROR.

If the function fails, the return value is:

DSJE_BADHANDLE Invalid JobHandle.

Remarks

The DSStopJob function should be used only after a DSRunJob func-tion has been issued. The stop request is sent regardless of the job’s current status. To ascertain if the job has stopped, use the DSWait-ForJob function or the DSJobStatus macro.

DataStage Development Kit (Job Control Interfaces) 51-43

Page 594: DataStage Parallel Job Developer’s Guide

DSUnlockJob

Unlocks a job, preventing any further manipulation of the job’s run state and freeing it for other processes to use.DSUnlockJob

Syntaxint DSUnlockJob(

DSJOB JobHandle);

Parameter

JobHandle is the value returned from DSOpenJob.

Return Values

If the function succeeds, the return value is DSJ_NOERROR.

If the function fails, the return value is:

DSJE_BADHANDLE Invalid JobHandle.

Remarks

The DSUnlockJob function returns immediately without waiting for the job to finish. Attempting to unlock a job that is not locked does not cause an error. If you have the same job open on several handles, unlocking the job on one handle unlocks it on all handles.

51-44 Ascential DataStage Parallel Job Developer’s Guide

Page 595: DataStage Parallel Job Developer’s Guide

DSWaitForJob

Waits to the completion of a job run.DSWaitForJob

Syntaxint DSWaitForJob(

DSJOB JobHandle);

Parameter

JobHandle is the value returned from DSOpenJob.

Return Values

If the function succeeds, the return value is DSJE_NOERROR.

If the function fails, the return value is one of the following:

Remarks

This function is only valid if the current job has issued a DSRunJob call on the given JobHandle. It returns if the job was started since the last DSRunJob, and has since finished. The finishing status can be found by calling DSGetJobInfo.

Token Description

DSJE_BADHANDLE Invalid JobHandle.

DSJE_WRONGJOB Job for this JobHandle was not started from a call to DSRunJob by the current process.

DSJE_TIMEOUT Job appears not to have started after waiting for a reasonable length of time. (About 30 minutes.)

DataStage Development Kit (Job Control Interfaces) 51-45

Page 596: DataStage Parallel Job Developer’s Guide

DSWaitForJob

Data StructuresThe DataStage API uses the data structures described in this section to hold data passed to, or returned from, functions. (See“Data Structures, Result Data, and Threads” on page 51-2). The data structures are summarized below, with full descriptions in the following sections:

This data structure…

Holds this type of data…

And is used by this function…

DSJOBINFO Information about a DataStage job

DSGetJobInfo

DSLINKINFO Information about a link to or from an active stage in a job, that is, a stage that is not a data source or destination

DSGetLinkInfo

DSLOGDETAIL Full details of an entry in a job log file

DSGetLogEntry

DSLOGEVENT Details of an entry in a job log file

DSLogEvent, DSFindFirstLogEntry, DSFindNextLogEntry

DSPARAM The type and value of a job parameter

DSSetParam

DSPARAMINFO Further information about a job parameter, such as its default value and a description

DSGetParamInfo

DSPROJECTINFO A list of jobs in the project

DSGetProjectInfo

DSSTAGEINFO Information about an active stage in a job

DSGetStageInfo

51-46 Ascential DataStage Parallel Job Developer’s Guide

Page 597: DataStage Parallel Job Developer’s Guide

DSJOBINFO

The DSJOBINFO structure represents information values about a DataStage job.DSJOBINFO

Syntaxtypedef struct _DSJOBINFO {

int infoType;union {

int jobStatus;char *jobController;time_t jobStartTime;int jobWaveNumber;char *userStatus;char *paramList;char *stageList;char *jobname;int jobcontrol;int jobPid;time_t jobLastTime;char *jobInvocations;int jobInterimStatus;char *jobInvocationid;

} info;

} DSJOBINFO;

MembersinfoType is one of the following keys indicating the type of information:

This key… Indicates this information…

DSJ_JOBSTATUS The current status of the job.

DSJ_JOBNAME Name of job referenced by JobHandle

DSJ_JOBCONTROLLER The name of the controlling job.

DSJ_JOBSTARTTIMESTAMP The date and time when the job started.

DataStage Development Kit (Job Control Interfaces) 51-47

DSJ_JOBWAVENO Wave number of the current (or last) job run.

Page 598: DataStage Parallel Job Developer’s Guide

DSJOBINFO

jobStatus is returned when infoType is set to DSJ_JOBSTATUS. Its value is one of the following keys:

DSJ_USERSTATUS The status reported by the job itself as defined in the job’s design.

DSJ_PARAMLIST A list of the names of the job’s parameters. Separated by nulls.

DSJ_STAGELIST A list of stages in the job. Separated by nulls.

DSJ_JOBCONTROL Whetehr a stop request has been issued for the job.

DSJ_JOBPID Process id of DSD.RUN process.

DSJ_JOBLASTTIMESTAMP The date and time on the server when the job last finished.

DSJ_JOBINVOCATIONS List of job invocation ids. Separated by nulls.

DSJ_JOBINTERIMSTATUS Current Interim status of the job.

DSJ_JOBINVOVATIONID Invocation name of the job referenced.

This key… Indicates this status…

DSJS_RUNNING Job running.

DSJS_RUNOK Job finished a normal run with no warnings.

DSJS_RUNWARN Job finished a normal run with warnings.

DSJS_RUNFAILED Job finished a normal run with a fatal error.

DSJS_VALOK Job finished a validation run with no warnings.

DSJS_VALWARN Job finished a validation run with warnings.

This key… Indicates this information…

51-48 Ascential DataStage Parallel Job Developer’s Guide

DSJS_VALFAILED Job failed a validation run.

DSJS_RESET Job finished a reset run.

Page 599: DataStage Parallel Job Developer’s Guide

DSJOBINFO

jobController is the name of the job controlling the job reference and is returned when infoType is set to DSJ_JOBCONTROLLER. Note that this may be several job names, separated by periods, if the job is controlled by a job which is itself controlled, and so on.

jobStartTime is the date and time when the last or current job run started and is returned when infoType is set to DSJ_JOBSTARTTIMESTAMP.

jobWaveNumber is the wave number of the last or current job run and is returned when infoType is set to DSJ_JOBWAVENO.

userStatus is the value, if any, set by the job as its user defined status, and is returned when infoType is set to DSJ_USERSTATUS.

paramList is a pointer to a buffer that contains a series of null-termi-nated strings, one for each job parameter name, that ends with a second null character. It is returned when infoType is set to DSJ_PARAMLIST. The following example shows the buffer contents with <null> representing the terminating null character:

first<null>second<null><null>

stageList is a pointer to a buffer that contains a series of null-termi-nated strings, one for each stage in the job, that ends with a second null character. It is returned when infoType is set to DSJ_STAGELIST. The following example shows the buffer contents with <null> repre-senting the terminating null character:

DSJS_CRASHED Job was stopped by some indetermi-nate action.

DSJS_STOPPED Job was stopped by operator inter-vention (can’t tell run type).

DSJS_NOTRUNNABLE Job has not been compiled.

DSJS_NOTRUNNING Any other status. Job was stopped by operator intervention (can’t tell run type).

This key… Indicates this status…

DataStage Development Kit (Job Control Interfaces) 51-49

first<null>second<null><null>

Page 600: DataStage Parallel Job Developer’s Guide

DSLINKINFO

The DSLINKINFO structure represents various information values about a link to or from an active stage within a DataStage job.DSLINKINFO

Syntaxtypedef struct _DSLINKINFO {

int infoType:/union {

DSLOGDETAIL lastError;int rowCount;char *linkName;char *linkSQLState;char *linkDBMSCode;

} info;

} DSLINKINFO;

MembersinfoType is a key indicating the type of information and is one of the following values:

lastError is a data structure containing the error log entry for the last error message reported from a link and is returned when infoType is set to DSJ_LINKLASTERR.

This key… Indicates this information…

DSJ_LINKLASTERR The last error message reported from a link.

DSJ_LINKNAME Actual name of link.

DSJ_LINKROWCOUNT The number of rows that have been passed down a link.

DSJ_LINKSQLSTATE SQLSTATE value from last error message.

DSJ_LINKDBMSCODE DBMSCODE value from last error message.

51-50 Ascential DataStage Parallel Job Developer’s Guide

rowCount is the number of rows that have been passed down a link so far and is returned when infoType is set to DSJ_LINKROWCOUNT.

Page 601: DataStage Parallel Job Developer’s Guide

DSLOGDETAIL

The DSLOGDETAIL structure represents detailed information for a single entry from a job log file.DSLOGDETAIL

Syntaxtypedef struct _DSLOGDETAIL {

int eventId;time_t timestamp;int type;char *reserved;char *fullMessage;

} DSLOGDETAIL;

MemberseventId is a a number, 0 or greater, that uniquely identifies the log entry for the job.

timestamp is the date and time at which the entry was added to the job log file.

type is a key indicting the type of the event, and is one of the following values:

This key… Indicates this type of log entry…

DSJ_LOGINFO Information

DSJ_LOGWARNING Warning

DSJ_LOGFATAL Fatal error

DSJ_LOGREJECT Transformer row rejection

DSJ_LOGSTARTED Job started

DSJ_LOGRESET Job reset

DSJ_LOGBATCH Batch control

DSJ_LOGOTHER Any other type of log entry

DataStage Development Kit (Job Control Interfaces) 51-51

reserved is reserved for future use with a later release of DataStage.

fullMessage is the full description of the log entry.

Page 602: DataStage Parallel Job Developer’s Guide

DSLOGEVENT

The DSLOGEVENT structure represents the summary information for a single entry from a job’s event log.DSLOGEVENT

Syntaxtypedef struct _DSLOGEVENT {

int eventId; time_t timestamp; int type; char *message;

} DSLOGEVENT;

MemberseventId is a a number, 0 or greater, that uniquely identifies the log entry for the job.

timestamp is the date and time at which the entry was added to the job log file.

type is a key indicating the type of the event, and is one of the following values:

message is the first line of the description of the log entry.

This key… Indicates this type of log entry…

DSJ_LOGINFO Information

DSJ_LOGWARNING Warning

DSJ_LOGFATAL Fatal error

DSJ_LOGREJECT Transformer row rejection

DSJ_LOGSTARTED Job started

DSJ_LOGRESET Job reset

DSJ_LOGBATCH Batch control

DSJ_LOGOTHER Any other type of log entry

51-52 Ascential DataStage Parallel Job Developer’s Guide

Page 603: DataStage Parallel Job Developer’s Guide

DSPARAM

The DSPARAM structure represents information about the type and value of a DataStage job parameter.DSPARAM

Syntaxtypedef struct _DSPARAM {

int paramType;union {

char *pString;char *pEncrypt;int pInt;float pFloat;char *pPath;char *pListValue;char *pDate;char *pTime;

} paramValue;

} DSPARAM;

MembersparamType is a key specifying the type of the job parameter. Possible values are as follows:

This key…Indicates this type of parameter…

DSJ_PARAMTYPE_STRING A character string.

DSJ_PARAMTYPE_ENCRYPTED An encrypted character string (for example, a password).

DSJ_PARAMTYPE_INTEGER An integer.

DSJ_PARAMTYPE_FLOAT A floating-point number.

DSJ_PARAMTYPE_PATHNAME A file system pathname.

DDSJ_PARAMTYPE_LIST A character string specifying one of the values from an

DataStage Development Kit (Job Control Interfaces) 51-53

enumerated list.

DDSJ_PARAMTYPE_DATE A date in the format YYYY-MM-DD.

Page 604: DataStage Parallel Job Developer’s Guide

DSPARAM

pString is a null-terminated character string that is returned when paramType is set to DSJ_PARAMTYPE_STRING.

pEncrypt is a null-terminated character string that is returned when paramType is set to DSJ_PARAMTYPE_ENCRYPTED. The string should be in plain text form when passed to or from DataStage API where it is encrypted. The application using the DataStage API should present this type of parameter in a suitable display format, for example, an asterisk for each character of the string rather than the character itself.

pInt is an integer and is returned when paramType is set to DSJ_PARAMTYPE_INTEGER.

pFloat is a floating-point number and is returned when paramType is set to DSJ_PARAMTYPE_FLOAT.

pPath is a null-terminated character string specifying a file system pathname and is returned when paramType is set to DSJ_PARAMTYPE_PATHNAME.

Note: This parameter does not need to specify a valid pathname on the server. Interpretation and validation of the pathname is performed by the job.

pListValue is a null-terminated character string specifying one of the possible values from an enumerated list and is returned when param-Type is set to DDSJ_PARAMTYPE_LIST.

pDate is a null-terminated character string specifying a date in the format YYYY-MM-DD and is returned when paramType is set to DSJ_PARAMTYPE_DATE.

pTime is a null-terminated character string specifying a time in the

DSJ_PARAMTYPE_TIME A time in the format HH:MM:SS.

This key…Indicates this type of parameter…

51-54 Ascential DataStage Parallel Job Developer’s Guide

format HH:MM:SS and is returned when paramType is set to DSJ_PARAMTYPE_TIME.

Page 605: DataStage Parallel Job Developer’s Guide

DSPARAMINFO

The DSPARAMINFO structure represents information values about a parameter of a DataStage job.DSPARAMINFO

Syntaxtypedef struct _DSPARAMINFO {

DSPARAM defaultValue;char *helpText;char *paramPrompt;int paramType;DSPARAM desDefaultValue;char *listValues;char *desListValues;int promptAtRun;

} DSPARAMINFO;

MembersdefaultValue is the default value, if any, for the parameter.

helpText is a description, if any, for the parameter.

paramPrompt is the prompt, if any, for the parameter.

paramType is a key specifying the type of the job parameter. Possible values are as follows:

This key…Indicates this type of parameter…

DSJ_PARAMTYPE_STRING A character string.

DSJ_PARAMTYPE_ENCRYPTED An encrypted character string (for example, a password).

DSJ_PARAMTYPE_INTEGER An integer.

DSJ_PARAMTYPE_FLOAT A floating-point number.

DSJ_PARAMTYPE_PATHNAME A file system pathname.

DataStage Development Kit (Job Control Interfaces) 51-55

DDSJ_PARAMTYPE_LIST A character string specifying one of the values from an enumerated list.

Page 606: DataStage Parallel Job Developer’s Guide

DSPARAMINFO

desDefaultValue is the default value set for the parameter by the job’s designer.

Note: Default values can be changed by the DataStage administrator, so a value may not be the current value for the job.

listValues is a pointer to a buffer that receives a series of null-termi-nated strings, one for each valid string that can be used as the parameter value, ending with a second null character as shown in the following example (<null> represents the terminating null character):

first<null>second<null><null>

desListValues is a pointer to a buffer containing the default list of values set for the parameter by the job’s designer. The buffer contains a series of null-terminated strings, one for each valid string that can be used as the parameter value, that ends with a second null character. The following example shows the buffer contents with <null> repre-senting the terminating null character:

first<null>second<null><null>

Note: Default values can be changed by the DataStage administrator, so a value may not be the current value for the job.

promptAtRun is either 0 (False) or 1 (True). 1 indicates that the operator is prompted for a value for this parameter whenever the job is run; 0 indicates that there is no prompting.

DDSJ_PARAMTYPE_DATE A date in the format YYYY-MM-DD.

DSJ_PARAMTYPE_TIME A time in the format HH:MM:SS.

This key…Indicates this type of parameter…

51-56 Ascential DataStage Parallel Job Developer’s Guide

Page 607: DataStage Parallel Job Developer’s Guide

DSPROJECTINFO

The DSPROJECTINFO structure represents information values for a DataStage project.DSPROJECTINFO

Syntaxtypedef struct _DSPROJECTINFO {

int infoType;union {

char *jobList;} info;

} DSPROJECTINFO;

MembersinfoType is a key value indicating the type of information to retrieve. Possible values are as follows .

jobList is a pointer to a buffer that contains a series of null-terminated strings, one for each job in the project, and ending with a second null character, as shown in the following example (<null> represents the terminating null character):

first<null>second<null><null>

This key… Indicates this information…

DSJ_JOBLIST List of jobs in project.

DSJ_PROJECTNAME Name of current project.

DSJ_HOSTNAME Host name of the server.

DataStage Development Kit (Job Control Interfaces) 51-57

Page 608: DataStage Parallel Job Developer’s Guide

The DSSTAGEINFO structure represents various information values about an active stage within a DataStage job.DSSTAGEINFO

Syntaxtypedef struct _DSSTAGEINFO {

int infoType;union {

DSLOGDETAIL lastError;char *typeName;int inRowNum;char *linkList;char *stagename;char *varlist;char *stageStartTime;char *stageEndTime;

} info;

} DSSTAGEINFO;

MembersinfoType is a key indicating the information to be returned and is one of the following:

This key… Indicates this information…

DSJ_STAGELASTERR The last error message generated from any link in the stage.

DSJ_STAGENAME Name of stage.

DSJ_STAGETYPE The stage type name, for example, Transformer or BeforeJob.

DSJ_STAGEINROWNUM The primary link’s input row number.

DSJ_VARLIST List of stage variable names.

DSJ_STAGESTARTTIME-STAMP

Date and time when stage started.

DSJ_STAGEENDTIME-STAMP

Date and time when stage finished.

51-58 Ascential DataStage Parallel Job Developer’s Guide

DSJ_LINKLIST A list of link names.

Page 609: DataStage Parallel Job Developer’s Guide

lastError is a data structure containing the error message for the last error (if any) reported from any link of the stage. It is returned when infoType is set to DSJ_STAGELASTERR.

typeName is the stage type name and is returned when infoType is set to DSJ_STAGETYPE.

inRowNum is the primary link’s input row number and is returned when infoType is set to DSJ_STAGEINROWNUM.

linkList is a pointer to a buffer that contains a series of null-terminated strings, one for each link in the stage, ending with a second null character, as shown in the following example (<null> represents the terminating null character):

first<null>second<null><null>

Error CodesThe following table lists DataStage API error codes in alphabetical order:

Error Token Code Description

DSJE_BADHANDLE –1 Invalid JobHandle.

DSJE_BADLINK –9 LinkName does not refer to a known link for the stage in question.

DSJE_BADNAME –12 Invalid project name.

DSJE_BADPARAM –3 ParamName is not a parameter name in the job.

DSJE_BADPROJECT –1002 ProjectName is not a known DataStage project.

DSJE_BADSTAGE –7 StageName does not refer to a known stage in the job.

DSJE_BADSTATE –2 Job is not in the right state (compiled, not running).

DSJE_BADTIME –13 Invalid StartTime or EndTime value.

DataStage Development Kit (Job Control Interfaces) 51-59

DSJE_BADTYPE –5 Information or event type was unrecognized.

Page 610: DataStage Parallel Job Developer’s Guide

DSJE_BAD_VERSION –1008 The DataStage server does not support this version of the DataStage API.

DSJE_BADVALUE –4 Invalid MaxNumber value.

DSJE_DECRYPTERR –15 Failed to decrypt encrypted values.

DSJE_INCOMPATIBLE_SERVER

–1009 The server version is incompatible with this version of the DataStage API.

DSJE_JOBDELETED –11 The job has been deleted.

DSJE_JOBLOCKED –10 The job is locked by another process.

DSJE_NOACCESS –16 Cannot get values, default values or design default values for any job except the current job.

DSJE_NO_DATASTAGE –1003 DataStage is not installed on the server system.

DSJE_NOERROR 0 No DataStage API error has occurred.

DSJE_NO_MEMORY –1005 Failed to allocate dynamic memory.

DSJE_NOMORE –1001 All events matching the filter criteria have been returned.

DSJE_NOT_AVAILABLE –1007 The requested information was not found.

DSJE_NOTINSTAGE –8 Internal server error.

DSJE_OPENFAIL –1004 The attempt to open the job failed – perhaps it has not been compiled.

DSJE_REPERROR –99 General server error.

DSJE_SERVER_ERROR –1006 An unexpected or unknown error occurred in the DataStage server engine.

Error Token Code Description

51-60 Ascential DataStage Parallel Job Developer’s Guide

DSJE_TIMEOUT –14 The job appears not to have started after waiting for a reasonable length of time. (About 30 minutes.)

Page 611: DataStage Parallel Job Developer’s Guide

The following table lists DataStage API error codes in numerical order:

DSJE_WRONGJOB –6 Job for this JobHandle was not started from a call to DSRunJob by the current process.

Code Error Token Description

0 DSJE_NOERROR No DataStage API error has occurred.

–1 DSJE_BADHANDLE Invalid JobHandle.

–2 DSJE_BADSTATE Job is not in the right state (compiled, not running).

–3 DSJE_BADPARAM ParamName is not a parameter name in the job.

–4 DSJE_BADVALUE Invalid MaxNumber value.

–5 DSJE_BADTYPE Information or event type was unrecognized.

–6 DSJE_WRONGJOB Job for this JobHandle was not started from a call to DSRunJob by the current process.

–7 DSJE_BADSTAGE StageName does not refer to a known stage in the job.

–8 DSJE_NOTINSTAGE Internal server error.

–9 DSJE_BADLINK LinkName does not refer to a known link for the stage in question.

–10 DSJE_JOBLOCKED The job is locked by another process.

–11 DSJE_JOBDELETED The job has been deleted.

–12 DSJE_BADNAME Invalid project name.

–13 DSJE_BADTIME Invalid StartTime or EndTime value.

–14 DSJE_TIMEOUT The job appears not to have started

Error Token Code Description

DataStage Development Kit (Job Control Interfaces) 51-61

after waiting for a reasonable length of time. (About 30 minutes.)

–15 DSJE_DECRYPTERR Failed to decrypt encrypted values.

Page 612: DataStage Parallel Job Developer’s Guide

The following table lists some common errors that may be returned from the lower-level communication layers:

–16 DSJE_NOACCESS Cannot get values, default values or design default values for any job except the current job.

–99 DSJE_REPERROR General server error.

–1001 DSJE_NOMORE All events matching the filter criteria have been returned.

–1002 DSJE_BADPROJECT ProjectName is not a known DataStage project.

–1003 DSJE_NO_DATASTAGE DataStage is not installed on the server system.

–1004 DSJE_OPENFAIL The attempt to open the job failed –perhaps it has not been compiled.

–1005 DSJE_NO_MEMORY Failed to allocate dynamic memory.

–1006 DSJE_SERVER_ERROR An unexpected or unknown error occurred in the DataStage server engine.

–1007 DSJE_NOT_AVAILABLE The requested information was not found.

–1008 DSJE_BAD_VERSION The DataStage server does not support this version of the DataStage API.

–1009 DSJE_INCOMPATIBLE_SERVER

The server version is incompatible with this version of the DataStage API.

Error Number Description

39121 The DataStage server license has expired.

39134 The DataStage server user limit has been reached.

Code Error Token Description

51-62 Ascential DataStage Parallel Job Developer’s Guide

80011 Incorrect system name or invalid user name or password provided.

80019 Password has expired.

Page 613: DataStage Parallel Job Developer’s Guide

DataStage BASIC InterfaceThese functions can be used in a job control routine, which is defined as part of a job’s properties and allows other jobs to be run and be controlled from the first job. Some of the functions can also be used for getting status information on the current job; these are useful in active stage expressions and before- and after-stage subroutines.

These functions are also described in Chapter 13,“BASIC Programming.”

To do this… Use this…

Specify the job you want to control

DSAttachJob, page 51-66

Set parameters for the job you want to control

DSSetParam, page 51-98

Set limits for the job you want to control

DSSetJobLimit, page 51-97

Request that a job is run DSRunJob, page 51-93

Wait for a called job to finish DSWaitForJob, page 51-105

Get information about the cur-rent project

DSGetProjectInfo, page 51-82

Get information about the con-trolled job or current job

DSGetJobInfo, page 51-70

Get information about a stage in the controlled job or current job

DSGetStageInfo, page 51-83

Get information about a link in a controlled job or current job

DSGetLinkInfo, page 51-73

Get information about a con-trolled job’s parameters

DSGetParamInfo, page 51-79

Get the log event from the job log

DSGetLogEntry, page 51-75

Get a number of log events on the specified subject from the job log

DSGetLogSummary, page 51-76

DataStage Development Kit (Job Control Interfaces) 51-63

Get the newest log event, of a specified type, from the job log

DSGetNewestLogId, page 51-78

Page 614: DataStage Parallel Job Developer’s Guide

Log an event to the job log of a different job

DSLogEvent, page 51-85

Stop a controlled job DSStopJob, page 51-100

Return a job handle previously obtained from DSAttachJob

DSDetachJob, page 51-68

Log a fatal error message in a job’s log file and aborts the job.

DSLogFatal, page 19-86

Log an information message in a job’s log file.

DSLogInfo, page 19-87

Put an info message in the job log of a job controlling current job.

DSLogToController, page 19-88

Log a warning message in a job’s log file.

DSLogWarn, page 19-89

Generate a string describing the complete status of a valid attached job.

DSMakeJobReport, page 19-90

Insert arguments into the message template.

DSMakeMsg, page 19-91

Ensure a job is in the correct state to be run or validated.

DSPrepareJob, page 19-92

Interface to system send mail facility.

DSSendMail, page 19-95

Log a warning message to a job log file.

DSTransformError, page 19-101

Convert a job control status or error code into an explanatory text message.

DSTranslateCode, page 19-102

Suspend a job until a named file either exists or does not exist.

DSWaitForFile, page 19-103

Checks if a BASIC routine is cataloged, either in VOC as a callable item, or in the catalog space.

DSCheckRoutine, page 19-67

Execute a DOS or DataStage DSExecute, page 19-69

To do this… Use this…

51-64 Ascential DataStage Parallel Job Developer’s Guide

Engine command from a befor/after subroutine.

Page 615: DataStage Parallel Job Developer’s Guide

Set a status message for a job to return as a termination message when it finishes

DSSetUserStatus, page 51-99

To do this… Use this…

DataStage Development Kit (Job Control Interfaces) 51-65

Page 616: DataStage Parallel Job Developer’s Guide

DSAttachJob Function

Attaches to a job in order to run it in job control sequence. A handle is returned which is used for addressing the job. There can only be one handle open for a particular job at any one time. DSAttachJob

SyntaxJobHandle = DSAttachJob (JobName, ErrorMode)

JobHandle is the name of a variable to hold the return value which is subsequently used by any other function or routine when referring to the job. Do not assume that this value is an integer.

JobName is a string giving the name of the job to be attached to.

ErrorMode is a value specifying how other routines using the handle should report errors. It is one of:

DSJ.ERRFATAL Log a fatal message and abort the controllingjob (default).

DSJ.ERRWARNINGLog a warning message but carry on.

DSJ.ERRNONE No message logged - caller takes fullresponsibility (failure of DSAttachJob itselfwill be logged, however).

RemarksA job cannot attach to itself.

The JobName parameter can specify either an exact version of the job in the form job%Reln.n.n, or the latest version of the job in the form job. If a controlling job is itself released, you will get the latest released version of job. If the controlling job is a development version, you will get the latest development version of job.

ExampleThis is an example of attaching to Release 11 of the job Qsales:

51-66 Ascential DataStage Parallel Job Developer’s Guide

Qsales_handle = DSAttachJob ("Qsales%Rel1", ➥ DSJ.ERRWARN)

Page 617: DataStage Parallel Job Developer’s Guide

DSCheckRoutine Function

Checks if a BASIC routine is cataloged, either in the VOC as a callable item, or in the catalog space.

SyntaxFound = DSCheckRoutine(RoutineName)

RoutineName is the name of BASIC routine to check.

Found Boolean. @False if RoutineName not findable, else @True.

Examplertn$ok = DSCheckRoutine(“DSU.DSSendMail”)

If(NOT(rtn$ok)) Then

* error handling here

End.

DataStage Development Kit (Job Control Interfaces) 51-67

Page 618: DataStage Parallel Job Developer’s Guide

DSDetachJob Function

Gives back a JobHandle acquired by DSAttachJob if no further control of a job is required (allowing another job to become its controller). It is not necessary to call this function, otherwise any attached jobs will always be detached automatically when the controlling job finishes.DSDetachJob

SyntaxErrCode = DSDetachJob (JobHandle)

JobHandle is the handle for the job as derived from DSAttachJob.

ErrCode is 0 if DSStopJob is successful, otherwise it may be the following:

DSJE.BADHANDLE Invalid JobHandle.

The only possible error is an attempt to close DSJ.ME. Otherwise, the call always succeeds.

ExampleThe following command detaches the handle for the job qsales:

Deterr = DSDetachJob (qsales_handle)

51-68 Ascential DataStage Parallel Job Developer’s Guide

Page 619: DataStage Parallel Job Developer’s Guide

DSExecute Subroutine

Executes a DOS or DataStage Engine command from a before/after subroutine.DSExecute

SyntaxCall DSExecute (ShellType, Command, Output, SystemReturnCode)

ShellType (input) specifies the type of command you want to execute and is either NT or UV (for DataStage Engine).

Command (input) is the command to execute. Command should not prompt for input when it is executed.

Output (output) is any output from the command. Each line of output is separated by a field mark, @FM. Output is added to the job log file as an information message.

SystemReturnCode (output) is a code indicating the success of the command. A value of 0 means the command executed successfully. A value of 1 (for a DOS command) indicates that the command was not found. Any other value is a specific exit code from the command.

RemarksDo not use DSExecute from a transform; the overhead of running a command for each row processed by a stage will degrade performance of the job.

DataStage Development Kit (Job Control Interfaces) 51-69

Page 620: DataStage Parallel Job Developer’s Guide

DSGetJobInfo Function

Provides a method of obtaining information about a job, which can be used gener-ally as well as for job control. It can refer to the current job or a controlled job, depending on the value of JobHandle.DSGetJobInfo

SyntaxResult = DSGetJobInfo (JobHandle, InfoType)

JobHandle is the handle for the job as derived from DSAttachJob, or it may be DSJ.ME to refer to the current job.

InfoType specifies the information required and can be one of:

DSJ.JOBCONTROLLER

DSJ.JOBINVOCATIONS

DSJ.JOBINVOCATIONID

DSJ.JOBNAME

DSJ.JOBSTARTTIMESTAMP

DSJ.JOBSTATUS

DSJ.JOBWAVENO

DSJ.PARAMLIST

DSJ.STAGELIST

DSJ.USERSTATUS

DSJ.JOBINTERIMSTATUS

Result depends on the specified InfoType, as follows:

• DSJ.JOBSTATUS Integer. Current status of job overall. Possible statuses that can be returned are currently divided into two categories:

Firstly, a job that is in progress is identified by:

DSJS.RESET Job finished a reset run.

51-70 Ascential DataStage Parallel Job Developer’s Guide

DSJS.RUNFAILED Job finished a normal run with a fatalerror.

Page 621: DataStage Parallel Job Developer’s Guide

DSGetJobInfo Function

DSJS.RUNNING Job running - this is the only status that means the job is actually running.

Secondly, jobs that are not running may have the following statuses:

DSJS.RUNOK Job finished a normal run with nowarnings.

DSJS.RUNWARN Job finished a normal run withwarnings.

DSJS.STOPPED Job was stopped by operator intervention (can’t tell run type).

DSJS.VALFAILED Job failed a validation run.

DSJS.VALOK Job finished a validation run with nowarnings.

DSJS.VALWARN Job finished a validation run withwarnings.

• DSJ.JOBCONTROLLER String. Name of the job controlling the job referenced by the job handle. Note that this may be several job names separated by periods if the job is controlled by a job which is itself controlled, etc.

• DSJ.JOBINVOCATIONS. Returns a comma-separated list of Invocation IDs.

• DSJ.JOBINVOCATIONID. Returns the invocation ID of the specified job (used in the DSJobInvocationId macro in a job design to access the invocation ID by which the job is invoked).

• DSJ.JOBNAME String. Actual name of the job referenced by the job handle.

• DSJ.JOBSTARTTIMESTAMP String. Date and time when the job started on the server in the form YYYY-MM-DD HH:MM:SS.

• DSJ.JOBWAVENO Integer. Wave number of last or current

DataStage Development Kit (Job Control Interfaces) 51-71

run.

Page 622: DataStage Parallel Job Developer’s Guide

DSGetJobInfo Function

• DSJ.PARAMLIST. Returns a comma-separated list of param-eter names.

• DSJ.STAGELIST. Returns a comma-separated list of active stage names.

• DSJ.USERSTATUS String. Whatever the job's last call of DSSetUserStatus last recorded, else the empty string.

• DSJ.JOBINTERIMSTATUS. Returns the status of a job after it has run all stages and controlled jobs, but before it has attempted to run an after-job subroutine. (Designed to be used by an after-job subroutine to get the status of the current job).

Result may also return error conditions as follows:

DSJE.BADHANDLE JobHandle was invalid.

DSJE.BADTYPE InfoType was unrecognized.

RemarksWhen referring to a controlled job, DSGetJobInfo can be used either before or after a DSRunJob has been issued. Any status returned following a successful call to DSRunJob is guaranteed to relate to that run of the job.

ExamplesThe following command requests the job status of the job qsales:

q_status = DSGetJobInfo(qsales_handle, DSJ.JOBSTATUS)

The following command requests the actual name of the current job:

whatname = DSGetJobInfo (DSJ.ME, DSJ.JOBNAME)

51-72 Ascential DataStage Parallel Job Developer’s Guide

Page 623: DataStage Parallel Job Developer’s Guide

DSGetLinkInfo Function

Provides a method of obtaining information about a link on an active stage, which can be used generally as well as for job control. This routine may reference either a controlled job or the current job, depending on the value of JobHandle.DSGetLinkInfo

SyntaxResult = DSGetLinkInfo (JobHandle, StageName, LinkName,

InfoType)

JobHandle is the handle for the job as derived from DSAttachJob, or it can be DSJ.ME to refer to the current job.

StageName is the name of the active stage to be interrogated. May also be DSJ.ME to refer to the current stage if necessary.

LinkName is the name of a link (input or output) attached to the stage. May also be DSJ.ME to refer to current link (e.g. when used in a Transformer expression or transform function called from link code).

InfoType specifies the information required and can be one of:

DSJ.LINKLASTERR

DSJ.LINKNAME

DSJ.LINKROWCOUNT

Result depends on the specified InfoType, as follows:

• DSJ.LINKLASTERR String - last error message (if any) reported from the link in question.

• DSJ.LINKNAME String - returns the name of the link, most useful when used with JobHandle = DSJ.ME and StageName = DSJ.ME and LinkName = DSJ.ME to discover your own name.

• DSJ.LINKROWCOUNT Integer - number of rows that have passed down a link so far.

Result may also return error conditions as follows:

DSJE.BADHANDLE JobHandle was invalid.

DataStage Development Kit (Job Control Interfaces) 51-73

DSJE.BADTYPE InfoType was unrecognized.

Page 624: DataStage Parallel Job Developer’s Guide

DSGetLinkInfo Function

DSJE.BADSTAGE StageName does not refer to a known stagein the job.

DSJE.NOTINSTAGE StageName was DSJ.ME and the caller isnot running within a stage.

DSJE.BADLINK LinkName does not refer to a known linkfor the stage in question.

RemarksWhen referring to a controlled job, DSGetLinkInfo can be used either before or after a DSRunJob has been issued. Any status returned following a successful call to DSRunJob is guaranteed to relate to that run of the job.

ExampleThe following command requests the number of rows that have passed down the order_feed link in the loader stage of the job qsales:

link_status = DSGetLinkInfo(qsales_handle, "loader", ➥ "order_feed", DSJ.LINKROWCOUNT)

51-74 Ascential DataStage Parallel Job Developer’s Guide

Page 625: DataStage Parallel Job Developer’s Guide

DSGetLogEntry Function

Reads the full event details given in EventId.DSGetLogEntry

SyntaxEventDetail = DSGetLogEntry (JobHandle, EventId)

JobHandle is the handle for the job as derived from DSAttachJob.

EventId is an integer that identifies the specific log event for which details are required. This is obtained using the DSGetNewestLogId function.

EventDetail is a string containing substrings separated by \. The substrings are as follows:

Substring1 Timestamp in form YYYY-MM-DD HH:MM:SS

Substring2 User information

Substring3 EventType – see DSGetNewestLogId

Substring4 – n Event message

If any of the following errors are found, they are reported via a fatal log event:

DSJE.BADHANDLE Invalid JobHandle.

DSJE.BADVALUE Error accessing EventId.

ExampleThe following command reads full event details of the log event identified by LatestLogid into the string LatestEventString:

LatestEventString = ➥ DSGetLogEntry(qsales_handle,latestlogid)

DataStage Development Kit (Job Control Interfaces) 51-75

Page 626: DataStage Parallel Job Developer’s Guide

DSGetLogSummary Function

Returns a list of short log event details. The details returned are determined by the setting of some filters. (Care should be taken with the setting of the filters, other-wise a large amount of information can be returned.)DSGetLogSummary

SyntaxSummaryArray = DSGetLogSummary (JobHandle, EventType, StartTime, EndTime, MaxNumber)

JobHandle is the handle for the job as derived from DSAttachJob.

EventType is the type of event logged and is one of:

DSJ.LOGINFO Information message

DSJ.LOGWARNING Warning message

DSJ.LOGFATAL Fatal error

DSJ.LOGREJECT Reject link was active

DSJ.LOGSTARTED Job started

DSJ.LOGRESET Log was reset

DSJ.LOGANY Any category (the default)

StartTime is a string in the form YYYY-MM-DD HH:MM:SS or YYYY-MM-DD.

EndTime is a string in the form YYYY-MM-DD HH:MM:SS or YYYY-MM-DD.

MaxNumber is an integer that restricts the number of events to return. 0 means no restriction. Use this setting with caution.

SummaryArray is a dynamic array of fields separated by @FM. Each field comprises a number of substrings separated by \, where each field represents a separate event, with the substrings as follows:

Substring1 EventId as per DSGetLogEntry

Substring2 Timestamp in form YYYY-MM-DD

51-76 Ascential DataStage Parallel Job Developer’s Guide

HH:MM:SS

Substring3 EventType – see DSGetNewestLogId

Page 627: DataStage Parallel Job Developer’s Guide

DSGetLogSummary Function

Substring4 – n Event message

If any of the following errors are found, they are reported via a fatal log event:

DSJE.BADHANDLE Invalid JobHandle.

DSJE.BADTYPE Invalid EventType.

DSJE.BADTIME Invalid StartTime or EndTime.

DSJE.BADVALUE Invalid MaxNumber.

ExampleThe following command produces an array of reject link active events recorded for the qsales job between 18th August 1998, and 18th September 1998, up to a maximum of MAXREJ entries:

RejEntries = DSGetLogSummary (qsales_handle, ➥ DSJ.LOGREJECT, "1998-08-18 00:00:00", "1998-09-18 ➥ 00:00:00", MAXREJ)

DataStage Development Kit (Job Control Interfaces) 51-77

Page 628: DataStage Parallel Job Developer’s Guide

DSGetNewestLogId Function

Gets the ID of the most recent log event in a particular category, or in any category.DSGetNewestLogId

SyntaxEventId = DSGetNewestLogId (JobHandle, EventType)

JobHandle is the handle for the job as derived from DSAttachJob.

EventType is the type of event logged and is one of:

DSJ.LOGINFO Information message

DSJ.LOGWARNING Warning message

DSJ.LOGFATAL Fatal error

DSJ.LOGREJECT Reject link was active

DSJ.LOGSTARTED Job started

DSJ.LOGRESET Log was reset

DSJ.LOGANY Any category (the default)

EventId is an integer that identifies the specific log event. EventId can also be returned as an integer, in which case it contains an error code as follows:

DSJE.BADHANDLE Invalid JobHandle.

DSJE.BADTYPE Invalid EventType.

ExampleThe following command obtains an ID for the most recent warning message in the log for the qsales job:

Warnid = DSGetNewestLogId (qsales_handle,➥ DSJ.LOGWARNING)

51-78 Ascential DataStage Parallel Job Developer’s Guide

Page 629: DataStage Parallel Job Developer’s Guide

DSGetParamInfo Function

Provides a method of obtaining information about a parameter, which can be used generally as well as for job control. This routine may reference either a controlled job or the current job, depending on the value of JobHandle.DSGetParamInfo

SyntaxResult = DSGetParamInfo (JobHandle, ParamName, InfoType)

JobHandle is the handle for the job as derived from DSAttachJob, or it may be DSJ.ME to refer to the current job.

ParamName is the name of the parameter to be interrogated.

InfoType specifies the information required and may be one of:

DSJ.PARAMDEFAULT

DSJ.PARAMHELPTEXT

DSJ.PARAMPROMPT

DSJ.PARAMTYPE

DSJ.PARAMVALUE

DSJ.PARAMDES.DEFAULT

DSJ.PARAMLISTVALUES

DSJ.PARAMDES.LISTVALUES

DSJ.PARAMPROMPT.AT.RUN

Result depends on the specified InfoType, as follows:

• DSJ.PARAMDEFAULT String – Current default value for the parameter in question. See also DSJ.PARAMDES.DEFAULT.

• DSJ.PARAMHELPTEXT String – Help text (if any) for the parameter in question.

• DSJ.PARAMPROMPT String – Prompt (if any) for the param-eter in question.

DataStage Development Kit (Job Control Interfaces) 51-79

• DSJ.PARAMTYPE Integer – Describes the type of validation test that should be performed on any value being set for this parameter. Is one of:

Page 630: DataStage Parallel Job Developer’s Guide

DSGetParamInfo Function

DSJ.PARAMTYPE.STRING

DSJ.PARAMTYPE.ENCRYPTED

DSJ.PARAMTYPE.INTEGER

DSJ.PARAMTYPE.FLOAT (the parameter may contain periods and E)

DSJ.PARAMTYPE.PATHNAME

DSJ.PARAMTYPE.LIST (should be a string of Tab-separated strings)

DSJ.PARAMTYPE.DATE (should be a string in form YYYY-MM-DD)

DSJ.PARAMTYPE.TIME (should be a string in form HH:MM)

• DSJ.PARAMVALUE String – Current value of the parameter for the running job or the last job run if the job is finished.

• DSJ.PARAMDES.DEFAULT String – Original default value of the parameter - may differ from DSJ.PARAMDEFAULT if the latter has been changed by an administrator since the job was installed.

• DSJ.PARAMLISTVALUES String – Tab-separated list of allowed values for the parameter. See also DSJ.PARAMDES.LISTVALUES.

• DSJ.PARAMDES.LISTVALUES String – Original Tab-sepa-rated list of allowed values for the parameter – may differ from DSJ.PARAMLISTVALUES if the latter has been changed by an administrator since the job was installed.

• DSJ.PROMPT.AT.RUN String – 1 means the parameter is to be prompted for when the job is run; anything else means it is not (DSJ.PARAMDEFAULT String to be used directly).

Result may also return error conditions as follows:

DSJE.BADHANDLE JobHandle was invalid.

51-80 Ascential DataStage Parallel Job Developer’s Guide

DSJE.BADPARAM ParamName is not a parametername in the job.

Page 631: DataStage Parallel Job Developer’s Guide

DSGetParamInfo Function

DSJE.BADTYPE InfoType was unrecognized.

RemarksWhen referring to a controlled job, DSGetParamInfo can be used either before or after a DSRunJob has been issued. Any status returned following a successful call to DSRunJob is guaranteed to relate to that run of the job.

ExampleThe following command requests the default value of the quarter parameter for the qsales job:

Qs_quarter = DSGetparamInfo(qsales_handle, "quarter",➥ DSJ.PARAMDEFAULT)

DataStage Development Kit (Job Control Interfaces) 51-81

Page 632: DataStage Parallel Job Developer’s Guide

DSGetProjectInfo Function

Provides a method of obtaining information about the current project. DSGetProjectInfo

SyntaxResult = DSGetProjectInfo (InfoType)

InfoType specifies the information required and can be one of:

DSJ.JOBLIST

DSJ.PROJECTNAME

DSJ.HOSTNAME

Result depends on the specified InfoType, as follows:

• DSJ.JOBLIST String - comma-separated list of names of all jobs known to the project (whether the jobs are currently attached or not).

• DSJ.PROJECTNAME String - name of the current project.

• DSJ.HOSTNAME String - the host name of the server holding the current project.

Result may also return an error condition as follows:

DSJE.BADTYPE InfoType was unrecognized.

51-82 Ascential DataStage Parallel Job Developer’s Guide

Page 633: DataStage Parallel Job Developer’s Guide

DSGetStageInfo Function

Provides a method of obtaining information about a stage, which can be used generally as well as for job control. It can refer to the current job, or a controlled job, depending on the value of JobHandle.DSGetStageInfo

SyntaxResult = DSGetStageInfo (JobHandle, StageName, InfoType)

JobHandle is the handle for the job as derived from DSAttachJob, or it may be DSJ.ME to refer to the current job.

StageName is the name of the stage to be interrogated. It may also be DSJ.ME to refer to the current stage if necessary.

InfoType specifies the information required and may be one of:

DSJ.STAGELASTERR

DSJ.STAGENAME

DSJ.STAGETYPE

DSJ.STAGEINROWNUM

DSJ.LINKLIST

DSJ.STAGEVARLIST

Result depends on the specified InfoType, as follows:

• DSJ.STAGELASTERR String - last error message (if any) reported from any link of the stage in question.

• DSJ.STAGENAME String - most useful when used with JobHandle = DSJ.ME and StageName = DSJ.ME to discover your own name.

• DSJ.STAGETYPE String - the stage type name (e.g. "Trans-former", "BeforeJob").

• DSJ. STAGEINROWNUM Integer - the primary link's input row number.

DataStage Development Kit (Job Control Interfaces) 51-83

• DSJ.LINKLIST - comma-separated list of link names in the stage.

Page 634: DataStage Parallel Job Developer’s Guide

DSGetStageInfo Function

• DSJ.STAGEVARLIST - comma-separated list of stage variable names.

Result may also return error conditions as follows:

DSJE.BADHANDLE JobHandle was invalid.

DSJE.BADTYPE InfoType was unrecognized.

DSJE.NOTINSTAGE StageName was DSJ.ME and the caller isnot running within a stage.

DSJE.BADSTAGE StageName does not refer to a knownstage in the job.

RemarksWhen referring to a controlled job, DSGetStageInfo can be used either before or after a DSRunJob has been issued. Any status returned following a successful call to DSRunJob is guaranteed to relate to that run of the job.

ExampleThe following command requests the last error message for the loader stage of the job qsales:

stage_status = DSGetStageInfo(qsales_handle, "loader", ➥ DSJ.STAGELASTERR)

51-84 Ascential DataStage Parallel Job Developer’s Guide

Page 635: DataStage Parallel Job Developer’s Guide

DSLogEvent Function

Logs an event message to a job other than the current one. (Use DSLogInfo, DSLogFatal, or DSLogWarn to log an event to the current job.)DSLogEvent

SyntaxErrCode = DSLogEvent (JobHandle, EventType, EventMsg)

JobHandle is the handle for the job as derived from DSAttachJob.

EventType is the type of event logged and is one of:

DSJ.LOGINFO Information message

DSJ.LOGWARNING Warning message

EventMsg is a string containing the event message.

ErrCode is 0 if there is no error. Otherwise it contains one of the following errors:

DSJE.BADHANDLE Invalid JobHandle.

DSJE.BADTYPE Invalid EventType (particularly note that you cannot place a fatal message in another job’s log).

ExampleThe following command, when included in the msales job, adds the message “monthly sales complete” to the log for the qsales job:

Logerror = DsLogEvent (qsales_handle, DSJ.LOGINFO,➥ "monthly sales complete")

DataStage Development Kit (Job Control Interfaces) 51-85

Page 636: DataStage Parallel Job Developer’s Guide

DSLogFatal Function

Logs a fatal error message in a job’s log file and aborts the job.DSLogFatal

SyntaxCall DSLogFatal (Message, CallingProgName)

Message (input) is the warning message you want to log. Message is automatically prefixed with the name of the current stage and the calling before/after subroutine.

CallingProgName (input) is the name of the before/after subroutine that calls the DSLogFatal subroutine.

RemarksDSLogFatal writes the fatal error message to the job log file and aborts the job. DSLogFatal never returns to the calling before/after subroutine, so it should be used with caution. If a job stops with a fatal error, it must be reset using the DataStage Director before it can be rerun.

In a before/after subroutine, it is better to log a warning message (using DSLogWarn) and exit with a nonzero error code, which allows DataStage to stop the job cleanly.

DSLogFatal should not be used in a transform. Use DSTransformError instead.

ExampleCall DSLogFatal("Cannot open file", "MyRoutine")

51-86 Ascential DataStage Parallel Job Developer’s Guide

Page 637: DataStage Parallel Job Developer’s Guide

DSLogInfo Function

Logs an information message in a job’s log file. DSLogInfo

SyntaxCall DSLogInfo (Message, CallingProgName)

Message (input) is the information message you want to log. Message is automatically prefixed with the name of the current stage and the calling program.

CallingProgName (input) is the name of the transform or before/after subroutine that calls the DSLogInfo subroutine.

RemarksDSLogInfo writes the message text to the job log file as an information message and returns to the calling routine or transform. If DSLogInfo is called during the test phase for a newly created routine in the DataStage Manager, the two arguments are displayed in the results window.

Unlimited information messages can be written to the job log file. However, if a lot of messages are produced the job may run slowly and the DataStage Director may take some time to display the job log file.

ExampleCall DSLogInfo("Transforming: ":Arg1, "MyTransform")

DataStage Development Kit (Job Control Interfaces) 51-87

Page 638: DataStage Parallel Job Developer’s Guide

DSLogToController Function

This routine may be used to put an info message in the log file of the job controlling this job, if any. If there isn’t one, the call is just ignored.

SyntaxCall DSLogToController(MsgString)

MsgString is the text to be logged. The log event is of type Information.

RemarksIf the current job is not under control, a silent exit is performed.

ExampleCall DSLogToController(“This is logged to parent”)

51-88 Ascential DataStage Parallel Job Developer’s Guide

Page 639: DataStage Parallel Job Developer’s Guide

DSLogWarn Function

Logs a warning message in a job’s log file.DSLogWarn

SyntaxCall DSLogWarn (Message, CallingProgName)

Message (input) is the warning message you want to log. Message is automatically prefixed with the name of the current stage and the calling before/after subroutine.

CallingProgName (input) is the name of the before/after subroutine that calls the DSLogWarn subroutine.

RemarksDSLogWarn writes the message to the job log file as a warning and returns to the calling before/after subroutine. If the job has a warning limit defined for it, when the number of warnings reaches that limit, the call does not return and the job is aborted.

DSLogWarn should not be used in a transform. Use DSTransformError instead.

ExampleIf InputArg > 100 Then

Call DSLogWarn("Input must be =< 100; received":InputArg,"MyRoutine")

End Else* Carry on processing unless the job aborts

End

DataStage Development Kit (Job Control Interfaces) 51-89

Page 640: DataStage Parallel Job Developer’s Guide

DSMakeJobReport Function

Generates a string describing the complete status of a valid attached job.

SyntaxReportText = DSMakeJobReport(JobHandle, ReportLevel,

LineSeparator)

JobHandle is the string as returned from DSAttachJob.

ReportLevel is the number where 0 = basic and 1= more detailed.

LineSeparator is the string used to separate lines of the report. Special values recognised are:

"CRLF" => CHAR(13):CHAR(10)

"LF" => CHAR(10)

"CR" => CHAR(13)

The default is CRLF if on Windows NT, else LF.

RemarksIf a bad job handle is given, or any other error is encountered, information is added to the ReportText.

Exampleh$ = DSAttachJob(“MyJob”, DSJ.ERRNONE)

rpt$ = DSMakeJobReport(h$,0,”CRLF”)

51-90 Ascential DataStage Parallel Job Developer’s Guide

Page 641: DataStage Parallel Job Developer’s Guide

DSMakeMsg Function

Insert arguments into a message template. Optionally, it will look up a template ID in the standard DataStage messages file, and use any returned message template instead of that given to the routine.

SyntaxFullText = DSMakeMsg(Template, ArgList)

FullText is the message with parameters substituted

Template is the message template, in which %1, %2 etc. are to be substituted with values from the equivalent position in ArgList. If the template string starts with a number followed by "\", that is assumed to be part of a message id to be looked up in the DataStage message file.

Note: If an argument token is followed by "[E]", the value of that argument is assumed to be a job control error code, and an explanation of it will be inserted in place of "[E]". (See the DSTranslateCode function.)

ArgList is the dynamic array, one field per argument to be substituted.

RemarksThis routine is called from job control code created by the JobSequence Generator. It is basically an interlude to call DSRMessage which hides any runtime includes.

It will also perform local job parameter substitution in the message text. That is, if called from within a job, it looks for substrings such as "#xyz#" and replaces them with the value of the job parameter named "xyz".

Examplet$ = DSMakeMsg(“Error calling DSAttachJob(%1)<L>%2”,

➥jb$:@FM:DSGetLastErrorMsg())

DataStage Development Kit (Job Control Interfaces) 51-91

Page 642: DataStage Parallel Job Developer’s Guide

DSPrepareJob Function

Used to ensure that a compiled job is in the correct state to be run or validated.

SyntaxJobHandle = DSPrepareJob(JobHandle)

JobHandle is the handle, as returned from DSAttachJob(), of the job to be prepared.

JobHandle is either the original handle or a new one. If returned as 0, an error occurred and a message is logged.

Exampleh$ = DSPrepareJob(h$)

51-92 Ascential DataStage Parallel Job Developer’s Guide

Page 643: DataStage Parallel Job Developer’s Guide

DSRunJob Function

Starts a job running. Note that this call is asynchronous; the request is passed to the run-time engine, but you are not informed of its progress.DSRunJob

SyntaxErrCode = DSRunJob (JobHandle, RunMode)

JobHandle is the handle for the job as derived from DSAttachJob.

RunMode is the name of the mode the job is to be run in and is one of:

DSJ.RUNNORMAL (Default) Standard job run.

DSJ.RUNRESET Job is to be reset.

DSJ.RUNVALIDATE Job is to be validated only.

ErrCode is 0 if DSRunJob is successful, otherwise it is one of the following negative integers:

DSJE.BADHANDLE Invalid JobHandle.

DSJE.BADSTATE Job is not in the right state (compiled, not running).

DSJE.BADTYPE RunMode is not a known mode.

RemarksIf the controlling job is running in validate mode, then any calls of DSRunJob will act as if RunMode was DSJ.RUNVALIDATE, regardless of the actual setting.

A job in validate mode will run its JobControl routine (if any) rather than just check for its existence, as is the case for before/after routines. This allows you to examine the log of what jobs it started up in validate mode.

After a call of DSRunJob, the controlled job’s handle is unloaded. If you require to run the same job again, you must use DSDetachJob and DSAttachJob to set a new handle. Note that you will also need to use DSWaitForJob, as you cannot attach to a job while it is running.

DataStage Development Kit (Job Control Interfaces) 51-93

Page 644: DataStage Parallel Job Developer’s Guide

DSRunJob Function

ExampleThe following command starts the job qsales in standard mode:

RunErr = DSRunJob(qsales_handle, DSJ.RUNNORMAL)

51-94 Ascential DataStage Parallel Job Developer’s Guide

Page 645: DataStage Parallel Job Developer’s Guide

DSSendMail Function

This routine is an interface to a sendmail program that is assumed to exist some-where in the search path of the current user (on the server). It hides the different call interfaces to various sendmail programs, and provides a simple interface for sending text. For example:

SyntaxReply = DSSendMail(Parameters)

Parameters is a set of name:value parameters, separated by either a mark character or "\n".

Currently recognized names (case-insensitive) are:

"From" Mail address of sender, e.g. [email protected]

Can only be left blank if the local template file does not contain a "%from%" token.

"To" Mail address of recipient, e.g. [email protected]

Can only be left blank if the local template file does not contain a "%to%" token.

"Subject" Something to put in the subject line of the message.

Refers to the "%subject%" token. If left as "", a standard subject line will be created, along the lines of "From DataStage job: jobname"

"Server" Name of host through which the mail should be sent.

May be omitted on systems (such as Unix) where the SMTP host name can be and is set up externally, in which case the local template file presumably will not contain a "%server%" token.

"Body" Message body.

Can be omitted. An empty message will be sent. If used, it must be the last parameter, to allow for getting multiple lines into the message, using "\n" for line breaks. Refers to the "%body%" token.

DataStage Development Kit (Job Control Interfaces) 51-95

Note: The text of the body may contain the tokens "%report% or %fullreport% anywhere within it, which

Page 646: DataStage Parallel Job Developer’s Guide

DSSendMail Function

will cause a report on the current job status to be inserted at that point. A full report contains stage and link information as well as job status.

Reply. Possible replies are:

DSJE.NOERROR (0) OK

DSJE.NOPARAM Parameter name missing - field does not look like ’name:value’

DSJE.NOTEMPLATE Cannot find template file

DSJE.BADTEMPLATE Error in template file

RemarksThe routine looks for a local file, in the current project directory, with a well-known name. That is, a template to describe exactly how to run the local sendmail command.

Examplecode = DSSendMail("From:me@here\nTo:You@there\nSubject:Hi ya\nBody:Line1\nLine2")

51-96 Ascential DataStage Parallel Job Developer’s Guide

Page 647: DataStage Parallel Job Developer’s Guide

DSSetJobLimit Function

By default a controlled job inherits any row or warning limits from the controlling job. These can, however, be overridden using the DSSetJobLimit function. DSSetJobLimit

SyntaxErrCode = DSSetJobLimit (JobHandle, LimitType, LimitValue)

JobHandle is the handle for the job as derived from DSAttachJob.

LimitType is the name of the limit to be applied to the running job and is one of:

DSJ.LIMITWARN Job to be stopped after LimitValuewarning events.

DSJ.LIMITROWS Stages to be limited to LimitValue rows.

LimitValue is an integer specifying the value to set the limit to.

ErrCode is 0 if DSSetJobLimit is successful, otherwise it is one of the following negative integers:

DSJE.BADHANDLE Invalid JobHandle.

DSJE.BADSTATE Job is not in the right state (compiled,not running).

DSJE.BADTYPE LimitType is not a known limitingcondition.

DSJE.BADVALUE LimitValue is not appropriate for thelimiting condition type.

ExampleThe following command sets a limit of 10 warnings on the qsales job before it is stopped:

LimitErr = DSSetJobLimit(qsales_handle,➥ DSJ.LIMITWARN, 10)

DataStage Development Kit (Job Control Interfaces) 51-97

Page 648: DataStage Parallel Job Developer’s Guide

DSSetParam Function

Specifies job parameter values before running a job. Any parameter not set will be defaulted.DSSetParam

SyntaxErrCode = DSSetParam (JobHandle, ParamName, ParamValue)

JobHandle is the handle for the job as derived from DSAttachJob.

ParamName is a string giving the name of the parameter.

ParamValue is a string giving the value for the parameter.

ErrCode is 0 if DSSetParam is successful, otherwise it is one of the following negative integers:

DSJE.BADHANDLE Invalid JobHandle.

DSJE.BADSTATE Job is not in the right state (compiled,not running).

DSJE.BADPARAM ParamName is not a known parameter ofthe job.

DSJE.BADVALUE ParamValue is not appropriate for that parameter type.

ExampleThe following commands set the quarter parameter to 1 and the startdate parameter to 1/1/97 for the qsales job:

paramerr = DSSetParam (qsales_handle, "quarter", "1")

paramerr = DSSetParam (qsales_handle, "startdate",➥ "1997-01-01")

51-98 Ascential DataStage Parallel Job Developer’s Guide

Page 649: DataStage Parallel Job Developer’s Guide

DSSetUserStatus Subroutine

Applies only to the current job, and does not take a JobHandle parameter. It can be used by any job in either a JobControl or After routine to set a termination code for interrogation by another job. In fact, the code may be set at any point in the job, and the last setting is the one that will be picked up at any time. So to be certain of getting the actual termination code for a job the caller should use DSWaitForJob and DSGetJobInfo first, checking for a successful finishing status.

Note: This routine is defined as a subroutine not a function because there are no possible errors.

DSSetUserStatus

SyntaxCall DSSetUserStatus (UserStatus)

UserStatus String is any user-defined termination message. The string will be logged as part of a suitable "Control" event in the calling job’s log, and stored for retrieval by DSGetJobInfo, overwriting any previous stored string.

This string should not be a negative integer, otherwise it may be indistinguishable from an internal error in DSGetJobInfo calls.

ExampleThe following command sets a termination code of “sales job done”:

Call DSSetUserStatus("sales job done")

DataStage Development Kit (Job Control Interfaces) 51-99

Page 650: DataStage Parallel Job Developer’s Guide

DSStopJob Function

This routine should only be used after a DSRunJob has been issued. It immedi-ately sends a stop request to the run-time engine. The call is asynchronous. If you need to know that the job has actually stopped, you must call DSWaitForJob or use the Sleep statement and poll for DSGetJobStatus. Note that the stop request gets sent regardless of the job’s current status.DSStopJob

SyntaxErrCode = DSStopJob (JobHandle)

JobHandle is the handle for the job as derived from DSAttachJob.

ErrCode is 0 if DSStopJob is successful, otherwise it may be the following:

DSJE.BADHANDLE Invalid JobHandle.

ExampleThe following command requests that the qsales job is stopped:

stoperr = DSStopJob(qsales_handle)

51-100 Ascential DataStage Parallel Job Developer’s Guide

Page 651: DataStage Parallel Job Developer’s Guide

DSTransformError Function

Logs a warning message to a job log file. This function is called from transforms only. DSTransformError

SyntaxCall DSTransformError (Message, TransformName)

Message (input) is the warning message you want to log. Message is automatically prefixed with the name of the current stage and the calling transform.

TransformName (input) is the name of the transform that calls the DSTransformError subroutine.

RemarksDSTransformError writes the message (and other information) to the job log file as a warning and returns to the transform. If the job has a warning limit defined for it, when the number of warnings reaches that limit, the call does not return and the job is aborted.

In addition to the warning message, DSTransformError logs the values of all columns in the current rows for all input and output links connected to the current stage.

ExampleFunction MySqrt(Arg1)If Arg1 < 0 Then

Call DSTransformError("Negative value:"Arg1, "MySqrt")

Return("0") ;*transform produces 0 in this caseEnd

Result = Sqrt(Arg1) ;* else return the square rootReturn(Result)

DataStage Development Kit (Job Control Interfaces) 51-101

Page 652: DataStage Parallel Job Developer’s Guide

DSTranslateCode Function

Converts a job control status or error code into an explanatory text message.

SyntaxAns = DSTranslateCode(Code)

Code is:

If Code > 0, it’s assumed to be a job status.

If Code < 0, it’s assumed to be an error code.

(0 should never be passed in, and will return "no error")

Ans is the message associated with the code.

RemarksIf Code is not recognized, then Ans will report it.

Examplecode$ = DSGetLastErrorMsg()

ans$ = DSTranslateCode(code$)

51-102 Ascential DataStage Parallel Job Developer’s Guide

Page 653: DataStage Parallel Job Developer’s Guide

DSWaitForFile Function

Suspend a job until a named file either exists or does not exist.

SyntaxReply = DSWaitForFile(Parameters)

Parameters is the full path of file to wait on. No check is made as to whether this is a reasonable path (for example, whether all directories in the path exist). A path name starting with "-", indicates a flag to check the non-existence of the path. It is not part of the path name.

Parameters may also end in the form " timeout:NNNN" (or "timeout=NNNN") This indicates a non-default time to wait before giving up. There are several possible formats, case-insensitive:

nnn number of seconds to wait (from now)

nnnS ditto

nnnM number of minutes to wait (from now)

nnnH number of hours to wait (from now)

nn:nn:nn wait until this time in 24HH:MM:SS. If this or nn:nn time has passed, will wait till next day.

The default timeout is the same as "12H".

The format may optionally terminate "/nn", indicating a poll delay time in seconds. If omitted, a default poll time is used.

Reply may be:

DSJE.NOERROR (0) OK - file now exists or does not exist, depending on flag.

DSJE.BADTIME Unrecognized Timeout format

DSJE.NOFILEPATH File path missing

DSJE.TIMEOUT Waited too long

Examples

DataStage Development Kit (Job Control Interfaces) 51-103

Reply = DSWaitForFile("C:\ftp\incoming.txt timeout:2H")

Page 654: DataStage Parallel Job Developer’s Guide

DSWaitForFile Function

(wait 7200 seconds for file on C: to exist before it gives up.)

Reply = DSWaitForFile("-incoming.txt timeout=15:00")

(wait until 3 pm for file in local directory to NOT exist.)

Reply = DSWaitForFile("incoming.txt timeout:3600/60")

(wait 1 hour for a local file to exist, looking once a minute.)

51-104 Ascential DataStage Parallel Job Developer’s Guide

Page 655: DataStage Parallel Job Developer’s Guide

DSWaitForJob Function

This function is only valid if the current job has issued a DSRunJob on the given JobHandle(s). It returns if the/a job has started since the last DSRunJob has since finished.DSWaitForJob

SyntaxErrCode = DSWaitForJob (JobHandle)

JobHandle is the string returned from DSAttachJob. If commas are contained, it’s a comma-delimited set of job handles, representing a list of jobs that are all to be waited for.

ErrCode is 0 if no error, else possible error values (<0) are:

DSJE.BADHANDLE Invalid JobHandle.

DSJE.WRONGJOB Job for this JobHandle was not run from within this job.

ErrCode is >0 => handle of the job that finished from a multi-job wait.

RemarksDSWaitForJob will wait for either a single job or multiple jobs.

ExampleTo wait for the return of the qsales job:

WaitErr = DSWaitForJob(qsales_handle)

DataStage Development Kit (Job Control Interfaces) 51-105

Page 656: DataStage Parallel Job Developer’s Guide

DSWaitForJob Function

Job Status Macros A number of macros are provided in the JOBCONTROL.H file to facil-itate getting information about the current job, and links and stages belonging to the current job. These macros provide the functionality of using the DataStage BASIC DSGetProjectInfo, DSGetJobInfo, DSGetStageInfo, and DSGetLinkInfo functions with the DSJ.ME token as the JobHandle and can be used in all active stages and before/after subroutines. The macros provide the functionality for all the possible InfoType arguments for the DSGet…Info functions.

The available macros are:

• DSHostName• DSProjectName

• DSJobStatus• DSJobName• DSJobController• DSJobStartDate• DSJobStartTime• DSJobWaveNo• DSJobInvocations• DSJobInvocationID

• DSStageName• DSStageLastErr• DSStageType• DSStageInRowNum• DSStageVarList

• DSLinkRowCount• DSLinkLastErr• DSLinkName

For example, to obtain the name of the current job:

MyName = DSJobName

To obtain the full current stage name:

51-106 Ascential DataStage Parallel Job Developer’s Guide

MyName = DSJobName : "." : DSStageName

In addition, the following macros are provided to manipulate Trans-former stage variables:

Page 657: DataStage Parallel Job Developer’s Guide

DSWaitForJob Function

• DSGetVar(VarName) returns the current value of the named stage variable. If the current stage does not have a stage vari-able called VarName, then "" is returned and an error message is logged. If the named stage variable is defined but has not been initialized, the "" is returned and an error message is logged.

• DSSetVar(VarName, VarValue) sets the value of the named stage variable. If the current stage does not have a stage vari-able called VarName, then an error message is logged.

Command Line InterfaceThe DataStage CLI gives you access to the same functionality as the DataStage API functions described on page 51-5 or the BASIC func-tions described on page 51-63. There is a single command, dsjob, with a large range of options. These options are described in the following topics:

• The logon clause• Starting a job• Stopping a job• Listing projects, jobs, stages, links, and parameters• Retrieving information• Accessing log files

All output from the dsjob command is in plain text without column headings on lists of things, or any other sort of description. This enables the command to be used in shell or batch scripts without extra processing.

The DataStage CLI returns a completion code of 0 to the operating system upon successful execution, or one of the DataStage API error codes on failure. See “Error Codes” on page 51-59.

The Logon ClauseBy default, the DataStage CLI connects to the DataStage server engine

DataStage Development Kit (Job Control Interfaces) 51-107

on the local system using the user name and password of the user invoking the command. You can specify a different server, user name,

Page 658: DataStage Parallel Job Developer’s Guide

or password using the logon clause, which is equivalent to the API DSSet-ServerParams function. Its syntax is as follows:

[ –server servername ][ –user username ][ –password password ] servername specifies a different server to log on to.

username specifies a different user name to use when logging on.

password specifies a different password to use when logging on.

You can also specify these details in a file using the following syntax:

[ –file filename servername ]

servername specifies the server for which the file contains login details.

filename is the name of the file containing login details. The file should contain the following information:

servername, username, password

You can use the logon clause with any dsjob command.

Starting a JobYou can start, stop, validate, and reset jobs using the –run option.

dsjob –run

[ –mode [ NORMAL | RESET | VALIDATE ] ][ –param name=value ][ –warn n ][ –rows n ][ –wait ][ –stop ][ –jobstatus][–userstatus][-local]

project job

51-108 Ascential DataStage Parallel Job Developer’s Guide

Page 659: DataStage Parallel Job Developer’s Guide

–mode specifies the type of job run. NORMAL starts a job run, RESET resets the job and VALIDATE validates the job. If –mode is not specified, a normal job run is started.

–param specifies a parameter value to pass to the job. The value is in the format name=value, where name is the parameter name, and value is the value to be set. If you use this to pass a value of an environment variable for a job (as you may do for parallel jobs), you need to quote the environ-ment variable and its value, for example -param ’$APT_CONFIG_FILE=chris.apt’ otherwise the current value of the environment variable will be used.

–warn n sets warning limits to the value specified by n (equivalent to the DSSetJobLimit function used with DSJ_LIMITWARN specified as the LimitType parameter).

–rows n sets row limits to the value specified by n (equivalent to the DSSetJobLimit function used with DSJ_LIMITROWS specified as the LimitType parameter).

–wait waits for the job to complete (equivalent to the DSWaitForJob function).

–stop terminates a running job (equivalent to the DSStopJob function).

–jobstatus waits for the job to complete, then returns an exit code derived from the job status.

–userstatus waits for the job to complete, then returns an exit code derived from the user status if that status is defined. The user status is a string, and it is converted to an integer exit code. The exit code 0 indicates that the job completed without an error, but that the user status string could not be converted. If a job returns a negative user status value, it is interpreted as an error.

-local use this when running a DataStage job from withing a shellscript on a UNIX server. Provided the script is run in the project directory, the job will pick up the settings for any environment variables set in the script and any setting specific to the user environment.

project is the name of the project containing the job.

job is the name of the job.

DataStage Development Kit (Job Control Interfaces) 51-109

Page 660: DataStage Parallel Job Developer’s Guide

Stopping a JobYou can stop a job using the –stop option.

dsjob –stop project job

–stop terminates a running job (equivalent to the DSStopJob function).

project is the name of the project containing the job.

job is the name of the job.

Listing Projects, Jobs, Stages, Links, and ParametersYou can list projects, jobs, stages, links, and job parameters using the dsjob command. The different versions of the syntax are described in the following sections.

Listing Projects

The following syntax displays a list of all known projects on the server:

dsjob –lprojects

This syntax is equivalent to the DSGetProjectList function.

Listing Jobs

The following syntax displays a list of all jobs in the specified project:

dsjob –ljobs project

project is the name of the project containing the jobs to list.

This syntax is equivalent to the DSGetProjectInfo function.

Listing Stages

The following syntax displays a list of all stages in a job:

dsjob –lstages project job

project is the name of the project containing job.

job is the name of the job containing the stages to list.

This syntax is equivalent to the DSGetJobInfo function with

51-110 Ascential DataStage Parallel Job Developer’s Guide

DSJ_STAGELIST specified as the InfoType parameter.

Page 661: DataStage Parallel Job Developer’s Guide

Listing Links

The following syntax displays a list of all the links to or from a stage:

dsjob –llinks project job stage

project is the name of the project containing job.

job is the name of the job containing stage.

stage is the name of the stage containing the links to list.

This syntax is equivalent to the DSGetStageInfo function with DSJ_LINKLIST specified as the InfoType parameter.

Listing Parameters

The following syntax display a list of all the parameters in a job and their values:

dsjob –lparams project job

project is the name of the project containing job.

job is the name of the job whose parameters are to be listed.

This syntax is equivalent to the DSGetJobInfo function with DSJ_PARAMLIST specified as the InfoType parameter.

Retrieving InformationThe dsjob command can be used to retrieve and display the available infor-mation about specific projects, jobs, stages, or links. The different versions of the syntax are described in the following sections.

Displaying Job Information

The following syntax displays the available information about a specified job:

dsjob –jobinfo project job

project is the name of the project containing job.

job is the name of the job.

The following information is displayed:

DataStage Development Kit (Job Control Interfaces) 51-111

• The current status of the job

• The name of any controlling job for the job

Page 662: DataStage Parallel Job Developer’s Guide

• The date and time when the job started

• The wave number of the last or current run (internal DataStage reference number)

• User status

This syntax is equivalent to the DSGetJobInfo function.

Displaying Stage Information

The following syntax displays all the available information about a stage:

dsjob –stageinfo project job stage

project is the name of the project containing job.

job is the name of the job containing stage.

stage is the name of the stage.

The following information is displayed:

• The last error message reported from any link to or from the stage• The stage type name, for example, Transformer or Aggregator• The primary links input row number

This syntax is equivalent to the DSGetStageInfo function.

Displaying Link Information

The following syntax displays information about a specified link to or from a stage:

dsjob –linkinfo project job stage link

project is the name of the project containing job.

job is the name of the job containing stage.

stage is the name of the stage containing link.

link is the name of the stage.

The following information is displayed:

• The last error message reported by the link• The number of rows that have passed down a link

51-112 Ascential DataStage Parallel Job Developer’s Guide

This syntax is equivalent to the DSGetLinkInfo function.

Page 663: DataStage Parallel Job Developer’s Guide

Displaying Parameter Information

This syntax displays information about the specified parameter:

dsjob –paraminfo project job param

project is the name of the project containing job.

job is the name of the job containing parameter.

parameter is the name of the parameter.

The following information is displayed:

• The parameter type• The parameter value• Help text for the parameter that was provided by the job’s designer• Whether the value should be prompted for• The default value that was specified by the job’s designer• Any list of values• The list of values provided by the job’s designer

This syntax is equivalent to the DSGetParamInfo function.

Accessing Log FilesThe dsjob command can be used to add entries to a job’s log file, or retrieve and display specific log entries. The different versions of the syntax are described in the following sections.

Adding a Log Entry

The following syntax adds an entry to the specified log file. The text for the entry is taken from standard input to the terminal, ending with Ctrl-D.

dsjob –log [ –info | –warn ] project job

–info specifies an information message. This is the default if no log entry type is specified.

–warn specifies a warning message.

project is the name of the project containing job.

job is the name of the job that the log entry refers to.

DataStage Development Kit (Job Control Interfaces) 51-113

This syntax is equivalent to the DSLogEvent function.

Page 664: DataStage Parallel Job Developer’s Guide

Displaying a Short Log Entry

The following syntax displays a summary of entries in a job log file:

dsjob –logsum [–type type] [ –max n ] project job

–type type specifies the type of log entry to retrieve. If –type type is not specified, all the entries are retrieved. type can be one of the following options:

–max n limits the number of entries retrieved to n.

project is the project containing job.

job is the job whose log entries are to be retrieved.

Displaying a Specific Log Entry

The following syntax displays the specified entry in a job log file:

dsjob –logdetail project job entry

project is the project containing job.

job is the job whose log entries are to be retrieved.

entry is the event number assigned to the entry. The first entry in the file is 0.

This syntax is equivalent to the DSGetLogEntry function.

This option… Retrieves this type of log entry…

INFO Information.

WARNING Warning.

FATAL Fatal error.

REJECT Rejected rows from a Transformer stage.

STARTED Job started.

RESET Job reset.

BATCH Batch control.

ANY All entries of any type. This is the default if type is not specified.

51-114 Ascential DataStage Parallel Job Developer’s Guide

Page 665: DataStage Parallel Job Developer’s Guide

Identifying the Newest Entry

The following syntax displays the ID of the newest log entry of the speci-fied type:

dsjob –lognewest project job type

project is the project containing job.

job is the job whose log entries are to be retrieved.

type can be one of the following options:

This syntax is equivalent to the DSGetNewestLogId function.

This option… Retrieves this type of log entry…

INFO Information

WARNING Warning

FATAL Fatal error

REJECT Rejected rows from a Transformer stage

STARTED Job started

RESET Job reset

BATCH Batch

DataStage Development Kit (Job Control Interfaces) 51-115

Page 666: DataStage Parallel Job Developer’s Guide

51-116 Ascential DataStage Parallel Job Developer’s Guide

Page 667: DataStage Parallel Job Developer’s Guide

ASchemas

Schemas are an alternative way for you to specify column definitions for the data used by parallel jobs. By default, most parallel job stages take there meta data from the Columns tab, which contains table definitions., supplemented, where necessary by format information from the Format tab. For some stages, you can specify a property that causes the stage to take its meta data from the specified schema file instead. Some stages also allow you to specify a partial schema. This allows you to describe only those columns that a particular stage is processing and ignore the rest.

The schema file is a plain text file, this appendix describes its format. A partial schema has the same format.

Schema FormatA schema contains a record (or row) definition. This describes each column (or field) that will be encountered within the record, giving column name and data type. The following is an example record schema:

record (name:string[255];address:nullable string[255];value1:int32;value2:int32date:date)

(The line breaks are there for ease of reading, you would omit these if you were defining a partial schema, for example record(name:string[255];value1:int32;date:date)is a valid

Schemas A-1

schema.)

The format of each line describing a column is:

column_name:[nullability]datatype;

Page 668: DataStage Parallel Job Developer’s Guide

• column_name. This is the name that identifies the column. Names must start with a letter or an underscore (_), and can contain only alphanumeric or underscore characters. The name is not case sensi-tive. The name can be of any length.

• nullability. You can optionally specify whether a column is allowed to contain a null value, or whether this would be viewed as invalid. If the column can be null, insert the word ’nullable’. By default columns are not nullable.

You can also include ’nullable’ at record level to specify that all columns are nullable, then override the setting for individual columns by specifying ‘not nullable’. For example:

record nullable (name:not nullable string[255];value1:int32;date:date)

• datatype. This is the data type of the column. This uses the internal data types as described on page 2-13, not SQL data types as used on Columns tabs in stage editors.

You can include comments in schema definition files. A comment is started by a double slash //, and ended by a newline.

A-2 Ascential DataStage Parallel Job Developer’s Guide

Page 669: DataStage Parallel Job Developer’s Guide

The example schema corresponds to the following table definition as spec-ified on a Columns tab of a stage editor:

The following sections give special consideration for representing various data types in a schema file.

Date ColumnsThe following examples show various different data definitions:

record (dateField1:date; ) // single daterecord (dateField2[10]:date; ) // 10-element date vectorrecord (dateField3[]:date; ) // variable-length date vectorrecord (dateField4:nullable date;) // nullable date

(See “Complex Data Types” on page 2-14 for information about vectors.)

Decimal ColumnsTo define a record field with data type decimal, you must specify the

Schemas A-3

column’s precision, and you may optionally specify its scale, as follows:

column_name:decimal[ precision, scale];

Page 670: DataStage Parallel Job Developer’s Guide

where precision is greater than or equal 1 and scale is greater than or equal to 0 and less than precision.

If the scale is not specified, it defaults to zero, indicating an integer value.

The following examples show different decimal column definitions:

record (dField1:decimal[12]; ) // 12-digit integerrecord (dField2[10]:decimal[15,3]; )// 10-element

//decimal vectorrecord (dField3:nullable decimal[15,3];) // nullable decimal

Floating-Point ColumnsTo define floating-point fields, you use the sfloat (single-precision) or dfloat (double-precision) data type, as in the following examples:

record (aSingle:sfloat; aDouble:dfloat; ) // float definitionsrecord (aSingle: nullable sfloat;) // nullable sfloatrecord (doubles[5]:dfloat;) // fixed-length vector of dfloatsrecord (singles[]:sfloat;) // variable-length vector sfloats

Integer ColumnsTo define integer fields, you use an 8-, 16-, 32-, or 64-bit integer data type (signed or unsigned), as shown in the following examples:

record (n:int32;) // 32-bit signed integerrecord (n:nullable int64;) // nullable, 64-bit signed integerrecord (n[10]:int16;) // fixed-length vector of 16-bit

//signed integerrecord (n[]:uint8;) // variable-length vector of 8-bit unsigned

//int

Raw ColumnsYou can define a record field that is a collection of untyped bytes, of fixed or variable length. You give the field data type raw. The definition for a raw field is similar to that of a string field, as shown in the following examples:

record (var1:raw[];) // variable-length raw fieldrecord (var2:raw;) // variable-length raw field; same as raw[]record (var3:raw[40];) // fixed-length raw fieldrecord (var4[5]:raw[40];)// fixed-length vector of raw fields

A-4 Ascential DataStage Parallel Job Developer’s Guide

You can specify the maximum number of bytes allowed in the raw field with the optional property max, as shown in the example below:

Page 671: DataStage Parallel Job Developer’s Guide

record (var7:raw[max=80];)

The length of a fixed-length raw field must be at least 1.

String ColumnsYou can define string fields of fixed or variable length. For variable-length strings, the string length is stored as part of the string as a hidden integer. The storage used to hold the string length is not included in the length of the string.

The following examples show string field definitions:

record (var1:string[];) // variable-length stringrecord (var2:string;) // variable-length string; same as string[]record (var3:string[80];) // fixed-length string of 80 bytesrecord (var4:nullable string[80];) // nullable stringrecord (var5[10]:string;) // fixed-length vector of stringsrecord (var6[]:string[80];) // variable-length vector of strings

You can specify the maximum length of a string with the optional property max, as shown in the example below:

record (var7:string[max=80];)

The length of a fixed-length string must be at least 1.

Time ColumnsBy default, the smallest unit of measure for a time value is seconds, but you can instead use microseconds with the [microseconds] option. The following are examples of time field definitions:

record (tField1:time; ) // single time field in secondsrecord (tField2:time[microseconds];)// time field in //microsecondsrecord (tField3[]:time; ) // variable-length time vectorrecord (tField4:nullable time;) // nullable time

Timestamp ColumnsTimestamp fields contain both time and date information. In the time portion, you can use seconds (the default) or microseconds for the smallest unit of measure. For example:

Schemas A-5

record (tsField1:timestamp;)// single timestamp field in //secondsrecord (tsField2:timestamp[microseconds];)// timestamp in //microseconds

Page 672: DataStage Parallel Job Developer’s Guide

record (tsField3[15]:timestamp; )// fixed-length timestamp //vectorrecord (tsField4:nullable timestamp;)// nullable timestamp

VectorsMany of the previous example show how to define a vector of a particular data type. You define a vector field by following the column name with brackets []. For a variable-length vector, you leave the brackets empty, and for a fixed-length vector you put the number of vector elements in the brackets. For example, to define a variable-length vector of int32, you would use a field definition such as the following one:

intVec[]:int32;

To define a fixed-length vector of 10 elements of type sfloat, you would use a definition such as:

sfloatVec[]:sfloat;

You can define a vector of any data type, including string and raw. You cannot define a vector of a vector or tagged type. You can, however, define a vector of type subrecord, and you can define that subrecord includes a tagged column or a vector.

You can make vector elements nullable, as shown in the following record definition:

record (vInt[]:nullable int32;vDate[6]:nullable date; )

In the example above, every element of the variable-length vector vInt will be nullable, as will every element of fixed-length vector vDate. To test whether a vector of nullable elements contains no data, you must check each element for null.

SubrecordsRecord schemas let you define nested field definitions, or subrecords, by specifying the type subrec. A subrecord itself does not define any storage; instead, the fields of the subrecord define storage. The fields in a subrecord can be of any data type, including tagged.

The following example defines a record that contains a subrecord:

A-6 Ascential DataStage Parallel Job Developer’s Guide

record ( intField:int16;aSubrec:subrec (aField:int16;

Page 673: DataStage Parallel Job Developer’s Guide

bField:sfloat; );)

In this example, the record contains a 16-bit integer field, intField, and a subrecord field, aSubrec. The subrecord includes two fields: a 16-bit integer and a single-precision float.

Subrecord columns of value data types (including string and raw) can be nullable, and subrecord columns of subrec or vector types can have nullable elements. A subrecord itself cannot be nullable.

You can define vectors (fixed-length or variable-length) of subrecords. The following example shows a definition of a fixed-length vector of subrecords:

record (aSubrec[10]:subrec (aField:int16;bField:sfloat; );

)

You can also nest subrecords and vectors of subrecords, to any depth of nesting. The following example defines a fixed-length vector of subrecords, aSubrec, that contains a nested variable-length vector of subrecords, cSubrec:

record (aSubrec[10]:subrec (aField:int16;bField:sfloat;cSubrec[]:subrec (

cAField:uint8;cBField:dfloat; ););

)

Subrecords can include tagged aggregate fields, as shown in the following sample definition:

record (aSubrec:subrec (aField:string;bField:int32;cField:tagged (

dField:int16;eField:sfloat;);

);)

Schemas A-7

In this example, aSubrec has a string field, an int32 field, and a tagged aggregate field. The tagged aggregate field cField can have either of two data types, int16 or sfloat.

Page 674: DataStage Parallel Job Developer’s Guide

Tagged ColumnsYou can use schemas to define tagged columns (similar to C unions), with the data type tagged. Defining a record with a tagged type allows each record of a data set to have a different data type for the tagged column. When your application writes to a field in a tagged column, DataStage updates the tag, which identifies it as having the type of the column that is referenced.

The data type of a tagged columns can be of any data type except tagged or subrec. For example, the following record defines a tagged subrecord field:

record ( tagField:tagged (aField:string;bField:int32;cField:sfloat;

) ;)

In the example above, the data type of tagField can be one of following: a variable-length string, an int32, or an sfloat.

A-8 Ascential DataStage Parallel Job Developer’s Guide

Page 675: DataStage Parallel Job Developer’s Guide

Partial SchemasSome parallel job stages allow you to use a partial schema. This means that you only need define column definitions for those columns that you are actually going to operate on. The stages that allow you to do this are file stages that have a Format tab. These are:

• Sequential File stage• File Set stage• External Source stage• External Target stage• Column Import stage

You specify a partial schema using the Intact property on the Format tab of the stage. Positional information for the columns in a partial schema is derived from other settings in the Format tab.

Note: If you wish to use runtime column propagation to propagate those columns you have not defined through the job, then you will also have to define the full schema using the schema file property on the input or output page of the stage.

You need to describe the record and the individual columns. Describe the record as follows:

• intact. This property specifies that the schema being defined is a partial one. You can optionally specify a name for the intact schema here as well.

• record_length. The length of the record, including record delimiter characters.

• record_delim_string. String giving the record delimiter as an ASCII string in single quotes. (For a single character delimiter, use record_delim and supply a single ASCII character in single quotes).

Describe the columns as follows:

• position. The position of the starting character within the record.

• delim. The column trailing delimiter, can be any of the following:

Schemas A-9

– ws to skip all standard whitespace characters (space, tab, and newline) trailing after a field.

Page 676: DataStage Parallel Job Developer’s Guide

– end to specify that the last field in the record is composed of all remaining bytes until the end of the record.

– none to specify that fields have no delimiter.

– null to specify that the delimiter is the ASCII null character.

– ASCII_char specifies a single ASCII delimiter. Enclose ASCII_char in single quotation marks. (To specify multiple ASCII characters, use delim_string followed by the string in single quotes.)

• text specifies the data representation type of a field as being text rather than binary. Data is formatted as text by default. (Specify binary if data is binary.)

A-10 Ascential DataStage Parallel Job Developer’s Guide

Page 677: DataStage Parallel Job Developer’s Guide

BFunctions

This appendix describes the functions that are available from the expres-sion editor under the Function… menu item. You would typically use these functions when defining a column derivation in a Transformer stage. The functions are described by category.

Date and Time FunctionsThe following table lists the functions available in the Date & Time cate-gory (Square brackets indicate an argument is optional):

Name Description Arguments Output

DateFromDaysSince

Returns a date by adding an integer to a baseline date

number (int32)[baseline date]

date

DateFromJulianDay

Returns a date from the given julian date

juliandate (uint32) date

DaysSinceFromDate

Returns the number of days from source date to the given date

source_dategiven_date

days since (int32)

HoursFromTime

Returns the hour portion of a time

time hours (int8)

JulianDayFromDate

Returns julian day from the given date

date julian date (int32)

MicroSeconds

Returns the microsecond portion from a time

time microseconds (int32)

Functions B-1

FromTime

Page 678: DataStage Parallel Job Developer’s Guide

MinutesFromTime

Returns the minute portion from a time

time minutes (int8)

MonthDayFromDate

Returns the day of the month given the date

date day (int8)

MonthFromDate

Returns the month number given the date

date month number (int8)

NextWeekdayFrom-Date

Returns the date of the specified day of the week soonest after the source date

source date day of week (string)

date

PreviousWeekdayFromDate

Returns the date of the specified day of the week most recent before the source date

source date day of week (string)

date

SecondsFromTime

Returns the second portion from a time

time seconds (dfloat)

SecondsSinceFromTimestamp

Returns the number of seconds between two timestamps

timestamp base timestamp

seconds (dfloat)

TimeDate Returns the system time and date as a formatted string

- system time and date (string)

TimeFromMidnightSeconds

Returns the time given the number of seconds since midnight

seconds (dfloat) time

TimestampFromDateTime

Returns a timestamp form the given date and time

date time

timestamp

Timestamp

Returns the timestamp from the number of

seconds (dfloat)[base timestamp]

timestamp

Name Description Arguments Output

B-2 Ascential DataStage Parallel Job Developer’s Guide

FromSecondsSince

seconds from the base timestamp

Page 679: DataStage Parallel Job Developer’s Guide

Date, Time, and Timestamp functions that take a format string (e.g., time-tostring(time, stringformat)) need specific formats:

For a date, the format components are:%dd two digit day%mm two digit month%yy two digit year (from 1900)%year_cutoffyy

two digit year from year_cutoff (e.g. %2000yy)%yyyy four digit year%ddd three digit day of the year

The default format is %yyyy-%mm-%dd

For a time, the format components are:%hh two digit hour%nn two digit minutes%ss two digit seconds

TimestampFromTimet

Returns a timestamp from the given unix time_t value

timet (int32) timestamp

TimetFromTimestamp

Returns a unix time_t value from the given timestamp

timestamp timet (int32)

WeekdayFromDate

Returns the day number of the week from the given date

date day (int8)

YeardayFromDate

Returns the day number in the year from the given date

date day (int16)

YearFromDate

Returns the year from the given date

date year (int16)

YearweekFromDate

Returns the week number in the year from the given date

date week (int16)

Name Description Arguments Output

Functions B-3

The default is %hh:%nn:%ss

A timestamp can include the components for date and time above. The default format is %yyyy-%mm-%dd %hh:%nn:%ss

Page 680: DataStage Parallel Job Developer’s Guide

Logical FunctionsThe following table lists the functions available in the Logical category (square brackets indicate an argument is optional):

Mathematical FunctionsThe following table lists the functions available in the Mathematical cate-gory (square brackets indicate an argument is optional):

Name Description Arguments Output

Not Returns the complement of the logical value of an expression

expression Complement (int8)

Name Description Arguments Output

Abs Absolute value of any numeric expression

number (int32) result (dfloat)

Acos Calculates the trigono-metric arc-cosine of an expression

number (dfloat) result (dfloat)

Asin Calculates the trigono-metric arc-sine of an expression

number (dfloat) result (dfloat)

Atan Calculates the trigono-metric arc-tangent of an expression

number (dfloat) result (dfloat)

Ceil Calculates the smallest integer value greater than or equal to the given decimal value

number (decimal) result (int32)

Cos Calculates the trigono-metric cosine of an expression

number (dfloat) result (dfloat)

Cosh Calculates the hyperbolic cosine of an expression

number (dfloat) result (dfloat)

B-4 Ascential DataStage Parallel Job Developer’s Guide

Page 681: DataStage Parallel Job Developer’s Guide

Div Outputs the whole part of the real division of two real numbers (divi-dend, divisor)

dividend (dfloat)divisor (dfloat)

result (dfloat)

Exp Calculates the result of base ’e’ raised to the power designated by the value of the expression

number (dfloat) result (dfloat)

Fabs Calculates the absolute value of the given value

number (dfloat) result (dfloat)

Floor Calculates the largest integer value less than or equal to the given decimal value

number (decimal) result (int32)

Ldexp Calculates a number from an exponent and mantissa

mantissa (dfloat)exponent (int32)

result (dfloat)

Llabs Returns the absolute value of the given integer

number (uint64) result (int64)

Ln Calculates the natural logarithm of an expres-sion in base ’e’

number (dfloat) result (dfloat)

Log10 Returns the log to the base 10 of the given value

number (dfloat) result (dfloat)

Max Returns the greater of the two argument values

number 1 (int32)number 1 (int32)

result (int32)

Min Returns the lower of the two argument values

number 1 (int32)number 1 (int32)

result (int32)

Mod Calculates the modulo (the remainder) of two expressions (dividend, divisor)

dividend (int32)divisor (int32)

result (int32)

Neg Negate a number number (dfloat) result (dfloat)

Name Description Arguments Output

Functions B-5

Page 682: DataStage Parallel Job Developer’s Guide

Null Handling FunctionsThe following table lists the functions available in the Null Handling cate-gory (square brackets indicate an argument is optional):

Pwr Calculates the value of an expression when raised to a specified power (expression, power)

expression (dfloat)power (dfloat)

result (dfloat)

Rand Return a psuedo random integer between 0 and 232-1

- result (uint32)

Random Returns a random number between 0 232-1

- result (uint32)

Sin Calculates the trigono-metric sine of an angle

number (dfloat) result (dfloat)

Sinh Calculates the hyperbolic sine of an expression

number (dfloat) result (dfloat)

Sqrt Calculates the square root of a number

number (dfloat) result (dfloat)

Tan Calculates the trigono-metric tangent of an angle

number (dfloat) result (dfloat)

Tanh Calculates the hyperbolic tangent of an expression

number (dfloat) result (dfloat)

Name Description Arguments Output

Name Description Arguments Output

HandleNull

Set a column’s in-band null value

any (column)string (string)

-

IsDFloatInBandNull

Return whether the given DFloat is an in-band null

number (dfloat) true/false (int8)

IsInt16 Return whether the number (int16) true/false (int8)

B-6 Ascential DataStage Parallel Job Developer’s Guide

InBandNull

given integer is an in-band null

Page 683: DataStage Parallel Job Developer’s Guide

Number FunctionsThe following table lists the functions available in the Number category (square brackets indicate an argument is optional):

IsInt32InBandNull

Return whether the given integer is an in-band null

number (int32) true/false (int8)

IsInt64InBandNull

Return whether the given integer is an in-band null

number (int64) true/false (int8)

IsNotNull Returns true when an expression does not eval-uate to the null value

any true/false (int8)

IsNull Returns true when an expression evaluates to the null value

any true/false (int8)

IsSFloatInBandNull

Return whether the given SFloat is an in-band null

number (dfloat) true/false (int8)

IsStringInBandNull

Return whether the given string is an in-band null

string (string) true/false (int8)

MakeNull Change an in-band null to out of band null

any (column)string (string)

-

SetNull Assign a null value to the target column

- -

Name Description Arguments Output

Name Description Arguments Output

MantissaFromDecimal

Returns the mantissa from the given decimal

number (decimal)[fixzero (int8)]

result (dfloat)

MantissaFrom

Returns the mantissa from the given dfloat

number (dfloat) result (dfloat)

Functions B-7

DFloat

Page 684: DataStage Parallel Job Developer’s Guide

Raw FunctionsThe following table lists the functions available in the Raw category (square brackets indicate an argument is optional):

String FunctionsThe following table lists the functions available in the String category (square brackets indicate an argument is optional):

Name Description Arguments Output

RawLength

Returns the length of a raw string

input string (raw) Result (int32)

Name Description Arguments Output

AlNum Return whether the given string consists of alphanumeric characters

string (string) true/false (int8)

Alpha Returns 1 if string is purely alphabetic

string (string) result (int8)

CompactWhiteSpace

Return the string after reducing all consective whitespace to a single space

string (string) result (string)

Compare Compares two strings for sorting

string1 (string)string2 (string)[justification (L or R)]

result (int8)

CompareNoCase

Case insensitive compar-ison of two strings

string1 (string)string2 (string)

result (int8)

CompareNum

Compare the first n char-acters of the two strings

string1 (string)string2 (string)length (int16)

result (int8)

CompareNumNoCase

Caseless comparison of the first n characters of the two strings

string1 (string)string2 (string)length (int16)

result (int8)

Convert Converts specified char- fromlist (string) result (string)

B-8 Ascential DataStage Parallel Job Developer’s Guide

acters in a string to designated replacement characters

tolist (string)expression (string)

Page 685: DataStage Parallel Job Developer’s Guide

Count Count number of times a substring occurs in a string

string (string)substring (string)

result (int32)

Dcount Count number of delim-ited fields in a string

string (string)delimiter (string)

result (int32)

DownCase Change all uppercase letters in a string to lowercase

string (string) result (string)

DQuote Enclose a string in double quotation marks

string (string) result (string)

Field Return 1 or more delim-ited substrings

string (string)delimiter (string)occurrence (int32)[number (int32)]

result (string)

Index Find starting character position of substring

string (string) substring (string)occurrence (int32)

result (int32)

Left Leftmost n characters of string

string (string)number (int32)

result (string)

Len Length of string in characters

string (string) result (int32)

Num Return 1 if string can be converted to a number

string (string) result (int8)

PadString Return the string padded with the optional pad character and optional length

string (string)padlength (int32)

result (string)

Right Rightmost n characters of string

string (string)number (int32)

result (string)

Space Return a string of N space characters

length (int32) result (string)

Squote Enclose a string in single quotation marks

string (string) result (string)

Str Repeat a string string (string) result (string)

Name Description Arguments Output

Functions B-9

repeats (int32)

Page 686: DataStage Parallel Job Developer’s Guide

Type Conversion FunctionsThe following table lists the functions available in the Type Conversion category (square brackets indicate an argument is optional):

StripWhiteSpace

Return the string after stripping all whitespace from it

string (string) result (string)

Trim Remove all leading and trailing spaces and tabs plus reduce internal occurrences to one

string (string)[stripchar (string)][oprions (string)]

result (string)

TrimB Remove all trailing spaces and tabs

string (string) result (string)

TrimF Remove all leading spaces and tabs

string (string) result (string)

TrimLeadingTrailing

Returns a string with leading and trailing whitespace removed

string (string) result (string)

Upcase Change all lowercase letters in a string to uppercase

string (string) result (string)

Name Description Arguments Output

Name Description Arguments Output

DateToString

Return the string repre-sentation of the given date

date[format (string)]

result (string)

DecimalToDecimal

Returns the given decimal in decimal repre-sentation with specified precision and scale

decimal (decimal)[type (string)][packedflag (int8)]*

result (decimal)

DecimalToDFloat

Returns the given decimal in dfloat representation

number (decimal)[fixzero (int8)]

result (dfloat)

B-10 Ascential DataStage Parallel Job Developer’s Guide

Page 687: DataStage Parallel Job Developer’s Guide

DecimalToString

Return the string repre-sentation of the given decimal

number (decimal)[ fixzero (int8)][prefix (string)][prefixlen (int32)][suffix (string)][suffixlen (int32)](

result (string)

DfloatToDecimal

Returns the given dfloat in decimal representation

number (dfloat)[rtype (string)]

result (decimal)

IsValid Return whether the given string is valid for the given type

type (string)format (string)

result (int8)

StringToDate

Returns a date from the given string in the given format

date (string)format (string)

date

StringToDecimal

Returns the given string in decimal representation

string (string)[rtype (string)]

result (decimal)

StringToRaw

Returns a string in raw representation

string (string) result (raw)

StringToTime

Returns a time represen-tation of the given string

string (string)[format (string)]

time

StringToTimestamp

Returns a timestamp representation of the given string

string (string)[format (string)]

timestamp

TimestampToDate

Returns a date from the given timestamp

timestamp date

TimestampToString

Return the string repre-sentation of the given timestamp

timestamp[format (string)]

result (string)

TimestampToTime

Returns the time from a given timestamp

timestamp time

TimeTo-String

Return the string repre-sentation of the given time

time[format (string)]

result (string)

Name Description Arguments Output

Functions B-11

* The type argument of decimaltodecimal() is a string, and should contain one of the following:

Page 688: DataStage Parallel Job Developer’s Guide

• ceil. Round the source field toward positive infinity. E.g, 1.4 -> 2, -1.6 -> -1.

• floor. Round the source field toward negative infinity. E.g, 1.6 -> 1, -1.4 -> -2.

• round_inf. Round or truncate the source field toward the nearest representable value, breaking ties by rounding positive values toward positive infinity and negative values toward negative infinity. E.g, 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2.

• trunc_zero. Discard any fractional digits to the right of the right-most fractional digit supported in the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, round or truncate to the scale size of the destination decimal. E.g, 1.6 -> 1, -1.6 -> -1.

The default is trunc_zero.

Utility FunctionsThe following table lists the functions available in the Utility category (square brackets indicate an argument is optional):

Name Description Arguments Output

GetEnvironment

Return the value of the given environment variable

environment vari-able (string)

result (string)

B-12 Ascential DataStage Parallel Job Developer’s Guide

Page 689: DataStage Parallel Job Developer’s Guide

CHeader Files

DataStage comes with a range of header files that you can include in code when you are defining a Build stage. The following sections list the header files and the classes and macros that they contain. See the header files themselves for more details about available functionality.

C++ Classes – Sorted By Header Fileapt_ framework/ accessorbase. h

APT_ AccessorBaseAPT_ AccessorTargetAPT_ InputAccessorBaseAPT_ InputAccessorInterfaceAPT_ OutputAccessorBaseAPT_ OutputAccessorInterface

apt_ framework/ adapter. h

APT_ AdapterBaseAPT_ ModifyAdapterAPT_ TransferAdapterAPT_ ViewAdapter

apt_ framework/ collector. h

APT_ Collector

apt_ framework/ composite. h

APT_ CompositeOperator

Header Files C-1

apt_ framework/ config. h

Page 690: DataStage Parallel Job Developer’s Guide

APT_ ConfigAPT_ NodeAPT_ NodeResourceAPT_ NodeSet

apt_ framework/ cursor. h

APT_ CursorBaseAPT_ InputCursorAPT_ OutputCursor

apt_ framework/ dataset. h

APT_ DataSet

apt_ framework/ fieldsel. h

APT_ FieldSelector

apt_ framework/ fifocon. h

APT_ FifoConnection

apt_ framework/ gsubproc. h

APT_ GeneralSubprocessConnectionAPT_ GeneralSubprocessOperator

apt_ framework/ impexp/ impexp_ function. h

APT_ GFImportExport

apt_ framework/ operator. h

APT_ Operator

apt_ framework/ partitioner. h

APT_ PartitionerAPT_ RawField

apt_ framework/ schema. h

APT_ SchemaAPT_ SchemaAggregateAPT_ SchemaFieldAPT_ SchemaFieldListAPT_ SchemaLengthSpec

C-2 Ascential DataStage Parallel Job Developer’s Guide

apt_ framework/ step. h

APT_ Step

Page 691: DataStage Parallel Job Developer’s Guide

apt_ framework/ subcursor. h

APT_ InputSubCursorAPT_ OutputSubCursorAPT_ SubCursorBase

apt_ framework/ tagaccessor. h

APT_ InputTagAccessorAPT_ OutputTagAccessorAPT_ ScopeAccessorTargetAPT_ TagAccessor

apt_ framework/ type/ basic/ float. h

APT_ InputAccessorToDFloatAPT_ InputAccessorToSFloatAPT_ OutputAccessorToDFloatAPT_ OutputAccessorToSFloat

apt_ framework/ type/ basic/ integer. h

APT_ InputAccessorToInt16APT_ InputAccessorToInt32APT_ InputAccessorToInt64APT_ InputAccessorToInt8APT_ InputAccessorToUInt16APT_ InputAccessorToUInt32APT_ InputAccessorToUInt64APT_ InputAccessorToUInt8APT_ OutputAccessorToInt16APT_ OutputAccessorToInt32APT_ OutputAccessorToInt64APT_ OutputAccessorToInt8APT_ OutputAccessorToUInt16APT_ OutputAccessorToUInt32APT_ OutputAccessorToUInt64APT_ OutputAccessorToUInt8

apt_ framework/ type/ basic/ raw. h

APT_ InputAccessorToRawFieldAPT_ OutputAccessorToRawField

Header Files C-3

APT_ RawFieldDescriptor

apt_ framework/ type/ conversion. h

Page 692: DataStage Parallel Job Developer’s Guide

APT_ FieldConversionAPT_ FieldConversionRegistry

apt_ framework/ type/ date/ date. h

APT_ DateDescriptorAPT_ InputAccessorToDateAPT_ OutputAccessorToDate

apt_ framework/ type/ decimal/ decimal. h

APT_ DecimalDescriptorAPT_ InputAccessorToDecimalAPT_ OutputAccessorToDecimal

apt_ framework/ type/ descriptor. h

APT_ FieldTypeDescriptorAPT_ FieldTypeRegistry

apt_ framework/ type/ function. h

APT_ GenericFunctionAPT_ GenericFunctionRegistryAPT_ GFComparisonAPT_ GFEqualityAPT_ GFPrint

apt_ framework/ type/ protocol. h

APT_ BaseOffsetFieldProtocolAPT_ EmbeddedFieldProtocolAPT_ FieldProtocolAPT_ PrefixedFieldProtocolAPT_ TraversedFieldProtocol

apt_ framework/ type/ time/ time. h

APT_ TimeDescriptorAPT_ InputAccessorToTimeAPT_ OutputAccessorToTime

apt_ framework/ type/ timestamp/ timestamp. h

APT_ TimeStampDescriptorAPT_ InputAccessorToTimeStamp

C-4 Ascential DataStage Parallel Job Developer’s Guide

APT_ OutputAccessorToTimeStamp

apt_ framework/ utils/fieldlist. h

Page 693: DataStage Parallel Job Developer’s Guide

APT_ FieldList

apt_ util/archive. h

APT_ ArchiveAPT_ FileArchiveAPT_ MemoryArchive

apt_ util/ argvcheck. h

APT_ ArgvProcessor

apt_ util/ date. h

APT_ Date

apt_ util/ dbinterface. h

APT_ DataBaseDriverAPT_ DataBaseSourceAPT_ DBColumnDescriptor

apt_ util/ decimal. h

APT_ Decimal

apt_ util/ endian. h

APT_ ByteOrder

apt_ util/ env_ flag. h

APT_ EnvironmentFlag

apt_ util/ errind. h

APT_ Error

apt_ util/ errlog. h

APT_ ErrorLog

apt_ util/ errorconfig. h

APT_ ErrorConfiguration

apt_ util/ fast_ alloc. h

APT_ FixedSizeAllocatorAPT_ VariableSizeAllocator

Header Files C-5

apt_ util/ fileset. h

APT_ FileSet

Page 694: DataStage Parallel Job Developer’s Guide

apt_ util/ identifier. h

APT_ Identifier

apt_ util/keygroup. h

APT_ KeyGroup

apt_ util/locator. h

APT_ Locator

apt_ util/persist. h

APT_ Persistent

apt_ util/proplist. h

APT_ PropertyAPT_ PropertyList

apt_ util/random. h

APT_ RandomNumberGenerator

apt_ util/rtti. h

APT_ TypeInfo

apt_ util/ string. h

APT_ StringAPT_ StringAccum

apt_ util/ time. h

APT_ TimeAPT_ TimeStamp

C++ Macros – Sorted By Header Fileapt_ framework/ accessorbase. h

APT_ DECLARE_ ACCESSORS()APT_ IMPLEMENT_ ACCESSORS()

apt_ framework/ osh_ name. h

C-6 Ascential DataStage Parallel Job Developer’s Guide

APT_ DEFINE_ OSH_ NAME()APT_ REGISTER_ OPERATOR()

apt_ framework/ type/ basic/ conversions_ default. h

Page 695: DataStage Parallel Job Developer’s Guide

APT_ DECLARE_ DEFAULT_ CONVERSION()APT_ DECLARE_ DEFAULT_ CONVERSION_ WARN()

apt_ framework/ type/ protocol. h

APT_ OFFSET_ OF()

apt_ util/ archive. h

APT_ DIRECTIONAL_ SERIALIZATION()

apt_ util/assert. h

APT_ ASSERT()APT_ DETAIL_ FATAL()APT_ DETAIL_ FATAL_ LONG()APT_ MSG_ ASSERT()APT_ USER_ REQUIRE()APT_ USER_ REQUIRE_ LONG()

apt_ util/condition. h

CONST_ CAST()REINTERPRET_ CAST()

apt_ util/errlog. h

APT_ APPEND_ LOG()APT_ DUMP_ LOG()APT_ PREPEND_ LOG()

apt_ util/ exception. h

APT_ DECLARE_ EXCEPTION()APT_ IMPLEMENT_ EXCEPTION()

apt_ util/ fast_ alloc. h

APT_ DECLARE_ NEW_ AND_ DELETE()

apt_ util/ logmsg. h

APT_ DETAIL_ LOGMSG()APT_ DETAIL_ LOGMSG_ LONG()APT_ DETAIL_ LOGMSG_ VERYLONG()

apt_ util/ persist. h

Header Files C-7

APT_ DECLARE_ ABSTRACT_ PERSISTENT()APT_ DECLARE_ PERSISTENT()APT_ DIRECTIONAL_ POINTER_ SERIALIZATION()

Page 696: DataStage Parallel Job Developer’s Guide

APT_ IMPLEMENT_ ABSTRACT_ PERSISTENT()APT_ IMPLEMENT_ ABSTRACT_ PERSISTENT_ V()APT_ IMPLEMENT_ NESTED_ PERSISTENT()APT_ IMPLEMENT_ PERSISTENT()APT_ IMPLEMENT_ PERSISTENT_ V()

apt_ util/ rtti. h

APT_ DECLARE_ RTTI()APT_ DYNAMIC_ TYPE()APT_ IMPLEMENT_ RTTI_ BASE()APT_ IMPLEMENT_ RTTI_ BEGIN()APT_ IMPLEMENT_ RTTI_ END()APT_ IMPLEMENT_ RTTI_ NOBASE()APT_ IMPLEMENT_ RTTI_ ONEBASE()APT_ NAME_ FROM_ TYPE()APT_ PTR_ CAST()APT_ STATIC_ TYPE()APT_ TYPE_ INFO()

C-8 Ascential DataStage Parallel Job Developer’s Guide

Page 697: DataStage Parallel Job Developer’s Guide

Symbols

_cplusplus token 51-2_STDC_ token 51-2

A

active stages 3-1advanced tab, stage page 3-5aggragator stage 17-1aggragator stage properties 17-2API 51-2

B

batch log entries 51-111build stage macros 49-16build stages 49-1

C

change apply stage 32-1change apply stage properties 32-3change capture stage 31-1change capture stage properties

propertieschange capture stage 31-2

Cluster systems 2-5collecting data 2-7collection types

ordered 2-9round robin 2-9sorted merge 2-9

column auto-match facility 16-11

column export stage 37-2column generator stage 28-1column generator stage

properties 28-1column import stage 36-1column import stage properties

propertiescolumn import stage 36-2

columns tab, inputs page 3-16columns tab, outputs page 3-28combine records stage 41-1combine records stage properties 41-2command line interface 51-104commands

dsjob 51-104compare stage 46-1, 46-2compare stage properties 46-2complex data types 2-14compress stage 24-1compress stage properties 24-2configuration file 2-6containers 2-17copy stage 29-1copy stage properties 29-1custom stages 49-1

D

data set stage 6-1data set stage input properties 6-2data set stage output properties 6-4data structures

description 51-45

Index

Index-1

column export stage 37-1column export stage properties

properties

how used 51-2summary of usage 51-44

data types 2-13

Page 698: DataStage Parallel Job Developer’s Guide

complex 2-14database stages 3-1DataStage API

building applications that use 51-4header file 51-2programming logic example 51-3redistributing programs 51-4

DataStage CLIcompletion codes 51-104logon clause 51-104overview 51-104using to run jobs 51-105

DataStage Development Kit 51-2API functions 51-5command line interface 51-104data structures 51-44dsjob command 51-105error codes 51-57job status macros 51-104writing DataStage API

programs 51-3DataStage server engine 51-104DB2 partition properties 3-15DB2 partitioning 2-8DB2 stage 12-1DB2 stage input properties 12-3DB2 stage output properties 12-11decode stage 34-1decode stage properties 34-1defining

local stage variables 16-15difference stage 35-1difference stage properties 35-2DLLs 51-4documentation conventions xix–xxdsapi.h header file

description 51-2including 51-4

DSCloseJob function 51-7

DSExecute subroutione 51-67DSFindFirstLogEntry function 51-9DSFindNextLogEntry function 51-9,

51-11DSGetJobInfo function 51-12, 51-68

and controlled jobs 51-13DSGetLastError function 51-14DSGetLastErrorMsg function 51-15DSGetLinkInfo function 51-16, 51-71DSGetLogEntry function 51-18, 51-73DSGetLogSummary function 51-74DSGetNewestLogId function 51-19,

51-76DSGetParamInfo function 51-21, 51-77DSGetProjectInfo function 51-23,

51-80DSGetProjectList function 51-25DSGetStageInfo function 51-26, 51-81DSHostName macro 51-103dsjob command

description 51-104DSJobController macro 51-103DSJOBINFO data structure 51-45

and DSGetJobInfo 51-13and DSGetLinkInfo 51-16

DSJobInvocationID macro 51-103DSJobInvocations macro 51-103DSJobName macro 51-103DSJobStartDate macro 51-103DSJobStartTime macro 51-103DSJobStatus macro 51-103DSJobWaveNo macro 51-103DSLINKINFO data structure 51-48DSLinkLastErr macro 51-103DSLinkName macro 51-103DSLinkRowCount macro 51-103DSLockJob function 51-28DSLOGDETAIL data structure 51-49DSLOGEVENT data structure 51-50

Index-2 Ascential DataStage Parallel Job Developer’s Guide

DSCloseProject function 51-8DSDetachJob function 51-66dsdk directory 51-4

DSLogEvent function 51-29, 51-83DSLogFatal function 51-84DSLogInfo function 51-85

Page 699: DataStage Parallel Job Developer’s Guide

DSLogWarn function 51-87DSOpenJob function 51-31DSOpenProject function 51-32DSPARAM data structure 51-51DSPARAMINFO data structure 51-53

and DSGetParamInfo 51-21DSPROJECTINFO data

structure 51-55and DSGetProjectInfo 51-23

DSProjectName macro 51-103DSRunJob function 51-34DSSetJobLimit function 51-36, 51-94DSSetParam function 51-38, 51-95DSSetServerParams function 51-40DSSetUserStatus subroutine 51-96DSSTAGEINFO data structure 51-56

and DSGetStageInfo 51-26DSStageInRowNum macro 51-103DSStageLastErr macro 51-103DSStageName macro 51-103DSStageType macro 51-103DSStageVarList macro 51-103DSStopJob function 51-41, 51-97DSTransformError function 51-98DSUnlockJob function 51-42DSWaitForJob function 51-43

E

editingTransformer stages 16-6

encode stage 33-1encode stage properties 33-1entire partitioning 2-8error codes 51-57errors

and DataStage API 51-57functions used for handling 51-6retrieving message text 51-15

examplesDevelopment Kit program A-1

expand stage 25-1expand stage properties 25-2expression editor 16-18external fileter stage 30-1external fileter stage properties 30-1external source stage 8-1external source stage output

properties 8-3external target stage 9-1external target stage input

properties 9-3

F

fatal error log entries 51-111file set output properties 5-13file set stage 5-1file set stage input properties 5-3file stages 3-1Find and Replace dialog box 16-8Folder stages 12-1format tab, inputs page 3-24formqt tab, outputs page 3-29functions, table of 51-5funnel stage 19-1funnel stage properties 19-2

G

general tab, inputs page 3-10general tab, outputs page 3-26general tab, stage page 3-2

H

hash by field partitioning 2-8head stage 44-1

Index-3

retrieving values for 51-14Event Type parameter 51-9example build stage 49-21

head stage properties 44-2

Page 700: DataStage Parallel Job Developer’s Guide

I

information log entries 51-111Informix XPS stage 15-1Informix XPS stage input

properties 15-2Informix XPS stage output properties

propertiesInformix XPS stage output 15-7

input links 16-5inputs page 3-6, 3-9

columns tab 3-16format tab 3-24general tab 3-10partitioning tab 3-11properties tab 3-10

J

job control interface 51-1job handle 51-31job parameters

displaying information about 51-110

functions used for accessing 51-6listing 51-108retrieving information about 51-21setting 51-38

job status macros 51-104jobs

closing 51-7displaying information

about 51-108functions used for accessing 51-5listing 51-23, 51-107locking 51-28opening 51-31resetting 51-34, 51-105retrieving status of 51-12running 51-34, 51-105

waiting for completion 51-43join stage 18-1join stage properties 18-2

L

library files 51-4limits 51-36link ordering tab, stage page 3-6links

displaying information about 51-109

functions used for accessing 51-6input 16-5listing 51-107output 16-5reject 16-5retrieving information about 51-16specifying order 16-14

log entriesadding 51-29, 51-110batch control 51-111fatal error 51-111finding newest 51-19, 51-112functions used for accessing 51-6job reset 51-111job started 51-111new lines in 51-30rejected rows 51-111retrieving 51-9, 51-11retrieving specific 51-18, 51-111types of 51-9warning 51-111

logon clause 51-104lookup file set stage 7-1lookup file set stage output

properties 7-7lookup stage 20-1lookup stage input properties 20-5lookup stage properties 20-2

Index-4 Ascential DataStage Parallel Job Developer’s Guide

stopping 51-41, 51-107unlocking 51-42validating 51-34, 51-105

Page 701: DataStage Parallel Job Developer’s Guide

M

macros, job status 51-104make subrecord stage 38-1make subrecord stage properties 38-2make vector stage 42-1, 42-2make vector stage properties 42-2mapping tab, outputs page 3-30merge stage 22-1merge stage properties 22-2meta data 2-11modulus partitioning 2-8MPP ssystems 2-5

N

new lines in log entries 51-30

O

optimizing performance 2-1Oracle stage 13-1Oracle stage input properties 13-2Oracle stage output properties 13-11ordered collection 2-9output links 16-5outputs page 3-25

columns tab 3-28format tab 3-29general tab 3-26mapping tab 3-30properties tab 3-27

P

parallel processing 2-1parallel processing environments 2-5parameters, see job parameterspartial schemas 2-12

partitioning icons 2-9partitioning tab, inputs page 3-11partitioning types

DB2 2-8entire 2-8hash by field 2-8modulus 2-8random 2-7range 2-8round robin 2-7same 2-7

passwords, setting 51-40Peek stage 47-1peek stage 47-1peek stage properties 47-1pipeline processing 2-1, 2-2projects

closing 51-8functions used for accessing 51-5listing 51-25, 51-107opening 51-32

promote subrecord stage 40-1promote subrecord stage

properties 40-2properties 23-2, 33-1

aggragator stage 17-2change apply stage 32-3combine records stage 41-2compare stage 46-2compress stage 24-2copy stage 29-1data set stage input 6-2data set stage output 6-4DB2 stage input 12-3DB2 stage output 12-11decode stage 34-1difference stage 35-2expand stage 25-2external fileter stage 30-1

Index-5

partition parallel processing 2-1partition paralllelism 2-3partitioning data 2-7

external source stage output 8-3external stage input 9-3file set input 5-3

Page 702: DataStage Parallel Job Developer’s Guide

file set output 5-13funnel stage 19-2head stage 44-2Informix XPS stage input 15-2join stage 18-2lookup file set stage output 7-7lookup stage 20-2lookup stage input 20-5make subrecord stage 38-2make vector stage 42-2merge stage 22-2Oracle stage input 13-2Oracle stage output 13-11peek stage 47-1promote subrecord stage 40-2sample stage 26-2SAS data set stage input 11-3SASA data set stage output 11-6sequential file input 4-3sequential file output 4-12sort stage 21-1split subrecord stage 39-2split vector stage 43-2tail stage 45-2Teradata stage input 14-2Teradata stage output 14-8trnsformer stage 16-18write range map stage input

properties 10-2properties tab, inputs page 3-10properties tab, outputs page 3-27properties tab, stage page 3-2propertiesrow generator stage

output 27-2prperties

column generator stage 28-1

R

redistributable files 51-4reject links 16-5rejected rows 51-111remove duplicates stage 23-1, 23-2remove duplicates stage

properties 23-2restructure operators

splitvect 43-1result data

reusing 51-3storing 51-3

round robin collection 2-9round robin partitioning 2-7row generator stage 27-1row generator stage output

properties 27-2row limits 51-36, 51-106runtime column propagation 2-12,

3-28, 4-20, 9-12

S

same partitioning 2-7sample stage 26-1sample stage properties 26-2SAS data set stage 11-1SAS data set stage input input

properties 11-3SAS data set stage output

properties 11-6SAS stage 48-1schema files 2-12sequential file input properties 4-3sequential file output properties 4-12sequential file stage 4-1server names, setting 51-40shared containers 2-17shortcut menus in Transformer

Editor 16-4

Index-6 Ascential DataStage Parallel Job Developer’s Guide

random partitioning 2-7range partition properties 3-15range partitioning 2-8

SMP systems 2-5sort stage 21-1sort stage properties 21-1

Page 703: DataStage Parallel Job Developer’s Guide

sorted merge collection 2-9split subrecord stage 39-1split subrecord stage properties 39-2split vector stage 43-1split vector stage properties 43-2splitvect restructure operator 43-1stage editors 3-1stage page 3-2

advanced tab 3-5general tab 3-2link ordering tab 3-6properties tab 3-2

stagesdisplaying information

about 51-109editing

Sequential File 7-1functions used for accessing 51-6listing 51-107retrieving information about 51-26sequential file 4-1

subrecords 2-15

T

table definitions 2-12tagged subrecords 2-15tail stage 45-1tail stage properties 45-2Teradata stage 14-1Teradata stage input properties 14-2Teradata stage output properties 14-8threads

and DSFindFirstLogEntry 51-11and DSFindNextLogEntry 51-11and DSGetLastErrorMsg 51-15and error storage 51-3and errors 51-14and log entries 51-10

_cplusplus 51-2_STDC_ 51-2WIN32 51-2

toolbarsTransformer Editor 16-3

Transformer Editorlink area 16-3meta data area 16-3shortcut menus 16-4toolbar 16-3

transformer stage 16-1transformer stage properties 16-18Transformer stages

basic concepts 16-5editing 16-6

U

unirpc32.dll 51-5UniVerse stages 4-1user names, setting 51-40uvclnt32.dll 51-5

V

vector 2-15vmdsapi.dll 51-4vmdsapi.lib library 51-4

W

warning limits 51-36, 51-106warnings 51-111WIN32 token 51-2wrapped stages 49-1write range map stage 10-1write range map stage input

properties 10-2writing

Index-7

and result data 51-2using multiple 51-3

tokens

DataStage API programs 51-3

Page 704: DataStage Parallel Job Developer’s Guide

Index-8 Ascential DataStage Parallel Job Developer’s Guide

Page 705: DataStage Parallel Job Developer’s Guide

Index-9

Page 706: DataStage Parallel Job Developer’s Guide

Index-10 Ascential DataStage Parallel Job Developer’s Guide