38026404 datastage quality stage fundamentals

530
IBM Information Server 8.x DataStage / Quality Stage Fundamentals ValueCap ValueCap Systems Systems 1 V alueCap Systems - Proprietary

Upload: rajesh-ganta

Post on 09-Oct-2015

59 views

Category:

Documents


5 download

TRANSCRIPT

  • IBM Information Server 8.x DataStage / QualityStage Fundamentals

    ValueCapValueCap SystemsSystems

    1ValueCap Systems - Proprietary

  • ValueCap Systems - Proprietary

    Introduction

    What is Information Server(IIS) 8.x Suite of applications that share a common repository Common set of Application Services (hosted by WebSphere app

    server) Data Integration toolset (ETL, Profiling and data quality) Employs scalable parallel processing Engine Supports N-tier layered architecture Newer version of data integration/ETL tool set offered by IBM Web Browser Interface to manage security and authentication

    2

  • ValueCap Systems - Proprietary

    Product Suite

    IIS Organized into 4 layers Client: Administration,

    Analysis,Development, and User Interface

    Metadata Repository: Single repository for each install. Can reside in DB2, Oracle or SQL Server database. Stores configuration, design and runtime metadata. DB2 is supplied database.

    Domain: Common Services. Requires WebSphere Application Server. Single domain for each install.

    Engine: Core engine that run all ETL jobs. Engine install includes connectors, packs, job monitors, performance monitors, log service etc.,

    Note : Metadata Repository , Domain and Engine can reside in either same server or separate server. Multiple engines can exist in a single Information Server install.

    3

  • ValueCap Systems - Proprietary

    Detailed IS Architecture

    IBM WebSphere Metadata Server

    Metadata Services Import/Export Manager

    Metadata Workbench

    DataStage &QualityStage

    Engine

    WebSphereBusiness Glossary

    External Data Sources

    (Erwin, Cognos)

    DataStage &QualityStage

    ClientAdmin Console

    Client

    Reporting ConsoleClient

    WebSphere Application Server

    InformationAnalyzer

    Metadata DB

    IADB(Profiling)

    Fast Track

    Engine layer

    Metadata Repository layer

    Domain layer

    Client layer

    4

  • ValueCap Systems - Proprietary

    Information Server 8.1 Components

    Core Components:

    Information Analyzer profiles and establishes an understanding of source systems, and monitors data rules on an ongoing basis to eliminate the risk of proliferating incorrect and inaccurate data.

    QualityStage standardizes and matches information across heterogeneous sources. DataStage extracts, transforms and loads data between multiple sources and targets. Metadata Server provides unified management, analysis and interchange of metadata through a shared

    repository and services infrastructure.

    Business Glossary defines data stewards and creates and manages business terms, definitions and relates these to physical data assets.

    Metadata Workbench provides unified management, analysis and interchange of metadata through a shared repository and services infrastructure.

    FastTrack Easy-to-use import/export features allow business users to take advantage of familiar Microsoft Excel interface and create new specifications.

    Federation Server defines integrated views across diverse and distributed information sources, including cost-based query optimization and integrated caching.

    Information Services Director enables information access and integration processes for publishing as reusable services in a SOA.

    5

  • ValueCap Systems - Proprietary

    Information Server 8.1 Components

    Optional Components:

    Rational Data Architect provides enterprise data modeling and information integration design capabilities. Replication Server provides high-speed, event-based replication between databases for high availability,

    disaster recovery, data synchronization and data distribution. Data Event Publisher detects and responds to data changes in source systems, publishing changes to

    subscribed systems, or feeding changed data into other modules for event-based processing. InfoSphere Change Data Capture Log-based Change Data Capture (CDC) technology, acquired in the

    DataMirror acquisition, detects and delivers changed data across heterogeneous data sources such as DB2, Oracle, SQL Server and Sybase. Supports service-oriented architectures (SOAs) by packaging real-time data transactions into XML documents and delivering to and from messaging middleware such as WebSphere MQ.

    DataStage Pack for SAP BW (DataStage BW Pack) The DataStage BW Pack is a companion product of the IBM Information Server. The pack was originally developed to support SAP BW and currently supports both SAP BW and SAP BI. The GUIs of the DataStage BW Pack are installed on the DataStage Client. The runtime part of the Pack is installed on the DataStage Server.

    6

  • IBM Information Server 8.x DataStage / QualityStage Fundamentals

    ValueCapValueCap SystemsSystems

    7ValueCap Systems - Proprietary

  • ValueCap Systems - Proprietary

    Course ObjectivesUpon completion of this course, you will be able to: Understand principles of parallel processing and scalability Understand how to create and manage a scalable job using

    DataStage Implement your business logic as a DataStage Job Build, Compile, and Execute DataStage Jobs Execute your DataStage Jobs in parallel Enhance DataStage functionality by creating your own

    Stages Import and Export DataStage Jobs

    8

  • ValueCap Systems - Proprietary

    Agenda1. DataStage Overview2. Parallel Framework Overview3. Data Import and Export4. Data Partitioning, Sorting, and Collection5. Data Transformation and Manipulation6. Data Combination7. Custom Components: Wrappers8. Custom Components: Buildops9. Additional Topics10. Glossary

    9

    Page 10

    Page 73

    Page 116

    Page 252

    Page 309

    Page 364

    Page 420

    Page 450

    Page 477

    Page 526

  • ValueCap Systems - Proprietary

    IS DataStage Overview

    In this section we will discuss: Product History Product Architecture Project setup and configuration Job Design Job Execution Managing Jobs and Job Metadata

    10

  • ValueCap Systems - Proprietary

    Product History

    Prior to IBMs acquisition of Ascential Software, Ascential had performed a series of its own acquisitions: Ascential started off as VMark before it became Ardent Software and

    introduced DataStage as an ETL solution Ardent was then acquired by Informix and through a reversal in

    fortune, Ardent management took over Informix. Informix was then sold to IBM and Ascential Software was spun out

    with approximately $1 Billion in the bank as a result. Ascential Software kept DataStage as its cash cow product, but started

    focusing on a bigger picture: Data Integration for the Enterprise

    11

  • ValueCap Systems - Proprietary

    DataStage Standard Edition was the original DataStage product andis also known as DataStage Server Edition. Server will be goingaway with the Hawk release later in 2006.

    DataStage Enterprise Edition was originally Orchestrate, which hadbeen renamed to Parallel Extender after the Torrent acquisition.

    Valitys Integrity was renamed to QualityStage

    DataStage TX was originally known as Mercator and renamed whenpurchased by Ascential.

    ProfileStage was once Metagenixs Metarecon software

    Product History (Continued)With plenty of money in the bank and a weakening economy, Ascential embarked upon a phase of acquisitions to fulfill its vision as becoming the leading Data Integration software provider.

    12

  • ValueCap Systems - Proprietary

    Product History (Continued)By 2004, Ascential had completed its acquisitions and turned its focus onto completely integrating the acquired technologies. Ascentials

    Data Integration Suite:

    Parallel Execution Engine

    DISCOVERDISCOVER

    Discover data

    content and

    structure

    PREPAREPREPARE

    Standardize, match, and correct data

    TRANSFORM and DELIVERTRANSFORM and DELIVER

    Transform, enrich, and

    deliver data

    ProfileStageProfileStage QualityStageQualityStage DataStageDataStage

    Meta Data Management

    Real-Time Integration Services

    Enterprise Connectivity

    and Event Management

    Service-Oriented Architecture

    13

  • ValueCap Systems - Proprietary

    Product History (Continued)In 2005, IBM acquired Ascential. In November of 2006, IBM released Information Server version 8, which included WebSphere Application Server, DataStage, QualityStage, and other tools, some of which are part of the standard install, and some of which are optional: FastTrack Metadata Workbench Information Analyzer (formerly ProfileStage)WebSphere Federation Server and others.

    14

  • ValueCap Systems - Proprietary

    Old DataStage Client/Server Architecture

    4 Clients, 1 Server

    Server WIN, UNIX (AIX, Solaris, TRU64, HP-UX, USS)

    Designer Director ManagerAdministrator

    Client - Microsoft Windows NT/2K/XP/2003

    DataStageEnterpriseEditionFramework

    DataStageRepository

    15

  • ValueCap Systems - Proprietary

    New DataStage Client/Server Architecture

    3 Clients, 1 (or more) Server(s)

    Server WIN, UNIX (Linux, AIX, Solaris, HP-UX, USS)

    Designer Director Administrator

    Client - Microsoft Windows XP/2003/Vista

    DataStageEnterpriseEditionFramework

    CommonRepository

    ApplicationServer

    Common Repository can be on a separate server

    Default J2EE-compliant Application Server is WebSphere Application Server

    Clients now handle both DataStage and QualityStage

    No more Manager client

    16

  • ValueCap Systems - Proprietary

    DataStage Clients: Administrator

    DataStage Administrator Manage licensing details Create, update, administer projects and users Manage environment variable settings for entire

    project

    17

  • ValueCap Systems - Proprietary

    DataStage Administrator Logon

    When 1st connecting to the Administrator,you will need to provide the following: Server address where the DataStage

    repository was installed Your userid Your password Assigned project

    18

  • ValueCap Systems - Proprietary

    DataStage Administrator ProjectsNext, click on Add to create anew DataStage project. In this course, each student

    will create his/her own project In typical development

    environments, many developerscan work on the same project.

    Project paths / locations canbe customized

    C:\IBM\InformationServer\Projects\ANALYZEPROJECT

    C:\IBM\InformationServer\Projects\

    19

  • ValueCap Systems - Proprietary

    DataStage Administrator ProjectsOnce a project has been created, it is populated with default settings. To change these defaults, click on the Properties button to bring up the Project Properties window.

    Next, click on Environmentbutton

    C:\IBM\InformationServer\Projects\Sample

    20

  • ValueCap Systems - Proprietary

    DataStage Administrator EnvironmentThis window displays all of the default environment variable settings, as well as the user defined environment variables.

    Click here when done

    Do not change anyvalues for now

    21

  • ValueCap Systems - Proprietary

    DataStage Administrator Other Options

    Useful options to set for allprojects include: Enable job administration

    in Director this allows various administrative actionsto be performed to jobs viathe Director interface

    Enable Runtime ColumnPropagation for Parallel Jobs -aka RCP a feature whichallows column metadata to beautomatically propagated atruntime. More on this later

    22

  • ValueCap Systems - Proprietary

    DataStage Clients: Designer

    DataStage Designer Develop DataStage jobs or modify existing jobs Compile jobs Execute jobs Monitor job performance Manage table definitions Import table definitions Manage job metadata Generate job reports

    23

  • ValueCap Systems - Proprietary

    DataStage Designer Login

    After login in you should see a similar screen:

    24

  • ValueCap Systems - Proprietary

    DataStage Designer Where to Start?

    Select to create a new DataStage job

    Open any existingDataStage job

    Open a job that you were recently working on

    For majority of lab exercises, you will be selecting Parallel Job or using the Existing and Recent tabs.

    25

  • ValueCap Systems - Proprietary

    DataStage Designer Elements

    Indicates Parallel Canvas (i.e. Parallel DataStage Job)

    These boxes can be docked invarious locations within this interface.Just click and drag around

    Icons can be made larger by right-clicking inside to access menu. Categories can

    be edited and customized as well

    The DataStage Designer user interface can be customized to your preferences.

    Here are just a few of the options

    26

  • ValueCap Systems - Proprietary

    DataStage Designer Toolbar

    NewJob

    Open Existing

    Job

    Save /Save All

    JobProperties

    JobCompile

    RunJob

    GridLines

    Snap toGrid

    LinkMarkers

    ZoomIn / Out

    Some of the useful icons you will become very familiar with as you get to know DataStage. Note that if you let the mouse pointer hover over any icon, a tooltip will appear.

    27

  • ValueCap Systems - Proprietary

    DataStage Designer Paradigm

    Left-click and drag thestage(s) onto the canvas.

    You can also left-clickon the stage once and then position your mousecursor on the canvas andleft-click again to placethe chosen stage there.

    28

  • ValueCap Systems - Proprietary

    DataStage Designer Paradigm

    To create the link, you can right-clickon the upstream stage and dragthe mouse pointer to the downstream stage. This will create a link as shown here.

    Alternatively, you can select the link icon from the General category in your Palette by left-clicking on it.

    29

  • ValueCap Systems - Proprietary

    DataStage Designer Design Feedback

    When Show stage validation errors under the Diagram menu is selected (the default) DataStage Designer uses visual cues to alert users that theres something wrong.

    Placing the mouse cursor over an exclamation mark on a stage will display a message indicating what the problem is.

    A red link indicates that the link cannot be left dangling and must have a source and/or target attached to it.

    30

  • ValueCap Systems - Proprietary

    DataStage Designer Labels

    You may notice that the default labels that are created on the stages and links are not very intuitive.

    You can easily change them by left-clicking once on the stage or link and then start typing a more appropriate label. This is considered to be a best practice. You will understand why shortly.

    Labels can also be changed by right-clicking on the stage or link and selecting the Rename option.

    31

  • ValueCap Systems - Proprietary

    DataStage Designer Stage PropertiesDouble-clicking on any stage on the canvas or right-clicking and selecting Properties will bring up the options dialogue for that particular stage.

    Almost all stages will require you to open and edit their properties and set them to appropriate values.

    However, almost all property dialogues follow the same paradigm.

    32

  • ValueCap Systems - Proprietary

    DataStage Designer Stage Properties

    Heres an example of a fairly common stage properties dialogue box.

    The Properties tab will always contain the stage specific options. Mandatory entries will be highlighted red.

    The Input tab allows you to view the incoming data layout as well as define data partitioning (we will cover this in detail later).

    The Output tab allows you to view and map the outgoing data layout.

    33

  • ValueCap Systems - Proprietary

    DataStage Designer Stage Input

    Once youve changed the link label to something more appropriate, it will make it easier to track your metadata. This is especially true if there are multiple inputs or outputs.

    We will discuss partitioning in detail later

    Another useful feature on the Input properties tab is the fact that you can see what the incoming data layout looks like.

    34

  • ValueCap Systems - Proprietary

    DataStage Designer Stage Output

    On the Output tab, there is a Mapping tab and another Columns tab.

    Note that the columns are missing on the Output side. Where did they go? We saw them on the Input, right?

    The answer lies in the Mapping tab. This is the Source to Target mapping paradigm you will find throughout DataStage. It is a means of propagating design-time metadata from source to target

    35

  • ValueCap Systems - Proprietary

    DataStage Designer Field Mapping

    Source to Target mapping is achieved by 2 methods in DataStage: Left-clicking and dragging a field or collection of fields from the Source side (left) to

    the Target side (right). Left-clicking on the Columns bar on the Source side and dragging it into the

    Target side. This is illustrated above.

    When performed correctly, you will see the Target side populated with some or all of the fields from the Source side, depending on your selection.

    36

  • ValueCap Systems - Proprietary

    DataStage Designer Field Mapping

    Once the mapping is complete, you can go back into the Output Columns tab and you will notice that all of the fields youve mapped from Source to Target now appear under the Columns tab.

    You may have also noticed the Runtime column propagation option below the columns. This is here because we enabled it in the Administrator. If you do not see this option, it is likely because it did not get enabled.

    37

  • ValueCap Systems - Proprietary

    DataStage Designer RCP

    What is Runtime Column Propagation? Powerful feature which allows you to bypass Source

    to Target mapping At runtime (not design time), it will automatically

    propagate all source columns to the target for all stages in your job.

    What this means: if you are reading in a database table with 200 columns/fields, and your business logic only affects 2 of those columns, then you only need to specify 2 out of 200 columns and subsequently enable RCP to handle the rest.

    38

  • ValueCap Systems - Proprietary

    DataStage Designer Mapping vs RCPSo, why Map when you can RCP? Design time vs runtime consideration When working on a job flow that affects many fields, it

    is easier to have the metadata there to work with Mapping also provides explicit documentation of what is

    happening Note that RCP can be combined with Mapping

    Enable RCP by default, and then turn it off when you only want to propagate a subset of fields. Do this by only mapping fields you need.

    It is often better to keep RCP enabled at all times, but be careful when you only want to keep certain columns and not others!

    39

  • ValueCap Systems - Proprietary

    DataStage Designer Table Definitions

    Table Definitions in DataStage are the same as a table layout or schema.

    You can manually enter everything and these can be saved for re-use laterSpecify location where the tabledefinition is to be saved. Once saved, table definition can beaccessed from the repository view.

    40

  • ValueCap Systems - Proprietary

    DataStage Designer Metadata Import

    Table Definitions can also be automatically generated by translating definitions stored in various formats. Popular options include COBOL copybooks and RDBMS table layouts.

    RDBMS layouts can be accessed via a couple of options: ODBC Table Definitions Orchestrate Schema Definitions

    (via orchdbutil option) Plug-in Meta Data Definitions

    41

  • ValueCap Systems - Proprietary

    DataStage Designer Job Properties

    The Parameters tab allows users to add environment variables both pre-defined anduser-defined.

    Once selected, it willshow up in the Job Propertieswindow. The default value canbe altered to a different value.

    Parameters can be used to control job behavior as wellas referenced within stages to allow for simple adjustment of properties without having to modify the job itself.

    42

  • ValueCap Systems - Proprietary

    DataStage Designer Job Compile/Run

    Before a job can be executed, it must first be saved and compiled. Compilation will validate that all necessary options are set and defined within each of the stages in the job.

    Compile Run

    To run the job, just click on the run button on the Designer. Alternatively, you can also click on the run button from within the Director.

    The Director will contain the job run log, which provides much more detail than the Designer will.

    43

  • ValueCap Systems - Proprietary

    DataStage Designer Job Statistics

    As a job is executing, you canright-click on the canvas and select Show performancestatistics to monitor your jobsperformance.

    Note that the link colors signifyjob status. Blue means it is running and green means it hascompleted. If the link is red, thenthe job has aborted due to error.

    44

  • ValueCap Systems - Proprietary

    DataStage Designer Export

    The Designer is also used for exporting and importing DataStage jobs, table Definitions, routines, containers, etcItems can be exported in 1 of 2 formats: DSX or XML. DSX format is DataStages internal format. Both formats can be opened and viewed in a standard text editor.

    We do not recommend altering the contents unless you really know what you are doing!.

    45

  • ValueCap Systems - Proprietary

    DataStage Designer Export

    You can export the contents of the entire project, or individual components. You can also export items into an existing file by selecting the Append to existing file option.

    Exported projects, depending on the total number of jobs, can grow to be several megabytes. However, these files can be easily compressed.

    46

  • ValueCap Systems - Proprietary

    DataStage Designer Import

    Previously exported items can be imported via the Designer. You can choose to import everything or only selected content.

    DSX files from previous versions of DataStage can also be imported. The upgrade to the current version will occur on the fly as the content is being imported into the repository.

    47

  • ValueCap Systems - Proprietary

    DataStage Clients: Director

    DataStage Director: Execute DataStage jobs Compile jobs Reset jobs Schedule jobs Monitor job performance Review job logs

    48

  • ValueCap Systems - Proprietary

    DataStage Director Access

    The easiest way to access the Director is from within the Designer. This will bypass the need to re-login again.

    Alternatively, you will have to double-click on the Director icon to bring up the Director interface.

    49

  • ValueCap Systems - Proprietary

    DataStage Director Interface

    The Directors default interface shows a list of Jobs along with their status.

    You will be able to see if jobs are compiled, how long it took to run, and when it was last run.

    50

  • ValueCap Systems - Proprietary

    DataStage Director Toolbar

    OpenProject

    JobStatusView

    JobScheduler

    JobLog

    ResetJob

    RunJob

    Some of the useful icons you will become very familiar with as you get to know DataStage. Note that if you let the mouse pointer hover over any icon, a tooltip will appear..

    51

  • ValueCap Systems - Proprietary

    DataStage Director Interface

    Whenever a job runs,you can view the joblog in the Director.

    Current entries arein black, whereas previous runs will show up in blue.

    Double-click on anyentry to access moredetails. What you seehere is often just a summary view

    52

  • ValueCap Systems - Proprietary

    DataStage Director MonitorTo enable job monitoring from within the Director, go to Tools menu and select New Monitor.

    You can set the update intervals as well as specify which statistics you would like to see.

    Colors correspond to status. Blue means it is running, green means it has finished, and red indicates a failure.

    53

  • IBM Information Server DataStage / QualityStage Fundamentals Labs

    ValueCapValueCap SystemsSystems

    54ValueCap Systems - Proprietary

  • ValueCap Systems - Proprietary

    Lab 1A: Project Setup & Configuration

    55

  • ValueCap Systems - Proprietary

    Lab 1A Objective

    Learn to setup and configure a simple project for IBM Information Server DataStage / QualityStage

    56

  • ValueCap Systems - Proprietary

    Creating a New Project Log into the DataStage Administrator using the userid

    and password provided to you by the instructor. Steps are outlined in the course material.

    Click on Add button to create a new project.Your instructor may advise you on a project name do not change the default project directory.

    Click OK when finished.

    57

  • ValueCap Systems - Proprietary

    Project Setup Click on the new project you have just created and

    select the Properties button. Under the General tab, check the boxes next to:

    Enable job administration in the Director Enable Runtime Column Propagation for Parallel Jobs

    Next, click on the Environment button to bring up the Environment Variables editor.

    58

  • ValueCap Systems - Proprietary

    Environment Variable Settings The Environment Variables editor should be similar to

    the screen shot shown here: We only need to change

    a couple of values APT_CONFIG_FILE

    instructor will providevalue.

    Click on Reporting and set APT_DUMP_SCOREto TRUE.

    Instructor will provide details if any other environment variable needs to be defined.

    59

  • ValueCap Systems - Proprietary

    Setting APT_CONFIG_FILE defines the default configuration file used by jobs in the project.

    Setting APT_DUMP_SCORE will enable additional diagnostic information to appear in the Director log.

    Click OK button when finished editing Environment Variables.

    Click OK and then Close to exit the Administrator. You have now finished configuring your project.

    60

  • ValueCap Systems - Proprietary

    Lab 1B: Designer Walkthrough

    61

  • ValueCap Systems - Proprietary

    Lab 1B Objective

    Become familiar with DataStage Designer.

    62

  • ValueCap Systems - Proprietary

    Getting Into the Designer

    Log into the DataStage Designer using the userid and password provided to you by the instructor. Be sure to select the project you just created when login in.

    Once connected, select theParallel Job option and clickon OK.

    You should see a blank canvaswith the Parallel label in the upper left hand corner.

    63

  • ValueCap Systems - Proprietary

    Create a Simple Job

    We will construct the following simple job:

    Use the techniques covered in the lecture material to build the job.

    Job consists of a Row Generator stage and a Peek stage. For the Row Generator, you will need to enter the following

    table definition:

    Alter the stage and link labels to match the diagram above.

    64

  • ValueCap Systems - Proprietary

    Compile and Run the Job

    Save the job as lab1b Click on the Compile button.

    Did the job compile successfully? If not, can you determine why not? Try to correct the problem(s) in order to get the job to compile.

    Once the job has compiled successfully, right-click on the canvas and select Show performance statistics

    Click on the Job Run button. Once your job finishes executing, you should see the following output:

    65

  • ValueCap Systems - Proprietary

    Lab 1C: Director Walkthrough

    66

  • ValueCap Systems - Proprietary

    Lab 1C Objective

    Become familiar with DataStage Director.

    67

  • ValueCap Systems - Proprietary

    Getting Into the Director

    Log into the DataStage Director using the userid and password provided to you by the instructor. You can also use the shortcut shown in the course materials.

    Once connected, you should see the status of lab1b, which was just executed from within the Designer:

    68

  • ValueCap Systems - Proprietary

    Viewing the Job Log Click on the Job Log button on the toolbar to access

    the log for lab1b. The log should be very similar to the screenshot

    here:

    There should not be any red (error) icons.69

  • Take a closer look at some of the entries in the log. Double click on the following highlighted selections:

    First one shows the configuration file being used. The next few entries show the output of the Peek stage.

    ValueCap Systems - Proprietary

    Director Job Log

    Also note the Job Status

    70

  • ValueCap Systems - Proprietary

    Stage Output

    The Peek stage output in the Director log should be similar to the following:

    Peek stage is similar to inserting a Print statement into the middle of a program.

    Where did this data come from? The data was generated by the Row Generator stage! You will learn more about this powerful stage in later sections & labs.

    71

  • ValueCap Systems - Proprietary

    Agenda1. DataStage Overview2. Parallel Framework Overview3. Data Import and Export4. Data Partitioning, Sorting, and Collection5. Data Transformation and Manipulation6. Data Combination7. Custom Components: Wrappers8. Custom Components: Buildops9. Additional Topics10. Glossary

    72

    Page 10

    Page 73

    Page 116

    Page 252

    Page 309

    Page 364

    Page 420

    Page 450

    Page 477

    Page 526

  • ValueCap Systems - Proprietary

    Parallel Framework OverviewIn this section we will discuss: Hardware and Software Scalability Traditional processing Parallel processing Configuration File overview Parallel Datasets

    73

  • ValueCap Systems - Proprietary

    Scalability

    Scalability is a term often used in product marketing but seldom well defined: Hardware vendors claim their products are highly

    scalable Computers Storage Network

    Software vendors claim their products are highlyscalable RDBMS Middleware

    74

  • ValueCap Systems - Proprietary

    Scalability Defined

    How should scalability be defined? Well, that depends on the product. For Parallel DataStage : The ability to process a fixed amount of data in

    decreasing amounts of time as hardware resources(cpu, memory, storage) are increased

    Could also be defined as the ability to processgrowing amounts of data by increasing hardwareresources accordingly.

    75

  • ValueCap Systems - Proprietary

    Scalability Illustrated

    Hardware Resources (CPU, Memory, etc)

    r

    u

    n

    t

    i

    m

    e

    poor scalability

    Linear Scalability: runtime decreasesas amount of hardware resources are increased. For example: a job that takes 8 hours to run on 1 cpu, will take 4 hours on 2 cpus, 2 hours on 4 cpus, and 1 hour on 8 cpus.

    Poor Scalability: results when running time no longer improves asadditional hardware resources areadded.

    Super-linear Scalability: occurs when the job performs better than linear asamount of hardware resources areincreased.(assumes that data volumes remain constant)

    76

  • ValueCap Systems - Proprietary

    Hardware Scalability

    Hardware vendors achieve scalability by: Using multiple processors Having large amounts of memory Installing fast storage mechanisms Leveraging a fast back plane Using very high bandwidth, high speed networking

    solutions

    77

  • ValueCap Systems - Proprietary

    Examples of Scalable Hardware

    SMP 1 physical machine with 2 or more processors and shared memory.

    MPP 2 or more SMPs interconnected by a high bandwidth, high speed switch. Memory between nodes of a MPP is not shared.

    Cluster more than 2 computers connected together by a network. Similar to MPP.

    Grid several computers networked together. Computers can be dynamically assigned to run jobs.

    78

  • ValueCap Systems - Proprietary

    Software Scalability

    Software scalability can occur via: Executing on scalable hardware Effective memory utilization Minimizing disk I/O Data partitioning Multi-threading Multi-processing

    79

  • ValueCap Systems - Proprietary

    Software Scalability DS EE

    Parallel DataStage achieves scalability in a variety of ways: Data Pipelining Data Partitioning Minimizing disk I/O In memory processing

    We will explore these concepts in detail!

    80

  • ValueCap Systems - Proprietary

    The Parallel Framework

    The Engine layer consists, in large part, of the Parallel Framework (aka Orchestrate). The Framework was written in C++ and has a

    published and documented API DS/QS jobs run on top of the Framework via OSH OSH is a scripting language much like Korn shell The Designer client will generate OSH automatically Framework relies on a configuration file to determine

    level of parallelism during job execution.

    81

  • ValueCap Systems - Proprietary

    Parallel Framework

    Parallel Framework

    ConfigurationFile

    DataStage Jobexecutes on the Frameworkat runtime

    Configuration Filecontains virtual mapof available systemresources.

    Framework will reference the Configuration File to determine the degree of parallelism for the job at runtime.

    82

  • ValueCap Systems - Proprietary

    Traditional Processing

    RDBMSA B Cfile

    Suppose we are interested in implementing the following business logic whereA, B, and C represent specific data transformation processes:

    Manual implementation of the business logic typically results in the following:

    RDBMSA B Cfile

    disk disk diskstaging area:

    Invoke loader

    While the above solution works and eventually delivers the correct results, problems will occur when data volumes increase and/or batch windows decrease! Disk I/O is the slowest link in the chain. Sequential processing prohibits scalability

    83

  • ValueCap Systems - Proprietary

    Data Pipelining

    What if, instead of persisting data to disk between processes, we could move thedata between processes in memory?

    RDBMSA B Cfile

    disk disk diskstaging area:

    Invoke loader

    The application will certainly run faster simply because we are now avoiding thedisk I/O that was previously present.

    RDBMSA B Cfile

    This concept is called data pipelining. Data continuously flows from Source toTarget, through the individual transformation processes. The downstream processno longer has to wait for all of the data to be written to disk it can now beginprocessing as soon as the upstream process is finished with the first record!

    84

  • ValueCap Systems - Proprietary

    Data Partitioning

    Parallel processing would not be possible without data partitioning. We will devote an entire lecture to this subject matter later in this course. For now: Think of partitioning as the act of distributing records

    into separate partitions for the purpose of dividingthe processing burden from one processor to many.

    DataFile

    P

    a

    r

    t

    i

    t

    i

    o

    n

    e

    r Records 1 - 1000Records 1001 - 2000Records 2001 - 3000Records 3001 - 4000

    85

  • ValueCap Systems - Proprietary

    Parallel Processing

    A

    Inputfile

    B C RDBMS

    By combining data pipelining and partitioning, you can achieve what people typically envision as being parallel processing:

    In this model, data flows from source to target, upstream stage to downstream stage, while remaining in the same partition throughoutthe entire job. This is often referred to as partitioned parallelism.

    86

  • ValueCap Systems - Proprietary

    Pipeline Parallel Processing

    A

    Inputfile

    There is, however, a more powerful way to perform parallel processing.We call this spaghetti pipeline parallelism.

    B C RDBMS

    What makes pipeline parallelism powerful is the following: Records are not bound to any given partition Records can flow down any partition Prevents backup and hotspots from occurring in any given partitionThe parallel framework does this by default!

    87

  • ValueCap Systems - Proprietary

    Pipeline Parallelism Example

    Suppose you are traveling from point A to point B along a 6 lane toll-way. Between the start and end points, there are 3 toll stations your car must pass through and pay toll. During your journey, you will most likely change

    lanes. These lanes are just like partitions During your journey, you will likely use the toll station

    with the least number of cars Think about the fact that other cars are doing the same! Each car is like a record, toll stations are processesWhat would happen if you are stuck in a single lane

    during the entire journey?

    This is a simple real-world example of pipeline parallelism!88

  • ValueCap Systems - Proprietary

    Configuration Files

    Configuration files are used by the Parallel Framework to determine the degree of parallelism for a given job. Configuration files are plain text files which reside on

    the server side Several configuration files can co-exist, however, only

    one can be referenced at a time by a job Configuration files have a minimum of one processing

    node defined and no maximum Can be edited through the Designer or vi or other text editors Syntax is pretty simple and highly repetitive.

    89

  • ValueCap Systems - Proprietary

    Configuration File ExampleHere is a sample configuration file which will allow a job to run 4 way parallel. The path will be different for windows installations.{node node_1" {fastname dev_server"pool "resource disk "/data/work" {}resource scratchdisk "/data/scratch" {}

    }node node_2" {fastname dev_server"pool "resource disk "/data/work" {}resource scratchdisk "/data/scratch" {}

    }node node_3" {fastname dev_server"pool "resource disk "/data/work" {}resource scratchdisk "/data/scratch" {}

    }node node_4" {fastname dev_server"pool "resource disk "/data/work" {}resource scratchdisk "/data/scratch" {}

    }}

    Hostname for the ETL server can also use IP address

    Label for each node, can be anything needs to be different for each node

    Location for parallel dataset storage used to spread I/O can have multiple entries per node

    Location for temporary scratch file storage used to spread I/O can have multiple entries per node

    90

  • ValueCap Systems - Proprietary

    Reading & Writing Parallel Datasets

    DataFile

    P

    a

    r

    t

    i

    t

    i

    o

    n

    e

    r Records 1 - 1000Records 1001 - 2000Records 2001 - 3000Records 3001 - 4000

    DataFileData

    FileDataFileData

    Files

    Records 1 - 1000Records 1001 - 2000

    Records 2001 - 3000Records 3001 - 4000

    VS

    Suppose that in each scenario illustrated below, we are reading in orwriting out 4000 records. Which performs better?

    Records 1001 - 2000 DataFile

    C

    o

    l

    l

    e

    c

    t

    o

    r

    Records 1 - 1000

    Records 2001 - 3000Records 3001 - 4000

    DataFileData

    FileDataFileData

    Files

    Records 1 - 1000Records 1001 - 2000

    Records 2001 - 3000Records 3001 - 4000

    VS

    -- OR --

    91

  • ValueCap Systems - Proprietary

    Parallel Dataset Advantage

    DataFileData

    FileDataFileData

    Files

    Records 1 - 1000Records 1001 - 2000

    Records 2001 - 3000Records 3001 - 4000

    Being able to read and write data in parallel will almost always be fasterand more scalable than reading or writing data sequentially.

    DataFileData

    FileDataFileData

    Files

    Records 1 - 1000Records 1001 - 2000

    Records 2001 - 3000Records 3001 - 4000

    Parallel Datasets perform better because: data I/O is distributed instead of sequential, thus removing a bottleneck data is stored using a format native to the Parallel Framework, thus

    eliminating need for the Framework to re-interpret data contents data can be stored and read back in a pre-partitioned and sorted

    manner

    92

  • ValueCap Systems - Proprietary

    Parallel Dataset Mechanics

    Datasets are made up of several small fragments or data files

    Fragments are stored per the resource disk entries in the configuration file This is where distributing the I/O becomes important!

    Datasets are very much dependent on configuration files. Its a good practice to read the dataset using the same

    configuration file that was originally used to create it.

    93

  • ValueCap Systems - Proprietary

    Using Parallel Datasets

    Parallel datasets should use a .ds extention. The .ds file is only a descriptor file containing metadata and location of actual datasets.

    When writing data to a parallel dataset, be sure to specify whether to create, overwrite, append, or insert.

    94

  • ValueCap Systems - Proprietary

    Browsing Datasets

    Dataset viewer can be accessed from the Tools menu in the Designer. Use the Dataset viewer to see all metadata as well as records stored within the dataset.

    Alternatively, if all you want to do is browse the records in the dataset, you can use the View Data button in the properties window for the dataset stage.

    deleting datasets

    95

  • ValueCap Systems - Proprietary

    Lab 2A: Simple Configuration File

    96

  • ValueCap Systems - Proprietary

    Lab 2A Objectives

    Learn to create a simple configuration file and validate its contents.

    Note: You will need to leverage skills learned during previous labs to complete subsequent labs.

    97

  • ValueCap Systems - Proprietary

    Creating a Configuration File

    Log into the DataStage Designer using your assigned userid and password.

    Click on the Tools menu to select Configurations

    98

  • ValueCap Systems - Proprietary

    Configuration File Editor

    The Configuration File editor should pop up, similar to the one you see here.

    Click on New and select default. Wewill use this as ourstarting point tocreate another config file.

    99

  • ValueCap Systems - Proprietary

    Checking the Configuration File

    Once you have opened the default configuration file, click on the Check button at the bottom. This action will validate the contents of the configuration file. Always do this after you have created a configuration file. If

    it fails this simple test, then there is no way any job will run using this configuration file!

    What is in your configuration file will depend on the hardware environment you are using (i.e. number of cpus).

    For Example, on a 4 cpu system, you will likely see a configuration file with 4 node entries defined.

    100

  • ValueCap Systems - Proprietary

    Editing the Configuration File

    At this point, how many nodes do you see defined in your default configuration file? Remember, this dictates how many way parallel your job will run. If

    you see 8 node entries, then your job will run 8-way parallel. Regardless of how many cpus your system has, edit the

    configuration file and create as many node entries as you have cpus. The default may already have the nodes defined. Copy and paste is the fastest way to do this if you need to add

    nodes. Keep in mind that node names need to be unique, while everything else can stay the same! Pay attention to the { }s!!!

    Your instructor may choose to provide you with alternate resource disk and resource scratchdisk locations to use.

    101

  • ValueCap Systems - Proprietary

    Save and Check the Config File Once you have finished editing the configuration file, click on the

    Save button and save it as something other than default. Suggestions include using your initials along with the number of

    nodes defined. This helps prevent other students from accidentally using the wrong configuration file.

    o For example: JD_Config_4node.apt

    Once you have saved your configuration file, click on the Check button again at the bottom. This action will validate the contents of your configuration file. Again, always do this after you have created a configuration file. If

    it fails this simple test, then there is no way any job will run using this configuration file!

    If the validation fails, use the error message to determine what the problem is. Correct the problem and repeat the above step.

    102

  • ValueCap Systems - Proprietary

    Save and Check the Configuration File Next, re-edit the configuration file you just created (and

    validated) and remove all node entries except for the first one.

    Check it again and, if no errors are returned, save it asa 1node configuration using the same nomenclature you applied to the multi-node configuration file you previously had created. For example: JD_Config_1node.apt Note: when you check the configuration, it may prompt you to

    save it first. You can check the configuration without saving it first, but always remember to save it once it passes the validation test.

    103

  • ValueCap Systems - Proprietary

    Checking the Configuration File

    What does Parallel DataStage do when it is checking the config file?

    Validates syntax Correct placement of all { }, , , etc Correct spelling and use of keywords such as node,

    fastname, resource disk, resource scratchdisk, pool, etc

    Validates information Fastname entry should match hostname or IP rsh permissions, if necessary, are in place Read and Write permissions exist for all of your resource

    disk and scratchdisk entries104

  • ValueCap Systems - Proprietary

    Changing Default Settings

    Exit the Manager and go into the Administrator be sure to select your project and not someone elses. Enter the Environment editor

    Find and set APT_CONFIG_FILE to the 1node configuration file you just created. This makes it the default for your project.

    Find and set APT_DUMP_SCORE to TRUE. This will enable additional diagnostic information to appear in the Director log.

    Click OK button when finished editing Environment Variables.

    Click OK and then Close to exit the Administrator. You have now finished configuring your project.

    105

  • ValueCap Systems - Proprietary

    Lab 2B: Applying the Configiguration Fileto a Simple DataStage Job.

    106

  • ValueCap Systems - Proprietary

    Lab 2B Objective

    Use your newly created configuration files to test a simple DataStage application.

    107

  • ValueCap Systems - Proprietary

    Create Lab2B Using Lab1B

    Open the job you created in Lab 1B should be called lab1b

    Save the job again using Save As use the name lab2b

    Next, find the job properties icon: Click on the job properties icon to bring up the Job

    Properties window.

    108

  • ValueCap Systems - Proprietary

    Editing Job Parameters

    Click on theParameters tab

    Find and click on the Add Environment Variable button

    You will see the big (and sometimes confusing) list of environment variables. Take some time to browse through these.

    Find and select APT_CONFIG_FILE

    109

  • ValueCap Systems - Proprietary

    Defining APT_CONFIG_FILE

    Once selected, you will return to the Job Properties window.

    Verify that the value for APT_CONFIG_FILE is the same as the 1node configuration file you defined previously in Lab 2A.

    Save, Compile, and Run your job.

    110

  • ValueCap Systems - Proprietary

    Running Using Parameters

    When you run your job, you should see the following Job Run Options dialogue:

    Note that it shows you thedefault configuration filebeing used, which happens to be the one defined previously in the Administrator.

    Keep this value for now, and just click on Run. Go to the Director to view the job run log.

    111

  • ValueCap Systems - Proprietary

    Director Log Output

    Look for a similar entry in the job log for lab2b:

    Double click on it. You should see the contents of the 1node

    configuration file used. Click on Close to exit from the dialogue. Click on Run again and this time, change the

    APT_CONFIG_FILE parameter to the multiple node configuration file you defined in Lab 2A.

    Click the Run button.

    112

  • ValueCap Systems - Proprietary

    Director Log Output

    Again, look for a similar entry in the job log for lab2b:

    Double click on it. You should see the contents of the multiple node

    configuration file used. Click on Close to exit from the dialogue. You have just successfully run your job sequentially

    and in parallel by simply changing the configuration file!

    113

  • ValueCap Systems - Proprietary

    Using APT_DUMP_SCORE

    Another way to verify degree of parallelism is to look at the following output in your job log:

    The entries Peek,0 and Peek,1 show up as a result of you having set APT_DUMP_SCORE to TRUE.

    The numbers 0 and 1 signify partition numbers. So if you have a job running 4 way parallel, you should see numbers 0 through 3.

    114

  • ValueCap Systems - Proprietary

    Agenda1. DataStage Overview2. Parallel Framework Overview3. Data Import and Export4. Data Partitioning, Sorting, and Collection5. Data Transformation and Manipulation6. Data Combination7. Custom Components: Wrappers8. Custom Components: Buildops9. Additional Topics10. Glossary

    115

    Page 10

    Page 73

    Page 116

    Page 252

    Page 309

    Page 364

    Page 420

    Page 450

    Page 477

    Page 526

  • ValueCap Systems - Proprietary

    Data Import and ExportIn this section we will discuss: Data Generation, Copy, and Peek Data Sources and Targets

    Flat Files Parallel Datasets vs Filesets RDBMS Other

    Related Stages

    116

  • ValueCap Systems - Proprietary

    Generating Columns and Rows

    DataStage allows you to easily test any job you develop by providing an easy way to generate data. Row Generator generates as many records as you want Column Generator generates extra fields within existing

    records. Must first have input records.

    To use either stage, you will need to have a table or column definition.

    You can generate as little as 1 record with 1 column. Columns can be of any supported data type

    Integer, Float, Double, Decimal, Character, Varchar, Date, and Timestamp

    117

  • ValueCap Systems - Proprietary

    Row Generator

    The Row Generator is an excellent stage to use when building jobs in Datastage. It allows you to test the behavior of various stages within the product.

    To configure the Row Generator to work, you must define at least 1 column. Looking at what we did for the job in Lab 1b, we see that 3 columns were defined:

    We could have also loaded an existing table definition instead of entering our own.

    118

  • ValueCap Systems - Proprietary

    Row Generator

    Suppose we want to stick with the 3 column table definition we created. As you saw in Lab 2B, the Row Generator will produce records with miscellaneous 10-byte character, integer, and date values.

    There is, however, a way to specify values to be generated. To do so, double click on the number next to the column name.

    119

  • ValueCap Systems - Proprietary

    Column Metadata Editor

    The Column Metadataeditor allows you to provide specific datageneration instructionsfor each and every field.

    Options vary by datatypes.

    Frequent optionsinclude cycle-through user-defined values, randomvalues, incremental values, and alphabetic algorithm

    120

  • ValueCap Systems - Proprietary

    Character Generator Options

    For a Character or Varchar type, when you click on Algorithm you will have 2 options: cycle cycle through only thespecific values you specify.

    alphabet methodically cyclethrough characters of thealphabet. This is the defaultbehavior:

    121

  • ValueCap Systems - Proprietary

    Number Generator Options

    For an Integer, Decimal, or Float type, your 2 options are: cycle cycle through numbers beginning at the initial value and incrementing by the increment value. You can also define an upper limit. random randomly generate numerical values. You can define an upper limit and a seed for the random number generator. You can also use the signed option to generate negative numbers.

    Note: In addition, with Decimal types, you also have the option of defining percent zero and percent invalid

    122

  • ValueCap Systems - Proprietary

    Other Data Type Generator Options

    Date, Time, and Timestamp data types have some useful options: Epoch: earliest date to use. For example, the default value

    is 1960-01-01. Scale Factor: Specifies a multiplier to the increment value

    for time. For example, a scale factor of 60 and an increment of 1 means the field increments by 60 seconds.

    Use Current Date: Generator will insert current date value for all rows. This cannot be used with other options.

    123

  • ValueCap Systems - Proprietary

    Column Generator

    The Column Generator is an excellent stage to use when you need to insert a new column or set of columns into a record layout.

    Column Generator requires you to specify the name of the column first, and then in the output-mapping tab, you will need to map source to target.

    In the output-columns tab, you will need to customize the column(s) added the same way as it is done in the Row Generator. For example, if you are generating a dummy key, you would want to

    make it an Integer type with an initial value of 0 and increment of 1. When running this in parallel, you can start with an initial value of part

    and increment of partcount. part is defined in the Framework as the partition number and partcount is the number of partitions.

    124

  • ValueCap Systems - Proprietary

    Making Copies of Data

    Copy Stage (in the Processing Palette): Incredibly flexible with little or no overhead at runtime Often used to create a duplicate of the incoming data:

    Can also be used to terminate a flow: Records get written out to /dev/null Useful when you dont care about the target

    or just want to test part of the flow.125

  • ValueCap Systems - Proprietary

    Take a Look at the Data

    Peek Stage (in the Development / Debug Pallette): Often used to help debug a job flow Can be inserted virtually anywhere in a job flow

    Must have an input data source Outputs fixed amount of records into the job log

    For example, in Lab 2B: Output volume can be

    controlled Can also be used to terminate any job flow. Similar in behavior to inserting a print statement into

    your source code.126

  • ValueCap Systems - Proprietary

    Importing Data from Outside DataStage

    If you have access to real data, then you probably will not have a lot of use for the Row Generator! DataStage can read in or import data from a large variety of data sources:

    Flat Files Complex Files RDBMSs SAS datasets Queues

    Parallel Datasets & Filesets FTP Named Pipes Compressed Files Etc

    127

  • ValueCap Systems - Proprietary

    Importing Data

    There are 2 primary means of importing data from external sources: Automatically DataStage automatically reads the table

    definition and applies it to the incoming data. Examples include RDBMSs, SAS datasets, and parallel datasets.

    Manually user must define the table definition that corresponds to the data to be imported. These table definitions can be entered manually or imported from an existing copy book or schema file. Examples include flat files and complex files.

    128

  • ValueCap Systems - Proprietary

    Manual Data Import

    When DataStage reads in data from an external source, there are 2 steps that will always take place: Recordization

    DataStage carves out the entire record based on the table definition being used.

    Record delimiter is defined within table definition Columnization

    DataStage parses through the record it just carved out and separates out the columns, again based on the table definition provided.

    Column delimiters are also defined within the table definitionCan become very troublesome if you dont know the correct layout of your data!

    129

  • ValueCap Systems - Proprietary

    DataStage Data TypesIn order to properly setup a table definition, you must 1stunderstand the internal data types used within DataStage:

    Integer: Signed or unsigned, 8-, 16-, 32- or 64-bit integer. In the Designer you will see TinyInt, SmallInt, Integer, and BigInt instead.

    Floating Point: Single- (32 bits) or double-precision (64 bits); IEEE. Inthe Designer you will see Float, and Double instead.

    String: Character string of fixed or a variable length. In the Designeryou will see Char and VarChar instead.

    Decimal: Numeric representation compatible with the IBM packeddecimal format. Decimal numbers consist of a precision (number ofdecimal digits) greater than 1 with no maximum, and scale (fixedposition of decimal point) between 0 and the precision.

    130

  • ValueCap Systems - Proprietary

    DataStage Data Types (continued)

    Date: Numeric representation compatible with RDBMS notion of date (year, month and day). The default format is month/date/year. This is represented by the default format string of: %mm/%dd/%yyyy

    Time: Time of day with either one second or one microsecond resolution. Time values range from 00:00:00 to 23:59:59.999999.

    Timestamp: Single field containing both a date and time. Raw: Untyped collection of contiguous bytes of a fixed or a

    variable length. Optionally aligned. In the Designer you will see Binary.

    131

  • ValueCap Systems - Proprietary

    DataStage Data Types (continued) Subrecords (subrec): Nested form of field definition that

    consists of multiple nested fields, Similar to COBOL record levels or C structs. A subrecord itself does not define any storage; instead, the fields of the subrecord define storage. The fields in a subrecord can be of any data type, including tagged. In addition, you can also nest subrecords and vectors of subrecords, to any depth of nesting.

    Tagged Subrecord (tagged): Any one of a mutually exclusive list of possible data types, including subrecord and tagged fields. Similar to COBOL redefines or C unions, but more type-safe. Defining a record with a tagged type allows each record of a data set to have a different data type for the tagged column.

    132

  • ValueCap Systems - Proprietary

    Null Handling

    All DataStage data types are nullable

    Tags and subrecs are not nullable, but their fields are Null fields do not have a value DataStage null is represented by an out-of-band indicator Nulls can be detected by a stage Nulls can be converted to/from a value Null fields can be ignored by a stage, can trigger error, or other

    action Exporting a nullable field to a flat file without 1st defining how to

    handle the null will cause an error.

    133

  • ValueCap Systems - Proprietary

    Data Import Example

    Suppose you have the following data:Last, First, Purchase_DT, Item, Amount, TotalSmith,John,2004-02-27,widget #2,21,185.20Doe,Jane,2005-07-03,widget #1,7,92.87Adams,Sam,2006-01-15,widget #9,43,492.93

    What would your table definition look like for this data? You need column names, which are provided for you You need data types for each column You need to specify , as the column delimiter You need to specify newline as the record delimiter

    134

  • ValueCap Systems - Proprietary

    Data Import Example (continued)It is critical that you fill out the Formatoptions correctly, otherwise, DataStage will not be able to perform the necessaryrecordization and columnization!

    Data types must also match the data itself, otherwise it will cause the columnization step to fail.

    Sequential File Stage

    135

  • ValueCap Systems - Proprietary

    Data Import Example (continued) Once all of the information is properly filled out, you

    can press the View Data button to see a sample of your data and at the same time, validate that your table definition is correct.

    If your table definition is not correct, then the View Data operation will fail.

    136

  • ValueCap Systems - Proprietary

    Data Import Example (continued)

    The table definition we used above worked for the data we were given. Was this the only table definition that would have worked? No, but this was the best one VarChar is perhaps the most flexible data type, so we could have

    defined all columns as VarChars. All numeric and date/time types can be imported as Char or

    VarChar as well, but the reverse is rarely true. Decimal types can typically be imported as Float or Double and

    vice versa, but be careful with precision you may lose data! Integer types can also be imported as Decimal, Float, or Double.

    137

  • ValueCap Systems - Proprietary

    Data Import Reject HandlingData is not always clean. Whenunexpected or invalid values come up, you can: Continue default option. It will

    discard any records where a field does not import correctly

    Fail abort the job as soon asan invalid field value isencountered

    Output - send reject records down a reject link to a Dataset.Can also be passed onto otherstages for further processing.

    138

  • ValueCap Systems - Proprietary

    Exporting Data to Disk

    Once the data has been read into DataStage and processed, it is then typically written out somewhere. These targets can be the same as the sources which originally produced the data or a completely different target. Exporting data to a flat file is easier than importing it from a flat file,

    simply because DataStage will use the table definition that has been propagated downstream to define the data layout within the output target file.

    You can easily edit the formatting properties within the Sequential File stage for items such as null handling, delimiters, quotes, etc

    Consider using a parallel dataset instead of flat file to stage data on disk! Much faster and easier if there is another DSEE application which will consume the data downstream.

    139

  • ValueCap Systems - Proprietary

    Data Export to Flat File ExampleHeres an example of what it takes to setup the Sequential File stage to export data to a flat file.

    140

  • ValueCap Systems - Proprietary

    Data Export to Parallel Dataset Example

    With DataStage Parallel Datasets, regardless of it being a source or target, all you need to specify is its name and location! No need to worry about data types, handling nulls, or delimiters.

    141

  • ValueCap Systems - Proprietary

    Automatic Data Import

    Besides flat files and other manual sources, DataStage can also import data from a Parallel Dataset or RDBMS without the need to first define a table definition! Parallel Datasets are self-describing datasets native to DataStage

    Easiest way to read and write data RDBMSs often store table definitions internally

    For example, the DESCRIBE or DESCRIBE TABLE command often returns the table definition associated with the given table

    DataStage has the ability to: Automatically extract the table definition during design time. Automatically extract the table definition to match the data at runtime,

    and propagate that table definition downstream using RCP

    142

  • ValueCap Systems - Proprietary

    Parallel Datasets vs Parallel Filesets

    Parallel Dataset vs Parallel Fileset Primary difference is format

    Parallel datasets are stored in a native DataStage format Parallel filesets are stored as ASCII

    Parallel filesets use a .fs extension vs .ds for parallel datasets The .fs file is also a descriptor file, however, its ascii and

    only contains the location of each fragment and the layout. Parallel datasets are faster than parallel filesets

    Parallel datasets avoid the recordization and columnization process because data is already stored in a native format.

    143

  • ValueCap Systems - Proprietary

    Parallel Datasets vs RDBMS

    Parallel Dataset vs RDBMS Logically and functionally very similar

    Parallel datasets have data that is partitioned and stored across several disks

    Table definition (aka schema) is stored and associated with the table

    Parallel datasets can sometimes be faster than loading/extracting a RDBMS. Some conditions that can make this happen: Non-partitioned RDBMS tables Remote location of RDBMS Sequential RDBMS access mechanism

    144

  • ValueCap Systems - Proprietary

    Importing RDBMS Table Definitions

    Select from DB2, Oracle, or Informix

    There are a couple of options you can choose for importing a RDBMS table definition for use during design time. Import Orchestrate Schema is one option.

    Once you enter all the necessary parameters, you can click on the Next button to import the table definition.

    Once imported, the table definition can be used at design time

    145

  • ValueCap Systems - Proprietary

    Importing RDBMS Table DefinitionsOther options for importing a RDBMS table definition include usingODBC or Plug-In Metadata access.

    ODBC option requires that the correct ODBC driver be setup

    The Plug-In Metadata optionrequires that it be setup duringinstall.

    Once setup, each option guides you through a simple process to import the table definition and save it for future re-use.

    146

  • ValueCap Systems - Proprietary

    Using Saved Table Definitions

    Table definition iconshows up on link

    There are 2 ways to reference a saved table definition in a job. The first is to select it from the repository tree view on the left side, and then drag and drop it onto the link.

    The presence of the icon on the link signifies that a table definition is present, or that metadata is present on the link.Why do this when DataStage can do this automatically at runtime? Sometimes it is easier or more straight forward to have the metadata available at design time.

    147

  • ValueCap Systems - Proprietary

    Using Saved Table Definitions

    Another way to access saved table definitions is to use the Load button on the Output tab of any given stage. Note that you can also do this on the Input tab, but that is the same as loading it on the Output tab of the upstream (preceding) stage.

    148

  • ValueCap Systems - Proprietary

    Loading Table Definitions

    When loading a previously saved table definition, the column selection dialogue will appear. This allows you to optionally eliminate certain columnswhich you do not wantto carry over.

    This is useful when youare only reading in somecolumns or your selectclause only has somecolumns.

    149

  • ValueCap Systems - Proprietary

    RDBMS Connectivity

    DataStage offers an array of options for RDBMS connectivity, ranging from ODBC to highly-scalable native interfaces. For handling large data volumes, DataStages highly-scalable native database interfaces are the best way to go. While the icons may appear similar, always look for the _enterprise label.

    DB2 parallel extract, load, upsert, and lookup. Oracle parallel extract, load, upsert, and lookup. Teradata parallel extract and load Sybase sequential extract, parallel load, upsert, and lookup Informix parallel extract and load

    150

  • ValueCap Systems - Proprietary

    Parallel RDBMS Interface

    Query orApplication

    Usually a query is submitted toa database sequentially, and the database then distributesthe query to execute it inparallel. The output, however,is returned sequentially. Similarly, when loading data,data is loaded sequentially 1st,before being distributed by thedatabase.

    DataStage will avoid thisbottleneck by establishingparallel connections into thedatabase and execute queries,extract data, and load data inparallel. The degree of parallelismchanges depending on thedatabase configuration (i.e. numberof partitions that are set up).

    Parallel DataStage

    151

  • ValueCap Systems - Proprietary

    DataStage and RDBMS Scalability

    Query orApplication

    While the database itself maybe highly scalable, the overallsolution which includes theapplication accessing thedatabase is not. Any sequentialbottlenecks in an end to endsolution will limit its ability toscale!

    DataStage s native parallelconnectivity into the database isthe key enabler for a truly scalableend to end solution.

    Parallel DataStage

    152

  • ValueCap Systems - Proprietary

    Extracting from the RDBMSExtracting data from DB2, Oracle, Teradata, and Sybase is pretty straightforward. The stage properties dialogue is very much the same for each database, despite its different behavior under the covers.

    For all database stages, to extract datayou will need to provide the following: Read Method full table scan or user

    defined query Table table name if using Table as

    the Read Method User user id (optional with DB2) Password password (optional with DB2) Server/Database used for some

    databases for establishing connectivity Options database specific options

    153

  • ValueCap Systems - Proprietary

    Loading to the RDBMSLoading data into DB2, Oracle, Teradata, and Sybase is also pretty straight forward. The stage properties dialogue is very much the same for each database, despite its different behavior under the covers.

    For all database stages, to load datayou will need to provide the following: Table name of table to be loaded Write Method write, load, upsert.

    Details will be discussed shortly. Write Mode Append, Create, Replace,

    and Truncate User user id (optional with DB2) Password password (optional with DB2) Server/Database used for some

    databases for establishing connectivity Options database specific options

    154

  • ValueCap Systems - Proprietary

    Write Methods Explained

    Write/Load often the default option. Used to append data into an existing table, create and load data into a target table, or drop an existing table, re-create it, and load data into it. The mechanics of the load itself depend on the database.

    Upsert update or insert data into the database. There is also an option to delete data from the target table.

    Lookup perform a lookup against a table inside the database. This is useful when the lookup table is much larger than the input data.

    155

  • ValueCap Systems - Proprietary

    Write Modes Explained

    Append default option. Append data into an existing table.

    Create creates a table using the table definition provided by the stage. If table already exists, then job will fail. Insert table into the created table.

    Replace if target table exists, drop the table first. If table does not exist, create it. Insert data into the created table.

    Truncate delete all records from the target table, but do not drop the table. Insert data into the empty table.

    156

  • ValueCap Systems - Proprietary

    ConnectivityConnectivityDataStage Oracle Enterprise Stage

    157

  • ValueCap Systems - Proprietary

    Configuration for OracleTo establish connectivity to Oracle, certain environment variables and stage options need to be defined: Environment Variables (defined via DataStage Administrator)

    ORACLE_SID name of the ORACLE database to access ORACLE_HOME location for ORACLE home PATH append $ORACLE_HOME/bin LIBPATH or LD_LIBRARY_PATH append

    $ORACLE_HOME/lib32 or $ORACLE_HOME/lib64, depending on the operating system. Path must be spelled out.

    Stage Options User Oracle user-id Password Oracle user password DB Options can also accept SQL*Loader parameters such as:

    o DIRECT = TRUE, PARALLEL = TRUE,

    158

  • ValueCap Systems - Proprietary

    Specifics for Extracting Oracle Extracts from Oracle: Default option (depending on the version used) is to use the

    SQL Builder interface, which allows you to use a graphical interface to create a custom query. Note: the query generated will run sequentially by default.

    User-Defined Query option allows you to enter your own query or copy and paste an existing query. Note: the custom query will run sequentially by default.

    Running SQL queries in parallel requires the use of the following option: Partition Table option enter name of the table containing the

    partitioning strategy you are looking to match

    159

  • ValueCap Systems - Proprietary

    Oracle Parallel Extract

    Both set of options above will yield identical results. Leaving out the Partition Table option would cause the extract to

    execute sequentially.

    160

  • ValueCap Systems - Proprietary

    Specifics for Loading Oracle There are 2 ways to put data into Oracle: Load (default option) leverage the Oracle SQL*Loader

    technology to load data into Oracle in parallel. Load uses the Direct Path load method by default Fastest way to load data into Oracle Select Append, Create, Replace, or Truncate mode

    Upsert update or insert data in an Oracle table Runs in parallel Uses standard SQL Insert and Update statements Use auto-generated or user-defined SQL

    Can also use DELETE option to remove data from target Oracle table

    161

  • ValueCap Systems - Proprietary

    Oracle Index MaintenanceLoading to range/hash partitioned table in parallel is supported, however, if the table is indexed: Rebuild can be used to rebuild global indexes. Can specify

    NOLOGGING (speeds up rebuild by eliminating the log during index rebuild) and COMPUTESTATISTICS to provide stats on the index.

    Maintenance is supported for local indexes partitioned the same waythe table is partitioned

    Dont use both the rebuild and maintenance options in same stage either the global or local index must be dropped prior to the load.

    Using DB Options DIRECT=TRUE,PARALLEL=TRUE, SKIP_INDEX_MAINTENANCE=YES to

    allow the Oracle stage to run in parallel using direct path mode but indexes on the table will be unusable after the load.

    162

  • ValueCap Systems - Proprietary

    Relevant Stages Column Import Only import a subset of the columns in a record,

    leaving the rest as raw or string. This is useful when you have a very wide record and only plan on referencing a few columns.

    Column Export Combine 2 or more columns into a single column.

    Combine Records Combines records in which particular key-column values are identical into vectors of subrecords. As input, the stage takes a data set in which one or more columns are chosen as keys. All adjacent records whose key columns contain the same value are gathered into the same record as subrecords.

    Make Subrecord Combines specified vectors in an input data set into a vector of subrecords whose columns have the names and data types of the original vectors. Specify the vector columns to be made into a vector of subrecords and the name of the new subrecord.

    163

  • ValueCap Systems - Proprietary

    Relevant Stages (Continued) Split Subrecord Inverse of Make Subrecord. Creates one new vector

    column for each element of the original subrecord. Each top-level vector column that is created has the same number of elements as the subrecord from which it was created. The stage outputs columns of the same name and data type as those of the columns that comprise the subrecord.

    Make Vector Combines specified columns of an input data record into a vector of columns. The stage has the following requirements: The input columns must form a numeric sequence, and must all be of the same type. The numbers must increase by one. The columns must be named column_name0 to column_namen, where column_name

    starts the name of a column and 0 and n are the first and last of its consecutive numbers.

    The columns do not have to be in consecutive order.All these columns are combined into a vector of the same length as the number of columns (n+1). The vector is called column_name. Any input columns that do not have a name of that form will not be included in the vector but will be output as top level columns.

    164

  • ValueCap Systems - Proprietary

    Relevant Stages (Continued) Split Vector Promotes the elements of a fixed-length vector to a

    set of similarly named top-level columns. The stage creates columns of the format name0 to nameN, where name is the original vectors name and 0 and N are the first and last elements of the vector.

    Promote Subrecord Promotes the columns of an input subrecord to top-level columns. The number of output records equals the number of subrecord elements. The data types of the input subrecord columns determine those of the corresponding top-level columns.

    DRS Dynamic Relational Stage. DRS reads data from any DataStage stage and writes it to one of the supported relational databases. It also reads data from any of the supported relational databases and writes it to any DataStage stage. It supports the following relational databases: DB2/UDB, Informix, Microsoft SQL Server, Oracle, and Sybase. It also supports a generic ODBC.

    165

  • ValueCap Systems - Proprietary

    Relevant Stages (Continued) ODBC Access or write data to remote sources via an ODBC interface.

    Stored Procedure allows a stored procedure to be used as: A source, returning a rowset A target, passing a row to a stored procedure to write A transform, invoking logic within the databaseThe Stored Procedure stage supports input and output parameters or arguments. It can process the returned value after the stored procedure is run. Also provides status codes indicating whether the stored procedure completed successfully and, if not, allowing for error handling. Currently supports DB2, Oracle, and Sybase.

    Complex Flat File As a source stage it imports data from one or more complex flat files, including MVS datasets with QSAM and VSAM files. A complex flat file may contain one or more GROUPs, REDEFINES, OCCURS or OCCURS DEPENDING ON clauses. When used as a target, the stage exports data to one or more complex flat files. It does not write to MVS datasets.

    166

  • ValueCap Systems - Proprietary

    Lab 3A: Flat File Import

    167

  • ValueCap Systems - Proprietary

    Lab 3A Objectives

    Learn to create a table definition to match the contents of the flat file

    Read in the flat file using the Sequential File stage and the table definition just created.

    168

  • ValueCap Systems - Proprietary

    The Data Files There are 4 data files you will be importing. You will be using

    these files for future labs. The files contains Major League Baseball data. Batting.csv player hitting statistics Pitching.csv pitcher statistics Salaries.csv player salaries Master.csv player details

    The files all have the following format 1st row in each file contains the column names Data is in ASCII format Records are newline delimited Columns are comma separated

    169

  • ValueCap Systems - Proprietary

    Batters File The layout of the Batting.csv file is:

    Open the file using vi or any other text editor to view contents note the contents and data types

    Create a table definition for this data, save it as batting.

    Column Name DescriptionplayerID Player ID codeyearID YearteamID TeamlgID LeagueG GamesAB At BatsR RunsH HitsDB DoublesTP TriplesHR HomerunsRBI Runs Batted InSB Stolen BasesIBB Intentional walks

    Tips:1. Use a data type that most closely

    matches the data. For example, for theGames column, use Integer instead ofChar or VarChar!

    2. When using a VarChar type, always fillin a maximum length by filling in anumber in the length column

    3. When defining numerical types such asInteger or Float, theres no need to fill inlength or scale values. You only do thisfor Decimal types.

    170

  • ValueCap Systems - Proprietary

    Pitchers File The layout of the Pitching.csv file is:

    Open the file using vi or any other text editor to view contents note the contents and data types

    Create a table definition for this data, save it as pitching.

    Column Name DescriptionplayerID Player ID codeyearID YearteamID TeamlgID LeagueW WinsL LossesSHO ShutoutsSV SavesSO StrikeoutsERA Earned Run Average

    Tips:1. Be careful to choose the right data type

    for the ERA column. Your choicesshould boil down to Float vs Decimal

    171

  • ValueCap Systems - Proprietary

    Salary File The layout of the Salaries.csv file is:

    Open the file using vi or any other text editor to view contents note the contents and data types

    Create a table definition for this data, save it as salaries.

    Column Name DescriptionyearID YearteamID TeamlgID LeagueplayerID Player ID codesalary Salary

    Tips:1. Salary value is in whole dollars. Again

    be sure to select the best data type.While it may be tempting to use Decimal,the Framework is more efficient atprocessing Integer and Float types. Thoseare considered native to the Framework.

    172

  • ValueCap Systems - Proprietary

    Master File The layout of the Master.csv file is:

    Open the file using vi or any other text editor to view contents note the contents and data types

    Create a table definition for this data, save it as master.

    Tips:1. Treat birthYear, birthMonth, &

    birthDay as Integer types for now.2. Be sure to specify the correct Date

    format string: %mm/%dd/%yyyy

    Column Name DescriptionplayerID A unique code asssigned to each player. birthYear Year player was bornbirthMonth Month player was bornbirthDay Day player was bornnameFirst Player's first namenameLast Player's last namedebut Date player made first major league appearancefinalGame Date player made last major league appearance

    173

  • ValueCap Systems - Proprietary

    Testing the Table Definitions

    Create the following flow by linking a Sequential File stage to a Peek stage:

    Next, find the batting tabledefinition you created,click and drag the tableonto the link

    On the link: Look for the icon that signifies the presence of a table

    definition

    174

  • ValueCap Systems - Proprietary

    Testing the Table Definition In the Sequential File stage properties:

    Fill in the File option with the correct path and filename. For example: C:\student01\training\data\Batting.csv

    Click on the Format tab and review the settings. Are these consistent with what you see in the Batting.csv data file?

    In the Columns tab, you will note that the table definition you previously selected and dragged onto the link is now present. Alternatively, you could have used the Load button to bring it in or typed it it all over again!

    Next, click on the View Data button to see if you got everything correct! Click OK to view data!

    175

  • ValueCap Systems - Proprietary

    Viewing Data

    If everything went well, you should see the View Data window pop up:

    If you get an error instead, take a look at the error message to determine the location and nature of the error. Make the necessary corrections and try again.

    176

  • ValueCap Systems - Proprietary

    Testing lab3a Save the job as lab3a_batting Compile the job and then click on the run button. Go into the Director and take a look at the job log.

    Look out for Warnings and Errors !!! Errors are fatal and must be resolved. Warnings can be an issue. In this case, it could be warning

    you that certain records failed to import. This is a bad thing! Typical mistakes include formatting and data type

    mismatches Verify that the column delimiter is correct. Everything should

    be comma separated Are you using the correct data types?

    177

  • ValueCap Systems - Proprietary

    lab3a_batting Results For your lab3a_batting job:

    You should see Import complete. 25076 records imported successfully, 0 rejected.

    There should be no rejected records! Find the Peek output line in the Directors Log. Double-click on it.

    It should look like the following:

    178

  • ValueCap Systems - Proprietary

    Importing Rest of the Files

    Repeat the process for the Pitching,Salaries, and Master files. Save the jobs as lab3a_pitching,

    lab3a_salaries, and lab3a_masteraccordingly

    When finished, your job shouldresemble one of the diagrams onthe right. Be sure to rename the stages accordingly.

    Make sure that View Data works foreach and every input file.

    179

  • ValueCap Systems - Proprietary

    Validating Results

    For your lab3a_pitching job: You should see Import complete. 11917 records

    imported successfully, 0 rejected. There should be no rejected records!

    For your lab3a_salaries job: You should see Import complete. 17277 records

    imported successfully, 0 rejected. There should be no rejected records!

    For your lab3a_master job: You should see Import complete. 3817 records

    imported successfully, 0 rejected. There should be no rejected records!

    180

  • ValueCap Systems - Proprietary

    Lab 3B: Exporting to a Flat File

    181

  • ValueCap Systems - Proprietary

    Lab 3B Objective

    Write out the imported data files to ASCII flat files and parallel datasets

    Use different formatting properties

    182

  • ValueCap Systems - Proprietary

    Create Lab 3B Using Lab 3A

    Open the jobs you created in Lab 3A lab3a_batting, lab3a_pitching, lab3a_salaries, and lab3a_master

    Save each job again using Save As use the names lab3b_batting, lab3b_pitching, lab3b_salaries, and lab3b_master accordingly.

    183

  • ValueCap Systems - Proprietary

    Edit lab3a_batting_out

    Go to lab3b_batting and edit the job to look like the following:

    To do so, perform the following steps: Click on the Peek stage and delete it Attach the Copy stage in its place Place a Sequential File stage and a Dataset stage after the

    copy Draw a link between the copy and the 2 output stages Update the link and stage names accordingly

    184

  • ValueCap Systems - Proprietary

    Edit lab3b_batting

    In the Copy stages Output Mapping tab, map the source columns to the target columns for both output links:

    185

  • ValueCap Systems - Proprietary

    Source to Target Mapping

    right-click

    Once the mapping is co