velocity bestpractices

702
Velocity 2007 Best Practices

Upload: alok-tiwary

Post on 31-Dec-2015

126 views

Category:

Documents


0 download

DESCRIPTION

Infa velocity doc

TRANSCRIPT

Velocity 2007

Best Practices

Best Practices

● Configuration Management and Security❍ Configuring Security

❍ Data Analyzer Security

❍ Database Sizing

❍ Deployment Groups

❍ Migration Procedures - PowerCenter

❍ Migration Procedures - PowerExchange

❍ Running Sessions in Recovery Mode

❍ Using PowerCenter Labels

● Data Analyzer Configuration❍ Deploying Data Analyzer Objects

❍ Installing Data Analyzer

● Data Connectivity❍ Data Connectivity using PowerCenter Connect for BW Integration

Server

❍ Data Connectivity using PowerCenter Connect for MQSeries

❍ Data Connectivity using PowerCenter Connect for SAP

❍ Data Connectivity using PowerCenter Connect for Web Services

● Data Migration❍ Data Migration Principles

❍ Data Migration Project Challenges

❍ Data Migration Velocity Approach

● Data Quality and Profiling❍ Build Data Audit/Balancing Processes

❍ Data Cleansing

❍ Data Profiling

INFORMATICA CONFIDENTIAL BEST PRACTICE 2 of 702

❍ Data Quality Mapping Rules

❍ Data Quality Project Estimation and Scheduling Factors

❍ Effective Data Matching Techniques

❍ Effective Data Standardizing Techniques

❍ Managing Internal and External Reference Data

❍ Testing Data Quality Plans

❍ Tuning Data Quality Plans

❍ Using Data Explorer for Data Discovery and Analysis

❍ Working with Pre-Built Plans in Data Cleanse and Match

● Development Techniques❍ Designing Data Integration Architectures

❍ Development FAQs

❍ Event Based Scheduling

❍ Key Management in Data Warehousing Solutions

❍ Mapping Design

❍ Mapping Templates

❍ Naming Conventions

❍ Performing Incremental Loads

❍ Real-Time Integration with PowerCenter

❍ Session and Data Partitioning

❍ Using Parameters, Variables and Parameter Files

❍ Using PowerCenter with UDB

❍ Using Shortcut Keys in PowerCenter Designer

❍ Working with JAVA Transformation Object

● Error Handling❍ Error Handling Process

❍ Error Handling Strategies - Data Warehousing

❍ Error Handling Strategies - General

❍ Error Handling Techniques - PowerCenter Mappings

❍ Error Handling Techniques - PowerCenter Workflows and Data Analyzer

INFORMATICA CONFIDENTIAL BEST PRACTICE 3 of 702

● Integration Competency Centers and Enterprise Architecture❍ Planning the ICC Implementation

❍ Selecting the Right ICC Model

● Metadata and Object Management❍ Creating Inventories of Reusable Objects & Mappings

❍ Metadata Reporting and Sharing

❍ Repository Tables & Metadata Management

❍ Using Metadata Extensions

❍ Using PowerCenter Metadata Manager and Metadata Exchange Views for Quality Assurance

● Metadata Manager Configuration❍ Configuring Standard XConnects

❍ Custom XConnect Implementation

❍ Customizing the Metadata Manager Interface

❍ Estimating Metadata Manager Volume Requirements

❍ Metadata Manager Load Validation

❍ Metadata Manager Migration Procedures

❍ Metadata Manager Repository Administration

❍ Upgrading Metadata Manager

● Operations❍ Daily Operations

❍ Data Integration Load Traceability

❍ High Availability

❍ Load Validation

❍ Repository Administration

❍ Third Party Scheduler

❍ Updating Repository Statistics

● Performance and Tuning❍ Determining Bottlenecks

❍ Performance Tuning Databases (Oracle)

❍ Performance Tuning Databases (SQL Server)

INFORMATICA CONFIDENTIAL BEST PRACTICE 4 of 702

❍ Performance Tuning Databases (Teradata)

❍ Performance Tuning UNIX Systems

❍ Performance Tuning Windows 2000/2003 Systems

❍ Recommended Performance Tuning Procedures

❍ Tuning and Configuring Data Analyzer and Data Analyzer Reports

❍ Tuning Mappings for Better Performance

❍ Tuning Sessions for Better Performance

❍ Tuning SQL Overrides and Environment for Better Performance

❍ Using Metadata Manager Console to Tune the XConnects

● PowerCenter Configuration❍ Advanced Client Configuration Options

❍ Advanced Server Configuration Options

❍ Causes and Analysis of UNIX Core Files

❍ Domain Configuration

❍ Managing Repository Size

❍ Organizing and Maintaining Parameter Files & Variables

❍ Platform Sizing

❍ PowerCenter Admin Console

❍ Understanding and Setting UNIX Resources for PowerCenter Installations

● PowerExchange Configuration❍ PowerExchange CDC for Oracle

❍ PowerExchange Installation (for Mainframe)

● Project Management❍ Assessing the Business Case

❍ Defining and Prioritizing Requirements

❍ Developing a Work Breakdown Structure (WBS)

❍ Developing and Maintaining the Project Plan

❍ Developing the Business Case

❍ Managing the Project Lifecycle

INFORMATICA CONFIDENTIAL BEST PRACTICE 5 of 702

❍ Using Interviews to Determine Corporate Data Integration Requirements

● Upgrades❍ Upgrading Data Analyzer

❍ Upgrading PowerCenter

❍ Upgrading PowerExchange

INFORMATICA CONFIDENTIAL BEST PRACTICE 6 of 702

Configuring Security

Challenge

Configuring a PowerCenter security scheme to prevent unauthorized access to mappings, folders, sessions, workflows, repositories, and data in order to ensure system integrity and data confidentiality.

Description

Security is an often overlooked area within the Informatica ETL domain. However, without paying close attention to the repository security, one ignores a crucial component of ETL code management. Determining an optimal security configuration for a PowerCenter environment requires a thorough understanding of business requirements, data content, and end-user access requirements. Knowledge of PowerCenter's security functionality and facilities is also a prerequisite to security design.

Implement security with the goals of easy maintenance and scalability. When establishing repository security, keep it simple. Although PowerCenter includes the utilities for a complex web of security, the more simple the configuration, the easier it is to maintain. Securing the PowerCenter environment involves the following basic principles:

● Create users and groups ● Define access requirements ● Grant privileges and permissions

Before implementing security measures, ask and answer the following questions:

● Who will administer the repository? ● How many projects need to be administered? Will the administrator be able to manage security for all

PowerCenter projects or just a select few? ● How many environments will be supported in the repository? ● Who needs access to the repository? What do they need the ability to do? ● How will the metadata be organized in the repository? How many folders will be required? ● Where can we limit repository privileges by granting folder permissions instead? ● Who will need Administrator or Super User-type access?

After you evaluate the needs of the repository users, you can create appropriate user groups, assign repository privileges and folder permissions. In most implementations, the administrator takes care of maintaining the repository. Limit the number of administrator accounts for PowerCenter. While this concept is important in a development/unit test environment, it is critical for protecting the production environment.

Repository Security Overview

A security system needs to properly control access to all sources, targets, mappings, reusable transformations, tasks, and workflows in both the test and production repositories. A successful security model needs to support all groups in the project lifecycle and also consider the repository structure.

INFORMATICA CONFIDENTIAL BEST PRACTICE 7 of 702

Informatica offers multiple layers of security, which enables you to customize the security within your data warehouse environment. Metadata level security controls access to PowerCenter repositories, which contain objects grouped by folders. Access to metadata is determined by the privileges granted to the user or to a group of users and the access permissions granted on each folder. Some privileges do not apply by folder, as they are granted by privilege alone (i.e., repository-level tasks).

Just beyond PowerCenter authentication is the connection to the repository database. All client connectivity to the repository is handled by the PowerCenter Repository Service over a TCP/IP connection. The particular database account and password is specified at installation and during the configuration of the Repository Service. Developers need not have knowledge of this database account and password; they should only use their individual repository user ids and passwords. This information should be restricted to the administrator.

Other forms of security available in PowerCenter include permissions for connections. Connections include database, FTP, and external loader connections. These permissions are useful when you want to limit access to schemas in a relational database and can be set-up in the Workflow Manager when source and target connections are defined.

Occasionally, you may want to restrict changes to source and target definitions in the repository. A common way to approach this security issue is to use shared folders, which are owned by an Administrator or Super User. Granting read access to developers on these folders allows them to create read-only copies in their work folders.

Informatica Security Architecture

The following diagram, Informatica PowerCenter Security, depicts PowerCenter security, including access to the repository, Repository Service, Integration Service and the command-line utilities pmrep and pmcmd.

As shown in the diagram, the repository service is the central component when using default security. It sits between the PowerCenter repository and all client applications, including GUI tools, command line tools, and the Integration Service. Each application must be authenticated against metadata stored in several tables within the repository. Each Repository Service manages a single repository database where all security data is stored as part of its metadata; this is a second layer of security. Only the Repository Service has access to this database; it authenticates all client applications against this metadata.

INFORMATICA CONFIDENTIAL BEST PRACTICE 8 of 702

Repository Service Security

Connection to the PowerCenter repository database is one level of security. The Repository Service uses native drivers to communicate with the repository database. PowerCenter Client tools and the Integration Service communicate with the Repository Service over TCP/IP. When a client application connects to the repository, it connects directly to the Repository Service process. You can configure a Repository Service to run on multiple machines, or nodes, in the domain. Each instance running on a node is called a Repository Service process. This process accesses the database tables and performs most repository-related tasks.

When the Repository Service is installed, the database connection information is entered for the metadata repository. At this time you need to know the database user id and password to access the metadata repository. The database user id must be able to read and write to all tables in the database. As a developer creates, modifies, executes mappings and sessions, this information is continuously updating the metadata in the repository. Actual database security should be controlled by the DBA responsible for that database, in conjunction with the PowerCenter Repository Administrator. After the Repository Service is installed and started, all subsequent client connectivity is automatic. The database id and password are transparent at this point.

INFORMATICA CONFIDENTIAL BEST PRACTICE 9 of 702

Integration Service Security

Like the Repository Service, the Integration Service communicates with the metadata repository when it executes workflows or when users are using Workflow Monitor. During configuration of the Integration Service, the repository database is identified with the appropriate user id and password. Connectivity to the repository is made using native drivers supplied by Informatica.

Certain permissions are also required to use the pmrep and pmcmd command line utilities.

Encrypting Repository Passwords

You can encrypt passwords and create an environment variable to use with pmcmd and pmrep. For example, you can encrypt the repository and database passwords for pmrep to maintain security when using pmrep in scripts. In addition, you can create an environment variable to store the encrypted password.

Use the following steps as a guideline to use an encrypted password as an environment variable:

1. Use the command line program pmpasswd to encrypt the repository password.

2. Configure the password environment variable to set the encrypted value.

To configure a password as an environment variable on UNIX:

1. At the command line, type:

pmpasswd <repository password>

pmpasswd returns the encrypted password.

2. In a UNIX C shell environment, type:

setenv <Password_Environment_Variable> <encrypted password>

In a UNIX Bourne shell environment, type:

<Password_Environment_Variable> = <encrypted password>

export <Password_Environment_Variable>

You can assign the environment variable any valid UNIX name.

To configure a password as an environment variable on Windows:

1. At the command line, type:

pmpasswd <repository password>

pmpasswd returns the encrypted password.

INFORMATICA CONFIDENTIAL BEST PRACTICE 10 of 702

2. Enter the password environment variable in the Variable field. Enter the encrypted password in the Value field.

Setting the Repository User Name

For pmcmd and pmrep, you can create an environment variable to store the repository user name.

To configure a user name as an environment variable on UNIX:

1. In a UNIX C shell environment, type:

setenv <User_Name_Environment_Variable> <user name>

2. In a UNIX Bourne shell environment, type:

<User_Name_Environment_Variable> = <user name>

export <User_Name_Environment_Variable> = <user name>

You can assign the environment variable any valid UNIX name.

To configure a user name as an environment variable on Windows:

1.

Enter the user name environment variable in the Variable field.

2.

Enter the repository user name in the Value field.

Connection Object Permissions

Within Workflow Manager, you can grant read, write, and execute permissions to groups and/or users for all types of connection objects. This controls who can create, view, change, and execute workflow tasks that use those specific connections, providing another level of security for these global repository objects.

Users with ‘Use Workflow Manager’ permission can create and modify connection objects. Connection objects allow the PowerCenter server to read and write to source and target databases. Any database the server can access requires a connection definition. As shown below, connection information is stored in the repository. Users executing workflows need execution permission on all connections used by the workflow. The PowerCenter server looks up the connection information in the repository, and verifies permission for the required action. If permissions are properly granted, the server reads and writes to the defined databases, as specified by the workflow.

INFORMATICA CONFIDENTIAL BEST PRACTICE 11 of 702

Users

Users are the fundamental objects of security in a PowerCenter environment. Each individual logging into the PowerCenter repository should have a unique user account. Informatica does not recommend creating shared accounts; unique accounts should be created for each user. Each repository user needs a user name and password, provided by the PowerCenter Repository Administrator, to access the repository.

Users are created and managed through Repository Manager. Users should change their passwords from the default immediately after receiving the initial user id from the Administrator. Passwords can be reset by the user if they are granted the privilege ‘Use Repository Manager’.

When you create the repository, the repository automatically creates two default users:

● Administrator - The default password for Administrator is Administrator. ● Database user - The username and password used when you created the repository.

These default users are in the Administrators user group, with full privileges within the repository. They cannot be deleted from the repository, nor have their group affiliation changed.

To administer repository users, you must have one of the following privileges:

● Administer Repository ● Super User

LDAP (Lightweight Directory Access Protocol)

In addition to default repository user authentication, LDAP can be used to authenticate users. Using LDAP authentication, the repository maintains an association between the repository user and the external login name. When a user logs into the repository, the security module authenticates the user name and password

INFORMATICA CONFIDENTIAL BEST PRACTICE 12 of 702

lclark
Stamp

against the external directory. The repository maintains a status for each user. Users can be enabled or disabled by modifying this status.

Prior to implementing LDAP, the administrator must know:

● Repository server username and password ● An administrator or superuser user name and password for the repository ● An external login name and password

To configure LDAP, follow these steps:

1. Edit ldap_authen.xml, modify the following attributes:

❍ NAME – the .dll that implements the authentication ❍ OSTYPE – Host operating system

2. Register ldap_authen.xml in the Repository Server Administration Console. 3. In the Repository Server Administration Console, configure the authentication module.

User Groups

When you create a repository, the Repository Manager creates two repository user groups. These two groups exist so you can immediately create users and begin developing repository objects. These groups cannot be deleted from the repository nor have their configured privileges changed. The default repository user groups are:

● Administrators - which has super-user access ● Public - which has a subset of default repository privileges

You should create custom user groups to manage users and repository privileges effectively. The number and types of groups that you create should reflect the needs of your development teams, administrators, and operations group. Informatica recommends minimizing the number of custom user groups that you create in order to facilitate the maintenance process.

A starting point is to create a group for each type of combination of privileges needed to support the development cycle and production process. This is the recommended method for assigning privileges. After creating a user group, you assign a set of privileges for that group. Each repository user must be assigned to at least one user group. When you assign a user to a group, the user:

● Receives all group privileges. ● Inherits any changes to group privileges. ● Loses and gains privileges if you change the user group membership.

You can also assign users to multiple groups, which grants the user the privileges of each group. Use the Repository Manager to create and edit repository user groups.

Folder Permissions

INFORMATICA CONFIDENTIAL BEST PRACTICE 13 of 702

When you create or edit a folder, you define permissions for the folder. The permissions can be set at three different levels:

● owner ● owners group ● repository - remainder of users within the repository

First, choose an owner (i.e., user) and group for the folder. If the owner belongs to more than one group, you must select one of the groups listed. Once the folder is defined and the owner is selected, determine what level of permissions you would like to grant to the users within the group. Then determine the permission level for the remainder of the repository users. The permissions that can be set include: read, write, and execute. Any combination of these can be granted to the owner, group or repository.

Be sure to consider folder permissions very carefully. They offer the easiest way to restrict users and/or groups from having access to folders or restricting access to folders. The following table gives some examples of folders, their type, and recommended ownership.

Folder Name Folder Type Proposed Owner

DEVELOPER_1 Initial development, temporary work area, unit test

Individual developer

DEVELOPMENT Integrated development Development lead, Administrator or Super User

UAT Integrated User Acceptance Test UAT lead, Administrator or Super User

PRODUCTION Production Administrator or Super User

PRODUCTION SUPPORT Production fixes and upgrades Production support lead, Administrator or Super User

Repository Privileges

Repository privileges work in conjunction with folder permissions to give a user or group authority to perform tasks. Repository privileges are the most granular way of controlling a user’s activity. Consider the privileges that each user group requires, as well as folder permissions, when determining the breakdown of users into groups. Informatica recommends creating one group for each distinct combination of folder permissions and privileges.

When you assign a user to a user group, the user receives all privileges granted to the group. You can also assign privileges to users individually. When you grant a privilege to an individual user, the user retains that privilege, even if his or her user group affiliation changes. For example, you have a user in a Developer group who has limited group privileges, and you want this user to act as a backup administrator when you are not available. For the user to perform every task in every folder in the repository, and to administer the Integration Service, the user must have the Super User privilege. For tighter security, grant the Super User privilege to the individual user, not the entire Developer group. This limits the number of users with the Super User privilege, and ensures that the user retains the privilege even if you remove the user from the Developer group.

The Repository Manager grants a default set of privileges to each new user and group for working within the repository. You can add or remove privileges from any user or group except:

INFORMATICA CONFIDENTIAL BEST PRACTICE 14 of 702

● Administrators and Public (the default read-only repository groups) ● Administrator and the database user who created the repository (the users automatically created in the

Administrators group)

The Repository Manager automatically grants each new user and new group the default privileges. These privileges allow you to perform basic tasks in Designer, Repository Manager, Workflow Manager, and Workflow Monitor. The following table lists the default repository privileges:

Default Repository Privileges

Default Privilege Folder Permission Connection Object Permission Grants the Ability to

Use Designer N/A N/A ● Connect to the repository using the Designer. ● Configure connection information.

Read N/A

● View objects in the folder. ● Change folder versions. ● Create shortcuts to objects in the folder. ● Copy objects from the folder. ● Export objects.

Read/Write N/A

● Create or edit metadata. ● Create shortcuts from shared folders. ● Copy objects into the folder. ● Import objects.

Browse Repository N/A N/A

● Connect to the repository using the Repository Manager.

● Add and remove reports. ● Import, export, or remove the registry. ● Search by keywords. ● Change your user password.

Read N/A

● View dependencies. ● Unlock objects, versions, and folders locked by your

username. ● Edit folder properties for folders you own. ● Copy a version. (You must also have Administer

Repository or Super User privilege in the target repository and write permission on the target folder.)

● Copy a folder. (You must also have Administer Repository or Super User privilege in the target repository.)

Use Workflow Manager N/A N/A

● Connect to the repository using the Workflow Manager. ● Create database, FTP, and external loader connections

in the Workflow Manager. ● Run the Workflow Monitor.

N/A Read/Write ● Edit database, FTP, and external loader connections in the Workflow Manager.

INFORMATICA CONFIDENTIAL BEST PRACTICE 15 of 702

Read N/A

● Export sessions. ● View workflows. ● View sessions. ● View tasks. ● View session details and session performance details.

Read/Write N/A● Create and edit workflows and tasks. ● Import sessions. ● Validate workflows and tasks.

Read/Write Read● Create and edit sessions.

Read/Execute N/A● View session log.

Read/Execute Execute ● Schedule or unschedule workflows. ● Start workflows immediately.

Execute N/A

● Restart workflow. ● Stop workflow. ● Abort workflow. ● Resume workflow.

Use Repository Manager N/A N/A● Remove label references.

Write Deployment group● Delete from deployment group.

Write Folder

● Change objects version comments if not owner. ● Change status of object. ● Check in. ● Check out/undo check-out. ● Delete objects from folder. ● Mass validation (needs write permission if options

selected). ● Recover after delete.

Read Folder● Export objects.

Read/Write Folder Deployment Groups● Add to deployment group.

Read/Write Original folders Target folder ● Copy objects. ● Import objects.

Read/Write/ Execute Folder Label● Apply label

Extended Privileges

In addition to the default privileges listed above, Repository Manager provides extended privileges that you can assign to users and groups. These privileges are granted to the Administrator group by default. The following table lists the extended repository privileges:

INFORMATICA CONFIDENTIAL BEST PRACTICE 16 of 702

Extended Repository Privileges

Extended Privilege Folder Permission Connection Object Permission Grants the Ability to

Admin Repository N/A N/A ● Create, upgrade, backup, delete, and restore the repository.

● Manage passwords, users, groups, and privileges.

● Start, stop, enable, disable, and check the status of the repository.

Write Folder ● Check in or undo check out for other users.

● Purge (in version-enabled repository).

Admin Integration Service

N/A N/A ● Disable the Integration Service using the infacmd program.

● Connect to the Integration Service from PowerCenter client applications when running the Integration Service in safe mode.

Super User N/A N/A ● Perform all tasks, across all folders in the repository.

● Manage connection object permissions.

● Manage global object permissions.

● Perform mass validate.

Workflow Operator N/A N/A ● Connect to the Integration Service.

Read Folder ● View the session log.

● View the workflow log. View session details and performance details.

Execute Folder ● Abort workflow.

● Restart workflow.

● Resume workflow.

● Stop workflow.

● Schedule and unschedule workflows.

Read

Execute

Folder

Connection

● Start workflows immediately.

Execute

Execute

Folder

Connection

● Use pmcmd to start workflows in folders for which you have execute permission.

Manage Connection N/A N/A ● Create and edit connection objects.

● Delete connection objects.

● Manage connection object permissions.

Manage Label N/A N/A ● Create labels.

● Delete labels.

Extended privileges allow you to perform more tasks and expand the access you have to repository objects. Informatica recommends that you reserve extended privileges for individual users and grant default privileges to groups.

Audit Trails

You can track changes to Repository users, groups, privileges, and permissions by selecting the

INFORMATICA CONFIDENTIAL BEST PRACTICE 17 of 702

SecurityAuditTrail configuration option in the Repository Service properties in the PowerCenter Administration Console. When you enable the audit trail, the Repository Service logs security changes to the Repository Service log.

The audit trail logs the following operations:

● Changing the owner, owner's group, or permissions for a folder. ● Changing the password of another user. ● Adding or removing a user. ● Adding or removing a group. ● Adding or removing users from a group. ● Changing global object permissions. ● Adding or removing user and group privileges.

Sample Security Implementation

The following steps provide an example of how to establish users, groups, permissions, and privileges in your environment. Again, the requirements of your projects and production systems should dictate how security is established.

1. Identify users and the environments they will support (e.g., Development, UAT, QA, Production, Production Support, etc.).

2. Identify the PowerCenter repositories in your environment (this may be similar to the basic groups listed in Step 1; for example, Development, UAT, QA, Production, etc.).

3. Identify which users need to exist in each repository. 4. Define the groups that will exist in each PowerCenter Repository. 5. Assign users to groups. 6. Define privileges for each group.

The following table provides an example of groups and privileges that may exist in the PowerCenter repository. This example assumes one PowerCenter project with three environments co-existing in one PowerCenter repository.

GROUP NAME FOLDER FOLDER PERMISSIONS

PRIVILEGES

ADMINISTRATORS All All Super User (all privileges)

DEVELOPERS

Individual development folder; integrated development folder Read, Write, Execute

Use Designer, Browse Repository, Use Workflow Manager

INFORMATICA CONFIDENTIAL BEST PRACTICE 18 of 702

DEVELOPERS UAT ReadUse Designer, Browse Repository, Use Workflow Manager

UATUAT working folder Read, Write, Execute

Use Designer, Browse Repository, Use Workflow Manager

UAT Production ReadUse Designer, Browse Repository, Use Workflow Manager

OPERATIONS Production Read, ExecuteBrowse Repository, Workflow Operator

PRODUCTION SUPPORT

Production maintenance folders Read, Write, Execute

Use Designer, Browse Repository, Use Workflow Manager

PRODUCTION SUPPORT Production Read Browse Repository

Informatica PowerCenter Security Administration

As mentioned earlier, one individual should be identified as the Informatica Administrator. This individual is responsible for a number of tasks in the Informatica environment, including security. To summarize, here are the security-related tasks an administrator is responsible for:

● Creating user accounts. ● Defining and creating groups. ● Defining and granting folder permissions. ● Defining and granting repository privileges. ● Enforcing changes in passwords. ● Controlling requests for changes in privileges. ● Creating and maintaining database, FTP, and external loader connections in conjunction with database

administrator. ● Working with operations group to ensure tight security in production environment.

Remember, you must have one of the following privileges to administer repository users:

● Administer Repository ● Super User

Summary of Recommendations

When implementing your security model, keep the following recommendations in mind:

● Create groups with limited privileges.

INFORMATICA CONFIDENTIAL BEST PRACTICE 19 of 702

● Do not use shared accounts. ● Limit user and group access to multiple repositories. ● Customize user privileges. ● Limit the Super User privilege. ● Limit the Administer Repository privilege. ● Restrict the Workflow Operator privilege. ● Follow a naming convention for user accounts and group names. ● For more secure environments, turn Audit Trail logging on.

Last updated: 05-Feb-07 15:33

INFORMATICA CONFIDENTIAL BEST PRACTICE 20 of 702

Data Analyzer Security

Challenge

Using Data Analyzer's sophisticated security architecture to establish a robust security system to safeguard valuable business information against a range of technologies and security models. Ensuring that Data Analyzer security provides appropriate mechanisms to support and augment the security infrastructure of a Business Intelligence environment at every level.

Description

Four main architectural layers must be completely secure: user layer, transmission layer, application layer and data layer.

Users must be authenticated and authorized to access data. Data Analyzer integrates seamlessly with the following LDAP-compliant directory servers:

SunOne/iPlanet Directory Server 4.1

Sun Java System Directory Server 5.2

INFORMATICA CONFIDENTIAL BEST PRACTICE 21 of 702

Novell eDirectory Server 8.7

IBM SecureWay Directory 3.2

IBM SecureWay Directory 4.1

IBM Tivoli Directory Server 5.2

Microsoft Active Directory 2000

Microsoft Active Directory 2003

In addition to the directory server, Data Analyzer supports Netegrity SiteMinder for centralizing authentication and access control for the various web applications in the organization.

Transmission Layer

The data transmission must be secure and hacker-proof. Data Analyzer supports the standard security protocol Secure Sockets Layer (SSL) to provide a secure environment.

Application Layer

Only appropriate application functionality should be provided to users with associated privileges. Data Analyzer provides three basic types of application-level security:

● Report, Folder and Dashboard Security. Restricts access for users or groups to specific reports, folders, and/or dashboards.

● Column-level Security. Restricts users and groups to particular metric and attribute columns. ● Row-level Security. Restricts users to specific attribute values within an attribute column of a

table.

Components for Managing Application Layer Security

Data Analyzer users can perform a variety of tasks based on the privileges that you grant them. Data Analyzer provides the following components for managing application layer security:

● Roles. A role can consist of one or more privileges. You can use system roles or create custom roles. You can grant roles to groups and/or individual users. When you edit a custom role, all groups and users with the role automatically inherit the change.

● Groups. A group can consist of users and/or groups. You can assign one or more roles to a group. Groups are created to organize logical sets of users and roles. After you create groups, you can assign users to the groups. You can also assign groups to other groups to organize privileges for related users. When you edit a group, all users and groups within the edited group inherit the change.

● Users. A user has a user name and password. Each person accessing Data Analyzer must have a unique user name. To set the tasks a user can perform, you can assign roles to the user or

INFORMATICA CONFIDENTIAL BEST PRACTICE 22 of 702

assign the user to a group with predefined roles.

Types of Roles

System roles - Data Analyzer provides a set of roles when the repository is created. Each role has sets of privileges assigned to it.

Custom roles - The end user can create and assign privileges to these roles.

Managing Groups

Groups allow you to classify users according to a particular function. You may organize users into groups based on their departments or management level. When you assign roles to a group, you grant the same privileges to all members of the group. When you change the roles assigned to a group, all users in the group inherit the changes. If a user belongs to more than one group, the user has the privileges from all groups. To organize related users into related groups, you can create group hierarchies. With hierarchical groups, each subgroup automatically receives the roles assigned to the group it belongs to. When you edit a group, all subgroups contained within it inherit the changes.

For example, you may create a Lead group and assign it the Advanced Consumer role. Within the Lead group, you create a Manager group with a custom role Manage Data Analyzer. Because the Manager group is a subgroup of the Lead group, it has both the Manage Data Analyzer and Advanced Consumer role privileges.

Belonging to multiple groups has an inclusive effect. For example, if group 1 has access to something but group 2 is excluded from that object, a user belonging to both groups 1 and 2 will have access to the object.

INFORMATICA CONFIDENTIAL BEST PRACTICE 23 of 702

Preventing Data Analyzer from Updating Group Information

If you use Windows Domain or LDAP authentication, you typically modify the users or groups in Data Analyzer. However, some organizations keep only user accounts in the Windows Domain or LDAP directory service, but set up groups in Data Analyzer to organize the Data Analyzer users. Data Analyzer provides a way for you to keep user accounts in the authentication server and still keep the groups in Data Analyzer.

Ordinarily, when Data Analyzer synchronizes the repository with the Windows Domain or LDAP directory service, it updates the users and groups in the repository and deletes users and groups that are not found in the Windows Domain or LDAP directory service.

To prevent Data Analyzer from deleting or updating groups in the repository, you can set a property in the web.xml file so that Data Analyzer updates only user accounts, not groups. You can then create and manage groups in Data Analyzer for users in the Windows Domain or LDAP directory service.

The web.xml file is in stored in the Data Analyzer EAR file. To access the files in the Data Analyzer EAR file, use the EAR Repackager utility provided with Data Analyzer.

Note: Be sure to back-up the web.xml file before you modify it.

To prevent Data Analyzer from updating group information in the repository:

1. In the directory where you extracted the Data Analyzer EAR file, locate the web.xml file in the following directory:

/custom/properties

2. Open the web.xml file with a text editor and locate the line containing the following property:

enableGroupSynchronization

The enableGroupSynchronization property determines whether Data Analyzer updates the groups in the repository.

INFORMATICA CONFIDENTIAL BEST PRACTICE 24 of 702

3. To prevent Data Analyzer from updating group information in the Data Analyzer repository, change the value of the enableGroupSynchronization property to false:

<init-param>

<param-name> InfSchedulerStartup.com.informatica.ias. scheduler.enableGroupSynchronization </param-name>

<param-value>false</param-value>

</init-param>

When the value of enableGroupSynchronization property is false, Data Analyzer does not synchronize the groups in the repository with the groups in the Windows Domain or LDAP directory service.

4. Save the web.xml file and add it back to the Data Analyzer EAR file.

5. Restart Data Analyzer.

When the enableGroupSynchronization property in the web.xml file is set to false, Data Analyzer updates only the user accounts in Data Analyzer the next time it synchronizes with the Windows Domain or LDAP authentication server. You must create and manage groups, and assign users to groups in Data Analyzer.

Managing Users

Each user must have a unique user name to access Data Analyzer. To perform Data Analyzer tasks, a user must have the appropriate privileges. You can assign privileges to a user with roles or groups.

Data Analyzer creates a System Administrator user account when you create the repository. The default user name for the System Administrator user account is admin. The system daemon, ias_scheduler/padaemon, runs the updates for all time-based schedules. System daemons must have a unique user name and password in order to perform Data Analyzer system functions and tasks. You can change the password for a system daemon, but you cannot change the system daemon user name via the GUI. Data Analyzer permanently assigns the daemon role to system daemons. You cannot assign new roles to system daemons or assign them to groups.

To change the password for a system daemon, complete the following steps:

1. Change the password in the Administration tab in Data Analyzer 2. Change the password in the web.xml file in the Data Analyzer folder. 3. Restart Data Analyzer.

Access LDAP Directory Contacts

INFORMATICA CONFIDENTIAL BEST PRACTICE 25 of 702

To access contacts in the LDAP directory service, you can add the LDAP server on the LDAP Settings page. After you set up the connection to the LDAP directory service, users can email reports and shared documents to LDAP directory contacts.

When you add an LDAP server, you must provide a value for the BaseDN (distinguished name) property. In the BaseDN property, enter the Base DN entries for your LDAP directory. The Base distinguished name entries define the type of information that is stored in the LDAP directory. If you do not know the value for BaseDN, contact your LDAP system administrator.

Customizing User Access

You can customize Data Analyzer user access with the following security options:

● Access permissions. Restrict user and/or group access to folders, reports, dashboards, attributes, metrics, template dimensions, or schedules. Use access permissions to restrict access to a particular folder or object in the repository.

● Data restrictions. Restrict user and/or group access to information in fact and dimension tables and operational schemas. Use data restrictions to prevent certain users or groups from accessing specific values when they create reports.

● Password restrictions. Restrict users from changing their passwords. Use password restrictions when you do not want users to alter their passwords.

When you create an object in the repository, every user has default read and write permissions for that object. By customizing access permissions for an object, you determine which users and/or groups can read, write, delete, or change access permissions for that object.

When you set data restrictions, you determine which users and groups can view particular attribute values. If a user with a data restriction runs a report, Data Analyzer does not display the restricted data to that user.

Types of Access Permissions

Access permissions determine the tasks that you can perform for a specific repository object. When you set access permissions, you determine which users and groups have access to the folders and repository objects. You can assign the following types of access permissions to repository objects:

● Read. Allows you to view a folder or object. ● Write. Allows you to edit an object. Also allows you to create and edit folders and objects within a

folder. ● Delete. Allows you to delete a folder or an object from the repository. ● Change permission. Allows you to change the access permissions on a folder or object.

By default, Data Analyzer grants read and write access permissions to every user in the repository. You can use the General Permissions area to modify default access permissions for an object, or turn off default access permissions.

INFORMATICA CONFIDENTIAL BEST PRACTICE 26 of 702

Data Restrictions

You can restrict access to data based on the values of related attributes. Data restrictions are set to keep sensitive data from appearing in reports. For example, you may want to restrict data related to the performance of a new store from outside vendors. You can set a data restriction that excludes the store ID from their reports.

You can set data restrictions using one of the following methods:

● Set data restrictions by object. Restrict access to attribute values in a fact table, operational schema, real-time connector, and real-time message stream. You can apply the data restriction to users and groups in the repository. Use this method to apply the same data restrictions to more than one user or group.

● Set data restrictions for one user at a time. Edit a user account or group to restrict user or group access to specified data. You can set one or more data restrictions for each user or group. Use this method to set custom data restrictions for different users or groups

Types of Data Restrictions

You can set two kinds of data restrictions:

● Inclusive. Use the IN option to allow users to access data related to the attributes you select. For example, to allow users to view only data from the year 2001, create an “IN 2001” rule.

● Exclusive. Use the NOT IN option to restrict users from accessing data related to the attributes you select. For example, to allow users to view all data except from the year 2001, create a “NOT IN 2001” rule.

Restricting Data Access by User or Group

You can edit a user or group profile to restrict the data the user or group can access in reports. When you edit a user profile, you can set data restrictions for any schema in the repository, including operational schemas and fact tables.

You can set a data restriction to limit user or group access to data in a single schema based on the attributes you select. If the attributes apply to more than one schema in the repository, you can also restrict the user or group access from related data across all schemas in the repository. For example, you may have a Sales fact table and Salary fact table. Both tables use the Region attribute. You can set one data restriction that applies to both the Sales and Salary fact tables based on the region you select.

To set data restrictions for a user or group, you need the following role or privilege:

● System Administrator role ● Access Management privilege

When Data Analyzer runs scheduled reports that have provider-based security, it runs reports against the data restrictions for the report owner. However, if the reports have consumer-based security, the Data Analyzer Server creates a separate report for each unique security profile.

INFORMATICA CONFIDENTIAL BEST PRACTICE 27 of 702

The following information applies to the required steps for changing admin user for weblogic only.

To change the Data Analyzer system administrator username on Weblogic 8.1(DA 8.1)

Repository authentication. You must use the Update System Accounts utility to change the system administrator account name in the repository.

● LDAP or Windows Domain Authentication. Set up the new system administrator account in Windows Domain or LDAP directory service. Then use the Update System Accounts utility to change the system administrator account name in the repository.

To change the Data Analyzer default users from admin, ias_scheduler/padaemon

1. Back up the repository.

2. Go to the Web Logic library directory: .\bea\wlserver6.1\lib

3. Open the file ias.jar and locate the file entry called InfChangeSystemUserNames.class

4. Extract the file "InfChangeSystemUserNames.class" into a temporary directory (example: d:\temp)

5. This extracts the file as 'd:\temp\repository tils\Refresh\InfChangeSystemUserNames.class'

6. Create a batch file (change_sys_user.bat) with the following commands in the directory D:\Temp\Repository Utils\Refresh\

REM To change the system user name and passwordREM *******************************************REM Change the BEA home hereREM ************************set JAVA_HOME=E:\bea\wlserver6.1\jdk131_06set WL_HOME=E:\bea\wlserver6.1set CLASSPATH=%WL_HOME%\sqlset CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\jconn2.jarset CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\classes12.zipset CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\weblogic.jarset CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias.jarset CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias_securityadapter.jarset CLASSPATH=%CLASSPATH%;%WL_HOME%\infalicenseREM Change the DB information here and alsoREM the user Dias_scheduler and -Dadmin to values of your choiceREM *************************************************************%JAVA_HOME%\bin\java-Ddriver=com.informatica.jdbc.sqlserver.SQLServerDriver-Durl=jdbc:informatica:sqlserver://host_name:port;SelectMethod=cursor;DatabaseName=database_name -

INFORMATICA CONFIDENTIAL BEST PRACTICE 28 of 702

Duser=userName -Dpassword=userPassword -Dias_scheduler=pa_scheduler -Dadmin=paadmin repositoryutil.refresh.InfChangeSystemUserNamesREM END OF BATCH FILE

7. Make changes in the batch file as directed in the remarks [REM lines]

8. Save the file and open up a command prompt window and navigate to D:\Temp\Repository Utils\Refresh\

9. At the prompt, type change_sys_user.bat and press Enter.

The user "ias_scheduler" and "admin" will be changed to "pa_scheduler" and "paadmin", respectively.

10. Modify web.xml, and weblogic.xml (located at .\bea\wlserver6.1\config\informatica\applications\ias\WEB-INF) by replacing ias_scheduler with 'pa_scheduler'

11. Replace ias_scheduler with pa_scheduler in the xml file weblogic-ejb-jar.xml

This file is in iasEjb.jar file located in the directory .\bea\wlserver6.1\config\informatica\applications\

To edit the file

Make a copy of the iasEjb.jar:

● mkdir \tmp ● cd \tmp ● jar xvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar META-INF ● cd META-INF ● Update META-INF/weblogic-ejb.jar.xml replace ias_scheduler with pa_scheduler ● cd \ ● jar uvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar -C \tmp .

Note: There is a tailing period at the end of the command above.

12. Restart the server.

Last updated: 05-Feb-07 15:39

INFORMATICA CONFIDENTIAL BEST PRACTICE 29 of 702

Database Sizing

Challenge

Database sizing involves estimating the types and sizes of the components of a data architecture. This is important for determining the optimal configuration for the database servers in order to support the operational workloads. Individuals involved in a sizing exercise may be data architects, database administrators, and/or business analysts.

Description

The first step in database sizing is to review system requirements to define such things as:

● Expected data architecture elements (will there be staging areas? operational data stores? centralized data warehouse and/or master data? data marts?)

Each additional database element requires more space. This is even more true in situations where data is being replicated across multiple systems, such as a data warehouse maintaining an operational data store as well. The same data in the ODS will be present in the warehouse as well, albeit in a different format.

● Expected source data volume

It is useful to analyze how each row in the source system translates into the target system. In most situations the row count in the target system can be calculated by following the data flows from the source to the target. For example, say a sales order table is being built by denormalizing a source table. The source table holds sales data for 12 months in a single row (one column for each month). Each row in the source translates to 12 rows in the target. So a source table with one million rows ends up as a 12 million row table.

● Data granularity and periodicity

Granularity refers to the lowest level of information that is going to be stored in a fact table. Granularity affects the size of a database to a great extent, especially for aggregate tables. The level at which a table has been aggregated

INFORMATICA CONFIDENTIAL BEST PRACTICE 30 of 702

increases or decreases a table's row count. For example, a sales order fact table's size is likely to be greatly affected by whether the table is being aggregated at a monthly level or at a quarterly level. The granularity of fact tables is determined by the dimensions linked to that table. The number of dimensions that are connected to the fact tables affects the granularity of the table and hence the size of the table.

● Load frequency and method (full refresh? incremental updates?)

Load frequency affects the space requirements for the staging areas. A load plan that updates a target less frequently is likely to load more data at one go. Therefore, more space is required by the staging areas. A full refresh requires more space for the same reason. Estimated growth rates over time and retained history.

Determining Growth Projections

One way to estimate projections of data growth over time is to use scenario analysis. As an example, for scenario analysis of a sales tracking data mart you can use the number of sales transactions to be stored as the basis for the sizing estimate. In the first year, 10 million sales transactions are expected; this equates to 10 million fact-table records.

Next, use the sales growth forecasts for the upcoming years for database growth calculations. That is, an annual sales growth rate of 10 percent translates into 11 million fact table records for the next year. At the end of five years, the fact table is likely to contain about 60 million records. You may want to calculate other estimates based on five-percent annual sales growth (case 1) and 20-percent annual sales growth (case 2). Multiple projections for best and worst case scenarios can be very helpful.

Oracle Table Space Prediction Model

Oracle (10g and onwards) provides a mechanism to predict the growth of a database. This feature can be useful in predicting table space requirements.

Oracle incorporates a table space prediction model in the database engine that provides projected statistics for space used by a table. The following Oracle 10g query returns projected space usage statistics:

SELECT *

INFORMATICA CONFIDENTIAL BEST PRACTICE 31 of 702

FROM TABLE(DBMS_SPACE.object_growth_trend ('schema','tablename','TABLE')) ORDER BY timepoint; The results of this query are shown below: TIMEPOINT SPACE_USAGE SPACE_ALLOC QUALITY ------------------------------ ----------- ----------- --------------------

11-APR-04 02.55.14.116000 PM 6372 65536 INTERPOLATED 12-APR-04 02.55.14.116000 PM 6372 65536 INTERPOLATED 13-APR-04 02.55.14.116000 PM 6372 65536 INTERPOLATED 13-MAY-04 02.55.14.116000 PM 6372 65536 PROJECTED 14-MAY-04 02.55.14.116000 PM 6372 65536 PROJECTED 15-MAY-04 02.55.14.116000 PM 6372 65536 PROJECTED 16-MAY-04 02.55.14.116000 PM 6372 65536 PROJECTED

The QUALITY column indicates the quality of the output as follows:

● GOOD - The data for the timepoint relates to data within the AWR repository with a timestamp within 10 percent of the interval.

● INTERPOLATED - The data for this timepoint did not meet the GOOD criteria but was based on data gathered before and after the timepoint.

● PROJECTED - The timepoint is in the future, so the data is estimated based on previous growth statistics.

Baseline Volumetric

Next, use the physical data models for the sources and the target architecture to develop a baseline sizing estimate. The administration guides for most DBMSs contain sizing guidelines for the various database structures such as tables, indexes, sort space, data files, log files, and database cache.

Develop a detailed sizing using a worksheet inventory of the tables and indexes from the physical data model, along with field data types and field sizes. Various database products use different storage methods for data types. For this reason, be sure to use the database manuals to determine the size of each data type. Add up the field sizes to determine row size. Then use the data volume projections to determine the number of rows to multiply by the table size.

The default estimate for index size is to assume same size as the table size. Also estimate the temporary space for sort operations. For data warehouse applications where summarizations are common, plan on large temporary spaces. The temporary space can be as much as 1.5 times larger than the largest table in the database.

Another approach that is sometimes useful is to load the data architecture with representative data and determine the resulting database sizes. This test load can be a

INFORMATICA CONFIDENTIAL BEST PRACTICE 32 of 702

fraction of the actual data and is used only to gather basic sizing statistics. You then need to apply growth projections to these statistics. For example, after loading ten thousand sample records to the fact table, you determine the size to be 10MB. Based on the scenario analysis, you can expect this fact table to contain 60 million records after five years. So, the estimated size for the fact table is about 60GB [i.e., 10 MB * (60,000,000/10,000)]. Don't forget to add indexes and summary tables to the calculations.

Guesstimating

When there is not enough information to calculate an estimate as described above, use educated guesses and “rules of thumb” to develop as reasonable an estimate as possible.

● If you don’t have the source data model, use what you do know of the source data to estimate average field size and average number of fields in a row to determine table size. Based on your understanding of transaction volume over time, determine your growth metrics for each type of data and calculate out your source data volume (SDV) from table size and growth metrics.

● If your target data architecture is not completed so that you can determine table sizes, base your estimates on multiples of the SDV:

❍ If it includes staging areas: add another SDV for any source subject area that you will stage multiplied by the number of loads you’ll retain in staging.

❍ If you intend to consolidate data into an operational data store, add the SDV multiplied by the number of loads to be retained in the ODS for historical purposes (e.g., keeping one year’s worth of monthly loads = 12 x SDV)

❍ Data warehouse architectures are based on the periodicity and granularity of the warehouse; this may be another SDV + (.3n x SDV where n = number of time periods loaded in the warehouse over time)

❍ If your data architecture includes aggregates, add a percentage of the warehouse volumetrics based on how much of the warehouse data will be aggregated and to what level (e.g., if the rollup level represents 10 percent of the dimensions at the details level, use 10 percent).

❍ Similarly, for data marts add a percentage of the data warehouse based on how much of the warehouse data is moved into the data mart.

❍ Be sure to consider the growth projections over time and the history to be retained in all of your calculations.

INFORMATICA CONFIDENTIAL BEST PRACTICE 33 of 702

And finally, remember that there is always much more data than you expect so you may want to add a reasonable fudge-factor to the calculations for a margin of safety.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 34 of 702

Deployment Groups

Challenge

In selectively migrating objects from one repository folder to another, there is a need for a versatile and flexible mechanism that can overcome such limitations as confinement to a single source folder.

Description

Deployment Groups are containers that hold references to objects that need to be migrated. This includes objects such as mappings, mapplets, reusable transformations, sources, targets, workflows, sessions and tasks, as well as the object holders (i.e., the repository folders). Deployment groups are faster and more flexible than folder moves for incremental changes. In addition, they allow for migration “rollbacks” if necessary. Migrating a deployment group allows you to copy objects in a single copy operation from across multiple folders in the source repository into multiple folders in the target repository. Copying a deployment group also allows you to specify individual objects to copy, rather than the entire contents of a folder.

There are two types of deployment groups: static and dynamic.

● Static deployment groups contain direct references to versions of objects that need to be moved. Users explicitly add the version of the object to be migrated to the deployment group. Create a static deployment group if you do not expect the set of deployment objects to change between the deployments.

● Dynamic deployment groups contain a query that is executed at the time of deployment. The results of the query (i.e., object versions in the repository) are then selected and copied to the target repository/folder. Create a dynamic deployment group if you expect the deployment objects to change frequently between deployments.

Dynamic deployment groups are generated from a query. While any available criteria can be used, it is advisable to have developers use labels to simplify the query. See the Best Practice on Using PowerCenter Labels , Strategies for Labels section, for further information. When generating a query for deployment groups with mappings and mapplets that contain non-reusable objects, you must use a query condition in addition to specific selection criteria. The query must include a condition for Is Reusable and use a qualifier of either Reusable and Non-Reusable. Without this qualifier, the

INFORMATICA CONFIDENTIAL BEST PRACTICE 35 of 702

deployment may encounter errors if there are non-reusable objects held within the mapping or mapplet.

A deployment group exists in a specific repository. It can be used to move items to any other accessible repository/folder. A deployment group maintains a history of all migrations it has performed. It tracks what versions of objects were moved from which folders in which source repositories, and into which folders in which target repositories those versions were copied (i.e., it provides a complete audit trail of all migrations performed). Given that the deployment group knows what it moved and to where, then if necessary, an administrator can have the deployment group “undo” the most recent deployment, reverting the target repository to its pre-deployment state. Using labels (as described in the Using PowerCenter Labels Best Practice) allows objects in the subsequent repository to be tracked back to a specific deployment.

It is important to note that the deployment group only migrates the objects it contains to the target repository/folder. It does not, itself, move to the target repository. It still resides in the source repository.

Deploying via the GUI

You can perform migrations via the GUI or the command line (pmrep). To migrate objects via the GUI, simply drag a deployment group from the repository it resides in onto the target repository where the objects it references are to be moved. The Deployment Wizard appears to step you through the deployment process. You can match folders in the source and target repositories so objects are moved into the proper target folders, reset sequence generator values, etc. Once the wizard is complete, the migration occurs, and the deployment history is created.

Deploying via the Command Line

Alternatively, you can use the PowerCenter pmrep command to automate both Folder Level deployments (e.g., in a non-versioned repository) and deployments using Deployment Groups. The commands DeployFolder and DeployDeploymentGroup in pmrep are used respectively for these purposes. Whereas deployment via the GUI requires you to step through a wizard and answer a series of questions to deploy, command-line deployment requires you to provide an XML control file, which contains the same information that the wizard requests. This file must be present before the deployment is executed.

Considerations for Deployment and Deployment Groups

INFORMATICA CONFIDENTIAL BEST PRACTICE 36 of 702

Simultaneous Multi-Phase Projects

If multiple phases of a project are being developed simultaneously in separate folders, it is possible to consolidate them by mapping folders appropriately through the deployment group migration wizard. When migrating with deployment groups in this way, the override buttons in the migration wizard are used to select specific folder mapping.

Rolling Back a Deployment

Deployment groups help to ensure that you have a back-out methodology. You can rollback the latest version of a deployment. To do this:

In the target repository (where the objects were migrated to), go to Versioning>>Deployment>>History>>View History>>Rollback.

The rollback purges all objects (of the latest version) that were in the deployment group. You can initiate a rollback on a deployment as long as you roll back only the latest versions of the objects. The rollback ensures that the check-in time for the repository objects is the same as the deploy time.

Managing Repository Size

As you check in objects and deploy objects to target repositories, the number of object versions in those repositories increases, and thus, the size of the repositories also increases.

In order to manage repository size, use a combination of Check-in Date and Latest Status (both are query parameters) to purge the desired versions from the repository and retain only the very latest version. You may also choose to purge all the deleted versions of the objects, which reduces the size of the repository.

If you want to keep more than the latest version, you can also include labels in your query. These labels are ones that you have applied to the repository for the specific purpose of identifying objects for purging.

Off-Shore, On-Shore Migration

In an off-shore development environment to an on-shore migration situation, other aspects of the computing environment may make it desirable to generate a dynamic

INFORMATICA CONFIDENTIAL BEST PRACTICE 37 of 702

deployment group. Instead of migrating the group itself to the next repository, you can use a query to select the objects for migration and save them to a single XML file which can be then be transmitted to the on-shore environment though alternative methods. If the on-shore repository is versioned, it activates the import wizard as if a deployment group was being received.

Migrating to a Non-Versioned Repository

In some instances, it may be desirable to migrate to a non-versioned repository from a versioned repository. Note that this changes the wizards used when migrating in this manner, and that the export from the versioned repository must take place using XML export. Also be aware that certain repository objects (e.g., connections) cannot be automatically migrated, which may invalidate objects such as sessions. To resolve this issue, first set up the objects/connections in the receiving repository; the XML import wizard will advise of any invalidations that occur.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 38 of 702

Migration Procedures - PowerCenter

Challenge

Develop a migration strategy that ensures clean migration between development, test, quality assurance (QA), and production environments, thereby protecting the integrity of each of these environments as the system evolves.

Description

Ensuring that an application has a smooth migration process between development, QA, and production environments is essential for the deployment of an application. Deciding which migration strategy works best for a project depends on two primary factors.

● How is the PowerCenter repository environment designed? Are there individual repositories for development, QA, and production or are there just one or two environments that share one or all of these phases.

● How has the folder architecture been defined?

Each of these factors plays a role in determining the migration procedure that is most beneficial to the project.

PowerCenter offers flexible migration options that can be adapted to fit the need of each application. PowerCenter migration options include repository migration, folder migration, object migration, and XML import/export. In versioned PowerCenter repositories, users can also use static or dynamic deployment groups for migration, which provides the capability to migrate any combination of objects within the repository with a single command.

This Best Practice is intended to help the development team decide which technique is most appropriate for the project. The following sections discuss various options that are available, based on the environment and architecture selected. Each section describes the major advantages of its use, as well as its disadvantages.

Repository Environments

The following section outlines the migration procedures for standalone and distributed repository environments. The distributed environment section touches on several migration architectures, outlining the pros and cons of each. Also, please note that any methods described in the Standalone section may also be used in a Distributed environment.

Standalone Repository Environment

In a standalone environment, all work is performed in a single PowerCenter repository that serves as the metadata store. Separate folders are used to represent the development, QA, and production workspaces and segregate work. This type of architecture within a single repository ensures seamless migration from development to QA, and from QA to production.

The following example shows a typical architecture. In this example, the company has chosen to create

INFORMATICA CONFIDENTIAL BEST PRACTICE 39 of 702

separate development folders for each of the individual developers for development and unit test purposes. A single shared or common development folder, SHARED_MARKETING_DEV, holds all of the common objects, such as sources, targets, and reusable mapplets. In addition, two test folders are created for QA purposes. The first contains all of the unit-tested mappings from the development folder. The second is a common or shared folder that contains all of the tested shared objects. Eventually, as the following paragraphs explain, two production folders will also be built.

Proposed Migration Process – Single Repository

DEV to TEST – Object Level Migration

Now that we've described the repository architecture for this organization, let's discuss how it will migrate mappings to test, and then eventually to production.

After all mappings have completed their unit testing, the process for migration to test can begin. The first step in this process is to copy all of the shared or common objects from the SHARED_MARKETING_DEV folder to the SHARED_MARKETING_TEST folder. This can be done using one of two methods:

● The first, and most common method, is object migration via an object copy. In this case, a user opens the SHARED_MARKETING_TEST folder and drags the object from the SHARED_MARKETING_DEV into the appropriate workspace (i.e., Source Analyzer, Warehouse Designer, etc.). This is similar to dragging a file from one folder to another using Windows Explorer.

● The second approach is object migration via object XML import/export. A user can export each of the objects in the SHARED_MARKETING_DEV folder to XML, and then re-import each object into the SHARED_MARKETING_TEST via XML import. With the XML import/export, the XML files can be uploaded to a third-party versioning tool, if the organization has standardized on such a tool. Otherwise, versioning can be enabled in PowerCenter. Migrations with versioned PowerCenter repositories is covered later in this document.

After you've copied all common or shared objects, the next step is to copy the individual mappings from each development folder into the MARKETING_TEST folder. Again, you can use either of the two object-level migration methods described above to copy the mappings to the folder, although the XML import/export method is the most intuitive method for resolving shared object conflicts. However, the migration method is slightly different here when you're copying the mappings because you must ensure that the shortcuts in the mapping are associated with the SHARED_MARKETING_TEST folder. Designer prompts the user to choose the correct shortcut folder that you created in the previous example, which point to the

INFORMATICA CONFIDENTIAL BEST PRACTICE 40 of 702

SHARED_MARKETING_TEST (see image below). You can then continue the migration process until all mappings have been successfully migrated. In PowerCenter 7 and later versions, you can export multiple objects into a single XML file, and then import them at the same time.

The final step in the process is to migrate the workflows that use those mappings. Again, the object-level migration can be completed either through drag-and-drop or by using XML import/export. In either case, this process is very similar to the steps described above for migrating mappings, but differs in that the Workflow Manager provides a Workflow Copy Wizard to guide you through the process. The following steps outline the full process for successfully copying a workflow and all of its associated tasks.

1. The Wizard prompts for the name of the new workflow. If a workflow with the same name exists in the destination folder, the Wizard prompts you to rename it or replace it. If no such workflow exists, a default name is used. Then click “Next” to continue the copy process.

2. The next step for each task is to see if it exists (as shown below). If the task is present, you can rename or replace the current one. If it does not exist, then the default name is used (see below). Then click “Next.”

INFORMATICA CONFIDENTIAL BEST PRACTICE 41 of 702

3. Next, the Wizard prompts you to select the mapping associated with each session task in the workflow. Select the mapping and continue by clicking “Next".

4. If connections exist in the target repository, the Wizard prompts you to select the connection to use for the source and target. If no connections exist, the default settings are used. When this step is completed, click "Finish" and save the work.

Initial Migration – New Folders Created

The move to production is very different for the initial move than for subsequent changes to mappings and workflows. Since the repository only contains folders for development and test, we need to create two new folders to house the production-ready objects. Create these folders after testing of the objects in SHARED_MARKETING_TEST and MARKETING_TEST has been approved.

The following steps outline the creation of the production folders and, at the same time, address the initial test

INFORMATICA CONFIDENTIAL BEST PRACTICE 42 of 702

to production migration.

1. Open the PowerCenter Repository Manager client tool and log into the repository. 2. To make a shared folder for the production environment, highlight the SHARED_MARKETING_TEST

folder, drag it, and drop it on the repository name. 3. The Copy Folder Wizard appears to guide you through the copying process.

4. The first Wizard screen asks if you want to use the typical folder copy options or the advanced options. In this example, we'll use the advanced options.

5. The second Wizard screen prompts you to enter a folder name. By default, the folder name that appears on this screen is the folder name followed by the date. In this case, enter the name as

INFORMATICA CONFIDENTIAL BEST PRACTICE 43 of 702

“SHARED_MARKETING_PROD.”

6. The third Wizard screen prompts you to select a folder to override. Because this is the first time you are transporting the folder, you won’t need to select anything.

7. The final screen begins the actual copy process. Click "Finish" when the process is complete.

INFORMATICA CONFIDENTIAL BEST PRACTICE 44 of 702

Repeat this process to create the MARKETING_PROD folder. Use the MARKETING_TEST folder as the original to copy and associate the shared objects with the SHARED_MARKETING_PROD folder that you just created.

At the end of the migration, you should have two additional folders in the repository environment for production: SHARED_MARKETING_PROD and MARKETING_ PROD (as shown below). These folders contain the initially migrated objects. Before you can actually run the workflow in these production folders, you need to modify the session source and target connections to point to the production environment.

Incremental Migration – Object Copy Example

Now that the initial production migration is complete, let's take a look at how future changes will be migrated into the folder.

Any time an object is modified, it must be re-tested and migrated into production for the actual change to

INFORMATICA CONFIDENTIAL BEST PRACTICE 45 of 702

occur. These types of changes in production take place on a case-by-case or periodically-scheduled basis. The following steps outline the process of moving these objects individually.

1. Log into PowerCenter Designer. Open the destination folder and expand the source folder. Click on the object to copy and drag-and-drop it into the appropriate workspace window.

2. Because this is a modification to an object that already exists in the destination folder, Designer prompts you to choose whether to Rename or Replace the object (as shown below). Choose the option to Replace the object.

3. In PowerCenter 7 and later versions, you can choose to compare conflicts whenever migrating any object in Designer or Workflow Manager. By comparing the objects, you can ensure that the changes that you are making are what you intend. See below for an example of the mapping compare window.

INFORMATICA CONFIDENTIAL BEST PRACTICE 46 of 702

4. After the object has been successfully copied, save the folder so the changes can take place. 5. The newly copied mapping is now tied to any sessions that the replaced mapping was tied to. 6. Log into Workflow Manager and make the appropriate changes to the session or workflow so it can

update itself with the changes.

Standalone Repository Example

In this example, we look at moving development work to QA and then from QA to production, using multiple development folders for each developer, with the test and production folders divided into the data mart they represent. For this example, we focus solely on the MARKETING_DEV data mart, first explaining how to move objects and mappings from each individual folder to the test folder and then how to move tasks, worklets, and workflows to the new area.

Follow these steps to copy a mapping from Development to QA:

1. If using shortcuts, first follow these steps; if not using shortcuts, skip to step 2 ❍ Copy the tested objects from the SHARED_MARKETING_DEV folder to the

SHARED_MARKETING_TEST folder. ❍ Drag all of the newly copied objects from the SHARED_MARKETING_TEST folder to

MARKETING_TEST. ❍ Save your changes.

2. Copy the mapping from Development into Test. ❍ In the PowerCenter Designer, open the MARKETING_TEST folder, and drag and drop the

mapping from each development folder into the MARKETING_TEST folder.

INFORMATICA CONFIDENTIAL BEST PRACTICE 47 of 702

❍ When copying each mapping in PowerCenter, Designer prompts you to either Replace, Rename, or Reuse the object, or Skip for each reusable object, such as source and target definitions. Choose to Reuse the object for all shared objects in the mappings copied into the MARKETING_TEST folder.

❍ Save your changes. 3. If a reusable session task is being used, follow these steps. Otherwise, skip to step 4.

❍ In the PowerCenter Workflow Manager, open the MARKETING_TEST folder and drag and drop each reusable session from the developers’ folders into the MARKETING_TEST folder. A Copy Session Wizard guides you through the copying process.

❍ Open each newly copied session and click on the Source tab. Change the source to point to the source database for the Test environment.

❍ Click the Target tab. Change each connection to point to the target database for the Test environment. Be sure to double-check the workspace from within the Target tab to ensure that the load options are correct.

❍ Save your changes. 4. While the MARKETING_TEST folder is still open, copy each workflow from Development to Test.

❍ Drag each workflow from the development folders into the MARKETING_TEST folder. The Copy Workflow Wizard appears. Follow the same steps listed above to copy the workflow to the new folder.

❍ As mentioned earlier, in PowerCenter 7 and later versions, the Copy Wizard allows you to compare conflicts from within Workflow Manager to ensure that the correct migrations are being made.

❍ Save your changes. 5. Implement the appropriate security.

❍ In Development, the owner of the folders should be a user(s) in the development group. ❍ In Test, change the owner of the test folder to a user(s) in the test group. ❍ In Production, change the owner of the folders to a user in the production group. ❍ Revoke all rights to Public other than Read for the production folders.

Disadvantages of a Single Repository Environment

The most significant disadvantage of a single repository environment is performance. Having a development, QA, and production environment within a single repository can cause degradation in production performance as the production environment shares CPU and memory resources with the development and test environments. Although these environments are stored in separate folders, they all reside within the same database table space and on the same server.

For example, if development or test loads are running simultaneously with production loads, the server machine may reach 100 percent utilization and production performance is likely to suffer.

A single repository structure can also create confusion as the same users and groups exist in all environments and the number of folders can increase exponentially.

Distributed Repository Environment

A distributed repository environment maintains separate, independent repositories, hardware, and software for development, test, and production environments. Separating repository environments is preferable for handling development to production migrations. Because the environments are segregated from one another,

INFORMATICA CONFIDENTIAL BEST PRACTICE 48 of 702

work performed in development cannot impact QA or production.

With a fully distributed approach, separate repositories function much like the separate folders in a standalone environment. Each repository has a similar name, like the folders in the standalone environment. For instance, in our Marketing example we would have three repositories, INFADEV, INFATEST, and INFAPROD. In the following example, we discuss a distributed repository architecture.

There are four techniques for migrating from development to production in a distributed repository architecture, with each involving some advantages and disadvantages.

● Repository Copy ● Folder Copy ● Object Copy ● Deployment Groups

Repository Copy

So far, this document has covered object-level migrations and folder migrations through drag-and-drop object copying and object XML import/export. This section discusses migrations in a distributed repository environment through repository copies.

The main advantages of this approach are:

● The ability to copy all objects (i.e., mappings, workflows, mapplets, reusable transformation, etc.) at once from one environment to another.

● The ability to automate this process using pmrep commands, thereby eliminating many of the manual processes that users typically perform.

● The ability to move everything without breaking or corrupting any of the objects.

This approach also involves a few disadvantages.

INFORMATICA CONFIDENTIAL BEST PRACTICE 49 of 702

● The first is that everything is moved at once (which is also an advantage). The problem with this is that everything is moved -- ready or not. For example, we may have 50 mappings in QA, but only 40 of them are production-ready. The 10 untested mappings are moved into production along with the 40 production-ready mappings, which leads to the second disadvantage.

● Significant maintenance is required to remove any unwanted or excess objects.

● There is also a need to adjust server variables, sequences, parameters/variables, database connections, etc. Everything must be set up correctly before the actual production runs can take place.

● Lastly, the repository copy process requires that the existing Production repository be deleted, and then the Test repository can be copied. This results in a loss of production environment operational metadata such as load statuses, session run times, etc. High-performance organizations leverage the value of operational metadata to track trends over time related to load success/failure and duration. This metadata can be a competitive advantage for organizations that use this information to plan for future growth.

Now that we've discussed the advantages and disadvantages, we'll look at three ways to accomplish the Repository Copy method:

● Copying the Repository ● Repository Backup and Restore ● PMREP

Copying the Repository

Copying the Test repository to Production through the GUI client tools is the easiest of all the migration methods. First, ensure that all users are logged out of the destination repository, then connect to the PowerCenter Repository Administration Console (as shown below).

INFORMATICA CONFIDENTIAL BEST PRACTICE 50 of 702

If the Production repository already exists, you must delete the repository before you can copy the Test repository. Before you can delete the repository, you must run the repository in the ‘exclusive mode’.

1. Click on the “INFA_PROD Repository on the left pane to select it and change the running mode to “exclusive mode’ by clicking on the edit button on the right pane under the properties tab.

INFORMATICA CONFIDENTIAL BEST PRACTICE 51 of 702

2. Delete the Production repository by selecting it and choosing “Delete” from the context menu.

INFORMATICA CONFIDENTIAL BEST PRACTICE 52 of 702

3. Click on the Action drop-down list and choose Copy contents from

INFORMATICA CONFIDENTIAL BEST PRACTICE 53 of 702

4. In the new window, choose the domain name, repository service “INFA_TEST” from the drop-down menu. Enter the username and password of the Test repository.

INFORMATICA CONFIDENTIAL BEST PRACTICE 54 of 702

5. Click OK to begin the copy process. 6. When you've successfully copied the repository to the new location, exit from the PowerCenter

Administration Console. 7. In the Repository Manager, double-click on the newly copied repository and log-in with a valid

username and password. 8. Verify connectivity, then highlight each folder individually and rename them. For example, rename the

MARKETING_TEST folder to MARKETING_PROD, and the SHARED_MARKETING_TEST to SHARED_MARKETING_PROD.

9. Be sure to remove all objects that are not pertinent to the Production environment from the folders before beginning the actual testing process.

10. When this cleanup is finished, you can log into the repository through the Workflow Manager. Modify the server information and all connections so they are updated to point to the new Production locations for all existing tasks and workflows.

Repository Backup and Restore

Backup and Restore Repository is another simple method of copying an entire repository. This process backs up the repository to a binary file that can be restored to any new location. This method is preferable to the repository copy process because if any type of error occurs, the file is backed up to the binary file on the repository server.

The following steps outline the process of backing up and restoring the repository for migration.

1. Launch the PowerCenter Administration Console, and highlight the INFA_TEST repository service. Select Action -> Backup Contents from the drop-down menu.

INFORMATICA CONFIDENTIAL BEST PRACTICE 55 of 702

2. A screen appears and prompts you to supply a name for the backup file as well as the Administrator username and password. The file is saved to the Backup directory within the repository server’s home directory.

3. After you've selected the location and file name, click OK to begin the backup process.

INFORMATICA CONFIDENTIAL BEST PRACTICE 56 of 702

4. The backup process creates a .rep file containing all repository information. Stay logged into the

Manage Repositories screen. When the backup is complete, select the repository connection to which the backup will be restored to (i.e., the Production repository).

5. The system will prompt you to supply a username, password, and the name of the file to be restored. Enter the appropriate information and click OK.

When the restoration process is complete, you must repeat the steps listed in the copy repository option in order to delete all of the unused objects and renaming of the folders.

PMREP

Using the PMREP commands is essentially the same as the Backup and Restore Repository method except that it is run from the command line rather than through the GUI client tools. PMREP utilities can be used from the Informatica Server or from any client machine connected to the server.

Refer to the Repository Manager Guide for a list of PMREP commands.

The following is a sample of the command syntax used within a Windows batch file to connect to and backup a repository. Using this code example as a model, you can write scripts to be run on a daily basis to perform functions such as connect, backup, restore, etc:

backupproduction.bat

REM This batch file uses pmrep to connect to and back up the repository Production on the server Central

INFORMATICA CONFIDENTIAL BEST PRACTICE 57 of 702

@echo off

echo Connecting to Production repository...

“C:\Program Files\Informatica PowerCenter 7 and later versions\RepositoryServer\bin\pmrep” connect -r INFAPROD -n Administrator -x Adminpwd –h infarepserver –o 7001

echo Backing up Production repository...

“C:\Program Files\Informatica PowerCenter 7 and later versions\RepositoryServer\bin\pmrep” backup -o c:\backup\Production_backup.rep

Post-Repository Migration Cleanup

After you have used one of the repository migration procedures to migrate into Production, follow these steps to convert the repository to Production:

1. Disable workflows that are not ready for Production or simply delete the mappings, tasks, and workflows.

❍ Disable the workflows not being used in the Workflow Manager by opening the workflow properties, then checking the Disabled checkbox under the General tab.

❍ Delete the tasks not being used in the Workflow Manager and the mappings in the Designer

2. Modify the database connection strings to point to the production sources and targets.

❍ In the Workflow Manager, select Relational connections from the Connections menu. ❍ Edit each relational connection by changing the connect string to point to the production

sources and targets. ❍ If you are using lookup transformations in the mappings and the connect string is anything

other than $SOURCE or $TARGET, you will need to modify the connect strings appropriately.

3. Modify the pre- and post-session commands and SQL as necessary.

❍ In the Workflow Manager, open the session task properties, and from the Components tab make the required changes to the pre- and post-session scripts.

4. Implement appropriate security, such as:

❍ In Development, ensure that the owner of the folders is a user in the development group. ❍ In Test, change the owner of the test folders to a user in the test group. ❍ In Production, change the owner of the folders to a user in the production group. ❍ Revoke all rights to Public other than Read for the Production folders.

Folder Copy

INFORMATICA CONFIDENTIAL BEST PRACTICE 58 of 702

Although deployment groups are becoming a very popular migration method, the folder copy method has historically been the most popular way to migrate in a distributed environment. Copying an entire folder allows you to quickly promote all of the objects located within that folder. All source and target objects, reusable transformations, mapplets, mappings, tasks, worklets and workflows are promoted at once. Because of this, however, everything in the folder must be ready to migrate forward. If some mappings or workflows are not valid, then developers (or the Repository Administrator) must manually delete these mappings or workflows from the new folder after the folder is copied.

The three advantages of using the folder copy method are:

● The Repository Managers Folder Copy Wizard makes it almost seamless to copy an entire folder and all the objects located within it.

● If the project uses a common or shared folder and this folder is copied first, then all shortcut relationships are automatically converted to point to this newly copied common or shared folder.

● All connections, sequences, mapping variables, and workflow variables are copied automatically.

The primary disadvantage of the folder copy method is that the repository is locked while the folder copy is being performed. Therefore, it is necessary to schedule this migration task during a time when the repository is least utilized. Remember that a locked repository means than no jobs can be launched during this process. This can be a serious consideration in real-time or near real-time environments.

The following example steps through the process of copying folders from each of the different environments. The first example uses three separate repositories for development, test, and production.

1. If using shortcuts, follow these sub steps; otherwise skip to step 2:

● Open the Repository Manager client tool. ● Connect to both the Development and Test repositories. ● Highlight the folder to copy and drag it to the Test repository. ● The Copy Folder Wizard appears to step you through the copy process. ● When the folder copy process is complete, open the newly copied folder in both the

Repository Manager and Designer to ensure that the objects were copied properly.

2. Copy the Development folder to Test. If you skipped step 1, follow these sub-steps:

● Open the Repository Manager client tool. ● Connect to both the Development and Test repositories. ● Highlight the folder to copy and drag it to the Test repository.

The Copy Folder Wizard will appear.

INFORMATICA CONFIDENTIAL BEST PRACTICE 59 of 702

3. Follow these steps to ensure that all shortcuts are reconnected.

Use the advanced options when copying the folder across. ● Select Next to use the default name of the folder

4. If the folder already exists in the destination repository, choose to replace the folder.

The following screen appears to prompt you to select the folder where the new shortcuts are located.

INFORMATICA CONFIDENTIAL BEST PRACTICE 60 of 702

In a situation where the folder names do not match, a folder compare will take place. The Copy Folder Wizard then completes the folder copy process. Rename the folder as appropriate and implement the security.

5. When testing is complete, repeat the steps above to migrate to the Production repository.

When the folder copy process is complete, log onto the Workflow Manager and change the connections to point to the appropriate target location. Ensure that all tasks updated correctly and that folder and repository security is modified for test and production.

Object Copy

Copying mappings into the next stage in a networked environment involves many of the same advantages and disadvantages as in the standalone environment, but the process of handling shortcuts is simplified in the networked environment. For additional information, see the earlier description of Object Copy for the standalone environment.

One advantage of Object Copy in a distributed environment is that it provides more granular control over objects.

Two distinct disadvantages of Object Copy in a distributed environment are:

● Much more work to deploy an entire group of objects ● Shortcuts must exist prior to importing/copying mappings

Below are the steps to complete an object copy in a distributed repository environment:

1. If using shortcuts, follow these sub-steps, otherwise skip to step 2:

INFORMATICA CONFIDENTIAL BEST PRACTICE 61 of 702

● In each of the distributed repositories, create a common folder with the exact same name and case. ● Copy the shortcuts into the common folder in Production, making sure the shortcut has the exact

same name.

2. Copy the mapping from the Test environment into Production.

● In the Designer, connect to both the Test and Production repositories and open the appropriate folders in each.

● Drag-and-drop the mapping from Test into Production. ● During the mapping copy process, PowerCenter 7 and later versions allow a comparison of this

mapping to an existing copy of the mapping already in Production. Note that the ability to compare objects is not limited to mappings, but is available for all repository objects including workflows, sessions, and tasks.

3. Create or copy a workflow with the corresponding session task in the Workflow Manager to run the mapping (first ensure that the mapping exists in the current repository).

● If copying the workflow, follow the Copy Wizard. ● If creating the workflow, add a session task that points to the mapping and enter all the appropriate

information.

4. Implement appropriate security.

● In Development, ensure the owner of the folders is a user in the development group. ● In Test, change the owner of the test folders to a user in the test group. ● In Production, change the owner of the folders to a user in the production group. ● Revoke all rights to Public other than Read for the Production folders.

Deployment Groups

For versioned repositories, the use of Deployment Groups for migrations between distributed environments allows the most flexibility and convenience. With Deployment Groups, you can migrate individual objects as you would in an object copy migration, but can also have the convenience of a repository- or folder-level migration as all objects are deployed at once. The objects included in a deployment group have no restrictions and can come from one or multiple folders. Additionally, for additional convenience, you can set up a dynamic deployment group that allows the objects in the deployment group to be defined by a repository query, rather than being added to the deployment group manually. Lastly, because deployment groups are available on versioned repositories, they also have the ability to be rolled back, reverting to the previous versions of the objects, when necessary.

Advantages of Using Deployment Groups

● Backup and restore of the Repository needs to be performed only once. ● Copying a Folder replaces the previous copy. ● Copying a Mapping allows for different names to be used for the same object. ● Uses for Deployment Groups

INFORMATICA CONFIDENTIAL BEST PRACTICE 62 of 702

❍ Deployment Groups are containers that hold references to objects that need to be migrated. ❍ Allows for version-based object migration. ❍ Faster and more flexible than folder moves for incremental changes. ❍ Allows for migration “rollbacks” ❍ Allows specifying individual objects to copy, rather than the entire contents of a folder.

Types of Deployment Groups

● Static

❍ Contain direct references to versions of objects that need to be moved. ❍ Users explicitly add the version of the object to be migrated to the deployment group.

● Dynamic

❍ Contain a query that is executed at the time of deployment. ❍ The results of the query (i.e. object versions in the repository) are then selected and copied to

the target repository

Pre-Requisites

Create required folders in the Target Repository

Creating Labels

A label is a versioning object that you can associate with any versioned object or group of versioned objects in a repository.

● Advantages

❍ Tracks versioned objects during development. ❍ Improves query results. ❍ Associates groups of objects for deployment. ❍ Associates groups of objects for import and export.

● Create label

❍ Create labels through the Repository Manager. ❍ After creating the labels, go to edit mode and lock them. ❍ The "Lock" option is used to prevent other users from editing or applying the label. ❍ This option can be enabled only when the label is edited. ❍ Some Standard Label examples are:

■ Development ■ Deploy_Test

INFORMATICA CONFIDENTIAL BEST PRACTICE 63 of 702

■ Test ■ Deploy_Production ■ Production

● Apply Label

❍ Create a query to identify the objects that are needed to be queried. ❍ Run the query and apply the labels.

Note: By default, the latest version of the object gets labeled.

Queries

A query is an object used to search for versioned objects in the repository that meet specific conditions.

● Advantages

❍ Tracks objects during development ❍ Associates a query with a deployment group ❍ Finds deleted objects you want to recover ❍ Finds groups of invalidated objects you want to validate

● Create a query

❍ The Query Browser allows you to create, edit, run, or delete object queries

● Execute a query

❍ Execute through Query Browser ❍ EXECUTE QUERY: ExecuteQuery -q query_name -t query_type -u persistent_output_file_name

-a append -c column_separator -r end-of-record_separator -l end-oflisting_indicator -b verbose

Creating a Deployment Group

Follow these steps to create a deployment group:

1. Launch the Repository Manager client tool and log in to the source repository.

2. Expand the repository, right-click on “Deployment Groups” and choose “New Group.”

INFORMATICA CONFIDENTIAL BEST PRACTICE 64 of 702

3. In the dialog window, give the deployment group a name, and choose whether it should be static or dynamic. In this example, we are creating a static deployment group. Click OK.

Adding Objects to a Static Deployment Group

INFORMATICA CONFIDENTIAL BEST PRACTICE 65 of 702

Follow these steps to add objects to a static deployment group:

1. In Designer, Workflow Manager, or Repository Manger, right-click an object that you want to add to the deployment group and choose “Versioning” -> “View History.” The “View History” window appears.

2. In the “View History” window, right-click the object and choose “Add to Deployment Group.”

INFORMATICA CONFIDENTIAL BEST PRACTICE 66 of 702

3. In the Deployment Group dialog window, choose the deployment group that you want to add the object to, and click OK.

4. In the final dialog window, choose whether you want to add dependent objects. In most cases, you will want to add dependent objects to the deployment group so that they will be migrated as well. Click OK.

INFORMATICA CONFIDENTIAL BEST PRACTICE 67 of 702

NOTE: The “All Dependencies” option should be used for any new code that is migrating forward. However, this option can cause issues when moving existing code forward because “All Dependencies” also flags shortcuts. During the deployment, PowerCenter tries to re-insert or replace the shortcuts. This does not work, and causes the deployment to fail.

The object will be added to the deployment group at this time.

Although the deployment group allows the most flexibility, the task of adding each object to the deployment group is similar to the effort required for an object copy migration. To make deployment groups easier to use, PowerCenter allows the capability to create dynamic deployment groups.

Adding Objects to a Dynamic Deployment Group

Dynamic Deployment groups are similar in function to static deployment groups, but differ in the way that objects are added. In a static deployment group, objects are manually added one by one. In a dynamic deployment group, the contents of the deployment group are defined by a repository query. Don’t worry about the complexity of writing a repository query, it is quite simple and aided by the PowerCenter GUI interface.

Follow these steps to add objects to a dynamic deployment group:

1. First, create a deployment group, just as you did for a static deployment group, but in this case, choose the dynamic option. Also, select the “Queries” button.

INFORMATICA CONFIDENTIAL BEST PRACTICE 68 of 702

2. The “Query Browser” window appears. Choose “New” to create a query for the dynamic deployment group.

3. In the Query Editor window, provide a name and query type (Shared). Define criteria for the objects that should be migrated. The drop-down list of parameters lets you choose from 23 predefined metadata categories. In this case, the developers have assigned the “RELEASE_20050130” label to all objects that need to be migrated, so the query is defined as “Label Is Equal To ‘RELEASE_20050130’”. The creation and application of labels are discussed in Using PowerCenter Labels.

INFORMATICA CONFIDENTIAL BEST PRACTICE 69 of 702

4. Save the Query and exit the Query Editor. Click OK on the Query Browser window, and close the Deployment Group editor window.

Executing a Deployment Group Migration

A Deployment Group migration can be executed through the Repository Manager client tool, or through the pmrep command line utility. With the client tool, you simply drag the deployment group from the source repository and drop it on the destination repository. This opens the Copy Deployment Group Wizard, which guides you through the step-by-step options for executing the deployment group.

Rolling Back a Deployment

To roll back a deployment, you must first locate the Deployment via the TARGET Repositories menu bar (i.e., Deployments -> History -> View History -> Rollback).

Automated Deployments

For the optimal migration method, you can set up a UNIX shell or Windows batch script that calls the pmrep DeployDeploymentGroup command, which can execute a deployment group migration without human intevention. This is ideal since the deployment group allows ultimate flexibility and convenience as the script can be scheduled to run overnight, thereby causing minimal impact on developers and the PowerCenter administrator. You can also use the pmrep utility to automate importing objects via XML.

INFORMATICA CONFIDENTIAL BEST PRACTICE 70 of 702

Recommendations

Informatica recommends using the following process when running in a three-tiered environment with development, test, and production servers.

Non-Versioned Repositories

For migrating from development into test, Informatica recommends using the Object Copy method. This method gives you total granular control over the objects that are being moved. It also ensures that the latest development mappings can be moved over manually as they are completed. For recommendations on performing this copy procedure correctly, see the steps listed in the Object Copy section.

Versioned Repositories

For versioned repositories, Informatica recommends using the Deployment Groups method for repository migration in a distributed repository environment. This method provides the greatest flexibility in that you can promote any object from within a development repository (even across folders) into any destination repository. Also, by using labels, dynamic deployment groups, and the enhanced pmrep command line utility, the use of the deployment group migration method results in automated migrations that can be executed without manual intervention.

Third-Party Versioning

Some organizations have standardized on third-party version control software. PowerCenter’s XML import/export functionality offers integration with such software and provides a means to migrate objects. This method is most useful in a distributed environment because objects can be exported into an XML file from one repository and imported into the destination repository.

The XML Object Copy Process allows you to copy nearly all repository objects, including sources, targets, reusable transformations, mappings, mapplets, workflows, worklets, and tasks. Beginning with PowerCenter 7 and later versions, the export/import functionality allows the export/import of multiple objects to a single XML file. This can significantly cut down on the work associated with object level XML import/export.

INFORMATICA CONFIDENTIAL BEST PRACTICE 71 of 702

The following steps outline the process of exporting the objects from source repository and importing them into the destination repository:

Exporting

1. From Designer or Workflow Manager, login to the source repository. Open the folder and highlight the object to be exported.

2. Select Repository -> Export Objects 3. The system prompts you to select a directory location on the local workstation. Choose the directory to

save the file. Using the default name for the XML file is generally recommended. 4. Open Windows Explorer and go to the C:\Program Files\Informatica PowerCenter 7 and later versions

x\Client directory. (This may vary depending on where you installed the client tools.) 5. Find the powrmart.dtd file, make a copy of it, and paste the copy into the directory where you saved

the XML file. 6. Together, these files are now ready to be added to the version control software

Importing

Log into Designer or the Workflow Manager client tool and login to the destination repository. Open the folder where the object is to be imported.

1. Select Repository -> Import Objects. 2. The system prompts you to select a directory location and file to import into the repository. 3. The following screen appears with the steps for importing the object.

4. Select the mapping and add it to the Objects to Import list.

INFORMATICA CONFIDENTIAL BEST PRACTICE 72 of 702

5. Click "Next", and then click "Import". Since the shortcuts have been added to the folder, the mapping will now point to the new shortcuts and their parent folder.

6. It is important to note that the pmrep command line utility was greatly enhanced in PowerCenter 7 and later versions, allowing the activities associated with XML import/export to be automated through pmrep.

7. Click on the destination repository service on the left pane and choose the “Action drop-down list box “ -> “Restore.” Remember, if the destination repository has content, it has to be deleted prior to restoring).

Last updated: 05-Feb-07 17:22

INFORMATICA CONFIDENTIAL BEST PRACTICE 73 of 702

Migration Procedures - PowerExchange

Challenge

To facilitate the migration of PowerExchange definitions from one environment to another.

Description

There are two approaches to perform a migration.

● Using the DTLURDMO utility ● Using the Power Exchange Client tool (Detail Navigator)

DTLURDMO Utility

Step 1: Validate connectivity between the client and listeners

● Test communication between clients and all listeners in the production environment with:

dtlrexeprog=ping <loc>=<nodename>.

● Run selected jobs to exercise data access through PowerExchange data maps.

Step 2: Run DTLURDMO to copy PowerExchange objects.

At this stage, if PowerExchange is to run against new versions of the PowerExchange objects rather than existing libraries, you need to copy the datamaps. To do this, use the PowerExchange Copy Utility DTLURDMO. The following section assumes that the entire datamap set is to be copied. DTLURDMO does have the ability to copy selectively, however, and the full functionality of the utility is documented in the PowerExchange Utilities Guide.

The types of definitions that can be managed with this utility are:

● PowerExchange data maps

INFORMATICA CONFIDENTIAL BEST PRACTICE 74 of 702

● PowerExchange capture registrations ● PowerExchange capture extraction data maps

On MVS, the input statements for this utility are taken from SYSIN.

On non-MVS platforms, the input argument point to a file containing the input definition. If no input argument is provided, it looks for a file dtlurdmo.ini in the current path.

The utility runs on all capture platforms.

Windows and UNIX Command Line

Syntax: DTLURDMO <dtlurdmo definition file>

For example: DTLURDMO e:\powerexchange\bin\dtlurdmo.ini

● DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates. If no definition file is specified, it looks for a file dtlurdmo.ini in the current path.

MVS DTLURDMO job utility

Run the utility by submitting the DTLURDMO job, which can be found in the RUNLIB library.

● DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates and is read from the SYSIN card.

AS/400 utility

Syntax: CALL PGM(<location and name of DTLURDMO executable file>)

For example: CALL PGM(dtllib/DTLURDMO)

● DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO utility operates. By default, the definition is in the member CFG/DTLURDMO in the current datalib library.

If you want to create a separate DTLURDMO definition file rather than use the default location, you must give the library and filename of the definition file as a parameter. For example: CALL PGM(dtllib/DTLURDMO) parm ('datalib/deffile(dtlurdmo)')

Running DTLURDMO

The utility should be run extracting information from the files locally, then writing out the datamaps through the new PowerExchange V8.x.x Listener. This causes the datamaps to be written out in the format required for the upgraded PowerExchange. DTLURDMO must be run once for the datamaps, then again for the registrations, and then the extract maps if this is a capture environment. Commands for mixed datamaps, registrations, and extract maps cannot be run together.

INFORMATICA CONFIDENTIAL BEST PRACTICE 75 of 702

If only a subset of the PowerExchange datamaps, registrations, and extract maps are required, then selective copies can be carried out. Details of performing selective copies are documented fully in the PowerExchange Utilities Guide. This document assumes that everything is going to be migrated from the existing environment to the new V8.x.x format.

Definition File Example

The following example shows a definition file to copy all datamaps from the existing local datamaps (the local datamaps are defined in the DATAMAP DD card in the MVS JCL or by the path on Windows or UNIX) to the V8.x.x listener (defined by the TARGET location node1):

USER DTLUSR;

EPWD A3156A3623298FDC;

SOURCE LOCAL;

TARGET NODE1;

DETAIL;

REPLACE;

DM_COPY;

SELECT schema=*;

Note: The encrypted password (EPWD) is generated from the FILE, ENCRYPT PASSWORD option from the PowerExchange Navigator.

Power Exchange Client tool (Detail Navigator)

Step 1: Validate connectivity between the client and listeners

● Test communication between clients and all listeners in the production environment with:

dtlrexeprog=ping loc=<nodename>.

INFORMATICA CONFIDENTIAL BEST PRACTICE 76 of 702

● Run selected jobs to exercise data access through PowerExchange data maps.

Step 2: Start the Power Exchange Navigator

● Select the datamap that is going to be promoted to production. ● On the menu bar, select a file to send to the remote node.

On the drop-down list box, choose the appropriate location ( in this case mvs_prod).●

Supply the user name and password and click OK.●

A confirmation message for successful migration is displayed.

INFORMATICA CONFIDENTIAL BEST PRACTICE 77 of 702

Last updated: 06-Feb-07 11:39

INFORMATICA CONFIDENTIAL BEST PRACTICE 78 of 702

Running Sessions in Recovery Mode

Challenge

Use the Load Manager architecture for manual error recovery by suspending and resuming the workflows and worklets when an error is encountered.

Description

When a task in the workflow fails at any point, one option is to truncate the target and run the workflow again from the beginning. Load Manager architecture offers an alternative to this scenario: the workflow can be suspended and the user can fix the error rather than re-processing the portion of the workflow with no errors. This option, "Suspend on Error", results in accurate and complete target data, as if the session completed successfully with one run.

Configure Mapping for Recovery

For consistent recovery, the mapping needs to produce the same result, and in the same order, in the recovery execution as in the failed execution. This can be achieved by sorting the input data using either the sorted ports option in Source Qualifier (or Application Source Qualifier) or by using a sorter transformation with distinct rows option immediately after source qualifier transformation. Additionally, ensure that all the targets received data from transformations that produce repeatable data.

Configure Session for Recovery

Enable the session for recovery by selecting one of the following three Recovery Strategies:

● Resume from the last checkpoint

❍ Integration Service saves session recovery information and updates recovery tables for a target database.

❍ If session interrupts, Integration Service uses saved recovery information to recover it.

● Restart task

INFORMATICA CONFIDENTIAL BEST PRACTICE 79 of 702

❍ Integration Service does not save session recovery information. ❍ If session interrupts, Integration Service reruns the session during

recovery.

● Fail task and continue workflow

❍ Session will not be recovered (default).

Configure Workflow for Recovery

The Suspend on Error option directs the PowerCenter Server to suspend the workflow while the user fixes the error, and then resumes the workflow.

The server suspends the workflow when any of the following tasks fail:

● Session● Command● Worklet● Email

When a task fails in the workflow, the Integration Service stops running tasks in the path. The Integration Service does not evaluate the output link of the failed task. If no other task is running in the workflow, the Workflow Monitor displays the status of the workflow as "Suspended."

If one or more tasks are still running in the workflow when a task fails, the Integration Service stops running the failed task and continues running tasks in other paths. The Workflow Monitor displays the status of the workflow as "Suspending."

When the status of the workflow is "Suspended" or "Suspending," you can fix the error, such as a target database error, and recover the workflow in the Workflow Monitor. When you recover a workflow, the Integration Service restarts the failed tasks and continues evaluating the rest of the tasks in the workflow. The Integration Service does not run any task that already completed successfully.

Note: You can no longer recover individual sessions in a workflow. To recover a session, you recover the workflow.

Truncate Target Table

INFORMATICA CONFIDENTIAL BEST PRACTICE 80 of 702

If the truncate table option is enabled in a recovery-enabled session, the target table is not truncated during recovery process.

Session Logs

In a suspended workflow scenario, the Integration Service uses the existing session log when it resumes the workflow from the point of suspension. However, the earlier runs that caused the suspension are recorded in the historical run information in the repository.

Suspension Email

The workflow can be configured to send an email when the Integration Service suspends the workflow. When a task fails, the server suspends the workflow and sends the suspension email. The user can then fix the error and resume the workflow. If another task fails while the Integration Service is suspending the workflow, the server does not send another suspension email. The Integration Service only sends out another suspension email if another task fails after the workflow resumes. Check the "Browse Emails" button on the General tab of the Workflow Designer Edit sheet to configure the suspension email.

Suspending Worklets

When the "Suspend On Error" option is enabled for the parent workflow, the Integration Service also suspends the worklet if a task within the worklet fails. When a task in the worklet fails, the server stops executing the failed task and other tasks in its path. If no other task is running in the worklet, the status of the worklet is "Suspended". If other tasks are still running in the worklet, the status of the worklet is "Suspending". The parent workflow is also suspended when the worklet is "Suspended" or "Suspending".

Starting Recovery

The recovery process can be started using Workflow Manager Client tool or Workflow Monitor client tool. Alternatively, the recovery process can be started using pmcmd in command line mode or using a script.

Recovery Tables and Recovery Process

When the Integration Service runs a session that has a resume recovery strategy, it writes to recovery tables on the target database system. When the Integration Service

INFORMATICA CONFIDENTIAL BEST PRACTICE 81 of 702

recovers the session, it uses information in the recovery tables to determine where to begin loading data to target tables.

If you want the Integration Service to create the recovery tables, grant table creation privilege to the database user name for the target database connection. If you do not want the Integration Service to create the recovery tables, create the recovery tables manually.

The Integration Service creates the following recovery tables in the target database:

● PM_RECOVERY. Contains target load information for the session run. The Integration Service removes the information from this table after each successful session and initializes the information at the beginning of subsequent sessions.

● PM_TGT_RUN_ID. Contains information the Integration Service uses to identify each target on the database. The information remains in the table between session runs. If you manually create this table, you must create a row and enter a value other than zero for LAST_TGT_RUN_ID to ensure that the session recovers successfully.

Do not edit or drop the recovery tables before you recover a session. If you disable recovery, the Integration Service does not remove the recovery tables from the target database. You must manually remove the recovery tables.

Unrecoverable Sessions

The following options affect whether the session is incrementally recoverable:

● Output is deterministic. A property that determines if the transformation generates the same set of data for each session run. You can set this property for SDK sources and Custom transformations.

● Output is repeatable. A property that determines if the transformation generates the data in the same order for each session run. You can set this property for Custom transformations.

● Lookup source is static. A Lookup transformation property that determines if the lookup source is the same between the session and recovery. The Integration Service uses this property to determine if the output is deterministic.

Inconsistent Data During Recovery Process

INFORMATICA CONFIDENTIAL BEST PRACTICE 82 of 702

For recovery to be effective, the recovery session must produce the same set of rows and in the same order. Any change after initial failure – in mapping, session and/or in the server – that changes the ability to produce repeatable data results in inconsistent data during recovery process.

The following cases may produce inconsistent data during a recovery session:

● Session performs incremental aggregation and server stops unexpectedly. ● Mapping uses sequence generator transformation. ● Mapping uses a normalizer transformation. ● Source and/or target changes after initial session failure. ● Data movement mode change after initial session failure. ● Code page (server, source or target) changes, after initial session failure. ● Mapping changes in a way that causes server to distribute or filter or

aggregate rows differently. ● Session configurations are not supported by PowerCenter for session

recovery. ● Mapping uses a lookup table and the data in the lookup table changes

between session runs. ● Session sort order changes, when server is running in Unicode mode.

HA Recovery

Highly-available recovery allows the workflow to resume automatically in case of the Integration Service has failed over. The following options are available in the properties tab of the workflow:

● Enable HA recovery Allows the workflow to be configured for Highly Availability.

● Automatically recover terminated tasks Recover terminated Session or Command tasks without user intervention. You must have high availability and the workflow must still be running.

● Maximum automatic recovery attempts When you automatically recover terminated tasks you can choose the number of times the Integration Service attempts to recover the task. Default is 5.

Note: To run a workflow in HA recovery, you must have HA License for the Repository

INFORMATICA CONFIDENTIAL BEST PRACTICE 83 of 702

Service.

Complex Mappings and Recovery

In the case of complex mappings that load to more than one target that are related (i.e., primary key – foreign key relationship), the session failure and subsequent recovery may lead to data integrity issues. In such cases, it is necessary to check the integrity of the target tables to be checked and fixed prior to starting the recovery process.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 84 of 702

Using PowerCenter Labels

Challenge

Using labels effectively in a data warehouse or data integration project to assist with administration and migration.

Description

A label is a versioning object that can be associated with any versioned object or group of versioned objects in a repository. Labels provide a way to tag a number of object versions with a name for later identification. Therefore, a label is a named object in the repository, whose purpose is to be a “pointer” or reference to a group of versioned objects. For example, a label called “Project X version X” can be applied to all object versions that are part of that project and release.

Labels can be used for many purposes:

● Track versioned objects during development ● Improve object query results. ● Create logical groups of objects for future deployment. ● Associate groups of objects for import and export.

Note that labels apply to individual object versions, and not objects as a whole. So if a mapping has ten versions checked in, and a label is applied to version 9, then only version 9 has that label. The other versions of that mapping do not automatically inherit that label. However, multiple labels can point to the same object for greater flexibility.

The “Use Repository Manager” privilege is required in order to create or edit labels, To create a label, choose Versioning-Labels from the Repository Manager.

INFORMATICA CONFIDENTIAL BEST PRACTICE 85 of 702

When creating a new label, choose a name that is as descriptive as possible. For example, a suggested naming convention for labels is: Project_Version_Action. Include comments for further meaningful description.

Locking the label is also advisable. This prevents anyone from accidentally associating additional objects with the label or removing object references for the label.

Labels, like other global objects such as Queries and Deployment Groups, can have user and group privileges attached to them. This allows an administrator to create a label that can only be used by specific individuals or groups. Only those people working on a specific project should be given read/write/execute permissions for labels that are assigned to that project.

INFORMATICA CONFIDENTIAL BEST PRACTICE 86 of 702

Once a label is created, it should be applied to related objects. To apply the label to objects, invoke the “Apply Label” wizard from the Versioning >> Apply Label menu option from the menu bar in the Repository Manager (as shown in the following figure).

Applying Labels

Labels can be applied to any object and cascaded upwards and downwards to parent and/or child objects. For example, to group dependencies for a workflow, apply a label to all children objects. The Repository Server applies labels to sources, targets, mappings, and tasks associated with the workflow. Use the “Move label” property to point the label to the latest version of the object(s).

Note: Labels can be applied to any object version in the repository except checked-out versions. Execute permission is required for applying labels.

After the label has been applied to related objects, it can be used in queries and deployment groups (see the Best Practice on Deployment Groups ). Labels can also be used to manage the size of the repository (i.e. to purge object versions).

Using Labels in Deployment

An object query can be created using the existing labels (as shown below). Labels can be associated only with a dynamic deployment group. Based on the object query, objects associated with that label can be used in the deployment.

INFORMATICA CONFIDENTIAL BEST PRACTICE 87 of 702

Strategies for Labels

Repository Administrators and other individuals in charge of migrations should develop their own label strategies and naming conventions in the early stages of a data integration project. Be sure that developers are aware of the uses of these labels and when they should apply labels.

For each planned migration between repositories, choose three labels for the development and subsequent repositories:

● The first is to identify the objects that developers can mark as ready for migration. ● The second should apply to migrated objects, thus developing a migration audit trail. ● The third is to apply to objects as they are migrated into the receiving repository,

completing the migration audit trail.

INFORMATICA CONFIDENTIAL BEST PRACTICE 88 of 702

When preparing for the migration, use the first label to construct a query to build a dynamic deployment group. The second and third labels in the process are optionally applied by the migration wizard when copying folders between versioned repositories. Developers and administrators do not need to apply the second and third labels manually.

Additional labels can be created with developers to allow the progress of mappings to be tracked if desired. For example, when an object is successfully unit-tested by the developer, it can be marked as such. Developers can also label the object with a migration label at a later time if necessary. Using labels in this fashion along with the query feature allows complete or incomplete objects to be identified quickly and easily, thereby providing an object-based view of progress.

Last updated: 12-Feb-07 15:17

INFORMATICA CONFIDENTIAL BEST PRACTICE 89 of 702

Deploying Data Analyzer Objects

Challenge

To understand the methods for deploying Data Analyzer objects among repositories and the limitations of such deployment.

Description

Data Analyzer repository objects can be exported to and imported from Extensible Markup Language (XML) files. Export/import facilitates archiving the Data Analyzer repository and deploying Data Analyzer Dashboards and reports from development to production.

The following repository objects in Data Analyzer can be exported and imported:

● Schemas ● Reports ● Time Dimensions ● Global Variables ● Dashboards ● Security profiles ● Schedules ● Users ● Groups ● Roles

The XML file created after exporting objects should not be modified. Any change might invalidate the XML file and result in failure of import objects into a Data Analyzer repository.

For more information on exporting objects from the Data Analyzer repository, refer to the Data Analyzer Administration Guide.

Exporting Schema(s)

INFORMATICA CONFIDENTIAL BEST PRACTICE 90 of 702

To export the definition of a star schema or an operational schema, you need to select a metric or folder from the Metrics system folder in the Schema Directory. When you export a folder, you export the schema associated with the definitions of the metrics in that folder and its subfolders. If the folder you select for export does not contain any objects, Data Analyzer does not export any schema definition and displays the following message:

There is no content to be exported.

There are two ways to export metrics or folders containing metrics:

● Select the “Export Metric Definitions and All Associated Schema Table and Attribute Definitions” option. If you select to export a metric and its associated schema objects, Data Analyzer exports the definitions of the metric and the schema objects associated with that metric. If you select to export an entire metric folder and its associated objects, Data Analyzer exports the definitions of all metrics in the folder, as well as schema objects associated with every metric in the folder.

● Alternatively, select the “Export Metric Definitions Only” option. When you choose to export only the definition of the selected metric, Data Analyzer does not export the definition of the schema table from which the metric is derived or any other associated schema object.

1. Login to Data Analyzer as a System Administrator. 2. Click on the Administration tab » XML Export/Import » Export Schemas. 3. All the metric folders in the schema directory are displayed. Click “Refresh

Schema” to display the latest list of folders and metrics in the schema directory. 4. Select the check box for the folder or metric to be exported and click “Export as

XML” option. 5. Enter XML filename and click “Save” to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting Report(s)

To export the definitions of more than one report, select multiple reports or folders. Data Analyzer exports only report definitions. It does not export the data or the schedule for cached reports. As part of the Report Definition export, Data Analyzer exports the report table, report chart, filters, indicators (i.e., gauge, chart, and table indicators), custom metrics, links to similar reports, and all reports in an analytic workflow, including links to similar reports.

INFORMATICA CONFIDENTIAL BEST PRACTICE 91 of 702

Reports can have public or personal indicators associated with them. By default, Data Analyzer exports only public indicators associated with a report. To export the personal indicators as well, select the Export Personal Indicators check box.

To export an analytic workflow, you need to export only the originating report. When you export the originating report of an analytic workflow, Data Analyzer exports the definitions of all the workflow reports. If a report in the analytic workflow has similar reports associated with it, Data Analyzer exports the links to the similar reports.

Data Analyzer does not export the alerts, schedules, or global variables associated with the report. Although Data Analyzer does not export global variables, it lists all global variables it finds in the report filter. You can, however, export these global variables separately.

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Export Reports. 3. Select the folder or report to be exported. 4. Click “Export as XML”. 5. Enter XML filename and click “Save” to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting Global Variables

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Export Global Variables. 3. Select the Global variable to be exported. 4. Click “Export as XML”. 5. Enter the XML filename and click “Save” to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting a Dashboard

Whenever a dashboard is exported, Data Analyzer exports the reports, indicators, shared documents, and gauges associated with the dashboard. Data Analyzer does not, however, export the alerts, access permissions, attributes or metrics in the report(s), or real-time objects. You can export any of the public dashboards defined in the repository, and can export more than one dashboard at one time.

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Export Dashboards. 3. Select the Dashboard to be exported.

INFORMATICA CONFIDENTIAL BEST PRACTICE 92 of 702

4. Click “Export as XML”. 5. Enter XML filename and click “Save” to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting a User Security Profile

Data Analyzer maintains a security profile for each user or group in the repository. A security profile consists of the access permissions and data restrictions that the system administrator sets for a user or group.

When exporting a security profile, Data Analyzer exports access permissions for objects under the Schema Directory, which include folders, metrics, and attributes. Data Analyzer does not export access permissions for filtersets, reports, or shared documents.

Data Analyzer allows you to export only one security profile at a time. If a user or group security profile you export does not have any access permissions or data restrictions, Data Analyzer does not export any object definitions and displays the following message:

There is no content to be exported.

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Export Security Profile. 3. Click “Export from users” and select the user for which security profile to be

exported. 4. Click “Export as XML”. 5. Enter XML filename and click “Save” to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting a Schedule

You can export a time-based or event-based schedule to an XML file. Data Analyzer runs a report with a time-based schedule on a configured schedule. Data Analyzer runs a report with an event-based schedule when a PowerCenter session completes. When you export a schedule, Data Analyzer does not export the history of the schedule.

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Export Schedules. 3. Select the Schedule to be exported. 4. Click “Export as XML”.

INFORMATICA CONFIDENTIAL BEST PRACTICE 93 of 702

5. Enter XML filename and click “Save” to save the XML file. 6. The XML file will be stored locally on the client machine.

Exporting Users, Groups, or Roles

Exporting Users

You can export the definition of any user defined in the repository. However, you cannot export the definitions of system users defined by Data Analyzer. If you have more than one thousand users defined in the repository, Data Analyzer allows you to search for the users that you want to export. You can use the asterisk (*) or the percent symbol (%) as wildcard characters to search for users to export.

You can export the definitions of more than one user, including the following information:

● Login name ● Description ● First, middle, and last name ● Title ● Password ● Change password privilege ● Password never expires indicator ● Account status ● Groups to which the user belongs ● Roles assigned to the user ● Query governing settings

Data Analyzer does not export the email address, reply-to address, department, or color scheme assignment associated with the exported user(s).

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Export User/Group/Role. 3. Click “Export Users/Group(s)/Role(s)”. 4. Select the user(s) to be exported. 5. Click “Export as XML”. 6. Enter XML filename and click “Save” to save the XML file. 7. The XML file will be stored locally on the client machine.

INFORMATICA CONFIDENTIAL BEST PRACTICE 94 of 702

Exporting Groups

You can export any group defined in the repository, and can export the definitions of multiple groups. You can also export the definitions of all the users within a selected group. Use the asterisk (*) or percent symbol (%) as wildcard characters to search for groups to export. Each group definition includes the following information:

● Name ● Description ● Department ● Color scheme assignment ● Group hierarchy ● Roles assigned to the group ● Users assigned to the group ● Query governing settings

Data Analyzer does not export the color scheme associated with an exported group.

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Export User/Group/Role. 3. Click “Export Users/Group(s)/Role(s)”. 4. Select the group to be exported. 5. Click “Export as XML”. 6. Enter XML filename and click “Save” to save the XML file. 7. The XML file will be stored locally on the client machine.

Exporting Roles

You can export the definitions of the custom roles defined in the repository. However, you cannot export the definitions of system roles defined by Data Analyzer. You can export the definitions of more than one role. Each role definition includes the name and description of the role and the permissions assigned to each role.

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Export User/Group/Role. 3. Click “Export Users/Group(s)/Role(s)”. 4. Select the role to be exported. 5. Click “Export as XML”.

INFORMATICA CONFIDENTIAL BEST PRACTICE 95 of 702

6. Enter XML filename and click “Save” to save the XML file. 7. The XML file will be stored locally on the client machine.

Importing Objects

You can import objects into the same repository or a different repository. If you import objects that already exist in the repository, you can choose to overwrite the existing objects. However, you can import only global variables that do not already exist in the repository.

When you import objects, you can validate the XML file against the DTD provided by Data Analyzer. Informatica recommends that you do not modify the XML files after you export from Data Analyzer. Ordinarily, you do not need to validate an XML file that you create by exporting from Data Analyzer. However, if you are not sure of the validity of an XML file, you can validate it against the Data Analyzer DTD file when you start the import process.

To import repository objects, you must have the System Administrator role or the Access XML Export/Import privilege.

When you import a repository object, you become the owner of the object as if you created it. However, other system administrators can also access imported repository objects. You can limit access to reports for users who are not system administrators. If you select to publish imported reports to everyone, all users in Data Analyzer have read and write access to them. You can change the access permissions to reports after you import them.

Importing Schemas

When importing schemas, if the XML file contains only the metric definition, you must make sure that the fact table for the metric exists in the target repository. You can import a metric only if its associated fact table exists in the target repository or the definition of its associated fact table is also in the XML file.

When you import a schema, Data Analyzer displays a list of all the definitions contained in the XML file. It then displays a list of all the object definitions in the XML file that already exist in the repository. You can choose to overwrite objects in the repository. If you import a schema that contains time keys, you must import or create a time dimension.

1. Login to Data Analyzer as a System Administrator.

INFORMATICA CONFIDENTIAL BEST PRACTICE 96 of 702

2. Click Administration » XML Export/Import » Import Schema. 3. Click “Browse” to choose an XML file to import. 4. Select “Validate XML against DTD”. 5. Click “Import XML”. 6. Verify all attributes on the summary page, and choose “Continue”.

Importing Reports

A valid XML file of exported report objects can contain definitions of cached or on-demand reports, including prompted reports. When you import a report, you must make sure that all the metrics and attributes used in the report are defined in the target repository. If you import a report that contains attributes and metrics not defined in the target repository, you can cancel the import process. If you choose to continue the import process, you may not be able to run the report correctly. To run the report, you must import or add the attribute and metric definitions to the target repository.

You are the owner of all the reports you import, including the personal or public indicators associated with the reports. You can publish the imported reports to all Data Analyzer users. If you publish reports to everyone, Data Analyzer provides read-access to the reports to all users. However, it does not provide access to the folder that contains the imported reports. If you want another user to access an imported report, you can put the imported report in a public folder and have the user save or move the imported report to his or her personal folder. Any public indicator associated with the report also becomes accessible to the user.

If you import a report and its corresponding analytic workflow, the XML file contains all workflow reports. If you choose to overwrite the report, Data Analyzer also overwrites the workflow reports. Also, when importing multiple workflows, note that Data Analyzer does not import analytic workflows containing the same workflow report names. Thus, ensure that all imported analytic workflows have unique report names prior to being imported.

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Import Report. 3. Click “Browse” to choose an XML file to import. 4. Select “Validate XML against DTD”. 5. Click “Import XML”. 6. Verify all attributes on the summary page, and choose “Continue”.

Importing Global Variables

INFORMATICA CONFIDENTIAL BEST PRACTICE 97 of 702

You can import global variables that are not defined in the target repository. If the XML file contains global variables already in the repository, you can cancel the process. If you continue the import process, Data Analyzer imports only the global variables not in the target repository.

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Import Global Variables. 3. Click “Browse” to choose an XML file to import. 4. Select “Validate XML against DTD”. 5. Click “Import XML”. 6. Verify all attributes on the summary page, and choose “Continue”.

Importing Dashboards

Dashboards display links to reports, shared documents, alerts, and indicators. When you import a dashboard, Data Analyzer imports the following objects associated with the dashboard:

● Reports ● Indicators ● Shared documents ● Gauges

Data Analyzer does not import the following objects associated with the dashboard:

● Alerts ● Access permissions ● Attributes and metrics in the report ● Real-time objects

If an object already exists in the repository, Data Analyzer provides an option to overwrite it. Data Analyzer does not import the attributes and metrics in the reports associated with the dashboard. If the attributes or metrics in a report associated with the dashboard do not exist, the report does not display on the imported dashboard.

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Import Dashboard. 3. Click “Browse” to choose an XML file to import. 4. Select “Validate XML against DTD”.

INFORMATICA CONFIDENTIAL BEST PRACTICE 98 of 702

5. Click “Import XML”. 6. Verify all attributes on the summary page, and choose “Continue”.

Importing Security Profile(s)

To import a security profile, you must begin by selecting the user or group to which you want to assign the security profile. You can assign the same security profile to more than one user or group.

When you import a security profile and associate it with a user or group, you can either overwrite the current security profile or add to it. When you overwrite a security profile, you assign the user or group only the access permissions and data restrictions found in the new security profile. Data Analyzer removes the old restrictions associated with the user or group. When you append a security profile, you assign the user or group the new access permissions and data restrictions in addition to the old permissions and restrictions.

When exporting a security profile, Data Analyzer exports the security profile for objects in Schema Directory, including folders, attributes, and metrics. However, it does not include the security profile for filtersets.

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Import Security Profile. 3. Click “Import to Users”. 4. Select the user with which you want to associate the security profile you import.

❍ To associate the imported security profiles with all the users on the page, select the "Users" check box at the top of the list.

❍ To associate the imported security profiles with all the users in the repository, select “Import to All.”.

❍ To overwrite the selected user’s current security profile with the imported security profile, select “Overwrite.”.

❍ To append the imported security profile to the selected user’s current security profile, select “Append.”.

5. Click “Browse” to choose an XML file to import. 6. Select “Validate XML against DTD”. 7. Click “Import XML”. 8. Verify all attributes on the summary page, and choose “Continue”.

INFORMATICA CONFIDENTIAL BEST PRACTICE 99 of 702

Importing Schedule(s)

A time-based schedule runs reports based on a configured schedule. An event-based schedule runs reports when a PowerCenter session completes. You can import a time-based or event-based schedules from an XML file. When you import a schedule, Data Analyzer does not attach the schedule to any reports.

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Import Schedule. 3. Click “Browse” to choose an XML file to import. 4. Select “Validate XML against DTD”. 5. Click “Import XML”. 6. Verify all attributes on the summary page, and choose “Continue”.

Importing Users, Groups, or Roles

When you import a user, group, or role, you import all the information associated with each user, group, or role. The XML file includes definitions of roles assigned to users or groups, and definitions of users within groups. For this reason, you can import the definition of a user, group, or role in the same import process.

When importing a user, you import the definitions of roles assigned to the user and the groups to which the user belongs. When you import a user or group, you import the user or group definitions only. The XML file does not contain the color scheme assignments, access permissions, or data restrictions for the user or group. To import the access permissions and data restrictions, you must import the security profile for the user or group.

1. Login to Data Analyzer as a System Administrator. 2. Click Administration » XML Export/Import » Import User/Group/Role. 3. Click “Browse” to choose an XML file to import. 4. Select “Validate XML against DTD”. 5. Click “Import XML” option. 6. Verify all attributes on the summary page, and choose “Continue”.

Tips for Importing/Exporting

● Schedule Importing/Exporting of repository objects for a time of minimal

INFORMATICA CONFIDENTIAL BEST PRACTICE 100 of 702

Data Analyzer activity, when most of the users are not accessing the Data Analyzer repository. This should help to prevent users from experiencing timeout errors or degraded response time. Only the System Administrator should perform import/export operations.

● Take a backup of the Data Analyzer repository prior to performing an import/export operation. This backup should be completed using the Repository Backup Utility provided with Data Analyzer.

● Manually add user/group permissions for the report. These permissions will not be exported as part of exporting Reports and should be manually added after the report is imported in the desired server.

● Use a version control tool. Prior to importing objects into a new environment, it is advisable to check the XML documents with a version-control tool such as Microsoft's Visual Source Safe, or PVCS. This facilitates the versioning of repository objects and provides a means for rollback to a prior version of an object, if necessary.

● Attach cached reports to schedules. Data Analyzer does not import the schedule with a cached report. When you import cached reports, you must attach them to schedules in the target repository. You can attach multiple imported reports to schedules in the target repository in one process immediately after you import them.

● Ensure that global variables exist in the target repository. If you import a report that uses global variables in the attribute filter, ensure that the global variables already exist in the target repository. If they are not in the target repository, you must either import the global variables from the source repository or recreate them in the target repository.

● Manually add indicators to the dashboard. When you import a dashboard, Data Analyzer imports all indicators for the originating report and workflow reports in a workflow. However, indicators for workflow reports do not display on the dashboard after you import it until added manually.

● Check with your System Administrator to understand what level of LDAP integration has been configured (if any). Users, groups, and roles need to be exported and imported during deployment when using repository authentication. If Data Analyzer has been integrated with an LDAP (Lightweight Directory Access Protocol) tool, then users, groups, and/or roles may not require deployment.

When you import users into a Microsoft SQL Server or IBM DB2 repository, Data Analyzer blocks all user authentication requests until the import process is complete.

INFORMATICA CONFIDENTIAL BEST PRACTICE 101 of 702

Installing Data Analyzer

Challenge

Installing Data Analyzer on new or existing hardware, either as a dedicated application on a physical machine (as Informatica recommends) or co-existing with other applications on the same physical server or with other Web applications on the same application server.

Description

Consider the following questions when determining what type of hardware to use for Data Analyzer:

If the hardware already exists:

1. Is the processor, operating system, and database software supported by Data Analyzer? 2. Are the necessary operating system and database patches applied? 3. How many CPUs does the machine currently have? Can the CPU capacity be expanded? 4. How much memory does the machine have? How much is available to the Data Analyzer application? 5. Will Data Analyzer share the machine with other applications? If yes, what are the CPU and memory requirements

of the other applications?

If the hardware does not already exist:

1. Has the organization standardized on hardware or operating system vendor? 2. What type of operating system is preferred and supported? (e.g., Solaris, Windows, AIX, HP-UX, Redhat AS,

SuSE) 3. What database and version is preferred and supported for the Data Analyzer repository?

Regardless of the hardware vendor chosen, the hardware must be configured and sized appropriately to support the reporting response time requirements for Data Analyzer. The following questions should be answered in order to estimate the size of a Data Analyzer server:

1. How many users are predicted for concurrent access? 2. On average, how many rows will be returned in each report? 3. On average, how many charts will there be for each report? 4. Do the business requirements mandate a SSL Web server?

The hardware requirements for the Data Analyzer environment depend on the number of concurrent users, types of reports being used (i.e., interactive vs. static), average number of records in a report, application server and operating system used, among other factors. The following table should be used as a general guide for hardware recommendations for a Data Analyzer installation. Actual results may vary depending upon exact hardware configuration and user volume. For exact sizing recommendations, contact Informatica Professional Services for a Data Analyzer Sizing and Baseline Architecture engagement.

Windows

# of Concurrent Users Average Number of Rows per Report

Average # of Charts per Report

Estimated # of CPUs for Peak Usage

Estimated Total RAM (For Data Analyzer alone)

Estimated # of App servers in a Clustered

Environment

50 1000 2 2 1 GB 1

INFORMATICA CONFIDENTIAL BEST PRACTICE 102 of 702

100 1000 2 3 2 GB 1 - 2

200 1000 2 6 3.5 GB 3

400 1000 2 12 6.5 GB 6

100 1000 2 3 2 GB 1 - 2

­100 2000 2 3 2.5 GB 1 - 2

100 5000 2 4 3 GB 2

100 10000 2 5 4 GB 2 - 3

100 1000 2 3 2 GB 1 - 2

100 1000 5 3 2 GB 1 - 2

100 1000 7 3 2.5 GB 1 - 2

100 1000 10 3 - 4 3 GB 1 - 2

Notes:

1. This estimating guide is based on experiments conducted in the Informatica lab. 2. The sizing estimates are based on PowerAnalyzer 5 running BEA WebLogic 8.1 SP3, Windows 2000, on a 4 CPU

2.5 GHz Xeon Processor. This estimate may not be accurate for other, different environments. 3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied

by the percentage of concurrent users. In practice, typically 10 percent of the user base is concurrent. However, this percentage can be as high as 50 percent or as low as 5 percent in some organizations.

4. For every two CPUs on the server, Informatica recommends one managed server (instance) of the application server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one physical server can result in increased performance.

5. There will be an increase in overhead on for a SSL Web server architecture, depending on strength of encryption. 6. CPU utilization can be minimized by 10 to 25 percent by using SVG charts, otherwise known as interactive

charting, rather than the default PNG charting. 7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesn’t have to be

across multiple boxes if >= 4 CPU) 8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.

IBM AIX

# of Concurrent Users Average Number of Rows per Report

Average # of Charts per Report

Estimated # of CPUs for Peak Usage

Estimated Total RAM (For Data Analyzer alone)

Estimated # of App servers in a Clustered

Environment

50 1000 2 2 1 GB 1

100 1000 2 2 - 3 2 GB 1

200 1000 2 4 - 5 3.5 GB 2 - 3

400 1000 2 9 - 10 6 GB 4 - 5

INFORMATICA CONFIDENTIAL BEST PRACTICE 103 of 702

100 1000 2 2 - 3 2 GB 1

­100 2000 2 2 - 3 2 GB 1 - 2

100 5000 2 2 - 3 3 GB 1 - 2

100 10000 2 4 4 GB 2

100 1000 2 2 - 3 2 GB 1

100 1000 5 2 - 3 2 GB 1

100 1000 7 2 - 3 2 GB 1 - 2

100 1000 10 2 - 3 2.5 GB 1 - 2

Notes:

1. This estimating guide is based on experiments conducted in the Informatica lab. 2. The sizing estimates are based on PowerAnalyzer 5 running IBM WebSphere 5.1.1.1 and AIX 5.2.02 on a 4 CPU

2.4 GHz IBM p630. This estimate may not be accurate for other, different environments. 3. The number of concurrent users under peak volume can be estimated by using the number of total users multiplied

by the percentage of concurrent users. In practice, typically 10 percent of the user base is concurrent. However, this percentage can be as high as 50 percent or as low as 5 percent in some organizations.

4. For every two CPUs on the server, Informatica recommends one managed server (instance) of the application server. For servers with at least four CPUs, clustering multiple logical instances of the application server on one physical server can result in increased performance.

5. Add 30 to 50 percent overhead on for a SSL Web server architecture, depending on strength of encryption. 6. CPU utilization can be minimized by 10 to 25 percent by using SVG charts, otherwise known as interactive

charting, rather than the default PNG charting. 7. Clustering is recommended for instances with more than 50 concurrent users. (Clustering doesn’t have to be

across multiple boxes if >= 4 CPU) 8. Informatica Professional Services should be engaged for a thorough and accurate sizing estimate.

Data Analyzer Installation

The Data Analyzer installation process involves two main components: the Data Analyzer Repository and the Data Analyzer Server, which is an application deployed on an application server. A Web server is necessary to support these components and is included with the installation of the application servers. This section discusses the installation process for JBOSS, BEA WebLogic and IBM WebSphere. The installation tips apply to both Windows and UNIX environments. This section is intended to serve as a supplement to the Data Analyzer Installation Guide.

Before installing Data Analyzer, be sure to complete the following steps:

● Verify that the hardware meets the minimum system requirements for Data Analyzer. Ensure that the combination of hardware, operating system, application server, repository database, and, optionally, authentication software are supported by Data Analyzer. Ensure that sufficient space has been allocated to the Data Analyzer repository.

● Apply all necessary patches to the operating system and database software. ● Verify connectivity to the data warehouse database (or other reporting source) and repository database. ● If LDAP or NT Domain is used for Data Analyzer authentication, verify connectivity to the LDAP directory server

or the NT primary domain controller. ● The Data Analyzer license file has been obtained from technical support. ● On UNIX/Linux installations, the OS user that is running Data Analyzer must have execute privileges on all Data

Analyzer installation executables.

INFORMATICA CONFIDENTIAL BEST PRACTICE 104 of 702

In addition to the standard Data Analyzer components that are installed by default, you can also install Metadata Manager. With Version 8.0, the Data Analyzer SDK and Portal Integration Kit are now installed with Data Analyzer. Refer to the Data Analyzer documentation for detailed information for these components.

Changes to Installation Process

Beginning with Data Analyzer version 7.1.4, Data Analyzer is packaged with PowerCenter Advance Edition. To install only the Data Analyzer portion, during the installation process choose the Custom Installation option. On the following screen, uncheck all of the check boxes except the Data Analyzer check box and then click Next.

Repository Configuration

To properly install Data Analyzer you need to have connectivity information for the database server where the repository is going to reside. This information includes:

● Database URL ● Repository username ● Password for repository username

Installation Steps: JBOSS

The following are the basic installation steps for Data Analyzer on JBOSS

1. Set up the Data Analyzer repository database. The Data Analyzer Server installation process will create the repository tables, but an empty database schema needs to exist and be able to be connected to via JDBC prior to installation.

2. Install Data Analyzer. The Data Analyzer installation process will install JBOSS if a version does not already exist, or an existing instance can be selected.

3. Apply the Data Analyzer license key. 4. Install the Data Analyzer Online Help.

INFORMATICA CONFIDENTIAL BEST PRACTICE 105 of 702

Installation Tips: JBOSS

The following are the basic installation tips for Data Analyzer on JBOSS:

● Beginning with PowerAnalyzer 5, multiple Data Analyzer instances can be installed on a single instance of JBOSS. Also, other applications can coexist with Data Analyzer on a single instance of JBOSS. Although this architecture should be considered during hardware sizing estimates, it allows greater flexibility during installation.

● For JBOSS installations on UNIX, the JBOSS Server installation program requires an X-Windows server. If JBOSS Server is installed on a machine where an X-Windows server is not installed, an X-Windows server must be installed on another machine in order to render graphics for the GUI-based installation program. For more information on installing on UNIX, please see the “UNIX Servers” section of the installation and configuration tips below.

● If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTP’d in binary format

● To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com).

● During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is used, have the configuration parameters available during installation as the installer will configure all properties files at installation.

● The Data Analyzer license file must be applied prior to starting Data Analyzer.

Configuration Screen

Installation Steps: BEA WebLogic

The following are the basic installation steps for Data Analyzer on BEA WebLogic:

INFORMATICA CONFIDENTIAL BEST PRACTICE 106 of 702

1. Set up the Data Analyzer repository database. The Data Analyzer Server installation process will create the repository tables, but an empty database schema needs to exist and be able to be connected to via JDBC prior to installation.

2. Install BEA WebLogic and apply the BEA license. 3. Install Data Analyzer. 4. Apply the Data Analyzer license key. 5. Install the Data Analyzer Online Help.

TIP

When creating a repository in an Oracle database, make sure the storage parameters specified for the tablespace that contains the repository are not set too large. Since many target tablespaces are initially set for very large INITIAL and NEXT values, large storage parameters cause the repository to use excessive amounts of space. Also verify that the default tablespace for the user that owns the repository tables is set correctly.

The following example shows how to set the recommended storage parameters, assuming the repository is stored in the “REPOSITORY” tablespace:

ALTER TABLESPACE “REPOSITORY” DEFAULT STORAGE ( INITIAL 10K NEXT 10K MAXEXTENTS UNLIMITED PCTINCREASE 50 );

Installation Tips: BEA WebLogic

The following are the basic installation tips for Data Analyzer on BEA WebLogic:

● Beginning with PowerAnalyzer 5, multiple Data Analyzer instances can be installed on a single instance of WebLogic. Also, other applications can coexist with Data Analyzer on a single instance of WebLogic. Although this architecture should be factored in during hardware sizing estimates, it allows greater flexibility during installation.

● With Data Analyzer 8, there is a console version of the installation available. X-Windows is no longer required for WebLogic installations.

● If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTP’d in binary format

● To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com).

● During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is used, have the configuration parameters available during installation since the installer will configure all properties files at installation.

● The Data Analyzer license file and BEA WebLogic license must be applied prior to starting Data Analyzer.

Configuration Screen

INFORMATICA CONFIDENTIAL BEST PRACTICE 107 of 702

Installation Steps: IBM WebSphere

The following are the basic installation steps for Data Analyzer on IBM WebSphere:

1. Setup the Data Analyzer repository database. The Data Analyzer Server installation process will create the repository tables, but the empty database schema needs to exist and be able to be connected to via JDBC prior to installation.

2. Install IBM WebSphere and apply the WebSphere patches. WebSphere can be installed in its “Base” configuration or “Network Deployment” configuration if clustering will be utilized. In both cases, patchsets will need to be applied.

3. Install Data Analyzer. 4. Apply the Data Analyzer license key. 5. Install the Data Analyzer Online Help. 6. Configure the PowerCenter Integration Utility. See the section "Configuring the PowerCenter Integration Utility for

WebSphere" in the PowerCenter Installation and Configuration Guide.

Installation Tips: IBM WebSphere

Starting in Data Analyzer 5, multiple Data Analyzer instances can be installed on a single instance of WebSphere. Also, other applications can coexist with Data Analyzer on a single instance of WebSphere. Although this architecture should be considered during sizing estimates, it allows greater flexibility during installation.

● With Data Analyzer 8 there is a console version of the installation available. X-Windows is no longer required for WebSphere installations.

● For WebSphere on UNIX installations, Data Analyzer must be installed using the root user or system administrator account. Two groups (mqm and mqbrkrs) must be created prior to the installation and the root account should be added to both of these groups.

● For WebSphere on Windows installations, ensure that Data Analyzer is installed under the “padaemon” local Windows user ID that is in the Administrative group and has the advanced user rights: "Act as part of the operating system" and "Log on as a service." During the installation, the padaemon account will need to be added to the mqm group.

INFORMATICA CONFIDENTIAL BEST PRACTICE 108 of 702

● If the Data Analyzer installation files are transferred to the Data Analyzer Server, they must be FTP’d in binary format.

● To enable an installation error log, read the Knowledgebase article, HOW TO: Debug PowerAnalyzer Installations. You can reach this article through My Informatica (http://my.informatica.com).

● During the WebSphere installation process, the user will be prompted to enter a directory for the application server and the HTTP (web) server. In both instances, it is advisable to keep the default installation directory. Directory names for the application server and HTTP server that include spaces may result in errors.

● During the Data Analyzer installation process, the user will be prompted to choose an authentication method for Data Analyzer, such as repository, NT Domain, or LDAP. If LDAP or NT Domain authentication is utilized, have the configuration parameters available during installation as the installer will configure all properties files at installation.

● The Data Analyzer license file and BEA WebLogic license must be applied prior to starting Data Analyzer.

Configuration Screen

Installation and Configuration Tips: UNIX Servers

With Data Analyzer 8 there is a console version of the installation available. For previous versions of Data Analyzer, a graphics display server is required for a Data Analyzer installation on UNIX.

On UNIX, the graphics display server is typically an X-Windows server, although an X-Window Virtual Frame Buffer (XVFB) or personal computer X-Windows software such as WRQ Reflection-X can also be used. In any case, the X-Windows server does not need to exist on the local machine where Data Analyzer is being installed, but does need to be accessible. A remote X-Windows, XVFB, or PC-X Server can be used by setting the DISPLAY to the appropriate IP address, as discussed below.

If the X-Windows server is not installed on the machine where Data Analyzer will be installed, Data Analyzer can be installed using an X-Windows server installed on another machine. Simply redirect the DISPLAY variable to use the X-Windows server on another UNIX machine.

To redirect the host output, define the environment variable DISPLAY. On the command line, type the following command and press Enter:

INFORMATICA CONFIDENTIAL BEST PRACTICE 109 of 702

C shell:

setenv DISPLAY=<TCP/IP node of X-Windows server>:0

Bourne/Korn shell:

export DISPLAY=”<TCP/IP node of X-Windows server>:0”

Configuration

● Data Analyzer requires a means to render graphics for charting and indicators. When graphics rendering is not configured properly, charts and indicators do not display properly on dashboards or reports. For Data Analyzer installations using an application server with JDK 1.4 and greater, the “java.awt.headless=true” setting can be set in the application server startup scripts to facilitate graphics rendering for Data Analyzer. If the application server does not use JDK 1.4 or later, use an X-Windows server or XVFB to render graphics. The DISPLAY environment variable should be set to the IP address of the X-Windows or XVFB server prior to starting Data Analyzer.

● The application server heap size is the memory allocation for the JVM. The recommended heap size depends on the memory available on the machine hosting the application server and server load, but the recommended starting point is 512MB. This setting is the first setting that should be examined when tuning a Data Analyzer instance.

Last updated: 06-Feb-07 11:55

INFORMATICA CONFIDENTIAL BEST PRACTICE 110 of 702

Data Connectivity using PowerCenter Connect for BW Integration Server

Challenge

Understanding how to use PowerCenter Connect for SAP NetWeaver - BW Option to load data into the SAP BW (Business Information Warehouse).

Description

The PowerCenter Connect for SAP NetWeaver - BW Option supports the SAP Business Information Warehouse as both a source and target.

Extracting Data from BW

PowerCenter Connect for SAP NetWeaver - BW Option lets you extract data from SAP BW to use as a source in a PowerCenter session. PowerCenter Connect for SAP NetWeaver - BW Option integrates with the Open Hub Service (OHS), SAP’s framework for extracting data from BW. OHS uses data from multiple BW data sources, including SAP's InfoSources and InfoCubes. The OHS framework includes InfoSpoke programs, which extract data from BW and write the output to SAP transparent tables.

Loading Data into BW

PowerCenter Connect for SAP NetWeaver - BW Option lets you import BW target definitions into the Designer and use the target in a mapping to load data into BW. PowerCenter Connect for SAP NetWeaver - BW Option uses Business Application Program Interface (BAPI), to exchange metadata and load data into BW.

PowerCenter can use SAP’s business content framework to provide a high-volume data warehousing solution or SAP’s Business Application Program Interface (BAPI), SAP’s strategic technology for linking components into the Business Framework, to exchange metadata with BW.

PowerCenter extracts and transforms data from multiple sources and uses SAP’s high-speed bulk BAPIs to load the data into BW, where it is integrated with industry-specific models for analysis through the SAP Business Explorer tool.

INFORMATICA CONFIDENTIAL BEST PRACTICE 111 of 702

Using PowerCenter with PowerCenter Connect to Populate BW

The following paragraphs summarize some of the key differences in using PowerCenter with the PowerCenter Connect to populate a SAP BW rather than working with standard RDBMS sources and targets.

● BW uses a pull model. The BW must request data from a source system before the source system can send data to the BW. PowerCenter must first register with the BW using SAP’s Remote Function Call (RFC) protocol.

● The native interface to communicate with BW is the Staging BAPI, an API published and supported by SAP. Three products in the PowerCenter suite use this API. PowerCenter Designer uses the Staging BAPI to import metadata for the target transfer structures; PowerCenter Integration Server for BW uses the Staging BAPI to register with BW and receive requests to run sessions; and the PowerCenter Server uses the Staging BAPI to perform metadata verification and load data into BW.

● Programs communicating with BW use the SAP standard saprfc.ini file to communicate with BW. The saprfc.ini file is similar to the tnsnames file in Oracle or the interface file in Sybase. The PowerCenter Designer reads metadata from BW and the PowerCenter Server writes data to BW.

● BW requires that all metadata extensions be defined in the BW Administrator Workbench. The definition must be imported to Designer. An active structure is the target for PowerCenter mappings loading BW.

● Because of the pull model, BW must control all scheduling. BW invokes the PowerCenter session when the InfoPackage is scheduled to run in BW.

● BW only supports insertion of data into BW. There is no concept of update or deletes through the staging BAPI.

Steps for Extracting Data from BW

The process of extracting data from SAP BW is quite similar to extracting data from SAP. Similar transports are used on the SAP side, and data type support is the same as that supported for SAP PowerCenter Connect.

The steps required for extracting data are:

1. Create an InfoSpoke. Create an InfoSpoke in the BW to extract the data from the BW database and write it to either a database table or a file output target.

2. Import the ABAP program. Import the Informatica-provided ABAP program,

INFORMATICA CONFIDENTIAL BEST PRACTICE 112 of 702

which calls the workflow created in the Workflow Manager. 3. Create a mapping. Create a mapping in the Designer that uses the database

table or file output target as a source. 4. Create a workflow to extract data from BW. Create a workflow and session

task to automate data extraction from BW. 5. Create a Process Chain. A BW Process Chain links programs together to run

in sequence. Create a Process Chain to link the InfoSpoke and ABAP programs together.

6. Schedule the data extraction from BW. Set up a schedule in BW to automate data extraction.

Steps To Load Data into BW

1. Install and Configure PowerCenter Components.

The installation of the PowerCenter Connect for SAP NetWeaver - BW Option includes both a client and a server component. The Connect server must be installed in the same directory as the PowerCenter Server. Informatica recommends installing the Connect client tools in the same directory as the PowerCenter Client. For more details on installation and configuration refer to the PowerCenter and the PowerCenter Connect installation guides.

Note: On SAP Transports for PowerConnect version 8.1 and above, it is crucial to install or upgrade PowerCenter 8.1 transports on the appropriate SAP system, when installing or upgrading PowerCenter Connect for SAP NetWeaver - BW Option. If you are extracting data from BW using OHS, you must also configure the mySAP option. If the BW system is separate from the SAP system, install the designated transports on the BW system. It is also important to note that there are now three categories of transports (as compared to two in previous versions). These are as follows:

Transports for SAP versions 3.1H and 3.1I.●

Transports for SAP versions 4.0B to 4.6B, 4.6C, and non-Unicode versions 4.7 and above.

Transports for SAP Unicode versions 4.7 and above; this category has been added for Unicode extraction support which was not previously available in SAP versions 4.6 and earlier.

INFORMATICA CONFIDENTIAL BEST PRACTICE 113 of 702

2. Build the BW Components.

To load data into BW, you must build components in both BW and PowerCenter. You must first build the BW components in the Administrator Workbench:

● Define PowerCenter as a source system to BW. BW requires an external source definition for all non-R/3 sources.

● Create the InfoObjects in BW (this is similar to a database table). ● The InfoSource represents a provider structure. Create the InfoSource in the

BW Administrator Workbench and import the definition into the PowerCenter Warehouse Designer.

● Assign the InfoSource to the PowerCenter source system. After you create an InfoSource, assign it to the PowerCenter source system.

● Activate the InfoSource. When you activate the InfoSource, you activate the InfoObjects and the transfer rules.

3. Configure the sparfc.ini file.

Required for PowerCenter and Connect to connect to BW.

PowerCenter uses two types of entries to connect to BW through the saprfc.ini file:

● Type A. Used by PowerCenter Client and PowerCenter Server. Specifies the BW application server.

● Type R. Used by the PowerCenter Connect for SAP NetWeaver - BW Option. Specifies the external program, which is registered at the SAP gateway.

Note: Do not use Notepad to edit the sparfc.ini file because Notepad can corrupt the file. Set RFC_INI environment variable for all Windows NT, Windows 2000, and Windows 95/98 machines with saprfc.ini file. RFC_INI is used to locate the saprfc.ini.

4. Start the Connect for BW server

Start Connect for BW server after you start PowerCenter Server and before you create InfoPackage in BW.

INFORMATICA CONFIDENTIAL BEST PRACTICE 114 of 702

5. Build mappings

Import the InfoSource into the PowerCenter repository and build a mapping using the InfoSource as a target.

The following restrictions apply to building mappings with BW InfoSource target:

● You cannot use BW as a lookup table. ● You can use only one transfer structure for each mapping. ● You cannot execute stored procedure in a BW target. ● You cannot partition pipelines with a BW target. ● You cannot copy fields that are prefaced with /BIC/ from the InfoSource

definition into other transformations. ● You cannot build an update strategy in a mapping. BW supports only inserts; it

does not support updates or deletes. You can use Update Strategy transformation in a mapping, but the Connect for BW Server attempts to insert all records, even those marked for update or delete.

6. Load data

To load data into BW from PowerCenter, both PowerCenter and the BW system must be configured.

Use the following steps to load data into BW:

● Configure a workflow to load data into BW. Create a session in a workflow that uses a mapping with an InfoSource target definition.

● Create and schedule an InfoPackage. The InfoPackage associates the PowerCenter session with the InfoSource.

When the Connect for BW Server starts, it communicates with the BW to register itself as a server. The Connect for BW Server waits for a request from the BW to start the workflow. When the InfoPackage starts, the BW communicates with the registered Connect for BW Server and sends the workflow name to be scheduled with the PowerCenter Server. The Connect for BW Server reads information about the workflow and sends a request to the PowerCenter Server to run the workflow.

The PowerCenter Server validates the workflow name in the repository and the

INFORMATICA CONFIDENTIAL BEST PRACTICE 115 of 702

workflow name in the InfoPackage. The PowerCenter Server executes the session and loads the data into BW. You must start the Connect for BW Server after you restart the PowerCenter Server.

Supported Datatypes

The PowerCenter Server transforms data based on the Informatica transformation datatypes. BW can only receive data in 250 bytes per packet. The PowerCenter Server converts all data to a CHAR datatype and puts it into packets of 250 bytes, plus one byte for a continuation flag.

BW receives data until it reads the continuation flag set to zero. Within the transfer structure, BW then converts the data to the BW datatype. Currently, BW only supports the following datatypes in transfer structures assigned to BAPI source systems (PowerCenter ): CHAR, CUKY, CURR, DATS, NUMC, TIMS, UNIT.

All other datatypes result in the following error in BW:

Invalid data type (data type name) for source system of type BAPI.

Date/Time Datatypes

The transformation date/time datatype supports dates with precision to the second. If you import a date/time value that includes milliseconds, the PowerCenter Server truncates to seconds. If you write a date/time value to a target column that supports milliseconds, the PowerCenter Server inserts zeros for the millisecond portion of the date.

Binary Datatypes

BW does not allow you to build a transfer structure with binary datatypes. Therefore, you cannot load binary data from PowerCenter into BW.

Numeric Datatypes

PowerCenter does not support the INT1 datatype.

Performance Enhancement for Loading into SAP BW

INFORMATICA CONFIDENTIAL BEST PRACTICE 116 of 702

If you see a performance slowdown for sessions that load into SAP BW, set the default buffer block size to 15MB to 20MB to enhance performance. You can put 5,000 to 10,000 rows per block, so you can calculate the buffer block size needed with the following formula:

Row size x Rows per block = Default Buffer Block size< /FONT >

For example, if your target row size is 2KB: 2 KB x 10,000 = 20MB.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 117 of 702

Data Connectivity using PowerCenter Connect for MQSeries

Challenge

Understanding how to use IBM MQSeries applications in PowerCenter mappings.

Description

MQSeries applications communicate by sending messages asynchronously rather than by calling each other directly. Applications can also request data using a "request message" on a message queue. Because no open connection is required between systems, they can run independently of one another. MQSeries enforces no structure on the content or format of the message; this is defined by the application.

With more and more requirements for “on-demand” or real-time data integration, as well as the development of Enterprise Application Integration (EAI) capabilities, MQ Series has become an important vehicle for providing information to data warehouses in a real-time mode.

PowerCenter provides data integration for transactional data generated by online continuously messaging systems (such as MQ Series). For these types of messaging systems, PowerCenter’s Zero Latency (ZL) Engine provides immediate processing of trickle-feed data, allowing the processing of real-time data flow in both uni-directional and bi-directional manner.

TIP In order to enable PowerCenter’s ZL engine to process MQ messages in real-time, the workflow must be configured to run continuously and a real-time MQ filter needs to be applied to the MQ source qualifier (such as idle time, reader time limit, or message count).

MQSeries Architecture

IBM MQSeries is a messaging and queuing application that permits programs to communicate with one another across heterogeneous platforms and network protocols using a consistent application-programming interface.

MQSeries architecture has three parts:

1. Queue Manager 2. Message Queue, which is a destination to which messages can be sent 3. MQSeries Message, which incorporates a header and a data component

Queue Manager

● PowerCenter connects to Queue Manager to send and receive messages. ● A Queue Manager may publish one or more MQ queues. ● Every message queue belongs to a Queue Manager. ● Queue Manager administers queues, creates queues, and controls queue operation.

Message Queue

● PowerCenter connects to Queue Manager to send and receive messages to one or more message queues. ● PowerCenter is responsible to deleting the message from the queue after processing it.

INFORMATICA CONFIDENTIAL BEST PRACTICE 118 of 702

TIP There are several ways to maintain transactional consistency (i.e., clean up the queue after reading). Refer to the Informatica Webzine article on Transactional Consistency for details on the various ways to delete messages from the queue.

MQSeries Message

An MQSeries message is composed of two distinct sections:

● MQSeries header. This section contains data about the queue message itself. Message header data includes the message identification number, message format, and other message descriptor data. In PowerCenter, MQSeries sources and dynamic MQSeries targets automatically incorporate MQSeries message header fields.

● MQSeries message data block. A single data element that contains the application data (sometime referred to as the "message body"). The content and format of the message data is defined by the application that puts the message on the queue.

Extracting Data from a Queue

Reading Messages from a Queue

In order for PowerCenter to extract from the message data block, the source system must define the data in one of the following formats:

Flat file (fixed width or delimited)●

XML●

COBOL●

Binary

When reading a message from a queue, the PowerCenter mapping must contain an MQ Source Qualifier (MQSQ). If the mapping also needs to read the message data block, then an Associated Source Qualifier (ASQ) is also needed. When developing an MQ Series mapping, the MESSAGE_DATA block is re-defined by the ASQ. Based on the format of the source data, PowerCenter will generate the appropriate transformation for parsing the MESSAGE_DATA. Once associated, the MSG_ID field is linked within the associated source qualifier transformation.

Applying Filters to Limit Messages Returned

Filters can be applied to the MQ Source Qualifier to reduce the number of messages read.

Filters can also be added to control the length of time PowerCenter reads the MQ queue.

If no filters are applied, PowerCenter reads all messages in the queue and then stops reading.

Example:

PutDate >= “20040901” && PutDate <= “20040930”

TIP In order to leverage reading a single MQ queue to process multiple record types, have the source application populate an MQ header field and then filter the value set in this field (Example: ApplIdentityData = ‘TRM’).

INFORMATICA CONFIDENTIAL BEST PRACTICE 119 of 702

Using MQ Functions

PowerCenter provides built-in functions that can also be used to filter message data.

Functions can be used to control the end-of-file of the MQSeries queue.●

Functions can be used to enable PowerCenter real-time data extraction.

Available Functions:

Function Description

Idle(n) Time RT remains idle before stopping.

MsgCount(n) Number of messages read from the queue before stopping.

StartTime(time) GMT time when RT begins reading queue.

EndTime(time) GMT time when RT stops reading queue.

FlushLatency(n) Time period RT waits before committing messages read from the queue.

ForcedEOQ(n) Time period RT reads messages from the queue before stopping.

RemoveMsg(TRUE) Removes messages from the queue.

TIP In order to enable real-time message processing, use the FlushLatency() or ForcedEOQ() MQ functions as part of the filter expression in the MQSQ.

Loading Message to a Queue

PowerCenter supports two types of MQ targeting: Static and Dynamic.

● Static MQ Targets. Used for loading message data (instead of header data) to the target. A Static target does not load data to the message header fields. Use the target definition specific to the format of the message data (i.e., flat file, XML, or COBOL). Design the mapping as if it were not using MQ Series, then configure the target connection to point to a MQ message queue in the session when using MQSeries.

● Dynamic. Used for binary targets only, and when loading data to a message header. Note that certain message headers in an MQSeries message require a predefined set of values assigned by IBM.

Dynamic MQSeries Targets

Use this type of target if message header fields need to be populated from the ETL pipeline.

MESSAGE_DATA field data type is binary only.

Certain fields cannot be populated by the pipeline (i.e., set by the target MQ environment):

UserIdentifier●

INFORMATICA CONFIDENTIAL BEST PRACTICE 120 of 702

AccountingToken●

ApplIdentityData●

PutApplType●

PutApplName●

PutDate●

PutTime●

ApplOriginData

Static MQSeries Targets

Unlike dynamic targets, where an MQ target transformation exists in the mapping, static targets use existing target transformations.

Flat file●

XML●

COBOL●

RT can only write to one MQ queue per target definition.●

XML targets with multiple hierarchies can generate one or more MQ messages (configurable).

Creating and Configuring MQSeries Sessions

After you create mappings in the Designer, you can create and configure sessions in the Workflow Manager.

Configuring MQSeries Sources

The MQSeries source definition represents the metadata for the MQSeries source in the repository. Unlike other source definitions, you do not create an MQSeries source definition by importing the metadata from the MQSeries source. Since all MQSeries messages contain the same message header and message data fields, the Designer provides an MQSeries source definition with predefined column names.

MQSeries Mappings

MQSeries mappings cannot be partitioned if an associated source qualifier is used.

For MQ Series sources, set the Source Type to the following:

● Heterogeneous - when there is an associated source definition in the mapping. This indicates that the source data is coming from an MQ source, and the message data is in flat file, COBOL or XML format.

● Message Queue - when there is no associated source definition in the mapping.

Note that there are two pages on the Source Options dialog: XML and MQSeries. You can alternate between the two pages to set configurations for each.

INFORMATICA CONFIDENTIAL BEST PRACTICE 121 of 702

Configuring MQSeries Targets

For Static MQSeries targets, select File Target type from the list. When the target is an XML file or XML message data for a target message queue, the target type is automatically set to XML.

● If you load data to a dynamic MQ target, the target type is automatically set to Message Queue. ● On the MQSeries page, select the MQ connection to use for the source message queue, and click OK. ● Be sure to select the MQ checkbox in Target Options for the Associated file type. Then click Edit Object Properties

and type:

❍ the connection name of the target message queue. ❍ the format of the message data in the target queue (ex. MQSTR). ❍ the number of rows per message (only applies to flat file MQ targets).

Considerations when Working with MQSeries

The following features and functions are not available to PowerCenter when using MQSeries:

● Lookup transformations can be used in an MQSeries mapping, but lookups on MQSeries sources are not allowed. ● No Debug "Sessions". You must run an actual session to debug a queue mapping. ● Certain considerations are necessary when using AEPs, Aggregators, Joiners, Sorters, Rank, or Transaction

Control transformations because they can only be performed on one queue, as opposed to a full data set. ● The MQSeries mapping cannot contain a flat file target definition if you are trying to target an MQSeries queue. ● PowerCenter version 6 and earlier performs a browse of the MQ queue. PowerCenter version 7 provides the

ability to perform a destructive read of the MQ queue (instead of a browse). ● PowerCenter version 7 also provides support for active transformations (i.e., Aggregators) in an MQ source

mapping. ● PowerCenter version 7 provides MQ message recovery on restart of failed sessions. ● PowerCenter version 7 offers enhanced XML capabilities for mid-stream XML parsing.

Appendix Information

PowerCenter uses the following datatypes in MQSeries mappings:

● IBM MQSeries datatypes. IBM MQSeries datatypes appear in the MQSeries source and target definitions in a mapping.

● Native datatypes. Flat file, XML, or COBOL datatypes associated with an MQSeries message data. Native datatypes appear in flat file, XML and COBOL source definitions. Native datatypes also appear in flat file and XML target definitions in the mapping.

● Transformation datatypes. Transformation datatypes are generic datatypes that PowerCenter uses during the transformation process. They appear in all the transformations in the mapping.

IBM MQSeries Datatypes

MQSeries Datatypes Transformation Datatypes

MQBYTE BINARY

MQCHAR STRING

MQLONG INTEGER

INFORMATICA CONFIDENTIAL BEST PRACTICE 122 of 702

MQHEX

Values for Message Header Fields in MQSeries Target Messages

MQSeries Message Header Description

StrucId Structure identifier

Version Structure version number

Report Options for report messages

MsgType Message type

Expiry Message lifetime

Feedback Feedback or reason code

Encoding Data encoding

CodedCharSetId Coded character set identifier

Format Format name

Priority Message priority

Persistence Message persistence

MsgId Message identifier

CorrelId Correlation identifier

BackoutCount Backout counter

ReplytoZ Name of reply queue

ReplytoQMgr Name of reply gueue Manager

UserIdentifier Defined by the environment. If the MQSeries server cannot determine this value, the value for the field is null.

AccountingToken Defined by the environment. If the MQSeries server cannot determine this value, the value for the field is MQACT_NONE.

ApplIdentityData Application data relating to identity. The value for ApplIdentityData is null.

PutApplType Type of application that put the message on queue. Defined by the environment.

PutApplName Name of application that put the message on queue. Defined by the environment. If the MQSeries server cannot determine this value, the value for the field is null.

PutDate Date when the message arrives in the queue.

INFORMATICA CONFIDENTIAL BEST PRACTICE 123 of 702

PutTime Time when the message arrives in queue.

ApplOrigData Application data relating to origin. Value for ApplOriginData is null.

GroupId Group identifier

MsgSeqNumber Sequence number of logical messages within group.

Offset Offset of data in physical message from start of logical message.

MsgFlags Message flags

OrigialLength Length of original message

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 124 of 702

Data Connectivity using PowerCenter Connect for SAP

Challenge

Understanding how to install PowerCenter Connect for SAP R/3, extract data from SAP R/3, and load data into SAP R/3.

Description

SAP R/3 is an ERP software that provides multiple business applications/modules, such as financial accounting, materials management, sales and distribution, human resources, CRM and SRM. The CORE R/3 system (BASIS layer) is programmed in Advance Business Application Programming-Fourth Generation (ABAP/4, or ABAP), a language proprietary to SAP.

PowerCenter Connect for SAP R/3 can write/read/change data in R/3 via BAPI/RFC and IDoc interfaces. The ABAP interface of PowerCenter Connect can only read data from SAP R/3.

PowerCenter Connect for SAP R/3 provides the ability to extract SAP R/3 data into data warehouses, data integration applications, and other third-party applications. All of this is accomplished without writing complex ABAP code. PowerCenter Connect for SAP R/3 generates ABAP programs and is capable of extracting data from transparent tables, pool tables, and cluster tables.

When integrated with R/3 using ALE (Application Link Enabling), PowerCenter Connect for SAP R/3 can also extract data from R/3 using outbound IDocs (Intermediate Documents) in near real-time. The ALE concept available in R/3 Release 3.0 supports the construction and operation of distributed applications. It incorporates controlled exchange of business data messages while ensuring data consistency across loosely-coupled SAP applications. The integration of various applications is achieved by using synchronous and asynchronous communication, rather than by means of a central database.

The database server stores the physical tables in the R/3 system, while the application server stores the logical tables. A transparent table definition on the application server is represented by a single physical table on the database server. Pool and cluster tables are logical definitions on the application server that do not have a one-to-one relationship with a physical table on the database server.

Communication Interfaces

TCP/IP is the native communication interface between PowerCenter and SAP R/3. Other interfaces between the two include:

Common Program Interface-Communications (CPI-C). CPI-C communication protocol enables online data exchange and data conversion between R/3 and PowerCenter. To initialize CPI-C communication with PowerCenter, SAP R/3 requires information such as the host name of the application server and the SAP gateway. This information is stored on the PowerCenter Server in a configuration file named sideinfo. The PowerCenter Server uses parameters in the sideinfo file to execute ABAP stream mode sessions.

Remote Function Call (RFC). RFC is the remote communication protocol used by SAP and is based on RPC (Remote Procedure Call). To execute remote calls from PowerCenter, SAP R/3 requires information such as the connection type and the service name and gateway on the application server. This information is stored on the PowerCenter Client and PowerCenter Server in a configuration file named saprfc.ini. PowerCenter makes remote function calls when importing source definitions, installing ABAP programs and running ABAP file mode sessions.

Transport system. The transport system in SAP is a mechanism to transfer objects developed on one system to another system. Transport system is primarily used to migrate code and configuration from development to QA and production systems. It can be used in the following cases:

● PowerCenter Connect for SAP R/3 installation transports ● PowerCenter Connect generated ABAP programs

INFORMATICA CONFIDENTIAL BEST PRACTICE 125 of 702

Note: If the ABAP programs are installed in the $TMP development class, they cannot be transported from development to production. Ensure you have a transportable development class/package for the ABAP mappings.

Security You must have proper authorizations on the R/3 system to perform integration tasks. The R/3 administrator needs to create authorizations, profiles, and users for PowerCenter users.

Integration Feature Authorization Object Activity

Import Definitions, Install Programs S_DEVELOP All activities. Also need to set Development Object ID to PROG

Extract Data S_TABU_DIS

READ

Run File Mode Sessions S_DATASET WRITE

Submit Background Job S_PROGRAM BTCSUBMIT, SUBMIT

Release Background Job S_BTCH_JOB DELE, LIST, PLAN, SHOW

Also need to set Job Operation to RELE

Run Stream Mode Sessions S_CPIC All activities

Authorize RFC privileges S_RFC All activities

You also need access to the SAP GUI, as described in following SAP GUI Parameters table:

Parameter Feature references to this variable Comments

User ID $SAP_USERID Identify the username that connects to the SAP GUI and is authorized for read-only access to the following transactions:

- SE12

- SE15

- SE16

- SPRO

Password $SAP_PASSWORD Identify the password for the above user

System Number $SAP_SYSTEM_NUMBER Identify the SAP system number

Client Number $SAP_CLIENT_NUMBER Identify the SAP client number

Server $SAP_SERVER Identify the server on which this instance of SAP is running

Key Capabilities of PowerCenter Connect for SAP R/3

INFORMATICA CONFIDENTIAL BEST PRACTICE 126 of 702

Some key capabilities of PowerCenter Connect for SAP R/3 include:

● Extract data from SAP R/3 using ABAP BAPI /RFC and IDoc interfaces. ● Migrate/load data from any source into R/3 using IDoc, BAPI/RFC and DMI interfaces. ● Generate DMI files ready to be loaded into SAP via SXDA TOOLS or LSMW or SAP standard delivered programs. ● Support calling BAPI and RFC functions dynamically from PowerCenter for data integration. PowerCenter Connect

for SAP R/3 can make BAPI and RFC function calls dynamically from mappings to extract or load. ● Capture changes to the master and transactional data in SAP R/3 using ALE. PowerCenter Connect for SAP R/3

can receive outbound IDocs from SAP R/3 in real time and load into SAP R/3 using inbound IDocs. To receive IDocs in real time using ALE, install PowerCenter Connect for SAP R/3 on PowerCenterRT.

● Provide rapid development of the data warehouse based on R/3 data using Analytic Business Components for SAP R/3 (ABC). ABC is a set of business content that includes mappings, mapplets, source objects, targets, and transformations.

● Set partition points in a pipeline for outbound/inbound IDoc sessions; sessions that fail when reading outbound IDocs from an SAP R/3 source can be configured for recovery. You can also receive data from outbound IDoc files and write data to inbound IDoc files.

● Insert ABAP Code Block to add functionality to the ABAP program flow and use static/dynamic filters to reduce return rows.

● Customize the ABAP program flow with joins, filters, SAP functions, and code blocks. For example: qualifying table = table1-field1 = table2-field2 where the qualifying table is the last table in the condition based on the join order including outer joins.

● Create ABAP program variables to represent SAP R/3 structures, structure fields, or values in the ABAP program. ● Remove ABAP program information from SAP R/3 and the repository when a folder is deleted. ● Provide enhanced platform support by running on 64-bit AIX and HP-UX (Itanium). You can install PowerCenter

Connect for SAP R/3 for the PowerCenter Server and Repository Server on SuSe Linux or on Red Hat Linux.

Installation and Configuration Steps

PowerCenter Connect for SAP R/3 setup programs install components for PowerCenter Server, Client, and repository server. These programs install drivers, connection files, and a repository plug-in XML file that enables integration between PowerCenter and SAP R/3. Setup programs can also install PowerCenter Connect for SAP R/3 Analytic Business Components, and PowerCenter Connect for SAP R/3 Metadata Exchange.

The Power Center Connect for SAP R/3 repository plug-in is called sapplg.xml. After the plug-in is installed, it needs to be registered in the PowerCenter repository.

For SAP R/3

Informatica provides a group of customized objects required for R/3 integration in the form of transport files. These objects include tables, programs, structures, and functions that PowerCenter Connect for SAP exports to data files. The R/3 system administrator must use the transport control program, tp import, to transport these object files on the R/3 system. The transport process creates a development class called ZERP. The SAPTRANS directory contains “data” and “co” files. The “data” files are the actual transport objects. The “co” files are control files containing information about the transport request.

The R/3 system needs development objects and user profiles established to communicate with PowerCenter. Preparing R/3 for integration involves the following tasks:

● Transport the development objects on the PowerCenter CD to R/3. PowerCenter calls these objects each time it makes a request to the R/3 system.

● Run the transport program that generates unique Ids. ● Establish profiles in the R/3 system for PowerCenter users. ● Create a development class for the ABAP programs that PowerCenter installs on the SAP R/3 system.

For PowerCenter

INFORMATICA CONFIDENTIAL BEST PRACTICE 127 of 702

The PowerCenter server and client need drivers and connection files to communicate with SAP R/3. Preparing PowerCenter for integration involves the following tasks:

● Run installation programs on PowerCenter Server and Client machines. ● Configure the connection files:

❍ The sideinfo file on the PowerCenter Server allows PowerCenter to initiate CPI-C with the R/3 system.

Following are the required parameters for sideinfo :

DEST logical name of the R/3 system TYPE set to A to indicate connection to specific R/3 system. ASHOST host name of the SAP R/3 application server. SYSNR system number of the SAP R/3 application server.

❍ ­The saprfc.ini file on the PowerCenter Client and Server allows PowerCenter to connect to the R/3 system as an RFC client. The required parameterts for sideinfo are:

DEST logical name of the R/3 system LU host name of the SAP application server machine TP set to sapdp<system number> GWHOST host name of the SAP gateway machine. GWSERV set to sapgw<system number> PROTOCOL set to I for TCP/IP connection.

Following is the summary of required steps:

1. Install PowerCenter Connect for SAP R/3 on PowerCenter. 2. Configure the sideinfo file. 3. Configure the saprfc.ini 4. Set the RFC_INI environment variable. 5. Configure an application connection for SAP R/3 sources in the Workflow Manager. 6. Configure SAP/ALE IDoc connection in the Workflow Manager to receive IDocs generated by the SAP R/3 system. 7. Configure the FTP connection to access staging files through FTP. 8. Install the repository plug-in in the PowerCenter repository.

Configuring the Services File

Windows

If SAPGUI is not installed, you must make entries in the Services file to run stream mode sessions. This is found in the \WINNT\SYSTEM32\drivers\etc directory. Entries should be similar to the following:

sapdp<system number> <port number of dispatcher service>/tcp

sapgw<system number> <port number of gateway service>/tcp

Note: SAPGUI is not technically required, but experience has shown that evaluators typically want to log into the R/3 system to use the ABAP workbench and to view table contents.

UNIX

Services file is located in /etc

INFORMATICA CONFIDENTIAL BEST PRACTICE 128 of 702

sapdp<system number> <port# of dispatcher service>/TCP ●

sapgw<system number> <port# of gateway service>/TCP

The system number and port numbers are provided by the BASIS administrator.

Configure Connections to Run Sessions

Informatica supports two methods of communication between the SAP R/3 system and the PowerCenter Server.

Streaming Mode does not create any intermediate files on the R/3 system. This method is faster, but uses more CPU cycles on the R/3 system.

File Mode creates an intermediate file on the SAP R/3 system, which is then transferred to the machine running the PowerCenter Server.

If you want to run file mode sessions, you must provide either FTP access or NFS access from the machine running the PowerCenter Server to the machine running SAP R/3. This, of course, assumes that PowerCenter and SAP R/3 are not running on the same machine; it is possible to run PowerCenter and R/3 on the same system, but highly unlikely.

If you want to use File mode sessions and your R/3 system is on a UNIX system, you need to do one of the following:

Provide the login and password for the UNIX account used to run the SAP R/3 system.●

Provide a login and password for a UNIX account belonging to same group as the UNIX account used to run the SAP R/3 system.

Create a directory on the machine running SAP R/3, and run “chmod g+s” on that directory. Provide the login and password for the account used to create this directory.

Configure database connections in the Server Manager to access the SAP R/3 system when running a session, then configure an FTP connection to access staging file through FTP.

Extraction Process

R/3 source definitions can be imported from the logical tables using RFC protocol. Extracting data from R/3 is a four-step process:

Import source definitions. The PowerCenter Designer connects to the R/3 application server using RFC. The Designer calls a function in the R/3 system to import source definitions.

Note: If you plan to join two or more tables in SAP, be sure you have the optimized join conditions. Make sure you have identified your driving table (e.g., if you plan to extract data from bkpf and bseg accounting tables, be sure to drive your extracts from bkpf table). There is a significant difference in performance if the joins are properly defined.

Create a mapping. When creating a mapping using an R/3 source definition, you must use an ERP source qualifier. In the ERP source qualifier, you can customize properties of the ABAP program that the R/3 server uses to extract source data. You can also use joins, filters, ABAP program variables, ABAP code blocks, and SAP functions to customize the ABAP program.

Generate and install ABAP program. You can install two types of ABAP programs for each mapping:

INFORMATICA CONFIDENTIAL BEST PRACTICE 129 of 702

File mode. Extract data to file. The PowerCenter Server accesses the file through FTP or NFS mount. This mode is used for large extracts as there are timeouts set in SAP for long running queries.

Stream Mode. Extract data to buffers. The PowerCenter Server accesses the buffers through CPI-C, the SAP protocol for program-to-program communication. This mode is preferred for short running extracts.

You can modify the ABAP program block and customize according to your requirements (e.g., if you want to get data incrementally, create a mapping variable/parameter and use it in the ABAP program).

Create Session and Run Workflow

● Stream Mode. In stream mode, the installed ABAP program creates buffers on the application server. The program extracts source data and loads it into the buffers. When a buffer fills, the program streams the data to the PowerCenter Server using CPI-C. With this method, the PowerCenter Server can process data when it is received.

File Mode. When running a session in file mode, the session must be configured to access the file through NFS mount or FTP. When the session runs, the installed ABAP program creates a file on the application server. The program extracts source data and loads it into the file. When the file is complete, the PowerCenter Server accesses the file through FTP or NFS mount and continues processing the session.

Data Integration Using RFC/BAPI Functions

PowerCenter Connect for SAP R/3 can generate RFC/BAPI function mappings in the Designer to extract data from SAP R/3, change data in R/3, or load data into R/3. When it uses an RFC/BAPI function mapping in a workflow, the PowerCenter Server makes the RFC function calls on R/3 directly to process the R/3 data. It doesn’t have to generate and install the ABAP program for data extraction.

Data Integration Using ALE

PowerCenter Connect for SAP R/3 can integrate PowerCenter with SAP R/3 using ALE. With PowerCenter Connect for SAP R/3, PowerCenter can generate mappings in the Designer to receive outbound IDocs from SAP R/3 in real time. It can also generate mappings to send inbound IDocs to SAP for data integration. When PowerCenter uses an inbound or outbound mapping in a workflow to process data in SAP R/3 using ALE, it doesn’t have to generate and install the ABAP program for data extraction.

Analytical Business Components

Analytic Business Components for SAP R/3 (ABC) allows you to use predefined business logic to extract and transform R/3 data. It works in conjunction with PowerCenter and PowerCenter Connect for SAP R/3 to extract master data, perform lookups, provide documents, and other fact and dimension data from the following R/3 modules:

Financial Accounting●

Controlling●

Materials Management●

Personnel Administration and Payroll Accounting●

Personnel Planning and Development●

INFORMATICA CONFIDENTIAL BEST PRACTICE 130 of 702

Sales and Distribution

Refer to the ABC Guide for complete installation and configuration information.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 131 of 702

Data Connectivity using PowerCenter Connect for Web Services

Challenge

Understanding PowerCenter Connect for Web Services and configuring PowerCenter to access a secure web service.

Description

PowerCenter Connect for Web Services (WebServices Consumer) allows PowerCenter to act as a web services client to consume external web services. PowerCenter Connect for Web Services uses the Simple Object Access Protocol (SOAP) to communicate with the external web service provider. An external web service can be invoked from PowerCenter in three ways:

● Web Service source ● Web Service transformation ● Web Service target

Web Service Source Usage

PowerCenter supports a request-response type of operation using Web Services source. You can use the web service as a source if the input in the SOAP request remains fairly constant since input values for a web service source can only be provided at the source transformation level.

The following steps serve as an example for invoking a temperature web service to retrieve the current temperature for a given zip code:

1. In Source Analyzer, click Import from WSDL(Consumer). 2. Specify URL http://www.xmethods.net/sd/2001/TemperatureService.wsdl and

pick operation getTemp. 3. Open the Web Services Consumer Properties tab and click Populate SOAP

request and populate the desired zip code value. 4. Connect the output port of the web services source to the target.

Web Service Transformation Usage

INFORMATICA CONFIDENTIAL BEST PRACTICE 132 of 702

PowerCenter also supports a request-response type of operation using Web Services transformation. You can use the web service as a transformation if your input data is available midstream and you want to capture the response values from the web service.

The following steps serve as an example for invoking a Stock Quote web service to learn the price for each of the ticker symbols available in a flat file:

1. In transformation developer, create a web service consumer transformation. 2. Specify URL http://services.xmethods.net/soap/urn:xmethods-delayed-quotes.

wsdl and pick operation getQuote. 3. Connect the input port of this transformation to the field containing the ticker

symbols. 4. To invoke the web service for each input row, change to source-based commit

and the interval to 1. Also change the Transaction Scope to Transaction in the web services consumer transformation.

Web Service Target Usage

PowerCenter supports a one-way type of operation using Web Services target. You can use the web service as a target if you only need to send a message (i.e., and do not need a response). PowerCenter only waits for the web server to start processing the message; it does not wait for the web server to finish processing the web service operation.

The following provides an example for invoking a sendmail web service:

1. In Warehouse Designer, click Import from WSDL(Consumer) 2. Specify URL http://webservices.matlus.com/scripts/emailwebservice.dll/wsdl/

IEmailService and pick operation SendMail 3. In the mapping, connect the input ports of the web services target to the ports

containing appropriate values.

PowerCenter Connect for Web Services and Web Services Provider

Informatica also offers a product called Web Services Provider that differs from PowerCenter Connect for Web Services.The advantage of this feature is that it will decouple the WebService that needs to be consumed from the client. Using Informatica PowerCenter as the glue you can make changes that are transparent from the client.

INFORMATICA CONFIDENTIAL BEST PRACTICE 133 of 702

This is helpful because Informatica Professional Services will most likely not have access to the client code or the Web Service.

● In Web Services Provider, PowerCenter acts as a Service Provider and exposes many key functionalities as web services.

● In PowerCenter Connect for Web Services, PowerCenter acts as a web service client and consumes external web services.

● It is not necessary to install or configure Web Services Provider in order to use PowerCenter Connect for Web Services.

● Web Services exposed through PowerCenter have two formats

❍ Real-Time: In real time web enabled workflows are exposed. The Web Services Provider must be used and point to the Web Service that the mapping is going to consume. Workflows can be started and protected.

❍ Batch: In batch mode a preset of services are exposed to run and monitor workflows in your system. Good for reporting engines, etc.

Configuring PowerCenter to Invoke a Secure Web Service

Secure Sockets Layer (SSL) is used to provide such security features as authentication and encryption to web services applications. The authentication certificates follow the Public Key Infrastructure (PKI) standard, a system of digital certificates provided by certificate authorities to verify and authenticate parties of Internet communications or transactions. These certificates are managed in the following two keystore files:

● Truststore. Truststore holds the public keys for the entities it can trust. PowerCenter uses the entries in the Truststore file to authenticate the external web services servers.

● Keystore (Clientstore). Clientstore holds both the entity’s public and private keys. PowerCenter sends the entries in the Clientstore file to the web services server so that the web services server can authenticate the PowerCenter server.

By default, the keystore files jssecacerts and cacerts in the $(JAVA_HOME)/lib/ security directory are used for Truststores. You can also create new keystore files and configure the TrustStore and ClientStore parameters in the PowerCenter Server setup to point to these files. Keystore files can contain multiple certificates and are managed using utilities like keytool.

SSL authentication can be performed in three ways:

INFORMATICA CONFIDENTIAL BEST PRACTICE 134 of 702

● Server authentication ● Client authentication ● Mutual authentication

Server authentication:

When establishing an SSL session in server authentication, the web services server sends its certificate to PowerCenter and PowerCenter verifies whether the server certificate can be trusted. Only the truststore file needs to be configured in this case.

Assumptions:

Web Services Server certificate is stored in server.cer file

PowerCenter Server(Client) public/private key pair is available in keystore client.jks

Steps:

1. Import the server’s certificate into the PowerCenter Server’s truststore file. You can use either the default keystores jssecacerts, cacerts or create your own keystore file.

2. keytool -import -file server.cer -alias wserver -keystore trust.jks –trustcacerts –storepass changeit

3. At the prompt for trusting this certificate, type “yes”. 4. Configure PowerCenter to use this truststore file. Open the PowerCenter Server

setup-> JVM options tab and in the value for Truststore, give the full path and name of the keystore file (i.e., c:\trust.jks)

Client authentication:

When establishing an SSL session in client authentication, PowerCenter sends its certificate to the web services server. The web services server then verifies whether the PowerCenter Server can be trusted. In this case, you need only the clientstore file.

Steps:

1. Keystore containing the private/public key pair is called client.jks. Be sure the client private key password and the keystore password are the same, (e.g., “changeit”)

INFORMATICA CONFIDENTIAL BEST PRACTICE 135 of 702

2. Configure PowerCenter to use this clientstore file. Open the PowerCenter Server setup-> JVM options tab and in the value for Clientstore, type the full path and name of the keystore file (i.e., c:\client.jks)

3. Add an additional JVM parameter in the PowerCenter Server setup and give the value as Djavax.net.ssl.keyStorePassword=changeit

Mutual authentication:

When establishing an SSL session in mutual authentication, both PowerCenter Server and the Web Services server send their certificates to each other and both verify if the other one can be trusted. You need to configure both the clientstore and the truststore files.

Steps:

1. Import the server’s certificate into the PowerCenter Server’s truststore file. 2. keytool -import -file server.cer -alias wserver -keystore trust.jks –trustcacerts –

storepass changeit 3. Configure PowerCenter to use this truststore file. Open the PowerCenter server

setup-> JVM options tab and in the value for Truststore, type the full path and name of the keystore file (i.e., c:\trust.jks).

4. Keystore containing the client public/private key pair is called client.jks. Be sure the client private key password and the keystore password are the same (e.g., “changeit”).

5. Configure PowerCenter to use this clientstore file. Open the PowerCenter Server setup-> JVM options tab and in the value for Clientstore, type the full path and name of the keystore file (i.e., c:\client.jks).

6. Add an additional JVM parameter in the PowerCenter Server setup and type the value as Djavax.net.ssl.keyStorePassword=changeit

Note: If your client private key is not already present in the keystore file, you cannot use keytool command to import it. Keytool can only generate a private key; it cannot import a private key into a keystore. In this case, use an external java utility such as utils.ImportPrivateKey(weblogic), KeystoreMove (to convert PKCS#12 format to JKS) to move it into the JKS keystore.

Converting Other Formats of Certificate Files

There are a number of formats of certificate files available: DER format (.cer and .der extensions); PEM format (.pem extension); and PKCS#12 format (.pfx or .P12 extension). You can convert from one format of certificate to another using openssl.

INFORMATICA CONFIDENTIAL BEST PRACTICE 136 of 702

Refer to the openssl documentation for complete information on such conversions. A few examples are given below:

To convert from PEM to DER: assuming that you have a PEM file called server.pem

● openssl x509 -in server.pem -inform PEM -out server.der -outform DER

To convert a PKCS12 file, you must first convert to PEM, and then from PEM to DER:

Assuming that your PKCS12 file is called server.pfx, the two commands are:

● openssl pkcs12 -in server.pfx -out server.pem ● openssl x509 -in server.pem -inform PEM -out server.der -outform DER

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 137 of 702

Data Migration Principles

Challenge

A successful Data Migration effort is often critical to a system implementation. These implementations can be a new or upgraded ERP package, integration due to merger and acquisition activity or the development of a new operational system. The effort and criticality of the Data Migration as part of the larger system implementation project is often overlooked, underestimated or given a lower priority in the scope of the full implementation project. As an end result, implementations are often delayed at a great cost to the organization while Data Migration issues are addressed. Informatica’s suite of products provide functionality and process to minimize this cost of the migration, lower risk and increase the probability of success (i.e. completing the project on-time and on-budget).

In this Best Practice we will discuss basic principles for data migration to lower the project time, to lower staff time to develop, lower risk and lower the total cost of ownership of the project. These principles include:

1. Leverage staging strategies 2. Utilize table driven approaches 3. Develop via Modular Design 4. Focus On Re-Use 5. Common Exception Handling Processes 6. Multiple Simple Processes versus Few Complex Processes 7. Take advantage of metadata

DescriptionLeverage Staging Strategies

As discussed elsewhere in Velocity, in data migration it is recommended to employ both a legacy staging and pre-load staging area. The reason for this is simple, it provides the ability to pull data from the production system and use it for data cleaning and harmonization activities without interfering with the production systems. By leveraging this type of strategy you are able to see real production data sooner and follow the guiding principle of ‘Convert Early, Convert Often, and with Real Production Data'.

Utilize Table Driven Approaches

INFORMATICA CONFIDENTIAL BEST PRACTICE 138 of 702

Developers frequently find themselves in positions where they need to perform a large amount of cross-referencing, hard-coding of values, or other repeatable transformations during a Data Migration. These transformations often have a probability to change over time. Without a table driven approach this will cause code changes, bug fixes, re-testing, and re-deployments during the development effort. This work is unnecessary on many occasions and could be avoided with the use of configuration or reference data tables. It is recommend to use table driven approaches such as these whenever possible. Some common table driven approaches include:

● Default Values – hard-coded values for a given column, stored in a table where the values could be changed whenever a requirement changes. For example, if you have a hard coded value of NA for any value not populated and then want to change that value to NV you could simply change the value in a default value table rather then change numerous hard-coded values.

● Cross-Reference Values – frequently in data migration projects there is a need to take values from the source system and convert them to the value of the target system. These values are usually identified up-front, but as the source system changes additional values are also needed. In a typical mapping development situation this would require adding additional values to a series of IIF or Decode statements. With a table driven situation, new data could be added to a cross-reference table and no coding, testing, or deployment would be required.

● Parameter Values – by using a table driven parameter file you can reduce the need for scripting and accelerate the development process.

● Code-Driven Table – in some instances a set of understood rules are known. By taking those rules and building code against them, a table-driven/code solution can be very productive. For example, if you had a rules table that was keyed by table/column/rule id, then whenever that combination was found a pre-set piece of code would be executed. If at a later date the rules change to a different set of pre-determined rules, the rule table could change for the column and no additional coding would be required.

Develop Via Modular Design

As part of the migration methodology, modular design is encouraged. Modular design is the act of developing a standard way of how similar mappings should function. These are then published as templates and developers are required to build similar mappings in that same manner. This provides rapid development, increases efficiency for testing, and increases ease of maintenance. The result of this change is it causes dramatically lower total cost of ownership and reduced cost.

INFORMATICA CONFIDENTIAL BEST PRACTICE 139 of 702

Focus On Re-Use

Re-use should always be considered during Informatica development. However, due to such a high degree of repeatability, on data migration projects re-use is paramount to success. There is often tremendous opportunity for re-use of mappings/strategies/processes/scripts/testing documents. This reduces the staff time for migration projects and lowers project costs.

Common Exception Handling Processes

Employing the Velocity Data Migration Methodology through its iterative intent will add new data quality rules as problems are found with the data. Because of this it is critical to find data exceptions and write appropriate rules to correct these situations throughout the data migration effort. It is highly recommended to build a common method for capturing and recording these exceptions. This common method should then be deployed for all data migration processes.

Multiple Simple Processes versus Few Complex Processes

For data migration projects it is possible to build one process to pull all data for a given entity from all systems to the target system. While this may seem ideal, these type of complex processes take much longer to design and develop, are challenging to test, and are very difficult to maintain over time. Due to these drawbacks, it is recommend to develop many simple processes as needed to complete the effort rather then a few complex processes.

Take Advantage of Metadata

The Informatica data integration platform is highly metadata driven. Take advantage of those capabilities on data migration projects. This can be done via a host of reports against the data integration repository such as:

1. Illustrate how the data is being transformed (i.e., lineage reports) 2. Illustrate who has access to what data (i.e., security group reports) 3. Illustrate what source or target objects exist in the repository 4. Identify how many mappings each developer has created 5. Identify how many sessions each developer has run during a given time period 6. Identify how many successful/failed sessions have been executed

In summary, these design principles provide significant benefits to data migration

INFORMATICA CONFIDENTIAL BEST PRACTICE 140 of 702

projects and add to the large set of typical best practice items that are available in Velocity. The key to Data Migration projects is architect well, design better, and execute best. Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 141 of 702

Data Migration Project Challenges

Challenge

A successful Data Migration effort is often critical to a system implementation. These implementations can be a new or upgraded ERP package, integration due to merger and acquisition activity, or the development of a new operational system. The effort and criticality of the Data Migration as part of the larger system implementation project is often overlooked, underestimated or given a lower priority in the scope of the full implementation project. As an end result, implementations are often delayed at a great cost to the organization while Data Migration issues are addressed. Informatica’s suite of products provide functionality and process to minimize this cost of the migration, lower risk and increase the probability of success (i.e. completing the project on-time and on-budget).

In this best practice the three main data migration project challenges will be discussed. These include:

1. Specifications incomplete, inaccurate, or not completed on-time. 2. Data quality problems impacting project time-lines. 3. Difficulties in project management executing the data migration project.

Description

Unlike other Velocity Best Practices we will not specify the full solution to each. Rather, it is more important to understand these three challenges and take action to address them throughout the implementation.

Migration Specifications

During the execution of data migration projects a challenge that projects always encounter is problems with migration specifications. Projects require the completion of functional specs to identify what is required of each migration interface.

Definitions:

● A migration interface is defined as 1 to many mapping/sessions/workflows or scripts used to migrate a data entity from one source system to one target system.

● A Functional Requirements Specification is normally comprised of a document covering details including security, database join needs, audit needs, and primary contact details. These details are normally at the interface level rather then at the column level. It also includes a Target-Source Matrix target-source matrix which identifies details at the column level such as how source table/columns map to target table/columns, business rules, data cleansing rules, validation rules, and other column level specifics.

Many projects attempt to complete these migrations without these types of specifications. Often these projects have little to no chance to complete on-time or on-budget. Time and subject matter expertise

INFORMATICA CONFIDENTIAL BEST PRACTICE 142 of 702

is needed to complete this analysis; this is the baseline for project success.

Projects are disadvantaged when functional specifications are not completed on-time. Developers can often be in a wait mode for extended periods of time when these specs are not completed at the time specified by the project plan.

Another project risk occurs when the right individuals are not used to write these specs or often inappropriate levels of importance are applied to this exercise. These situations cause inaccurate or incomplete specifications which prevent data integration developers from successfully building the migration processes.

To address the spec challenge for migration projects, projects must have specifications that are completed with accuracy and delivered on time.

Data Quality

Most projects are affected by data quality due to the need to address problems in the source data that fit into the six dimensions of data quality:

Data Quality Dimension Description

Completeness What data is missing or unusable?

Conformity What data is stored in a non-standard format?

Consistency What data values give conflicting Informatica?

Accuracy What data is incorrect or out of date?

Duplicates What data records or attributes are repeated?

Integrity What data is missing or not referenced?

Data migration data quality problems are typically worse then planned for. Projects need to allow enough time to identify and fix data quality problems BEFORE loading the data into the new target system.

Informatica’s data integration platform provides data quality capabilities that can help to identify the data quality problems in an efficient manner, but Subject-Matter Experts are required to address how these data problems should be addressed within business context and process.

Project Management

Project managers are often disadvantaged on these types of projects as they are mainly much larger, more expensive, and more complex then any prior project they have been involved with. They need to

INFORMATICA CONFIDENTIAL BEST PRACTICE 143 of 702

understand early in the project the importance of correctly completed specs and the importance of addressing data quality and establish a set of tools to accurately and objectively plan the project with the ability to evaluate progress.

Informatica’s Velocity Migration Methodology, its tool sets, and the metadata reporting capabilities are key to addressing these project challenges.

The key challenge is to fully understand the pitfalls early on in the project and how PowerCenter and Informatica Data Quality can address these challenges, and how metadata reporting can provide objective information relative to project status.

In summary, data migration projects are challenged by specification issues, data quality issues, and project management difficulties. By understanding the Velocity Methodology focus on data migration and how Informatica’s products can handle these changes for a successful migration, these challenges can be minimized.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 144 of 702

Data Migration Velocity Approach

Challenge

A successful Data Migration effort is often critical to a system implementation. These implementations can be a new or upgraded ERP package, integration due to merger and acquisition activity or the development of a new operational system. The effort and criticality of the Data Migration as part of the larger system implementation project is often overlooked, underestimated or given a lower priority in the scope of the full implementation project. As an end result, implementations are often delayed at a great cost to the organization while Data Migration issues are addressed. Informatica’s suite of products provide functionality and process to minimize this cost of the migration, lower risk and increase the probability of success (i.e. completing the project on-time and on-budget).

To meet these objectives a set of best practices have been provided in Velocity focused on Data Migration. This Best Practice provides an overview of how to use Informatica’s Products in an iterative methodology to expedite a data migration project. The keys to the methodology are further discussed in the Best Practice Data Migration Principles.

Description

The Velocity approach to data migration is illustrated here. While it is possible to migrate data in one step it is more productive to break these processes up into two or three simpler steps. The goal for data migration is to get the data into the target application as early as possible for large scale implementations. Typical implementations will have three to four trial cutovers or mock-runs before the final implementation of ‘Go-Live’. The mantra for the Informatica based migration is to ‘Convert Early, Convert Often, and Convert with Real Production Data.’ To do this the following approach is encouraged:

Analysis

In the analysis phase the functional specs will be completed, these will include both functional specs and target-source matrix.

See the Best Practice Data Migration Project Challenges for related information.

Acquire

INFORMATICA CONFIDENTIAL BEST PRACTICE 145 of 702

In the acquire phase the targets-source matrix will be reviewed and all source systems/tables will be identified. These tables will be used to develop one mapping per source table to populate a mirrored structure in a legacy data based schema. For example if there were 50 source tables identified in all the Target-Source Matrix documents, 50 legacy tables would be created and one mapping would be developed; one for each table.

It is recommended to perform the initial development against test data, but once complete run a single extract of the current production data. This will assist in addressing data quality problems without impacting production systems. It is recommended to run these extracts in low use time periods and with the cooperation of the operations group responsible for these systems.

It is also recommended to take advantage of the Visio Generation Option if available. These mappings are very straight forward and the use of autogeneration can increase consistency and lower required staff time for the project.

Convert

In this phase data will be extracted from the legacy stage tables (merged, transformed, and cleansed) to populate a mirror of the target application. As part of this process a standard exception process should be developed to determine exceptions and expedite data cleansing activities. The results of this convert process should be profiled, and appropriate data quality scorecards should be reviewed.

During the convert phase the basic set of exception tests should be executed, with exception details collected for future reporting and correction. The basic exception tests include:

1. Data Type 2. Data Size 3. Data Length 4. Valid Values 5. Range of Values

Exception Type Exception Description

Data Type Will the source data value load correctly to the target data type such as a numeric date loading into an Oracle date type?

Data Size Will a numeric value from a source value load correctly to the target column or will a numeric overflow occur?

Data Length Is the input value too large for the target column? (This is appropriate for all data types but of particular interest for string data types. For example, in one system a field could be char(256) but most of the values are char(10). In the target the new field is varchar(20) so any value over char(20) should raise an exception.)

Range of Values Is the input value within a tolerable range for the new system? (For example, does the birth date for an Insurance Subscriber fall between Jan 1, 1900 and Jan 1, 2006? If this test fails the date is unreasonable and should be addressed.)

Valid Values Is the input value in a list of tolerant values in the target system? (An example of this would be does the state code for an input record match the list of states in the new target system? If not the data should be corrected prior to entry to the new system.)

Once profiling exercises, exception reports and data quality scorecards are complete a list of data quality issues should be created.

This list should then be reviewed with the functional business owners to generate new data quality rules to correct the data. These details should be added to the spec and the original convert process should be modified with the new data quality

INFORMATICA CONFIDENTIAL BEST PRACTICE 146 of 702

rules.

The convert process should then be re-executed as well as the profiling, exception reporting and data scorecarding until the data is correct and ready for load to the target application.

Migrate

In the migrate phase the data from the convert phase should be loaded to the target application.

The expectation is that there should be no failures on these loads. The data should be corrected in the covert phase prior to loading the target application.

Once the migrate phase is complete, validation should occur. It is recommended to complete an audit/balancing step prior to validation. This is discussed in the Best Practice Build Data Audit/Balancing Processes.

Additional detail about these steps are defined in the Best Practice Data Migration Principles. Last updated: 06-Feb-07 12:08

INFORMATICA CONFIDENTIAL BEST PRACTICE 147 of 702

Build Data Audit/Balancing Processes

Challenge

Data Migration and Data Integration projects are often challenged to verify that the data in an application is complete. More specifically, to identify that all the appropriate data was extracted from a source system and propagated to its final target. This best practice illustrates how to do this in an efficient and a repeatable fashion for increased productivity and reliability. This is particularly important in businesses that are either highly regulated internally and externally or that have to comply with a host of government compliance regulations such as Sarbanes-Oxley, BASEL II, HIPAA, Patriot Act, and many others.

Description

The common practice for audit and balancing solutions is to produce a set of common tables that can hold various control metrics regarding the data integration process. Ultimately, business intelligence reports provide insight at a glance to verify that the correct data has been pulled from the source and completely loaded to the target. Each control measure that is being tracked will require development of a corresponding PowerCenter process to load the metrics to the Audit/Balancing Detail table.

To drive out this type of solution execute the following tasks:

1. Work with business users to identify what audit/balancing processes are needed. Some examples of this may be:

a. Customers – (Number of Customers or Number of Customers by Country) b. Orders – (Qty of Units Sold or Net Sales Amount) c. Deliveries – (Number of shipments or Qty of units shipped of Value of all shipments) d. Accounts Receivable – (Number of Accounts Receivable Shipments or Total Accounts

Receivable Outstanding) 2. Define for each process defined in #1 which columns should be used for tracking purposes for both

the source and target system. 3. Develop a data integration process that will read from the source system and populate the detail

audit/balancing table with the control totals. 4. Develop a data integration process that will read from the target system and populate the detail

audit/balancing table with the control totals. 5. Develop a reporting mechanism that will query the audit/balancing table and identify the the source

and target entries match or if there is a discrepancy.

An example audit/balance table definition looks like this :

Audit/Balancing Details

Column Name Data Type Size

AUDIT_KEY NUMBER 10

CONTROL_AREA VARCHAR2 50

INFORMATICA CONFIDENTIAL BEST PRACTICE 148 of 702

CONTROL_SUB_AREA VARCHAR2 50

CONTROL_COUNT_1 NUMBER 10

CONTROL_COUNT_2 NUMBER 10

CONTROL_COUNT_3 NUMBER 10

CONTROL_COUNT_4 NUMBER 10

CONTROL_COUNT_5 NUMBER 10

CONTROL_SUM_1 NUMBER (p,s) 10,2

CONTROL_SUM_2 NUMBER (p,s) 10,2

CONTROL_SUM_3 NUMBER (p,s) 10,2

CONTROL_SUM_4 NUMBER (p,s) 10,2

CONTROL_SUM_5 NUMBER (p,s) 10,2

UPDATE_TIMESTAMP TIMESTAMP

UPDATE_PROCESS VARCHAR2 50

Control Column Definition by Control Area/Control Sub Area

Column Name Data Type Size

CONTROL_AREA VARCHAR2 50

CONTROL_SUB_AREA VARCHAR2 50

CONTROL_COUNT_1 VARCHAR2 50

CONTROL_COUNT_2 VARCHAR2 50

CONTROL_COUNT_3 VARCHAR2 50

CONTROL_COUNT_4 VARCHAR2 50

CONTROL_COUNT_5 VARCHAR2 50

CONTROL_SUM_1 VARCHAR2 50

CONTROL_SUM_2 VARCHAR2 50

INFORMATICA CONFIDENTIAL BEST PRACTICE 149 of 702

CONTROL_SUM_3 VARCHAR2 50

CONTROL_SUM_4 VARCHAR2 50

CONTROL_SUM_5 VARCHAR2 50

UPDATE_TIMESTAMP TIMESTAMP

UPDATE_PROCESS VARCHAR2 50

The following is a screenshot of a single mapping that will populate both the source and target values in a single mapping:

The following two screenshots show how two mappings could be used to provide the same results:

INFORMATICA CONFIDENTIAL BEST PRACTICE 150 of 702

Note: One key challenge is how to capture the appropriate control values from the source system if it is continually being updated. The first example with one mapping will not work due to the changes that occur in the time between the extraction of the data from the source and the completion of the load to the target application. In those cases you may want to take advantage of an aggregator transformation to collect the appropriate control totals as illustrated in this screenshot:

The following are two Straw-man Examples of an Audit/Balancing Report which is the end-result of this type of process:

Data area Leg count TT count Diff Leg amt TT amt

Customer 11000 10099 1 0

Orders 9827 9827 0 11230.21 11230.21 0

Deliveries 1298 1288 0 21294.22 21011.21 283.01

In summary, there are two big challenges in building audit/balancing processes:

INFORMATICA CONFIDENTIAL BEST PRACTICE 151 of 702

1. Identifying what the control totals should be 2. Building processes that will collect the correct information at the correct granularity

There are also a set of basic tasks that can be leveraged and shared across any audit/balancing needs. By building a common model for meeting audit/balancing needs, projects can lower the time needed to develop these solutions and still provide risk reductions by having this type of solution in place.

Last updated: 06-Feb-07 12:37

INFORMATICA CONFIDENTIAL BEST PRACTICE 152 of 702

Data Cleansing

Challenge

Poor data quality is one of the biggest obstacles to the success of many data integration projects. A 2005 study by the Gartner Group stated that the majority of currently planned data warehouse projects will suffer limited acceptance or fail outright. Gartner declared that the main cause of project problems was a lack of attention to data quality.

Moreover, once in the system, poor data quality can cost organizations vast sums in lost revenues. Defective data leads to breakdowns in the supply chain, poor business decisions, and inferior customer relationship management. It is essential that data quality issues are tackled during any large-scale data project to enable project success and future organizational success.

Therefore, the challenge is twofold: to cleanse project data, so that the project succeeds, and to ensure that all data entering the organizational data stores provides for consistent and reliable decision-making.

Description

A significant portion of time in the project development process should be dedicated to data quality, including the implementation of data cleansing processes. In a production environment, data quality reports should be generated after each data warehouse implementation or when new source systems are integrated into the environment. There should also be provision for rolling back if data quality testing indicates that the data is unacceptable.

Informatica offers two application suites for tackling data quality issues: Informatica Data Explorer (IDE) and Informatica Data Quality (IDQ). IDE focuses on data profiling, and its results can feed into the data integration process. However, its unique strength is its metadata profiling and discovery capability. IDQ has been developed as a data analysis, cleansing, correction, and de-duplication tool, one that provides a complete solution for identifying and resolving all types of data quality problems and preparing data for the consolidation and load processes.

Concepts

Following are some key concepts in the field of data quality. These data quality concepts provide a foundation that helps to develop a clear picture of the subject data, which can improve both efficiency and effectiveness. The list of concepts can be read as a process, leading from profiling and analysis to consolidation.

Profiling and Analysis - whereas data profiling and data analysis are often synonymous terms, in Informatica terminology these tasks are assigned to IDE and IDQ respectively. Thus, profiling is primarily concerned with metadata discovery and definition, and IDE is ideally suited to these tasks. IDQ can discover data quality issues at a record and field level, and Velocity best practices recommends the use of IDQ for such purposes.

Note: The remaining items in this document will therefore, focus in the context of IDQ usage.

INFORMATICA CONFIDENTIAL BEST PRACTICE 153 of 702

Parsing - the process of extracting individual elements within the records, files, or data entry forms in order to check the structure and content of each field and to create discrete fields devoted to specific information types. Examples may include: name, title, company name, phone number, and SSN.

Cleansing and Standardization - refers to arranging information in a consistent manner or preferred format. Examples include the removal of dashes from phone numbers or SSNs. For more information, see the Best Practice Effective Data Standardizing Techniques.

Enhancement - refers to adding useful, but optional, information to existing data or complete data. Examples may include: sales volume, number of employees for a given business, and zip+4 codes.

Validation - the process of correcting data using algorithmic components and secondary reference data sources, to check and validate information. Example: validating addresses with postal directories.

Matching and de-duplication - refers to removing, or flagging for removal, redundant or poor-quality records where high-quality records of the same information exist. Use matching components and business rules to identify records that may refer, for example, to the same customer. For more information, see the Best Practice Effective Data Matching Techniques.

Consolidation - using the data sets defined during the matching process to combine all cleansed or approved data into a single, consolidated view. Examples are building best record, master record, or house-holding.

Informatica Applications

The Informatica Data Quality software suite has been developed to resolve a wide range of data quality issues, including data cleansing. The suite comprises the following elements:

● IDQ Workbench - a stand-alone desktop tool that provides a complete set of data quality functionality on a single computer (Windows only).

INFORMATICA CONFIDENTIAL BEST PRACTICE 154 of 702

● IDQ Server- a set of processes that enables the deployment and management of data quality procedures and resources across a network of any size through TCP/IP.

● IDQ Integration - a plug-in component that integrates Workbench with PowerCenter, enabling PowerCenter users to embed data quality procedures defined in IDQ in their mappings.

● IDQ stores all its processes as XML in the Data Quality Repository (MySQL). IDQ Server enables the creation and management of multiple repositories.

Using IDQ in Data Projects

IDQ can be used effectively alongside PowerCenter in data projects, to run data quality procedures in its own applications or to provide them for addition to PowerCenter transformations.

Through its Workbench user-interface tool, IDQ tackles data quality in a modular fashion. That is, Workbench enables you to build discrete procedures (called plans in Workbench) which contain data input components, output components, and operational components. Plans can perform analysis, parsing, standardization, enhancement, validation, matching, and consolidation operations on the specified data. Plans are saved into projects that can provide a structure and sequence to your data quality endeavors.

The following figure illustrates how data quality processes can function in a project setting:

INFORMATICA CONFIDENTIAL BEST PRACTICE 155 of 702

In stage 1, you analyze the quality of the project data according to several metrics, in consultation with the business or project sponsor. This stage is performed in Workbench, which enables the creation of versatile and easy to use dashboards to communicate data quality metrics to all interested parties.

In stage 2, you verify the target levels of quality for the business according to the data quality measurements taken in stage 1, and in accordance with project resourcing and scheduling.

In stage 3, you use Workbench to design the data quality plans and projects to achieve the targets. Capturing business rules and testing the plans are also covered in this stage.

In stage 4, you deploy the data quality plans. If you are using IDQ Workbench and Server, you can deploy plans and resources to remote repositories and file systems through the user interface. If you are running Workbench alone on remote computers, you can export your plans as XML. Stage 4 is the phase in which data cleansing and other data quality tasks are performed on the project data.

In stage 5, you’ll test and measure the results of the plans and compare them to the initial data quality assessment to verify that targets have been met. If targets have not been met, this information feeds into another iteration of data quality operations in which the plans are tuned and optimized.

In a large data project, you may find that data quality processes of varying sizes and impact are necessary at many points in the project plan. At a high level, stages 1 and 2 ideally occur very early in the project, at a point defined as the Manage Phase within Velocity. Stages 3 and 4 typically occur during the Design Phase of Velocity. Stage 5 can occur during the Design and/or Build Phase of Velocity, depending on the level of unit testing required.

Using the IDQ Integration

INFORMATICA CONFIDENTIAL BEST PRACTICE 156 of 702

Data Quality Integration is a plug-in component that enables PowerCenter to connect to the Data Quality repository and import data quality plans to a PowerCenter transformation. With the Integration component, you can apply IDQ plans to your data without necessarily interacting with or being aware of IDQ Workbench or Server.

The Integration interacts with PowerCenter in two ways:

● On the PowerCenter client side, it enables you to browse the Data Quality repository and add data quality plans to custom transformations. The data quality plans’ functional details are saved as XML in the PowerCenter repository.

● On the PowerCenter server side, it enables the PowerCenter Server (or Integration service) to send data quality plan XML to the Data Quality engine for execution.

The Integration requires that at least the following IDQ components are available to PowerCenter:

● Client side: PowerCenter needs to access a Data Quality repository from which to import plans. ● Server side: PowerCenter needs an instance of the Data Quality engine to execute the plan

instructions.

An IDQ-trained consultant can build the data quality plans, or you can use the pre-built plans provided by Informatica. Currently, Informatica provides a set of plans dedicated to cleansing and de-duplicating North American name and postal address records.

The Integration component enables the following process:

● Data quality plans are built in Data Quality Workbench and saved from there to the Data Quality repository.

● The PowerCenter Designer user opens a Data Quality Integration transformation and configures it to read from the Data Quality repository. Next, the users selects a plan from the Data Quality repository and adds it to the transformation.

● The PowerCenter Designer user saves the transformation and the mapping containing it to the PowerCenter repository. The plan information is saved with the transformation as XML.

The PowerCenter Integration service can then run a workflow containing the saved mapping. The relevant source data and plan information will be sent to the Data Quality engine, which processes the data (in conjunction with any reference data files used by the plan) and returns the results to PowerCenter.

Last updated: 06-Feb-07 12:43

INFORMATICA CONFIDENTIAL BEST PRACTICE 157 of 702

Data Profiling

Challenge

Data profiling is an option in PowerCenter version 7.0 and later that leverages existing PowerCenter functionality and a data profiling GUI front-end to provide a wizard-driven approach to creating data profiling mappings, sessions, and workflows. This Best Practice is intended to provide an introduction on usage for new users.

Bear in mind that Informatica’s Data Quality (IDQ) applications also provide data profiling capabilities. Consult the following Velocity Best Practice documents for more information:

● Data Cleansing

● Using Data Explorer for Data Discovery and Analysis

DescriptionCreating a Custom or Auto Profile

The data profiling option provides visibility into the data contained in source systems and enables users to measure changes in the source data over time. This information can help to improve the quality of the source data.

An auto profile is particularly valuable when you are data profiling a source for the first time, since auto profiling offers a good overall perspective of a source. It provides a row count, candidate key evaluation, and redundancy evaluation at the source level, and domain inference, distinct value and null value count, and min, max, and average (if numeric) at the column level. Creating and running an auto profile is quick and helps to gain a reasonably thorough understanding of a source in a short amount of time.

A custom data profile is useful when there is a specific question about a source. Custom profiling is useful for validating business rules and/or verifying that data matches a particular pattern. For example, use custom profiling if you have a business rule that you want to validate, or if you want to test whether data matches a particular pattern.

Setting Up the Profile Wizard

To customize the profile wizard for your preferences:

● Open the Profile Manager and choose Tools > Options. ● If you are profiling data using a database user that is not the owner of the tables to be sourced, check the “Use

source owner name during profile mapping generation” option. ● If you are in the analysis phase of your project, choose “Always run profile interactively” since most of your data-

profiling tasks will be interactive. (In later phases of the project, uncheck this option because more permanent data profiles are useful in these phases.)

Running and Monitoring Profiles

Profiles are run in one of two modes: interactive or batch. Choose the appropriate mode by checking or unchecking “Configure Session” on the "Function-Level Operations” tab of the wizard.

● Use Interactive to create quick, single-use data profiles. The sessions are created with default configuration parameters.

● For data-profiling tasks that are likely to be reused on a regular basis, create the sessions manually in Workflow Manager and configure and schedule them appropriately.

Generating and Viewing Profile Reports

Use Profile Manager to view profile reports. Right-click on a profile and choose View Report.

INFORMATICA CONFIDENTIAL BEST PRACTICE 158 of 702

For greater flexibility, you can also use Data Analyzer to view reports. Each PowerCenter client includes a Data Analyzer schema and reports xml file. The xml files are located in the \Extensions\DataProfile\IPAReports subdirectory of the client installation.

You can create additional metrics, attributes, and reports in Data Analyzer to meet specific business requirements. You can also schedule Data Analyzer reports and alerts to send notifications in cases where data does not meet preset quality limits.

Sampling Techniques

Four types of sampling techniques are available with the PowerCenter data profiling option:

Technique Description Usage

No sampling Uses all source data Relatively small data sources

Automatic random sampling PowerCenter determines the appropriate percentage to sample, then samples random rows.

Larger data sources where you want a statistically significant data analysis

Manual random sampling PowerCenter samples random rows of the source data based on a user-specified percentage.

Samples more or fewer rows than the automatic option chooses.

Sample first N rows Samples the number of user-selected rows Provides a quick readout of a source (e.g., first 200 rows)

Profile Warehouse Administration

Updating Data Profiling Repository Statistics

The Data Profiling repository contains nearly 30 tables with more than 80 indexes. To ensure that queries run optimally, be sure to keep database statistics up to date. Run the query below as appropriate for your database type, then capture the script that is generated and run it.

ORACLE

select 'analyze table ' || table_name || ' compute statistics;' from user_tables where table_name like 'PMDP%';

select 'analyze index ' || index_name || ' compute statistics;' from user_tables where index_name like 'DP%';

Microsoft SQL Server

select 'update statistics ' + name from sysobjects where name like 'PMDP%'

SYBASE

select 'update statistics ' + name from sysobjects where name like 'PMDP%'

INFORMIX

select 'update statistics low for table ', tabname, ' ; ' from systables where table_name like 'PMDP%'

IBM DB2

INFORMATICA CONFIDENTIAL BEST PRACTICE 159 of 702

select 'runstats on table ' || rtrim(tabschema) || '. ' || tabname || ' and indexes all; ' from syscat.tables where tabname like 'PMDP%'

TERADATA

select 'collect statistics on ', tablename, ' index ', indexname from dbc.indices where tablename like 'PMDP%' and databasename = 'database_name'

where database_name is the name of the repository database.

Purging Old Data Profiles

Use the Profile Manager to purge old profile data from the Profile Warehouse. Choose Target Warehouse>Connect and connect to the profiling warehouse. Choose Target Warehouse>Purge to open the purging tool.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 160 of 702

Data Quality Mapping Rules

Challenge

Use PowerCenter to create data quality mapping rules to enhance the usability of the data in your system.

Description

The issue of poor data quality is one that frequently hinders the success of data integration projects. It can produce inconsistent or faulty results and ruin the credibility of the system with the business users.

This Best Practice focuses on techniques for use with PowerCenter and third-party or add-on software. Comments that are specific to the use of PowerCenter are enclosed in brackets.

Bear in mind that you can augment or supplant the data quality handling capabilities of PowerCenter with Informatica Data Quality (IDQ), the Informatica application suite dedicated to data quality issues. Data analysis and data enhancement processes, or plans, defined in IDQ can deliver significant data quality improvements to your project data. A data project that has built-in data quality steps, such as those described in the Analyze and Design phases of Velocity, enjoys a significant advantage over a project that has not audited and resolved issues of poor data quality. If you have added these data quality steps to your project, you are likely to avoid the issues described below.

A description of the range of IDQ capabilities is beyond the scope of this document. For a summary of Informatica’s data quality methodology, as embodied in IDQ, consult the Best Practice Data Cleansing.

Common Questions to Consider

Data integration/warehousing projects often encounter general data problems that may not merit a full-blown data quality project, but which nonetheless must be addressed. This document discusses some methods to ensure a base level of data quality; much of the content discusses specific strategies to use with PowerCenter.

The quality of data is important in all types of projects, whether it be data warehousing,

INFORMATICA CONFIDENTIAL BEST PRACTICE 161 of 702

data synchronization, or data migration. Certain questions need to be considered for all of these projects, with the answers driven by the project’s requirements and the business users that are being serviced. Ideally, these questions should be addressed during the Design and Analyze Phases of the project because they can require a significant amount of re-coding if identified later.

Some of the areas to consider are:

Text Formatting

The most common hurdle here is capitalization and trimming of spaces. Often, users want to see data in its “raw” format without any capitalization, trimming, or formatting applied to it. This is easily achievable as it is the default behavior, but there is danger in taking this requirement literally since it can lead to duplicate records when some of these fields are used to identify uniqueness and the system is combining data from various source systems.

One solution to this issue is to create additional fields that act as a unique key to a given table, but which are formatted in a standard way. Since the “raw” data is stored in the table, users can still see it in this format, but the additional columns mitigate the risk of duplication.

Another possibility is to explain to the users that “raw” data in unique, identifying fields is not as clean and consistent as data in a common format. In other words, push back on this requirement.

This issue can be particularly troublesome in data migration projects where matching the source data is a high priority. Failing to trim leading/trailing spaces from data can often lead to mismatched results since the spaces are stored as part of the data value. The project team must understand how spaces are handled from the source systems to determine the amount of coding required to correct this. (When using PowerCenter and sourcing flat files, the options provided while configuring the File Properties may be sufficient.). Remember that certain RDBMS products use the data type CHAR, which then stores the data with trailing blanks. These blanks need to be trimmed before matching can occur. It is usually only advisable to use CHAR for 1-character flag fields.

INFORMATICA CONFIDENTIAL BEST PRACTICE 162 of 702

Note that many fixed-width files do not use a null as space. Therefore, developers must put one space beside the text radio button, and also tell the product that the space is repeating to fill out the rest of the precision of the column. The strip trailing blanks facility then strips off any remaining spaces from the end of the data value. Embedding database text manipulation functions in lookup transformations is not recommended because a developer must then cache the lookup table due to the presence of a SQL override. (In PowerCenter, avoid embedding database text manipulation functions in lookup transformations.) On very large tables, caching is not always realistic or feasible.

Datatype Conversions

It is advisable to use explicit tool functions when converting the data type of a particular data value.

[In PowerCenter, if the TO_CHAR function is not used, an implicit conversion is performed, and 15 digits are carried forward, even when they are not needed or desired. PowerCenter can handle some conversions without function calls (these are detailed in the product documentation), but this may cause subsequent support or

INFORMATICA CONFIDENTIAL BEST PRACTICE 163 of 702

maintenance headaches.]

Dates

Dates can cause many problems when moving and transforming data from one place to another because an assumption must be made that all data values are in a designated format.

[Informatica recommends first checking a piece of data to ensure it is in the proper format before trying to convert it to a Date data type. If the check is not performed first, then a developer increases the risk of transformation errors, which can cause data to be lost].

An example piece of code would be: IIF(IS_DATE(in_RECORD_CREATE_DT, ‘YYYYMMDD'), TO_DATE(in_RECORD_CREATE_DT, 'YYYYMMDD'), NULL)

If the majority of the dates coming from a source system arrive in the same format, then it is often wise to create a reusable expression that handles dates, so that the proper checks are made. It is also advisable to determine if any default dates should be defined, such as a low date or high date. These should then be used throughout the system for consistency. However, do not fall into the trap of always using default dates as some are meant to be NULL until the appropriate time (e.g., birth date or death date).

The NULL in the example above could be changed to one of the standard default dates described here.

Decimal Precision

With numeric data columns, developers must determine the expected or required precisions of the columns. (By default, to increase performance, PowerCenter treats all numeric columns as 15 digit floating point decimals, regardless of how they are defined in the transformations. The maximum numeric precision in PowerCenter is 28 digits.)

If it is determined that a column realistically needs a higher precision, then the Enable Decimal Arithmetic in the Session Properties option needs to be checked. However, be aware that enabling this option can slow performance by as much as 15 percent. The Enable Decimal Arithmetic option must be enabled when comparing two numbers for equality.

Trapping Poor Data Quality Techniques

INFORMATICA CONFIDENTIAL BEST PRACTICE 164 of 702

The most important technique for ensuring good data quality is to prevent incorrect, inconsistent, or incomplete data from ever reaching the target system. This goal may be difficult to achieve in a data synchronization or data migration project, but it is very relevant when discussing data warehousing or ODS. This section discusses techniques that you can use to prevent bad data from reaching the system.

Checking Data for Completeness Before Loading

When requesting a data feed from an upstream system, be sure to request an audit file or report that contains a summary of what to expect within the feed. Common requests here are record counts or summaries of numeric data fields. If you have performed a data quality audit, as specified in the Analyze Phase these metrics and others should be readily available.

Assuming that the metrics can be obtained from the source system, it is advisable to then create a pre-process step that ensures your input source matches the audit file. If the values do not match, stop the overall process from loading into your target system. The source system can then be alerted to verify where the problem exists in its feed.

Enforcing Rules During Mapping

Another method of filtering bad data is to have a set of clearly defined data rules built into the load job. The records are then evaluated against these rules and routed to an Error or Bad Table for further re-processing accordingly. An example of this is to check all incoming Country Codes against a Valid Values table. If the code is not found, then the record is flagged as an Error record and written to the Error table.

A pitfall of this method is that you must determine what happens to the record once it has been loaded to the Error table. If the record is pushed back to the source system to be fixed, then a delay may occur until the record can be successfully loaded to the target system. In fact, if the proper governance is not in place, the source system may refuse to fix the record at all. In this case, a decision must be made to either: 1) fix the data manually and risk not matching with the source system; or 2) relax the business rule to allow the record to be loaded.

Often times, in the absence of an enterprise data steward, it is a good idea to assign a team member the role of data steward. It is this person’s responsibility to patrol these tables and push back to the appropriate systems as necessary, as well as help to make decisions about fixing or filtering bad data. A data steward should have a good command of the metadata, and he/she should also understand the consequences to the user community of data decisions.

INFORMATICA CONFIDENTIAL BEST PRACTICE 165 of 702

Another solution applicable in cases with a small number of code values is to try to anticipate any mistyped error codes and translate them back to the correct codes. The cross-reference translation data can be accumulated over time. Each time an error is corrected, both the incorrect and correct values should be put into the table and used to correct future errors automatically.

Dimension Not Found While Loading Fact

The majority of current data warehouses are built using a dimensional model. A dimensional model relies on the presence of dimension records existing before loading the fact tables. This can usually be accomplished by loading the dimension tables before loading the fact tables. However, there are some cases where a corresponding dimension record is not present at the time of the fact load. When this occurs, consistent rules need to handle this so that data is not improperly exposed to, or hidden from, the users.

One solution is to continue to load the data to the fact table, but assign the foreign key a value that represents Not Found or Not Available in the dimension. These keys must also exist in the dimension tables to satisfy referential integrity, but they provide a clear and easy way to identify records that may need to be reprocessed at a later date.

Another solution is to filter the record from processing since it may no longer be relevant to the fact table. The team will most likely want to flag the row through the use of either error tables or process codes so that it can be reprocessed at a later time.

A third solution is to use dynamic caches and load the dimensions when a record is not found there, even while loading the fact table. This should be done very carefully since it may add unwanted or junk values to the dimension table. One occasion when this may be advisable is in cases where dimensions are simply made up of the distinct combination values in a data set. Thus, this dimension may require a new record if a new combination occurs.

It is imperative that all of these solutions be discussed with the users before making any decisions since they will eventually be the ones making decisions based on the reports.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 166 of 702

Data Quality Project Estimation and Scheduling Factors

Challenge

This Best Practice is intended to assist project managers who must estimate the time and resources necessary to address data quality issues within data integration or other data-dependent projects.

Its primary concerns are the project estimation issues that arise when you add a discrete data quality stage to your data project. However, it also examines the factors that determine when, or whether, you need to build a larger data quality element into your project.

Description

At a high level, there are three ways to add data quality to your project:

● Add a discrete and self-contained data quality stage, such as that enabled by using pre-built Informatica Data Quality (IDQ) processes, or plans, in conjunction with Informatica Data Cleanse and Match.

● Add an expanded but finite set of data quality actions to the project, for example in cases where pre-built plans do not fit the project parameters.

● Incorporate data quality actions throughout the project.

This document should help you decide which of these methods best suits your project and assist in estimating the time and resources needed for the first and second methods.

Using Pre-Built Plans with Informatica Data Cleanse and Match

Informatica Data Cleanse and Match is a cross-application solution that enables PowerCenter users to add data quality processes defined in IDQ to custom transformations in PowerCenter. It incorporates the following components:

● Data Quality Workbench, a user-interface application for building and executing data quality processes, or plans.

● Data Quality Integration, a plug-in component for PowerCenter that integrates PowerCenter and IDQ.

● At least one set of reference data files that can be read by data quality plans to validate and enrich certain types of project data. For example, Data Cleanse and Match can be used with the North America Content Pack, which includes pre-built data quality plans and complete address reference datasets for the United States and Canada.

Data Quality Engagement Scenarios

Data Cleanse and Match delivers its data quality capabilities “out of the box”; a PowerCenter user can select data quality plans and add them to a Data Quality transformation without leaving PowerCenter. In this way, Data Cleanse and Match capabilities can be added into a project plan as a relatively short and

INFORMATICA CONFIDENTIAL BEST PRACTICE 167 of 702

discrete stage.

In a more complex scenario, a Data Quality Developer may wish to modify the underlying data quality plans or create new plans to focus on quality analysis or enhancements in particular areas. This expansion of the data quality operations beyond the pre-built plans can also be handled within a discrete data quality stage.

The Project Manager may decide to implement a more thorough approach to data quality and integrate data quality actions throughout the project plan. In many cases, a convincing case can be made for enlarging the data quality aspect to encompass the full data project. (Velocity contains several tasks and subtasks concerned with such an endeavor.) This is well worth considering. Often, businesses do not realize the extent to which their business and project goals depend on the quality of their data.

The project impact of these three types of data quality activity can be summarized as follows:

DQ approach Estimated Project impact

Simple stage 10 days, 1-2 Data Quality Developers

Expanded data quality stage 15-20 days, 2 Data Quality Developers, high visibility to business

Data quality integrated with data project Duration of data project, 2 or more project roles, impact on business and project objectives

Note: The actual time that should be allotted to the data quality stages noted above depends on the factors discussed in the remainder of this document.

Factors Influencing Project Estimation

The factors influencing project estimation for a data quality stage range from high-level project parameters to lower-level data characteristics. The main factors are listed below and explained in detail later in this document.

Base and target levels of data quality●

Overall project duration/budget●

Overlap of sources/Complexity of data joins●

Quantity of data sources●

Matching requirements●

Data volumes●

Complexity and quantity of data rules●

Geography

Determine which scenario — out of the box (Data Cleanse and Match), expanded Data Cleanse and

INFORMATICA CONFIDENTIAL BEST PRACTICE 168 of 702

Match, or a thorough data quality integration —best fits your data project by considering the project’s overall objectives and its mix of factors.

The Simple Data Quality Stage

Project managers can consider the use of pre-built plans with Data Cleanse and Match as a simple scenario with a predictable number of function points that can be added to the project plan as a single package.

You can add the North America Content Pack plans to your project if the project meets most of the following criteria. Similar metrics apply to other types of pre-built plans:

Baseline functionality of the pre-built data quality plans meets 80 percent of the project needs.

Complexity of data rules is relatively low.●

Business rules present in pre-built plans need minimum fine-tuning. ●

Target data quality level is achievable (i.e., <100 percent).●

Quantity of data sources is relatively low.●

Overlap of data sources/complexity of database table joins is relatively low.●

Matching requirements and targets are straightforward.●

Overall project duration is relatively short.●

The project relates to a single country.

Note that the source data quality level is not a major concern.

Implementing the Simple Data Quality Stage

The out-of-the-box scenario is designed to deliver significant increases in data quality in those areas for which the plans were designed (i.e., North American name and address data) in a short time frame. As indicated above, it does not anticipate major changes to the underlying data quality plans. It involves the following three steps:

1. Run pre-built plans.

2. Review plan results.

3. Transfer data to the next stage in the project and (optionally) add data quality plans to PowerCenter transformations.

INFORMATICA CONFIDENTIAL BEST PRACTICE 169 of 702

While every project is different, a single iteration of the simple model may take approximately five days, as indicated below:

Run pre-built plans (2 days)●

Review plan results (1 day)●

Pass data to the next stage in the project and add plans to PowerCenter transformations (2 days)

Note that these estimates fit neatly into a five-day week but may be conservative in some cases. Note also that a Data Quality Developer can tune plans on an ad-hoc basis to suit the project. Therefore you should plan for a two week “simple” data quality stage.

Step - Simple Stage Days, week 1 Days, week 2

Run pre-built plans 2

Review plan resultsFine-tune pre-built plans if necessary

1

Re-run pre-built plans 2

Review plan results with stakeholdersAdd plans to PowerCenter transformations and define mappings

2

Run PowerCenter workflows 1

Review results/obtain approval from stakeholders 1

Approve and pass all files to the next project stage 1

Expanding the Simple Data Quality Stage

Although the simple scenario above allows for the data quality components to be treated as a “black box,” it allows for modifications to the data quality plans. The types of plan tuning that developers can undertake in this time frame include changing the reference dictionaries used by the plans, editing these dictionaries, and re-selecting the data fields used by the plans as keys to identify data matches. The above time frame does not guarantee that a developer can build or re-build a plan from scratch.

The gap between base and target levels of data quality is an important area to consider when expanding the data quality stage. The Developer and Project Manager may decide to add a data analysis step in this stage, or even decide to split these activities across the project plan by conducting a data quality audit early in the project, so that issues can be revealed to the business in advance of the formal data quality stage. The schedule should allow for sufficient time for testing the data quality plans and for contact with the business managers in order to define data quality expectations and targets.

In addition:

● If a data quality audit is added early in the project, the data quality stage grows into a project-length endeavor.

● If the data quality audit is included in the discrete data quality stage, the expanded, three-week Data Quality stage may look like this:

INFORMATICA CONFIDENTIAL BEST PRACTICE 170 of 702

Step - Enhanced DQ Stage Days, week 1

Days, week 2

Days, week 3

Set up and run data analysis plansReview plan results

1-2

Conduct advance tuning of pre-built plans Run pre-built plans

2

Review plan results with stakeholders 1

Modify pre-built plans or build new plans from scratch 2

Re-run the plans 2

Review plan results/obtain approval from stakeholders 1

Add approved plans to PowerCenter transformations, define mappings

2

Run PowerCenter workflows 1

Review results/obtain approval from stakeholders 1

Approve and pass all files to the next project stage 1

Sizing Your Data Quality Initiatives

The following section describes the factors that affect the estimated time that the data quality endeavors may add to a project. Estimating the specific impact that a single factor is likely to have on a project plan is difficult, as a single data factor rarely exists in isolation from others. If one or two of these factors apply to your data, you may be able to treat them within the scope of a discrete DQ stage. If several factors apply, you are moving into a complex scenario and must design your project plan accordingly.

Base and Target Levels of Data Quality

The rigor of your data quality stage depends in large part on the current (i.e., “base”) levels of data quality in your dataset and the target levels that you want to achieve. As part of your data project, you should run a set of data analysis plans and determine the strengths and weaknesses of the proposed project data. If your data is already of a high quality relative to project and business goals, then your data quality stage is likely to be a short one!

If possible, you should conduct this analysis at an early stage in the data project (i.e., well in advance of the data quality stage). Depending on your overall project parameters, you may have already scoped a Data Quality Audit into your project. However, if your overall project is short in duration, you may have to tailor your data quality analysis actions to the time available.

Action:If there is a wide gap between base and target data quality levels, determine whether a short data quality stage can bridge the gap. If a data quality audit is conducted early in the project, you have latitude to discuss this with the business managers in the context of the overall project timeline. In general, it is good practice to agree with the business to incorporate time into the project plan for a dedicated Data Quality Audit. (See Task 2.8 in theVelocity Work Breakdown Structure.)

If the aggregated data quality percentage for your project’s source data is greater than 60 percent, and your target percentage level for the data quality stage is less than 95 percent, then you are in the zone of effectiveness for Data Cleanse and Match.

Note: You can assess data quality according to at least six criteria. Your business may need to improve data quality levels with respect to one criterion but not another. See the Best Practice document Data

INFORMATICA CONFIDENTIAL BEST PRACTICE 171 of 702

Cleansing .

Overall Project Duration/Budget

A data project with a short duration may not have the means to accommodate a complex data quality stage, regardless of the potential or need to enhance the quality of the data involved. In such a case, you may have to incorporate a finite data quality stage.

Conversely, a data project with a long time line may have scope for a larger data quality initiative. In large data projects with major business and IT targets, good data quality may be a significant issue. For example, poor data quality can affect the ability to cleanly and quickly load data into target systems. Major data projects typically have a genuine need for high-quality data if they are to avoid unforeseen problems.

Action: Evaluate the project schedule parameters and expectations put forward by the business and evaluate how data quality fits into these parameters.

You must also determine if there are any data quality issues that may jeopardize project success, such as a poor understanding of the data structure. These issues may already be visible to the business community. If not, they should be raised with the management. Bear in mind that data quality is not simply concerned with the accuracy of the data values — it can encompass the project metadata also.

Overlap of Sources/Complexity of Data Joins

When data sources overlap, data quality issues can be spread across several sources. The relationships among the variables within the sources can be complex, difficult to join together, and difficult to resolve, all adding to project time.

If the joins between the data are simple, then this task may be straightforward. However, if the data joins use complex keys or exist over many hierarchies, then the data modeling stage can be time-consuming, and the process of resolving the indices may be prolonged.

Action: You can tackle complexity in data sources and in required database joins within a data quality stage, but in doing so, you step outside the scope of the simple data quality stage.

Quantity of Data Sources

This issue is similar to that of data source overlap and complexity (above). The greater the quantity of sources, the greater the opportunity for data quality issues to arise. The number of data sources has a particular impact on the time required to set up the data quality solution. (The source data setup in PowerCenter can facilitate the data setup in the data quality stage.)

Action: You may find that the number of data sources correlates with the number of data sites covered by the project. If your project includes data from multiple geographies, you step outside the scope of a simple data quality stage.

Matching Requirements

Data matching plans are the most performance-intensive type of data quality plan. Moreover, matching plans are often coupled to a type of data standardization plan (i.e., grouping plan) that prepares the data for

INFORMATICA CONFIDENTIAL BEST PRACTICE 172 of 702

match analysis.

Matching plans are not necessarily more complex to design than other types of plans, although they may contain sophisticated business rules. However, the time taken to execute a matching plan is exponentially proportional to the volume of data records passed through the plan. (Specifically, the time taken is proportional to the size and number of data groups created in the grouping plans.)

Action: Consult the Best Practice on Effective Data Matching Techniques and determine how long your matching plans may take to run.

Data Volumes

Data matching requirements and data volumes are closely related. As stated above, the time taken to execute a matching plan is exponentially proportional to the volume of data records passed through it. In other types of plans, this exponential relationship does not exist. However, the general rule applies: the larger your data volumes, the longer it takes for plans to execute.

Action: Although IDQ can handle data volumes measurable in eight figures, a dataset of more than 1.5 million records is considered larger than average. If your dataset is measurable in millions of records, and high levels of matching/de-duplication are required, consult the Best Practice on Effective Data Matching Techniques.

Complexity and Quantity of Data Rules

This is a key factor in determining the complexity of your data quality stage. If the Data Quality Developer is likely to write a large number of business rules for the data quality plans — as may be the case if data quality target levels are very high or relate to precise data objectives — then the project is de facto moving out of Data Cleanse and Match capability and you need to add rule-creation and rule-review elements to the data quality effort.

Action: If the business requires multiple complex rules, you must scope additional time for rule creation and for multiple iterations of the data quality stage. Bear in mind that, as well as writing and adding these rules to data quality plans, the rules must be tested and passed by the business.

Geography

Geography affects the project plan in two ways:

● First, the geographical spread of data sites is likely to affect the time needed to run plans, collate data, and engage with key business personnel. Working hours in different time zones can mean that one site is starting its business day while others are ending theirs, and this can effect the tight scheduling of the simple data quality stage.

● Secondly, project data that is sourced from several countries typically means multiple data sources, with opportunities for data quality issues to arise that may be specific to the country or the division of the organization providing the data source.

There is also a high correlation between the scale of the data project and the scale of the enterprise in which the project will take place. For multi-national corporations, there is rarely such a thing as a small data project!

INFORMATICA CONFIDENTIAL BEST PRACTICE 173 of 702

Action: Consider the geographical spread of your source data. If the data sites are spread across several time zones or countries, you may need to factor in time lags to your data quality planning.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 174 of 702

Effective Data Matching Techniques

Challenge

Identifying and eliminating duplicates is a cornerstone of effective marketing efforts and customer resource management initiatives, and is an increasingly important driver of cost-efficient compliance with regulatory initiatives such as KYC (Know Your Customer).

Once duplicate records are identified, you can remove them from your dataset, and can better recognize key relationships among data records (such as customer records from a common household). You can also match records or values against reference data to ensure data accuracy and validity.

This Best Practice is targeted toward Informatica Data Quality (IDQ) users familiar with Informatica's matching approach. It has two high-level objectives:

● To identify the key performance variables that affect the design and execution of IDQ matching plans. ● To describe plan design and plan execution actions that will optimize plan performance and results.

To optimize your data matching operations in IDQ, you must be aware of the factors that are discussed below.

Description

All too often, an organization's datasets contain duplicate data in spite of numerous attempts to cleanse the data or prevent duplicates from occurring. In other scenarios, the datasets may lack common keys (such as customer numbers or product ID fields) that, if present, would allow clear ‘joins’ between the datasets and improve business knowledge.

Identifying and eliminating duplicates in datasets can serve several purposes. It enables the creation of a single view of customers; it can help control costs associated with mailing lists by preventing multiple pieces of mail from being sent to the same person or household; and it can assist marketing efforts by identifying households or individuals who are heavy users of a product or service.

Data can be enriched by matching across production data and reference data sources. Business intelligence operations can be improved by identifying links between two or more systems to provide a more complete picture of how customers interact with a business.

IDQ’s matching capabilities can help to resolve dataset duplications and deliver business results. However, a user’s ability to design and execute a matching plan that meets the key requirements of performance and match quality depends on understanding the best-practice approaches described in this document.

An integrated approach to data matching involves several steps that prepare the data for matching and improve the overall quality of the matches. The following table outlines the processes in each step.

Step Description

Profiling Typically the first stage of the data quality process, profiling generates a picture of the data and indicates the data elements that can comprise effective group keys. It also highlights the data elements that require standardizing to improve match scores.

Standardization Removes noise, excess punctuation, variant spellings, and other extraneous data elements. Standardization reduces the likelihood that match quality will be affected by data elements that are not relevant to match determination.

Grouping A post-standardization function in which the groups' key fields identified in the profiling stage are used to segment data into logical groups that facilitate matching plan performance.

Matching The process whereby the data values in the created groups are compared against one another and record matches are identified according to user-defined criteria.

INFORMATICA CONFIDENTIAL BEST PRACTICE 175 of 702

Consolidation The process whereby duplicate records are cleansed. It identifies the master record in a duplicate cluster and permits the creation of a new dataset or the elimination of subordinate records. Any child data associated with subordinate records is linked to the master record.

The sections below identify the key factors that affect the performance (or speed) of a matching plan and the quality of the matches identified. They also outline the best practices that ensure that each matching plan is implemented with the highest probability of success. (This document does not make any recommendations on profiling, standardization or consolidation strategies. Its focus is grouping and matching.)

The following table identifies the key variables that affect matching plan performance and the quality of matches identified.

Factor Impact Impact summary

Group size Plan performance The number and size of groups have a significant impact on plan execution speed.

Group keys Quality of matches The proper selection of group keys ensures that the maximum number of possible matches are identified in the plan.

Hardware resources Plan performance Processors, disk performance, and memory require consideration.

Size of dataset(s) Plan performance This is not a high-priority issue. However, it should be considered when designing the plan.

Informatica Data Quality components Plan performance The plan designer must weigh file-based versus database matching approaches when considering plan requirements.

Time window and frequency of execution Plan performance The time taken for a matching plan to complete execution depends on its scale. Timing requirements must be understood up-front.

Match identification Quality of matches The plan designer must weigh deterministic versus probabilistic approaches.

Group Size

Grouping breaks large datasets down into smaller ones to reduce the number of record-to-record comparisons performed in the plan, which directly impacts the speed of plan execution. When matching on grouped data, a matching plan compares the records within each group with one another. When grouping is implemented properly, plan execution speed is increased significantly, with no meaningful effect on match quality.

The most important determinant of plan execution speed is the size of the groups to be processed — that is, the number of data records in each group.

For example, consider a dataset of 1,000,000 records, for which a grouping strategy generates 10,000 groups. If 9,999 of these groups have an average of 50 records each, the remaining group will contain more than 500,000 records; based on this one large group, the matching plan would require 87 days to complete, processing 1,000,000 comparisons a minute! In comparison, the remaining 9,999 groups could be matched in about 12 minutes if the group sizes were evenly distributed.

Group size can also have an impact on the quality of the matches returned in the matching plan. Large groups perform more record comparisons, so more likely matches are potentially identified. The reverse is true for small groups: as groups

INFORMATICA CONFIDENTIAL BEST PRACTICE 176 of 702

get smaller, fewer comparisons are possible, and the potential for missing good matches is increased. Therefore, groups must be defined intelligently through the use of group keys.

Group Keys

Group keys determine which records are assigned to which groups. Group key selection, therefore, has a significant affect on the success of matching operations.

Grouping splits data into logical chunks and thereby reduces the total number of comparisons performed by the plan. The selection of group keys, based on key data fields, is critical to ensuring that relevant records are compared against one another.

When selecting a group key, two main criteria apply:

● Candidate group keys should represent a logical separation of the data into distinct units where there is a low probability that matches exist between records in different units. This can be determined by profiling the data and uncovering the structure and quality of the content prior to grouping.

● Candidate group keys should also have high scores in three keys areas of data quality: completeness, conformity, and accuracy. Problems in these data areas can be improved by standardizing the data prior to grouping.

For example, geography is a logical separation criterion when comparing name and address data. A record for a person living in Canada is unlikely to match someone living in Ireland. Thus, the country-identifier field can provide a useful group key. However, if you are working with national data (e.g. Swiss data), duplicate data may exist for an individual living in Geneva, who may also be recorded as living in Genf or Geneve. If the group key in this case is based on city name, records for Geneva, Genf, and Geneve will be written to different groups and never compared — unless variant city names are standardized.

Size of Dataset

In matching, the size of the dataset typically does not have as significant an impact on plan performance as the definition of the groups within the plan. However, in general terms, the larger the dataset, the more time required to produce a matching plan — both in terms of the preparation of the data and the plan execution.

IDQ Components

All IDQ components serve specific purposes, and very little functionality is duplicated across the components. However, there are performance implications for certain component types, combinations of components, and the quantity of components used in the plan.

Several tests have been conducted on IDQ (version 2.11) to test source/sink combinations and various operational components. In tests comparing file-based matching against database matching, file-based matching outperformed database matching in UNIX and Windows environments for plans containing up to 100,000 groups. Also, matching plans that wrote output to a CSV Sink outperformed plans with a DB Sink or Match Key Sink. Plans with a Mixed Field Matcher component performed more slowly than plans without a Mixed Field Matcher.

Raw performance should not be the only consideration when selecting the components to use in a matching plan. Different components serve different needs and may offer advantages in a given scenario.

Time Window

IDQ can perform millions or billions of comparison operations in a single matching plan. The time available for the completion of a matching plan can have a significant impact on the perception that the plan is running correctly.

Knowing the time window for plan completion helps to determine the hardware configuration choices, grouping strategy,

INFORMATICA CONFIDENTIAL BEST PRACTICE 177 of 702

and the IDQ components to employ.

Frequency of Execution

The frequency with which plans are executed is linked to the time window available. Matching plans may need to be tuned to fit within the cycle in which they are run. The more frequently a matching plan is run, the more the execution time will have to be considered.

Match Identification

The method used by IDQ to identify good matches has a significant effect on the success of the plan. Two key methods for assessing matches are:

● deterministic matching ● probabilistic matching

Deterministic matching applies a series of checks to determine if a match can be found between two records. IDQ’s fuzzy matching algorithms can be combined with this method. For example, a deterministic check may first check if the last name comparison score was greater than 85 percent. If this is true, it next checks the address. If an 80 percent match is found, it then checks the first name. If a 90 percent match is found on the first name, then the entire record is considered successfully matched.

The advantages of deterministic matching are: (1) it follows a logical path that can be easily communicated to others, and (2) it is similar to the methods employed when manually checking for matches. The disadvantages to this method are its rigidity and its requirement that each dependency be true. This can result in matches being missed, or can require several different rule checks to cover all likely combinations.

Probabilistic matching takes the match scores from fuzzy matching components and assigns weights to them in order to calculate a weighted average that indicates the degree of similarity between two pieces of information.

The advantage of probabilistic matching is that it is less rigid than deterministic matching. There are no dependencies on certain data elements matching in order for a full match to be found. Weights assigned to individual components can place emphasis on different fields or areas in a record. However, even if a heavily-weighted score falls below a defined threshold, match scores from less heavily-weighted components may still produce a match.

The disadvantages of this method are a higher degree of required tweaking on the user’s part to get the right balance of weights in order to optimize successful matches. This can be difficult for users to understand and communicate to one another.

Also, the cut-off mark for good matches versus bad matches can be difficult to assess. For example, a matching plan with 95 to 100 percent success may have found all good matches, but matching plan success between 90 and 94 percent may map to only 85 percent genuine matches. Matches between 85 and 89 percent may correspond to only 65 percent genuine matches, and so on. The following table illustrates this principle.

INFORMATICA CONFIDENTIAL BEST PRACTICE 178 of 702

Close analysis of the match results is required because of the relationship between match quality and match thresholds scores assigned since there may not be a one-to-one mapping between the plan’s weighted score and the number of records that can be considered genuine matches.

Best Practice Operations

The following section outlines best practices for matching with IDQ.

Capturing Client Requirements

Capturing client requirements is key to understanding how successful and relevant your matching plans are likely to be. As a best practice, be sure to answer the following questions, as a minimum, before designing and implementing a matching plan:

● How large is the dataset to be matched? ● How often will the matching plans be executed? ● When will the match process need to be completed? ● Are there any other dependent processes? ● What are the rules for determining a match? ● What process is required to sign-off on the quality of match results? ● What processes exist for merging records?

Test Results

Performance tests demonstrate the following:

● IDQ has near-linear scalability in a multi-processor environment. ● Scalability in standard installations, as achieved in the allocation of matching plans to multiple processors, will

eventually level off.

Performance is the key to success in high-volume matching solutions. IDQ’s architecture supports massive scalability by allowing large jobs to be subdivided and executed across several processors. This scalability greatly enhances IDQ’s ability to meet the service levels required by users without sacrificing quality or requiring an overly complex solution.

INFORMATICA CONFIDENTIAL BEST PRACTICE 179 of 702

Managing Group Sizes

As stated earlier, group sizes have a significant affect on the speed of matching plan execution. Also, the quantity of small groups should be minimized to ensure that the greatest number of comparisons are captured. Keep the following parameters in mind when designing a grouping plan.

Condition Best practice Exceptions

Maximum group size 5,000 records Large datasets over 2M records with uniform data. Minimize the number of groups containing more than 5,000 records.

Minimum number of single-record groups 1,000 groups per one million record dataset.

Optimum number of comparisons 500,000,000 comparisons per 1 million records

+/- 20 percent

In cases where the datasets are large, multiple group keys may be required to segment the data to ensure that best practice guidelines are followed. Informatica Corporation can provide sample grouping plans that automate these requirements as far as is practicable.

Group Key Identification

Identifying appropriate group keys is essential to the success of a matching plan. Ideally, any dataset that is about to be matched has been profiled and standardized to identify candidate keys.

Group keys act as a “first pass” or high-level summary of the shape of the dataset(s). Remember that only data records within a given group are compared with one another. Therefore, it is vital to select group keys that have high data quality scores for completeness, conformity, consistency, and accuracy.

Group key selection depends on the type of data in the dataset, for example whether it contains name and address data or other data types such as product codes.

Hardware Specifications

Matching is a resource-intensive operation, especially in terms of processor capability. Three key variables determine the effect of hardware on a matching plan: processor speed, disk performance, and memory.

The majority of the activity required in matching is tied to the processor. Therefore, the speed of the processor has a significant affect on how fast a matching plan completes. Although the average computational speed for IDQ is one million comparisons per minute, the speed can range from as low as 250,000 comparisons to 6.5 million comparisons per minute, depending on the hardware specification, background processes running, and components used. As a best practice, higher-specification processors (e.g., 1.5 GHz minimum) should be used for high-volume matching plans.

Hard disk capacity and available memory can also determine how fast a plan completes. The hard disk reads and writes data required by IDQ sources and sinks. The speed of the disk and the level of defragmentation affect how quickly data can be read from, and written to, the hard disk. Information that cannot be stored in memory during plan execution must be temporarily written to the hard disk. This increases the time required to retrieve information that otherwise could be stored in memory, and also increases the load on the hard disk. A RAID drive may be appropriate for datasets of 3 to 4 million records and a minimum of 512MB of memory should be available.

The following table is a rough guide for hardware estimates based on IDQ Runtime on Windows platforms. Specifications for UNIX-based systems vary.

Match volumes Suggested hardware specification

< 1,500,000 records 1.5 GHz computer, 512MB RAM

INFORMATICA CONFIDENTIAL BEST PRACTICE 180 of 702

1,500,000 to 3 million records Multi processor server, 1GB RAM

> 3 million records Multi-processor server, 2GB RAM, RAID 5 hard disk

Single Processor vs. Multi-Processor

With IDQ Runtime, it is possible to run multiple processes in parallel. Matching plans, whether they are file-based or database-based, can be split into multiple plans to take advantage of multiple processors on a server. Be aware however, that this requires additional effort to create the groups and consolidate the match output. Also, matching plans split across four processors do not run four times faster than a single-processor matching plan. As a result, multi-processor matching may not significantly improve performance in every case.

The following table can help you to estimate the execution time between a single and multi-processor match plan.

Plan Type Single Processor Multiprocessor

Standardardization/ grouping Depends on operations and size of data set.

(Time equals Y)

Single processor time plus 20 percent.

(Time equals Y * 1.20)

Matching Est 1 million comparisons a minute.

(Time equals X)

Time for single processor matching divided by no or processors (NP) multiplied by 25 percent. (Time equals [(X / NP) * 1.25])

For example, if a single processor plan takes one hour to group and standardize the data and eight hours to match, a four-processor match plan should require approximately one hour and 20 minute to group and standardize and two and one half hours to match. The time difference between a single- and multi-processor plan in this case would be more than five hours (i.e., nine hours for the single processor plan versus three hours and 50 minutes for the quad-processor plan).

Deterministic vs. Probabilistic Comparisons

No best-practice research has yet been completed on which type of comparison is most effective at determining a match. Each method has strengths and weaknesses. A 2006 article by Forrester Research stated a preference for deterministic comparisons since they remove the burden of identifying a universal match threshold from the user.

Bear in mind that IDQ supports deterministic matching operations only. However, IDQ’s Weight Based Analyzer component lets plan designers calculate weighted match scores for matched fields.

Database vs. File-Based Matching

File-based matching and database matching perform essentially the same operations. The major differences between the two methods revolve around how data is stored and how the outputs can be manipulated after matching is complete. With regards to selecting one method or the other, there are no best practice recommendations since this is largely defined by requirements.

The following table outlines the strengths and weakness of each method:

File-Based Method Database Method

Ease of implementation Easy to implement Requires SQL knowledge

Performance Fastest method Slower than file-based method

Space utilization Requires more hard-disk space Lower hard-disk space requirement

Operating system restrictions Possible limit to number of groups that can be created

None

Ability to control/ manipulate output Low High

High-Volume Data Matching Techniques

INFORMATICA CONFIDENTIAL BEST PRACTICE 181 of 702

This section discusses the challenges facing IDQ matching plan designers in opti­mizing their plans for speed of execution and quality of results. It highlights the key factors affecting matching performance and discusses the results of IDQ performance testing in single and multi-processor environments.

Checking for duplicate records where no clear connection exists among data elements is a resource-intensive activity. In order to detect matching information, a record must be compared against every other record in a dataset. For a single data source, the quantity of comparisons required to check an entire dataset increases geometrically as the volume of data increases. A similar situation arises when matching between two datasets, where the number of comparisons required is a multiple of the volumes of data in each dataset.

When the volume of data increases into the tens of millions, the number of comparisons required to identify matches — and consequently, the amount of time required to check for matches — reaches impractical levels.

Approaches to High-Volume Matching

Two key factors control the time it takes to match a dataset:

● The number of comparisons required to check the data. ● The number of comparisons that can be performed per minute.

The first factor can be controlled in IDQ through grouping, which involves logically segmenting the dataset into distinct elements, or groups, so that there is a high probability that records within a group are not duplicates of records outside of the group. Grouping data greatly reduces the total number of required comparisons without affecting match accuracy.

IDQ affects the number of comparisons per minute in two ways:

● Its matching components maximize the comparison activities assigned to the com­puter processor. This reduces the amount of disk I/O communication in the system and increases the number of comparisons per minute. Therefore, hard­ware with higher processor speeds has higher match throughputs.

● IDQ architecture also allows matching tasks to be broken into smaller tasks and shared across multiple processors. The use of multiple processors to handle matching operations greatly enhances IDQ scalability with regard to high-volume matching problems.

The following section outlines how a multi-processor matching solution can be imple­mented and illustrates the results obtained in Informatica Corporation testing.

Multi-Processor Matching: Solution Overview

IDQ does not automatically distribute its load across multiple processors. To scale a matching plan to take advantage of a multi-processor environment, the plan designer must develop multiple plans for execution in parallel.

To develop this solution, the plan designer first groups the data to prevent the plan from running low-probability comparisons. Groups are then subdivided into one or more subgroups (the number of subgroups depends on the plan being run and the number of processors in use). Each subgroup is assigned to a discrete matching plan, and the plans are executed in parallel.

The following diagram outlines how multi-processor matching can be implemented in a database model. Source data is first grouped and then subgrouped according to the number of processors available to the job. Each subgroup of data is loaded into a sepa­rate staging area, and the discrete match plans are run in parallel against each table. Results from each plan are consolidated to generate a single match result for the orig­inal source data.

INFORMATICA CONFIDENTIAL BEST PRACTICE 182 of 702

Informatica Corporation Match Plan Tests

Informatica Corporation performed match plan tests on a 2GHz Intel Xeon dual-processor server running Windows 2003 (Server edition). Two gigabytes of RAM were available. The hyper-threading ability of the Xeon processors effectively provided four CPUs on which to run the tests.

Several tests were performed using file-based and database-based matching methods and single and multiple processor methods. The tests were performed on one million rows of data. Grouping of the data limited the total number of comparisons to approximately 500,000,000.

Test results using file-based and database-based methods showed a near linear scal­ability as the number of available processors increased. As the number of processors increased, so too did the demand on disk I/O resources. As the processor capacity began to scale upward, disk I/O in this configuration eventually limited the benefits of adding additional processor capacity. This is demonstrated in the graph below.

Execution times for multiple processors were based on the longest execution time of the jobs run in parallel. Therefore, having an even distribution of records across all proc­essors was important to maintaining scalability. When the data was

INFORMATICA CONFIDENTIAL BEST PRACTICE 183 of 702

not evenly distributed, some match plans ran longer than others, and the benefits of scaling over multiple processors was not as evident.

Last updated: 07-Feb-07 17:24

INFORMATICA CONFIDENTIAL BEST PRACTICE 184 of 702

Effective Data Standardizing Techniques

Challenge

To enable users to streamline their data cleansing and standardization processes (or plans) with Informatica Data Quality (IDQ). The intent is to shorten development timelines and ensure a consistent and methodological approach to cleansing and standardizing project data.

Description

Data cleansing refers to operations that remove non-relevant information and “noise” from the content of the data. Examples of cleansing operations include the removal of person names, “care of” information, excess character spaces, or punctuation from postal address.

Data standardization refers to operations related to modifying the appearance of the data, so that it takes on a more uniform structure and to enriching the data by deriving additional details from existing content.

Cleansing and Standardization Operations

Data can be transformed into a “standard” format appropriate for its business type. This is typically performed on complex data types such as name and address or product data. A data standardization operation typically profiles data by type (e.g., word, number, code) and parses data strings into discrete components. This reveals the content of the elements within the data as well as standardizing the data itself.

For best results, the Data Quality Developer should carry out these steps in consultation with a member of the business. Often, this individual is the data steward, the person who best understands the nature of the data within the business scenario.

● Within IDQ, the Profile Standardizer is a powerful tool for parsing unsorted data into the correct fields. However, when using the Profile Standardizer, be aware that there is a finite number of profiles (500) that can be contained within a cleansing plan. Users can extend the number of profiles by using the first 500 profiles within one component and then feeding the data overflow into a second Profile Standardizer via the Token Parser component.

After the data is parsed and labeled, it should be evident if reference dictionaries will be needed to further standardize the data. It may take several iterations of dictionary construction and review before the data is standardized to an acceptable level. Once acceptable standardization has been achieved, data quality scorecard or dashboard reporting can be introduced. For information on dashboard reporting, see the Report Viewer chapter of the Informatica Data Quality 3.1 User Guide.

INFORMATICA CONFIDENTIAL BEST PRACTICE 185 of 702

Discovering Business Rules

At this point, the business user may discover and define business rules applicable to the data. These rules should be documented and converted to logic that can be contained within a data quality plan. When building a data quality plan, be sure to group related business rules together in a single rules component whenever possible; otherwise the plan may become very difficult to read. If there are rules that do not lend themselves easily to regular IDQ components (i.e, when standardizing product data information), it may be necessary to perform some custom scripting using IDQ’s scripting component. This requirement may arise when a string or an element within a string needs to be treated as an array.

Standard and Third-Party Reference Data

Reference data can be a useful tool when standardizing data. Terms with variant formats or spellings can be standardized to a single form. IDQ installs with several reference dictionary files that cover common name and address and business terms. The illustration below shows part of a dictionary of street address suffixes.

Common Issues when Cleansing and Standardizing Data

If the customer has expectations of a bureau-style service, it may be advisable to re-emphasize the score-carding and graded-data approach to cleansing and standardizing. This helps to ensure that the customer develops reasonable expectations of what can be achieved with the data set within an agreed-upon timeframe.

INFORMATICA CONFIDENTIAL BEST PRACTICE 186 of 702

Standardizing Ambiguous Data

Data values can often appear ambiguous, particularly in name and address data where name, address, and premise values can be interchangeable. For example, Hill, Park, and Church are all common surnames. In some cases, the position of the value is important. “ST” can be a suffix for street or a prefix for Saint, and sometimes they can both occur in the same string.

The address string “St Patrick’s Church, Main St” can reasonably be interpreted as “Saint Patrick’s Church, Main Street.” In this case, if the delimiter is a space (thus ignoring any commas and periods), the string has five tokens. You may need to write business rules using the IDQ Scripting component, as you are treating the string as an array. St with position 1 within the string would be standardized to meaning_1, whereas St with position 5 would be standardized to meaning_2. Each data value can then be compared to a discrete prefix and suffix dictionary.

Conclusion

Using the data cleansing and standardization techniques described in this Best Practice can help an organization to recognize the value of incorporating IDQ into their development methodology. Because data quality is an iterative process, the business rules initially developed may require ongoing modification, as the results produced by IDQ will be affected by the starting condition of the data and the requirements of the business users.

When data arrives in multiple languages, it is worth creating similar IDQ plans for each country and applying the same rules across these plans. The data would typically be staged in a database, and the plans developed using a SQL statement as input, with a “where country_code= ‘DE’” clause, for example. Country dictionaries are identifiable by country code to facilitate such statements. Remember that IDQ installs with a large set of reference dictionaries and additional dictionaries are available from Informatica.

IDQ provides several components that focus on verifying and correcting the accuracy of name and postal address data. These components leverage address reference data that originates from national postal carriers such as the United States Postal Service. Such datasets enable IDQ to validate an address to premise level. Please note, the reference datasets are licensed and installed as discrete Informatica products, and thus it is important to discuss their inclusion in the project with the business in advance so as to avoid budget and installation issues. Several types of reference data, with differing levels of address granularity, are available from Informatica. Pricing for the licensing of these components may vary and should be discussed with the Informatica Account Manager.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 187 of 702

Managing Internal and External Reference Data

Challenge

To provide guidelines for the development and management of the reference data sources that can be used with data quality plans in Informatica Data Quality (IDQ). The goal is to ensure the smooth transition from development to production for reference data files and the plans with which they are associated.

Description

Reference data files can be used by a plan to verify or enhance the accuracy of the data inputs to the plan. A reference data file is a list of verified-correct terms and, where appropriate, acceptable variants on those terms. It may be a list of employees, package measurements, or valid postal addresses — any data set that provides an objective reference against which project data sources can be checked or corrected. Reference files are essential to some, but not all data quality processes.

Reference data can be internal or external in origin.

Internal data is specific to a particular project or client. Such data is typically generated from internal company information. It may be custom-built for the project.

External data has been sourced or purchased from outside the organization. External data is used when authoritative, independently-verified data is needed to provide the desired level of data quality to a particular aspect of the source data. Examples include the dictionary files that install with IDQ, postal address data sets that have been verified as current and complete by a national postal carrier, such as United States Postal Service, or company registration and identification information from an industry-standard source such as Dun & Bradstreet.

Reference data can be stored in a file format recognizable to Informatica Data Quality or in a format that requires intermediary (third-party) software in order to be read by Informatica applications.

Internal data files, as they are often created specifically for data quality projects, are typically saved in the dictionary file format or as delimited text files, which are easily portable into dictionary format. Databases can also be used as a source for internal data.

External files are more likely to remain in their original format. For example, external data may be contained in a database or in a library whose files cannot be edited or opened on the desktop to reveal discrete data values.

Working with Internal Data

Obtaining Reference Data

Most organizations already possess much information that can be used as reference data — for example, employee tax numbers or customer names. These forms of data may or may not be part of the project source data, and they may be stored in different parts of the organization.

INFORMATICA CONFIDENTIAL BEST PRACTICE 188 of 702

The question arises, are internal data sources sufficiently reliable for use as reference? Bear in mind that in some cases the reference data does not need to be 100 percent accurate. It can be good enough to compare project data against reference data and to flag inconsistencies between them, particularly in cases where both sets of data are highly unlikely to share common errors.

Saving the Data in .DIC File Format

IDQ installs with a set of reference dictionaries that have been created to handle many types of business data. These dictionaries are created using a proprietary .DIC file name extension. DIC is abbreviated from dictionary, and dictionary files are essentially comma delimited text files.

You can create a new dictionary in three ways:

● You can save an appropriately formatted delimited file as a .DIC file into the Dictionaries folders of your IDQ (client or server) installation.

● You can use the Dictionary Manager within Data Quality Workbench. This method allows you to create text and database dictionaries.

● You can write from plan files directly to a dictionary using the IDQ Report Viewer (see below).

The figure below shows a dictionary file open in IDQ Workbench and its underlying .DIC file open in a text editor. Note that the dictionary file has at least two columns of data. The Label column contains the correct or standardized form of each datum from the dictionary’s perspective. The Item columns contain versions of each datum that the dictionary recognizes as identical to or coterminous with the Label entry. Therefore, each datum in the dictionary must have at least two entries in the DIC file (see the text editor illustration below). A dictionary can have multiple Item columns.

To edit a dictionary value, open the DIC file and make your changes. You can make changes either through a text editor or by opening the dictionary in the Dictionary Manager.

INFORMATICA CONFIDENTIAL BEST PRACTICE 189 of 702

To add a value to a dictionary, open the DIC file in Dictionary Manager, place the cursor in an empty row, and add a Label string and at least one Item string. You can also add values in a text editor by placing the cursor on a new line and typing Label and Item values separated by commas.

Once saved, the dictionary is ready for use in IDQ.

Note: IDQ users with database expertise can create and specify dictionaries that are linked to database tables, and that thus can be updated dynamically when the underlying data is updated. Database dictionaries are useful when the reference data has been originated for other purposes and is likely to change independently of data quality. By making use of a dynamic connection, data quality plans can always point to the current version of the reference data.

Sharing Reference Data Across the Organization

As you can publish or export plans from a local Data Quality repository to server repositories, so you can copy dictionaries across the network. The File Manager within IDQ Workbench provides an Explorer-like mechanism for moving files to other machines across the network.

Bear in mind that Data Quality looks for .DIC files in pre-set locations within the IDQ installation when running a plan. By default, Data Quality relies on dictionaries being located in the following locations:

● The Dictionaries folders installed with Workbench and Server. ● The user’s file space in the Data Quality service domain.

IDQ does not recognize a dictionary file that is not in such a location, even if you can browse to the file when designing the data quality plan. Thus, any plan that uses a dictionary in a non-standard location will fail.

This is most relevant when you publish or export a plan to another machine on the network. You must ensure that copies of any dictionary files used in the local plan are available in a suitable location on the service domain — in the user space on the server, or at a location in the server’s Dictionaries folders that corresponds to the dictionaries’ location on Workbench — when the plan is copied to the server-side repository.

Note: You can change the locations in which IDQ looks for plan dictionaries by editing the config.xml file. However, this is the master configuration file for the product and you should not edit it without consulting Informatica Support. Bear in mind that Data Quality looks only in the locations set in the config.xml file.

Version Controlling Updates and Managing Rollout from Development to Production

Plans can be version-controlled during development in Workbench and when published to a domain repository. You can create and annotate multiple versions of a plan, and review/roll back to earlier versions when necessary.

Dictionary files are not version controlled by IDQ, however. You should define a process to log changes and back-up your dictionaries using version control software if possible or a manual method. If modifications are to be made to the versions of dictionary files installed by the software, it is recommended that these modifications be made to a copy of the original file, renamed or relocated as desired. This approach avoids the risk that a subsequent installation might overwrite changes.

INFORMATICA CONFIDENTIAL BEST PRACTICE 190 of 702

Database reference data can also be version controlled, although this presents difficulties if the database is very large in size. Bear in mind that third-party reference data, such as postal address data, should not ordinarily be changed, and so the need for a versioning strategy for these files is debatable.

Working with External Data

Formatting Data into Dictionary Format

External data may or may not permit the copying of data into text format — for example, external data contained in a database or in library files. Currently, third-party postal address validation data is provided to Informatica users in this manner, and IDQ leverages software from the vendor to read these files. (The third-party software has a very small footprint.) However, some software files can be amenable to data extraction to file.

Obtaining Updates for External Reference Data

External data vendors produce regular data updates, and it’s vital to refresh your external reference data when updates become available. The key advantage of external data — its reliability — is lost if you do not apply the latest files from the vendor. If you obtained third-party data through Informatica, you will be kept up to date with the latest data as it becomes available for as long as your data subscription warrants. You can check that you possess the latest versions of third-party data by contacting your Informatica Account Manager.

Managing Reference Updates and Rolling Out Across the Organization

If your organization has a reference data subscription, you will receive either regular data files on compact disc or regular information on how to download data from Informatica or vendor web sites. You must develop a strategy for distributing these updates to all parties who run plans with the external data. This may involve installing the data on machines in a service domain.

Bear in mind that postal address data vendors update their offerings every two or three months, and that a significant percentage of postal addresses can change in such time periods.

You should plan for the task of obtaining and distributing updates in your organization at frequent intervals. Depending on the number of IDQ installations that must be updated, updating your organization with third-party reference data can be a sizable task.

Strategies for Managing Internal and External Reference Data

Experience working with reference data leads to a series of best practice tips for creating and managing reference data files.

Using Workbench to Build Dictionaries

With IDQ Workbench, you can select data fields or columns from a dataset and save them in a dictionary-compatible format.

INFORMATICA CONFIDENTIAL BEST PRACTICE 191 of 702

Let’s say you have designed a data quality plan that identifies invalid or anomalous records in a customer database. Using IDQ, you can create an exception file of these bad records, and subsequently use this file to create a dictionary-compatible file.

For example, let’s say you have an exception file containing suspect or invalid customer account records. Using a very simple data quality plan, you can quickly parse the account numbers from this file to create a new text file containing the account serial numbers only. This file effectively constitutes the labels column of your dictionary.

By opening this file in Microsoft Excel or a comparable program and copying the contents of Column A into Column B, and then saving the spreadsheet as a CSV file, you create a file with Label and Item1 columns. Rename the file with a .DIC suffix and add it to the Dictionaries folder of your IDQ installation: the dictionary is now visible to the IDQ Dictionary Manager. You now have a dictionary file of bad account numbers that you can use in any plans checking the validity of the organization's account records.

Using Report Viewer to Build Dictionaries

The IDQ Report Viewer allows you to create exception files and dictionaries on-the-fly from report data. The figure below illustrates how you can drill-down into report data, right-click on a column, and save the column data as a dictionary file. This file will be populated with Label and Item1 entries corresponding to the column data.

In this case, the dictionary created is a list of serial numbers from invalid customer records (specifically, records containing bad zip codes). The plan designer can now create plans to check customer databases against these serial numbers. You can also append data to an existing dictionary file in this manner.

As a general rule, it is a best practice to follow the dictionary organization structure installed by the application, adding to that structure as necessary to accommodate specialized and supplemental dictionaries. Subsequent users are then relieved of the need to examine the config.xml file for possible modifications, thereby lowering the risk of accidental errors during migration. When following the original dictionary organization structure is not practical or contravenes other requirements, take care to document

INFORMATICA CONFIDENTIAL BEST PRACTICE 192 of 702

the customizations.

Since external data may be obtained from third parties and may not be in file format, the most efficient way to share its content across the organization is to locate it on the Data Quality Server machine. (Specifically, this is the machine that hosts the Execution Service.)

Moving Dictionary Files After IDQ Plans are Built

This is a similar issue to that of sharing reference data across the organization. If you must move or relocate your reference data files post-plan development, you have three options:

● You can reset the location to which IDQ looks by default for dictionary files. ● You can reconfigure the plan components that employ the dictionaries to point to the new

location. Depending on the complexity of the plan concerned, this can be very labor-intensive. ● If deploying plans in a batch or scheduled task, you can append the new location to the plan

execution command. You can do this by appending a parameter file to the plan execution instructions on the command line. The parameter file is an xml file that can contain a simple command to use one file path instead of another.

Last updated: 08-Feb-07 17:09

INFORMATICA CONFIDENTIAL BEST PRACTICE 193 of 702

Testing Data Quality Plans

Challenge

To provide a guide to testing data quality processes or plans created using Informatica Data Quality (IDQ), and to manage some of the unique complexities associated with data quality plans.

Description

Testing data quality plans is an iterative process that occurs as part of the Design Phase of Velocity. That is, plan testing often precedes the project’s main testing activities, as the tested plan outputs will be used as inputs in the Build Phase. It is not necessary to formally test the plans used in the Analyze Phase of Velocity.

The development of data quality plans typically follows a prototyping methodology of create, execute, analyze. Testing is performed as part of the third step, in order to determine that the plans are being developed in accordance with design and project requirements. This method of iterative testing helps support rapid identification and resolution of bugs.

Bear in mind that data quality plans are designed to analyze and resolve data content issues. These are not typically cut-and-dry problems, but more often represent a continuum of data improvement issues where it is possible that every data instance is unique and there is a target level of data quality rather than a “right or wrong answer”. Data quality plans tend to resolve problems in terms of percentages and probabilities that a problem is fixed. For example, the project may set a target of 95 percent accuracy in its customer addresses.

Common Questions in Data Quality Plan Testing

● What dataset will you use to test the plans? While the ideal situation is to use a data set that exactly mimics the project production data, you may not gain access to this data. If you obtain a full cloned set of the project data for testing purposes, bear in mind that some plans (specifically, some data matching plans) can take several hours to complete. Consider testing data matching plans overnight.

● Are the plans using reference dictionaries? Reference dictionary

INFORMATICA CONFIDENTIAL BEST PRACTICE 194 of 702

management is an important factor since it is possible to make changes to a reference dictionary independently of IDQ and without making any changes to the plan itself. When you pass an IDQ plan as tested, you must ensure that no additional work is carried out on any dictionaries referenced in the plan. Moreover, you must ensure that the dictionary files reside in locations that are valid IDQ.

● How will the plans be executed? Will they be executed on a remote IDQ Server, and/or via a scheduler? In cases like these, it’s vital to ensure that your plan resources, including source data files and reference data files, are in valid locations for use by the Data Quality engine. For details on the local and remote locations to which IDQ looks for source and reference data files, refer to the Informatica Data Quality 3.1 User Guide.

● Will the plans be integrated into a PowerCenter transformation? If so, the plans must have realtime-enabled data source and sink components.

Strategies for Testing Data Quality Plans

The best practice steps for testing plans can be grouped under two headings.

Testing to Validate Rules

1. Identify a small, representative sample of source data. 2. Manually process the data, based on the rules for profiling, standardization or

matching that the plans will apply, to determine the results expected when the plans are run.

3. Execute the plans on the test dataset, and validate the plan results against the manually-derived results.

Testing to Validate Plan Effectiveness

This process is concerned with establishing that a data enhancement plan has been properly designed; that is, that the plan delivers the required improvements in data quality.

This is largely a matter of comparing the business and project requirements for data quality and establishing if the plans are on course to deliver these. If not, the plans may need a thorough redesign – or the business and project targets may need to be revised.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 195 of 702

Tuning Data Quality Plans

Challenge

This document gives an insight into the type of considerations and issues a user needs to be aware of when making changes to data quality processes defined in Informatica Data Quality (IDQ). In IDQ, data quality processes are called plans.

The principal focus of this best practice is to know how to tune your plans without adversely affecting the plan logic. This best practice is not intended to replace training materials but serve as a guide for decision making in the areas of adding, removing or changing the operational components that comprise a data quality plan.

Description

You should consider the following questions prior to making changes to a data quality plan:

● What is the purpose of changing the plan? You should consider changing a plan if you believe the plan is not optimally configured, or the plan is not functioning properly and there is a problem at execution time or the plan is not delivering expected results as per the plan design principles.

● Are you trained to change the plan? Data quality plans can be complex. You should not alter a plan unless you have been trained or are highly experienced with IDQ methodology.

● Is the plan properly documented? You should ensure all plan documentation on the data flow and the data components are up-to-date. For guidelines on documenting IDQ plans, see the Sample Deliverable Data Quality Plan Design.

● Have you backed up the plan before editing? If you are using IDQ in a client-server environment, you can create a baseline version of the plan using IDQ version control functionality. In addition, you should copy the plan to a new project folder (viz., Work_Folder) in the Workbench for changing and testing, and leave the original plan untouched during testing.

● Is the plan operating directly on production data? This applies especially to standardization plans. When editing a plan, always work on staged data (database or flat-file). You can later migrate the plan to the production environment after complete and thorough testing.

INFORMATICA CONFIDENTIAL BEST PRACTICE 196 of 702

You should have a clear goal whenever you plan to change an existing plan. An event may prompt the change: for example, input data changing (in format or content), or changes in business rules or business/project targets. You should take into account all current change-management procedures, and the updated plans should be thoroughly tested before production processes are updated. This includes integration and regression testing too. (See also Testing Data Quality Plans.)

Bear in mind that at a high level there are two types of data quality plans: data analysis and data enhancement plans.

● Data analysis plans produce reports on data patterns and data quality across the input data. The key objective in data analysis is to determine the levels of completeness, conformity, and consistency in the dataset. In pursuing these objectives, data analysis plans can also identify cases of missing, inaccurate or “noisy” data.

● Data enhancement plans corrects completeness, conformity and consistency problems; they can also identify duplicate data entries and fix accuracy issues through the use of reference data.

Your goal in a data analysis plan is to discover the quality and usability of your data. It is not necessarily your goal to obtain the best scores for your data. Your goal in a data enhancement plan is to resolve the data quality issues discovered in the data analysis.

Adding Components

In general, simply adding a component to a plan is not likely to directly effect results if no further changes are made to the plan. However, once the outputs from the new component are integrated into existing components, the data process flow is changed and the plan must be re-tested and results reviewed in detail before migrating the plan into production.

Bear in mind, particularly in data analysis plans, that improved plan statistics do not always mean that the plan is performing better. It is possible to configure a plan that moves “beyond the point of truth” by focusing on certain data elements and exclude others.

When added to existing plans, some components have a larger impact than others. For example, adding a “To Upper” component to convert text into upper case may not cause the plan results to change meaningfully, although the presentation of the output data will change. However, adding and integrating a Rule Based Analyzer component

INFORMATICA CONFIDENTIAL BEST PRACTICE 197 of 702

(designed to apply business rules) may cause a severe impact, as the rules are likely to change the plan logic.

As well as adding a new component — that is, a new icon — to the plan, you can add a new instance to an existing component. This can have the same effect as adding and integrating a new component icon. To avoid overloading a plan with too many components, it is a good practice to add multiple instances to a single component, within reason. Good plan design suggests that instances within a single component should be logically similar and work on the selected inputs in similar ways. If you add a new instance to a component, and that instance behaves very differently to the other instances in that component — for example, if it acts on an unrelated set of outputs or performs an unrelated type of action on the data — you should probably add a new component for this instance. This will also help you keep track of your changes onscreen.

To avoid making plans over-complicated, it is often a good practice to split tasks into multiple plans where a large amount of data quality measures need to be checked. This makes plans and business rules easier to maintain and provides a good framework for future development. For example, in an environment where a large number of attributes must be evaluated against the six standard data quality criteria (i.e., completeness, conformity, consistency, accuracy, duplication and consolidation) using one plan per data quality criterion may be a good way to move forward. Alternatively, splitting plans up by data entity may be advantageous. Similarly, during standardization, you can create plans for specific function areas (e.g,. address, product, or name) as opposed to adding all standardization tasks to a single large plan.

For more information on the six standard data quality criteria, see Data Cleansing .

Removing Components

Removing a component from a plan is likely to have a major impact since, in most cases, data flow in the plan will be broken. If you remove an integrated component, configuration changes will be required to all components that use the outputs from the component. The plan cannot run without these configuration changes being completed.

The only exceptions to this case are when the output(s) of the removed component are solely used by CSV Sink component or by a frequency component. However, in these cases, you must note that the plan output changes since the column(s) no longer appear in the result set.

Editing Component Configurations

INFORMATICA CONFIDENTIAL BEST PRACTICE 198 of 702

Changing the configuration of a component can have a comparable impact on the overall plan as adding or removing a component – the plan’s logic changes, and therefore, so do the results that it produces. However, although adding or removing a component may make a plan non-executable, changing the configuration of a component can impact the results in more subtle ways. For example, changing the reference dictionary used by a parsing component does not “break” a plan, but may have a major impact on the resulting output.

Similarly, changing the name of a component instance output does not break a plan. By default, component output names “cascade” through the other components in the plan, so when you change an output name, all subsequent components automatically update with the new output name. It is not necessary to change the configuration of dependent components.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL BEST PRACTICE 199 of 702

Using Data Explorer for Data Discovery and Analysis

Challenge

To understand and make full use of Informatica Data Explorer’s potential to profile and define mappings for your project data.

Data profiling and mapping provide a firm foundation for virtually any project involving data movement, migration, consolidation or integration, from data warehouse/data mart development, ERP migrations, and enterprise application integration to CRM initiatives and B2B integration. These types of projects rely on an accurate understanding of the true structure of the source data in order to correctly transform the data for a given target database design. However, the data’s actual form rarely coincides with its documented or supposed form.

The key to success for data-related projects is to fully understand the data as it actually is, before attempting to cleanse, transform, integrate, mine, or otherwise operate on it. Informatica Data Explorer is a key tool for this purpose.

This Best Practice describes how to use Informatica Data Explorer (IDE) in data profiling and mapping scenarios.

Description

Data profiling and data mapping involve a combination of automated and human analyses to reveal the quality, content and structure of project data sources. Data profiling analyzes several aspects of data structure and content, including characteristics of each column or field, the relationships between fields, and the commonality of data values between fields— often an indicator of redundant data.

Data Profiling

Data profiling involves the explicit analysis of source data and the comparison of observed data characteristics against data quality standards. Data quality and integrity issues include invalid values, multiple formats within a field, non-atomic fields (such as long address strings), duplicate entities, cryptic field names, and others. Quality standards may either be the native rules expressed in the source data’s metadata, or an external standard (e.g., corporate, industry, or government) to which the source data must be mapped in order to be assessed.

INFORMATICA CONFIDENTIAL BEST PRACTICE 200 of 702

Data profiling in IDE is based on two main processes:

● Inference of characteristics from the data ● Comparison of those characteristics with specified standards, as an assessment of data quality

Data mapping involves establishing relationships among data elements in various data structures or sources, in terms of how the same information is expressed or stored in different ways in different sources. By performing these processes early in a data project, IT organizations can preempt the “code/load/explode” syndrome, wherein a project fails at the load stage because the data is not in the anticipated form.

Data profiling and mapping are fundamental techniques applicable to virtually any project. The following figure summarizes and abstracts these scenarios into a single depiction of the IDE solution.

The overall process flow for the IDE Solution is as follows:

INFORMATICA CONFIDENTIAL BEST PRACTICE 201 of 702

1. Data and metadata are prepared and imported into IDE.

2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents cleansing and transformation requirements based on the source and normalized schemas.

3. The resultant metadata are exported to and managed in the IDE Repository. 4. In a derived-target scenario, the project team designs the target database by modeling the existing data

sources and then modifying the model as required to meet current business and performance requirements. In this scenario, IDE is used to develop the normalized schema into a target database. The normalized and target schemas are then exported to IDE’s FTM/XML tool, which documents transformation requirements between fields in the source, normalized, and target schemas. OR

5. In a fixed-target scenario, the design of the target database is a given (i.e., because another organization is responsible for developing it, or because an off-the-shelf package or industry standard is to be used). In this scenario, the schema development process is bypassed. Instead, FTM/XML is used to map the source data fields to the corresponding fields in an externally-specified target schema, and to document transformation requirements between fields in the normalized and target schemas. FTM is used for SQL-based metadata structures, and FTM/XML is used to map SQL and/or XML-based metadata structures. Externally specified targets are typical for ERP package migrations, business-to-business integration projects, or situations where a data modeling team is independently designing the target schema.

6. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and loading or formatting specs developed with IDE applications.

IDE's Methods of Data Profiling

IDE employs three methods of data profiling:

Column profiling - infers metadata from the data for a column or set of columns. IDE infers both the most likely metadata and alternate metadata which is consistent with the data.

INFORMATICA CONFIDENTIAL BEST PRACTICE 202 of 702

Table Structural profiling - uses the sample data to infer relationships among the columns in a table. This process can discover primary and foreign keys, functional dependencies, and sub-tables.

Cross-Table profiling - determines the overlap of values across a set of columns, which may come from multiple tables.

INFORMATICA CONFIDENTIAL BEST PRACTICE 203 of 702

Profiling against external standards requires that the data source be mapped to the standard before being assessed (as shown in the following figure). Note that the mapping is performed by IDE’s Fixed Target Mapping tool (FTM). IDE can also be used in the development and application of corporate standards, making them relevant to existing systems as well as to new systems.

Data profiling projects may involve iterative profiling and cleansing as well since data cleansing may improve the quality of the results obtained through dependency and redundancy profiling. Note that Informatica Data Quality should be considered as an alternative tool for data cleansing.

IDE and Fixed-Target Migration

Fixed-target migration projects involve the conversion and migration of data from one or more sources to an externally defined or fixed-target. IDE is used to profile the data and develop a normalized schema representing

INFORMATICA CONFIDENTIAL BEST PRACTICE 204 of 702

the data source(s), while IDE’s Fixed Target Mapping tool (FTM) is used to map from the normalized schema to the fixed target.

The general sequence of activities for a fixed-target migration project, as shown in the figure below, is as follows:

1. Data is prepared for IDE. Metadata is imported into IDE. 2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents

cleansing and transformation requirements based on the source and normalized schemas. The cleansing requirements can be reviewed and modified by the Data Quality team.

3. The resultant metadata are exported to and managed by the IDE Repository. 4. FTM maps the source data fields to the corresponding fields in an externally specified target schema, and

documents transformation requirements between fields in the normalized and target schemas. Externally-specified targets are typical for ERP migrations or projects where a data modeling team is independently designing the target schema.

5. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and loading or formatting specs developed with IDE and FTM.

6. The cleansing, transformation, and formatting specs can be used by the application development or Data Quality team to cleanse the data, implement any required edits and integrity management functions, and develop the transforms or configure an ETL product to perform the data conversion and migration.

The following screen shot shows how IDE can be used to generate a suggested normalized schema, which may discover ‘hidden’ tables within tables.

INFORMATICA CONFIDENTIAL BEST PRACTICE 205 of 702

Depending on the staging architecture used, IDE can generate the data definition language (DDL) needed to establish several of the staging databases between the sources and target, as shown below:

Derived-Target Migration

Derived-target migration projects involve the conversion and migration of data from one or more sources to a target database defined by the migration team. IDE is used to profile the data and develop a normalized schema representing the data source(s), and to further develop the normalized schema into a target schema by adding tables and/or fields, eliminating unused tables and/or fields, changing the relational structure, and/or denormalizing the schema to enhance performance. When the target schema is developed from the normalized schema within IDE, the product automatically maintains the mappings from the source to normalized schema, and from the normalized to target schemas.

The figure below shows that the general sequence of activities for a derived-target migration project is as follows:

INFORMATICA CONFIDENTIAL BEST PRACTICE 206 of 702

1. Data is prepared for IDE. Metadata is imported into IDE. 2. IDE is used to profile the data, generate accurate metadata (including a normalized schema), and

document cleansing and transformation requirements based on the source and normalized schemas. The cleansing requirements can be reviewed and modified by the Data Quality team.

3. IDE is used to modify and develop the normalized schema into a target schema. This generally involves removing obsolete or spurious data elements, incorporating new business requirements and data elements, adapting to corporate data standards, and denormalizing to enhance performance.

4. The resultant metadata are exported to and managed by the IDE Repository. 5. FTM is used to develop and document transformation requirements between the normalized and target

schemas. The mappings between the data elements are automatically carried over from the IDE-based schema development process.

6. The IDE Repository is used to export an XSLT document containing the transformation and the formatting specs developed with IDE and FTM/XML.

7. The cleansing, transformation, and formatting specs are used by the application development or Data Quality team to cleanse the data, implement any required edits and integrity management functions, and develop the transforms of configure an ETL product to perform the data conversion and migration.

Last updated: 09-Feb-07 12:55

INFORMATICA CONFIDENTIAL BEST PRACTICE 207 of 702

Working with Pre-Built Plans in Data Cleanse and Match

Challenge

To provide a set of best practices for users of the pre-built data quality processes designed for use with the Informatica Data Cleanse and Match (DC&M) product offering.

Informatica Data Cleanse and Match is a cross-application data quality solution that installs two components to the PowerCenter system:

● Data Cleanse and Match Workbench, the desktop application in which data quality processes - or plans - plans can be designed, tested, and executed. Workbench installs with its own Data Quality repository, where plans are stored until needed.

● Data Quality Integration, a plug-in component that integrates Informatica Data Quality and PowerCenter. The plug-in adds a transformation to PowerCenter, called the Data Quality Integration transformation; PowerCenter Designer users can connect to the Data Quality repository and read data quality plan information into this transformation.

Informatica Data Cleanse and Match has been developed to work with Content Packs developed by Informatica. This document focuses on the plans that install with the North America Content Pack, which was developed in conjunction with the components of Data Cleanse and Match. The North America Content Pack delivers data parsing, cleansing, standardization, and de-duplication functionality to United States and Canadian name and address data through a series of pre-built data quality plans and address reference data files.

This document focuses on the following areas:

● when to use one plan vs. another for data cleansing. ● what behavior to expected from the plans. ● how best to manage exception data.

Description

The North America Content Pack installs several plans to the Data Quality Repository:

● Plans 01-04 are designed to parse, standardize, and validate United States name and address data. ● Plans 05-07 are designed to enable single-source matching operations (identifying duplicates within a data set) or dual

source matching operations (identifying matching records between two datasets).

The processing logic for data matching is split between PowerCenter and Informatica Data Quality (IDQ) applications.

Plans 01-04: Parsing, Cleansing, and Validation

These plans provide modular solutions for name and address data. The plans can operate on highly unstructured and well-structured data sources. The level of structure contained in a given data set determines the plan to be used.

The following diagram demonstrates how the level of structure in address data maps to the plans required to standardize and validate an address.

INFORMATICA CONFIDENTIAL BEST PRACTICE 208 of 702

In cases where the address is well structured and specific data elements (i.e., city, state, and zip) are mapped to specific fields, only the address validation plan may be required. Where the city, state, and zip are mapped to address fields, but not specifically labeled as such (e.g., as Address1 through Address5), a combination of the address standardization and validation plans is required. In extreme cases, where the data is not mapped to any address columns, a combination of the general parser, address standardization, and validation plans may be required to obtain meaning from the data.

The purpose of making the plans modular is twofold:

● It is possible to apply these plans on an individual basis to the data. There is no requirement that the plans be run in sequence with each other. For example, the address validation plan (plan 03) can be run successfully to validate input addresses discretely from the other plans. In fact, the Data Quality Developer will not run all seven plans consecutively on the same dataset. Plans 01 and 02 are not designed to operate in sequence, nor are plans 06 and 07.

● Modular plans facilitate faster performance. Designing a single plan to perform all the processing tasks contained in the seven plans, even if it were desirable from a functional point of view, would result in significant performance degradation and extremely complex plan logic that would be difficult to modify and maintain.

01 General Parser

The General Parser plan was developed to handle highly unstructured data and to parse it into type-specific fields. For example, consider data stored in the following format:

Field1 Field2 Field3 Field4 Field5

100 Cardinal Way Informatica Corp CA 94063 [email protected] Redwood City

Redwood City 38725 100 Cardinal Way CA 94063 [email protected]

While it is unusual to see data fragmented and spread across a number of fields in this way, it can and does happen. In cases such as this, data is not stored in any specific fields. Street addresses, email addresses, company names, and dates are scattered throughout the data. Using a combination of dictionaries and pattern recognition, the General Parser plan sorts such data into type-specific fields of address, names, company names, Social Security Numbers, dates, telephone numbers, and email addresses, depending on the profile of the content. As a result, the above data will be parsed into the following format:

INFORMATICA CONFIDENTIAL BEST PRACTICE 209 of 702

Address1 Address2 Address3 E-mail Date Company

100 Cardinal Way CA 94063 Redwood City [email protected] Informatica Corp

Redwood City 100 Cardinal Way CA 94063 [email protected] 08/01/2006

The General Parser does not attempt to apply any structure or meaning to the data. Its purpose is to identify and sort data by information type. As demonstrated with the address fields in the above example, the address fields are labeled as addresses; the contents are not arranged in a standard address format, they are flagged as addresses in the order in which they were processed in the file.

The General Parser does not attempt to validate the correctness of a field. For example, the dates are accepted as valid because they have a structure of symbols and numbers that represents a date. A value of 99/99/9999 would also be parsed as a date.

The General Parser does not attempt to handle multiple information types in a single field. For example, if a person name and address element are contained in the same field, the General Parser would label the entire field either a name or an address - or leave it unparsed - depending on the elements in the field it can identify first (if any).

While the General Parser does not make any assumption about the data prior to parsing, it parses based on the elements of data that it can make sense of first. In cases where no elements of information can be labeled, the field is left in a pipe-delimited form containing unparsed data.

The effectiveness of the General Parser to recognize various information types is a function of the dictionaries used to identify that data and the rules used to sort them. Adding or deleting dictionary entries can greatly affect the effectiveness of this plan.

Overall, the General Parser is likely only be used in limited cases, where certain types of information may be mixed together, (e.g., telephone and email in the same contact field), or in cases where the data has been badly managed, such as when several files of differing structures have been merged into a single file.

02 Name Standardization

The Name Standardization plan is designed to take in person name or company name information and apply parsing and standardization logic to it. Name Standardization follows two different tracks: one for person names and one for company names.

The plan input fields include two inputs for company names. Data that is entered in these fields are assumed to be valid company names, and no additional tests are performed to validate that the data is an existing company name. Any combination of letters, numbers, and symbols can represent a company; therefore, in the absence of an external reference data source, further tests to validate a company name are not likely to yield usable results.

Any data entered into the company name fields is subjected to two processes. First, the company name is standardized using the Word Manager component, standardizing any company suffixes included in the field. Second, the standardized company name is matched against the company_names.dic dictionary, which returns the standardized Dun & Bradstreet company name, if found.

The second track for name standardization is person names standardization. While this track is dedicated to standardizing person names, it does not necessarily assume that all data entered here is a person name. Person names in North America tend to follow a set structure and typically do not contain company suffixes or digits. Therefore, values entered in this field that contain a company suffix or a company name are taken out of the person name track and moved to the company name track. Additional logic is applied to identify people whose last name is similar (or equal) to a valid company name (for example John Sears); inputs that contain an identified first name and a company name are treated as a person name.

If the company name track inputs are already fully populated for the record in question, then any company name detected in a person name column is moved to a field for unparsed company name output. If the name is not recognized as a company name (e.g., by the presence of a company suffix) but contains digits, the data is parsed into the non-name data output field. Any remaining data is accepted as being a valid person name and parsed as such.

North American person names are typically entered in one of two different styles: either in a “firstname middlename surname” format or “surname, firstname middlename” format. Name parsing algorithms have been built using this assumption.

Name parsing occurs in two passes. The first pass applies a series of dictionaries to the name fields, attempting to parse out name prefixes, name suffixes, firstnames, and any extraneous data (“noise”) present. Any remaining details are assumed to be middle name or surname details. A rule is applied to the parsed details to check if the name has been parsed correctly. If not, “best guess”

INFORMATICA CONFIDENTIAL BEST PRACTICE 210 of 702

parsing is applied to the field based on the possible assumed formats.

When name details have been parsed into first, last, and middle name formats, the first name is used to derive additional details including gender and the name prefix. Finally, using all parsed and derived name elements, salutations are generated.

In cases where no clear gender can be generated from the first name, the gender field is typically left blank or indeterminate.

The salutation field is generated according to the derived gender information. This can be easily replicated outside the data quality plan if the salutation is not immediately needed as an output from the process (assuming the gender field is an output).

Depending on the data entered in the person name fields, certain companies may be treated as person names and parsed according to person name processing rules. Likewise, some person names may be identified as companies and standardized according to company name processing logic. This is typically a result of the dictionary content. If this is a significant problem when working with name data, some adjustments to the dictionaries and the rule logic for the plan may be required.

Non-name data encountered in the name standardization plan may be standardized as names depending on the contents of the fields. For example, an address datum such as “Corporate Parkway” may be standardized as a business name, as “Corporate” is also a business suffix. Any text data that is entered in a person name field is always treated as a person or company, depending on whether or not the field contains a recognizable company suffix in the text.

To ensure that the name standardization plan is delivering adequate results, Informatica strongly recommends pre- and post-execution analysis of the data.

Based on the following input:

ROW ID IN NAME1

1 Steven King

2 Chris Pope Jr.

3 Shannon C. Prince

4 Dean Jones

5 Mike Judge

6 Thomas Staples

7 Eugene F. Sears

8 Roy Jones Jr.

9 Thomas Smith, Sr

10 Eddie Martin III

11 Martin Luther King, Jr.

12 Staples Corner

13 Sears Chicago

14 Robert Tyre

15 Chris News

The following outputs are produced by the Name Standardization plan:

INFORMATICA CONFIDENTIAL BEST PRACTICE 211 of 702

The last entry (Chris News) is identified as a company in the current plan configuration – such results can be refined by changing the underlying dictionary entries used to identify company and person names.

03 US Canada Standardization

This plan is designed to apply basic standardization processes to city, state/province, and zip/postal code information for United States and Canadian postal address data. The purpose of the plan is to deliver basic standardization to address elements where processing time is critical and one hundred percent validation is not possible due to time constraints. The plan also organizes key search elements into discrete fields, thereby speeding up the validation process.

The plan accepts up to six generic address fields and attempts to parse out city, state/province, and zip/postal code information. All remaining information is assumed to be address information and is absorbed into the address line 1-3 fields. Any information that cannot be parsed into the remaining fields is merged into the non-address data field.

The plan makes a number of assumptions that may or may not suit your data:

● When parsing city, state, and zip details, the address standardization dictionaries assume that these data elements are spelled correctly. Variation in town/city names is very limited, and in cases where punctuation differences exist or where town names are commonly misspelled, the standardization plan may not correctly parse the information.

● Zip codes are all assumed to be five-digit. In some files, zip codes that begin with “0” may lack this first number and so appear as a four-digit codes, and these may be missed during parsing. Adding four-digit zips to the dictionary is not recommended, as these will conflict with the “Plus 4” element of a zip code. Zip codes may also be confused with other five-digit numbers in an address line such as street numbers.

● City names are also commonly found in street names and other address elements. For example, “United” is part of a country (United States of America) and is also a town name in the U.S. Bear in mind that the dictionary parsing operates from right to left across the data, so that country name and zip code fields are analyzed before city names and street addresses. Therefore, the word “United” may be parsed and written as the town name for a given address before the actual town name datum is reached.

● The plan appends a country code to the end of a parsed address if it can identify it as U.S. or Canadian. Therefore, there is no need to include any country code field in the address inputs when configuring the plan.

Most of these issues can be dealt with, if necessary, by minor adjustments to the plan logic or to the dictionaries, or by adding some pre-processing logic to a workflow prior to passing the data into the plan.

The plan assumes that all data entered into it are valid address elements. Therefore, once city, state, and zip details have been parsed out, the plan assumes all remaining elements are street address lines and parses them in the order they occurred as address lines 1-3.

04 NA Address Validation

The purposes of the North America Address Validation plan are:

● To match input addresses against known valid addresses in an address database, and ● To parse, standardize, and enrich the input addresses.

INFORMATICA CONFIDENTIAL BEST PRACTICE 212 of 702

Performing these operations is a resource-intensive process. Using the US Canada Standardization plan before the NA Address Validation plan helps to improve validation plan results in cases where city, state, and zip code information are not already in discrete fields. City, state, and zip are key search criteria for the address validation engine, and they need to be mapped into discrete fields. Not having these fields correctly mapped prior to plan execution leads to poor results and slow execution times.

The address validation APIs store specific area information in memory and continue to use that information from one record to the next, when applicable. Therefore, when running validation plans, it is advisable to sort address data by zip/postal code in order to maximize the usage of data in memory.

In cases where status codes, error codes, or invalid results are generated as plan outputs, refer to the Informatica Data Quality 3.1 User Guide for information on how to interpret them.

Plans 05-07: Pre-Match Standardization, Grouping, and Matching

These plans take advantage of PowerCenter and IDQ capabilities and are commonly used in pairs. Users will use either plan 05 and 06 or plans 05 and 07. There plans work as follows:

● 05 Match Standardization and Grouping. This plan is used to perform basic standardization and grouping operations on the data prior to matching.

● 06 Single Source Matching. Single source matching seeks to identify duplicate records within a single data set. ● 07 Dual Source Matching. Dual source matching seeks to identify duplicate records between two datasets.

Note that the matching plans are designed for use within a PowerCenter mapping and do not deliver optimal results when executed directly from IDQ Workbench. Note also that the Standardization and Matching plans are geared towards North American English data. Although they work with datasets in other languages, the results may be sub-optimal.

Matching Concepts

To ensure the best possible matching results and performance, match plans usually use a pre-processing step to standardize and group the data.

The aim for standardization here is different from a classic standardization plan – the intent is to ensure that different spellings, abbreviations, etc. are as similar to each other as possible to return better match set. For example, 123 Main Rd. and 123 Main Road will obtain an imperfect match score, although they clearly refer to the same street address.

Grouping, in a matching context, means sorting input records based on identical values in one or more user-selected fields. When a matching plan is run on grouped data, serial matching operations are performed on a group-by-group basis, so that data records within a group are matched but records across groups are not. A well-designed grouping plan can dramatically cut plan processing time while minimizing the likelihood of missed matches in the dataset.

Grouping performs two functions. It sorts the records in a dataset to increase matching plan performance, and it creates new data columns to provide group key options for the matching plan. (In PowerCenter, the Sorter transformation can organize the data to facilitate matching performance. Therefore, the main function of grouping in a PowerCenter context is to create candidate group keys. In both Data Quality and PowerCenter, grouping operations do not affect the source dataset itself.)

Matching on un-grouped data involves a large number of comparisons that realistically will not generate a meaningful quantity of additional matches. For example, when looking for duplicates in a customer list, there is little value in comparing the record for John Smith with the record for Angela Murphy as they are obviously not going to be considered as duplicate entries. The type of grouping used depends on the type of information being matched; in general, productive fields for grouping name and address data are location-based (e.g. city name, zip codes) or person/company based (surname and company name composites). For more information on grouping strategies for best result/performance relationship, see the Best Practice Effective Data Matching Techniques.

Plan 05 (Match Standardization and Grouping) performs cleansing and standardization operations on the data before group keys are generated. It offers a number of grouping options. The plan generates the following group keys:

● OUT_ZIP_GROUP: first 5 digits of ZIP code

INFORMATICA CONFIDENTIAL BEST PRACTICE 213 of 702

● OUT_ZIP_NAME3_GROUP: first 5 digits of ZIP code and the first 3 characters of the last name ● OUT_ZIP_NAME5_GROUP: first 5 digits of ZIP code and the first 5 characters of the last name ● OUT_ZIP_COMPANY3_GROUP: first 5 digits of ZIP code and the first 3 characters of the cleansed company name ● OUT_ZIP_COMPANY5_GROUP: first 5 digits of ZIP code and the first 5 characters of the cleansed company name

The grouping output used depends on the data contents and data volume.

Plans 06 Single Source Matching and 07 Dual Source Matching

Plans 06 and 07 are set up in similar ways and assume that person name, company name, and address data inputs will be used. However, in PowerCenter, plan 07 requires the additional input of a Source tag, typically generated by an Expression transform upstream in the PowerCenter mapping.

A number of matching algorithms are applied to the address and name elements. To ensure the best possible result, a weight-based component and a custom rule are applied to the outputs from the matching components. For further information on IDQ matching components, consult the Informatica Data Quality 3.1 User Guide.

By default the plans are configured to write as output all records that match with an 85% percent or higher degree of certainty. The Data Quality Developer can easily adjusted this figure in each plan.

PowerCenter Mappings

When configuring the Data Quality Integration transformation for the matching plan, the Developer must select a valid grouping field.

To ensure best matching results, the PowerCenter mapping that contains plan 05 should include a Sorter transformation that sorts data according to the group key to be used during matching. This transformation should follow standardization and grouping operations. Note that a single mapping can contain multiple Data Quality Integration transformations, so that the Data Quality Developer or Data Integration Developer can add plan 05 to one Integration transformation and plan 06 or 07 to another in the same mapping. The standardization plan requires a passive transformation, whereas the matching plan requires an active transformation.

INFORMATICA CONFIDENTIAL BEST PRACTICE 214 of 702

The developer can add a Sequencer transformation to the mapping to generate a unique identifier for each input record if these not present in the source data. (Note that a unique identifier is not required for matching processes).

When working with the dual source matching plan, additional PowerCenter transformations are required to pre-process the data for the Integration transformation. Expression transformations are used to label each input with a source tag of A and B respectively. The data from the two sources is then joined together using a Union transformation, before being passed to the Integration transformation containing the standardization and grouping plan. From here on, the mapping has the same design as the single source version.

Last updated: 09-Feb-07 13:18

INFORMATICA CONFIDENTIAL BEST PRACTICE 215 of 702

Designing Data Integration Architectures

Challenge

Develop a sound data architecture that can serve as a foundation for a data integration solution that may evolve over many years.

Description

Historically, organizations have approached the development of a "data warehouse" or "data mart" as a departmental effort, without considering an enterprise perspective. The result has been silos of corporate data and analysis, which very often conflict with each other in terms of both detailed data and the business conclusions implied by it.

Taking an enterprise-wide, architect stance in developing data integration solutions provides many advantages, including:

● A sound architectural foundation ensures the solution can evolve and scale with the business over time.

● Proper architecture can isolate the application component (business context) of the data integration solution from the technology.

● Lastly, architectures allow for reuse - reuse of skills, design objects, and knowledge.

As the evolution of data integration solutions (and the corresponding nomenclature) has progressed, the necessity of building these solutions on a solid architectural framework has become more and more clear. To understand why, a brief review of the history of data integration solutions and their predecessors is warranted.

Historical Perspective

Online Transaction Processing Systems (OLTPs) have always provided a very detailed, transaction-oriented view of an organization's data. While this view was indispensable for the day-to-day operation of a business, its ability to provide a "big picture" view of the operation, critical for management decision-making, was severely limited. Initial attempts to address this problem took several directions:

Reporting directly against the production system. This approach minimized the effort associated with developing management reports, but introduced a number of significant issues:

The nature of OLTP data is, by definition, "point-in-time." Thus, reports run at different times of the year, month, or even the day, were inconsistent with each other.

INFORMATICA CONFIDENTIAL BEST PRACTICE 216 of 702

Ad hoc queries against the production database introduced uncontrolled performance issues, resulting in slow reporting results and degradation of OLTP system performance.

Trending and aggregate analysis was difficult (or impossible) with the detailed data available in the OLTP systems.

● Mirroring the production system in a reporting database . While this approach alleviated the performance degradation of the OLTP system, it did nothing to address the other issues noted above.

● Reporting databases . To address the fundamental issues associated with reporting against the OLTP schema, organizations began to move toward dedicated reporting databases. These databases were optimized for the types of queries typically run by analysts, rather than those used by systems supporting data entry clerks or customer service representatives. These databases may or may not have included pre-aggregated data, and took several forms, including traditional RDBMS as well as newer technology Online Analytical Processing (OLAP) solutions.

The initial attempts at reporting solutions were typically point solutions; they were developed internally to provide very targeted data to a particular department within the enterprise. For example, the Marketing department might extract sales and demographic data in order to infer customer purchasing habits. Concurrently, the Sales department was also extracting sales data for the purpose of awarding commissions to the sales force. Over time, these isolated silos of information became irreconcilable, since the extracts and business rules applied to the data during the extract process differed for the different departments

The result of this evolution was that the Sales and Marketing departments might report completely different sales figures to executive management, resulting in a lack of confidence in both departments' "data marts." From a technical perspective, the uncoordinated extracts of the same data from the source systems multiple times placed undue strain on system resources.

The solution seemed to be the "centralized" or "galactic" data warehouse. This warehouse would be supported by a single set of periodic extracts of all relevant data into the data warehouse (or Operational Data Store), with the data being cleansed and made consistent as part of the extract process. The problem with this solution was its enormous complexity, typically resulting in project failure. The scale of these failures led many organizations to abandon the concept of the enterprise data warehouse in favor of the isolated, "stovepipe" data marts described earlier. While these solutions still had all of the issues discussed previously, they had the clear advantage of providing individual departments with the data they needed without the unmanageability of the enterprise solution.

As individual departments pursued their own data and data integration needs, they not only created data stovepipes, they also created technical islands. The approaches to populating the data marts and performing the data integration tasks varied widely, resulting in a single enterprise evaluating, purchasing, and being trained on multiple tools and adopting multiple methods for performing these tasks. If, at any point, the organization did attempt to undertake an enterprise effort, it was likely to face the daunting challenge of integrating the disparate data as well as the widely varying technologies. To deal with these issues, organizations began developing approaches that considered the enterprise-level requirements of a data integration solution.

INFORMATICA CONFIDENTIAL BEST PRACTICE 217 of 702

Centralized Data Warehouse

The first approach to gain popularity was the centralized data warehouse. Designed to solve the decision support needs for the entire enterprise at one time, with one effort, the data integration process extracts the data directly from the operational systems. It transforms the data according to the business rules and loads it into a single target database serving as the enterprise-wide data warehouse.

Advantages

The centralized model offers a number of benefits to the overall architecture, including:

● Centralized control . Since a single project drives the entire process, there is centralized control over everything occurring in the data warehouse. This makes it easier to manage a production system while concurrently integrating new components of the warehouse.

● Consistent metadata . Because the warehouse environment is contained in a single database and the metadata is stored in a single repository, the entire enterprise can be queried whether you are looking at data from Finance, Customers, or Human Resources.

● Enterprise view . Developing the entire project at one time provides a global view of how data from one workgroup coordinates with data from others. Since the warehouse is highly integrated, different workgroups often share common tables such as customer, employee, and item lists.

INFORMATICA CONFIDENTIAL BEST PRACTICE 218 of 702

● High data integrity . A single, integrated data repository for the entire enterprise would naturally avoid all data integrity issues that result from duplicate copies and versions of the same business data.

Disadvantages

Of course, the centralized data warehouse also involves a number of drawbacks, including:

● Lengthy implementation cycle. With the complete warehouse environment developed simultaneously, many components of the warehouse become daunting tasks, such as analyzing all of the source systems and developing the target data model. Even minor tasks, such as defining how to measure profit and establishing naming conventions, snowball into major issues.

● Substantial up-front costs . Many analysts who have studied the costs of this approach agree that this type of effort nearly always runs into the millions. While this level of investment is often justified, the problem lies in the delay between the investment and the delivery of value back to the business.

● Scope too broad . The centralized data warehouse requires a single database to satisfy the needs of the entire organization. Attempts to develop an enterprise-wide warehouse using this approach have rarely succeeded, since the goal is simply too ambitious. As a result, this wide scope has been a strong contributor to project failure.

● Impact on the operational systems . Different tables within the warehouse often read data from the same source tables, but manipulate it differently before loading it into the targets. Since the centralized approach extracts data directly from the operational systems, a source table that feeds into three different target tables is queried three times to load the appropriate target tables in the warehouse. When combined with all the other loads for the warehouse, this can create an unacceptable performance hit on the operational systems.

Independent Data Mart

The second warehousing approach is the independent data mart, which gained popularity in 1996 when DBMS magazine ran a cover story featuring this strategy. This architecture is based on the same principles as the centralized approach, but it scales down the scope from solving the warehousing needs of the entire company to the needs of a single department or workgroup.

Much like the centralized data warehouse, an independent data mart extracts data directly from the operational sources, manipulates the data according to the business rules, and loads a single target database serving as the independent data mart. In some cases, the operational data may be staged in an Operational Data Store (ODS) and then moved to the mart.

INFORMATICA CONFIDENTIAL BEST PRACTICE 219 of 702

Advantages

The independent data mart is the logical opposite of the centralized data warehouse. The disadvantages of the centralized approach are the strengths of the independent data mart:

● Impact on operational databases localized . Because the independent data mart is trying to solve the DSS needs of a single department or workgroup, only the few operational databases containing the information required need to be analyzed.

● Reduced scope of the data model . The target data modeling effort is vastly reduced since it only needs to serve a single department or workgroup, rather than the entire company.

● Lower up-front costs . The data mart is serving only a single department or workgroup; thus hardware and software costs are reduced.

● Fast implementation . The project can be completed in months, not years. The process of defining business terms and naming conventions is simplified since "players from the same team" are working on the project.

Disadvantages

Of course, independent data marts also have some significant disadvantages:

● Lack of centralized control . Because several independent data marts are needed to

INFORMATICA CONFIDENTIAL BEST PRACTICE 220 of 702

solve the decision support needs of an organization, there is no centralized control. Each data mart or project controls itself, but there is no central control from a single location.

● Redundant data . After several data marts are in production throughout the organization, all of the problems associated with data redundancy surface, such as inconsistent definitions of the same data object or timing differences that make reconciliation impossible.

● Metadata integration . Due to their independence, the opportunity to share metadata - for example, the definition and business rules associated with the Invoice data object - is lost. Subsequent projects must repeat the development and deployment of common data objects.

● Manageability . The independent data marts control their own scheduling routines and therefore store and report their metadata differently, with a negative impact on the manageability of the data warehouse. There is no centralized scheduler to coordinate the individual loads appropriately or metadata browser to maintain the global metadata and share development work among related projects.

Dependent Data Marts (Federated Data Warehouses)

The third warehouse architecture is the dependent data mart approach supported by the hub-and-spoke architecture of PowerCenter and PowerMart. After studying more than one hundred different warehousing projects, Informatica introduced this approach in 1998, leveraging the benefits of the centralized data warehouse and independent data mart.

The more general term being adopted to describe this approach is the "federated data warehouse." Industry analysts have recognized that, in many cases, there is no "one size fits all" solution. Although the goal of true enterprise architecture, with conformed dimensions and strict standards, is laudable, it is often impractical, particularly for early efforts. Thus, the concept of the federated data warehouse was born. It allows for the relatively independent development of data marts, but leverages a centralized PowerCenter repository for sharing transformations, source and target objects, business rules, etc.

Recent literature describes the federated architecture approach as a way to get closer to the goal of a truly centralized architecture while allowing for the practical realities of most organizations. The centralized warehouse concept is sacrificed in favor of a more pragmatic approach, whereby the organization can develop semi-autonomous data marts, so long as they subscribe to a common view of the business. This common business model is the fundamental, underlying basis of the federated architecture, since it ensures consistent use of business terms and meanings throughout the enterprise.

With the exception of the rare case of a truly independent data mart, where no future growth is planned or anticipated, and where no opportunities for integration with other business areas exist, the federated data warehouse architecture provides the best framework for building a data integration solution.

Informatica's PowerCenter and PowerMart products provide an essential capability for supporting the federated architecture: the shared Global Repository. When used in conjunction with one or more Local Repositories, the Global Repository serves as a sort of "federal" governing body,

INFORMATICA CONFIDENTIAL BEST PRACTICE 221 of 702

providing a common understanding of core business concepts that can be shared across the semi-autonomous data marts. These data marts each have their own Local Repository, which typically include a combination of purely local metadata and shared metadata by way of links to the Global Repository.

This environment allows for relatively independent development of individual data marts, but also supports metadata sharing without obstacles. The common business model and names described above can be captured in metadata terms and stored in the Global Repository. The data marts use the common business model as a basis, but extend the model by developing departmental metadata and storing it locally.

A typical characteristic of the federated architecture is the existence of an Operational Data Store (ODS). Although this component is optional, it can be found in many implementations that extract data from multiple source systems and load multiple targets. The ODS was originally designed to extract and hold operational data that would be sent to a centralized data warehouse, working as a time-variant database to support end-user reporting directly from operational systems. A typical ODS had to be organized by data subject area because it did not retain the data model from the operational system.

Informatica's approach to the ODS, by contrast, has virtually no change in data model from the operational system, so it need not be organized by subject area. The ODS does not permit direct end-user reporting, and its refresh policies are more closely aligned with the refresh schedules of the enterprise data marts it may be feeding. It can also perform more sophisticated consolidation functions than a traditional ODS.

INFORMATICA CONFIDENTIAL BEST PRACTICE 222 of 702

Advantages

The Federated architecture brings together the best features of the centralized data warehouse and independent data mart:

● Room for expansion . While the architecture is designed to quickly deploy the initial data mart, it is also easy to share project deliverables across subsequent data marts by migrating local metadata to the Global Repository. Reuse is built in.

● Centralized control . A single platform controls the environment from development to test to production. Mechanisms to control and monitor the data movement from operational databases into the data integration environment are applied across the data marts, easing the system management task.

● Consistent metadata . A Global Repository spans all the data marts, providing a consistent view of metadata.

● Enterprise view . Viewing all the metadata from a central location also provides an enterprise view, easing the maintenance burden for the warehouse administrators. Business users can also access the entire environment when necessary (assuming that security privileges are granted).

● High data integrity . Using a set of integrated metadata repositories for the entire enterprise removes data integrity issues that result from duplicate copies of data.

● Minimized impact on operational systems . Frequently accessed source data, such as customer, product, or invoice records is moved into the decision support environment once, leaving the operational systems unaffected by the number of target data marts.

Disadvantages

Disadvantages of the federated approach include:

● Data propagation . This approach moves data twice-to the ODS, then into the individual data mart. This requires extra database space to store the staged data as well as extra time to move the data. However, the disadvantage can be mitigated by not saving the data permanently in the ODS. After the warehouse is refreshed, the ODS can be truncated, or a rolling three months of data can be saved.

● Increased development effort during initial installations . For each table in the target, there needs to be one load developed from the ODS to the target, in addition to all the loads from the source to the targets.

Operational Data Store

Using a staging area or ODS differs from a centralized data warehouse approach since the ODS is not organized by subject area and is not customized for viewing by end users or even for reporting. The primary focus of the ODS is in providing a clean, consistent set of operational data for creating and refreshing data marts. Separating out this function allows the ODS to provide more reliable and flexible support.

INFORMATICA CONFIDENTIAL BEST PRACTICE 223 of 702

Data from the various operational sources is staged for subsequent extraction by target systems in the ODS. In the ODS, data is cleaned and remains normalized, tables from different databases are joined, and a refresh policy is carried out (a change/capture facility may be used to schedule ODS refreshes, for instance).

The ODS and the data marts may reside in a single database or be distributed across several physical databases and servers.

Characteristics of the Operational Data Store are:

● Normalized ● Detailed (not summarized) ● Integrated ● Cleansed ● Consistent

Within an enterprise data mart, the ODS can consolidate data from disparate systems in a number of ways:

● Normalizes data where necessary (such as non-relational mainframe data), preparing it for storage in a relational system.

● Cleans data by enforcing commonalties in dates, names and other data types that appear across multiple systems.

● Maintains reference data to help standardize other formats; references might range from zip codes and currency conversion rates to product-code-to-product-name translations. The ODS may apply fundamental transformations to some database tables in order to reconcile common definitions, but the ODS is not intended to be a transformation processor for end-user reporting requirements.

Its role is to consolidate detailed data within common formats. This enables users to create wide varieties of data integration reports, with confidence that those reports will be based on the same detailed data, using common definitions and formats.

The following table compares the key differences in the three architectures:

Architecture Centralized Data Warehouse

Independent Data Mart

Federated Data Warehouse

Centralized Control

Yes No Yes

INFORMATICA CONFIDENTIAL BEST PRACTICE 224 of 702

Consistent Metadata

Yes No Yes

Cost effective No Yes Yes

Enterprise View Yes No Yes

Fast Implementation

No Yes Yes

High Data Integrity

Yes No Yes

Immediate ROI No Yes Yes

Repeatable Process

No Yes Yes

The Role of Enterprise Architecture

The federated architecture approach allows for the planning and implementation of an enterprise architecture framework that addresses not only short-term departmental needs, but also the long-term enterprise requirements of the business. This does not mean that the entire architectural investment must be made in advance of any application development. However, it does mean that development is approached within the guidelines of the framework, allowing for future growth without significant technological change. The remainder of this chapter will focus on the process of designing and developing a data integration solution architecture using PowerCenter as the platform.

Fitting Into the Corporate Architecture

Very few organizations have the luxury of creating a "green field" architecture to support their decision support needs. Rather, the architecture must fit within an existing set of corporate guidelines regarding preferred hardware, operating systems, databases, and other software. The Technical Architect, if not already an employee of the organization, should ensure that he/she has a thorough understanding of the existing (and future vision of) technical infrastructure. Doing so will eliminate the possibility of developing an elegant technical solution that will never be implemented because it defies corporate standards.

Last updated: 12-Feb-07 15:22

INFORMATICA CONFIDENTIAL BEST PRACTICE 225 of 702

Development FAQs

Challenge

Using the PowerCenter product suite to effectively develop, name, and document components of the data integration solution. While the most effective use of PowerCenter depends on the specific situation, this Best Practice addresses some questions that are commonly raised by project teams. It provides answers in a number of areas, including Logs, Scheduling, Backup Strategies, Server Administration, Custom Transformations, and Metadata. Refer to the product guides supplied with PowerCenter for additional information.

Description

The following pages summarize some of the questions that typically arise during development and suggest potential resolutions.

Mapping Design

Q: How does source format affect performance? (i.e., is it more efficient to source from a flat file rather than a database?)

In general, a flat file that is located on the server machine loads faster than a database located on the server machine. Fixed-width files are faster than delimited files because delimited files require extra parsing. However, if there is an intent to perform intricate transformations before loading to target, it may be advisable to first load the flat file into a relational database, which allows the PowerCenter mappings to access the data in an optimized fashion by using filters, custom transformations, and custom SQL SELECTs where appropriate.

Q: What are some considerations when designing the mapping? (i.e., what is the impact of having multiple targets populated by a single map?)

With PowerCenter, it is possible to design a mapping with multiple targets. If each target has a separate source qualifier, you can then load the targets in a specific order using Target Load Ordering. However, the recommendation is to limit the amount of complex logic in a mapping. Not only is it easier to debug a mapping with a limited number of objects, but such mappings can also be run concurrently and make use of more system resources. When using multiple output files (targets), consider writing to multiple disks or file systems simultaneously. This minimizes disk writing contention and applies to a session writing to multiple targets, and to multiple sessions running simultaneously.

Q: What are some considerations for determining how many objects and transformations to include in a single mapping?

The business requirement is always the first consideration, regardless of the number of objects it takes to fulfill the requirement. Beyond this, consideration should be given to having objects that stage data at certain points to allow both easier debugging and better understandability, as well as to create potential partition points. This should be balanced against the fact that more objects means more overhead for the DTM process.

It should also be noted that the most expensive use of the DTM is passing unnecessary data through the mapping. It is best to use filters as early as possible in the mapping to remove rows of data that are not needed. This is the SQL equivalent of the WHERE clause. Using the filter condition in the Source Qualifier to filter out the rows at the database level is a good way to increase the performance of the mapping. If this is not possible, a filter or router transformation can be used instead.

Log File Organization

Q: How does PowerCenter handle logs?

The Service Manager provides accumulated log events from each service in the domain and for sessions and workflows. To perform the logging function, the Service Manager runs a Log Manager and a Log Agent.

The Log Manager runs on the master gateway node. It collects and processes log events for Service Manager domain operations and application services. The log events contain operational and error messages for a domain. The Service

INFORMATICA CONFIDENTIAL BEST PRACTICE 226 of 702

Manager and the application services send log events to the Log Manager. When the Log Manager receives log events, it generates log event files, which can be viewed in the Administration Console.

The Log Agent runs on the nodes to collect and process log events for session and workflows. Log events for workflows include information about tasks performed by the Integration Service, workflow processing, and workflow errors. Log events for sessions include information about the tasks performed by the Integration Service, session errors, and load summary and transformation statistics for the session. You can view log events for the last workflow run with the Log Events window in the Workflow Monitor.

Log event files are binary files that the Administration Console Log Viewer uses to display log events. When you view log events in the Administration Console, the Log Manager uses the log event files to display the log events for the domain or application service. For more information, please see Chapter 16: Managing Logs in the Administrator Guide.

Q: Where can I view the logs?

Logs can be viewed in two locations: the Administration Console or the Workflow Monitor. The Administration Console displays domain-level operational and error messages. The Workflow Monitor displays session and workflow level processing and error messages.

Q: Where is the best place to maintain Session Logs?

One often-recommended location is a shared directory location that is accessible to the gateway node. If you have more than one gateway node, store the logs on a shared disk. This keeps all the logs in the same directory. The location can be changed in the Administration Console.

If you have more than one PowerCenter domain, you must configure a different directory path for each domain’s Log Manager. Multiple domains can not use the same shared directory path.

For more information, please refer to Chapter 16: Managing Logs of the Administrator Guide.

Q: What documentation is available for the error codes that appear within the error log files?

Log file errors and descriptions appear in Chapter 39: LGS Messages of the PowerCenter Trouble Shooting Guide. Error information also appears in the PowerCenter Help File within the PowerCenter client applications. For other database-specific errors, consult your Database User Guide.

Scheduling Techniques

Q: What are the benefits of using workflows with multiple tasks rather than a workflow with a stand-alone session?

Using a workflow to group logical sessions minimizes the number of objects that must be managed to successfully load the warehouse. For example, a hundred individual sessions can be logically grouped into twenty workflows. The Operations group can then work with twenty workflows to load the warehouse, which simplifies the operations tasks associated with loading the targets.

Workflows can be created to run tasks sequentially or concurrently, or have tasks in different paths doing either.

● A sequential workflow runs sessions and tasks one at a time, in a linear sequence. Sequential workflows help ensure that dependencies are met as needed. For example, a sequential workflow ensures that session1 runs before session2 when session2 is dependent on the load of session1, and so on. It's also possible to set up conditions to run the next session only if the previous session was successful, or to stop on errors, etc.

● A concurrent workflow groups logical sessions and tasks together, like a sequential workflow, but runs all the tasks at one time. This can reduce the load times into the warehouse, taking advantage of hardware platforms' symmetric multi-processing (SMP) architecture.

Other workflow options, such as nesting worklets within workflows, can further reduce the complexity of loading the warehouse. This capability allows for the creation of very complex and flexible workflow streams without the use of a third-party scheduler.

INFORMATICA CONFIDENTIAL BEST PRACTICE 227 of 702

Q: Assuming a workflow failure, does PowerCenter allow restart from the point of failure?

No. When a workflow fails, you can choose to start a workflow from a particular task but not from the point of failure. It is possible, however, to create tasks and flows based on error handling assumptions. If a previously running real-time workflow fails, first recover and then restart that workflow from the Workflow Monitor.

Q: How can a failed workflow be recovered if it is not visible from the Workflow Monitor?

Start the Workflow Manager and open the corresponding workflow. Find the failed task and right click to "Recover Workflow From Task."

Q: What guidelines exist regarding the execution of multiple concurrent sessions / workflows within or across applications?

Workflow Execution needs to be planned around two main constraints:

● Available system resources ● Memory and processors

The number of sessions that can run efficiently at one time depends on the number of processors available on the server. The load manager is always running as a process. If bottlenecks with regards to I/O and network are addressed, a session will be compute-bound, meaning its throughput is limited by the availability of CPU cycles. Most sessions are transformation intensive, so the DTM always runs. However, some sessions require more I/O, so they use less processor time. A general rule is that a session needs about 120 percent of a processor for the DTM, reader, and writer in total.

For concurrent sessions:

One session per processor is about right; you can run more, but that requires a "trial and error" approach to determine what number of sessions starts to affect session performance and possibly adversely affect other executing tasks on the server.

If possible, sessions should run at "off-peak" hours to have as many available resources as possible.

Even after available processors are determined, it is necessary to look at overall system resource usage. Determining memory usage is more difficult than the processors calculation; it tends to vary according to system load and number of PowerCenter sessions running.

The first step is to estimate memory usage, accounting for:

● Operating system kernel and miscellaneous processes ● Database engine ● Informatica Load Manager

Next, each session being run needs to be examined with regard to the memory usage, including the DTM buffer size and any cache/memory allocations for transformations such as lookups, aggregators, ranks, sorters and joiners.

At this point, you should have a good idea of what memory is utilized during concurrent sessions. It is important to arrange the production run to maximize use of this memory. Remember to account for sessions with large memory requirements; you may be able to run only one large session, or several small sessions concurrently.

Load-order dependencies are also an important consideration because they often create additional constraints. For example, load the dimensions first, then facts. Also, some sources may only be available at specific times; some network links may become saturated if overloaded; and some target tables may need to be available to end users earlier than others.

Q: Is it possible to perform two "levels" of event notification? At the application level and the PowerCenter Server level to notify the Server Administrator?

The application level of event notification can be accomplished through post-session email. Post-session email allows you to

INFORMATICA CONFIDENTIAL BEST PRACTICE 228 of 702

create two different messages; one to be sent upon successful completion of the session, the other to be sent if the session fails. Messages can be a simple notification of session completion or failure, or a more complex notification containing specifics about the session. You can use the following variables in the text of your post-session email:

Email Variable Description

%s Session name

%l Total records loaded

%r Total records rejected

%e Session status

%t Table details, including read throughput in bytes/second and write throughput in rows/second

%b Session start time

%c Session completion time

%i Session elapsed time (session completion time-session start time)

%g Attaches the session log to the message

%m Name and version of the mapping used in the session

%d Name of the folder containing the session

%n Name of the repository containing the session

%a<filename> Attaches the named file. The file must be local to the Informatica Server. The following are valid filenames: %a<c:\data\sales.txt> or %a</users/john/data/sales.txt>

On Windows NT, you can attach a file of any type. On UNIX, you can only attach text files. If you attach a non-text file, the send may fail.

Note: The filename cannot include the Greater Than character (>) or a line break.

The PowerCenter Server on UNIX uses rmail to send post-session email. The repository user who starts the PowerCenter server must have the rmail tool installed in the path in order to send email.

To verify the rmail tool is accessible:

1. Login to the UNIX system as the PowerCenter user who starts the PowerCenter Server. 2. Type rmail <fully qualified email address> at the prompt and press Enter. 3. Type '.' to indicate the end of the message and press Enter. 4. You should receive a blank email from the PowerCenter user's email account. If not, locate the directory where rmail

resides and add that directory to the path. 5. When you have verified that rmail is installed correctly, you are ready to send post-session email.

The output should look like the following:

Session complete. Session name: sInstrTest Total Rows Loaded = 1

INFORMATICA CONFIDENTIAL BEST PRACTICE 229 of 702

Total Rows Rejected = 0 Completed

Rows Loaded

Rows Rejected

ReadThroughput (bytes/sec)

WriteThroughput (rows/sec) Table Name

Status

1 0 30 1 t_Q3_sales

No errors encountered. Start Time: Tue Sep 14 12:26:31 1999 Completion Time: Tue Sep 14 12:26:41 1999 Elapsed time: 0:00:10 (h:m:s)

This information, or a subset, can also be sent to any text pager that accepts email.

Backup Strategy Recommendation

Q: Can individual objects within a repository be restored from the backup or from a prior version?

At the present time, individual objects cannot be restored from a backup using the PowerCenter Repository Manager (i.e., you can only restore the entire repository). But, it is possible to restore the backup repository into a different database and then manually copy the individual objects back into the main repository.

It should be noted that PowerCenter does not restore repository backup files created in previous versions of PowerCenter. To correctly restore a repository, the version of PowerCenter used to create the backup file must be used for the restore as well.

An option for the backup of individual objects is to export them to XML files. This allows for the granular re-importation of individual objects, mappings, tasks, workflows, etc.

Refer to Migration Procedures - PowerCenter for details on promoting new or changed objects between development, test, QA, and production environments.

Server Administration

Q: What built-in functions does PowerCenter provide to notify someone in the event that the server goes down, or some other significant event occurs?

The Repository Service can be used to send messages notifying users that the server will be shut down. Additionally, the Repository Service can be used to send notification messages about repository objects that are created, modified, or deleted by another user. Notification messages are received through the PowerCenter Client tools.

Q: What system resources should be monitored? What should be considered normal or acceptable server performance levels?

The pmprocs utility, which is available for UNIX systems only, shows the currently executing PowerCenter processes.

Pmprocs is a script that combines the ps and ipcs commands. It is available through Informatica Technical Support. The utility provides the following information:

● CPID - Creator PID (process ID) ● LPID - Last PID that accessed the resource ● Semaphores - used to sync the reader and writer ● 0 or 1 - shows slot in LM shared memory

INFORMATICA CONFIDENTIAL BEST PRACTICE 230 of 702

A variety of UNIX and Windows NT commands and utilities are also available. Consult your UNIX and/or Windows NT documentation.

Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an Oracle instance crash?

If the UNIX server crashes, you should first check to see if the repository database is able to come back up successfully. If this is the case, then you should try to start the PowerCenter server. Use the pmserver.err log to check if the server has started correctly. You can also use ps -ef | grep pmserver to see if the server process (the Load Manager) is running.

Custom Transformations

Q: What is the relationship between the Java or SQL transformation and the Custom transformation?

Many advanced transformations, including Java and SQL, were built using the Custom transformation. Custom transformations operate in conjunction with procedures you create outside of the Designer interface to extend PowerCenter functionality.

Other transformations that were built using Custom transformations include HTTP, SQL, Union, XML Parser, XML Generator, and many others. Below is a summary of noticeable differences.

Transformation # of Input Groups # of Output Groups Type

Custom Multiple Multiple Active/Passive

HTTP One One Passive

Java One One Active/Passive

SQL One One Active/Passive

Union Multiple One Active

XML Parser One Multiple Active

XML Generator Multiple One Active

For further details, please see the Transformation Guide.

Q: What is the main benefit of a Custom transformation over an External Procedure transformation?

A Custom transformation allows for the separation of input and output functions, whereas an External Procedure transformation handles both the input and output simultaneously. Additionally, an External Procedure transformation’s parameters consist of all the ports of the transformation.

The ability to separate input and output functions is especially useful for sorting and aggregation, which require all input rows to be processed before outputting any output rows.

Q: How do I change a Custom transformation from Active to Passive, or vice versa?

After the creation of the Custom transformation, the transformation type cannot be changed. In order to set the appropriate type, delete and recreate the transformation.

Q: What is the difference between active and passive Java transformations? When should one be used over the other?

An active Java transformation allows for the generation of more than one output row for each input row. Conversely, a passive Java transformation only allows for the generation of one output row per input row.

INFORMATICA CONFIDENTIAL BEST PRACTICE 231 of 702

Use active if you need to generate multiple rows with each input. For example, a Java transformation contains two input ports that represent a start date and an end date. You can generate an output row for each date between the start and end date. Use passive when you need one output row for each input.

Q: What are the advantages of a SQL transformation over a Source Qualifier?

A SQL transformation allows for the processing of SQL queries in the middle of a mapping. It allows you to insert, delete, update, and retrieve rows from a database. For example, you might need to create database tables before adding new transactions. The SQL transformation allows for the creation of these tables from within the workflow.

Q: What is the difference between the SQL transformation’s Script and Query modes?

Script mode allows for the execution of externally located ANSI SQL scripts. Query mode executes a query that you define in a query editor. You can pass strings or parameters to the query to define dynamic queries or change the selection parameters.

For more information, please see Chapter 22: SQL Transformation in the Transformation Guide.

Metadata

Q: What recommendations or considerations exist as to naming standards or repository administration for metadata that may be extracted from the PowerCenter repository and used in others?

With PowerCenter, you can enter description information for all repository objects, sources, targets, transformations, etc, but the amount of metadata that you enter should be determined by the business requirements. You can also drill down to the column level and give descriptions of the columns in a table if necessary. All information about column size and scale, data types, and primary keys are stored in the repository.

The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it is also very time consuming to do so. Therefore, this decision should be made on the basis of how much metadata is likely to be required by the systems that use the metadata.

There are some time-saving tools that are available to better manage a metadata strategy and content, such as third-party metadata software and, for sources and targets, data modeling tools.

Q: What procedures exist for extracting metadata from the repository?

Informatica offers an extremely rich suite of metadata-driven tools for data warehousing applications. All of these tools store, retrieve, and manage their metadata in Informatica's PowerCenter repository. The motivation behind the original Metadata Exchange (MX) architecture was to provide an effective and easy-to-use interface to the repository.

Today, Informatica and several key Business Intelligence (BI) vendors, including Brio, Business Objects, Cognos, and MicroStrategy, are effectively using the MX views to report and query the Informatica metadata.

Informatica strongly discourages accessing the repository directly, even for SELECT access because some releases of PowerCenter change the look and feel of the repository tables, resulting in a maintenance task for you. Rather, views have been created to provide access to the metadata stored in the repository.

Additionally, Informatica's Metadata Manager and Data Analyzer, allow for more robust reporting against the repository database and are able to present reports to the end-user and/or management.

Versioning

Q: How can I keep multiple copies of the same object within PowerCenter?

A: With PowerCenter, you can use version control to maintain previous copies of every changed object.

INFORMATICA CONFIDENTIAL BEST PRACTICE 232 of 702

You can enable version control after you create a repository. Version control allows you to maintain multiple versions of an object, control development of the object, and track changes. You can configure a repository for versioning when you create it, or you can upgrade an existing repository to support versioned objects.

When you enable version control for a repository, the repository assigns all versioned objects version number 1 and each object has an active status.

You can perform the following tasks when you work with a versioned object:

● View object version properties. Each versioned object has a set of version properties and a status. You can also configure the status of a folder to freeze all objects it contains or make them active for editing.

● Track changes to an object. You can view a history that includes all versions of a given object, and compare any version of the object in the history to any other version. This allows you to determine changes made to an object over time.

● Check the object version in and out. You can check out an object to reserve it while you edit the object. When you check in an object, the repository saves a new version of the object and allows you to add comments to the version. You can also find objects checked out by yourself and other users.

● Delete or purge the object version. You can delete an object from view and continue to store it in the repository. You can recover, or undelete, deleted objects. If you want to permanently remove an object version, you can purge it from the repository.

Q: Is there a way to migrate only the changed objects from Development to Production without having to spend too much time on making a list of all changed/affected objects?

A: Yes there is.

You can create Deployment Groups that allow you to group versioned objects for migration to a different repository. You can create the following types of deployment groups:

● Static. You populate the deployment group by manually selecting objects. ● Dynamic. You use the result set from an object query to populate the deployment group.

To make a smooth transition/migration to Production, you need to have a query associated with your Dynamic deployment group. When you associate an object query with the deployment group, the Repository Agent runs the query at the time of deployment. You can associate an object query with a deployment group when you edit or create a deployment group.

If the repository is enabled for versioning, you may also copy the objects in a deployment group from one repository to another. Copying a deployment group allows you to copy objects in a single copy operation from across multiple folders in the source repository into multiple folders in the target repository. Copying a deployment group also allows you to specify individual objects to copy, rather than the entire contents of a folder.

Performance

Q: Can PowerCenter sessions be load balanced?

A: Yes, if the grid option is available. The Load Balancer is a component of the Integration Service that dispatches tasks to Integration Service processes running on nodes in a grid. It matches task requirements with resource availability to identify the best Integration Service process to run a task. It can dispatch tasks on a single node or across nodes.

Tasks can be dispatched in three ways: Round-robin, Metric-based, and Adaptive. Additionally, you can set the Service Levels to change the priority of each task waiting to be dispatched. This can be changed in the Administration Console’s domain properties.

For more information, please refer to Chapter 11: Configuring the Load Balancer in the Administrator Guide.

Web Services

INFORMATICA CONFIDENTIAL BEST PRACTICE 233 of 702

Q: How does Web Services Hub work in PowerCenter?

A: The Web Services Hub is a web service gateway for external clients. It processes SOAP requests from web service clients that want to access PowerCenter functionality through web services. Web service clients access the Integration Service and Repository Service through the Web Services Hub.

The Web Services Hub hosts Batch and Real-time Web Services. When you install PowerCenter Services, the PowerCenter installer installs the Web Services Hub. Use the Administration Console to configure and manage the Web Services Hub. For more information, please refer to Creating and Configuring the Web Services Hub in the Administrator Guide.

The Web Services Hub connects to the Repository Server and the PowerCenter Server through TCP/IP. Web service clients log in to the Web Services Hub through HTTP(s). The Web Services Hub authenticates the client based on repository user name and password. You can use the Web Services Hub console to view service information and download Web Services Description Language (WSDL) files necessary for running services and workflows.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 234 of 702

Event Based Scheduling

Challenge

In an operational environment, the beginning of a task often needs to be triggered by some event, either internal or external, to the Informatica environment. In versions of PowerCenter prior to version 6.0, this was achieved through the use of indicator files. In PowerCenter 6.0 and forward, it is achieved through use of the EventRaise and EventWait Workflow and Worklet tasks, as well as indicator files.

Description

Event-based scheduling with versions of PowerCenter prior to 6.0 was achieved through the use indicator files. Users specified the indicator file configuration in the session configuration under advanced options. When the session started, the PowerCenter Server looked for the specified file name; if it wasn’t there, it waited until it appeared, then deleted it, and triggered the session.

In PowerCenter 6.0 and above, event-based scheduling is triggered by Event-Wait and Event-Raise tasks. These tasks can be used to define task execution order within a workflow or worklet. They can even be used to control sessions across workflows.

● An Event-Raise task represents a user-defined event (i.e., an indicator file). ● An Event-Wait task waits for an event to occurwithin a workflow. After the

event triggers, the PowerCenter Server continues executing the workflow from the Event-Wait task forward.

The following paragraphs describe events that can be triggered by an Event-Wait task.

Waiting for Pre-Defined Events

To use a pre-defined event, you need a session, shell command, script, or batch file to create an indicator file. You must create the file locally or send it to a directory local to the PowerCenter Server. The file can be any format recognized by the PowerCenter Server operating system. You can choose to have the PowerCenter Server delete the indicator file after it detects the file, or you can manually delete the indicator file. The PowerCenter Server marks the status of the Event-Wait task as "failed" if it cannot delete the indicator file.

INFORMATICA CONFIDENTIAL BEST PRACTICE 235 of 702

When you specify the indicator file in the Event-Wait task, specify the directory in which the file will appear and the name of the indicator file. Do not use either a source or target file name as the indicator file name. You must also provide the absolute path for the file and the directory must be local to the PowerCenter Server. If you only specify the file name, and not the directory, Workflow Manager looks for the indicator file in the system directory. For example, on Windows NT, the system directory is C:/winnt/system32. You can enter the actual name of the file or use server variables to specify the location of the files. The PowerCenter Server writes the time the file appears in the workflow log.

Follow these steps to set up a pre-defined event in the workflow:

1. Create an Event-Wait task and double-click the Event-Wait task to open the Edit Tasks dialog box.

2. In the Events tab of the Edit Task dialog box, select Pre-defined. 3. Enter the path of the indicator file. 4. If you want the PowerCenter Server to delete the indicator file after it detects

the file, select the Delete Indicator File option in the Properties tab. 5. Click OK.

Pre-defined Event

A pre-defined event is a file-watch event. For pre-defined events, use an Event-Wait task to instruct the PowerCenter Server to wait for the specified indicator file to appear before continuing with the rest of the workflow. When the PowerCenter Server locates the indicator file, it starts the task downstream of the Event-Wait.

User-defined Event

A user-defined event is defined at the workflow or worklet level and the Event-Raise task triggers the event at one point of the workflow/worklet. If an Event-Wait task is configured in the same workflow/worklet to listen for that event, then execution will continue from the Event-Wait task forward.

The following is an example of using user-defined events:

Assume that you have four sessions that you want to execute in a workflow. You want P1_session and P2_session to execute concurrently to save time. You also want to execute Q3_session after P1_session completes. You want to execute Q4_session only when P1_session, P2_session, and Q3_session complete. Follow these steps:

INFORMATICA CONFIDENTIAL BEST PRACTICE 236 of 702

1. Link P1_session and P2_session concurrently. 2. Add Q3_session after P1_session 3. Declare an event called P1Q3_Complete in the Events tab of the workflow

properties 4. In the workspace, add an Event-Raise task after Q3_session. 5. Specify the P1Q3_Complete event in the Event-Raise task properties. This

allows the Event-Raise task to trigger the event when P1_session and Q3_session complete.

6. Add an Event-Wait task after P2_session. 7. Specify the Q1 Q3_Complete event for the Event-Wait task. 8. Add Q4_session after the Event-Wait task. When the PowerCenter Server

processes the Event-Wait task, it waits until the Event-Raise task triggers Q1Q3_Complete before it executes Q4_session.

The PowerCenter Server executes the workflow in the following order:

1. The PowerCenter Server executes P1_session and P2_session concurrently. 2. When P1_session completes, the PowerCenter Server executes Q3_session. 3. The PowerCenter Server finishes executing P2_session. 4. The Event-Wait task waits for the Event-Raise task to trigger the event. 5. The PowerCenter Server completes Q3_session. 6. The Event-Raise task triggers the event, Q1Q3_complete. 7. The Informatica Server executes Q4_session because the event,

Q1Q3_Complete, has been triggered.

Be sure to take carein setting the links though. If they are left as the default and if Q3 fails, the Event-Raise will never happen. Then the Event-Wait will wait forever and the workflow will run until it is stopped. To avoid this, check the workflow option ‘suspend on error’. With this option, if a session fails, the whole workflow goes into suspended mode and can send an email to notify developers.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 237 of 702

Key Management in Data Warehousing Solutions

Challenge

Key management refers to the technique that manages key allocation in a decision support RDBMS to create a single view of reference data from multiple sources. Informatica recommends a concept of key management that ensures loading everything extracted from a source system into the data warehouse.

This Best Practice provides some tips for employing the Informatica-recommended approach of key management, an approach that deviates from many traditional data warehouse solutions that apply logical and data warehouse (surrogate) key strategies where errors are loaded and transactions rejected from referential integrity issues.

Description

Key management in a decision support RDBMS comprises three techniques for handling the following common situations:

● Key merging/matching ● Missing keys ● Unknown keys

All three methods are applicable to a Reference Data Store, whereas only the missing and unknown keys are relevant for an Operational Data Store (ODS). Key management should be handled at the data integration level, thereby making it transparent to the Business Intelligence layer.

Key Merging/Matching

When companies source data from more than one transaction system of a similar type, the same object may have different, non-unique legacy keys. Additionally, a single key may have several descriptions or attributes in each of the source systems. The independence of these systems can result in incongruent coding, which poses a greater problem than records being sourced from multiple systems.

INFORMATICA CONFIDENTIAL BEST PRACTICE 238 of 702

A business can resolve this inconsistency by undertaking a complete code standardization initiative (often as part of a larger metadata management effort) or applying a Universal Reference Data Store (URDS). Standardizing code requires an object to be uniquely represented in the new system. Alternatively, URDS contains universal codes for common reference values. Most companies adopt this pragmatic approach, while embarking on the longer term solution of code standardization.

The bottom line is that nearly every data warehouse project encounters this issue and needs to find a solution in the short term.

Missing Keys

A problem arises when a transaction is sent through without a value in a column where a foreign key should exist (i.e., a reference to a key in a reference table). This normally occurs during the loading of transactional data, although it can also occur when loading reference data into hierarchy structures. In many older data warehouse solutions, this condition would be identified as an error and the transaction row would be rejected. The row would have to be processed through some other mechanism to find the correct code and loaded at a later date. This is often a slow and cumbersome process that leaves the data warehouse incomplete until the issue is resolved.

The more practical way to resolve this situation is to allocate a special key in place of the missing key, which links it with a dummy 'missing key' row in the related table. This enables the transaction to continue through the loading process and end up in the warehouse without further processing. Furthermore, the row ID of the bad transaction can be recorded in an error log, allowing the addition of the correct key value at a later time.

The major advantage of this approach is that any aggregate values derived from the transaction table will be correct because the transaction exists in the data warehouse rather than being in some external error processing file waiting to be fixed.

Simple Example:

PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE

Audi TT18 Doe10224 1 35,000

In the transaction above, there is no code in the SALES REP column. As this row is processed, a dummy sales rep key (UNKNOWN) is added to the record to link to a record in the SALES REP table. A data warehouse key (8888888) is also added to the

INFORMATICA CONFIDENTIAL BEST PRACTICE 239 of 702

transaction.

PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE DWKEY

Audi TT18 Doe10224 9999999 1 35,000 8888888

The related sales rep record may look like this:

REP CODE REP NAME REP MANAGER

1234567 David Jones Mark Smith

7654321 Mark Smith

9999999 Missing Rep

An error log entry to identify the missing key on this transaction may look like:

ERROR CODE TABLE NAME KEY NAME KEY

MSGKEY ORDERS SALES REP 8888888

This type of error reporting is not usually necessary because the transactions with missing keys can be identified using standard end-user reporting tools against the data warehouse.

Unknown Keys

Unknown keys need to be treated much like missing keys except that the load process has to add the unknown key value to the referenced table to maintain integrity rather than explicitly allocating a dummy key to the transaction. The process also needs to make two error log entries. The first, to log the fact that a new and unknown key has been added to the reference table and a second to record the transaction in which the unknown key was found.

Simple example:

The sales rep reference data record might look like the following:

INFORMATICA CONFIDENTIAL BEST PRACTICE 240 of 702

DWKEY REP NAME REP MANAGER

1234567 David Jones Mark Smith

7654321 Mark Smith

9999999 Missing Rep

A transaction comes into ODS with the record below:

PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE

Audi TT18 Doe10224 2424242 1 35,000

In the transaction above, the code 2424242 appears in the SALES REP column. As this row is processed, a new row has to be added to the Sales Rep reference table. This allows the transaction to be loaded successfully.

DWKEY REP NAME REP MANAGER

2424242 Unknown

A data warehouse key (8888889) is also added to the transaction.

PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE DWKEY

Audi TT18 Doe10224 2424242 1 35,000 8888889

Some warehouse administrators like to have an error log entry generated to identify the addition of a new reference table entry. This can be achieved simply by adding the following entries to an error log.

ERROR CODE TABLE NAME KEY NAME KEY

NEWROW SALES REP SALES REP 2424242

A second log entry can be added with the data warehouse key of the transaction in which the unknown key was found.

INFORMATICA CONFIDENTIAL BEST PRACTICE 241 of 702

ERROR CODE TABLE NAME KEY NAME KEY

UNKNKEY ORDERS SALES REP 8888889

As with missing keys, error reporting is not essential because the unknown status is clearly visible through the standard end-user reporting.

Moreover, regardless of the error logging, the system is self-healing because the newly added reference data entry will be updated with full details as soon as these changes appear in a reference data feed.

This would result in the reference data entry looking complete.

DWKEY REP NAME REP MANAGER

2424242 David Digby Mark Smith

Employing the Informatica recommended key management strategy produces the following benefits:

● All rows can be loaded into the data warehouse ● All objects are allocated a unique key ● Referential integrity is maintained ● Load dependencies are removed

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 242 of 702

Mapping Design

Challenge

Optimizing PowerCenter to create an efficient execution environment.

Description

Although PowerCenter environments vary widely, most sessions and/or mappings can benefit from the implementation of common objects and optimization procedures. Follow these procedures and rules of thumb when creating mappings to help ensure optimization.

General Suggestions for Optimizing

1. Reduce the number of transformations. There is always overhead involved in moving data between transformations.

2. Consider more shared memory for large number of transformations. Session shared memory between 12MB and 40MB should suffice.

3. Calculate once, use many times.

❍ Avoid calculating or testing the same value over and over. ❍ Calculate it once in an expression, and set a True/False flag. ❍ Within an expression, use variable ports to calculate a value that can

be used multiple times within that transformation. 4. Only connect what is used.

❍ Delete unnecessary links between transformations to minimize the

amount of data moved, particularly in the Source Qualifier. ❍ This is also helpful for maintenance. If a transformation needs to be

reconnected, it is best to only have necessary ports set as input and output to reconnect.

❍ In lookup transformations, change unused ports to be neither input nor output. This makes the transformations cleaner looking. It also makes the generated SQL override as small as possible, which cuts down on the amount of cache necessary and thereby improves performance.

5. Watch the data types.

INFORMATICA CONFIDENTIAL BEST PRACTICE 243 of 702

❍ The engine automatically converts compatible types. ❍ Sometimes data conversion is excessive. Data types are automatically

converted when types differ between connected ports. Minimize data type changes between transformations by planning data flow prior to developing the mapping.

6. Facilitate reuse.

❍ Plan for reusable transformations upfront.. ❍ Use variables. Use both mapping variables and ports that are

variables. Variable ports are especially beneficial when they can be used to calculate a complex expression or perform a disconnected lookup call only once instead of multiple times.

❍ Use mapplets to encapsulate multiple reusable transformations. ❍ Use mapplets to leverage the work of critical developers and minimize

mistakes when performing similar functions. 7. Only manipulate data that needs to be moved and transformed.

❍ Reduce the number of non-essential records that are passed through

the entire mapping. ❍ Use active transformations that reduce the number of records as early

in the mapping as possible (i.e., placing filters, aggregators as close to source as possible).

❍ Select appropriate driving/master table while using joins. The table with the lesser number of rows should be the driving/master table for a faster join.

8. Utilize single-pass reads.

❍ Redesign mappings to utilize one Source Qualifier to populate multiple targets. This way the server reads this source only once. If you have different Source Qualifiers for the same source (e.g., one for delete and one for update/insert), the server reads the source for each Source Qualifier.

❍ Remove or reduce field-level stored procedures. 9. Utilize Pushdown Optimization.

❍ Design mappings so they can take advantage of the Pushdown Optimization feature. This improves performance by allowing the source and/or target database to perform the mapping logic.

Lookup Transformation Optimizing Tips

INFORMATICA CONFIDENTIAL BEST PRACTICE 244 of 702

1. When your source is large, cache lookup table columns for those lookup tables of 500,000 rows or less. This typically improves performance by 10 to 20 percent.

2. The rule of thumb is not to cache any table over 500,000 rows. This is only true if the standard row byte count is 1,024 or less. If the row byte count is more than 1,024, then you need to adjust the 500K-row standard down as the number of bytes increase (i.e., a 2,048 byte row can drop the cache row count to between 250K and 300K, so the lookup table should not be cached in this case). This is just a general rule though. Try running the session with a large lookup cached and not cached. Caching is often faster on very large lookup tables.

3. When using a Lookup Table Transformation, improve lookup performance by placing all conditions that use the equality operator = first in the list of conditions under the condition tab.

4. Cache only lookup tables if the number of lookup calls is more than 10 to 20 percent of the lookup table rows. For fewer number of lookup calls, do not cache if the number of lookup table rows is large. For small lookup tables (i.e., less than 5,000 rows), cache for more than 5 to 10 lookup calls.

5. Replace lookup with decode or IIF (for small sets of values). 6. If caching lookups and performance is poor, consider replacing with an

unconnected, uncached lookup. 7. For overly large lookup tables, use dynamic caching along with a persistent

cache. Cache the entire table to a persistent file on the first run, enable the "update else insert" option on the dynamic cache and the engine never has to go back to the database to read data from this table. You can also partition this persistent cache at run time for further performance gains.

8. When handling multiple matches, use the "Return any matching value" setting whenever possible. Also use this setting if the lookup is being performed to determine that a match exists, but the value returned is irrelevant. The lookup creates an index based on the key ports rather than all lookup transformation ports. This simplified indexing process can improve performance.

9. Review complex expressions.

❍ Examine mappings via Repository Reporting and Dependency Reporting within the mapping.

❍ Minimize aggregate function calls. ❍ Replace Aggregate Transformation object with an Expression

Transformation object and an Update Strategy Transformation for certain types of Aggregations.

Operations and Expression Optimizing Tips

1. Numeric operations are faster than string operations. 2. Optimize char-varchar comparisons (i.e., trim spaces before comparing).

INFORMATICA CONFIDENTIAL BEST PRACTICE 245 of 702

3. Operators are faster than functions (i.e., || vs. CONCAT). 4. Optimize IIF expressions. 5. Avoid date comparisons in lookup; replace with string. 6. Test expression timing by replacing with constant. 7. Use flat files.

❍ Using flat files located on the server machine loads faster than a database located in the server machine.

❍ Fixed-width files are faster to load than delimited files because delimited files require extra parsing.

❍ If processing intricate transformations, consider loading first to a source flat file into a relational database, which allows the PowerCenter mappings to access the data in an optimized fashion by using filters and custom SQL Selects where appropriate.

8. If working with data that is not able to return sorted data (e.g., Web Logs), consider using the Sorter Advanced External Procedure.

9. Use a Router Transformation to separate data flows instead of multiple Filter Transformations.

10. Use a Sorter Transformation or hash-auto keys partitioning before an Aggregator Transformation to optimize the aggregate. With a Sorter Transformation, the Sorted Ports option can be used even if the original source cannot be ordered.

11. Use a Normalizer Transformation to pivot rows rather than multiple instances of the same target.

12. Rejected rows from an update strategy are logged to the bad file. Consider filtering before the update strategy if retaining these rows is not critical because logging causes extra overhead on the engine. Choose the option in the update strategy to discard rejected rows.

13. When using a Joiner Transformation, be sure to make the source with the smallest amount of data the Master source.

14. If an update override is necessary in a load, consider using a Lookup transformation just in front of the target to retrieve the primary key. The primary key update is much faster than the non-indexed lookup override.

Suggestions for Using Mapplets

A mapplet is a reusable object that represents a set of transformations. It allows you to reuse transformation logic and can contain as many transformations as necessary. Use the Mapplet Designer to create mapplets.

INFORMATICA CONFIDENTIAL BEST PRACTICE 246 of 702

Mapping Templates

Challenge

Mapping Templates demonstrate proven solutions for tackling challenges that commonly occur during data integration development efforts. Mapping Templates can be used to make the development phase of a project more efficient. Mapping Templates can also serve as a medium to introduce development standards into the mapping development process that developers need to follow.

A wide array of Mapping Template examples can be obtained for the most current PowerCenter version from the Informatica Customer Portal. As "templates," each of the objects in Informatica's Mapping Template Inventory illustrates the transformation logic and steps required to solve specific data integration requirements. These sample templates, however, are meant to be used as examples, not as means to implement development standards.

Description Reuse Transformation Logic

Templates can be heavily used in a data integration and warehouse environment, when loading information from multiple source providers into the same target structure, or when similar source system structures are employed to load different target instances. Using templates guarantees that any transformation logic that is developed and tested correctly, once, can be successfully applied across multiple mappings as needed. In some instances, the process can be further simplified if the source/target structures have the same attributes, by simply creating multiple instances of the session, each with its own connection/execution attributes, instead of duplicating the mapping.

Implementing Development Techniques

When the process is not simple enough to allow usage based on the need to duplicate transformation logic to load the same target, Mapping Templates can help to reproduce transformation techniques. In this case, the implementation process requires more than just replacing source/target transformations. This scenario is most useful when certain logic (i.e., logical group of transformations) is employed across mappings. In many instances this can be further simplified by making use of mapplets. Additionally user defined functions can be utilized for expression logic reuse and build complex

INFORMATICA CONFIDENTIAL BEST PRACTICE 247 of 702

expressions using transformation language.

Transport mechanism

Once Mapping Templates have been developed, they can be distributed by any of the following procedures:

● Copy mapping from development area to the desired repository/folder ● Export mapping template into XML and import to the desired repository/folder.

Mapping template examples

The following Mapping Templates can be downloaded from the Informatica Customer Portal and are listed by subject area:

Common Data Warehousing Techniques

● Aggregation using Sorted Input ● Tracking Dimension History ● Constraint-Based Loading ● Loading Incremental Updates ● Tracking History and Current ● Inserts or Updates

Transformation Techniques

● Error Handling Strategy ● Flat File Creation with Headers and Footers ● Removing Duplicate Source Records ● Transforming One Record into Multiple Records ● Dynamic Caching ● Sequence Generator Alternative ● Streamline a Mapping with a Mapplet ● Reusable Transformations (Customers) ● Using a Sorter ● Pipeline Partitioning Mapping Template

INFORMATICA CONFIDENTIAL BEST PRACTICE 248 of 702

● Using Update Strategy to Delete Rows ● Loading Heterogenous Targets ● Load Using External Procedure

Advanced Mapping Concepts

● Aggregation Using Expression Transformation ● Building a Parameter File ● Best Build Logic ● Comparing Values Between Records● Transaction Control Transformation

Source-Specific Requirements

● Processing VSAM Source Files ● Processing Data from an XML Source ● Joining a Flat File with a Relational Table

Industry-Specific Requirements

● Loading SWIFT 942 Messages.htm ● Loading SWIFT 950 Messages.htm

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 249 of 702

Naming Conventions

Challenge

Choosing a good naming standard for use in the repository and adhering to it.

DescriptionAlthough naming conventions are important for all repository and database objects, the suggestions in this Best Practice focus on the former. Choosing a convention and sticking with it is the key.

Having a good naming convention facilitates smooth migration and improves readability for anyone reviewing or carrying out maintenance on the repository objects by helping them understand the processes being affected. If consistent names and descriptions are not used, significant time may be needed to understand the working of mappings and transformation objects. If no description is provided, a developer is likely to spend considerable time going through an object or mapping to understand its objective.

The following pages offer some suggestions for naming conventions for various repository objects. Whatever convention is chosen, it is important to make the selection very early in the development cycle and communicate the convention to project staff working on the repository. The policy can be enforced by peer review and at test phases by adding process to check conventions to test plans and test execution documents.

Suggested Naming Conventions

Designer Objects Suggested Naming Conventions

Mapping m_{PROCESS}_{SOURCE_SYSTEM}_{TARGET_NAME} or suffix with _{descriptor} if there are multiple mappings for that single target table

Mapplet mplt_{DESCRIPTION}

Target {update_types(s)}_{TARGET_NAME} this naming convention should only occur within a mapping as the actual target name object affects the actual table that PowerCenter will access

Aggregator Transformation AGG_{FUNCTION} that leverages the expression and/or a name that describes the processing being done.

Application Source Qualifier Transformation

ASQ_{TRANSFORMATION} _{SOURCE_TABLE1}_{SOURCE_TABLE2} represents data from application source.

Custom Transformation CT_{TRANSFORMATION} name that describes the processing being done.

Expression Transformation EXP_{FUNCTION} that leverages the expression and/or a name that describes the processing being done.

External Procedure Transformation

EXT_{PROCEDURE_NAME}

Filter Transformation FIL_ or FILT_{FUNCTION} that leverages the expression or a name that describes the processing being done.

Java Transformation JV_{FUNCTION} that leverages the expression or a name that describes the processing being done.

INFORMATICA CONFIDENTIAL BEST PRACTICE 250 of 702

Joiner Transformation JNR_{DESCRIPTION}

Lookup Transformation LKP_{TABLE_NAME} or suffix with _{descriptor} if there are multiple look-ups on a single table. For unconnected look-ups, use ULKP in place of LKP.

Mapplet Input Transformation MPLTI_{DESCRIPTOR} indicating the data going into the mapplet.

Mapplet Output Transformation

MPLTO_{DESCRIPTOR} indicating the data coming out of the mapplet.

MQ Source Qualifier Transformation

MQSQ_{DESCRIPTOR} defines the messaging being selected.

Normalizer Transformation NRM_{FUNCTION} that leverages the expression or a name that describes the processing being done.

Rank Transformation RNK_{FUNCTION} that leverages the expression or a name that describes the processing being done.

Router Transformation RTR_{DESCRIPTOR}

Sequence Generator Transformation

SEQ_{DESCRIPTOR} if using keys for a target table entity, then refer to that

Sorter Transformation SRT_{DESCRIPTOR}

Source Qualifier Transformation

SQ_{SOURCE_TABLE1}_{SOURCE_TABLE2}. Using all source tables can be impractical if there are a lot of tables in a source qualifier, so refer to the type of information being obtained, for example a certain type of product – SQ_SALES_INSURANCE_PRODUCTS.

Stored Procedure Transformation

SP_{STORED_PROCEDURE_NAME}

Transaction Control Transformation

TCT_ or TRANS_{DESCRIPTOR} indicating the function of the transaction control.

Union Transformation UN_{DESCRIPTOR}

Update Strategy Transformation

UPD_{UPDATE_TYPE(S)} or UPD_{UPDATE_TYPE(S)}_{TARGET_NAME} if there are multiple targets in the mapping. E.g., UPD_UPDATE_EXISTING_EMPLOYEES

XML Generator Transformation

XMG_{DESCRIPTOR}defines the target message.

XML Parser Transformation XMP_{DESCRIPTOR}defines the messaging being selected.

XML Source Qualifier Transformation

XMSQ_{DESCRIPTOR}defines the data being selected.

Port Names

Ports names should remain the same as the source unless some other action is performed on the port. In that case, the port should be prefixed with the appropriate name.

INFORMATICA CONFIDENTIAL BEST PRACTICE 251 of 702

When the developer brings a source port into a lookup, the port should be prefixed with ‘in_’. This helps the user immediately identify the ports that are being input without having to line up the ports with the input checkbox. In any other transformation, if the input port is transformed in an output port with the same name, prefix the input port with ‘in_’.

Generated output ports can also be prefixed. This helps trace the port value throughout the mapping as it may travel through many other transformations. If it is intended to be able to use the autolink feature based on names, then outputs may be better left as the name of the target port in the next transformation. For variables inside a transformation, the developer can use the prefix ‘v’, 'var_’ or ‘v_' plus a meaningful name.

With some exceptions, port standards apply when creating a transformation object. The exceptions are the Source Definition, the Source Qualifier, the Lookup, and the Target Definition ports, which must not change since the port names are used to retrieve data from the database.

Other transformations that are not applicable to the port standards are:

● Normalizer - The ports created in the Normalizer are automatically formatted when the developer configures it. ● Sequence Generator - The ports are reserved words. ● Router - Because output ports are created automatically, prefixing the input ports with an I_ prefixes the output ports with I_

as well. Port names should not have any prefix. ● Sorter, Update Strategy, Transaction Control, and Filter - These ports are always input and output. There is no need to

rename them unless they are prefixed. Prefixed port names should be removed. ● Union - The group ports are automatically assigned to the input and output; therefore prefixing with anything is reflected in

both the input and output. The port names should not have any prefix.

All other transformation object ports can be prefixed or suffixed with:

● ‘in_’ or ‘i_’for Input ports ● ‘o_’ or ‘_out’ for Output ports ● ‘io_’ for Input/Output ports ● ‘v’,‘v_’ or ‘var_’ for variable ports ● ‘lkp_’ for returns from look ups ● ‘mplt_’ for returns from mapplets

Prefixes are preferable because they are generally easier to see; developers do not need to expand the columns to see the suffix for longer port names.

Transformation object ports can also:

● Have the Source Qualifier port name. ● Be unique. ● Be meaningful. ● Be given the target port name.

Transformation Descriptions

This section defines the standards to be used for transformation descriptions in the Designer.

● Source Qualifier Descriptions. Should include the aim of the source qualifier and the data it is intended to select.

Should also indicate if any overrides are used. If so, it should describe the filters or settings used. Some projects prefer items such as the SQL statement to be included in the description as well.

● Lookup Transformation Descriptions. Describe the lookup along the lines of the [lookup attribute] obtained from [lookup table name] to retrieve the [lookup attribute name].

Where:

INFORMATICA CONFIDENTIAL BEST PRACTICE 252 of 702

❍ Lookup attribute is the name of the column being passed into the lookup and is used as the lookup criteria. ❍ Lookup table name is the table on which the lookup is being performed. ❍ Lookup attribute name is the name of the attribute being returned from the lookup. If appropriate, specify the condition

when the lookup is actually executed.

It is also important to note lookup features such as persistent cache or dynamic lookup.

● Expression Transformation Descriptions. Must adhere to the following format:

“This expression … [explanation of what transformation does].”

Expressions can be distinctly different depending on the situation; therefore the explanation should be specific to the actions being performed.

Within each Expression, transformation ports have their own description in the format:

“This port … [explanation of what the port is used for].”

● Aggregator Transformation Descriptions. Must adhere to the following format:

“This Aggregator … [explanation of what transformation does].”

Aggregators can be distinctly different, depending on the situation; therefore the explanation should be specific to the actions being performed.

Within each Aggregator, transformation ports have their own description in the format:

“This port … [explanation of what the port is used for].”

● Sequence Generators Transformation Descriptions. Must adhere to the following format:

“This Sequence Generator provides the next value for the [column name] on the [table name].”

Where:

❍ Table name is the table being populated by the sequence number, and the ❍ Column name is the column within that table being populated.

● Joiner Transformation Descriptions. Must adhere to the following format:

“This Joiner uses … [joining field names] from [joining table names].”

Where:

Joining field names are the names of the columns on which the join is done, and the ❍

Joining table names are the tables being joined.

● Normalizer Transformation Descriptions. Must adhere to the following format::

“This Normalizer … [explanation].”

Where:

❍ explanation describes what the Normalizer does.

INFORMATICA CONFIDENTIAL BEST PRACTICE 253 of 702

● Filter Transformation Descriptions. Must adhere to the following format:

“This Filter processes … [explanation].”

Where:

❍ explanation describes what the filter criteria are and what they do.

● Stored Procedure Transformation Descriptions. Explain the stored procedure’s functionality within the mapping (i.e., what does it return in relation to the input ports?).

● Mapplet Input Transformation Descriptions. Describe the input values and their intended use in the mapplet.

● Mapplet Output Transformation Descriptions. Describe the output ports and the subsequent use of those values. As an example, for an exchange rate mapplet, describe what currency the output value will be in. Answer the questions like: is the currency fixed or based on other data? What kind of rate is used? is it a fixed inter-company rate? an inter-bank rate? business rate or tourist rate? Has the conversion gone through an intermediate currency?

● Update Strategies Transformation Descriptions. Describe the Update Strategy and whether it is fixed in its function or determined by a calculation.

● Sorter Transformation Descriptions. Explanation of the port(s) that are being sorted and their sort direction.

● Router Transformation Descriptions. Describes the groups and their functions.

● Union Transformation Descriptions. Describe the source inputs and indicate what further processing on those inputs (if any) is expected to take place in later transformations in the mapping.

● Transaction Control Transformation Descriptions. Describe the process behind the transaction control and the function of the control to commit or rollback.

● Custom Transformation Descriptions. Describe the function that the custom transformation accomplishes and what data is expected as input and what data will be generated as output. Also indicate the module name (and location) and the procedure which is used.

● External Procedure Transformation Descriptions. Describe the function of the external procedure and what data is expected as input and what data will be generated as output. Also indicate the module name (and location) and the procedure that is used.

● Java Transformation Descriptions. Describe the function of the java code and what data is expected as input and what data is generated as output. Also indicate whether the java code determines the object to be an Active or Passive transformation.

● Rank Transformation Descriptions. Indicate the columns being used in the rank, the number of records returned from the rank, the rank direction, and the purpose of the transformation.

● XML Generator Transformation Descriptions. Describe the data expected for the generation of the XML and indicate the purpose of the XML being generated.

● XML Parser Transformation Descriptions. Describe the input XML expected and the output from the parser and indicate the purpose of the transformation.

Mapping Comments

These comments describe the source data obtained and the structure file, table or facts and dimensions that it populates. Remember to use business terms along with such technical details as table names. This is beneficial when maintenance is required or if issues arise that need to be discussed with business analysts.

Mapplet Comments

These comments are used to explain the process that the mapplet carries out. Always be sure to see the notes regarding descriptions for the input and output transformation.

INFORMATICA CONFIDENTIAL BEST PRACTICE 254 of 702

Repository Objects

Repositories, as well as repository level objects, should also have meaningful names. Repositories should prefix with either ‘L_’ for local or ‘G’ for global and a descriptor. Descriptors usually include information about the project and/or level of the environment (e.g., PROD, TEST, DEV).

Folders and Groups

Working folder names should be meaningful and include project name and, if there are multiple folders for that one project, a descriptor. User groups should also include project name and descriptors, as necessary. For example, folder DW_SALES_US and DW_SALES_UK could both have TEAM_SALES as their user group. Individual developer folders or non-production folders should prefix with ‘z_’ so that they are grouped together and not confused with working production folders.

Shared Objects and Folders

Any object within a folder can be shared across folders and maintained in one central location. These objects are sources, targets, mappings, transformations, and mapplets. To share objects in a folder, the folder must be designated as shared. In addition to facilitating maintenance, shared folders help reduce the size of the repository since shortcuts are used to link to the original, instead of copies.

Only users with the proper permissions can access these shared folders. These users are responsible for migrating the folders across the repositories and, with help from the developers, for maintaining the objects within the folders. For example, if an object is created by a developer and is to be shared, the developer should provide details of the object and the level at which the object is to be shared before the Administrator accepts it as a valid entry into the shared folder. The developers, not necessarily the creator, control the maintenance of the object, since they must ensure that a subsequent change does not negatively impact other objects.

If the developer has an object that he or she wants to use in several mappings or across multiple folders, like an Expression transformation that calculates sales tax, the developer can place the object in a shared folder. Then use the object in other folders by creating a shortcut to the object. In this case, the naming convention is ‘sc_’ (e.g., sc_EXP_CALC_SALES_TAX). The folder should prefix with ‘SC_’ to identify it as a shared folder and keep all shared folders grouped together in the repository.

Workflow Manager Objects

WorkFlow Objects Suggested Naming Convention

Session s_{MappingName}

Command Object cmd_{DESCRIPTOR}

Worklet wk or wklt_{DESCRIPTOR}

Workflow wkf or wf_{DESCRIPTOR}

Email Task: email_ or eml_{DESCRIPTOR}

Decision Task: dcn_ or dt_{DESCRIPTOR}

Assign Task: asgn_{DESCRIPTOR}

Timer Task: timer_ or tmr_{DESCRIPTOR}

Control Task: ctl_{DESCRIPTOR}Specify when and how the PowerCenter Server is to stop or abort a workflow by using the Control task in the workflow.

INFORMATICA CONFIDENTIAL BEST PRACTICE 255 of 702

Event Wait Task: wait_ or ew_{DESCRIPTOR}Waits for an event to occur. Once the event triggers, the PowerCenter Server continues executing the rest of the workflow.

Event Raise Task: raise_ or er_{DESCRIPTOR} Represents a user-defined event. When the PowerCenter Server runs the Event-Raise task, the Event-Raise task triggers the event. Use the Event-Raise task with the Event-Wait task to define events.

ODBC Data Source Names

All Open Database Connectivity (ODBC) data source names (DSNs) should be set up in the same way on all client machines. PowerCenter uniquely identifies a source by its Database Data Source (DBDS) and its name. The DBDS is the same name as the ODBC DSN since the PowerCenter Client talks to all databases through ODBC.

Also be sure to setup the ODBC DSNs as system DSNs so that all users of a machine can see the DSN. This approach ensures that there is less chance of a discrepancy occuring among users when they use different (i.e., colleagues') machines and have to recreate a new DSN when they use a separate machine.

If ODBC DSNs are different across multiple machines, there is a risk of analyzing the same table using different names. For example, machine1 has ODBS DSN Name0 that points to database1. TableA gets analyzed in on machine 1. TableA is uniquely identified as Name0.TableA in the repository. Machine2 has ODBS DSN Name1 that points to database1. TableA gets analyzed in on machine 2. TableA is uniquely identified as Name1.TableA in the repository. The result is that the repository may refer to the same object by multiple names, creating confusion for developers, testers, and potentially end users.

Also, refrain from using environment tokens in the ODBC DSN. For example, do not call it dev_db01. When migrating objects from dev, to test, to prod, PowerCenter can wind up with source objects called dev_db01 in the production repository. ODBC database names should clearly describe the database they reference to ensure that users do not incorrectly point sessions to the wrong databases.

Database Connection Information

Security considerations may dictate using the company name of the database or project instead of {user}_{database name}, except for developer scratch schemas, which are not found in test or production environments. Be careful not to include machine names or environment tokens in the database connection name. Database connection names must be very generic to be understandable and ensure a smooth migration.

The naming convention should be applied across all development, test, and production environments. This allows seamless migration of sessions when migrating between environments. If an administrator uses the Copy Folder function for migration, session information is also copied. If the Database Connection information does not already exist in the folder the administrator is copying to, it is also copied. So, if the developer uses connections with names like Dev_DW in the development repository, they are likely to eventually wind up in the test, and even the production repositories as the folders are migrated. Manual intervention is then necessary to change connection names, user names, passwords, and possibly even connect strings.

Instead, if the developer just has a DW connection in each of the three environments, when the administrator copies a folder from the development environment to the test environment, the sessions automatically use the existing connection in the test repository. With the right naming convention, you can migrate sessions from the test to production repository without manual intervention.

TIP At the beginning of a project, have the Repository Administrator or DBA setup all connections in all environments based on the issues discussed in this Best Practice. Then use permission options to protect these connections so that only specified individuals can modify them. Whenever possible, avoid having developers create their own connections using different conventions and possibly duplicating connections.

Administration Console Objects

Administration console objects such as domains, nodes, and services should also have meaningful names.

Object Recommended Naming Convention Example

INFORMATICA CONFIDENTIAL BEST PRACTICE 256 of 702

Domain DOM_ or DMN_[PROJECT]_[ENVIRONMENT] DOM_PROCURE_DEV

Node NODE[#]_[SERVER_NAME]_ [optional_descriptor] NODE02_SERVER_rs_b (backup node for the repository service)

Services:

- Integration INT_SVC_[ENVIRONMENT]_[optional descriptor] INT_SVC_DEV_primary

- Repository REPO_SVC_[ENVIRONMENT]_[optional descriptor]

REPO_SVC_TEST

- Web Services Hub

WEB_SVC_[ENVIRONMENT]_[optional descriptor] WEB_SVC_PROD

PowerCenter PowerExchange Application/Relational Connections

Before the PowerCenter Server can access a source or target in a session, you must configure connections in the Workflow Manager. When you create or modify a session that reads from, or writes to, a database, you can select only configured source and target databases. Connections are saved in the repository.

For PowerExchange Client for PowerCenter, you configure relational database and/or application connections. The connection you configure depends on the type of source data you want to extract and the extraction mode (e.g., PWX[MODE_INITIAL]_[SOURCE]_[Instance_Name]). The following table shows some examples.

Source Type/Extraction Mode

Application Connection/Relational Connection

Connection Type Recommended Naming Convention

DB2/390 Bulk Mode Relational PWX DB2390 PWXB_DB2_Instance_Name

DB2/390 Change Mode

Application PWX DB2390 CDC Change

PWXC_DB2_Instance_Name

DB2/390 Real Time Mode

Application PWX DB2390 CDC Real Time

PWXR_DB2_Instance_Name

IMS Batch Mode Application PWX NRDB Batch PWXB_IMS_ Instance_Name

IMS Change Mode Application PWX NRDB CDC Change

PWXC_IMS_ Instance_Name

IMS Real Time Application PWX NRDB CDC Real Time

PWXR_IMS_ Instance_Name

Oracle Change Mode Application PWX Oracle CDC Change

PWXC_ORA_Instance_Name

Oracle Real Time Application PWX Oracle CDC Real

PWXR_ORA_Instance_Name

PowerCenter PowerExchange Target Connections

The connection you configure depends on the type of target data you want to load.

INFORMATICA CONFIDENTIAL BEST PRACTICE 257 of 702

Target Type Connection Type Recommended Naming Convention

DB2/390 PWX DB2390 relational database connection

PWXT_DB2_Instance_Name

DB2/400 PWX DB2400 relational database connection

PWXT_DB2_Instance_Name

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 258 of 702

Performing Incremental Loads

Challenge

Data warehousing incorporates very large volumes of data. The process of loading the warehouse in a reasonable timescale without compromising its functionality is extremely difficult. The goal is to create a load strategy that can minimize downtime for the warehouse and allow quick and robust data management.

Description

As time windows shrink and data volumes increase, it is important to understand the impact of a suitable incremental load strategy. The design should allow data to be incrementally added to the data warehouse with minimal impact on the overall system. This Best Practice describes several possible load strategies.

Incremental Aggregation

Incremental aggregation is useful for applying incrementally-captured changes in the source to aggregate calculations in a session.

If the source changes only incrementally, and you can capture those changes, you can configure the session to process only those changes with each run. This allows the PowerCenter Integration Service to update the target incrementally, rather than forcing it to process the entire source and recalculate the same calculations each time you run the session.

If the session performs incremental aggregation, the PowerCenter Integration Service saves index and data cache information to disk when the session finishes. The next time the session runs, the PowerCenter Integration Service uses this historical information to perform the incremental aggregation. To utilize this functionality set the “Incremental Aggregation” Session attribute. For details see Chapter 24 in the Workflow Administration Guide.

Use incremental aggregation under the following conditions:

● Your mapping includes an aggregate function. ● The source changes only incrementally.

INFORMATICA CONFIDENTIAL BEST PRACTICE 259 of 702

● You can capture incremental changes (i.e., by filtering source data by timestamp).

● You get only delta records (i.e., you may have implemented the CDC (Change Data Capture) feature of PowerExchange).

Do not use incremental aggregation in the following circumstances:

● You cannot capture new source data. ● Processing the incrementally-changed source significantly changes the target.

If processing the incrementally-changed source alters more than half the existing target, the session may not benefit from using incremental aggregation.

● Your mapping contains percentile or median functions.

Some conditions that may help in making a decision on an incremental strategy include:

● Error handling, loading and unloading strategies for recovering, reloading, and unloading data.

● History tracking requirements for keeping track of what has been loaded and when

● Slowly-changing dimensions. Informatica Mapping Wizards are a good start to an incremental load strategy. The Wizards generate generic mappings as a starting point (refer to Chapter 15 in the Designer Guide)

Source Analysis

Data sources typically fall into the following possible scenarios:

● Delta records. Records supplied by the source system include only new or changed records. In this scenario, all records are generally inserted or updated into the data warehouse.

● Record indicator or flags. Records that include columns that specify the intention of the record to be populated into the warehouse. Records can be selected based upon this flag for all inserts, updates, and deletes.

● Date stamped data. Data is organized by timestamps, and loaded into the warehouse based upon the last processing date or the effective date range.

● Key values are present. When only key values are present, data must be checked against what has already been entered into the warehouse. All values must be checked before entering the warehouse.

INFORMATICA CONFIDENTIAL BEST PRACTICE 260 of 702

● No key values present. When no key values are present, surrogate keys are created and all data is inserted into the warehouse based upon validity of the records.

Identify Records for Comparison

After the sources are identified, you need to determine which records need to be entered into the warehouse and how. Here are some considerations:

● Compare with the target table. When source delta loads are received, determine if the record exists in the target table. The timestamps and natural keys of the record are the starting point for identifying whether the record is new, modified, or should be archived. If the record does not exist in the target, insert the record as a new row. If it does exist, determine if the record needs to be updated, inserted as a new record, or removed (deleted from target) or filtered out and not added to the target.

● Record indicators. Record indicators can be beneficial when lookups into the target are not necessary. Take care to ensure that the record exists for update or delete scenarios, or does not exist for successful inserts. Some design effort may be needed to manage errors in these situations.

Determine Method of Comparison

There are four main strategies in mapping design that can be used as a method of comparison:

● Joins of sources to targets. Records are directly joined to the target using Source Qualifier join conditions or using Joiner transformations after the Source Qualifiers (for heterogeneous sources). When using Joiner transformations, take care to ensure the data volumes are manageable and that the smaller of the two datasets is configured as the Master side of the join.

● Lookup on target. Using the Lookup transformation, lookup the keys or critical columns in the target relational database. Consider the caches and indexing possibilities.

● Load table log. Generate a log table of records that have already been inserted into the target system. You can use this table for comparison with lookups or joins, depending on the need and volume. For example, store keys in a separate table and compare source records against this log table to determine load strategy. Another example is to store the dates associated with the data already loaded into a log table.

● MD5 checksum function. Generate a unique value for each row of data and then compare previous and current unique checksum values to determine

INFORMATICA CONFIDENTIAL BEST PRACTICE 261 of 702

whether the record has changed.

Source-Based Load Strategies

Complete Incremental Loads in a Single File/Table

The simplest method for incremental loads is from flat files or a database in which all records are going to be loaded. This strategy requires bulk loads into the warehouse with no overhead on processing of the sources or sorting the source records.

Data can be loaded directly from the source locations into the data warehouse. There is no additional overhead produced in moving these sources into the warehouse.

Date-Stamped Data

This method involves data that has been stamped using effective dates or sequences. The incremental load can be determined by dates greater than the previous load date or data that has an effective key greater than the last key processed.

With the use of relational sources, the records can be selected based on this effective date and only those records past a certain date are loaded into the warehouse. Views can also be created to perform the selection criteria. This way, the processing does not have to be incorporated into the mappings but is kept on the source component.

Placing the load strategy into the other mapping components is more flexible and controllable by the Data Integration developers and the associated metadata.

To compare the effective dates, you can use mapping variables to provide the previous date processed (see the description below). An alternative to Repository-maintained mapping variables is the use of control tables to store the dates and update the control table after each load.

Non-relational data can be filtered as records are loaded based upon the effective dates or sequenced keys. A Router transformation or filter can be placed after the Source Qualifier to remove old records.

Changed Data Based on Keys or Record Information

Data that is uniquely identified by keys can be sourced according to selection criteria. For example, records that contain primary keys or alternate keys can be used to determine if they have already been entered into the data warehouse. If they exist, you

INFORMATICA CONFIDENTIAL BEST PRACTICE 262 of 702

can also check to see if you need to update these records or discard the source record.

It may be possible to perform a join with the target tables in which new data can be selected and loaded into the target. It may also be feasible to lookup in the target to see if the data exists.

Target-Based Load Strategies

● Loading directly into the target. Loading directly into the target is possible when the data is going to be bulk loaded. The mapping is then responsible for error control, recovery, and update strategy.

● Load into flat files and bulk load using an external loader. The mapping loads data directly into flat files. You can then invoke an external loader to bulk load the data into the target. This method reduces the load times (with less downtime for the data warehouse) and provides a means of maintaining a history of data being loaded into the target. Typically, this method is only used for updates into the warehouse.

● Load into a mirror database. The data is loaded into a mirror database to avoid downtime of the active data warehouse. After data has been loaded, the databases are switched, making the mirror the active database and the active the mirror.

Using Mapping Variables

You can use a mapping variable to perform incremental loading. By referencing a date-based mapping variable in the Source Qualifier or join condition, it is possible to select only those rows with greater than the previously captured date (i.e., the newly inserted source data). However, the source system must have a reliable date to use.

The steps involved in this method are:

Step 1: Create mapping variable

In the Mapping Designer, choose Mappings > Parameters > Variables. Or, to create variables for a mapplet, choose Mapplet > Parameters > Variables in the Mapplet Designer.

Click Add and enter the name of the variable (i.e., $$INCREMENT DATE). In this case, make your variable a date/time. For the Aggregation option, select MAX.

In the same screen, state your initial value. This date is used during the initial run of the

INFORMATICA CONFIDENTIAL BEST PRACTICE 263 of 702

session and as such should represent a date earlier than the earliest desired data. The date can use any one of these formats:

● MM/DD/RR ● MM/DD/RR HH24:MI:SS ● MM/DD/YYYY ● MM/DD/YYYY HH24:MI:SS

Step 2: Reference the mapping variable in the Source Qualifier

The select statement should look like the following:

Select * from table AwhereCREATE DATE > date(‘$$INCREMENT_DATE’. ‘MM-DD-YYYY HH24:MI:SS’)

Step 3: Refresh the mapping variable for the next session run using an Expression Transformation

Use an Expression transformation and the pre-defined variable functions to set and use the mapping variable.

In the expression transformation, create a variable port and use the SETMAXVARIABLE variable function to capture the maximum source date selected during each run.

SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE)

CREATE_DATE in this example is the date field from the source that should be used to identify incremental rows.

You can use the variables in the following transformations:

● Expression ● Filter ● Router ● Update Strategy

INFORMATICA CONFIDENTIAL BEST PRACTICE 264 of 702

As the session runs, the variable is refreshed with the max date value encountered between the source and variable. So, if one row comes through with 9/1/2004, then the variable gets that value. If all subsequent rows are LESS than that, then 9/1/2004 is preserved.

Note: This behavior has no effect on the date used in the source qualifier. The initial select always contains the maximum data value encountered during the previous, successful session run.

When the mapping completes, the PERSISTENT value of the mapping variable is stored in the repository for the next run of your session. You can view the value of the mapping variable in the session log file.

The advantage of the mapping variable and incremental loading is that it allows the session to use only the new rows of data. No table is needed to store the max(date) since the variable takes care of it.

After a successful session run, the PowerCenter Integration Service saves the final value of each variable in the repository. So when you run your session the next time, only new data from the source system is captured. If necessary, you can override the value saved in the repository with a value saved in a parameter file.

Using PowerExchange Change Data Capture

PowerExchange (PWX) Change Data Capture (CDC) greatly simplifies the identification, extraction, and loading of change records. It supports all key mainframe and midrange database systems, requires no changes to the user application, uses vendor-supplied technology where possible to capture changes, and eliminates the need for programming or the use of triggers. Once PWX CDC collects changes, it places them in a “change stream” for delivery to PowerCenter. Included in the change data is useful control information, such as the transaction type (insert/update/delete) and the transaction timestamp. In addition, the change data can be made available immediately (i.e., in real time) or periodically (i.e., where changes are condensed).

The native interface between PowerCenter and PowerExchange is PowerExchange Client for PowerCenter (PWXPC). PWXPC enables PowerCenter to pull the change data from the PWX change stream if real-time consumption is needed or from PWX condense files if periodic consumption is required. The changes are applied directly. So if the action flag is “I”, the record is inserted. If the action flag is “U’, the record is updated. If the action flag is “D”, the record is deleted. There is no need for change detection logic in the PowerCenter mapping.

INFORMATICA CONFIDENTIAL BEST PRACTICE 265 of 702

In addition, by leveraging “group source” processing, where multiple sources are placed in a single mapping, the PowerCenter session reads the committed changes for multiple sources in a single efficient pass, and in the order they occurred. The changes are then propagated to the targets, and upon session completion, restart tokens (markers) are written out to a PowerCenter file so that the next session run knows the point to extract from.

Tips for Using PWX CDC

● After installing PWX, ensure the PWX Listener is up and running and that connectivity is established to the Listener. For best performance, the Listener should be co-located with the source system.

● In the PWX Navigator client tool, use metadata to configure data access. This means creating data maps for the non-relational to relational view of mainframe sources (such as IMS and VSAM) and capture registrations for all sources (mainframe, Oracle, DB2, etc). Registrations define the specific tables and columns desired for change capture. There should be one registration per source. Group the registrations logically, for example, by source database.

● For an initial test, make changes in the source system to the registered sources. Ensure that the changes are committed.

● Still working in PWX Navigator (and before using PowerCenter), perform Row Tests to verify the returned change records, including the transaction action flag (the DTL__CAPXACTION column) and the timestamp. Set the required access mode: CAPX for change and CAPXRT for real time. Also, if desired, edit the PWX extraction maps to add the Change Indicator (CI) column. This CI flag (Y or N) allows for field level capture and can be filtered in the PowerCenter mapping.

● Use PowerCenter to materialize the targets (i.e., to ensure that sources and targets are in sync prior to starting the change capture process). This can be accomplished with a simple pass-through “batch” mapping. This same bulk mapping can be reused for CDC purposes, but only if specific CDC columns are not included, and by changing the session connection/mode.

● Import the PWX extraction maps into Designer. This requires the PWXPC

INFORMATICA CONFIDENTIAL BEST PRACTICE 266 of 702

component. Specify the CDC Datamaps option during the import.

● Use “group sourcing” to create the CDC mapping by including multiple sources in the mapping. This enhances performance because only one read/connection is made to the PWX Listener and all changes (for the sources in the mapping) are pulled at one time.

● Keep the CDC mappings simple. There are some limitations; for instance, you cannot use active transformations. In addition, if loading to a staging area, store the transaction types (i.e., insert/update/delete) and the timestamp for subsequent processing downstream. Also, if loading to a staging area, include an Update Strategy transformation in the mapping with DD_INSERT or DD_UPDATE in order to override the default behavior and store the action flags.

● Set up the Application Connection in Workflow Manager to be used by the CDC session. This requires the PWXPC component. There should be one connection and token file per CDC mapping/session. Set the UOW (unit of work) to a low value for faster commits to the target for real-time sessions. Specify the restart token location and file on the PowerCenter Integration Service (within the infa_shared directory) and specify the location of the PWX Listener.

● In the CDC session properties, enable session recovery (i.e., set the Recovery Strategy to “Resume from last checkpoint”).

● Use post-session commands to archive the restart token files for restart/recovery purposes. Also, archive the session logs.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 267 of 702

Real-Time Integration with PowerCenter

Challenge

Configure PowerCenter to work with PowerCenter Connect to process real-time data. This Best Practice discusses guidelines for establishing a connection with PowerCenter and setting up a real-time session to work with PowerCenter.

Description

PowerCenter with real-time option can be used to process data from real-time data sources. PowerCenter supports the following types of real-time data:

● Messages and message queues. PowerCenter with the real-time option can be used to integrate third-party messaging applications using a specific version of PowerCenter Connect. Each PowerCenter Connect version supports a specific industry-standard messaging application, such as IBM MQSeries, JMS, MSMQ, SAP NetWeaver mySAP Option, TIBCO, and webMethods. You can read from messages and message queues and write to messages, messaging applications, and message queues. IBM MQ Series uses a queue to store and exchange data. Other applications, such as TIBCO and JMS, use a publish/subscribe model. In this case, the message exchange is identified using a topic.

● Web service messages. PowerCenter can receive a web service message from a web service client through the Web Services Hub, transform the data, and load the data to a target or send a message back to a web service client. A web service message is a SOAP request from a web service client or a SOAP response from the Web Services Hub. The Integration Service processes real-time data from a web service client by receiving a message request through the Web Services Hub and processing the request. The Integration Service can send a reply back to the web service client through the Web Services Hub or write the data to a target.

● Changed source data. PowerCenter can extract changed data in real time from a source table using the PowerExchange Listener and write data to a target. Real-time sources supported by PowerExchange are ADABAS, DATACOM, DB2/390, DB2/400, DB2/UDB, IDMS, IMS, MS SQL Server, Oracle and VSAM.

Connection Setup

PowerCenter uses some attribute values in order to correctly connect and identify the third-party messaging application and message itself. Each version of PowerCenter Connect supplies its own connection attributes that need to be configured properly before running a real-time session.

Setting Up Real-Time Session in PowerCenter

INFORMATICA CONFIDENTIAL BEST PRACTICE 268 of 702

The PowerCenter real-time option uses a zero latency engine to process data from the messaging system. Depending on the messaging systems and the application that sends and receives messages, there may be a period when there are many messages and, conversely, there may be a period when there are no messages. PowerCenter uses the attribute ‘Flush Latency’ to determine how often the messages are being flushed to the target. PowerCenter also provides various attributes to control when the session ends.

The following reader attributes determine when a PowerCenter session should end:

● Message Count - Controls the number of messages the PowerCenter Server reads from the source before the session stops reading from the source.

● Idle Time - Indicates how long the PowerCenter Server waits when no messages arrive before it stops reading from the source.

● Time Slice Mode - Indicates a specific range of time that the server read messages from the source. Only PowerCenter Connect for MQSeries uses this option.

● Reader Time Limit - Indicates the number of seconds the PowerCenter Server spends reading messages from the source.

The specific filter conditions and options available to you depend on which Real-Time source is being used.

For example: Attributes for PowerExchange real-time CDC for DB2/400

INFORMATICA CONFIDENTIAL BEST PRACTICE 269 of 702

Set the attributes that control how the reader ends. One or more attributes can be used to control the end of session.

For example: set the Reader Time Limit attribute to 3600. The reader will end after 3600 seconds. The idle time limit is set to 500 seconds. The reader will end if it doesn’t process any changes for 500 seconds (i.e., it remains idle for 500 seconds).

If more than one attribute is selected, the first attribute that satisfies the condition is used to control the end of session.

Note:: The real-time attributes can be found in the Reader Properties for PowerCenter Connect for JMS, Tibco, Webmethods, and SAP Idoc. For PowerCenter Connect for MQ Series, the real-time attributes must be specified as a filter condition.

INFORMATICA CONFIDENTIAL BEST PRACTICE 270 of 702

The next step is to set the Real-time Flush Latency attribute. The Flush Latency defines how often PowerCenter should flush messages, expressed in milli-seconds.

For example, if the Real-time Flush Latency is set to 2000, PowerCenter flushes messages every two seconds. The messages will also be flushed from the reader buffer if the Source Based Commit condition is reached. The Source Based Commit condition is defined in the Properties tab of the session.

The message recovery option can be enabled to ensure that no messages are lost if a session fails as a result of unpredictable error, such as power loss. This is especially important for real-time sessions because some messaging applications do not store the messages after the messages are consumed by another application.

A unit of work (UOW) is a collection of changes within a single commit scope made by a transaction on the source system from an external application. Each UOW may consist of a different number of rows depending on the transaction to the source system. When you use the UOW Count Session condition, the Integration Service commits source data to the target when it reaches the number of UOWs specified in the session condition.

For example, if the value for UOW Count is 10, the Integration Service commits all data read from the source after the 10th UOW enters the source. The lower you set the value, the faster the Integration Service commits data to the target. The lower value also causes the system to consume more resources.

Executing a Real-Time Session

A real-time session often has to be up and running continuously to listen to the messaging application and to process messages immediately after the messages arrive. Set the reader attribute Idle Time to -1 and Flush Latency to a specific time interval. This is applicable for all PowerExchange and PowerCenter Connect versions except for PowerConnect for MQSeries where the session continues to run and flush the messages to the target using the specific flush latency interval.

Another scenario is the ability to read data from another source system and immediately send it to a real-time target. For example, reading data from a relational source and writing it to MQ Series. In this case, set the session to run continuously so that every change in the source system can be immediately reflected in the target.

A real-time session may run continuously until a condition is met to end the session. In some situations it may be required to periodically stop the session and restart it. This is sometimes necessary to execute a post-session command or run some other process that is not part of the session. To stop the session and restart it, it is useful to deploy continuously running workflows. The Integration Service starts the next run of a continuous workflow as soon as it completes the first.

To set a workflow to run continuously, edit the workflow and select the ‘Scheduler’ tab. Edit the

INFORMATICA CONFIDENTIAL BEST PRACTICE 271 of 702

‘Scheduler’ and select ‘Run Continuously’ from ‘Run Options’. A continuous workflow starts automatically when the Integration Service initializes. When the workflow stops, it restarts immediately.

Real-Time Sessions and Active Transformations

Some of the transformations in PowerCenter are ‘active transformations’, which means that the number of input rows and output rows of the transformations are not the same. For most cases, active transformation requires all of the input rows to be processed before processing the output row to the next transformation or target. For a real-time session, the flush latency will be ignored if DTM needs to wait for all the rows to be processed.

Depending on user needs, active transformations, such as aggregator, rank, sorter can be used in a real-time session by setting the transaction scope property in the active transformation to ‘Transaction’. This signals the session to process the data in the transformation every transaction. For example, if a real-time session is using an aggregator that sums a field of an input, the summation will be done per transaction, as opposed to all rows. The result may or may not be correct depending on the requirement. Use the active transformation with real-time session if you want to process the data per transaction.

Custom transformations can also be defined to handle data per transaction so that they can be used in a real-time session.

PowerExchange Real Time Connections

PowerExchange NRDB CDC Real Time connections can be used to extract changes from ADABAS, DATACOM, IDMS, IMS and VSAM sources in real time.

The DB2/390 connection can be used to extract changes for DB2 on OS/390 and the DB2/400 connection to extract from AS/400. There is a separate connection to read from DB2 UDB in real time.

The NRDB CDC connection requires the application name and the restart token file name to be overridden for every session. When the PowerCenter session completes, the PowerCenter Server writes the last restart token to a physical file called the RestartToken File. The next time the session starts, the PowerCenter Server reads the restart token from the file and the starts reading changes from the point where it last left off. Every PowerCenter session needs to have a unique restart token filename.

Informatica recommends archiving the file periodically. The reader timeout or the idle timeout can be used to stop a real-time session. A post-session command can be used to archive the RestartToken file.

The encryption mode for this connection can slow down the read performance and increase resource consumption. Compression mode can help in situations where the network is a

INFORMATICA CONFIDENTIAL BEST PRACTICE 272 of 702

bottleneck; using compression also increases the CPU and memory usage on the source system.

Archiving PowerExchange Tokens

When the PowerCenter session completes, the Integration Service writes the last restart token to a physical file called the RestartToken File. The token in the file indicates the end point where the read job ended. The next time the session starts, the PowerCenter Server reads the restart token from the file and the starts reading changes from the point where it left off. The token file is overwritten each time the session has to write a token out. PowerCenter does not implicitly maintain an archive of these tokens.

If, for some reason, the changes from a particular point in time have to “replayed”, we need the PowerExchange token from that point in time.

To enable such a process, it is a good practice to periodically copy the token file to a backup folder. This procedure is necessary to maintain an archive of the PowerExchange tokens. A real-time PowerExchange session may be stopped periodically, using either the reader time limit or the idle time limit. A post-session command is used to copy the restart token file to an archive folder. The session will be part of a continuous running workflow, so when the session completes after the post session command, it automatically restarts again. From a data processing standpoint very little changes; the process pauses for a moment, archives the token, and starts again.

The following are examples of post-session commands that can be used to copy a restart token file (session.token) and append the current system date/time to the file name for archive purposes:

cp session.token session`date '+%m%d%H%M'`.token

Windows:

copy session.token session-%date:~4,2%-%date:~7,2%-%date:~10,4%-%time:~0,2%-%time:~3,2%.token

PowerCenter Connect for MQ

1. In the Workflow Manager, connect to a repository and choose Connection > Queue 2. The Queue Connection Browser appears. Select New > Message Queue 3. The Connection Object Definition dialog box appears

You need to specify three attributes in the Connection Object Definition dialog box:

● Name - the name for the connection. (Use <queue_name>_<QM_name> to uniquely

INFORMATICA CONFIDENTIAL BEST PRACTICE 273 of 702

identified the connection.) ● Queue Manager - the Queue Manager name for the message queue. (in Windows, the

default Queue Manager name is QM_<machine name>) ● Queue Name - the Message Queue name

To obtain the Queue Manager and Message Queue names:

● Open the MQ Series Administration Console. The Queue Manager should appear on the left panel

● Expand the Queue Manager icon. A list of the queues for the queue manager appears on the left panel

Note that the Queue Manager’s name and Queue Name are case-sensitive

PowerCenter Connect for JMS

PowerCenter Connect for JMS can be used to read or write messages from various JMS providers, such as IBM MQ Series JMS, BEA Weblogic Server, and IBM Websphere.

There are two types of JMS application connections:

● JNDI Application Connection, which is used to connect to a JNDI server during a session run.

● JMS Application Connection, which is used to connect to a JMS provider during a session run.

JNDI Application Connection Attributes are:

● Name ● JNDI Context Factory ● JNDI Provider URL ● JNDI UserName ● JNDI Password ● JMS Application Connection

JMS Application Connection Attributes are:

● Name ● JMS Destination Type ● JMS Connection Factory Name

INFORMATICA CONFIDENTIAL BEST PRACTICE 274 of 702

● JMS Destination ● JMS UserName ● JMS Password

Configuring the JNDI Connection for IBM MQ Series

The JNDI settings for MQ Series JMS can be configured using a file system service or LDAP (Lightweight Directory Access Protocol).

The JNDI setting is stored in a file named JMSAdmin.config. The file should be installed in the MQSeries Java installation/bin directory.

If you are using a file system service provider to store your JNDI settings, remove the number sign (#) before the following context factory setting:

INITIAL_CONTEXT_FACTORY=com.sun.jndi.fscontext.RefFSContextFactory

Or, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#) before the following context factory setting:

INITIAL_CONTEXT_FACTORY=com.sun.jndi.ldap.LdapCtxFactory

Find the PROVIDER_URL settings.

If you are using a file system service provider to store your JNDI settings, remove the number sign (#) before the following provider URL setting and provide a value for the JNDI directory.

PROVIDER_URL=file: /<JNDI directory>

<JNDI directory> is the directory where you want JNDI to store the .binding file.

Or, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#) before the provider URL setting and specify a hostname.

#PROVIDER_URL=ldap://<hostname>/context_name

For example, you can specify:

PROVIDER_URL=ldap://<localhost>/o=infa,c=rc

If you want to provide a user DN and password for connecting to JNDI, you can remove the # from the following settings and enter a user DN and password:

INFORMATICA CONFIDENTIAL BEST PRACTICE 275 of 702

PROVIDER_USERDN=cn=myname,o=infa,c=rc PROVIDER_PASSWORD=test

The following table shows the JMSAdmin.config settings and the corresponding attributes in the JNDI application connection in the Workflow Manager:

JMSAdmin.config Settings: JNDI Application Connection Attribute

INITIAL_CONTEXT_FACTORY JNDI Context Factory

PROVIDER_URL JNDI Provider URL

PROVIDER_USERDN JNDI UserName

PROVIDER_PASSWORD JNDI Password

Configuring the JMS Connection for IBM MQ Series

The JMS connection is defined using a tool in JMS called jmsadmin, which is available in MQ Series Java installation/bin directory. Use this tool to configure the JMS Connection Factory.

The JMS Connection Factory can be a Queue Connection Factory or Topic Connection Factory.

● When Queue Connection Factory is used, define a JMS queue as the destination. ● When Connection Factory is used, define a JMS topic as the destination.

The command to define a queue connection factory (qcf) is:

def qcf(<qcf_name>) qmgr(queue_manager_name) hostname (QM_machine_hostname) port (QM_machine_port)

The command to define JMS queue is:

def q(<JMS_queue_name>) qmgr(queue_manager_name) qu(queue_manager_queue_name)

The command to define JMS topic connection factory (tcf) is:

def tcf(<tcf_name>) qmgr(queue_manager_name) hostname (QM_machine_hostname) port (QM_machine_port)

INFORMATICA CONFIDENTIAL BEST PRACTICE 276 of 702

The command to define the JMS topic is:

def t(<JMS_topic_name>) topic(pub/sub_topic_name)

The topic name must be unique. For example: topic (application/infa)

The following table shows the JMS object types and the corresponding attributes in the JMS application connection in the Workflow Manager:

JMS Object Types JMS Application Connection Attribute

QueueConnectionFactory or TopicConnectionFactory

JMS Connection Name

JMS Queue Name or JMS Topic Name

JMS Destination

Configure the JNDI and JMS Connection for IBM Websphere

Configure the JNDI settings for IBM WebSphere to use IBM WebSphere as a provider for JMS sources or targets in a PowerCenterRT session.

JNDI Connection Add the following option to the file JMSAdmin.bat to configure JMS properly:

-Djava.ext.dirs=<WebSphere Application Server>bin

For example: -Djava.ext.dirs=WebSphere\AppServer\bin

The JNDI connection resides in the JMSAdmin.config file, which is located in the MQ Series Java/bin directory.

INITIAL_CONTEXT_FACTORY=com.ibm.websphere.naming.wsInitialContextFactory

PROVIDER_URL=iiop://<hostname>/

For example:

PROVIDER_URL=iiop://localhost/

INFORMATICA CONFIDENTIAL BEST PRACTICE 277 of 702

PROVIDER_USERDN=cn=informatica,o=infa,c=rc PROVIDER_PASSWORD=test

JMS Connection

The JMS configuration is similar to the JMS Connection for IBM MQ Series.

Configure the JNDI and JMS Connection for BEA Weblogic

Configure the JNDI settings for BEA Weblogic to use BEA Weblogic as a provider for JMS sources or targets in a PowerCenterRT session.

PowerCenter Connect for JMS and the JMS hosting WebLogic server do not need to be on the same server. PowerCenter Connect for JMS just needs a URL, as long as the URL points to the right place.

JNDI Connection

The Weblogic Server automatically provides a context factory and URL during the JNDI set-up configuration for WebLogic Server. Enter these values to configure the JNDI connection for JMS sources and targets in the Workflow Manager.

Enter the following value for JNDI Context Factory in the JNDI Application Connection in the Workflow Manager:

weblogic.jndi.WLInitialContextFactory

Enter the following value for JNDI Provider URL in the JNDI Application Connection in the Workflow Manager:

t3://<WebLogic_Server_hostname>:<port>

where WebLogic Server hostname is the hostname or IP address of the WebLogic Server and port is the port number for the WebLogic Server.

JMS Connection

The JMS connection is configured from the BEA WebLogic Server console. Select JMS -> Connection Factory.

The JMS Destination is also configured from the BEA Weblogic Server console.

From the Console pane, select Services > JMS > Servers > <JMS Server name> >

INFORMATICA CONFIDENTIAL BEST PRACTICE 278 of 702

Destinations under your domain.

Click Configure a New JMSQueue or Configure a New JMSTopic.

The following table shows the JMS object types and the corresponding attributes in the JMS application connection in the Workflow Manager:

WebLogic Server JMS Object JMS Application Connection Attribute

Connection Factory Settings: JNDIName JMS Application Connection Attribute

Connection Factory Settings: JNDIName JMS Connection Factory Name

Destination Settings: JNDIName JMS Destination

In addition to JNDI and JMS setting, BEA Weblogic also offers a function called JMS Store, which can be used for persistent messaging when reading and writing JMS messages. The JMS Stores configuration is available from the Console pane: select Services > JMS > Stores under your domain.

Configuring the JNDI and JMS Connection for TIBCO

TIBCO Rendezvous Server does not adhere to JMS specifications. As a result, PowerCenter Connect for JMS can’t connect directly with the Rendezvous Server. TIBCO Enterprise Server, which is JMS-compliant, acts as a bridge between the PowerCenter Connect for JMS and TIBCO Rendezvous Server. Configure a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server for PowerCenter Connect for JMS to be able to read messages from and write messages to TIBCO Rendezvous Server.

To create a connection-bridge between PowerCenter Connect for JMS and TIBCO Rendezvous Server, follow these steps:

1. Configure PowerCenter Connect for JMS to communicate with TIBCO Enterprise Server.

2. Configure TIBCO Enterprise Server to communicate with TIBCO Rendezvous Server.

Configure the following information in your JNDI application connection:

● JNDI Context Factory.com.tibco.tibjms.naming.TibjmsInitialContextFactory ● Provider URL.tibjmsnaming://<host>:<port> where host and port are the host name

and port number of the Enterprise Server.

INFORMATICA CONFIDENTIAL BEST PRACTICE 279 of 702

To make a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server:

1. In the file tibjmsd.conf, enable the tibrv transport configuration parameter as in the example below, so that TIBCO Enterprise Server can communicate with TIBCO Rendezvous messaging systems:

tibrv_transports = enabled

2. Enter the following transports in the transports.conf file:

[RV] type = tibrv // type of external messaging system topic_import_dm = TIBJMS_RELIABLE // only reliable/certified messages can transfer daemon = tcp:localhost:7500 // default daemon for the Rendezvous server

The transports in the transports.conf configuration file specify the communication protocol between TIBCO Enterprise for JMS and the TIBCO Rendezvous system. The import and export properties on a destination can list one or more transports to use to communicate with the TIBCO Rendezvous system.

3. Optionally, specify the name of one or more transports for reliable and certified message delivery in the export property in the file topics.conf. as in the following example:

topicname export="RV"

The export property allows messages published to a topic by a JMS client to be exported to the external systems with configured transports. Currently, you can configure transports for TIBCO Rendezvous reliable and certified messaging protocols.

PowerCenter Connect for webMethods

When importing webMethods sources into the Designer, be sure the webMethods host file doesn’t contain ‘.’ character. You can’t use fully-qualified names for the connection when importing webMethods sources. You can use fully-qualified names for the connection when importing webMethods targets because PowerCenter doesn’t use the same grouping method for importing sources and targets. To get around this, modify the host file to resolve the name to the IP address.

For example:

Host File:

INFORMATICA CONFIDENTIAL BEST PRACTICE 280 of 702

crpc23232.crp.informatica.com crpc23232

Use crpc23232 instead of crpc23232.crp.informatica.com as the host name when importing webMethods source definition. This step is only required for importing PowerCenter Connect for webMethods sources into the Designer.

If you are using the request/reply model in webMethods, PowerCenter needs to send an appropriate document back to the broker for every document it receives. PowerCenter populates some of the envelope fields of the webMethods target to enable webMethods broker to recognize that the published document is a reply from PowerCenter. The envelope fields ‘destid’ and ‘tag’ are populated for the request/reply model. ‘Destid’ should be populated from the ‘pubid’ of the source document and ‘tag’ should be populated from ‘tag’ of the source document. Use the option ‘Create Default Envelope Fields’ when importing webMethods sources and targets into the Designer in order to make the envelope fields available in PowerCenter.

Configuring the PowerCenter Connect for webMethods Connection

To create or edit PowerCenter Connect for webMethods connection select Connections > Application > webMethods Broker from the Workflow Manager.

PowerCenter Connect for webMethods connection attributes are:

● Name ● Broker Host ● Broker Name ● Client ID ● Client Group ● Application Name ● Automatic Reconnect ● Preserve Client State

Enter the connection to the Broker Host in the following format <hostname: port>.

If you are using the request/reply method in webMethods, you have to specify a client ID in the connection. Be sure that the client ID used in the request connection is the same as the client ID used in the reply connection. Note that if you are using multiple request/reply document pairs, you need to setup different webMethods connections for each pair because they cannot share a client ID.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 281 of 702

Session and Data Partitioning

Challenge

Improving performance by identifying strategies for partitioning relational tables, XML, COBOL and standard flat files, and by coordinating the interaction between sessions, partitions, and CPUs. These strategies take advantage of the enhanced partitioning capabilities in PowerCenter.

Description

On hardware systems that are under-utilized, you may be able to improve performance by processing partitioned data sets in parallel in multiple threads of the same session instance running onthe PowerCenter Server engine. However, parallel execution may impair performance on over-utilized systems or systems with smaller I/O capacity.

In addition to hardware, consider these other factors when determining if a session is an ideal candidate for partitioning: source and target database setup, target type, mapping design, and certain assumptions that are explained in the following paragraphs. Use the Workflow Manager client tool to implement session partitioning.

Assumptions

The following assumptions pertain to the source and target systems of a session that is a candidate for partitioning. These factors can help to maximize the benefits that can be achieved through partitioning.

● Indexing has been implemented on the partition key when using a relational source.

● Source files are located on the same physical machine as the PowerCenter Server process when partitioning flat files, COBOL, and XML, to reduce network overhead and delay.

● All possible constraints are dropped or disabled on relational targets. ● All possible indexes are dropped or disabled on relational targets. ● Table spaces and database partitions are properly managed on the target

system. ● Target files are written to same physical machine that hosts the PowerCenter

INFORMATICA CONFIDENTIAL BEST PRACTICE 282 of 702

process in order to reduce network overhead and delay. ● Oracle External Loaders are utilized whenever possible

First, determine if you should partition your session. Parallel execution benefits systems that have the following characteristics:

Check idle time and busy percentage for each thread. This gives the high-level information of the bottleneck point/points. In order to do this, open the session log and look for messages starting with “PETL_” under the “RUN INFO FOR TGT LOAD ORDER GROUP” section. These PETL messages give the following details against the reader, transformation, and writer threads:

● Total Run Time ● Total Idle Time ● Busy Percentage

Under-utilized or intermittently-used CPUs. To determine if this is the case, check the CPU usage of your machine. The column ID displays the percentage utilization of CPU idling during the specified interval without any I/O wait. If there are CPU cycles available (i.e., twenty percent or more idle time), then this session's performance may be improved by adding a partition.

● Windows 2000/2003 - check the task manager performance tab. ● UNIX - type VMSTAT 1 10 on the command line.

Sufficient I/O. To determine the I/O statistics:

● Windows 2000/2003 - check the task manager performance tab. ● UNIX - type IOSTAT on the command line. The column %IOWAIT displays the

percentage of CPU time spent idling while waiting for I/O requests. The column %idle displays the total percentage of the time that the CPU spends idling (i.e., the unused capacity of the CPU.)

Sufficient memory. If too much memory is allocated to your session, you will receive a memory allocation error. Check to see that you're using as much memory as you can. If the session is paging, increase the memory. To determine if the session is paging:

● Windows 2000/2003 - check the task manager performance tab.

INFORMATICA CONFIDENTIAL BEST PRACTICE 283 of 702

● UNIX - type VMSTAT 1 10 on the command line. PI displays number of pages swapped in from the page space during the specified interval. PO displays the number of pages swapped out to the page space during the specified interval. If these values indicate that paging is occurring, it may be necessary to allocate more memory, if possible.

If you determine that partitioning is practical, you can begin setting up the partition.

Partition Types

PowerCenter provides increased control of the pipeline threads. Session performance can be improved by adding partitions at various pipeline partition points. When you configure the partitioning information for a pipeline, you must specify a partition type. The partition type determines how the PowerCenter Server redistributes data across partition points. The Workflow Manager allows you to specify the following partition types:

Round-robin Partitioning

The PowerCenter Server distributes data evenly among all partitions. Use round-robin partitioning when you need to distribute rows evenly and do not need to group data among partitions.

In a pipeline that reads data from file sources of different sizes, use round-robin partitioning. For example, consider a session based on a mapping that reads data from three flat files of different sizes.

● Source file 1: 100,000 rows ● Source file 2: 5,000 rows ● Source file 3: 20,000 rows

In this scenario, the recommended best practice is to set a partition point after the Source Qualifier and set the partition type to round-robin. The PowerCenter Server distributes the data so that each partition processes approximately one third of the data.

Hash Partitioning

The PowerCenter Server applies a hash function to a partition key to group data among

INFORMATICA CONFIDENTIAL BEST PRACTICE 284 of 702

partitions.

Use hash partitioning where you want to ensure that the PowerCenter Server processes groups of rows with the same partition key in the same partition. For example, in a scenario where you need to sort items by item ID, but do not know the number of items that have a particular ID number. If you select hash auto-keys, the PowerCenter Server uses all grouped or sorted ports as the partition key. If you select hash user keys, you specify a number of ports to form the partition key.

An example of this type of partitioning is when you are using Aggregators and need to ensure that groups of data based on a primary key are processed in the same partition.

Key Range Partitioning

With this type of partitioning, you specify one or more ports to form a compound partition key for a source or target. The PowerCenter Server then passes data to each partition depending on the ranges you specify for each port.

Use key range partitioning where the sources or targets in the pipeline are partitioned by key range. Refer to Workflow Administration Guide for further directions on setting up Key range partitions.

For example, with key range partitioning set at End range = 2020, the PowerCenter Server passes in data where values are less than 2020. Similarly, for Start range = 2020, the PowerCenter Server passes in data where values are equal to greater than 2020. Null values or values that may not fall in either partition are passed through the first partition.

Pass-through Partitioning

In this type of partitioning, the PowerCenter Server passes all rows at one partition point to the next partition point without redistributing them.

Use pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to (or cannot) change the distribution of data across partitions. The Data Transformation Manager spawns a master thread on each session run, which in turn creates three threads (reader, transformation, and writer threads) by default. Each of these threads can, at the most, process one data set at a time and hence, three data sets simultaneously. If there are complex transformations in the mapping, the transformation thread may take a longer time than the other threads, which can slow data throughput.

INFORMATICA CONFIDENTIAL BEST PRACTICE 285 of 702

It is advisable to define partition points at these transformations. This creates another pipeline stage and reduces the overhead of a single transformation thread.

When you have considered all of these factors and selected a partitioning strategy, you can begin the iterative process of adding partitions. Continue adding partitions to the session until you meet the desired performance threshold or observe degradation in performance.

Tips for Efficient Session and Data Partitioning

● Add one partition at a time. To best monitor performance, add one partition at a time, and note your session settings before adding additional partitions. Refer to Workflow Administrator Guide, for more information on Restrictions on the Number of Partitions.

● Set DTM buffer memory. For a session with n partitions, set this value to at least n times the original value for the non-partitioned session.

● Set cached values for sequence generator. For a session with n partitions, there is generally no need to use the Number of Cached Values property of the sequence generator. If you must set this value to a value greater than zero, make sure it is at least n times the original value for the non-partitioned session.

● Partition the source data evenly. The source data should be partitioned into equal sized chunks for each partition.

● Partition tables. A notable increase in performance can also be realized when the actual source and target tables are partitioned. Work with the DBA to discuss the partitioning of source and target tables, and the setup of tablespaces.

● Consider using external loader. As with any session, using an external loader may increase session performance. You can only use Oracle external loaders for partitioning. Refer to the Session and Server Guide for more information on using and setting up the Oracle external loader for partitioning.

● Write throughput. Check the session statistics to see if you have increased the write throughput.

● Paging. Check to see if the session is now causing the system to page. When you partition a session and there are cached lookups, you must make sure that DTM memory is increased to handle the lookup caches. When you partition a source that uses a static lookup cache, the PowerCenter Server creates one memory cache for each partition and one disk cache for each transformation. Thus, memory requirements grow for each partition. If the memory is not bumped up, the system may start paging to disk, causing

INFORMATICA CONFIDENTIAL BEST PRACTICE 286 of 702

degradation in performance.

When you finish partitioning, monitor the session to see if the partition is degrading or improving session performance. If the session performance is improved and the session meets your requirements, add another partition

Session on Grid and Partitioning Across Nodes

Session on Grid provides the ability to run a session on multi-node integration services. This is most suitable for large-size sessions. For small and medium size sessions, it is more practical to distribute whole sessions to different nodes using Workflow on Grid. Session on Grid leverages existing partitions of a session b executing threads in multiple DTMs. Log service can be used to get the cumulative log.

Dynamic Partitioning

Dynamic partitioning is also called parameterized partitioning because a single parameter can determine the number of partitions. With the Session on Grid option, more partitions can be added when more resources are available. Also the number of partitions in a session can be tied to partitions in the database to facilitate maintenance of PowerCenter partitioning to leverage database partitioning.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 287 of 702

Using Parameters, Variables and Parameter Files

Challenge

Understanding how parameters, variables, and parameter files work and using them for maximum efficiency.

Description

Prior to the release of PowerCenter 5, the only variables inherent to the product were defined to specific transformations and to those server variables that were global in nature. Transformation variables were defined as variable ports in a transformation and could only be used in that specific transformation object (e.g., Expression, Aggregator, and Rank transformations). Similarly, global parameters defined within Server Manager would affect the subdirectories for source files, target files, log files, and so forth.

More current versions of PowerCenter made variables and parameters available across the entire mapping rather than for a specific transformation object. In addition, they provide built-in parameters for use within Workflow Manager. Using parameter files, these values can change from session-run to session-run. With the addition of workflows, parameters can now be passed to every session contained in the workflow, providing more flexibility and reducing parameter file maintenance. Other important functionality that has been added in recent releases is the ability to dynamically create parameter files that can be used in the next session in a workflow or in other workflows.

Parameters and Variables

Use a parameter file to define the values for parameters and variables used in a workflow, worklet, mapping, or session. A parameter file can be created using a text editor such as WordPad or Notepad. List the parameters or variables and their values in the parameter file. Parameter files can contain the following types of parameters and variables:

● Workflow variables ● Worklet variables ● Session parameters ● Mapping parameters and variables

When using parameters or variables in a workflow, worklet, mapping, or session, the PowerCenter Server checks the parameter file to determine the start value of the parameter or variable. Use a parameter file to initialize workflow variables, worklet variables, mapping parameters, and mapping variables. If not defining start values for these parameters and variables, the PowerCenter Server checks for the start value of the parameter or variable in other places.

Session parameters must be defined in a parameter file. Because session parameters do not have default values, if the PowerCenter Server cannot locate the value of a session parameter in the parameter file, it fails to initialize the session. To include parameter or variable information for more than one workflow, worklet, or session in a single parameter file, create separate sections for each object within the parameter file.

Also, create multiple parameter files for a single workflow, worklet, or session and change the file that these tasks use, as necessary. To specify the parameter file that the PowerCenter Server uses with a workflow, worklet, or session, do either of the following:

INFORMATICA CONFIDENTIAL BEST PRACTICE 288 of 702

● Enter the parameter file name and directory in the workflow, worklet, or session properties. ● Start the workflow, worklet, or session using pmcmd and enter the parameter filename and directory in

the command line.

If entering a parameter file name and directory in the workflow, worklet, or session properties and in the pmcmd command line, the PowerCenter Server uses the information entered in the pmcmd command line.

Parameter File Format

The format for parameter files changed with the addition of the Workflow Manager. When entering values in a parameter file, precede the entries with a heading that identifies the workflow, worklet, or session whose parameters and variables are to be assigned. Assign individual parameters and variables directly below this heading, entering each parameter or variable on a new line. List parameters and variables in any order for each task.

The following heading formats can be defined:

● Workflow variables - [folder name.WF:workflow name] ● Worklet variables -[folder name.WF:workflow name.WT:worklet name] ● Worklet variables in nested worklets - [folder name.WF:workflow name.WT:worklet name.WT:worklet

name...] ● Session parameters, plus mapping parameters and variables - [folder name.WF:workflow name.ST:

session name] or [folder name.session name] or [session name]

Below each heading, define parameter and variable values as follows:

● parameter name=value ● parameter2 name=value ● variable name=value ● variable2 name=value

For example, a session in the production folder, s_MonthlyCalculations, uses a string mapping parameter, $$State, that needs to be set to MA, and a datetime mapping variable, $$Time. $$Time already has an initial value of 9/30/2000 00:00:00 saved in the repository, but this value needs to be overridden to 10/1/2000 00:00:00. The session also uses session parameters to connect to source files and target databases, as well as to write session log to the appropriate session log file.

The following table shows the parameters and variables that can be defined in the parameter file:

Parameter and Variable Type Parameter and Variable Name Desired Definition

String Mapping Parameter $$State MA

Datetime Mapping Variable $$Time 10/1/2000 00:00:00

Source File (Session Parameter) $InputFile1 Sales.txt

Database Connection (Session Parameter) $DBConnection_Target Sales (database connection)

Session Log File (Session Parameter) $PMSessionLogFile d:/session logs/firstrun.txt

INFORMATICA CONFIDENTIAL BEST PRACTICE 289 of 702

The parameter file for the session includes the folder and session name, as well as each parameter and variable:

● [Production.s_MonthlyCalculations] ● $$State=MA ● $$Time=10/1/2000 00:00:00 ● $InputFile1=sales.txt ● $DBConnection_target=sales ● $PMSessionLogFile=D:/session logs/firstrun.txt

The next time the session runs, edit the parameter file to change the state to MD and delete the $$Time variable. This allows the PowerCenter Server to use the value for the variable that was set in the previous session run

Mapping Variables

Declare mapping variables in PowerCenter Designer using the menu option Mappings -> Parameters and Variables (See the first figure, below). After selecting mapping variables, use the pop-up window to create a variable by specifying its name, data type, initial value, aggregation type, precision, and scale. This is similar to creating a port in most transformations (See the second figure, below).

INFORMATICA CONFIDENTIAL BEST PRACTICE 290 of 702

Variables, by definition, are objects that can change value dynamically. PowerCenter has four functions to affect change to mapping variables:

● SetVariable ● SetMaxVariable ● SetMinVariable ● SetCountVariable

A mapping variable can store the last value from a session run in the repository to be used as the starting value for the next session run.

● Name. The name of the variable should be descriptive and be preceded by $$ (so that it is easily identifiable as a variable). A typical variable name is: $$Procedure_Start_Date.

● Aggregation type. This entry creates specific functionality for the variable and determines how it stores data. For example, with an aggregation type of Max, the value stored in the repository at the end of each session run would be the maximum value across ALL records until the value is deleted.

● Initial value. This value is used during the first session run when there is no corresponding and overriding parameter file. This value is also used if the stored repository value is deleted. If no initial value is identified, then a data-type specific default value is used.

Variable values are not stored in the repository when the session:

● Fails to complete. ● Is configured for a test load. ● Is a debug session.

INFORMATICA CONFIDENTIAL BEST PRACTICE 291 of 702

● Runs in debug mode and is configured to discard session output.

Order of Evaluation

The start value is the value of the variable at the start of the session. The start value can be a value defined in the parameter file for the variable, a value saved in the repository from the previous run of the session, a user-defined initial value for the variable, or the default value based on the variable data type.

The PowerCenter Server looks for the start value in the following order:

1. Value in session parameter file 2. Value saved in the repository 3. Initial value 4. Default value

Mapping Parameters and Variables

Since parameter values do not change over the course of the session run, the value used is based on:

● Value in session parameter file ● Initial value ● Default value

Once defined, mapping parameters and variables can be used in the Expression Editor section of the following transformations:

● Expression ● Filter ● Router ● Update Strategy ● Aggregator

Mapping parameters and variables also can be used within the Source Qualifier in the SQL query, user-defined join, and source filter sections, as well as in a SQL override in the lookup transformation.

The lookup SQL override is similar to entering a custom query in a Source Qualifier transformation. When entering a lookup SQL override, enter the entire override, or generate and edit the default SQL statement. When the Designer generates the default SQL statement for the lookup SQL override, it includes the lookup/output ports in the lookup condition and the lookup/return port.

Note: Although you can use mapping parameters and variables when entering a lookup SQL override, the Designer cannot expand mapping parameters and variables in the query override and does not validate the lookup SQL override. When running a session with a mapping parameter or variable in the lookup SQL override, the PowerCenter Integration Service expands mapping parameters and variables and connects to the lookup database to validate the query override.

Also note that Workflow Manager does not recognize variable connection parameters such as dbconnection with lookup transformations. At this time, Lookups can use $Source, $Target, or exact db connections.

INFORMATICA CONFIDENTIAL BEST PRACTICE 292 of 702

Guidelines for Creating Parameter Files

Use the following guidelines when creating parameter files:

● Capitalize folder and session names as necessary. Folder and session names are case-sensitive in the parameter file.

● Enter folder names for non-unique session names. When a session name exists more than once in a repository, enter the folder name to indicate the location of the session.

● Create one or more parameter files. Assign parameter files to workflows, worklets, and sessions individually. Specify the same parameter file for all of these tasks or create several parameter files.

● If including parameter and variable information for more than one session in the file, create a new section for each session as follows. The folder name is optional.

[folder_name.session_name]

parameter_name=value

variable_name=value

mapplet_name.parameter_name=value

[folder2_name.session_name]

parameter_name=value

variable_name=value

mapplet_name.parameter_name=value

● Specify headings in any order. Place headings in any order in the parameter file. However, if defining the same parameter or variable more than once in the file, the PowerCenter Server assigns the parameter or variable value using the first instance of the parameter or variable.

● Specify parameters and variables in any order. Below each heading, the parameters and variables can be specified in any order.

● When defining parameter values, do not use unnecessary line breaks or spaces. The PowerCenter Server may interpret additional spaces as part of the value.

● List all necessary mapping parameters and variables. Values entered for mapping parameters and variables become the start value for parameters and variables in a mapping. Mapping parameter and variable names are not case sensitive.

● List all session parameters. Session parameters do not have default values. An undefined session parameter can cause the session to fail. Session parameter names are not case sensitive.

● Use correct date formats for datetime values. When entering datetime values, use the following date formats:

MM/DD/RR

MM/DD/RR HH24:MI:SS

MM/DD/YYYY

INFORMATICA CONFIDENTIAL BEST PRACTICE 293 of 702

MM/DD/YYYY HH24:MI:SS

● Do not enclose parameters or variables in quotes. The PowerCenter Server interprets everything after the equal sign as part of the value.

● Do enclose parameters in single quotes in a Source Qualifier SQL Override if the parameter represents a string or date/time value to be used in the SQL Override.

● Precede parameters and variables created in mapplets with the mapplet name as follows:

mapplet_name.parameter_name=value

mapplet2_name.variable_name=value

Sample: Parameter Files and Session Parameters

Parameter files, along with session parameters, allow you to change certain values between sessions. A commonly-used feature is the ability to create user-defined database connection session parameters to reuse sessions for different relational sources or targets. Use session parameters in the session properties, and then define the parameters in a parameter file. To do this, name all database connection session parameters with the prefix $DBConnection, followed by any alphanumeric and underscore characters. Session parameters and parameter files help reduce the overhead of creating multiple mappings when only certain attributes of a mapping need to be changed.

Using Parameters in Source Qualifiers

Another commonly used feature is the ability to create parameters in the source qualifiers, which allows you to reuse the same mapping, with different sessions, to extract specified data from the parameter files the session references.

Moreover, there may be a time when it is necessary to create a mapping that will create a parameter file and the second mapping to use that parameter file created from the first mapping. The second mapping pulls the data using a parameter in the Source Qualifier transformation, which reads the parameter from the parameter file created in the first mapping. In the first case, the idea is to build a mapping that creates the flat file, which is a parameter file for another session to use.

Note: Server variables cannot be modified by entries in the parameter file. For example, there is no way to set the Workflow log directory in a parameter file. The Workflow Log File Directory can only accept an actual directory or the $PMWorkflowLogDir variable as a valid entry. The $PMWorkflowLogDir variable is a server variable that is set at the server configuration level, not in the Workflow parameter file.

Sample: Variables and Parameters in an Incremental Strategy

Variables and parameters can enhance incremental strategies. The following example uses a mapping variable, an expression transformation object, and a parameter file for restarting.

Scenario

Company X wants to start with an initial load of all data, but wants subsequent process runs to select only new information. The environment data has an inherent Post_Date that is defined within a column named Date_Entered that can be used. The process will run once every twenty-four hours.

INFORMATICA CONFIDENTIAL BEST PRACTICE 294 of 702

Sample Solution

Create a mapping with source and target objects. From the menu create a new mapping variable named $$Post_Date with the following attributes:

● TYPE Variable ● DATATYPE Date/Time ● AGGREGATION TYPE MAX ● INITIAL VALUE 01/01/1900

Note that there is no need to encapsulate the INITIAL VALUE with quotation marks. However, if this value is used within the Source Qualifier SQL, it may be necessary to use native RDBMS functions to convert (e.g., TO DATE(--,--)). Within the Source Qualifier Transformation, use the following in the Source_Filter Attribute: DATE_ENTERED > to_Date(' $$Post_Date','MM/DD/YYYY HH24:MI:SS') [please be aware that this sample refers to Oracle as the source RDBMS]. Also note that the initial value 01/01/1900 will be expanded by the PowerCenter Server to 01/01/1900 00:00:00, hence the need to convert the parameter to a datetime.

The next step is to forward $$Post_Date and Date_Entered to an Expression transformation. This is where the function for setting the variable will reside. An output port named Post_Date is created with a data type of date/time. In the expression code section, place the following function:

SETMAXVARIABLE($$Post_Date,DATE_ENTERED)

The function evaluates each value for DATE_ENTERED and updates the variable with the Max value to be passed forward. For example:

DATE_ENTERED Resultant POST_DATE

9/1/2000 9/1/2000

10/30/2001 10/30/2001

9/2/2000 10/30/2001

Consider the following with regard to the functionality:

1. In order for the function to assign a value, and ultimately store it in the repository, the port must be connected to a downstream object. It need not go to the target, but it must go to another Expression Transformation. The reason is that the memory will not be instantiated unless it is used in a downstream transformation object.

2. In order for the function to work correctly, the rows have to be marked for insert. If the mapping is an update-only mapping (i.e., Treat Rows As is set to Update in the session properties) the function will not work. In this case, make the session Data Driven and add an Update Strategy after the transformation containing the SETMAXVARIABLE function, but before the Target.

3. If the intent is to store the original Date_Entered per row and not the evaluated date value, then add an ORDER BY clause to the Source Qualifier. This way, the dates are processed and set in order and data is preserved.

INFORMATICA CONFIDENTIAL BEST PRACTICE 295 of 702

The first time this mapping is run, the SQL will select from the source where Date_Entered is > 01/01/1900 providing an initial load. As data flows through the mapping, the variable gets updated to the Max Date_Entered it encounters. Upon successful completion of the session, the variable is updated in the repository for use in the next session run. To view the current value for a particular variable associated with the session, right-click on the session in the Workflow Monitor and choose View Persistent Values.

The following graphic shows that after the initial run, the Max Date_Entered was 02/03/1998. The next time this session is run, based on the variable in the Source Qualifier Filter, only sources where Date_Entered > 02/03/1998 will be processed.

Resetting or Overriding Persistent Values

To reset the persistent value to the initial value declared in the mapping, view the persistent value from Workflow Manager (see graphic above) and press Delete Values. This deletes the stored value from the repository, causing the Order of Evaluation to use the Initial Value declared from the mapping.

If a session run is needed for a specific date, use a parameter file. There are two basic ways to accomplish this:

● Create a generic parameter file, place it on the server, and point all sessions to that parameter file. A session may (or may not) have a variable, and the parameter file need not have variables and parameters defined for every session using the parameter file. To override the variable, either change, uncomment, or delete the variable in the parameter file.

● Run pmcmd for that session, but declare the specific parameter file within the pmcmd command. Although you can use mapping parameters and variables when entering a lookup SQL override, the Designer cannot expand mapping parameters and variables in the query override and does not validate the lookup SQL override. When running a session with a mapping parameter or variable in the lookup SQL override, the PowerCenter Server expands mapping parameters and variables and connects to the lookup database to validate the query override.

INFORMATICA CONFIDENTIAL BEST PRACTICE 296 of 702

Configuring the Parameter File Location

Specify the parameter filename and directory in the workflow or session properties. To enter a parameter file in the workflow or session properties:

● Select either the Workflow or Session, choose, Edit, and click the Properties tab. ● Enter the parameter directory and name in the Parameter Filename field. ● Enter either a direct path or a server variable directory. Use the appropriate delimiter for the PowerCenter

Server operating system.

The following graphic shows the parameter filename and location specified in the session task.

The next graphic shows the parameter filename and location specified in the Workflow.

In this example, after the initial session is run, the parameter file contents may look like:

[Test.s_Incremental]

;$$Post_Date=

INFORMATICA CONFIDENTIAL BEST PRACTICE 297 of 702

By using the semicolon, the variable override is ignored and the Initial Value or Stored Value is used. If, in the subsequent run, the data processing date needs to be set to a specific date (for example: 04/21/2001), then a simple Perl script or manual change can update the parameter file to:

[Test.s_Incremental]

$$Post_Date=04/21/2001

Upon running the sessions, the order of evaluation looks to the parameter file first, sees a valid variable and value and uses that value for the session run. After successful completion, run another script to reset the parameter file.

Sample: Using Session and Mapping Parameters in Multiple Database Environments

Reusable mappings that can source a common table definition across multiple databases, regardless of differing environmental definitions (e.g., instances, schemas, user/logins), are required in a multiple database environment.

Scenario

Company X maintains five Oracle database instances. All instances have a common table definition for sales orders, but each instance has a unique instance name, schema, and login.

DB Instance Schema Table User Password

ORC1 aardso orders Sam max

ORC99 environ orders Help me

HALC hitme order_done Hi Lois

UGLY snakepit orders Punch Judy

GORF gmer orders Brer Rabbit

Each sales order table has a different name, but the same definition:

ORDER_ID NUMBER (28) NOT NULL,

DATE_ENTERED DATE NOT NULL,

DATE_PROMISED DATE NOT NULL,

DATE_SHIPPED DATE NOT NULL,

EMPLOYEE_ID NUMBER (28) NOT NULL,

CUSTOMER_ID NUMBER (28) NOT NULL,

SALES_TAX_RATE NUMBER (5,4) NOT NULL,

STORE_ID NUMBER (28) NOT NULL

Sample Solution

Using Workflow Manager, create multiple relational connections. In this example, the strings are named according to the DB Instance name. Using Designer, create the mapping that sources the commonly defined table. Then create a Mapping Parameter named $$Source_Schema_Table with the following attributes:

INFORMATICA CONFIDENTIAL BEST PRACTICE 298 of 702

Note that the parameter attributes vary based on the specific environment. Also, the initial value is not required since this solution uses parameter files.

Open the Source Qualifier and use the mapping parameter in the SQL Override as shown in the following graphic.

Open the Expression Editor and select Generate SQL. The generated SQL statement shows the columns.

INFORMATICA CONFIDENTIAL BEST PRACTICE 299 of 702

Override the table names in the SQL statement with the mapping parameter.

Using Workflow Manager, create a session based on this mapping. Within the Source Database connection drop-down box, choose the following parameter:

$DBConnection_Source.

Point the target to the corresponding target and finish.

Now create the parameter files. In this example, there are five separate parameter files.

Parmfile1.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table=aardso.orders

$DBConnection_Source= ORC1

Parmfile2.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table=environ.orders

$DBConnection_Source= ORC99

Parmfile3.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table=hitme.order_done

$DBConnection_Source= HALC

Parmfile4.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table=snakepit.orders

$DBConnection_Source= UGLY

Parmfile5.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table= gmer.orders

$DBConnection_Source= GORF

INFORMATICA CONFIDENTIAL BEST PRACTICE 300 of 702

Use pmcmd to run the five sessions in parallel. The syntax for pmcmd for starting sessions with a particular parameter file is as follows:

pmcmd startworkflow -s serveraddress:portno -u Username -p Password -paramfile parmfilename s_Incremental

You may also use "-pv pwdvariable" if the named environment variable contains the encrypted form of the actual password.

Notes on Using Parameter Files with Startworkflow

When starting a workflow, you can optionally enter the directory and name of a parameter file. The PowerCenter Integration Service runs the workflow using the parameters in the file specified.

For UNIX shell users, enclose the parameter file name in single quotes:

-paramfile '$PMRootDir/myfile.txt'

For Windows command prompt users, the parameter file name cannot have beginning or trailing spaces. If the name includes spaces, enclose the file name in double quotes:

-paramfile "$PMRootDir\my file.txt"

Note: When writing a pmcmd command that includes a parameter file located on another machine, use the backslash (\) with the dollar sign ($). This ensures that the machine where the variable is defined expands the server variable.

pmcmd startworkflow -uv USERNAME -pv PASSWORD -s SALES:6258 -f east -w wSalesAvg -paramfile '\$PMRootDir/myfile.txt'

In the event that it is necessary to run the same workflow with different parameter files, use the following five separate commands:

pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -paramfile \$PMRootDir\ParmFiles\Parmfile1.txt 1 1

pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -paramfile \$PMRootDir\ParmFiles\Parmfile2.txt 1 1

pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -paramfile \$PMRootDir\ParmFiles\Parmfile3.txt 1 1

pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -paramfile \$PMRootDir\ParmFiles\Parmfile4.txt 1 1

pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -paramfile \$PMRootDir\ParmFiles\Parmfile5.txt 1 1

Alternatively, run the sessions in sequence with one parameter file. In this case, a pre- or post-session script can change the parameter file for the next session.

INFORMATICA CONFIDENTIAL BEST PRACTICE 301 of 702

Using PowerCenter with UDB

Challenge

Universal Database (UDB) is a database platform that can be used to run PowerCenter repositories and act as source and target databases for PowerCenter mappings. Like any software, it has its own way of doing things. It is important to understand these behaviors so as to configure the environment correctly for implementing PowerCenter and other Informatica products with this database platform. This Best Practice offers a number of tips for using UDB with PowerCenter.

Description UDB Overview

UDB is used for a variety of purposes and with various environments. UDB servers run on Windows, OS/2, AS/400 and UNIX-based systems like AIX, Solaris, and HP-UX. UDB supports two independent types of parallelism: symmetric multi-processing (SMP) and massively parallel processing (MPP).

Enterprise-Extended Edition (EEE) is the most common UDB edition used in conjunction with the Informatica product suite. UDB EEE introduces a dimension of parallelism that can be scaled to very high performance. A UDB EEE database can be partitioned across multiple machines that are connected by a network or a high-speed switch. Additional machines can be added to an EEE system as application requirements grow. The individual machines participating in an EEE installation can be either uniprocessors or symmetric multiprocessors.

Connection Setup

You must set up a remote database connection to connect to DB2 UDB via PowerCenter. This is necessary because DB2 UDB sets a very small limit on the number of attachments per user to the shared memory segments when the user is using the local (or indirect) connection/protocol. The PowerCenter server runs into this limit when it is acting as the database agent or user. This is especially apparent when the repository is installed on DB2 and the target data source is on the same DB2 database.

The local protocol limit will definitely be reached when using the same connection node

INFORMATICA CONFIDENTIAL BEST PRACTICE 302 of 702

for the repository via the PowerCenter Server and for the targets. This occurs when the session is executed and the server sends requests for multiple agents to be launched. Whenever the limit on number of database agents is reached, the following error occurs:

CMN_1022 [[IBM][CLI Driver] SQL1224N A database agent could not be started to service a request, or was terminated as a result of a database system shutdown or a force command. SQLSTATE=55032]

The following recommendations may resolve this problem:

● Increase the number of connections permitted by DB2. ● Catalog the database as if it were remote. (For information of how to catalog

database with remote node refer Knowledgebase id 14745 at my.Informatica.com support Knowledgebase)

● Be sure to close connections when programming exceptions occur. ● Verify that connections obtained in one method are returned to the pool via

close() ● (The PowerCenter Server is very likely already doing this). ● Verify that your application does not try to access pre-empted connections (i.

e., idle connections that are now used by other resources).

DB2 Timestamp

DB2 has a timestamp data type that is precise to the microsecond and uses a 26-character format, as follows:

YYYY-MM-DD-HH.MI.SS.MICROS (where MICROS after the last period recommends six decimals places of second)

The PowerCenter Date/Time datatype only supports precision to the second (using a 19 character format), so under normal circumstances when a timestamp source is read into PowerCenter, the six decimal places after the second are lost. This is sufficient for most data warehousing applications but can cause significant problems where this timestamp is used as part of a key.

If the MICROS need to be retained, this can be accomplished by changing the format of the column from a timestamp data type to a character 26 in the source and target definitions. When the timestamp is read from DB2, the timestamp will be read in and converted to character in the ‘YYYY-MM-DD-HH.MI.SS.MICROS’ format. Likewise,

INFORMATICA CONFIDENTIAL BEST PRACTICE 303 of 702

when writing to a timestamp, pass the date as a character in the ‘YYYY-MM-DD-HH.MI.SS.MICROS’ format. If this format is not retained, the records are likely to be rejected due to an invalid date format error.

It is also possible to maintain the timestamp correctly using the timestamp data type itself. Setting a flag at the PowerCenter Server level does this; the technique is described in Knowledge Base article 10220 at my.Informatica.com.

Importing Sources or Targets

If the value of the DB2 system variable APPLHEAPSZ is too small when you use the Designer to import sources/targets from a DB2 database, the Designer reports an error accessing the repository. The Designer status bar displays the following message:

SQL Error:[IBM][CLI Driver][DB2]SQL0954C: Not enough storage is available in the application heap to process the statement.

If you receive this error, increase the value of the APPLHEAPSZ variable for your DB2 operating system. APPLHEAPSZ is the application heap size (in 4KB pages) for each process using the database.

Unsupported Datatypes

PowerMart and PowerCenter do not support the following DB2 datatypes:

● Dbclob ● Blob ● Clob ● Real

DB2 External Loaders

The DB2 EE and DB2 EEE external loaders can both perform insert and replace operations on targets. Both can also restart or terminate load operations.

● The DB2 EE external loader invokes the db2load executable located in the PowerCenter Server installation directory. The DB2 EE external loader can load data to a DB2 server on a machine that is remote to the PowerCenter Server.

INFORMATICA CONFIDENTIAL BEST PRACTICE 304 of 702

● The DB2 EEE external loader invokes the IBM DB2 Autoloader program to load data. The Autoloader program uses the db2atld executable. The DB2 EEE external loader can partition data and load the partitioned data simultaneously to the corresponding database partitions. When you use the DB2 EEE external loader, the PowerCenter Server and theDB2 EEE server must be on the same machine.

The DB2 external loaders load from a delimited flat file. Be sure that the target table columns are wide enough to store all of the data. If you configure multiple targets in the same pipeline to use DB2 external loaders, each loader must load to a different tablespace on the target database. For information on selecting external loaders, see "Configuring External Loading in a Session" in the PowerCenter User Guide.

Setting DB2 External Loader Operation Modes

DB2 operation modes specify the type of load the external loader runs. You can configure the DB2 EE or DB2 EEE external loader to run in any one of the following operation modes:

● Insert. Adds loaded data to the table without changing existing table data. ● Replace. Deletes all existing data from the table, and inserts the loaded data.

The table and index definitions do not change. ● Restart. Restarts a previously interrupted load operation. ● Terminate. Terminates a previously interrupted load operation and rolls back

the operation to the starting point, even if consistency points were passed. The tablespaces return to normal state, and all table objects are made consistent.

Configuring Authorities, Privileges, and Permissions

When you load data to a DB2 database using either the DB2 EE or DB2 EEE external loader, you must have the correct authority levels and privileges to load data into to the database tables.

DB2 privileges allow you to create or access database resources. Authority levels provide a method of grouping privileges and higher-level database manager maintenance and utility operations. Together, these functions control access to the database manager and its database objects. You can access only those objects for which you have the required privilege or authority.

To load data into a table, you must have one of the following authorities:

INFORMATICA CONFIDENTIAL BEST PRACTICE 305 of 702

● SYSADM authority ● DBADM authority ● LOAD authority on the database, with INSERT privilege

In addition, you must have proper read access and read/write permissions:

● The database instance owner must have read access to the external loader input files.

● If you use run DB2 as a service on Windows, you must configure the service start account with a user account that has read/write permissions to use LAN resources, including drives, directories, and files.

● If you load to DB2 EEE, the database instance owner must have write access to the load dump file and the load temporary file.

Remember, the target file must be delimited when using the DB2 AutoLoader.

Guidelines for Performance Tuning

You can achieve numerous performance improvements by properly configuring the database manager, database, and tablespace container and parameter settings. For example, MAXFILOP is one of the database configuration parameters that you can tune. The default value for MAXFILOP is far too small for most databases. When this value is too small, UDB spends a lot of extra CPU processing time closing and opening files. To resolve this problem, increase MAXFILOP value until UDB stops closing files.

You must also have enough DB2 agents available to process the workload based on the number of users accessing the database. Incrementally increase the value of MAXAGENTS until agents are not stolen from another application. Moreover, sufficient memory allocated to the CATALOGCACHE_SZ database configuration parameter also benefits the database. If the value of catalog cache heap is greater than zero, both DBHEAP and CATALOGCACHE_SZ should be proportionally increased.

In UDB, the LOCKTIMEOUT default value is 1. In a data warehouse database, set this value to 60 seconds. Remember to define TEMPSPACE tablespaces so that they have at least 3 or 4 containers across different disks, and set the PREFETCHSIZE to a multiple of EXTENTSIZE, where the multiplier is equal to the number of containers. Doing so will enable parallel I/O for larger sorts, joins, and other database functions requiring substantial TEMPSPACE space.

INFORMATICA CONFIDENTIAL BEST PRACTICE 306 of 702

In UDB, LOGBUFSZ value of 8 is too small. Try setting it to 128. Also, set an INTRA_PARALLEL value of YES for CPU parallelism. The database configuration parameter DFT_DEGREE should be set to a value between ANY and 1 depending on the number of CPUs available and number of processes that will be running simultaneously. Setting the DFT_DEGREE to ANY can prove to be a CPU hogger since one process can take up all the processing power with this setting. Setting it to one does not make sense as there is no parallelism in one.

Note: DFT_DEGREE and INTRA_PARALLEL are applicable only for EEE DB.

Data warehouse databases perform numerous sorts, many of which can be very large. SORTHEAP memory is also used for hash joins, which a surprising number of DB2 users fail to enable. To do so, use the db2set command to set environment variable DB2_HASH_JOIN=ON.

For a data warehouse database, at a minimum, double or triple the SHEAPTHRES (to between 40,000 and 60,000) and set the SORTHEAP size between 4,096 and 8,192. If real memory is available, some clients use even larger values for these configuration parameters.

SQL is very complex in a data warehouse environment and often consumes large quantities of CPU and I/O resources. Therefore, set DFT_QUERYOPT to 7 or 9.

UDB uses NUM_IO_CLEANERS for writing to TEMPSPACE, temporary intermediate tables, index creations, and more. SET NUM_IO_CLEANERS equal to the number of CPUs on the UDB server and focus on your disk layout strategy instead.

Lastly, for RAID devices where several disks appear as one to the operating system, be sure to do the following:

1. db2set DB2_STRIPED_CONTAINERS=YES (do this before creating tablespaces or before a redirected restore)

2. db2set DB2_PARALLEL_IO=* (or use TablespaceID numbers for tablespaces residing on the RAID devices for example DB2_PARALLEL_IO=4,5,6,7,8,10,12,13)

3. Alter the tablespace PREFETCHSIZE for each tablespace residing on RAID devices such that the PREFETCHSIZE is a multiple of the EXTENTSIZE.

Database Locks and Performance Problems

When working in an environment with many users that target a DB2 UDB database,

INFORMATICA CONFIDENTIAL BEST PRACTICE 307 of 702

you may experience slow and erratic behavior resulting from the way UDB handles database locks. Out of the box, DB2 UDB database and client connections are configured on the assumption that they will be part of an OLTP system and place several locks on records and tables. Because PowerCenter typically works with OLAP systems where it is the only process writing to the database and users are primarily reading from the database, this default locking behavior can have a significant impact on performance

Connections to DB2 UDB databases are set up using the DB2 Client Configuration utility. To minimize problems with the default settings, make the following changes to all remote clients accessing the database for read-only purposes. To help replicate these settings, you can export the settings from one client and then import the resulting file into all the other clients.

● Enable Cursor Hold is the default setting for the Cursor Hold option. Edit the configuration settings and make sure the Enable Cursor Hold option is not checked.

● Connection Mode should be Shared, not Exclusive ● Isolation Level should be Read Uncommitted (the minimum level) or Read

Committed (if updates by other applications are possible and dirty reads must be avoided)

For setting the Isolation level to dirty read at the PowerCenter Server level, you can set a flag can at the PowerCenter configuration file. For details on this process, refer to the KB article 13575 in my.Informatica.com support knowledgebase.

If you're not sure how to adjust these settings, launch the IBM DB2 Client Configuration utility, then highlight the database connection you use and select Properties. In Properties, select Settings and then select Advanced. You will see these options and their settings on the Transaction tab

To export the settings from the main screen of the IBM DB2 client configuration utility, highlight the database connection you use, then select Export and all. Use the same process to import the settings on another client.

If users run hand-coded queries against the target table using DB2's Command Center, be sure they know to use script mode and avoid interactive mode (by choosing the script tab instead of the interactive tab when writing queries). Interactive mode can lock returned records while script mode merely returns the result and does not hold them.

INFORMATICA CONFIDENTIAL BEST PRACTICE 308 of 702

If your target DB2 table is partitioned and resides across different nodes in DB2, you can use a target partition type “DB Partitioning” in PowerCenter session properties. When DB partitioning is selected, separate connections are opened directly to each node and the load starts in parallel. This improves performance and scalability.

Last updated: 13-Feb-07 17:14

INFORMATICA CONFIDENTIAL BEST PRACTICE 309 of 702

Using Shortcut Keys in PowerCenter Designer

Challenge

Using shortcuts and work-arounds to work as efficiently as possible in PowerCenter Mapping Designer and Workflow Manager.

Description

After you are familiar with the normal operation of PowerCenter Mapping Designer and Workflow Manager, you can use a variety of shortcuts to speed up their operation.

PowerCenter provides two types of shortcuts:

● keyboard shortcuts to edit repository objects and maneuver through the Mapping Designer and Workflow Manager as efficiently as possible, and

● shortcuts that simplify the maintenance of repository objects.

General Suggestions

Maneuvering the Navigator Window

Follow these steps to open a folder with workspace open as well:

1. While highlighting the folder, click the Open folder icon.

Note: Double-clicking the folder name only opens the folder if it has not yet been opened or connected to.

2. Alternatively, right-click the folder name, then click on Open.

Working with the Toolbar and Menubar

The toolbar contains commonly used features and functions within the various client tools. Using the toolbar is often faster than selecting commands from within the menubar.

● To add more toolbars, select Tools | Customize. ● Select the Toolbar tab to add or remove toolbars.

Follow these steps to use drop-down menus without the mouse:

1. Press and hold the <Alt> key. You will see an underline under one letter of each of the menu titles. 2. Press the underlined letter for the desired drop-down menu. For example, press 'r' for the 'Repository' drop-down menu.

INFORMATICA CONFIDENTIAL BEST PRACTICE 310 of 702

3. Press the underlined letter to select the command/operation you want. For example, press 't' for 'Close All Tools'. 4. Alternatively, after you have pressed the <Alt> key, use the right/left arrows to navigate across the menubar, and up/down

arrows to expand and navigate through the drop-down menu.. Press Enter when the desired command is highlighted.

● To create a customized toolbar for the functions you frequently use, press <Alt> <T> (expands the ‘Tools’ drop-down menu) then <C> (for ‘Customize’).

● To delete customized icons, select Tools | Customize, and then remove the icons by dragging them directly off the toolbar ● To add an icon to an existing (or new) toolbar, select Tools | Customize and navigate to the ‘Commands’ tab. Find your

desired command, then "drag and drop" the icon onto your toolbar. ● To rearrange the toolbars, click and drag the toolbar to the new location. You can insert more than one toolbar at the top

of the designer tool to avoid having the buttons go off the edge of the screen. Alternatively, you can position the toolbars at the bottom, side, or between the workspace and the message windows.

● To use a Docking\UnDocking window (e.g., Repository Navigator), double-click on the window's title bar. If you are having trouble docking the the window again, right-click somewhere in the white space of the runaway window (not the title bar) and make sure that the "Allow Docking" option is checked. When it is checked, drag the window to its proper place and, when an outline of where the window used to be appears, release the window.

Keyboard Shortcuts

Use the following keyboard shortcuts to perform various operations in Mapping Designer and Workflow Manager.

To: Press:

Cancel editing in an object Esc

Check and uncheck a check box Space Bar

Copy text from an object onto a clipboard Ctrl+C

Cut text from an object onto the clipboard Ctrl+X.

Edit the text of an object F2. Then move the cursor to the desired location

INFORMATICA CONFIDENTIAL BEST PRACTICE 311 of 702

Find all combination and list boxes Type the first letter of the list

Find tables or fields in the workspace Ctrl+F

Move around objects in a dialog box (When no objects are selected, this will pan within the workspace)

Ctrl+directional arrows

Paste copied or cut text from the clipboard into an object Ctrl+V

Select the text of an object F2

To start help F1

Mapping Designer

Navigating the Workspace

When using the "drag & drop" approach to create Foreign Key/Primary Key relationships between tables, be sure to start in the Foreign Key table and drag the key/field to the Primary Key table. Set the Key Type value to "NOT A KEY" prior to dragging.

Follow these steps to quickly select multiple transformations:

1. Hold the mouse down and drag to view a box. 2. Be sure the box touches every object you want to select. The selected items will have a distinctive outline around them. 3. If you miss one or have an extra, you can hold down the <Shift> or <Ctrl> key and click the offending transformations one

at a time. They will alternate between being selected and deselected each time you click on them.

Follow these steps to copy and link fields between transformations:

1. You can select multiple ports when you are trying to link to the next transformation. 2. When you are linking multiple ports, they are linked in the same order as they are in the source transformation. You need

to highlight the fields you want in the source transformation and hold the mouse button over the port name in the target transformation that corresponds to the source transformation port.

3. Use the Autolink function whenever possible. It is located under the Layout menu (or accessible by right-clicking somewhere in the workspace) of the Mapping Designer.

4. Autolink can link by name or position. PowerCenter version 6 and later gives you the option of entering prefixes or suffixes (when you click the 'More' button). This is especially helpful when you are trying to autolink from a Router transformation to some target transformation. For example, each group created in a Router has a distinct suffix number added to the port/field name. To autolink, you need to choose the proper Router and Router group in the 'From Transformation' space. You also need to click the 'More' button and enter the appropriate suffix value. You must do both to create a link.

5. Autolink does not work if any of the fields in the 'To' transformation are already linked to another group or another stream. No error appears; the links are simply not created.

Sometimes, a shared object is very close to (but not exactly) what you need. In this case, you may want to make a copy of the object with some minor alterations to suit your purposes. If you try to simply click and drag the object, it will ask you if you want to make a shortcut or it will be reusable every time. Follow these steps to make a non-reusable copy of a reusable object:

1. Open the target folder. 2. Select the object that you want to make a copy of, either in the source or target folder. 3. Drag the object over the workspace. 4. Press and hold the <Ctrl> key (the crosshairs symbol '+' will appear in a white box) 5. Release the mouse button, then release the <Ctrl> key. 6. A copy confirmation window and a copy wizard window appears. 7. The newly created transformation no longer says that it is reusable and you are free to make changes without affecting the

original reusable object.

Editing Tables/Transformations

Follow these steps to move one port in a transformation:

INFORMATICA CONFIDENTIAL BEST PRACTICE 312 of 702

1. Double-click the transformation and make sure you are in the "Ports" tab. (You go directly to the “Ports” tab if you double-click a port instead of the colored title bar.)

2. Highlight the port and click the up/down arrow button to reposition the port. 3. Or, highlight the port and then press <Alt><w> to move the port down or <Alt> <u> to move the port up.

Note: You can hold down the <Alt> and hit the <w> or <u> multiple times to reposition the currently highlighted port downwards or upwards, respectively.

Alternatively, you can accomplish the same thing by following these steps:

1. Highlight the port you want to move by clicking the number beside the port. 2. Grab onto the port by its number and continue holding down the left mouse button.. 3. Drag the port to the desired location (the list of ports scrolls when you reach the end). A red line indicates the new location. 4. When the red line is pointing to the desired location, release the mouse button.

Note: You cannot move more than one port at a time with this method. See below for instructions on moving more than one port at a time.

If you are using PowerCenter version 6.x, 7.x, or 8.x and the ports you are moving are adjacent, you can follow these steps to move more than one port at a time:

1. Highlight the ports you want to move by clicking the number beside the port while holding down the <Ctrl> key. 2. Use the up/down arrow buttons to move the ports to the desired location.

● To add a new field or port, first highlight an existing field or port, then press <Alt><f> to insert the new field/port below it. ● To validate a defined default value, first highlight the port you want to validate, and then press <Alt><v>. A message box

will confirm the validity of the default value. ● After creating a new port, simply begin typing the name you wish to call the port. There is no need to to remove the

default "NEWFIELD" text prior to labelling the new port. This method could also be applied when modifying existing port names. Simply highlight the existing port, by clicking onto the port number, and begin typing the modified name of the port. To prefix a port name, press <Home> to bring the cursor to the beginning of the port name. In addition, to add a suffix to a port name, press <End> to bring the curso to the end of the port name.

● Checkboxes can be checked (or unchecked) by highlighting the desired checkbox, and pressing SPACE bar to toggle the checkmark on and off.

Follow either of these steps to quickly open the Expression Editor of an output or variable port:

1. Highlight the expression so that there is a box around the cell and press <F2> followed by <F3>. 2. Or, highlight the expression so that there is a cursor somewhere in the expression, then press <F2>.

● To cancel an edit in the grid, press <Esc> so the changes are not saved. ● For all combo/drop-down list boxes, type the first letter on the list to select the item you want. For example, you can

highlight a port's Data type box without displaying the drop-down. To change it to 'binary', type <b>. Then use the arrow keys to go down to the next port. This is very handy if you want to change all fields to string for example because using the up and down arrows and hitting a letter is much faster than opening the drop-down menu and making a choice each time.

● To copy a selected item in the grid, press <Ctrl><c>. ● To paste a selected item from the Clipboard to the grid, press <Ctrl><v>. ● To delete a selected field or port from the grid, press <Alt><c>. ● To copy a selected row from the grid, press <Alt><o>. ● To paste a selected row from the grid, press <Alt><p>.

You can use either of the following methods to delete more than one port at a time.

● ­You can repeatedly hit the cut button; or ● You can highlight several records and then click the cut button. Use <Shift> to highlight many items in a row or <Ctrl> to

INFORMATICA CONFIDENTIAL BEST PRACTICE 313 of 702

highlight multiple non-contiguous items. Be sure to click on the number beside the port, not the port name while you are holding <Shift> or <Ctrl>.

Editing Expressions

Follow either of these steps to expedite validation of a newly created expression:

● Click on the <Validate> button or press <Alt> and <v>.

Note: This validates and leaves the Expression Editor open.

● Or, press <OK> to initiate parsing/validating of the expression. The system closes the Expression Editor if the validation is successful. If you click ‘OK’ once again in the "Expression parsed successfully" pop-up, the Expression Editor remains open.

There is little need to type in the Expression Editor. The tabs list all functions, ports, and variables that are currently available. If you want an item to appear in the Formula box, just double-click on it in the appropriate list on the left. This helps to avoid typographical errors and mistakes (such as including an output-only port name in an expression formula).

In version 6.x and later, if you change a port name, PowerCenter automatically updates any expression that uses that port with the new name.

Be careful about changing data types. Any expression using the port with the new data type may remain valid, but not perform as expected. If the change invalidates the expression, it will be detected when the object is saved or if the Expression Editor is active for that expression.

The following table summarizes additional shortcut keys that are applicable only when working with Mapping Designer:

To: Press

Add a new field or port Alt + F

Copy a row Alt + O

Cut a row Alt + C

Move current row down Alt + W

Move current row up Alt + U

Paste a row Alt + P

INFORMATICA CONFIDENTIAL BEST PRACTICE 314 of 702

Validate the default value in a transformation Alt + V

Open the Expression Editor from the expression field F2, then press F3

To start the debugger F9

Repository Object Shortcuts

A repository object defined in a shared folder can be reused across folders by creating a shortcut (i.e., a dynamic link to the referenced object).

Whenever possible, reuse source definitions, target definitions, reusable transformations, mapplets, and mappings. Reusing objects allows sharing complex mappings, mapplets or reusable transformations across folders, saves space in the repository, and reduces maintenance.

Follow these steps to create a repository object shortcut:

1. Expand the shared folder. 2. Click and drag the object definition into the mapping that is open in the workspace. 3. As the cursor enters the workspace, the object icon appears along with a small curve; as an example, the icon should look

like this:

4. A dialog box appears to confirm that you want to create a shortcut.

If you want to copy an object from a shared folder instead of creating a shortcut, hold down the <Ctrl> key before dropping the object into the workspace.

Workflow Manager

Navigating the Workspace

When editing a repository object or maneuvering around the Workflow Manager, use the following shortcuts to speed up the operation you are performing:

To: Press:

Create links Press Ctrl+F2 to select first task you want to link.

Press Tab to select the rest of the tasks you want to link

Press Ctrl+F2 again to link all the tasks you selected

Edit tasks name in the workspace F2

Expand a selected node and all its children SHIFT + * (use asterisk on numeric keypad)

Move across to select tasks in the workspace Tab

Select multiple tasks Ctrl + Mouseclick

INFORMATICA CONFIDENTIAL BEST PRACTICE 315 of 702

Repository Object Shortcuts

Mappings that reside in a “shared folder” can be reused within workflows by creating shortcut mappings.

A set of workflow logic can be reused within workflows by creating a reusable worklet.

Last updated: 13-Feb-07 17:25

INFORMATICA CONFIDENTIAL BEST PRACTICE 316 of 702

Working with JAVA Transformation Object

Challenge

Occasionally special processing of data is required that is not easy to accomplish using existing PowerCenter transformation objects. Transformation tasks like looping through data 1 to x number of times is not a functionality native to the existing PowerCenter transformation objects. For these situations, the Java Transformation provides the ability to develop Java code with unlimited possibilities for transformation capabilities. This Best Practice addresses questions that are commonly raised about using JTX and how to make effective use of it, and supplements the existing PowerCenter documentation on the JTX.

Description

The “Java Transformation” (JTX) introduced in PowerCenter 8.0 provides a uniform means of entering and maintaining program code written in Java to be executed for every record being processed during a session run. The Java code is maintained, entered, and viewed within the PowerCenter Designer tool.

Below is a summary of some of typical questions about JTX.

Is a JTX a passive or an active transformation?

A JTX can be either passive or active. When defining a JTX you must choose one or the other type. Once you make this choice you will not be able to change it without deleting the JTX, saving the repository and recreating the object.

Hint: If you are working with a versioned repository, you will have to purge the deleted JTX from the repository before you can recreate it with the same name.

What parts of a typical Java class can be used in a JTX?

The following standard features can be used in a JTX:

● “static” initialization blocks can be defined on the tab “Helper Code”. ● “import” statements can be listed on the tab “Import Packages”.

INFORMATICA CONFIDENTIAL BEST PRACTICE 317 of 702

● “static” variables of the Java class as a whole (i.e., counters for instances of this class) as well as non-static member variables (for every single instance) can be defined on the tab “Helper Code”.

● Auxiliary member functions or “static” functions may be declared and defined on the tab “Helper Code”.

● “static final” variables may be defined on the tab “Helper Code”. However, they are private by nature; no object of any other Java class will be able to utilize these.

● Auxiliary functions (static and dynamic) can be defined on the tab “Helper Code”.

Important Note:

Before trying to start a session utilizing additional “import” clauses in the Java code, make sure that the environment variable CLASSPATH contains the necessary .jar files or directories before the PowerCenter Integration Service has been started.

All non-static member variables declared on the tab “Helper Code” are automatically available to every partition of a partitioned session without any precautions. In other words, one object of the respective Java class that is generated by PowerCenter will be instantiated for every single instance of the JTX and for every session partition. For example, if you utilize two instances of the same reusable JTX and have set the session to run with three partitions, then six individual objects of that Java class will be instantiated for this session run.

What parts of a typical Java class cannot be utilized in a JTX?

The following standard features of Java are not available in a JTX:

● Standard and user-defined constructors ● Standard and user-defined destructors ● Any kind of direct user-interface, be it a Swing GUI or a console-based user

interface

What else cannot be done in a JTX?

One important note for a JTX is that you cannot retrieve, change, or utilize an existing DB connection in a JTX (such as a source connection, a target connection, or a relational connection to a LKP). If you would like to establish a database connection, use JDBC in the JTX. Make sure in this case that you provide the necessary

INFORMATICA CONFIDENTIAL BEST PRACTICE 318 of 702

parameters by other means.

How can I substitute constructors and the like in a JTX?

User-defined constructors are mainly used to pass certain initialization values to a Java class that you want to process only once. The only way in a JTX to get this work done is to pass those parameters into the JTX as a normal port; then you define a boolean variable (initial value is “true”). For example, the name might be “constructMissing” on the Helper Code tab. The very first block in the On Input Row block will then look like this:

if (constructMissing)

{ …

// do whatever you would do in the constructor

constructMissing = false;

}

Interaction with users is mainly done to provide input values to some member functions of a class. This usually is not appropriate in a JTX because all input values should be provided by means of input records.

If there is a need to enable immediate interaction with a user for one or several or all input records, use an inter-process communication mechanism (i.e., IPC) to establish communication between the Java class associated with the JTX and an environment available to a user. For example, if the actual check to be performed can only be determined at runtime, you might want to establish a JavaBeans communication between the JTX and the classes performing the actual checks. Beware, however, that this sort of mechanism causes great overhead and subsequently may decrease performance dramatically. Although in many cases such requirements indicate that the analysis process and the mapping design process have not been executed optimally.

How do I choose between an active and a passive JTX?

Use the following guidelines to identify whether you need an active or a passive JTX in your mapping:

INFORMATICA CONFIDENTIAL BEST PRACTICE 319 of 702

As a general rule of thumb, a passive JTX will usually execute faster than an active JTX.

If one input record equals one output record of the JTX, you will probably want to use a passive JTX.

If you have to produce a varying number of output records per input record (i.e., for some input values the JTX will generate one output record, for some values it will generate no output records, for some values it will generate two or even more output records) you will have to utilize an active JTX. There is no other choice.

If you have to accumulate one or more input records before generating one or more output records, you will have to utilize an active JTX. There is no other choice.

If you have to do some initialization work before processing the first input record, then this fact does in no way determine whether to utilize an active or a passive JTX.

If you have to do some cleanup work after having processed the last input record, then this fact does in no way determine whether to utilize an active or a passive JTX.

If you have to generate one or more output records after the last input record has been processed, then you have to use an active JTX. There is no other choice except changing the mapping accordingly to produce these additional records by other means.

How do I set up a JTX and use it in a mapping?

As with most standard transformations you can either define a reusable JTX or an instance directly within a mapping. The following example will describe how to define a JTX in a mapping. For this example assume that the JTX has one input port of data type String and three output ports of type String, Integer, and Smallint.

Note: As of version 8.1.1 the PowerCenter Designer is extremely sensitive regarding the port structure of a JTX; make sure you read and understand the Notes section below before designing your first JTX, otherwise you will encounter issues when trying to run a session associated to your mapping.

1.

INFORMATICA CONFIDENTIAL BEST PRACTICE 320 of 702

Click the button showing the java icon, then click on the background in the main window of the Mapping Designer. Choose whether to generate a passive or an active JTX (see “How do I choose between an active and a passive JTX” above). Remember, you cannot change this setting later.

2. Rename the JTX accordingly (i.e., rename it to “JTX_SplitString”).

3. Go to the Ports tab; define all input-only ports in the Input Group, define all output-only and input-output ports in the Output Group. Make sure that every output-only and every input-output port is defined correctly.

4. Make sure you define the port structure correctly from the onset as changing data types of ports after the JTX has been saved to the repository will not always work.

5. Click Apply.

6. On the Properties tab you may want to change certain properties. For example, the setting "Is Partitionable" is mandatory if this session will be partitioned. Follow the hints in the lower part of the screen form that explain the selection lists in detail.

7. Activate the tab Java Code. Enter code pieces where necessary. Be aware that all ports marked as input-output ports on the Ports tab are automatically processed as pass-through ports by the Integration Service. You do not have to (and should not) enter any code referring to pass-through ports. See the Notes section below for more details.

8. Click the Compile link near the lower right corner of the screen form to compile the Java code you have entered. Check the output window at the lower border of the screen form for all compilation errors and work through each error message encountered; then click Compile again. Repeat this step as often as necessary until you can compile the Java code without any error messages.

9. Click OK.

10. Only connect ports of the same data type to every input-only or input-output port of the JTX. Connect output-only and input-output ports of the JTX only to ports of the same data type in transformations downstream. If any downstream transformation expects a different data type than the type of the respective output port of the JTX, insert an EXP to convert data types. Refer to the Notes below for more detail.

11. Save the mapping.

INFORMATICA CONFIDENTIAL BEST PRACTICE 321 of 702

Notes:

The primitive Java data types available in a JTX that can be used for ports of the JTX to connect to other transformations are Integer, Double, and Date/Time. Date/time values are delivered to or by a JTX by means of a Java “long” value which indicates the difference of the respective date/time value to midnight, Jan 1st, 1970 (the so-called Epoch) in milliseconds; to interpret this value, utilize the appropriate methods of the Java class GregorianCalendar. Smallint values cannot be delivered to or by a JTX.

The Java object data types available in a JTX that can be used for ports are String, byte arrays (for Binary ports), and BigDecimal (for Decimal values of arbitrary precision).

In a JTX you check whether an input port has a NULL value by calling the function isNull("name_of_input_port"). If an input value is NULL, then you should explicitly set all depending output ports to NULL by calling setNull("name_of_output_port"). Both functions take the name of the respective input / output port as a string.

You retrieve the value of an input port (provided this port is not NULL, see previous paragraph) simply by referring to the name of this port in your Java source code. For example, if you have two input ports i_1 and i_2 of type Integer and one output port o_1 of type String, then you might set the output value with a statement like this one: o_1 = "First value = " + i_1 + ", second value = " + i_2;

In contrast to a Custom Transformation, it is not possible to retrieve the names, data types, and/or values of pass-through ports except if these pass-through ports have been defined on the Ports tab in advance. In other words, it is impossible for a JTX to adapt to its port structure at runtime (which would be necessary, for example, for something like a Sorter JTX).

If you have to transfer 64-bit values into a JTX, deliver them to the JTX by means of a string representing the 64-bit number and convert this string into a Java “long” variable using the static method Long.parseLong(). Likewise, to deliver a 64-bit integer from a JTX to downstream transformations, convert the “long” variable to a string

INFORMATICA CONFIDENTIAL BEST PRACTICE 322 of 702

which will be an output port of the JTX (e.g. using the statement o_Int64 = "" + myLongVariable).

As of version 8.1.1, the PowerCenter Designer is very sensitive regarding data types of ports connected to a JTX. Supplying a JTX with not exactly the expected data types or connecting output ports to other transformations expecting other data types (i.e., a string instead of an integer) may cause the Designer to invalidate the mapping such that the only remedy is to delete the JTX, save the mapping, and re-create the JTX.

Initialization Properties and Metadata Extensions can neither be defined nor retrieved in a JTX.

The code entered on the Java Code sub-tab “On Input Row” is inserted into some other code; only this complete code constitutes the method “execute()” of the resulting Java class associated to the JTX (see output of the link "View Code" near the lower-right corner of the Java Code screen form). The same holds true for the code entered on the tabs “On End Of Data” and “On Receiving Transactions” with regard to the methods. This fact has a couple of implications which will be explained in more detail below.

If you connect input and/or output ports to transformations with differing data types, you might get error messages during mapping validation. One such error message occurring quite often indicates that the byte code of the class cannot be retrieved from the repository. In this case, rectify port connections to all input and/or output ports of the JTX and edit the Java code (inserting one blank comment line usually suffices) and recompile the Java code again.

The JTX (Java Transformation) doesn't currently allow pass-through ports. Thus they have to be simulated by splitting them up into one input port and one output port, then the values of all input ports have to be assigned to the respective output port. The key here is the input port of every pair of ports has to be in the Input Group while the respective output port has to be in the Output Group. If you do not do this, there is no warning in designer but it will not function correctly.

Where and how to insert what pieces of Java code into a JTX?

A JTX always contains a code skeleton that is generated by the Designer. Every piece

INFORMATICA CONFIDENTIAL BEST PRACTICE 323 of 702

of code written by a mapping designer is inserted into this skeleton at designated places. Because all these code pieces do not constitute the sole content of the respective functions, there are certain rules and recommendations as to how to write such code.

As mentioned previously, a mapping designer can neither write his or her own constructor nor insert any code into the default constructor or the default destructor generated by the Designer. All initialization work can be done in either of the following two ways:

as part of the “static{}” initialization block,●

by inserting code that in a standalone class would be part of the destructor into the tab On End Of Data,

by inserting code that in a standalone class would be part of the constructor into the tab On Input Row.

The last case (constructor code being part of the On Input Row code) requires a little trick: constructor code is supposed to be executed once only, namely before the first method is called. In order to resemble this behavior, follow these steps:

1. On the tab Helper Code, define a boolean variable (i.e., “constructorMissing”) and initialize it to “true”.

2. At the beginning of the On Input Row code, insert code that looks like the following:

if( constructorMissing)

{ …

// do whatever the constructor should have done

constructorMissing = false;

}

INFORMATICA CONFIDENTIAL BEST PRACTICE 324 of 702

This will ensure that this piece of code is executed only once, namely directly before the very first input row is processed.

The code pieces on the tabs “On Input Row”, “On End Of Data”, and “On Receiving Transaction” are embedded in other code. There is code that runs before the code entered here will execute, and there is more code to follow; for example, exceptions raised within code written by a developer will be caught here. As a mapping developer you cannot change this order, so you need to be aware of the following important implication.

Suppose you are writing a Java class that performs some checks on an input record and, if the checks fail, issues an error message and then skips processing to the next record. Such a piece of code might look like this:

if (firstCheckPerformed( inputRecord) &&

secondCheckPerformed( inputRecord))

{ logMessage( “ERROR: one of the two checks failed!”);

return;

}

// else

insertIntoTarget( inputRecord);

countOfSucceededRows ++;

This code will not compile in a JTX because it would lead to unreachable code. Why? Because the “return” at the end of the “if” statement might enable the respective function (in this case, the method will have the name “execute()”) to “ignore” the subsequent code that is part of the framework created by the Designer.

In order to make this code work in a JTX, change it to look like this:

if (firstCheckPerformed( inputRecord) &&

secondCheckPerformed( inputRecord))

INFORMATICA CONFIDENTIAL BEST PRACTICE 325 of 702

{ logMessage( “ERROR: one of the two checks failed!”);

}

else

{ insertIntoTarget( inputRecord);

countOfSucceededRows ++;

}

The same principle (never use “return” in these code pieces) applies to all three tabs On Input Row, On End Of Data, and On Receiving Transaction.

Another important point is that the code entered on the On Every Record tab is embedded in a try-catch block. So never include any try-catch code on this tab.

How fast does a JTX perform?

A JTX communicates with PowerCenter by means of JNI (Java Native Invocation). This mechanism has been defined by Sun Micro-systems in order to allow Java code to interact with dynamically linkable libraries. Though JNI has been designed to perform fast, it still creates some overhead to a session due to:

the additional process switches between the PowerCenter Integration Service and the Java Virtual Machine (JVM) that executes as another operating system process

Java not being compiled to machine code but to portable byte code (although this has been largely remedied in the past years due to the introduction of Just-In-Time compilers) which is interpreted by the JVM

The inherent complexity of the genuine object model in Java (except for most sorts of number types and characters everything in Java is an object that occupies space and execution time).

INFORMATICA CONFIDENTIAL BEST PRACTICE 326 of 702

So it is obvious that a JTX cannot perform as fast as, for example, a carefully written Custom Transformation.

The rule of thumb is for simple JTX to require approximately 50% more total running time than an EXP of comparable functionality. It can also be assumed that Java code utilizing several of the fairly complex standard classes will need even more total runtime when compared to an EXP performing the same tasks.

When should I use a JTX and when not?

As with any other standard transformation, a JTX has its advantages as well as disadvantages. The most significant disadvantages are:

The Designer is very sensitive in regards to the data types of ports that are connected to the ports of a JTX. However, most of the troubles arising from this sensitivity can be remedied rather easily by simply recompiling the Java code.

Working with “long” values representing days and time within, for example, the GregorianCalendar can be extremely difficult to do and demanding in terms of runtime resources (memory, execution time). Date/time ports in PowerCenter are by far easier to use. So it is advisable to split up date/time ports into their individual components, such as year, month, and day, and to process these singular attributes within a JTX if needed.

In general a JTX can reduce performance simply by the nature of the architecture. Only use a JTX when necessary.

A JTX always has one input group and one output group. For example, it is impossible to write a Joiner as a JTX.

Significant advantages to using a JTX are:

Java knowledge and experience are generally easier to find than comparable skills in other languages.

Prototyping with a JTX can be very fast. For example, setting up a

INFORMATICA CONFIDENTIAL BEST PRACTICE 327 of 702

simple JTX that calculates the calendar week and calendar year for a given date takes approximately 10-20 minutes. Writing Custom Transformations (even for easy tasks) can take several hours.

Not every data integration environment has access to a C compiler used to compile Custom Transformations in C. Because PowerCenter is installed with its own JDK, this problem will not arise with a JTX.

In Summary

If you need a transformation that adapts its processing behavior to its ports, a JTX is not the way to go. In such a case, write a Custom Transformation in C, C++, or Java to perform the necessary tasks. The CT API is considerably more complex than the JTX API, but it is also far more flexible.

Use a JTX for development whenever a task cannot be easily completed using other standard options in PowerCenter (as long as performance requirements do not dictate otherwise).

If performance measurements are slightly below expectations, try optimizing the Java code and the remainder of the mapping in order to increase processing speed.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 328 of 702

Error Handling Process

Challenge

For an error handling strategy to be implemented successfully, it must be integral to the load process as a whole. The method of implementation for the strategy will vary depending on the data integration requirements for each project.

The resulting error handling process should however, always involve the following three steps:

1. Error identification 2. Error retrieval 3. Error correction

This Best Practice describes how each of these steps can be facilitated within the PowerCenter environment.

Description

A typical error handling process leverages the best-of-breed error management technology available in PowerCenter, such as:

Relational database error logging Email notification of workflow failures Session error thresholds The reporting capabilities of PowerCenter Data Analyzer Data profiling

These capabilities can be integrated to facilitate error identification, retrieval, and correction as described in the flow chart below:

INFORMATICA CONFIDENTIAL BEST PRACTICE 329 of 702

Error Identification

The first step in the error handling process is error identification. Error identification is often achieved through the use of the ERROR() function within mappings, enablement of relational error logging in PowerCenter, and referential integrity constraints at the database.

This approach ensures that row-level issues such as database errors (e.g., referential integrity failures), transformation errors, and business rule exceptions for which the ERROR() function was called are captured in relational error logging tables.

Enabling the relational error logging functionality automatically writes row-level data to a set of four error handling tables (PMERR_MSG, PMERR_DATA, PMERR_TRANS, and PMERR_SESS). These tables can be centralized in the PowerCenter repository and store information such as error messages, error data, and source row data. Row-level errors trapped in this manner include any database errors, transformation errors, and business rule exceptions for which the ERROR() function was called within the mapping.

Error Retrieval

The second step in the error handling process is error retrieval. After errors have been captured in the PowerCenter repository, it is important to make their retrieval simple and automated so that the process is as efficient as possible. Data Analyzer can be customized to create error retrieval reports from the information stored in the PowerCenter repository. A typical error report prompts a user for the folder and workflow name, and returns a report with information such as the session, error message, and data that caused the error. In this way, the error is successfully captured in the repository and can be easily retrieved through a Data Analyzer report, or an email alert that identifies a user when a certain threshold is crossed (such as “number of errors is greater than zero”).

Error Correction

The final step in the error handling process is error correction. As PowerCenter automates the process of error identification, and Data Analyzer can be used to simplify error retrieval, error correction is straightforward. After retrieving an error through Data Analyzer, the error report (which contains information such as workflow name, session name, error date, error message, error data, and source row data) can be exported to various file formats including Microsoft Excel, Adobe PDF, CSV, and others. Upon retrieval of an error, the error report can be extracted into a supported format and emailed to a developer or DBA to resolve the issue, or it can be entered into a defect management tracking tool. The Data Analyzer interface supports emailing a report directly through the web-based interface to make the process even easier.

For further automation, a report broadcasting rule that emails the error report to a developer’s inbox can be set up to run on a pre-defined schedule. After the developer or DBA identifies the condition that caused the error, a fix for the error can be implemented. The exact method of data correction depends on various factors such as the number of records with errors, data availability requirements per SLA, the level of data criticality to the business unit(s), and the type of error that occurred. Considerations made during error correction include:

The ‘owner’ of the data should always fix the data errors. For example, if the source data is coming from an external system, then the errors should be sent back to the source system to be fixed.

In some situations, a simple re-execution of the session will reprocess the data. Does partial data that has been loaded into the target systems need to be backed-out in order to

avoid duplicate processing of rows. Lastly, errors can also be corrected through a manual SQL load of the data. If the volume of

errors is low, the rejected data can be easily exported to Microsoft Excel or CSV format and corrected in a spreadsheet from the Data Analyzer error reports. The corrected data can then be manually inserted into the target table using a SQL statement.

Any approach to correct erroneous data should be precisely documented and followed as a standard.

INFORMATICA CONFIDENTIAL BEST PRACTICE 330 of 702

If the data errors occur frequently, then the reprocessing process can be automated by designing a special mapping or session to correct the errors and load the corrected data into the ODS or staging area.

Data Profiling Option

For organizations that want to identify data irregularities post-load but do not want to reject such rows at load time, the PowerCenter Data Profiling option can be an important part of the error management solution. The PowerCenter Data Profiling option enables users to create data profiles through a wizard-driven GUI that provides profile reporting such as orphan record identification, business rule violation, and data irregularity identification (such as NULL or default values). The Data Profiling option comes with a license to use Data Analyzer reports that source the data profile warehouse to deliver data profiling information through an intuitive BI tool. This is a recommended best practice since error handling reports and data profile reports can be delivered to users through the same easy-to-use application.

Integrating Error Handling, Load Management, and Metadata

Error handling forms only one part of a data integration application. By necessity, it is tightly coupled to the load management process and the load metadata; it is the integration of all these approaches that ensures the system is sufficiently robust for successful operation and management. The flow chart below illustrates this in the end-to-end load process.

INFORMATICA CONFIDENTIAL BEST PRACTICE 331 of 702

INFORMATICA CONFIDENTIAL BEST PRACTICE 332 of 702

Error handling underpins the data integration system from end-to-end. Each of the load components performs validation checks, the results of which must be reported to the operational team. These components are not just PowerCenter processes such as business rule and field validation, but cover the entire data integration architecture, for example:

Process Validation. Are all the resources in place for the processing to begin (e.g., connectivity to source systems)?

Source File Validation. Is the source file datestamp later than the previous load? File Check. Does the number of rows successfully loaded match the source rows read?

Last updated: 09-Feb-07 13:42

INFORMATICA CONFIDENTIAL BEST PRACTICE 333 of 702

Error Handling Strategies - Data Warehousing

Challenge

A key requirement for any successful data warehouse or data integration project is that it awithin the user community. At the same time, it is imperative that the warehouse be as up-to-date as possible since the more recent the information derived from it is, the more relevant it is to the business operations of the organization, thereby providing the best opportunity to gain an advantage over the competition.

ttains credibility

Transactional systems can manage to function even with a certain amount of error since the impact of an individual transaction (in error) has a limited effect on the business figures as a whole, and corrections can be applied to erroneous data after the event (i.e., after the error has been identified). In data warehouse systems, however, any systematic error (e.g., for a particular load instance) not only affects a larger number of data items, but may potentially distort key reporting metrics. Such data cannot be left in the warehouse "until someone notices" because business decisions may be driven by such information.

Therefore, it is important to proactively manage errors, identifying them before, or as, they occur. If errors occur, it is equally important either to prevent them from getting to the warehouse at all, or to remove them from the warehouse immediately (i.e., before the business tries to use the information in error).

The types of error to consider include:

• Source data structures • Sources presented out-of-sequence • ‘Old’ sources represented in error • Incomplete source files • Data-type errors for individual fields • Unrealistic values (e.g., impossible dates) • Business rule breaches • Missing mandatory data • O/S errors • RDBMS errors

These cover both high-level (i.e., related to the process or a load as a whole) and low-level (i.e., field or column-related errors) concerns.

Description

In an ideal world, when an analysis is complete, you have a precise definition of source and target data; you can be sure that every source element was populated correctly, with meaningful values, never missing a value, and fulfilling all relational constraints. At the same time, source data sets always have a fixed structure, are always available on time (and in the correct order), and are never corrupted during transfer to the data warehouse. In addition, the OS and RDBMS never run out of resources, or have permissions and privileges change.

Realistically, however, the operational applications are rarely able to cope with every possible business scenario or combination of events; operational systems crash, networks fall over, and users may not use the transactional systems in quite the way they were designed. The operational systems also typically need some flexibility to allow non-fixed data to be stored (typically as free-text comments). In every case, there is a risk that the source data does not match what the data warehouse expects.

Because of the credibility issue, in-error data must not be propagated to the metrics and measures used by the business managers. If erroneous data does reach the warehouse, it must be identified and removed immediately (before the current version of the warehouse can be published). Preferably, error data should

INFORMATICA CONFIDENTIAL BEST PRACTICE 334 of 702

be identified during the load process and prevented from reaching the warehouse at all. Ideally, erroneous source data should be identified before a load even begins, so that no resources are wasted trying to load it.

As a principle, data errors should corrected at the source. As soon as any attempt is made to correct errors within the warehouse, there is a risk that the lineage and provenance of the data will be lost. From that point on, it becomes impossible to guarantee that a metric or data item came from a specific source via a specific chain of processes. As a by-product, adopting this principle also helps to tie both the end-users and those responsible for the source data into the warehouse process; source data staff understand that their professionalism directly affects the quality of the reports, and end-users become owners of their data.

As a final consideration, error management (the implementation of an error handling strategy) complements and overlaps load management, data quality and key management, and operational processes and procedures.

Load management processes record at a high-level if a load is unsuccessful; error management records the details of why the failure occurred.

Quality management defines the criteria whereby data can be identified as in error; and error management identifies the specific error(s), thereby allowing the source data to be corrected.

Operational reporting shows a picture of loads over time, and error management allows analysis to identify systematic errors, perhaps indicating a failure in operational procedure.

Error management must therefore be tightly integrated within the data warehouse load process. This is shown in the high level flow chart below:

INFORMATICA CONFIDENTIAL BEST PRACTICE 335 of 702

INFORMATICA CONFIDENTIAL BEST PRACTICE 336 of 702

Error Management Considerations

High-Level Issues

From previous discussion of load management, a number of checks can be performed before any attempt is made to load a source data set. Without load management in place, it is unlikely that the warehouse process will be robust enough to satisfy any end-user requirements, and error correction processing becomes moot (in so far as nearly all maintenance and development resources will be working full time to manually correct bad data in the warehouse). The following assumes that you have implemented load management processes similar to Informatica’s best practices.

• Process Dependency checks in the load management can identify when a source data set is missing, duplicates a previous version, or has been presented out of sequence, and where the previous load failed but has not yet been corrected.

• Load management prevents this source data from being loaded. At the same time, error management processes should record the details of the failed load; noting the source instance, the load affected, and when and why the load was aborted.

• Source file structures can be compared to expected structures stored as metadata, either from header information or by attempting to read the first data row.

• Source table structures can be compared to expectations; typically this can be done by interrogating the RDBMS catalogue directly (and comparing to the expected structure held in metadata), or by simply running a ‘describe’ command against the table (again comparing to a pre-stored version in metadata).

• Control file totals (for file sources) and row number counts (table sources) are also used to determine if files have been corrupted or truncated during transfer, or if tables have no new data in them (suggesting a fault in an operational application).

• In every case, information should be recorded to identify where and when an error occurred, what sort of error it was, and any other relevant process-level details.

Low-Level Issues

Assuming that the load is to be processed normally (i.e., that the high-level checks have not caused the load to abort), further error management processes need to be applied to the individual source rows and fields.

• Individual source fields can be compared to expected data-types against standard metadata within the repository, or additional information added by the development. In some instances, this is enough to abort the rest of the load; if the field structure is incorrect, it is much more likely that the source data set as a whole either cannot be processed at all or (more worryingly) is likely to be processed unpredictably.

• Data conversion errors can be identified on a field-by-field basis within the body of a mapping. Built-in error handling can be used to spot failed date conversions, conversions of string to numbers, or missing required data. In rare cases, stored procedures can be called if a specific conversion fails; however this cannot be generally recommended because of the potentially crushing impact on performance if a particularly error-filled load occurs.

Business rule breaches can then be picked up. It is possible to define allowable values, or acceptable value ranges within PowerCenter mappings (if the rules are few, and it is clear frommapping metadata that the business rules are included in the mapping itself). A more flexible approach is to use external tables to codify the business rules. In this way, only the rules tablneed to be amended if a new business ru

• the

es le needs to be applied. Informatica has suggested

• nt methods to implement such a process. Missing Key/Unknown Key issues have already been defined in their own best practice documeKey Management in Data Warehousing Solutions with suggested management techniques for identifying and handling them. However, from an error handling perspective, such errors must stbe identified and recorded, even when key management techniques do not formally fail source rows with key errors. Unless a record is kept of the frequency with which particular sour

ill

ce data

• fails, it is difficult to realize when there is a systematic problem in the source systems. Inter-row errors may also have to be considered. These may occur when a business process expects a certain hierarchy of events (e.g., a customer query, followed by a booking request,

INFORMATICA CONFIDENTIAL BEST PRACTICE 337 of 702

followed by a confirmation, followed by a payment). If the events arrive from the source system in the wrong order, or where key events are missing, it may indicate a major problem with the source system, or the way in which the source system is being used.

• g

e first, re-running, and then identifying a second error (which halts the load for a second time).

OS and RDBMS Issues

s should be very rare (i.e., the load should already have identified that reference information is missing).

o d

schemas, invalid indexes, no further table space extents available, missing partitions and the like.

on the data warehouse, or are not aware that the data warehouse managers need to be kept up to speed.

n

issions on a UNIX host are amended, bad files themselves (or even the log files) may not be accessible.

Subsequent runs should note this, and enforce correction of the last load before beginning the new one.

ld also be available to the data warehouse operators to rapidly explain and resolve such errors if they occur.

Auto-Correction vs. Manual Correction

An important principle to follow is to try to identify all of the errors on a particular row before haltinprocessing, rather than rejecting the row at the first instance. This seems to break the rule of not wasting resources trying to load a sourced data set if we already know it is in error; however, sincethe row needs to be corrected at source, then reprocessed subsequently, it is sensible to identifyall the corrections that need to be made before reloading, rather than fixing th

Since best practice means that referential integrity (RI) issues are proactively managed within the loads, instances where the RDBMS rejects data for referential reason

However, there is little that can be done to identify the more generic RDBMS problems that are likely toccur; changes to schema permissions, running out of temporary disk space, dropping of tables an

Similarly, interaction with the OS means that changes in directory structures, file permissions, disk space,command syntax, and authentication may occur outside of the data warehouse. Often such changes are driven by Systems Administrators who, from an operational perspective, are not aware that there is likely to be an impact

In both of the instances above, the nature of the errors may be such that not only will they cause a load to fail, but it may be impossible to record the nature of the error at that point in time. For example, if RDBMS user ids are revoked, it may be impossible to write a row to an error table if the error process depends othe revoked id; if disk space runs out during a write to a target table, this may affect all other tables (including the error tables); if file perm

Most of these types of issues can be managed by a proper load management process, however. Since setting the status of a load to ‘complete’ should be absolutely the last step in a given process, any failure before, or including, that point leaves the load in an ‘incomplete’ state.

The best practice to manage such OS and RDBMS errors is, therefore, to ensure that the Operational Administrators and DBAs have proper and working communication with the data warehouse management to allow proactive control of changes. Administrators and DBAs shou

Load management and key management best practices (Key Management in Data Warehousing Solutions) back,

eserved, and incorrect key values are corrected as soon as the source system provides the missing data.

ality

be impossible, potentially requiring a whole section of the warehouse to be reloaded entirely from scratch.

have already defined auto-correcting processes; the former to allow loads themselves to launch, rolland reload without manual intervention, and the latter to allow RI errors to be managed so that the quantitative quality of the warehouse data is pr

We cannot conclude from these two specific techniques, however, that the warehouse should attempt to change source data as a general principle. Even if this were possible (which is debatable), such functionwould mean that the absolute link between the source data and its eventual incorporation into the data warehouse would be lost. As soon as one of the warehouse metrics was identified as incorrect, unpicking the error would

INFORMATICA CONFIDENTIAL BEST PRACTICE 338 of 702

In addition, such automatic correction of data might hide the fact that one or other of the source systems had a generic fault, or more importantly, had acquired a fault because of on-going development of the transactional applications, or a failure in user training.

The principle to apply here is to identify the errors in the load, and then alert the source system users that data should be corrected in the source system itself, ready for the next load to pick up the right data. This maintains the data lineage, allows source system errors to be identified and ameliorated in good time, and permits extra training needs to be identified and managed.

Error Management Techniques

Simple Error Handling Structure

The following data structure is an example of the error metadata that should be captured as a minimum within the error handling strategy.

The example defines three main sets of information:

• The ERROR_DEFINITION table, which stores descriptions for the various types of errors, including:

o process-level (e.g., incorrect source file, load started out-of-sequence) o row-level (e.g., missing foreign key, incorrect data-type, conversion errors) and o reconciliation (e.g., incorrect row numbers, incorrect file total etc.).

• The ERROR_HEADER table provides a high-level view on the process, allowing a quick

identification of the frequency of error for particular loads and of the distribution of error types. It is linked to the load management processes via the SRC_INST_ID and PROC_INST_ID, from which other process-level information can be gathered.

• The ERROR_DETAIL table stores information about actual rows with errors, including how to identify the specific row that was in error (using the source natural keys and row number) together with a string of field identifier/value pairs concatenated together. It is not expected that this

INFORMATICA CONFIDENTIAL BEST PRACTICE 339 of 702

information will be deconstructed as part of an automatic correction load, but if necessary this can be pivoted (e.g., using simple UNIX scripts) to separate out the field/value pairs for subsequent reporting.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 340 of 702

Error Handling Strategies - General

Challenge

The challenge is to accurately and efficiently load data into the target data architecture. This Best Practice describes various loading scenarios, the use of data profiles, an alternate method for identifying data errors, methods for handling data errors, and alternatives for addressing the most common types of problems. For the most part, these strategies are relevant whether your data integration project is loading an operational data structure (as with data migrations, consolidations, or loading various sorts of operational data stores) or loading a data warehousing structure.

Description

Regardless of target data structure, your loading process must validate that the data conforms to known rules of the business. When the source system data does not meet these rules, the process needs to handle the exceptions in an appropriate manner. The business needs to be aware of the consequences of either permitting invalid data to enter the target or rejecting it until it is fixed. Both approaches present complex issues. The business must decide what is acceptable and prioritize two conflicting goals:

● The need for accurate information. ● The ability to analyze or process the most complete information available with the understanding that errors

can exist.

Data Integration Process Validation

In general, there are three methods for handling data errors detected in the loading process:

● Reject All. This is the simplest to implement since all errors are rejected from entering the target when they are detected. This provides a very reliable target that the users can count on as being correct, although it may not be complete. Both dimensional and factual data can be rejected when any errors are encountered. Reports indicate what the errors are and how they affect the completeness of the data.

Dimensional or Master Data errors can cause valid factual data to be rejected because a foreign key relationship cannot be created. These errors need to be fixed in the source systems and reloaded on a subsequent load. Once the corrected rows have been loaded, the factual data will be reprocessed and loaded, assuming that all errors have been fixed. This delay may cause some user dissatisfaction since the users need to take into account that the data they are looking at may not be a complete picture of the operational systems until the errors are fixed. For an operational system, this delay may affect downstream transactions.

The development effort required to fix a Reject All scenario is minimal, since the rejected data can be processed through existing mappings once it has been fixed. Minimal additional code may need to be written since the data will only enter the target if it is correct, and it would then be loaded into the data mart using the normal process.

● Reject None. This approach gives users a complete picture of the available data without having to consider data that was not available due to it being rejected during the load process. The problem is that the data may not be complete or accurate. All of the target data structures may contain incorrect information that can lead to incorrect decisions or faulty transactions.

With Reject None, the complete set of data is loaded, but the data may not support correct transactions or

INFORMATICA CONFIDENTIAL BEST PRACTICE 341 of 702

aggregations. Factual data can be allocated to dummy or incorrect dimension rows, resulting in grand total numbers that are correct, but incorrect detail numbers. After the data is fixed, reports may change, with detail information being redistributed along different hierarchies. The development effort to fix this scenario is significant. After the errors are corrected, a new loading process needs to correct all of the target data structures, which can be a time-consuming effort based on the delay between an error being detected and fixed. The development strategy may include removing information from the target, restoring backup tapes for each night’s load, and reprocessing the data. Once the target is fixed, these changes need to be propagated to all downstream data structures or data marts.

● Reject Critical. This method provides a balance between missing information and incorrect information. It involves examining each row of data and determining the particular data elements to be rejected. All changes that are valid are processed into the target to allow for the most complete picture. Rejected elements are reported as errors so that they can be fixed in the source systems and loaded on a subsequent run of the ETL process.

This approach requires categorizing the data in two ways: 1) as key elements or attributes, and 2) as inserts or updates. Key elements are required fields that maintain the data integrity of the target and allow for hierarchies to be summarized at various levels in the organization. Attributes provide additional descriptive information per key element. Inserts are important for dimensions or master data because subsequent factual data may rely on the existence of the dimension data row in order to load properly. Updates do not affect the data integrity as much because the factual data can usually be loaded with the existing dimensional data unless the update is to a key element. The development effort for this method is more extensive than Reject All since it involves classifying fields as critical or non-critical, and developing logic to update the target and flag the fields that are in error. The effort also incorporates some tasks from the Reject None approach, in that processes must be developed to fix incorrect data in the entire target data architecture. Informatica generally recommends using the Reject Critical strategy to maintain the accuracy of the target. By providing the most fine-grained analysis of errors, this method allows the greatest amount of valid data to enter the target on each run of the ETL process, while at the same time screening out the unverifiable data fields. However, business management needs to understand that some information may be held out of the target, and also that some of the information in the target data structures may be at least temporarily allocated to the wrong hierarchies.

Handling Errors in Dimension Profiles

Profiles are tables used to track history changes to the source data. As the source systems change, profile records are created with date stamps that indicate when the change took place. This allows power users to review the target data using either current (As-Is) or past (As-Was) views of the data.

A profile record should occur for each change in the source data. Problems occur when two fields change in the source system and one of those fields results in an error. The first value passes validation, which produces a new profile record, while the second value is rejected and is not included in the new profile. When this error is fixed, it would be desirable to update the existing profile rather than creating a new one, but the logic needed to perform this UPDATE instead of an INSERT is complicated. If a third field is changed in the source before the error is fixed, the correction process is complicated further.

The following example represents three field values in a source system. The first row on 1/1/2000 shows the original values. On 1/5/2000, Field 1 changes from Closed to Open, and Field 2 changes from Black to BRed, which is invalid. On 1/10/2000, Field 3 changes from Open 9-5 to Open 24hrs, but Field 2 is still invalid. On 1/15/2000, Field

INFORMATICA CONFIDENTIAL BEST PRACTICE 342 of 702

2 is finally fixed to Red.

Date Field 1 Value Field 2 Value Field 3 Value

1/1/2000 Closed Sunday Black Open 9 – 5

1/5/2000 Open Sunday BRed Open 9 – 5

1/10/2000 Open Sunday BRed Open 24hrs

1/15/2000 Open Sunday Red Open 24hrs

Three methods exist for handling the creation and update of profiles:

1. The first method produces a new profile record each time a change is detected in the source. If a field value was invalid, then the original field value is maintained.

Date Profile Date Field 1 Value Field 2 Value Field 3 Value

1/1/2000 1/1/2000 Closed Sunday Black Open 9 – 5

1/5/2000 1/5/2000 Open Sunday Black Open 9 – 5

1/10/2000 1/10/2000 Open Sunday Black Open 24hrs

1/15/2000 1/15/2000 Open Sunday Red Open 24hrs

By applying all corrections as new profiles in this method, we simplify the process by directly applying all changes to the source system directly to the target. Each change -- regardless if it is a fix to a previous error -- is applied as a new change that creates a new profile. This incorrectly shows in the target that two changes occurred to the source information when, in reality, a mistake was entered on the first change and should be reflected in the first profile. The second profile should not have been created.

2. The second method updates the first profile created on 1/5/2000 until all fields are corrected on 1/15/2000, which loses the profile record for the change to Field 3.

If we try to apply changes to the existing profile, as in this method, we run the risk of losing profile information. If the third field changes before the second field is fixed, we show the third field changed at the same time as the first. When the second field was fixed, it would also be added to the existing profile, which incorrectly reflects the changes in the source system.

3. The third method creates only two new profiles, but then causes an update to the profile records on 1/15/2000 to fix the Field 2 value in both.

Date Profile Date Field 1 Value Field 2 Value Field 3 Value

1/1/2000 1/1/2000 Closed Sunday Black Open 9 – 5

INFORMATICA CONFIDENTIAL BEST PRACTICE 343 of 702

1/5/2000 1/5/2000 Open Sunday Black Open 9 – 5

1/10/2000 1/10/2000 Open Sunday Black Open 24hrs

1/15/2000 1/5/2000 (Update) Open Sunday Red Open 9-5

1/15/2000 1/10/2000 (Update) Open Sunday Red Open 24hrs

If we try to implement a method that updates old profiles when errors are fixed, as in this option, we need to create complex algorithms that handle the process correctly. It involves being able to determine when an error occurred and examining all profiles generated since then and updating them appropriately. And, even if we create the algorithms to handle these methods, we still have an issue of determining if a value is a correction or a new value. If an error is never fixed in the source system, but a new value is entered, we would identify it as a previous error, causing an automated process to update old profile records, when in reality a new profile record should have been entered.

Recommended Method

A method exists to track old errors so that we know when a value was rejected. Then, when the process encounters a new, correct value it flags it as part of the load strategy as a potential fix that should be applied to old Profile records. In this way, the corrected data enters the target as a new Profile record, but the process of fixing old Profile records, and potentially deleting the newly inserted record, is delayed until the data is examined and an action is decided. Once an action is decided, another process examines the existing Profile records and corrects them as necessary. This method only delays the As-Was analysis of the data until the correction method is determined because the current information is reflected in the new Profile.

Data Quality Edits

Quality indicators can be used to record definitive statements regarding the quality of the data received and stored in the target. The indicators can be append to existing data tables or stored in a separate table linked by the primary key. Quality indicators can be used to:

● Show the record and field level quality associated with a given record at the time of extract. ● Identify data sources and errors encountered in specific records. ● Support the resolution of specific record error types via an update and resubmission process.

Quality indicators can be used to record several types of errors – e.g., fatal errors (missing primary key value), missing data in a required field, wrong data type/format, or invalid data value. If a record contains even one error, data quality (DQ) fields will be appended to the end of the record, one field for every field in the record. A data quality indicator code is included in the DQ fields corresponding to the original fields in the record where the errors were encountered. Records containing a fatal error are stored in a Rejected Record Table and associated to the original file name and record number. These records cannot be loaded to the target because they lack a primary key field to be used as a unique record identifier in the target.

The following types of errors cannot be processed:

● A source record does not contain a valid key. This record would be sent to a reject queue. Metadata will be saved and used to generate a notice to the sending system indicating that x number of invalid records were received and could not be processed. However, in the absence of a primary key, no tracking is possible to determine whether the invalid record has been replaced or not.

● The source file or record is illegible. The file or record would be sent to a reject queue. Metadata indicating

INFORMATICA CONFIDENTIAL BEST PRACTICE 344 of 702

that x number of invalid records were received and could not be processed may or may not be available for a general notice to be sent to the sending system. In this case, due to the nature of the error, no tracking is possible to determine whether the invalid record has been replaced or not. If the file or record is illegible, it is likely that individual unique records within the file are not identifiable. While information can be provided to the source system site indicating there are file errors for x number of records, specific problems may not be identifiable on a record-by-record basis.

In these error types, the records can be processed, but they contain errors:

● A required (non-key) field is missing. ● The value in a numeric or date field is non-numeric. ● The value in a field does not fall within the range of acceptable values identified for the field. Typically, a

reference table is used for this validation.

When an error is detected during ingest and cleansing, the identified error type is recorded.

Quality Indicators (Quality Code Table)

The requirement to validate virtually every data element received from the source data systems mandates the development, implementation, capture and maintenance of quality indicators. These are used to indicate the quality of incoming data at an elemental level. Aggregated and analyzed over time, these indicators provide the information necessary to identify acute data quality problems, systemic issues, business process problems and information technology breakdowns.

The quality indicators: “0”-No Error, “1”-Fatal Error, “2”-Missing Data from a Required Field, “3”-Wrong Data Type/Format, “4”-Invalid Data Value and “5”-Outdated Reference Table in Use, apply a concise indication of the quality of the data within specific fields for every data type. These indicators provide the opportunity for operations staff, data quality analysts and users to readily identify issues potentially impacting the quality of the data. At the same time, these indicators provide the level of detail necessary for acute quality problems to be remedied in a timely manner.

Handling Data Errors

The need to periodically correct data in the target is inevitable. But how often should these corrections be performed?

The correction process can be as simple as updating field information to reflect actual values, or as complex as deleting data from the target, restoring previous loads from tape, and then reloading the information correctly. Although we try to avoid performing a complete database restore and reload from a previous point in time, we cannot rule this out as a possible solution.

Reject Tables vs. Source System

As errors are encountered, they are written to a reject file so that business analysts can examine reports of the data and the related error messages indicating the causes of error. The business needs to decide whether analysts should be allowed to fix data in the reject tables, or whether data fixes will be restricted to source systems. If errors are fixed in the reject tables, the target will not be synchronized with the source systems. This can present credibility problems when trying to track the history of changes in the target data architecture. If all fixes occur in the source systems, then these fixes must be applied correctly to the target data.

Attribute Errors and Default Values

INFORMATICA CONFIDENTIAL BEST PRACTICE 345 of 702

Attributes provide additional descriptive information about a dimension concept. Attributes include things like the color of a product or the address of a store. Attribute errors are typically things like an invalid color or inappropriate characters in the address. These types of errors do not generally affect the aggregated facts and statistics in the target data; the attributes are most useful as qualifiers and filtering criteria for drilling into the data, (e.g. to find specific patterns for market research). Attribute errors can be fixed by waiting for the source system to be corrected and reapplied to the data in the target.

When attribute errors are encountered for a new dimensional value, default values can be assigned to let the new record enter thetarget. Some rules that have been proposed for handling defaults are as follows:

Value Types Description Default

Reference Values Attributes that are foreign keys to other tables Unknown

Small Value Sets Y/N indicator fields No

Other Any other type of attribute Null or Business provided value

Reference tables are used to normalize the target model to prevent the duplication of data. When a source value does not translate into a reference table value, we use the ‘Unknown’ value. (All reference tables contain a value of ‘Unknown’ for this purpose.)

The business should provide default values for each identified attribute. Fields that are restricted to a limited domain of values (e.g., On/Off or Yes/No indicators), are referred to as small-value sets. When errors are encountered in translating these values, we use the value that represents off or ‘No’ as the default. Other values, like numbers, are handled on a case-by-case basis. In many cases, the data integration process is set to populate ‘Null’ into these fields, which means “undefined” in the target. After a source system value is corrected and passes validation, it is corrected in the target.

Primary Key Errors

The business also needs to decide how to handle new dimensional values such as locations. Problems occur when the new key is actually an update to an old key in the source system. For example, a location number is assigned and the new location is transferred to the target using the normal process; then the location number is changed due to some source business rule such as: all Warehouses should be in the 5000 range. The process assumes that the change in the primary key is actually a new warehouse and that the old warehouse was deleted. This type of error causes a separation of fact data, with some data being attributed to the old primary key and some to the new. An analyst would be unable to get a complete picture.

Fixing this type of error involves integrating the two records in the target data, along with the related facts. Integrating the two rows involves combining the profile information, taking care to coordinate the effective dates of the profiles to sequence properly. If two profile records exist for the same day, then a manual decision is required as to which is correct. If facts were loaded using both primary keys, then the related fact rows must be added together and the originals deleted in order to correct the data.

The situation is more complicated when the opposite condition occurs (i.e., two primary keys mapped to the same target data ID really represent two different IDs). In this case, it is necessary to restore the source information for both dimensions and facts from the point in time at which the error was introduced, deleting affected records from the target and reloading from the restore to correct the errors.

DM Facts Calculated from EDW Dimensions

INFORMATICA CONFIDENTIAL BEST PRACTICE 346 of 702

If information is captured as dimensional data from the source, but used as measures residing on the fact records in the target, we must decide how to handle the facts. From a data accuracy view, we would like to reject the fact until the value is corrected. If we load the facts with the incorrect data, the process to fix the target can be time consuming and difficult to implement.

If we let the facts enter downstream target structures, we need to create processes that update them after the dimensional data is fixed. If we reject the facts when these types of errors are encountered, the fix process becomes simpler. After the errors are fixed, the affected rows can simply be loaded and applied to the target data.

Fact Errors

If there are no business rules that reject fact records except for relationship errors to dimensional data, then when we encounter errors that would cause a fact to be rejected, we save these rows to a reject table for reprocessing the following night. This nightly reprocessing continues until the data successfully enters the target data structures. Initial and periodic analyses should be performed on the errors to determine why they are not being loaded.

Data Stewards

Data Stewards are generally responsible for maintaining reference tables and translation tables, creating new entities in dimensional data, and designating one primary data source when multiple sources exist. Reference data and translation tables enable the target data architecture to maintain consistent descriptions across multiple source systems, regardless of how the source system stores the data. New entities in dimensional data include new locations, products, hierarchies, etc. Multiple source data occurs when two source systems can contain different data for the same dimensional entity.

Reference Tables

The target data architecture may use reference tables to maintain consistent descriptions. Each table contains a short code value as a primary key and a long description for reporting purposes. A translation table is associated with each reference table to map the codes to the source system values. Using both of these tables, the ETL process can load data from the source systems into the target structures.

The translation tables contain one or more rows for each source value and map the value to a matching row in the reference table. For example, the SOURCE column in FILE X on System X can contain ‘O’, ‘S’ or ‘W’. The data steward would be responsible for entering in the translation table the following values:

Source Value Code Translation

O OFFICE

S STORE

W WAREHSE

These values are used by the data integration process to correctly load the target. Other source systems that maintain a similar field may use a two-letter abbreviation like ‘OF’, ‘ST’ and ‘WH’. The data steward would make the following entries into the translation table to maintain consistency across systems:

Source Value Code Translation

INFORMATICA CONFIDENTIAL BEST PRACTICE 347 of 702

OF OFFICE

ST STORE

WH WAREHSE

The data stewards are also responsible for maintaining the reference table that translates the codes into descriptions. The ETL process uses the reference table to populate the following values into the target:

Code Translation Code Description

OFFICE Office

STORE Retail Store

WAREHSE Distribution Warehouse

Error handling results when the data steward enters incorrect information for these mappings and needs to correct them after data has been loaded. Correcting the above example could be complex (e.g., if the data steward entered ST as translating to OFFICE by mistake). The only way to determine which rows should be changed is to restore and reload source data from the first time the mistake was entered. Processes should be built to handle these types of situations, including correction of the entire target data architecture.

Dimensional Data

New entities in dimensional data present a more complex issue. New entities in the target may include Locations and Products, at a minimum. Dimensional data uses the same concept of translation as reference tables. These translation tables map the source system value to the target value. For location, this is straightforward, but over time, products may have multiple source system values that map to the same product in the target. (Other similar translation issues may also exist, but Products serves as a good example for error handling.)

There are two possible methods for loading new dimensional entities. Either require the data steward to enter the translation data before allowing the dimensional data into the target, or create the translation data through the ETL process and force the data steward to review it. The first option requires the data steward to create the translation for new entities, while the second lets the ETL process create the translation, but marks the record as ‘Pending Verification’ until the data steward reviews it and changes the status to ‘Verified’ before any facts that reference it can be loaded.

When the dimensional value is left as ‘Pending Verification’ however, facts may be rejected or allocated to dummy values. This requires the data stewards to review the status of new values on a daily basis. A potential solution to this issue is to generate an email each night if there are any translation table entries pending verification. The data steward then opens a report that lists them.

A problem specific to Product is that when it is created as new, it is really just a changed SKU number. This causes additional fact rows to be created, which produces an inaccurate view of the product when reporting. When this is fixed, the fact rows for the various SKU numbers need to be merged and the original rows deleted. Profiles would also have to be merged, requiring manual intervention.

The situation is more complicated when the opposite condition occurs (i.e., two products are mapped to the same product, but really represent two different products). In this case, it is necessary to restore the source information for all loads since the error was introduced. Affected records from the target should be deleted and then reloaded from

INFORMATICA CONFIDENTIAL BEST PRACTICE 348 of 702

the restore to correctly split the data. Facts should be split to allocate the information correctly and dimensions split to generate correct profile information.

Manual Updates

Over time, any system is likely to encounter errors that are not correctable using source systems. A method needs to be established for manually entering fixed data and applying it correctly to the entire target data architecture, including beginning and ending effective dates. These dates are useful for both profile and date event fixes. Further, a log of these fixes should be maintained to enable identifying the source of the fixes as manual rather than part of the normal load process.

Multiple Sources

The data stewards are also involved when multiple sources exist for the same data. This occurs when two sources contain subsets of the required information. For example, one system may contain Warehouse and Store information while another contains Store and Hub information. Because they share Store information, it is difficult to decide which source contains the correct information.

When this happens, both sources have the ability to update the same row in the target. If both sources are allowed to update the shared information, data accuracy and profile problems are likely to occur. If we update the shared information on only one source system, the two systems then contain different information. If the changed system is loaded into the target, it creates a new profile indicating the information changed. When the second system is loaded, it compares its old unchanged value to the new profile, assumes a change occurred and creates another new profile with the old, unchanged value. If the two systems remain different, the process causes two profiles to be loaded every day until the two source systems are synchronized with the same information.

To avoid this type of situation, the business analysts and developers need to designate, at a field level, the source that should be considered primary for the field. Then, only if the field changes on the primary source would it be changed. While this sounds simple, it requires complex logic when creating Profiles, because multiple sources can provide information toward the one profile record created for that day.

One solution to this problem is to develop a system of record for all sources. This allows developers to pull the information from the system of record, knowing that there are no conflicts for multiple sources. Another solution is to indicate, at the field level, a primary source where information can be shared from multiple sources. Developers can use the field level information to update only the fields that are marked as primary. However, this requires additional effort by the data stewards to mark the correct source fields as primary and by the data integration team to customize the load process.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 349 of 702

Error Handling Techniques - PowerCenter Mappings

Challenge

Identifying and capturing data errors using a mapping approach, and making such errors available for further processing or correction.

Description

Identifying errors and creating an error handling strategy is an essential part of a data integration project. In the production environment, data must be checked and validated prior to entry into the target system. One strategy for catching data errors is to use PowerCenter mappings and error logging capabilities to catch specific data validation errors and unexpected transformation or database constraint errors.

Data Validation Errors

The first step in using a mapping to trap data validation errors is to understand and identify the error handling requirements.

Consider the following questions:

• What types of data errors are likely to be encountered? • Of these errors, which ones should be captured? • What process can capture the possible errors? • Should errors be captured before they have a chance to be written to the target database? • cted? Will any of these errors need to be reloaded or corre

s are encountered? • How will the users know if error• How will the errors be stored? • Should descriptions be assigned for individual errors? • Can a table be designed to store captured errors and the error descriptions?

y

r server does not have to write them to the session log and the reject/bad file, thus improving performance.

h : date

conversion, null values intended for not null target fields, and incorrect data formats or data types.

is identified, the row containing the error is to be separated from the data flow and logged in an error table.

One solution is to implement a mapping similar to the one shown below:

Capturing data errors within a mapping and re-routing these errors to an error table facilitates analysis bend users and improves performance. One practical application of the mapping approach is to capture foreign key constraint errors (e.g., executing a lookup on a dimension table prior to loading a fact table). Referential integrity is assured by including this sort of functionality in a mapping. While the database still enforces the foreign key constraints, erroneous data is not written to the target table; constraint errors are captured within the mapping so that the PowerCente

Data content errors can also be captured in a mapping. Mapping logic can identify content errors and attacdescriptions to them. This approach can be effective for many types of data content error, including

Sample Mapping Approach for Data Validation Errors

In the following example, customer data is to be checked to ensure that invalid null values are intercepted before being written to not null columns in a target CUSTOMER table. Once a null value

INFORMATICA CONFIDENTIAL BEST PRACTICE 350 of 702

An expression transformation can be employed to validate the source data, applying rules and flagging records with one or more errors.

A router transformation can then separate valid rows from those containing the errors. It is good practice to append error rows with a unique key; this can be a composite consisting of a MAPPING_ID and ROW_ID, for example. The MAPPING_ID would refer to the mapping name and the ROW_ID would be created by a sequence generator.

The composite key is designed to allow developers to trace rows written to the error tables that store information useful for error reporting and investigation. In this example, two error tables are suggested, namely: CUSTOMER_ERR and ERR_DESC_TBL.

The table ERR_DESC_TBL, is designed to hold information about the error, such as the mapping name, the ROW_ID, and the error description. This table can be used to hold all data validation error descriptions for all mappings, giving a single point of reference for reporting.

The CUSTOMER_ERR table can be an exact copy of the target CUSTOMER table appended with two additional columns: ROW_ID and MAPPING_ID. These columns allow the two error tables to be joined. The CUSTOMER_ERR table stores the entire row that was rejected, enabling the user to trace the error rows back to the source and potentially build mappings to reprocess them.

The mapping logic must assign a unique description for each error in the rejected row. In this example, any null value intended for a not null target field could generate an error message such as ‘NAME is NULL’ or ‘DOB is NULL’. This step can be done in an expression transformation (e.g., EXP_VALIDATION in the sample mapping).

After the field descriptions are assigned, the error row can be split into several rows, one for each possible error using a normalizer transformation. After a single source row is normalized, the resulting rows can be

INFORMATICA CONFIDENTIAL BEST PRACTICE 351 of 702

filtered to leave only errors that are present (i.e., each record can have zero to many errors). For example, if a row has three errors, three error rows would be generated with appropriate error descriptions (ERROR_DESC) in the table ERR_DESC_TBL.

The following table shows how the error data produced may look.

Table Name: CUSTOMER_ERR NAME DOB ADDRESS ROW_ID MAPPING_ID NULL NULL NULL 1 DIM_LOAD Table Name: ERR_DESC_TBL FOLDER_NAME

MAPPING_ID

ROW_ID

ERROR_DESC

LOAD_DATE

SOURCE Target

CUST DIM_LOAD

1 Name is NULL

10/11/2006 CUSTOMER_FF

CUSTOMER

CUST DIM_LOAD

1 DOB is NULL

10/11/2006 CUSTOMER_FF

CUSTOMER

CUST DIM_LOAD

1 Address is NULL

10/11/2006 CUSTOMER_FF

CUSTOMER

The efficiency of a mapping approach can be increased by employing reusable objects. Common logic should be placed in mapplets, which can be shared by multiple mappings. This improves productivity in implementing and managing the capture of data validation errors.

Data validation error handling can be extended by including mapping logic to grade error severity. For example, flagging data validation errors as ‘soft’ or ‘hard’.

• A ‘hard’ error can be defined as one that would fail when being written to the database, such as a constraint error.

• A ‘soft’ error can be defined as a data content error.

A record flagged as ‘hard’ can be filtered from the target and written to the error tables, while a record flagged as ‘soft’ can be written to both the target system and the error tables. This gives business analysts an opportunity to evaluate and correct data imperfections while still allowing the records to be processed for end-user reporting.

Ultimately, business organizations need to decide if the analysts should fix the data in the reject table or in the source systems. The advantage of the mapping approach is that all errors are identified as either data errors or constraint errors and can be properly addressed. The mapping approach also reports errors based on projects or categories by identifying the mappings that contain errors. The most important aspect of the mapping approach however, is its flexibility. Once an error type is identified, the error handling logic can be placed anywhere within a mapping. By using the mapping approach to capture identified errors, the operations team can effectively communicate data quality issues to the business users.

Constraint and Transformation Errors

Perfect data can never be guaranteed. In implementing the mapping approach described above to detect errors and log them to an error table, how can we handle unexpected errors that arise in the load? For example, PowerCenter may apply the validated data to the database; however the relational database management system (RDBMS) may reject it for some unexpected reason. An RDBMS may, for example, reject data if constraints are violated. Ideally, we would like to detect these database-level errors automatically and send them to the same error table used to store the soft errors caught by the mapping approach described above.

In some cases, the ‘stop on errors’ session property can be set to ‘1’ to stop source data for which unhandled errors were encountered from being loaded. In this case, the process will stop with a failure, the

INFORMATICA CONFIDENTIAL BEST PRACTICE 352 of 702

data must be corrected, and the entire source may need to be reloaded or recovered. This is not always an acceptable approach.

An alternative might be to have the load process continue in the event of records being rejected, and then reprocess only the records that were found to be in error. This can be achieved by configuring the ‘stop on errors’ property to 0 and switching on relational error logging for a session. By default, the error-messages from the RDBMS and any un-caught transformation errors are sent to the session log. Switching on relational error logging redirects these messages to a selected database in which four tables are automatically created: PMERR_MSG, PMERR_DATA, PMERR_TRANS and PMERR_SESS.

The PowerCenter Workflow Administration Guide contains detailed information on the structure of these tables. However, the PMERR_MSG table stores the error messages that were encountered in a session. The following four columns of this table allow us to retrieve any RDBMS errors:

• SESS_INST_ID: A unique identifier for the session. Joining this table with the Metadata Exchange (MX) View REP_LOAD_SESSIONS in the repository allows the MAPPING_ID to be retrieved.

• TRANS_NAME: Name of the transformation where an error occurred. When a RDBMS error occurs, this is the name of the target transformation.

• TRANS_ROW_ID: Specifies the row ID generated by the last active source. This field contains the row number at the target when the error occurred.

• ERROR_MSG: Error message generated by the RDBMS

With this information, all RDBMS errors can be extracted and stored in an applicable error table. A post-load session (i.e., an additional PowerCenter session) can be implemented to read the PMERR_MSG table, join it with the MX View REP_LOAD_SESSION in the repository, and insert the error details into ERR_DESC_TBL. When the post process ends, ERR_DESC_TBL will contain both ‘soft’ errors and ‘hard’ errors.

One problem with capturing RDBMS errors in this way is mapping them to the relevant source key to provide lineage. This can be difficult when the source and target rows are not directly related (i.e., one source row can actually result in zero or more rows at the target). In this case, the mapping that loads the source must write translation data to a staging table (including the source key and target row number). The translation table can then be used by the post-load session to identify the source key by the target row number retrieved from the error log. The source key stored in the translation table could be a row number in the case of a flat file, or a primary key in the case of a relational data source.

Reprocessing

After the load and post-load sessions are complete, the error table (e.g., ERR_DESC_TBL) can be analyzed by members of the business or operational teams. The rows listed in this table have not been loaded into the target database. The operations team can, therefore, fix the data in the source that resulted in ‘soft’ errors and may be able to explain and remediate the ‘hard’ errors.

Once the errors have been fixed, the source data can be reloaded. Ideally, only the rows resulting in errors during the first run should be reprocessed in the reload. This can be achieved by including a filter and a lookup in the original load mapping and using a parameter to configure the mapping for an initial load or for a reprocess load. If the mapping is reprocessing, the lookup searches for each source row number in the error table, while the filter removes source rows for which the lookup has not found errors. If initial loading, all rows are passed through the filter, validated, and loaded.

With this approach, the same mapping can be used for initial and reprocess loads. During a reprocess run, the records successfully loaded should be deleted (or marked for deletion) from the error table, while any new errors encountered should be inserted as if an initial run. On completion, the post-load process is executed to capture any new RDBMS errors. This ensures that reprocessing loads are repeatable and result in reducing numbers of records in the error table over time.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 353 of 702

Error Handling Techniques - PowerCenter Workflows and Data Analyzer

Challenge

Implementing an efficient strategy to identify different types of errors in the ETL process, correct the errors, and reprocess the corrected data.

Description

Identifying errors and creating an error handling strategy is an essential part of a data warehousing project. The errors in an ETL process can be broadly categorized into two types: data errors in the load process, which are defined by the standards of acceptable data quality; and process errors, which are driven by the stability of the process itself.

The first step in implementing an error handling strategy is to understand and define the error handling requirement. Consider the following questions:

● What tools and methods can help in detecting all the possible errors? ● What tools and methods can help in correcting the errors? ● What is the best way to reconcile data across multiple systems? ● Where and how will the errors be stored? (i.e., relational tables or flat files)

A robust error handling strategy can be implemented using PowerCenter’s built-in error handling capabilities along with Data Analyzer as follows:

● Process Errors: Configure an email task to notify the PowerCenter Administrator immediately of any process failures.

● Data Errors: Setup the ETL process to:

❍ Use the Row Error Logging feature in PowerCenter to capture data errors in the PowerCenter error tables for analysis, correction, and reprocessing.

❍ Setup Data Analyzer alerts to notify the PowerCenter Administrator in the event of any rejected rows. ❍ Setup customized Data Analyzer reports and dashboards at the project level to provide information on

failed sessions, sessions with failed rows, load time, etc.

Configuring an Email Task to Handle Process Failures

Configure all workflows to send an email to the PowerCenter Administrator, or any other designated recipient, in the event of a session failure. Create a reusable email task and use it in the “On Failure Email” property settings in the Components tab of the session, as shown in the following figure.

INFORMATICA CONFIDENTIAL BEST PRACTICE 354 of 702

When you configure the subject and body of a post-session email, use email variables to include information about the session run, such as session name, mapping name, status, total number of records loaded, and total number of records rejected. The following table lists all the available email variables:

Email Variables for Post-Session Email

Email Variable Description

%s Session name.

%e Session status.

%b Session start time.

%c Session completion time.

%i Session elapsed time (session completion time-session start time).

%l Total rows loaded.

%r Total rows rejected.

%tSource and target table details, including read throughput in bytes per second and write throughput in rows per second. The PowerCenter Server includes all information displayed in the session detail dialog box.

%m Name of the mapping used in the session.

INFORMATICA CONFIDENTIAL BEST PRACTICE 355 of 702

%n Name of the folder containing the session.

%d Name of the repository containing the session.

%g Attach the session log to the message.

%a<filename>

Attach the named file. The file must be local to the PowerCenter Server. The following are valid file names: %a<c:\data\sales.txt> or %a</users/john/data/sales.txt>.

Note: The file name cannot include the greater than character (>) or a line break.

Note: The PowerCenter Server ignores %a, %g, or %t when you include them in the email subject. Include these variables in the email message only.

Configuring Row Error Logging in PowerCenter

PowerCenter provides you with a set of four centralized error tables into which all data errors can be logged. Using these tables to capture data errors greatly reduces the time and effort required to implement an error handling strategy when compared with a custom error handling solution.

When you configure a session, you can choose to log row errors in this central location. When a row error occurs, the PowerCenter Server logs error information that allows you to determine the cause and source of the error. The PowerCenter Server logs information such as source name, row ID, current row data, transformation, timestamp, error code, error message, repository name, folder name, session name, and mapping information. This error metadata is logged for all row-level errors, including database errors, transformation errors, and errors raised through the ERROR() function, such as business rule violations.

Logging row errors into relational tables rather than flat files enables you to report on and fix the errors easily. When you enable error logging and chose the ‘Relational Database’ Error Log Type, the PowerCenter Server offers you the following features:

● Generates the following tables to help you track row errors:

❍ PMERR_DATA. Stores data and metadata about a transformation row error and its corresponding source row.

❍ PMERR_MSG. Stores metadata about an error and the error message. ❍ PMERR_SESS. Stores metadata about the session. ❍ PMERR_TRANS. Stores metadata about the source and transformation ports, such as name and datatype,

when a transformation error occurs.

■ Appends error data to the same tables cumulatively, if they already exist, for the further runs of the session.

■ Allows you to specify a prefix for the error tables. For instance, if you want all your EDW session errors to go to one set of error tables, you can specify the prefix as ‘EDW_’

■ Allows you to collect row errors from multiple sessions in a centralized set of four error tables. To do this, you specify the same error log table name prefix for all sessions.

Example:

In the following figure, the session ‘s_m_Load_Customer’ loads Customer Data into the EDW Customer table. The Customer Table in EDW has the following structure:

CUSTOMER_ID NOT NULL NUMBER (PRIMARY KEY)

INFORMATICA CONFIDENTIAL BEST PRACTICE 356 of 702

CUSTOMER_NAME NULL VARCHAR2(30)

CUSTOMER_STATUS NULL VARCHAR2(10)

There is a primary key constraint on the column CUSTOMER_ID.

To take advantage of PowerCenter’s built-in error handling features, you would set the session properties as shown below:

The session property ‘Error Log Type’ is set to ‘Relational Database’, and ‘Error Log DB Connection’ and ‘Table name Prefix’ values are given accordingly.

When the PowerCenter server detects any rejected rows because of Primary Key Constraint violation, it writes information into the Error Tables as shown below:

EDW_PMERR_DATA

WORKFLOW_ RUN_ID

WORKLET_ RUN_ID

SESS_ INST_ ID

TRANS_NAME TRANS_ ROW_ID

TRANS_ROW DATA

SOURCE_ ROW_ID

SOURCE_ ROW_ TYPE

SOURCE_ ROW_ DATA

LINE_ NO

8 0 3 Customer_Table 1 D:1001:00000000 0000|D:Elvis Pres|D:Valid

-1 -1 N/A 1

INFORMATICA CONFIDENTIAL BEST PRACTICE 357 of 702

8 0 3 Customer_Table 2 D:1002:00000000 0000|D:James Bond|D:Valid

-1 -1 N/A 1

8 0 3 Customer_Table 3 D:1003:00000000 0000|D:Michael Ja|D:Valid

-1 -1 N/A 1

EDW_PMERR_MSG

WORKFLOW_ RUN_ID

SESS_ INST_ID

SESS_ START_TIME

REPOSITORY_ NAME

FOLDER_ NAME

WORKFLOW_ NAME

TASK_ INST_PATH

MAPPING_ NAME

LINE_ NO

6 3 9/15/2004 18:31

pc711 Folder1 wf_test1 s_m_test1 m_test1 1

7 3 9/15/2004 18:33

pc711 Folder1 wf_test1 s_m_test1 m_test1 1

8 3 9/15/2004 18:34

pc711 Folder1 wf_test1 s_m_test1 m_test1 1

EDW_PMERR_SESS

WORKFLOW_ RUN_ID

SESS_ INST_ID

SESS_ START_TIME

REPOSITORY_ NAME

FOLDER_ NAME

WORKFLOW_ NAME

TASK_ INST_PATH

MAPPING_ NAME

LINE_ NO

6 3 9/15/2004 18:31

pc711 Folder1 wf_test1 s_m_test1 m_test1 1

7 3 9/15/2004 18:33

pc711 Folder1 wf_test1 s_m_test1 m_test1 1

8 3 9/15/2004 18:34

pc711 Folder1 wf_test1 s_m_test1 m_test1 1

EDW_PMERR_TRANS

WORKFLOW_RUN_ID SESS_INST_ID TRANS_NAME TRANS_GROUP TRANS_ATTR LINE_ NO

8 3 Customer_Table Input Customer _Id:3, Customer _Name:12, Customer _Status:12

1

By looking at the workflow run id and other fields, you can analyze the errors and reprocess them after fixing the errors.

Error Detection and Notification using Data Analyzer

Informatica provides Data Analyzer for PowerCenter Repository Reports with every PowerCenter license. Data Analyzer is Informatica’s powerful business intelligence tool that is used to provide insight into the PowerCenter repository metadata.

INFORMATICA CONFIDENTIAL BEST PRACTICE 358 of 702

You can use the Operations Dashboard provided with the repository reports as one central location to gain insight into production environment ETL activities. In addition, the following capabilities of Data Analyzer are recommended best practices:

● Configure alerts to send an email or a pager message to the PowerCenter Administrator whenever there is an entry made into the error tables PMERR_DATA or PMERR_TRANS.

● Configure reports and dashboards to provide detailed session run information grouped by projects/PowerCenter folders for easy analysis.

● Configure reports to provide detailed information of the row level errors for each session. This can be accomplished by using the four error tables as sources of data for the reports

Data Reconciliation Using Data Analyzer

Business users often like to see certain metrics matching from one system to another (e.g., source system to ODS, ODS to targets, etc.) to ascertain that the data has been processed accurately. This is frequently accomplished by writing tedious queries, comparing two separately produced reports, or using constructs such as DBLinks.

Upgrading the Data Analyzer licence from Repository Reports to a full license enables Data Analyzer to source your company’s data (e.g., source systems, staging areas, ODS, data warehouse, and data marts) and provide a reliable and reusable way to accomplish data reconciliation. Using Data Analyzer’s reporting capabilities, you can select data from various data sources such as ODS, data marts, and data warehouses to compare key reconciliation metrics and numbers through aggregate reports. You can further schedule the reports to run automatically every time the relevant PowerCenter sessions complete, and setup alerts to notify the appropriate business or technical users in case of any discrepancies.

For example, a report can be created to ensure that the same number of customers exist in the ODS in comparison to a data warehouse and/or any downstream data marts. The reconciliation reports should be relevant to a business user by comparing key metrics (e.g., customer counts, aggregated financial metrics, etc) across data silos. Such reconciliation reports can be run automatically after PowerCenter loads the data, or they can be run by technical users or business on demand. This process allows users to verify the accuracy of data and builds confidence in the data warehouse solution.

Last updated: 09-Feb-07 14:22

INFORMATICA CONFIDENTIAL BEST PRACTICE 359 of 702

Planning the ICC Implementation

Challenge

The first stage in the creation of an Integration Competency Centre (ICC) is the selection of the appropriate organizational model for the services to be provided; this process is documented in the Best Practice Selecting the Right ICC Model

This Best Practice is concerned with planning the construction of the ICC itself. There are several challenges in describing this process in a single document; the resources required obviously vary according to the organizational model selected.

Description

At the end of the ICC selection process, one of the following models will have been selected for implementation:

The following stages need to be undertaken to implement the ICC:

● Select the initial project ● Identify the resource needs ● Establish the 30/60/90-day plan

INFORMATICA CONFIDENTIAL BEST PRACTICE 360 of 702

Choosing Projects for the ICC

The first step is the considered selection of a pilot project for the ICC; it may be advisable to choose a moderately challenging project so that the ICC can build on success. However, the project that is selected should be representative of the projects to be undertaken in the first year of the ICC’s existence.

Another criteria for the selection of the first project is the contribution to the ICC resource pool. The deliverables created by the first projects can serve as examples and templates for the processes to be adopted by the ICC as standards. Documents that fall into this category are: naming standards, sizing guides, performance tuning guidelines, deployment processes, low level design documents, project plans, etc. Supplementary resources can be found in the Velocity sample deliverables on Informatica’s customer portal at http://my.informatica.com.

Identify the Resource Needs

The resources required by an ICC fall into two broad categories.

First, a resource is needed to implement an ICC infrastructure and drive organizational change. Typically, one FTE is required to architect, design, and build the technical infrastructure that will support the ICC. This involves architecting the physical hardware (i.e., servers, SAN, network, etc) as well as the software (i.e., PowerCenter, PowerExchange, options, etc) to support any projects that are likely to use the shared resources of the ICC. The resource selected should have the required technical skills for this complex task.

The second type of resource is required to provide whatever development/operational support services are within the remit of the ICC organizational model selected. Once again, the calibre of the staff selected should reflect the importance of achieving success in the ICC’s first projects.

Establishing the 30/60/90/120 Day Plan

The purpose of the 30/60/90 day plan is to ensure that incremental deliverables are accomplished with respect to the implementation of the ICC. This also provides opportunities to enjoy successes at each step of the way and to communicate those successes to the ICC executive sponsor as each milestone is achieved during the 30/60/90 day plan.

It is also important to note that since the central services ICC model is estimated at 6+ months to fully implement, Informatica has provided a category of “120+ Day Plan” for deliverables that may be associated with a central services model, since it is outside of the scope of the 30/60/90 day plan.

30 Day Plan

The following plan outlines the people, process, and technology steps that should occur during the first 30 days of the ICC rollout:

INFORMATICA CONFIDENTIAL BEST PRACTICE 361 of 702

People

● Identify, assemble, and budget for the human resources necessary to support the ICC rollout.; typically, one technical FTE is a good place to start.

● Identify, estimate, and budget for the necessary technical resources (e.g., hardware, software). Note: To encourage projects to utilize the ICC model, it can often be effective to provide hardware and software resources without any internal chargeback for the first year of the ICC conception. Alternatively, the hardware and software costs can be funded by the projects that are likely to leverage the ICC.

Processes

● Identify and start planning for the initial projects that will utilize the ICC shared services.

Technology

● Implement a short-term technical infrastructure for the ICC. This includes implementing the hardware and software required to support the initial five projects (or projects within the scope of the first year of the ICC) in both a development and production capacity. Typically, this technical infrastructure is not the end-goal configuration, but it should include a hardware and software configuration that can easily meld into the end-goal configuration. The hardware and software requirements of the short-term technical infrastructure are generally limited to the components required for the projects that will leverage the infrastructure during the first year.

60 Day Plan

People

● Provide the shared resources to support the ongoing projects that are utilizing the ICC shared services in development. These resources need to support the deployment of objects between environments (dev/test/production) and support the monitoring of ongoing production processes (a.k.a production support).

Processes

● Start building, establishing, and communicating processes that are going to be required to support the ICC. These include:

❍ Naming standards ❍ Code/mapping reviews ❍ Deployment processes ❍ Performance tuning techniques

INFORMATICA CONFIDENTIAL BEST PRACTICE 362 of 702

❍ Detailed design documents

Technology

● Build out additional features into the short-term technical infrastructure that can improve service levels of the ICC and reduce costs. Examples include:

❍ PowerCenter Metadata Reporter ❍ PowerCenter Team Based Development Model ❍ Metadata Manager ❍ Data Profiling and Cleansing options ❍ Various connectivity options, including PowerExchange and PowerCenter Connects

90 Day Plan

People

● Continue to provide production support shared services to projects leveraging the ICC infrastructure.

● Provide training to project teams on additional ICC capabilities available to them (i.e., implemented during the 60 day plan).

Processes

● Finalize and fully communicate all ICC processes (i.e., the processes listed in the 30 day plan)

● Develop a governance plan such that all objects/code developed for projects leveraging the ICC are reviewed by a governing board of architects and senior developers before being migrated into production. This governance ensures that only high-quality projects are placed in production.

● Establish SLAs between the projects leveraging the ICC shared services and the ICC itself.

● Begin work on a chargeback model such that projects that join the ICC after the first year provide an internal transfer of funds to support the ICC based on their usage of the ICC shared services. Typically, chargeback models are based upon CPU utilization used in production by the project on a monthly basis.

❍ PowerCenter 8.1 logs metadata in the repository regarding the amount of CPU used by a particular process. For this reason, PowerCenter 8.1 is a key technology that should be leveraged for ICC implementations.

❍ The ICC chargeback model should be flexible in that the project manager can choose between a number of options for levels of support. For example, projects have different SLA requirements and a project that requires 24/7 high availability and dedicated hardware should have a different, more expensive, chargeback than a

INFORMATICA CONFIDENTIAL BEST PRACTICE 363 of 702

similar project that does not require high availability or dedicated hardware. ❍ The ICC chargeback model should reflect costs that are lower than the costs that the

project would otherwise have to pay to various hardware, software, and services vendors if they were to go down the path of a project silo approach.

Technology

● As projects join the ICC that have disaster recovery/failover needs, the appropriate implementation of DR/Failover should be completed for the ICC infrastructure. This usually happens in the first 90 days of the ICC.

120+ Day Plan

Assuming a central services ICC model is chosen, the following plan outlines the people, process, and technology steps that should occur during the first six to nine months of the ICC rollout.

● Implement a long-term technical infrastructure, including both hardware and software. This long-term technical infrastructure can generally provide cost-effective options for horizontal scaling – such as leveraging Informatica’s Enterprise Grid capabilities with a relatively inexpensive hardware platform, such as Linux or Windows.

● Proactively implement additional software components that can be leveraged by ICC customers/projects. Examples include:

❍ High Availability ❍ Enterprise Grid ❍ Unstructured Data Option

● After the initial project successes leveraging the ICC shared services model, establish the ICC as the enterprise standard for all data integration project needs.

● Provide additional chargeback models offering greater flexibility to the ICC customers/projects

● The ICC should expand its services offerings beyond simple development and production support to include shared services resources that can be shared across projects during the development and testing phases of the project. Examples of such resources include Data Architects, Data Modelers, Development resources, and/or QA resources.

● Establish an ICC “Help Desk” that provides 24x7 production support – similar to an “operator” in the mainframe environment.

● Consider negotiating with hardware vendors for more flexible offerings. Last updated: 09-Feb-07 14:09

INFORMATICA CONFIDENTIAL BEST PRACTICE 364 of 702

Selecting the Right ICC Model

Challenge

With increased pressure on IT productivity, many companies are rethinking the "independence" of data integration projects that has resulted in inefficient, piecemeal or silo-based approach to each new project. Furthermore, as each group within a business attempts to integrate its data, it unknowingly duplicates effort the company has already invested -- not just in the data integration itself, but also the effort spent on developing practices, processes, code, and personnel expertise.

An alternative to this expensive redundancy is to create some type of "integration competency center" (ICC). An ICC is an IT approach that provides teams throughout an organization with best practices in integration skills, processes, and technology so that they can complete data integration projects consistently, rapidly, and cost-efficiently.

What types of services should an ICC offer? This Best Practice provides an overview to help you consider the appropriate structure for your ICC.

More information is available in the following publication: Integration Competency Center: An Implementation Methodology by John Schmidt and David Lyle, Copyright 2005 Informatica Corporation.

DescriptionObjectives

Typical ICC objectives include:

● Promoting data integration as a formal discipline. ● Developing a set of experts with data integration skills and processes, and leveraging their

knowledge across the organization. ● Building and developing skills, capabilities, and best practices for integration processes

and operations. ● Monitoring, assessing, and selecting integration technology and tools. ● Managing integration pilots. ● Leading and supporting integration projects with the cooperation of subject matter experts. ● Reusing development work such as source definitions, application interfaces, and codified

business rules.

Benefits

Although a successful project that shares its lessons with other teams can be a great way to begin developing organizational awareness of the value of an ICC, setting up a more formal ICC requires

INFORMATICA CONFIDENTIAL BEST PRACTICE 365 of 702

upper management buy-in and funding. Here are some of the typical benefits that can be realized from doing so:

● Rapid development of in-house expertise through coordinated training and shared knowledge.

● Leverage shared resources and "best practice" methods and solutions. ● More rapid project deployments. ● Higher quality/reduced risk data integration projects. ● Reduced costs of project development and maintenance.

When examining the move toward an ICC model that optimizes and (in certain situations) centralizes integration functions, consider two things: the problems, costs and risks associated with a project silo-based approach, and the potential benefits of an ICC environment.

What Services Should be in an ICC?

The common services provided by ICCs can be divided into four major categories:

● Knowledge Management ● Environment ● Development Support ● Production Support

Having considered the service categories, the appropriate ICC Organizational Model can be selected.

Knowledge Management

Training

● Standards Training (Training Coordinator) - Training of best practices, including but not limited to, naming conventions, unit test plans, configuration management strategy, and project methodology.

● Product Training (Training Coordinator) - Co-ordination of vendor-offered or internally-sponsored training of specific technology products.

Standards

Standards Development (Knowledge Coordinator) - Creating best practices, including but not limited to, naming conventions, unit test plans, and coding standards.

INFORMATICA CONFIDENTIAL BEST PRACTICE 366 of 702

Standards Enforcement (Knowledge Coordinator) - Enforcing development teams to use documented best practices through formal development reviews, metadata reports, project audits or other means.

● Methodology (Knowledge Coordinator) - Creating methodologies to support development initiatives. Examples include methodologies for rolling out data warehouses and data integration projects. Typical topics in a methodology include, but are not limited to:

❍ Project Management ❍ Project Estimation ❍ Development Standards ❍ Operational Support

● Mapping Patterns (Knowledge Coordinator) - Developing and maintaining mapping patterns (templates) to speed up development time and promote mapping standards across projects.

Technology

● Emerging Technologies (Technology Leader ) - Assessing emerging technologies and determining if/where they fit in the organization and policies around their adoption/use.

Benchmarking (Technology Leader) - Conducting and documenting tests on hardware and software in the organization to establish performance benchmarks.

Metadata

● Metadata Standards (Metadata Administrator) - Creating standards for capturing and maintaining metadata. For example, database column descriptions can be captured in ErWin and pushed to PowerCenter via Metadata Exchange.

● Metadata Enforcement (Metadata Administrator) - Enforcing development teams to conform to documented metadata standards.

● Data Integration Catalog (Metadata Administrator) - Tracking the list of systems involved in data integration efforts, the integration between systems, and the use of/subscription to data integration feeds. This information is critical to managing the interconnections in the environment in order to avoid duplication of integration efforts. The Calalog can also assist in understanding when particular integration feeds are no longer needed.

Environment

Hardware

● Vendor Selection and Management (Vendor Manager) - Selecting vendors for the hardware tools needed for integration efforts that may span Servers, Storage and network facilities.

INFORMATICA CONFIDENTIAL BEST PRACTICE 367 of 702

● Hardware Procurement (Vendor Manager) - Responsible for the purchasing process for hardware items that may include receiving and cataloging the physical hardware items.

● Hardware Architecture (Technical Architect) - Developing and maintaining the physical layout and details of the hardware used to support the Integration Competency Center.

● Hardware Installation (Product Specialist) - Setting up and activating new hardware as it becomes part of the physical architecture supporting the Integration Competency Center.

● Hardware Upgrades (Product Specialist) - Managing the upgrade of hardware including operating system patches, additional cpu/memory upgrades, replacing old technology etc.

Software

● Vendor Selection and Management (Vendor Manager) - Selecting vendors for the software tools needed for integration efforts. Activities may include formal RFPs, vendor presentation reviews, software selection criteria, maintenance renewal negotiations and all activities related to managing the software vendor relationship.

● Software Procurement (Vendor Manager) - Responsible for the purchasing process for software packages and licenses.

● Software Architecture (Technical Architect) - Developing and maintaining the architecture of the software package(s) used in the competency center. This may include flowcharts and decision trees of what software to select for specific tasks.

● Software Installation (Product Specialist) - Setting up and installing new software as it becomes part of the physical architecture supporting the Integration Competency Center.

● Software Upgrade (Product Specialist) - Managing the upgrade of software including patches and new releases. Depending on the nature of the upgrade, significant planning and rollout efforts may be required during upgrades. (Training, testing, physical installation on client machines etc.)

● Compliance (Licensing) (Vendor Manager) - Monitoring and ensuring proper licensing compliance across development teams. Formal audits or reviews may be scheduled. Physical documentation should be kept matching installed software with purchased licenses.

Professional Services

● Vendor Selection and Management (Vendor Manager) - Selecting vendors for professional services efforts related to integration efforts. Activities may include managing vendor rates and bulk discount negotiations, payment of vendors, reviewing past vendor work efforts, managing list of "preferred" vendors etc.

● Vendor Qualification (Vendor Manager) - Conducting formal vendor interviews as consultants/ contracts are proposed for projects, checking vendor references and certifications, formally qualifying selected vendors for specific work tasks (i.e., Vendor A is qualified for Java development while Vendor B is qualified for ETL and EAI work.)

Security

INFORMATICA CONFIDENTIAL BEST PRACTICE 368 of 702

● Security Administration (Security Administrator) - Providing access to the tools and technology needed to complete data integration development efforts including software user id's, source system user id/passwords, and overall data security of the integration efforts. Ensures enterprise security processes are followed.

● Disaster Recovery (Technical Architect) - Performing risk analysis in order to develop and execute a plan for disaster recovery including repository backups, off-site backups, failover hardware, notification procedures and other tasks related to a catastrophic failure (i.e., server room fire destroys dev/prod servers).

Financial

● Budget (ICC Manager) - Yearly budget management for the Integration Competency Center. Responsible for managing outlays for services, support, hardware, software and other costs.

● Departmental Cost Allocation (ICC Manager) - For clients where shared services costs are to be spread across departments/ business units for cost purposes. Activities include defining metrics uses for cost allocation, reporting on the metrics, and applying cost factors for billing on a weekly/monthly or quarterly basis as dictated.

Scalability/Availability

● High Availability (Technical Architect) - Designing and implementing hardware, software and procedures to ensure high availability of the data integration environment.

● Capacity Planning (Technical Architect) - Designing and planing for additional integration capacity to address the growth in size and volume of data integration in the future for the organization.

Development Support

Performance

● Performance and Tuning (Product Specialist) - Providing targeted performance and tuning assistance for integration efforts. Providing on-going assessments of load windows and schedules to ensure service level agreements are being met.

Shared Objects

● Shared Object Quality Assurance (Quality Assurance) - Providing quality assurance services for shared objects so that objects conform to standards and do not adversely affect the various projects that may be using them.

● Shared Object Change Management (Change Control Coordinator) - Managing the migration to production of shared objects which may impact multiple project teams. Activities include defining the schedule for production moves, notifying teams of changes, and coordinating the migration of the object to production.

INFORMATICA CONFIDENTIAL BEST PRACTICE 369 of 702

● Shared Object Acceptance (Change Control Coordinator) - Defining and documenting the criteria for a shared object and officially certifying an object as one that will be shared across project teams.

● Shared Object Documentation (Change Control Coordinator) - Defining the standards for documentation of shared objects and maintaining a catalog of all shared objects and their functions.

Project Support

● Development Helpdesk (Data Integration Developer) - Providing a helpdesk of expert product personnel to support project teams. This will provide project teams new to developing data integration routines with a place to turn to for experienced guidance.

● Software/Method Selection (Technical Architect) - Providing a workflow or decision tree to use when deciding which data integration technology to use for a given technology request.

● Requirements Definition (Business/Technical Analyst) - Developing the process to gather and document integration requirements. Depending on the level of service, activity may include assisting or even fully gathering the requirements for the project.

● Project Estimation (Project Manager) - Developing project estimation models and provide estimation assistance for data integration efforts.

● Project Management (Project Manager) - Providing full time management resources experienced in data integration to ensure successful projects.

● Project Architecture Review (Data Integration Architect) - Providing project level architecture review as part of the design process for data integration projects. Helping ensure standards are met and the project architecture fits within the enterprise architecture vision.

● Detailed Design Review (Data Integration Developer) - Reviewing design specifications in detail to ensure conformance to standards and identifying any issues upfront before development work is begun.

● Development Resources (Data Integration Developer) - Providing product-skilled resources for completion of the development efforts.

● Data Profiling (Data Integration Developer) - Providing data profiling services to identify data quality issues. Develop plans for addressing issues found in data profiling.

● Data Quality (Data Integration Developer) - Defining and meeting data quality levels and thresholds for data integration efforts.

Testing

● Unit Testing (Quality Assurance ) - Defining and executing unit testing of data integration processes. Deliverables include documented test plans, test cases and verification against end-user acceptance criteria.

● System Testing (Quality Assurance) - Defining and performing system testing to ensure that data integration efforts work seamlessly across multiple projects and teams.

INFORMATICA CONFIDENTIAL BEST PRACTICE 370 of 702

Cross Project Integration

● Schedule Management/Planning (Data Integration Developer) - Providing a single point for managing load schedules across the physical architecture to make best use of available resources and appropriately handle integration dependencies.

● Impact Analysis (Data Integration Developer) - Providing impact analysis on proposed and scheduled changes that may impact the integration environment. Changes include, but are not limited to, system enhancements, new systems, retirement of old systems, data volume changes, shared object changes, hardware migration and system outages.

Production Support

Issue Resolution

Operations Helpdesk (Production Operator) - First line of support for operations issues providing high level issue resolution. Helpdesk should provide field support for cases and issues related to scheduled jobs, system availability and other production support tasks.

● Data Validation (Quality Assurance) - Providing data validation on integration load tasks. Data may be "held" from end-user access until some level of data validation has been performed. It may be manual review of load statistics - to automated review of record counts including grand total comparisons, expected size thresholds, or any other metric an organization may define to catch potential data inconsistencies before reaching end users.

Production Monitoring

Schedule Monitoring (Production Operator) - Nightly/daily monitoring of the data integration load jobs. Ensuring jobs are properly initiated, are not being delayed, and ensuring successful completion. May provide first level support to the load schedule while escalating issues to the appropriate support teams.

● Operations Metadata Delivery (Production Operator) - Responsible for providing metadata to system owners and end users regarding the production load process including load times, completion status, known issues and other pertinent information regarding the current state of the integration job stream.

Change Management

● Object Migration (Change Control Coordinator) - Coordinating movement of development objects and processes to production. May even physically control migration such that all migration is scheduled, managed, and performed by the ICC.

● Change Control Review (Change Control Coordinator) - Conducting formal and informal reviews of production changes before migration is approved. At this time, standards may be enforced, system tuning reviewed, production schedules updated, and formal sign off

INFORMATICA CONFIDENTIAL BEST PRACTICE 371 of 702

to production changes is issued. ● Process Definition (Change Control Coordinator) - Developing and documenting the

change management process such that development objects are efficiently and flawlessly migrated into the production environment. This may include notification rules, schedule migration plans, emergency fix procedures etc.

Choosing an ICC Model

The organizational options for developing multiple integration applications are shown below:

The higher the degree of centralization, the greater the potential cost savings. Some organizations have the flexibility to easily move toward central services, while others don’t – either due to organizational or regulatory constraints. There is no ideal model, just one that is appropriate to the environment in which it operates.

To assist the selection of the appropriate ICC model, the Services described above are mapped to the Organizational Models below:

INFORMATICA CONFIDENTIAL BEST PRACTICE 372 of 702

The adoption of the Central Services model does not necessarily mandate the inclusion of all applications within the orbit of the ICC. Some projects require very specific SLAs (Service Level Agreements) that are much more stringent than other projects, and as such they may require a less stringent ICC model.

Last updated: 09-Feb-07 14:51

INFORMATICA CONFIDENTIAL BEST PRACTICE 373 of 702

Creating Inventories of Reusable Objects & Mappings

Challenge

Successfully identify the need and scope of reusability. Create inventories of reusable objects with in a folder or shortcuts across folders (Local shortcuts) or shortcuts across repositories (Global shortcuts).

Successfully identify and create inventories of mappings based on business rules.

DescriptionReusable Objects

Prior to creating an inventory of reusable objects or shortcut objects, be sure to review the business requirements and look for any common routines and/or modules that may appear in more than one data movement. These common routines are excellent candidates for reusable objects or shortcut objects. In PowerCenter, these objects can be created as:

● single transformations (i.e., lookups, filters, etc.) ● a reusable mapping component (i.e., a group of transformations - mapplets) ● single tasks in workflow manager (i.e., command, email, or session) ● a reusable workflow component (i.e., a group of tasks in workflow manager - worklets).

Please note that shortcuts are not supported for workflow level objects (Tasks).

Identify the need for reusable objects based on the following criteria:

● Is there enough usage and complexity to warrant the development of a common object? ● Are the data types of the information passing through the reusable object the same from

case to case or is it simply the same high-level steps with different fields and data. Identify the Scope based on the following criteria:

● Do these objects need to be shared with in the same folder. If so, then create re-usable objects with in the folder

● Do these objects need to be shared in several other PowerCenter repository folders? If so, then create local shortcuts

● Do these objects need to be shared across repositories? If so, then create a global repository and maintain these re-usable objects in the global repository. Create global

INFORMATICA CONFIDENTIAL BEST PRACTICE 374 of 702

shortcuts to these reusable objects from the local repositories.

Note: Shortcuts cannot be created for workflow objects. PowerCenter Designer objects:

Creating and testing common objects does not always save development time or facilitate future maintenance. For example, if a simple calculation like subtracting a current rate from a budget rate that is going to be used for two different mappings, carefully consider whether the effort to create, test, and document the common object is worthwhile. Often, it is simpler to add the calculation to both mappings. However, if the calculation were to be performed in a number of mappings, if it was very difficult, and if all occurrences would be updated following any change or fix, then the calculation would be an ideal case for a reusable object. When you add instances of a reusable transformation to mappings, be careful that the changes do not invalidate the mapping or generate unexpected data. The Designer stores each reusable transformation as metadata, separate from any mapping that uses the transformation. The second criterion for a reusable object concerns the data that will pass through the reusable object. Developers often encounter situations where they may perform a certain type of high-level process (i.e., a filter, expression, or update strategy) in two or more mappings. For example, if you have several fact tables that require a series of dimension keys, you can create a mapplet containing a series of lookup transformations to find each dimension key. You can then use the mapplet in each fact table mapping, rather than recreating the same lookup logic in each mapping. This seems like a great candidate for a mapplet. However, after performing half of the mapplet work, the developers may realize that the actual data or ports passing through the high-level logic are totally different from case to case, thus making the use of a mapplet impractical. Consider whether there is a practical way to generalize the common logic so that it can be successfully applied to multiple cases. Remember, when creating a reusable object, the actual object will be replicated in one to many mappings. Thus, in each mapping using the mapplet or reusable transformation object, the same size and number of ports must pass into and out of the mapping/reusable object. Document the list of the reusable objects that pass this criteria test, providing a high-level description of what each object will accomplish. The detailed design will occur in a future subtask, but at this point the intent is to identify the number and functionality of reusable objects that will be built for the project. Keep in mind that it will be impossible to identify one hundred percent of the reusable objects at this point; the goal here is to create an inventory of as many as possible, and hopefully the most difficult ones. The remainder will be discovered while building the data integration processes.

PowerCenter Workflow Manager Objects: In some cases, we may have to read data from different sources and go through the same transformation logic and write the data to either one destination database or multiple destination databases. Also, sometimes, depending on the availability of the source, these loads have to be scheduled at different time. This case would be the ideal one to create a re-usable session and

INFORMATICA CONFIDENTIAL BEST PRACTICE 375 of 702

do Session overrides at the session instance level for the database connections/pre-session commands / post session commands. Logging load statistics, failure criteria and success criteria are usually common pieces of code that would be executed for multiple loads in most Projects. Some of these common tasks include:

Notification when number of rows loaded is less then expected●

Notification when there are any reject rows using email tasks and link conditions

Successful completion notification based on success criteria like number of rows loaded using email tasks and link conditions

Fail the load based on failure criteria like load statistics or status of some critical session using control task

Stop/Abort a Workflow based on some failure criteria using control task●

Based on some previous session completion times, calculate the amount of time the down stream session has to wait before it can start using worklet variables, timer task and assignment task

Re-usable worklets can be developed to encapsulate the above-mentioned tasks and can be used in multiple loads. By passing workflow variable values to the worklets and assign then to worklet variables, one can easily encapsulate common workflow logic.

Mappings A mapping is a set of source and target definitions linked by transformation objects that define the rules for data transformation. Mappings represent the data flow between sources and targets. In a simple world, a single source table would populate a single target table. However, in practice, this is usually not the case. Sometimes multiple sources of data need to be combined to create a target table, and sometimes a single source of data creates many target tables. The latter is especially true for mainframe data sources where COBOL OCCURS statements litter the landscape. In a typical warehouse or data mart model, each OCCURS statement decomposes to a separate table. The goal here is to create an inventory of the mappings needed for the project. For this exercise, the challenge is to think in individual components of data movement. While the business may consider a fact table and its three related dimensions as a single ‘object’ in the data mart or warehouse, five mappings may be needed to populate the corresponding star schema with data (i.e., one for each of the dimension tables and two for the fact table, each from a different source system).

INFORMATICA CONFIDENTIAL BEST PRACTICE 376 of 702

Typically, when creating an inventory of mappings, the focus is on the target tables, with an assumption that each target table has its own mapping, or sometimes multiple mappings. While often true, if a single source of data populates multiple tables, this approach yields multiple mappings. Efficiencies can sometimes be realized by loading multiple tables from a single source. By simply focusing on the target tables, however, these efficiencies can be overlooked. A more comprehensive approach to creating the inventory of mappings is to create a spreadsheet listing all of the target tables. Create a column with a number next to each target table. For each of the target tables, in another column, list the source file or table that will be used to populate the table. In the case of multiple source tables per target, create two rows for the target, each with the same number, and list the additional source(s) of data.

The table would look similar to the following:

Number Target Table Source1 Customers Cust_File2 Products Items3 Customer_Type Cust_File4 Orders_Item Tickets4 Orders_Item Ticket_Items

When completed, the spreadsheet can be sorted either by target table or source table. Sorting by source table can help determine potential mappings that create multiple targets. When using a source to populate multiple tables at once for efficiency, be sure to keep restartabilty and reloadability in mind. The mapping will always load two or more target tables from the source, so there will be no easy way to rerun a single table. In this example, potentially the Customers table and the Customer_Type tables can be loaded in the same mapping. When merging targets into one mapping in this manner, give both targets the same number. Then, re-sort the spreadsheet by number. For the mappings with multiple sources or targets, merge the data back into a single row to generate the inventory of mappings, with each number representing a separate mapping.

The resulting inventory would look similar to the following:

Number Target Table Source1 Customers Customer_Type Cust_File2 Products Items4 Orders_Item Tickets Ticket_Items

At this point, it is often helpful to record some additional information about each mapping to help with planning and maintenance. First, give each mapping a name. Apply the naming standards generated in 3.2 Design Development Architecture. These names can then be used to distinguish mappings from one other and also can be put on the project plan as individual tasks.

INFORMATICA CONFIDENTIAL BEST PRACTICE 377 of 702

Next, determine for the project a threshold for a high, medium, or low number of target rows. For example, in a warehouse where dimension tables are likely to number in the thousands and fact tables in the hundred thousands, the following thresholds might apply:

Low – 1 to 10,000 rows●

Medium – 10,000 to 100,000 rows●

High – 100,000 rows +Assign a likely row volume (high, medium or low) to each of the mappings based on the expected volume of data to pass through the mapping. These high level estimates will help to determine how many mappings are of ‘high’ volume; these mappings will be the first candidates for performance tuning. Add any other columns of information that might be useful to capture about each mapping, such as a high-level description of the mapping functionality, resource (developer) assigned, initial estimate, actual completion time, or complexity rating. Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 378 of 702

Metadata Reporting and Sharing

Challenge

Using Informatica's suite of metadata tools effectively in the design of the end-user analysis application.

Description

The Informatica tool suite can capture extensive levels of metadata but the amount of metadata that is entered depends on the metadata strategy. Detailed information or metadata comments can be entered for all repository objects (e.g. mapping, sources, targets, transformations, ports etc.). Also, all information about column size and scale, data types, and primary keys are stored in the repository. The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it will also require extra amount of time and efforts to do so. But once that information is fed to the Informatica repository ,the same information can be retrieved using Metadata reporter any time. There are several out-of-box reports and customized reports can also be created to view that information. There are several options available to export these reports (e.g. Excel spreadsheet, Adobe .pdf file etc.). Informatica offers two ways to access the repository metadata:

• Metadata Reporter, which is a web-based application that allows you to run reports against the repository metadata. This is a very comprehensive tool that is powered by the functionality of Informatica’s BI reporting tool, Data Analyzer. It is included on the PowerCenter CD.

• Because Informatica does not support or recommend direct reporting access to the repository, even for Select Only queries, the second way of repository metadata reporting is through the use of views written using Metadata Exchange (MX).

Metadata Reporter

The need for the Informatica Metadata Reporter arose from the number of clients requesting custom and complete metadata reports from their repositories. Metadata Reporter is based on the Data Analyzer and PowerCenter products. It provides Data Analyzer dashboards and metadata reports to help you administer your day-to-day PowerCenter operations, reports to access to every Informatica object stored in the repository, and even reports to access objects in the Data Analyzer repository. The architecture of the Metadata Reporter is web-based, with an Internet browser front end. Because Metadata Reporter runs on Data Analyzer, you must have Data Analyzer installed and running before you proceed with Metadata Reporter setup.

Metadata Reporter setup includes the following .XML files to be imported from the PowerCenter CD in the same sequence as they are listed below:

• Schemas.xml • Schedule.xml • GlobalVariables_Oracle.xml (This file is database specific, Informatica provides GlobalVariable

files for DB2, SQLServer, Sybase and Teradata. You need to select the appropriate file based on er repository environment) your PowerCent

• Reports.xml • Dashboards.xml

Note : If you have setup a new instance of Data Analyzer exclusively for Metadata reporter, you should have no problem importing these files. However, if you are using an existing instance of Data Analyzer which you currently use for some other reporting purpose, be careful while importing these files. Some of the file (e.g., Global variables, schedules, etc.) may already exist with the same name. You can rename the conflicting objects.

INFORMATICA CONFIDENTIAL BEST PRACTICE 379 of 702

The follo

nize

• nt - contains a set of reports that provide detailed information on conf t, including deployment and label details. This folder contains following subf

• operational statistics including server lo run times, load times, number of runtime errors, etc. for workflows,

• nable users to identify all types of Pow eir properties, and interdependencies on other objects within the repo following subfolders:

nsion

• Security - contains a set of reports that provide detailed information on the users, groups and their

atica client tools being

installed on that computer. The Metadata Reporter connects to the PowerCenter repository using JDBC

(Note: You can also use the JDBC to ODBC bridge to connect to the repository (e.g., Syntax - jdbc b

• adata

• er. The name of any metadata object that displays on a report links to an associated report. As you view a report, you can generate reports for objects on which you need more information.

wing are the folders that are created in Data Analyzer when you import the above-listed files:

• Data Analyzer Metadata Reporting - contains reports for Data Analyzer repository itself e.g. Today’s Login ,Reports accessed by Users Today etc. PowerCent• er Metadata Reports - contains reports for PowerCenter repository. To better orgareports based on their functionality these reports are further grouped into subfolders as following: Configuration Manageme

iguration managemenolders: oo Label o Object Version

Operations - contains a set of reports that enable users to analyze

Deployment

ad, connection usage, let and sessions. This foldwork s er contains following subfolders: o Session Execution o Workflow Execution

Pow contains a set of reports that eerCenter Objects -erCenter objects, thsitory. This folder containso Mappings

Mapplets oteo Metadata Ex

Server Gridso Sessions oo Sources

Target o Transformatioo ns o Workflows o Worklets

association within the repository.

Informatica recommends retaining this folder organization, adding new folders if necessary.

The Metadata Reporter provides 44 standard reports which can be customized with the use of parameters and wildcards. Metadata Reporter is accessible from any computer with a browser that has access to theweb server where the Metadata Reporter is installed, even without the other Inform

drivers. Be sure the proper JDBC drivers are installed for your database platform.

:od c:<data_source_name>)

• Metadata Reporter is comprehensive. You can run reports on any repository. The reports provide information about all types of metadata objects. Metadata Reporter is easily accessible. Because the Metadata Reporter is web-based, you can generate reports from any machine that has access to the web server. The reports in the MetReporter are customizable. The Metadata Reporter allows you to set parameters for the metadataobjects to include in the report. The Metadata Reporter allows you to go easily from one report to anoth

INFORMATICA CONFIDENTIAL BEST PRACTICE 380 of 702

The following table shows list of reports provided by the Metadata Reporter, along with their location and a brief description:

Reports For PowerCenter Repository Sr No Name Folder Description 1 Deployment Group Public Folders>PowerCenter Metadata

Reports>Configuration Management>Deployment>Deployment Group

Displays deployment groups by repository

2 Deployment Group History

Public Folders>PowerCenter Metadata Reports>Configuration Management>Deployment>Deployment Group History

Displays, by group, deployment groups and the dates they were deployed. It also displays the source and target repository names of the deployment group for all deployment dates. This is a primary report in an analytic workflow.

3 Labels Public Folders>PowerCenter Metadata Reports>Configuration Management>Labels>Labels

Displays labels created in the repository for any versioned object by repository.

4 All Object Version History

Public Folders>PowerCenter Metadata Reports>Configuration Management>Object Version>All Object Version History

Displays all versions of an object by the date the object is saved in the repository. This is a standalone report.

5 Server Load by Day of Week

Public Folders>PowerCenter Metadata Reports>Operations>Session Execution>Server Load by Day of Week

Displays the total number of sessions that ran, and the total session run duration for any day of week in any given month of the year by server by repository. For example, all Mondays in September are represented in one row if that month had 4 Mondays

6 Session Run Details Public Folders>PowerCenter Metadata Reports>Operations>Session Execution>Session Run Details

Displays session run details for any start date by repository by folder. This is a primary report in an analytic workflow.

7 Target Table Load Analysis (Last Month)

Public Folders>PowerCenter Metadata Reports>Operations>Session Execution>Target Table Load Analysis (Last Month)

Displays the load statistics for each table for last month by repository by folder. This is a primary report in an analytic workflow.

8 Workflow Run Details

Public Folders>PowerCenter Metadata Reports>Operations>Workflow Execution>Workflow Run Details

Displays the run statistics of all workflows by repository by folder. This is a primary report in an analytic workflow.

9 Worklet Run Details Public Folders>PowerCenter Metadata Reports>Operations>Workflow Execution>Worklet Run Details

Displays the run statistics of all worklets by repository by folder. This is a primary report in an analytic workflow.

10 Mapping List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Mapping List

Displays mappings by repository and folder. It also displays properties of the mapping such as the number of sources used in a mapping, the number of transformations, and the number of targets. This is a primary report in an analytic workflow.

11 Mapping Lookup Public Folders>PowerCenter Metadata Displays Lookup

INFORMATICA CONFIDENTIAL BEST PRACTICE 381 of 702

Reports For PowerCenter Repository Sr No Name Folder Description

Transformations Reports>PowerCenter Objects>Mappings>Mapping Lookup Transformations

transformations used in a mapping by repository and folder. This report is a standalone report and also the first node in the analytic workflow associated with the Mapping List primary report.

12 Mapping Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Mapping Shortcuts

Displays mappings defined as a shortcut by repository and folder.

13 Source to Target Dependency

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mappings>Source to Target Dependency

Displays the data flow from the source to the target by repository and folder. The report lists all the source and target ports, the mappings in which the ports are connected, and the transformation expression that shows how data for the target port is derived.

14 Mapplet List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Mapplet List

Displays mapplets available by repository and folder. It displays properties of the mapplet such as the number of sources used in a mapplet, the number of transformations, or the number of targets. This is a primary report in an analytic workflow.

15 Mapplet Lookup Transformations

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Mapplet Lookup Transformations

Displays all Lookup transformations used in a mapplet by folder and repository. This report is a standalone report and also the first node in the analytic workflow associated with the Mapplet List primary report.

16 Mapplet Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Mapplet Shortcuts

Displays mapplets defined as a shortcut by repository and folder.

17 Unused Mapplets in Mappings

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Mapplets>Unused Mapplets in Mappings

Displays mapplets defined in a folder but not used in any mapping in that folder.

18 Metadata Extensions Usage

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Metadata Extensions>Metadata Extensions Usage

Displays, by repository by folder, reusable metadata extensions used by any object. Also displays the counts of all objects using that metadata extension.

19 Server Grid List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Server Grid>Server Grid List

Displays all server grids and servers associated with each grid. Information includes host name, port number, and internet protocol address of the servers.

20 Session List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sessions>Session List

Displays all sessions and their properties by repository by folder. This is a primary report

INFORMATICA CONFIDENTIAL BEST PRACTICE 382 of 702

Reports For PowerCenter Repository Sr No Name Folder Description

in an analytic workflow. 21

Source List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sources>Source List

Displays relational and non-relational sources by repository and folder. It also shows the source properties. This report is a primary report in an analytic workflow.

22 Source Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Sources>Source Shortcuts

Displays sources that are defined as shortcuts by repository and folder

23 Target List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Targets>Target List

Displays relational and non-relational targets available by repository and folder. It also displays the target properties. This is a primary report in an analytic workflow.

24 Target Shortcuts Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Targets>Target Shortcuts

Displays targets that are defined as shortcuts by repository and folder.

25 Transformation List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Transformations>Transformation List

Displays transformations defined by repository and folder. This is a primary report in an analytic workflow.

26 Transformation Shortcuts

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Transformations>Transformation Shortcuts

Displays transformations that are defined as shortcuts by repository and folder.

27 Scheduler (Reusable) List

Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Workflows>Scheduler (Reusable) List

Displays all the reusable schedulers defined in the repository and their description and properties by repository by folder. This is a primary report in an analytic workflow.

28 Workflow List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Workflows>Workflow List

Displays workflows and workflow properties by repository by folder. This report is a primary report in an analytic workflow.

29 Worklet List Public Folders>PowerCenter Metadata Reports>PowerCenter Objects>Worklets>Worklet List

Displays worklets and worklet properties by repository by folder. This is a primary report in an analytic workflow.

30 Users By Group Public Folders>PowerCenter Metadata Reports>Security>Users By Group

Displays users by repository and group.

Reports For Data Analyzer Repository Sr No Name Folder Description 1 Bottom 10 Least

Accessed Reports this Year

Public Folders>Data Analyzer Metadata Reporting>Bottom 10 Least Accessed Reports this Year

Displays the ten least accessed reports for the current year. It has an analytic workflow that provides access details such as user name and access time.

2 Report Activity Details

Public Folders>Data Analyzer Metadata Reporting>Report Activity Details

Part of the analytic workflows "Top 10 Most Accessed Reports This Year", "Bottom 10 Least Accessed Reports this Year" and "Usage by Login (Month To Date)".

INFORMATICA CONFIDENTIAL BEST PRACTICE 383 of 702

Reports For PowerCenter Repository Sr No Name Folder Description 3 Report Activity

Details for Current Month

Public Folders>Data Analyzer Metadata Reporting>Report Activity Details for Current Month

Provides information about reports accessed in the current month until current date.

4 Report Refresh Schedule

Public Folders>Data Analyzer Metadata Reporting>Report Refresh Schedule

Provides information about the next scheduled update for scheduled reports. It can be used to decide schedule timing for various reports for optimum system performance.

5 Reports Accessed by Users Today

Public Folders>Data Analyzer Metadata Reporting>Reports Accessed by Users Today

Part of the analytic workflow for "Today's Logins". It provides detailed information on the reports accessed by users today. This can be used independently to get comprehensive information about today's report activity details.

6 Todays Logins Public Folders>Data Analyzer Metadata Reporting>Todays Logins

Provides the login count and average login duration for users who logged in today.

7 Todays Report Usage by Hour

Public Folders>Data Analyzer Metadata Reporting>Todays Report Usage by Hour

Provides information about the number of reports accessed today for each hour. The analytic workflow attached to it provides more details on the reports accessed and users who accessed them during the selected hour.

8 Top 10 Most Accessed Reports this Year

Public Folders>Data Analyzer Metadata Reporting>Top 10 Most Accessed Reports this Year

Shows the ten most accessed reports for the current year. It has an analytic workflow that provides access details such as user name and access time.

9 Top 5 Logins (Month To Date)

Public Folders>Data Analyzer Metadata Reporting>Top 5 Logins (Month To Date)

Provides information about users and their corresponding login count for the current month to date. The analytic workflow attached to it provides more details about the reports accessed by a selected user.

10 Top 5 Longest Running On-Demand Reports (Month To Date)

Public Folders>Data Analyzer Metadata Reporting>Top 5 Longest Running On-Demand Reports (Month To Date)

Shows the five longest running on-demand reports for the current month to date. It displays the average total response time, average DB response time, and the average Data Analyzer response time (all in seconds) for each report shown.

11 Top 5 Longest Running Scheduled Reports (Month To Date)

Public Folders>Data Analyzer Metadata Reporting>Top 5 Longest Running Scheduled Reports (Month To Date)

Shows the five longest running scheduled reports for the current month to date. It displays the average response time (in seconds) for each report shown.

12 Total Schedule Errors for Today

Public Folders>Data Analyzer Metadata Reporting>Total Schedule Errors for Today

Provides the number of errors encountered during execution

INFORMATICA CONFIDENTIAL BEST PRACTICE 384 of 702

Reports For PowerCenter Repository Sr No Name Folder Description

of reports attached to schedules. The analytic workflow "Scheduled Report Error Details for Today" is attached to it.

13 User Logins (Month To Date)

Public Folders>Data Analyzer Metadata Reporting>User Logins (Month To Date)

Provides information about users and their corresponding login count for the current month to date. The analytic workflow attached to it provides more details about the reports accessed by a selected user.

14 Users Who Have Never Logged On

Public Folders>Data Analyzer Metadata Reporting>Users Who Have Never Logged On

Provides information about users who exist in the repository but have never logged in. This information can be used to make administrative decisions about disabling accounts.

Customizing a Report or Creating New Reports

Once you select the report, you can customize it by setting the parameter values and/or creating new attributes or metrics. Data Analyzer includes simples steps to create new reports or modify existing ones. Adding filters or modifying filters offers tremendous reporting flexibility. Additionally, you can setup report templates and export them as Excel files, which can be refreshed as necessary. For more information on the attributes, metrics, and schemas included with the Metadata Reporter, consult the product documentation.

Wildcards

The Metadata Reporter supports two wildcard characters:

• Percent symbol (%) - represents any number of characters and spaces. • Underscore (_) - represents one character or space.

You can use wildcards in any number and combination in the same parameter. Leaving a parameter blank returns all values and is the same as using %. The following examples show how you can use the wildcards to set parameters.

Suppose you have the following values available to select:

items, items_in_promotions, order_items, promotions

INFORMATICA CONFIDENTIAL BEST PRACTICE 385 of 702

The following list shows the return values for some wildcard combinations you can use:

Wildcard Combination Return Values % items, items_in_promotions, order_items, promotions <blank> items, items_in_promotions, order_items, promotions %items items, order_items item_ Items item% items, items_in_promotions ___m% items, items_in_promotions, promotions %pr_mo% items_in_promotions, promotions

A printout of the mapping object flow is also useful for clarifying how objects are connected. To produce such a printout, arrange the mapping in Designer so the full mapping appears on the screen, and then use Alt+PrtSc to copy the active window to the clipboard. Use Ctrl+V to paste the copy into a Word document.

For a detailed description of how to run these reports, consult the Metadata Reporter Guide included in the PowerCenter documentation.

Security Awareness for Metadata Reporter

Metadata Reporter uses Data Analyzer for reporting out of the PowerCenter /Data Analyzer repository. Data Analyzer has a robust security mechanism that is inherited by Metadata Reporter. You can establish groups, roles, and/or privileges for users based on their profiles. Since the information in PowerCenter repository does not change often after it goes to production, the Administrator can create some reports and export them to files that can be distributed to the user community. If the numbers of users for Metadata Reporter are limited, you can implement security using report filters or data restriction feature. For example, if a user in PowerCenter repository has access to certain folders, you can create a filter for those folders and apply it to the user's profile. For more information on the ways in which you can implement security in Data Analyzer, refer to the Data Analyzer documentation.

Metadata Exchange: the Second Generation (MX2)

The MX architecture was intended primarily for BI vendors who wanted to create a PowerCenter-based data warehouse and display the warehouse metadata through their own products. The result was a set of relational views that encapsulated the underlying repository tables while exposing the metadata in several categories that were more suitable for external parties. Today, Informatica and several key vendors, including Brio, Business Objects, Cognos, and MicroStrategy are effectively using the MX views to report and query the Informatica metadata.

Informatica currently supports the second generation of Metadata Exchange called MX2. Although the overall motivation for creating the second generation of MX remains consistent with the original intent, the requirements and objectives of MX2 supersede those of MX.

The primary requirements and features of MX2 are:

Incorporation of object technology in a COM-based API. Although SQL provides a powerful mechanism for accessing and manipulating records of data in a relational paradigm, it is not suitable for procedural programming tasks that can be achieved by C, C++, Java, or Visual Basic. Furthermore, the increasing popularity and use of object-oriented software tools require interfaces that can fully take advantage of the object technology. MX2 is implemented in C++ and offers an advanced object-based API for accessing and manipulating the PowerCenter Repository from various programming languages.

INFORMATICA CONFIDENTIAL BEST PRACTICE 386 of 702

Self-contained Software Development Kit (SDK). One of the key advantages of MX views is that they are part of the repository database and thus can be used independent of any of the Informatica software products. The same requirement also holds for MX2, thus leading to the development of a self-contained API Software Development Kit that can be used independently of the client or server products.

Extensive metadata content, especially multidimensional models for OLAP. A number of BI tools and upstream data warehouse modeling tools require complex multidimensional metadata, such as hierarchies, levels, and various relationships. This type of metadata was specifically designed and implemented in the repository to accommodate the needs of the Informatica partners by means of the new MX2 interfaces.

Ability to write (push) metadata into the repository. Because of the limitations associated with relational views, MX could not be used for writing or updating metadata in the Informatica repository. As a result, such tasks could only be accomplished by directly manipulating the repository's relational tables. The MX2 interfaces provide metadata write capabilities along with the appropriate verification and validation features to ensure the integrity of the metadata in the repository.

Complete encapsulation of the underlying repository organization by means of an API. One of the main challenges with MX views and the interfaces that access the repository tables is that they are directly exposed to any schema changes of the underlying repository database. As a result, maintaining the MX views and direct interfaces requires a major effort with every major upgrade of the repository. MX2 alleviates this problem by offering a set of object-based APIs that are abstracted away from the details of the underlying relational tables, thus providing an easier mechanism for managing schema evolution.

Integration with third-party tools. MX2 offers the object-based interfaces needed to develop more sophisticated procedural programs that can tightly integrate the repository with the third-party data warehouse modeling and query/reporting tools.

Synchronization of metadata based on changes from up-stream and down-stream tools. Given that metadata is likely to reside in various databases and files in a distributed software environment, synchronizing changes and updates ensures the validity and integrity of the metadata. The object-based technology used in MX2 provides the infrastructure needed to implement automatic metadata synchronization and change propagation across different tools that access the PowerCenter Repository.

Interoperability with other COM-based programs and repository interfaces. MX2 interfaces comply with Microsoft's Component Object Model (COM) interoperability protocol. Therefore, any existing or future program that is COM-compliant can seamlessly interface with the PowerCenter Repository by means of MX2.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 387 of 702

Repository Tables & Metadata Management

Challenge

Maintaining the repository for regular backup, quick response, and querying metadata for metadata reports.

Description

Regular actions such as backups, testing backup and restore procedures, and deleting unwanted information from the repository maintains the repository for better performance.

Managing Repository

The PowerCenter Administrator plays a vital role in managing and maintaining the repository and metadata. The role involves tasks such as securing the repository, managing the users and roles, maintaining backups, and managing the repository through such activities as removing unwanted metadata, analyzing tables, and updating statistics.

Repository backup

Repository back up can be performed using the client tool Repository Server Admin Console or the command line program pmrep. Backup using pmrep can be automated and scheduled for regular backups.

This shell script can be scheduled to run as cron job for regular backups. Alternatively, this shell script can be called from PowerCenter via a command task. The command task can be placed in a workflow and scheduled to run daily.

INFORMATICA CONFIDENTIAL BEST PRACTICE 388 of 702

The following paragraphs describe some useful practices for maintaining backups:

Frequency: Backup frequency depends on the activity in repository. For Production repositories, backup is recommended once a month or prior to major release. For development repositories, backup is recommended once a week or once a day, depending upon the team size.

Backup file sizes: Because backup files can be very large, Informatica recommends compressing them using a utility such as winzip or gzip.

Storage: For security reasons, Informatica recommends maintaining backups on a different physical device that the repository itself.

Move backups offline: Review the backups on a regular basis to determine how long they need to remain online. Any that are not required online should be moved offline, to tape, as soon as possible.

Restore repository

Although the Repository restore function is used primarily as part of disaster recovery, it can also be useful for testing the validity of the backup files and for testing the recovery process on a regular basis. Informatica recommends testing the backup files and recovery process at least once each quarter. The repository can be restored using the client tool, Repository Server Administrator Console, or the command line programs pmrepagent.

Restore folders

There is no easy way to restore only one particular folder from backup. First the backup repository has to be restored into a new repository, then you can use the client tool, repository manager, to copy the entire folder from the restored repository into the target repository.

Remove older versions

Use the purge command to remove older versions of objects from repository. To purge a specific version of an object, view the history of the object, select the version, and purge it.

Finding deleted objects and removing them from repository

If a PowerCenter repository is enabled for versioning through the use of the Team Based Development option. Objects that have been deleted from the repository are not be visible in the client tools. To list or view deleted objects, use either the find checkouts command in the client tools or a query generated in the repository

INFORMATICA CONFIDENTIAL BEST PRACTICE 389 of 702

manager, or a specific query.

After an object has been deleted from the repository, you cannot create another object with the same name unless the deleted object has been completely removed from the repository. Use the purge command to completely remove deleted objects from the repository. Keep in mind, however, that you must remove all versions of a deleted object to completely remove it from repository.

Truncating Logs

You can truncate the log information (for sessions and workflows) stored in the repository either by using repository manager or the pmrep command line program. Logs can be truncated for the entire repository or for a particular folder.

Options allow truncating all log entries or selected entries based on date and time.

INFORMATICA CONFIDENTIAL BEST PRACTICE 390 of 702

Repository Performance

Analyzing (or updating the statistics) of repository tables can help to improve the repository performance. Because this process should be carried out for all tables in the repository, a script offers the most efficient means. You can then schedule the script to run using either an external scheduler or a PowerCenter workflow with a command task to call the script.

Repository Agent and Repository Server performance

Factors such as team size, network, number of objects involved in a specific operation, number of old locks (on repository objects), etc. may reduce the efficiency of the repository server (or agent). In such cases, the various causes should be analyzed and the repository server (or agent) configuration file modified to improve performance.

Managing Metadata

The following paragraphs list the queries that are most often used to report on PowerCenter metadata. The queries are written for PowerCenter repositories on Oracle and are based on PowerCenter 6 and PowerCenter 7. Minor changes in the queries may be required for PowerCenter repositories residing on other databases.

Failed Sessions

The following query lists the failed sessions in the last day. To make it work for the last ‘n’ days, replace SYSDATE-1 with SYSDATE - n

SELECT Subject_Area AS Folder,

Session_Name,

Last_Error AS Error_Message,

DECODE (Run_Status_Code,3,'Failed',4,'Stopped',5,'Aborted') AS Status,

Actual_Start AS Start_Time,

INFORMATICA CONFIDENTIAL BEST PRACTICE 391 of 702

Session_TimeStamp

FROM rep_sess_log

WHERE run_status_code != 1

AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)

Long running Sessions

The following query lists long running sessions in the last day. To make it work for the last ‘n’ days, replace SYSDATE-1 with SYSDATE - n

SELECT Subject_Area AS Folder,

Session_Name,

Successful_Source_Rows AS Source_Rows,

Successful_Rows AS Target_Rows,

Actual_Start AS Start_Time,

Session_TimeStamp

FROM rep_sess_log

WHERE run_status_code = 1

AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)

AND (Session_TimeStamp - Actual_Start) > (10/(24*60))

ORDER BY Session_timeStamp

Invalid Tasks

The following query lists folder names and task name, version number, and last saved for all invalid tasks.

SELECT SUBJECT_AREA AS FOLDER_NAME,

DECODE(IS_REUSABLE,1,'Reusable',' ') || ' ' ||TASK_TYPE_NAME AS TASK_TYPE,

TASK_NAME AS OBJECT_NAME,

VERSION_NUMBER, -- comment out for V6

LAST_SAVED

INFORMATICA CONFIDENTIAL BEST PRACTICE 392 of 702

FROM REP_ALL_TASKS

WHERE IS_VALID=0

AND IS_ENABLED=1

--AND CHECKOUT_USER_ID = 0 -- Comment out for V6

--AND is_visible=1 -- Comment out for V6

ORDER BY SUBJECT_AREA,TASK_NAME

Load Counts

The following query lists the load counts (number of rows loaded) for the successful sessions.

SELECT

subject_area,

workflow_name,

session_name,

DECODE (Run_Status_Code,1,'Succeeded',3,'Failed',4,'Stopped',5,'Aborted') AS Session_Status,

successful_rows,

failed_rows,

actual_start

FROM

REP_SESS_LOG

WHERE

TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)

ORDER BY

subject_area

workflow_name,

session_name,

Session_status

INFORMATICA CONFIDENTIAL BEST PRACTICE 393 of 702

Using Metadata Extensions

Challenge

To provide for efficient documentation and achieve extended metadata reporting through the use of metadata extensions in repository objects.

Description

Metadata Extensions, as the name implies, help you to extend the metadata stored in the repository by associating information with individual objects in the repository.

Informatica Client applications can contain two types of metadata extensions: vendor-defined and user-defined.

● Vendor-defined. Third-party application vendors create vendor-defined metadata extensions. You can view and change the values of vendor-defined metadata extensions, but you cannot create, delete, or redefine them.

● User-defined. You create user-defined metadata extensions using PowerCenter clients. You can create, edit, delete, and view user-defined metadata extensions. You can also change the values of user-defined extensions.

You can create reusable or non-reusable metadata extensions. You associate reusable metadata extensions with all repository objects of a certain type. So, when you create a reusable extension for a mapping, it is available for all mappings. Vendor-defined metadata extensions are always reusable.

Non-reusable extensions are associated with a single repository object. Therefore, if you edit a target and create a non-reusable extension for it, that extension is available only for the target you edit. It is not available for other targets. You can promote a non-reusable metadata extension to reusable, but you cannot change a reusable metadata extension to non-reusable.

Metadata extensions can be created for the following repository objects:

● Source definitions ● Target definitions

INFORMATICA CONFIDENTIAL BEST PRACTICE 394 of 702

● Transformations (Expressions, Filters, etc.) ● Mappings ● Mapplets ● Sessions ● Tasks ● Workflows ● Worklets

Metadata Extensions offer a very easy and efficient method of documenting important information associated with repository objects. For example, when you create a mapping, you can store the mapping owners name and contact information with the mapping OR when you create a source definition, you can enter the name of the person who created/imported the source.

The power of metadata extensions is most evident in the reusable type. When you create a reusable metadata extension for any type of repository object, that metadata extension becomes part of the properties of that type of object. For example, suppose you create a reusable metadata extension for source definitions called SourceCreator. When you create or edit any source definition in the Designer, the SourceCreator extension appears on the Metadata Extensions tab. Anyone who creates or edits a source can enter the name of the person that created the source into this field.

You can create, edit, and delete non-reusable metadata extensions for sources, targets, transformations, mappings, and mapplets in the Designer. You can create, edit, and delete non-reusable metadata extensions for sessions, workflows, and worklets in the Workflow Manager. You can also promote non-reusable metadata extensions to reusable extensions using the Designer or the Workflow Manager. You can also create reusable metadata extensions in the Workflow Manager or Designer.

You can create, edit, and delete reusable metadata extensions for all types of repository objects using the Repository Manager. If you want to create, edit, or delete metadata extensions for multiple objects at one time, use the Repository Manager. When you edit a reusable metadata extension, you can modify the properties Default Value, Permissions and Description.

Note: You cannot create non-reusable metadata extensions in the Repository Manager. All metadata extensions created in the Repository Manager are reusable. Reusable metadata extensions are repository wide.

You can also migrate Metadata Extensions from one environment to another. When

INFORMATICA CONFIDENTIAL BEST PRACTICE 395 of 702

you do a copy folder operation, the Copy Folder Wizard copies the metadata extension values associated with those objects to the target repository. A non-reusable metadata extension will be copied as a non-reusable metadata extension in the target repository. A reusable metadata extension is copied as reusable in the target repository, and the object retains the individual values. You can edit and delete those extensions, as well as modify the values.

Metadata Extensions provide for extended metadata reporting capabilities. Using Informatica MX2 API, you can create useful reports on metadata extensions. For example, you can create and view a report on all the mappings owned by a specific team member. You can use various programming environments such as Visual Basic, Visual C++, C++ and Java SDK to write API modules. The Informatica Metadata Exchange SDK 6.0 installation CD includes sample Visual Basic and Visual C++ applications.

Additionally, Metadata Extensions can also be populated via data modeling tools such as ERWin, Oracle Designer, and PowerDesigner via Informatica Metadata Exchange for Data Models. With the Informatica Metadata Exchange for Data Models, the Informatica Repository interface can retrieve and update the extended properties of source and target definitions in PowerCenter repositories. Extended Properties are the descriptive, user defined, and other properties derived from your Data Modeling tool and you can map any of these properties to the metadata extensions that are already defined in the source or target object in the Informatica repository.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 396 of 702

Using PowerCenter Metadata Manager and Metadata Exchange Views for Quality Assurance

Challenge

The role that the PowerCenter repository can play in an automated QA strategy is often overlooked and under-appreciated. This repository is essentially a database about the transformation process and the software developed to implement it; the challenge is to devise a method to exploit this resource for QA purposes.

To address the above challenge, Informatica PowerCenter provides several pre-packaged reports (PowerCenter Repository Reports) that can be installed on Data Analyzer or Metadata Manager Installation. These reports provide lots of useful information about PowerCenter object metadata and operational metadata that can be used for quality assurance.

Description

Before considering the mechanics of an automated QA strategy, it is worth emphasizing that quality should be built in from the outset. If the project involves multiple mappings repeating the same basic transformation pattern(s), it is probably worth constructing a virtual production line. This is essentially a template-driven approach to accelerate development and enforce consistency through the use of the following aids:

● Shared template for each type of mapping. ● Checklists to guide the developer through the process of adapting the template to the mapping

requirements. ● Macros/scripts to generate productivity aids such as SQL overrides etc.

It is easier to ensure quality from a standardized base rather than relying on developers to repeat accurately the same basic keystrokes.

Underpinning the exploitation of the repository for QA purposes is the adoption of naming standards which categorize components. By running the appropriate query on the repository, it is possible to identify those components whose attributes differ from those predicted for the category. Thus, it is quite possible to automate some aspects of QA. Clearly, the function of naming conventions is not just to standardize, but also to provide logical access paths into the information in the repository; names can be used to identify patterns and/or categories and thus allow assumptions to be made about object attributes. Along with the facilities provided to query the repository, such as the Metadata Exchange (MX) Views and the PowerCenter Metadata Manager, this opens the door to an automated QA strategy

For example, consider the following situation: it is possible that the EXTRACT mapping/session should always truncate the target table before loading; conversely, the TRANSFORM and LOAD phases should never truncate a target.

Possible code errors in this respect can be identified as follows:

● Define a mapping/session naming standard to indicate EXTRACT, TRANSFORM, or LOAD. ● Develop a query on the repository to search for sessions named EXTRACT, which do not have the

truncate target option set. ● Develop a query on the repository to search for sessions named TRANSFORM or LOAD, which do have

the truncate target option set.

INFORMATICA CONFIDENTIAL BEST PRACTICE 397 of 702

● Provide a facility to allow developers to run both queries before releasing code to the test environment.

Alternatively, a standard may have been defined to prohibit unconnected output ports from transformations (such as expressions) in a mapping. These can be very easily identified from the MX View REP_MAPPING_UNCONN_PORTS:

The following bullets represent a high-level overview of the steps involved in automating QA:

● Review the transformations/mappings/sessions/workflows and allocate to broadly representative categories.

● Identify the key attributes of each category. ● Define naming standards to identify the category for transformations/mappings/sessions/workflows. ● Analyze the MX Views to source the key attributes. ● Develop the query to compare actual and expected attributes for each category.

After you have completed these steps, it is possible to develop a utility that compares actual and expected attributes for developers to run before releasing code into any test environment. Such a utility may incorporate the following processing stages:

● Execute a profile to assign environment variables (e.g., repository schema user, password, etc). ● Select the folder to be reviewed. ● Execute the query to find exceptions. ● Report the exceptions in an accessible format. ● Exit with failure if exceptions are found.

TIP Remember that any queries on the repository that bypass the MX views will require modification if subsequent upgrades to PowerCenter occur and as such is not recommended by Informatica.

The principal objective of any QA strategy is to ensure that developed components adhere to standards and to identify defects before incurring overhead during the migration from development to test/production environments. Qualitative, peer-based reviews of PowerCenter objects due for release obviously have their part to play in this process.

Using Metadata Manager and PowerCenter Repository Reports for Quality Assurance

The need for the Informatica Metadata Reporter was identified from the a number of clients requesting custom and complete metadata reports from their repositories. Metadata Reporter provides Data Analyzer dashboards and metadata reports to help you administer your day-to-day PowerCenter operations. In this section, we focus primarily on how these reports and custom reports can help ease the QA process.

The following reports can help identify regressions in load performance:

Session Run details ●

Workflow Run details ●

Worklet Run details

INFORMATICA CONFIDENTIAL BEST PRACTICE 398 of 702

● Server Load by Day of the Week can help determine the load on the server before and after QA migrations and may help balance the loads through the week by modifying the schedules.

The Target Table Load Analysis can help identify any data regressions with the number of records loaded in each target (if a baseline was established before the migration/upgrade).

The Failed Session report lists failed sessions at a glance, which is very helpful after a major QA migration or QA of Informatica upgrade process

During huge deployments to QA, the Code review team can look at the following reports to determine if the standards (i.e., Naming standards, Comments for repository objects, metadata extensions usage, etc.) were followed. Accessing this information from PowerCenter Repository Reports typically reduces the time required for review because the reviewer doesn’t need to open each mapping and check for these details. All of the following are out-of-the-box reports provided by Informatica:

Label report ●

Mappings list ●

Mapping shortcuts ●

Mapping lookup transformation ●

Mapplet list ●

Mapplet shortcuts ●

Mapplet lookup transformation ●

Metadata extensions usage ●

Sessions list ●

Worklets list ●

Workflows list ●

Source list ●

Target list ●

Custom reports based on the review requirements

In addition, note that the following reports are also useful during migration and upgrade processes:

● Invalid object reports and deployment group report in the QA repository help to determine which deployments caused the invalidations.

● Invalid object report against Development repository helps to identify the invalid objects that are part of deployment before QA migration.

INFORMATICA CONFIDENTIAL BEST PRACTICE 399 of 702

● Invalid object report helps in QA of an Informatica upgrade process.

The following table summarizes some of the reports that Informatica ships with a PowerCenter Repository Reports installation:

Report Name Description

1 Deployment Group Displays deployment groups by repository

2 Deployment Group History Displays, by group, deployment groups and the dates they were deployed. It also displays the source and target repository names of the deployment group for all deployment dates.

3 Labels Displays labels created in the repository for any versioned object by repository.

4 All Object Version History Displays all versions of an object by the date the object is saved in the repository.

5 Server Load by Day of Week

Displays the total number of sessions that ran, and the total session run duration for any day of week in any given month of the year by server by repository. For example, all Mondays in September are represented in one row if that month had 4 Mondays

6 Session Run Details Displays session run details for any start date by repository by folder.

7 Target Table Load Analysis (Last Month)

Displays the load statistics for each table for last month by repository by folder

8 Workflow Run Details Displays the run statistics of all workflows by repository by folder.

9 Worklet Run Details Displays the run statistics of all worklets by repository by folder.

10 Mapping List Displays mappings by repository and folder. It also displays properties of the mapping such as the number of sources used in a mapping, the number of transformations, and the number of targets.

11 Mapping Lookup Transformations

Displays Lookup transformations used in a mapping by repository and folder.

12 Mapping Shortcuts Displays mappings defined as a shortcut by repository and folder.

13 Source to Target Dependency

Displays the data flow from the source to the target by repository and folder. The report lists all the source and target ports, the mappings in which the ports are connected, and the transformation expression that shows how data for the target port is derived.

14 Mapplet List Displays mapplets available by repository and folder. It displays properties of the mapplet such as the number of sources used in a mapplet, the number of transformations, or the number of targets.

15 Mapplet Lookup Transformations

Displays all Lookup transformations used in a mapplet by folder and repository.

16 Mapplet Shortcuts Displays mapplets defined as a shortcut by repository and folder.

17 Unused Mapplets in Mappings

Displays mapplets defined in a folder but not used in any mapping in that folder.

INFORMATICA CONFIDENTIAL BEST PRACTICE 400 of 702

18 Metadata Extensions Usage

Displays, by repository by folder, reusable metadata extensions used by any object. Also displays the counts of all objects using that metadata extension.

19 Server Grid List Displays all server grids and servers associated with each grid. Information includes host name, port number, and internet protocol address of the servers.

20 Session List Displays all sessions and their properties by repository by folder. This is a primary report in a data integration workflow.

21 Source List Displays relational and non-relational sources by repository and folder. It also shows the source properties. This report is a primary report in a data integration workflow.

22 Source Shortcuts Displays sources that are defined as shortcuts by repository and folder

23 Target List Displays relational and non-relational targets available by repository and folder. It also displays the target properties. This is a primary report in a data integration workflow.

24 Target Shortcuts Displays targets that are defined as shortcuts by repository and folder.

25 Transformation List Displays transformations defined by repository and folder. This is a primary report in a data integration workflow.

26 Transformation Shortcuts Displays transformations that are defined as shortcuts by repository and folder.

27 Scheduler (Reusable) List Displays all the reusable schedulers defined in the repository and their description and properties by repository by folder.

28 Workflow List Displays workflows and workflow properties by repository by folder.

29 Worklet List Displays worklets and worklet properties by repository by folder.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 401 of 702

Configuring Standard XConnects

Challenge

Metadata that is derived from a variety of sources and tools is often disparate and fragmented. To be of value, metadata needs to be consolidated into a central repository. Informatica's Metadata Manager provides a central repository for the capture and analysis of critical metadata.

DescriptionMetadata Manager Console Settings

Logging into the Metadata Manager Warehouse

You can use the Metadata Manager console to access one Metadata Manager Warehouse repository at a time. When logging in to the Metadata Manager console for the first time, you need to set up the connection information along with the data source for the Integration Repository. In subsequent logins, you need enter only the Metadata Manager Warehouse database password.

Setting up Connections to the PowerCenter Components

Before you run any XConnects, be sure that the Metadata Manager Console has valid connections to the following PowerCenter components for Metadata Manager:

● Integration Repository Service ● Domain ● Integration Service

To verify the PowerCenter settings, click the Administration tab.

Specifying the PowerCenter Source Files Directory

Metadata Manager stores the following files in the PowerCenter source files directory:

1. IME files - Some XConnects extract the source repository metadata and reformat it into an IME-based format. The reformatted metadata is stored in new files, referred to as IME files. The workflows extract the transformed metadata from the IME files and then load the metadata into the Metadata Manager Warehouse.

2. Parameter files - The integration workflows use parameters to control the sessions, worklets, and workflows.

3. Date files - The integration workflows use date files to load dates into the Metadata Manager Warehouse

To configure the Source file directory, click the Administration tab, then the File Transfer Configuration tab.

INFORMATICA CONFIDENTIAL BEST PRACTICE 402 of 702

For Windows:

● \\Informatica Server Name\PM SrcFiles directory. ● Click Save when you are finished.

For UNIX:

● Select ftp option ● FTP Server Name: Integration Service Host Name ● Port Number: 21 (default) ● User Name: UNIX login name to Integration Server ● ftp directory: /Integration service Home directory/SrcFiles

Note: Metadata Manager 8.1 does not support secure ftp connections to the Integration server.

Configuring Standard XConnects

SQL Server XConnect

Specify a user name and password to access the SQL Server database metadata. Be sure that the user has access to all system tables. One XConnect is needed per SQL Server database.

To extract metadata from SQL Server, perform the following steps:

● Add a new SQL repository from Metadata Manager (Web interface). ● Log in to the Metadata Manager Console. Click the Source Repository Management tab. The

new SQL Server XConnect added above should show up in the console. Select the SQL Server XConnect and click the Configuration Properties tab. Enter the following information related to the XConnect:

Properties Description

User Name/Password Database user name and password to access SQL Server data dictionary

Data Source Name ODBC connection name to connect to SQL Server data dictionary

Database Type Microsoft SQL Server

INFORMATICA CONFIDENTIAL BEST PRACTICE 403 of 702

Connection String For default instance: SQL Server Name@Database Name

For named instance: Server Name\Instance Name@Database Name

● Click Save when you have finished entering this information. ● To configure the list of user schemas to load, click the Parameter Setup tab and select the list

of schemas to load (these are listed in the Included Schemas). Click Save when you are finished.

● The XConnect is ready to be loaded. ● After a successful load of the SQL Server metadata, you can see the metadata in the Web

interface.

To configure SQL Server out-of-the-box XConnects to run on the PowerCenter server in a UNIX environment, follow these steps:

● Install DataDirect ODBC drivers on the PowerCenter server location. ● Configure .odbc.ini just like any other ODBC setup. ● Create a repository of type Microsoft SQL Server using the Metadata Manager browser ● Configuring the repository in the configuration console, specify a connect string as

<SQLserverhost>@DBname and save the configuration ● Using Workflow Manager, delete the connection it created R<RepoUID>, and create an ODBC

connection with the same name as R<RepoUID> (Specify the connect string same as the one configured in the .odbc.ini)

Oracle XConnect

Specify a user name and password to access the Oracle database metadata. Be sure that the user has the Select Any Table privilege and Select Permissions on the following objects in the specified schemas: tables, views, indexes, packages, procedures, functions, sequences, triggers, and synonyms. Also ensure the user has Select Permissions on the SYS.v_$instance. One XConnect is needed for each Oracle instance.

To extract metadata from Oracle, perform the following steps:

● Add a new SQL repository from Metadata Manager (Web interface). ● Log in to the Metadata Manager Console. Click the Source Repository Management tab. The

Oracle XConnect added above should show up in the console. Select the Oracle XConnect and click the Configuration Properties tab. Enter the following information related to the XConnect:

INFORMATICA CONFIDENTIAL BEST PRACTICE 404 of 702

Properties Description

User Name/Password Database user name and password to access the Oracle instance data dictionary

Data Source Name ODBC connection name to connect to the Oracle instance data dictionary

Database Type Oracle

Connect String Oracle instance name

● Click Save when you have finished entering this information. ● To configure the list of schemas to load, click the Parameter Setup tab and select the list of

schemas to load (these are listed in the Included Schemas). Click Save when finished. ● The XConnect is ready to be loaded. ● After a successful load of the Oracle metadata, you can see the metadata in the Web interface:

Teradata XConnect

Specify a user name and password to access the Teradata metadata. Be sure that the user has access to all the system “DBC” tables.

To extract metadata from Teradata Server, perform the following steps:

● Add a new SQL repository from Metadata Manager (Web interface). ● Log in to the Metadata Manager Console and click the Source Repository Management tab.

The new SQL Server XConnect added above should show up in the console. Select the SQL Server XConnect and click the Configuration Properties tab. Enter the following information related to the XConnect:

Properties Description

User Name/Password Database user name and password to access SQL Server data dictionary

Data Source Name ODBC connection name to connect to SQL Server data dictionary

Database Type Teradata

INFORMATICA CONFIDENTIAL BEST PRACTICE 405 of 702

Connection String ODBC connection name in PowerCenter repository to Teradata

● Click Save when you have finished entering this information. ● To configure the list of user databases to load click the Parameter Setup tab. Select the list of

list of database to load (these are listed in the Included Schemas). Click Save when you are finished.

● The XConnect is ready to be loaded. ● After a successful load of the Teradata metadata, you can see the metadata in the Web

interface.

ERwin XConnect

The following format is required to extract metadata from Erwin:

● For Erwin 3.5, save the datamodel as ER1 format ● For Erwin 4.x, Save as XML format of the ERwin model that you want to load into Metadata

Manager.

To extract metadata from ERwin, perform the following steps:

● Log in to Metadata Manager (Web interface) and select the Administration tab. Under Repository Management, select Repositories. Click Add to add a new repository. Enter all the information related to the ERwin XConnect. Repository Type and Name are mandatory fields.)

● Log into the Metadata Manager Console and click the Source Repository Management tab. The ERwin XConnect added above should show up in the console. Select the ERwin XConnect and click the Configuration Properties tab.

❍ Each XConnect allows you to add multiple files. ❍ Source System Version = Select the appropriate option. ❍ Click Add to add the ERwin file. Browse to the location of the ERwin file. The directory

path of the file is stored locally. To load a new ERwin file, select the current file, then click Delete and add the new file.

❍ Select the Refresh? checkbox to refresh the metadata from the file. If you do not want to update the metadata from a particular metadata file (i.e., if the file does not contain any changes since the last metadata load), uncheck this box.

❍ Click Save when you are finished.

● If you select Edit/assigned Connections for Lineage Report, set the connection assignments between the ERwin model and the underlying database schemas. Click OK when you are finished.

● The XConnect is ready to be loaded. After a successful load of the ERwin metadata, you can see the metadata in the Web interface.

ER-Studio XConnect

INFORMATICA CONFIDENTIAL BEST PRACTICE 406 of 702

The following format is required to extract metadata from ER-Studio:

● ER-Studio and DM1 format

To extract metadata from ERStudio, perform the following steps:

● Log in to Metadata Manager (Web interface) and select the Administration tab. Under Repository Management, select Repositories and click Add to add a new repository. Enter all the information related to the ERStudio XConnect. Repository Type and Name are mandatory fields.)

● Log in to the Metadata Manager Console and click the Source Repository Management tab. The ERStudio XConnect added above should show up in the console. Select the ERStudio XConnect and click the Configuration Properties tab.

❍ Each XConnect allows you to add multiple files ❍ Source System Version = Select the appropriate option. ❍ Click Add to add the ERStudio file. Browse to the location of the ERStudio file. The

directory path of the file is stored locally. To load a new ERStudio file, select the current file, and click Delete, then add the new file.

❍ Select the Refresh? checkbox to refresh the metadata from the file. If you do not want to update the metadata from a particular metadata file (i.e., if the file does not contain any changes since the last metadata load), uncheck this box.

❍ Click Save when you are finished.

● If you select Edit/assigned Connections for Lineage Report, set the connection assignments between the ERStudio model and the underlying database schemas. Click OK when you are finished.

● The XConnect is ready to be loaded. After a successful load of the ERStudio metadata, you can see the metadata in the Web interface.

PowerCenter XConnect

Specify a user name and password to access the PowerCenter database metadata. Be sure that the user has the Select Any Table privilege and the ability to drop and create views. If you are using a different Oracle user to pull PowerCenter metadata into the metadata warehouse than is used by PowerCenter to create the metadata, you need to create synonyms in the new user’s schema to all tables and views in the PowerCenter user’s schema. When the XConnect runs, it can successfully create the views it needs in the new user’s schema.

To extract metadata from PowerCenter, perform the following steps:

● Log into Metadata Manager (Web interface) and select Add to add a new repository. Enter all the information related to the PowerCenter XConnect.

● Log into the Metadata Manager Console and click the Source Repository Management tab. The PowerCenter XConnect added above should show up in the console. Select the PowerCenter XConnect and click the Configuration Properties tab.

INFORMATICA CONFIDENTIAL BEST PRACTICE 407 of 702

● Enter the following information related to the XConnect:

Properties Description

User Name/Password Database user name and password to access the PowerCenter repository tables

Data Source Name ODBC connection name to connect to the database (provides information about how to connect to the machine containing the source repository database)

Database Type Database type of PowerCenter database

Connect String Please refer appropriate RDBMS XConnect based on database type.

● Click Save when you have finished entering this information. ● To configure the list of folders to load, click the Parameter Setup tab, and select the list of

folders to load (these are listed in the Included Folders). ● Select Enable Operational Metadata Extraction to extract operational metadata (e.g., run

details, including times and statuses for workflow, worklet, and session runs, etc.) ● Leave the Source Incremental Extract Window (in days) at its default value of 4000. (To ensure

a full extract during the initial workflow run, the workflow is configured to extract records that have been inserted or updated within the past 4000 days of the extract.)

● Click Save when you are finished.

Configure Parameterized Connection

Use the Assign Source Parameter Files button located under the Enable Operational Metadata text-box to assign connection parameters to a PowerCenter XConnect.

● Browse to the Parameter File Directory. Click the Add button to select the appropriate parameter file for each workflow that is being used. Click Save when you are finished selecting parameter files.

● The XConnect is ready to be loaded. ● After a successful load of the PowerCenter metadata, you can see the metadata in the Web

interface.

Business Objects XConnect

The Business Objects XConnect requires you to install Business Object Designer on the machine hosting the Metadata Manager console and to provide user name and password to access Business Objects repository.

INFORMATICA CONFIDENTIAL BEST PRACTICE 408 of 702

To extract metadata from Business Objects, perform the following steps:

● Add a new SQL repository from Metadata Manager (Web interface). ● Log into the Metadata Manager Console and click the Source Repository Management tab.

The new Business Objects XConnect added above should show up in the console. Select the Business Objects XConnect and click the Configuration Properties tab.

To configure the Business Objects repository connection setup for the first time:

● Click Configure to setup the Business Objects configuration file. The Metadata Administrator needs to define the Business Objects configuration for the first time.

● Select the Business Object repository, then enter the Business Objects login name and password to connect to the Business Objects repository.

● Select the list of universes you need to extract. ● Select the list of documents. ● Click Save to create the Business Objects configuration file to extract metadata from Business

Objects. ● Browse to select the path and file name for the Business Objects connection configuration file. ● Click Save when you are finished. ● The XConnect is now ready to be loaded. ● After a successful load of the Business Objects metadata, you can see the metadata in the

Web interface. Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 409 of 702

Custom XConnect Implementation

Challenge

Metadata Manager uses XConnects to extract source repository metadata and load it into the Metadata Manager Warehouse. The Metadata Manager Configuration Console is used to run each XConnect. A Custom XConnect is needed to load metadata from a source repository for which Metadata Manager does not prepackage an out-of-the box XConnect.

Description

This document organizes all steps into phases, where each phase and step must be performed in the order presented. To integrate custom metadata, complete tasks for the following phases:

● Design the Metamodel. ● Implement the Metamodel Design. ● Set-up and run the custom XConnect. ● Configure the reports and schema.

Prerequisites for Integrating Custom Metadata

To integrate custom metadata, install Metadata Manager and the other required applications. The custom metadata integration process assumes knowledge of the following topics:

● Common Warehouse Metamodel (CWM) and Informatica-Defined Metamodels. The CWM metamodel includes industry-standard packages, classes, and class associations. The Informatica metamodel components supplement the CWM metamodel by providing repository-specific packages, classes, and class associations. For more information about CWM, see http://www.omg.org/cwm. For more information about the Informatica-defined metamodel components, run and review the metamodel reports.

● PowerCenter Functionality. During the metadata integration process, XConnects are configured and run. The XConnects run PowerCenter workflows that extract custom metadata and load it into the Metadata Manager Warehouse.

INFORMATICA CONFIDENTIAL BEST PRACTICE 410 of 702

● Data Analyzer Functionality. Metadata Manager embeds Data Analyzer functionality to create, run, and maintain a metadata reporting environment. Knowledge of creating, modifying, and deleting reports, dashboards, and analytic workflows in Data Analyzer is required. Knowledge of creating, modifying, and deleting table definitions, metrics, and attributes is required to update the schema with new or changed objects.

Design the Metamodel

During this planning phase, the metamodel is designed; the metamodel will be implemented in the next phase.

A metamodel is the logical structure that classifies the metadata from a particular repository type. Metamodels consist of classes, class, associations, and packages, which group related classes and class associations.

An XConnect loads metadata into the Metadata Manager Warehouse based on classes and class associations. This task consists of the following steps:

1. Identify Custom Classes. To identify custom classes, determine the various types of metadata in the source repository that need to be loaded into the Metadata Manager Warehouse. Each type of metadata corresponds to one class.

2. Identify Custom Class Properties. After identifying the custom classes, each custom class must be populated with properties (i.e., attributes) in order for Metadata Manager to track and report values belonging to classes instances.

3. Map Custom Classes to CWM Classes. Metadata Manager prepackages all CWM classes, class properties, and class associations. To quickly develop a custom metamodel and reduce redundancy, reuse the predefined class properties and associations instead of recreating them. To determine which custom classes can inherit properties from CWM classes, map custom classes to the packaged CWM classes. For all properties that cannot be inherited, define them in Metadata Manager.

4. Determine the Metadata Tree Structure. Configure the way the metadata tree displays objects. Determine the groups of metadata objects in the metadata tree, then determine the hierarchy of the objects in the tree. Assign the TreeElement class as a base class to each custom class.

5. Identify Custom Class Associations. The metadata browser uses class associations to display metadata. For each identified class association, determine if you can reuse a predefined association from a CWM base class or if you need to manually define an association in Metadata Manager.

6. Identify Custom Packages. A package contains related classes and class associations. Multiple packages can be assigned to a repository type to define

INFORMATICA CONFIDENTIAL BEST PRACTICE 411 of 702

the structure of the metadata contained in the source repositories of the given repository type. Create packages to group related classes and class associations.

To see an example of sample metamodel design specifications, see Appendix A in the Metadata Manager Custom Metadata Integration Guide.

Implement the Metamodel DesignUsing the metamodel design specifications from the previous task, implement the metamodel in Metadata Manager. This task includes the following steps:

1. Create the originator (aka owner) of the metamodel. When creating a new metamodel, specify the originator of each metamodel. An originator is the organization that creates and owns the metamodel. When defining a new custom originator in Metadata Manager, select ‘Customer’ as the originator type.

● Go to the Administration tab. ● Click Originators under Metamodel Management. ● Click Add to add a new originator. ● Fill out the requested information (Note: Domain Name, Name, and

Type are mandatory fields). ● Click OK when you are finished.

2. Create the packages that contain the classes and associations of the subject metamodel. Define the packages to which custom classes and associations are assigned. Packages contain classes and their class associations. Packages have a hierarchical structure, where one package can be the parent of another package. Parent packages are generally used to group child packages together.

Go to the Administration tab.●

Click Packages under Metamodel Management.●

Click Add to add a new package.●

Fill out the requested information (Note: Name and Originator

INFORMATICA CONFIDENTIAL BEST PRACTICE 412 of 702

are mandatory fields). Choose the originator created above.●

Click OK when you are finished.

3. Create Custom Classes. In this step, create custom classes identified in the metamodel design task.

Go to the Administration tab.●

Click Classes under Metamodel Management.●

From the drop-down menu, select the package that you created in the step above

Click Add to create a new class.●

Fill out the requested information (Note: Name, Package, and Class Label are mandatory fields).

Base Classes: In order to see the metadata in the Metadata Manager metadata browser, you need to at least add the base class, TreeElement. To do this:

a. Click Add under Base Classes.b. Select the package.c. Under Classes, select TreeElement.d. Click OK (You should now see the class properties in the properties section).

● To add custom properties to your class, click Add. Fill out the property information (Name, Data Type, and Display Label are mandatory fields). Click OK when you are done.

● Click OK at the top of the page to create the class. Repeat the above steps for additional classes.

4. Create Custom Class Associations. In this step, implement the custom class associations identified in the metamodel design phase. In the previous step, CWM classes are added as base classes. Any of the class associations from the CWM base classes can be reused. Define those custom class associations

INFORMATICA CONFIDENTIAL BEST PRACTICE 413 of 702

that cannot be reused. If you only need the ElementOwnership association, skip this step.

Go to the Administration tab.●

Click Associations under Metamodel Management.●

Click Add to add a new association.●

Fill out the requested information (all bold fields are required).●

Click OK when you are finished.

5. Create the Repository Type. Each type of repository contains unique metadata. For example, a PowerCenter data integration repository type contains workflows and mappings, but a Data Analyzer business intelligence repository type does not. Repository types maintain the uniqueness of each repository.

Go to the Administration tab.●

Click Repository Types under Metamodel Management.●

Click Add to add a new repository type.●

Fill out the requested information (Note: Name and Product Type are mandatory fields).

Click OK when you are finished.

6. Configure a Repository Type Root Class. Root classes display under the source repository in the metadata tree. All other objects appear under the root class. To configure a repository root class:

Go to the Administration tab.●

INFORMATICA CONFIDENTIAL BEST PRACTICE 414 of 702

Click Custom Repository Type Root Classes under Metamodel Management.

Select the custom repository type.●

Optionally, select a package to limit the number of classes that display.

Select the Root Class option for all applicable classes.●

Click Apply to apply the changes.

Set Up and Run the XConnectThe objective of this task is to set up and run the custom XConnect. Custom XConnects involve a set of mappings that transform source metadata into the required format specified in the Informatica Metadata Extraction (IME) files. The custom XConnect extracts the metadata from the IME files and loads it into the Metadata Manager Warehouse. This task includes the following steps:

1. Determine which Metadata Manager Warehouse tables to load. Although you do not have to load all Metadata Manager Warehouse tables, you must load the following Metadata Manager Warehouse tables:

IMW_ELEMENT: The IME_ELEMENT interface file loads the element names from the source repository into the IMW_ELEMENT table. Note that element is used generically to mean packages, classes, or properties.

IMW_ELMNT_ATTR: The IME_ELMNT_ATTR interface file loads the attributes belonging to elements from the source repository into the IMW_ELMNT_ATTR table.

IMW_ELMNT_ASSOC: The IME_ELMNT_ASSOC interface file loads the associations between elements of a source repository into the IMW_ELMNT_ASSOC table.

To stop the metadata load into particular Metadata Manager Warehouse tables, disable the worklets that load those tables.

INFORMATICA CONFIDENTIAL BEST PRACTICE 415 of 702

2. Reformat the source metadata. In this step, reformat the source metadata so that it conforms to the format specified in each required IME interface file. (The IME files are packaged with the Metadata Manager documentation.) Present the reformatted metadata in a valid source type format. To extract the reformatted metadata, the integration workflows require that the reformatted metadata be in one or more of the following source type formats: database table, database view, or flat file. Note that you can load metadata into a Metadata Manager Warehouse table using more than one of the accepted source type formats.

3. Register the Source Repository Instance in Metadata Manager. Before the Custom XConnect can extract metadata, the source repository must be registered in Metadata Manager. When registering the source repository, the Metadata Manager application assigns a unique repository ID that identifies the source repository. Once registered, Metadata Manager adds an XConnect in the Configuration Console for the source repository. To register the source repository, go to the Metadata Manager web interface. Register the repository under the custom repository type created above. All packages, classes, and class associations defined for the custom repository type apply to all repository instances registered to the repository type. When defining the repository, provide descriptive information about the repository instance. Once the repository is registered in Metadata Manager, Metadata Manager adds an XConnect in the Configuration Console for the repository.

Create the Repository that will hold the metadata extracted from the source system:

● Go to the Administration tab. ● Click Repositories under Repository Management. ● Click Add to add a new repository. ● Fill out the requested information (Note: Name and Repository Type

are mandatory fields). Choose the repository type created above. ● Click OK when finished.

4. Configure the Custom Parameter Files. Custom XConnects require that the parameter file be updated by specifying the following information:

The source type (database table, database view, or flat file).●

The name of the database views or tables used to load the Metadata Manager Warehouse, if applicable.

INFORMATICA CONFIDENTIAL BEST PRACTICE 416 of 702

The list of all flat files used to load a particular Metadata Manager Warehouse table, if applicable.

The worklets you want to enable and disable.

Understanding Metadata Manager Workflows for Custom Metadata

● wf_Load_IME. Custom workflow to extract and transform metadata from the source repository into IME format. This is created by a developer.

Metadata Manager prepackages the following integration workflows for custom metadata. These workflows read the IME files mentioned above and load them into the Metadata Manager Warehouse.

❍ WF_STATUS: Extracts and transforms statuses from any source repository and loads them into the Metadata Manager Warehouse. To resolve status IDs correctly, the workflow is configured to run before the WF_CUSTOM workflow.

❍ WF_CUSTOM: Extracts and transforms custom metadata from IME files and loads that metadata into the Metadata Manager Warehouse.

5. Configure the Custom XConnect. The XConnect loads metadata into the

Metadata Manager Warehouse based on classes and class associations specified in the custom metamodel.

When the custom repository type is defined, Metadata Manager registers the corresponding XConnect in the Configuration Console. The following information in the Configuration Console configures the XConnect:

Under the Administration Tab, select Custom Workflow Configuration and choose the repository type to which the custom repository belongs.

● Workflows to load the metadata.

❍ CustomXConnect-wf_Load_IME workflow ❍ Metadata Manager-WF_CUSTOM workflow(prepackages all

INFORMATICA CONFIDENTIAL BEST PRACTICE 417 of 702

worklets and sessions required to populate all Metadata Manager Warehouse tables, except the IMW_STATUS table)

❍ Metadata Manager -WF_STATUS workflow (populates the IMW_STATUS)

Note: Metadata Manager Server does not load Metadata Manager Warehouse tables that have disabled worklets.

● Under the Administration Tab, select Custom Workflow Configuration and choose the parameter file used by the workflows to load the metadata (the parameter file name is assigned at first data load). This parameter file name has the form nnnnn.par, where nnnnn is a five digit integer assigned at the time of the first load of this source repository. The script promoting Metadata Manager from the development environment to test and from the test environment to production preserves this file name.

6. Reset the $$SRC_INCR_DATE Parameter. After completing the first metadata load, reset the $$SRC_INCR_DATE parameter to extract metadata in shorter intervals, such as every f days. The value depends on how often the Metadata Manager Warehouse needs to be updated. If the source does not provide the date when the records were last updated, records are extracted regardless of the $$SRC_INCR_DATE parameter setting.

7. Run the Custom XConnect. Using the Configuration Console, Metadata Manager Administrators can run the custom XConnect and ensure that the metadata loads correctly.

Note: When loading metadata with Effective From and Effective To Dates, Metadata Manager does not validate whether the Effective From Date is less than the Effective To Date. Ensure that each Effective To Date is greater than the Effective From Date. If you do not supply Effective From and Effective To Dates, Metadata Manager sets the Effective From Date to 1/1/1899 and the Effective To Date to 1/1/3714.

To Run a Custom XConnect

Log in to the Configuration Console.

INFORMATICA CONFIDENTIAL BEST PRACTICE 418 of 702

Click Source Repository Management●

Click Load next to the custom XConnect you want to run

Configure the Reports and Schema

The objective of this task is to set up the reporting environment, which needs to run reports on the metadata stored in the Metadata Manager Warehouse. The setup of the reporting environment depends on the reporting requirements. The following options are available for creating reports:

● Use the existing schema and reports. Metadata Manager contains prepackaged reports that can be used to analyze business intelligence metadata, data integration metadata, data modeling tool metadata, and database catalog metadata. Metadata Manager also provides impact analysis and lineage reports that provide information on any type of metadata.

● Create new reports using the existing schema. Build new reports using the existing Metadata Manager metrics and attributes.

● Create new Metadata Manager Warehouse tables and views to support the schema and reports. If the prepackaged Metadata Manager schema does not meet the reporting requirements, create new Metadata Manager Warehouse tables and views. Prefix the name of custom-built tables with Z_IMW_. Prefix custom-built views with Z_IMA_. If you build new Metadata Manager Warehouse tables or views, register the tables in the Metadata Manager schema and create new metrics/attributes in the Metadata Manager schema. Note that the Metadata Manager schema is built on the Metadata Manager views.

After the environment setup is complete, test all schema objects, such as dashboards, analytic workflows, reports, metrics, attributes, and alerts.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 419 of 702

Customizing the Metadata Manager Interface

Challenge

Customizing the Metadata Manager Presentation layer to meet specific business needs.

DescriptionConfiguring Metamodels

You may need to configure metamodels for a repository type in order to integrate additional metadata into a Metadata Manager Warehouse and/or to adapt to changes in metadata reporting and browsing requirements. For more information about creating a metamodel for a new repository type, see the Metadata Manager Custom Metadata Integration Guide.

Use Metadata Manager to define a metamodel, which consists of the following objects:

● Originator - the party that creates and owns the metamodel. ● Packages - contain related classes that model metadata for a particular application domain or specific

application. Multiple packages can be defined under the newly defined originator. Each package stores classes and associations that represent the metamodel.

● Classes and Class Properties - define a type of object, with its properties, contained in a repository. Multiple classes can be defined under a single package. Each class has multiple properties associated to it. These properties can be inherited from one or many base classes already available. Additional properties can be defined directly under the new class.

● Associations - define the relationship among classes and their objects. Associations help define relationships across individual classes. The cardinality helps define 1-1, 1-n, or n-n relationships. These relationships mirror real life associations of logical, physical, or design-level building blocks of systems and processes.

For more information about metamodels, originators, packages, classes, and associations, see “Metadata Manager Concepts” in the Metadata Manager Administration Guide.

After you define the metamodel, you need to associate it with a repository type. When registering a repository under a repository type, all classes and associations assigned to the repository type through packages apply to the repository.

Repository Types

You can configure types of repositories for the metadata you want to store and manage in the Metadata Manager Warehouse.

You must configure a repository type when you develop an XConnect. You can modify some attributes for existing XConnects and XConnect repository types.

Displaying Objects of an Association in the Metadata Tree

Metadata Manager displays many objects in the metadata tree by default because of the predefined associations among metadata objects. Associations determine how objects display in the metadata tree.

To display an object that doesn't already display in the metadata tree, add an association between the objects in the IMM.properties file. For example, Object A displays in the metadata tree but Object B does not. To display Object B under Object A in the metadata tree, perform the following actions:

● Create an association from Object B to Object A. 'From Objects' in an association display as parent objects; 'To

INFORMATICA CONFIDENTIAL BEST PRACTICE 420 of 702

Objects' display as child objects. The 'To Object' displays in the metadata tree only if the 'From Object' in the association already displays in the metadata tree. For more information about adding associations, refer to “Adding Object Associations” in the Metadata Manager User Guide.

● Add the association to the IMM.properties file. Metadata Manager only displays objects in the metadata tree if the corresponding association between their classes is included in the IMM.properties file.

Note: Some associations are not explicitly defined among the classes of objects. Some objects reuse associations based on the ancestors of the classes. The metadata tree displays objects that have explicit or reused associations.

To Add an Association to the IMM.properties File

1. Open the IMM.properties file. The file is located in the following directory:

● For WebLogic: <WebLogic_home>\user_projects\domains\<domain> ● For WebSphere: <WebSphere_home>\DeploymentManager ● For JBoss: <JBoss_home>\bin

2. Add the association ID under findtab.parentChildAssociations parameter:

To determine the ID of an association, click Administration > Metamodel Management > Associations, and then click the association on the Associations page.

3. Save and close the IMM.properties file. 4. Stop and then restart the Metadata Manager Server to apply the changes.

Customizing Metadata Manager Metadata Browser

The Metadata Browser, on the Metadata Directory page, is used for browsing source repository metadata stored in the Metadata Manager Warehouse. The following figure shows a sample metadata directory page on the 'Find Tab' of Metadata Manager.

INFORMATICA CONFIDENTIAL BEST PRACTICE 421 of 702

The Metadata Directory page consists of the following areas:

● Query Task Area - allows you to search for metadata objects stored in the Metadata Manager Warehouse. ● Metadata Tree Task Area - allows you to navigate to a metadata object in a particular repository. ● Results Task Area - displays metadata objects based on an object search in the Query Task area or based on

the object selected in the Metadata Tree Task area. ● Details Task Area - displays properties about the selected object. You can also view associations between the

object and other objects, and run related reports from the Details Task area.

For more information about the Metadata Directory page on the Find tab, refer to the “Accessing Source Repository Metadata” chapter in the Metadata Manager User Guide.

You can perform the following customizations while browsing the source repository metadata:

Configure Display Properties

Metadata Manager displays a set of default properties for all items in the 'Results Task' area. The default properties are generic properties that apply to all metadata objects stored in the Metadata Manager Warehouse.

By default, Metadata Manager displays the following properties in the 'Results Task' area for each source repository object:

● Class - Displays an icon that represents the class of the selected object. The class name appears when you place the pointer over the icon.

● Label - Label of the object. ● Source Update Date - Date the object was last updated in the source repository.

INFORMATICA CONFIDENTIAL BEST PRACTICE 422 of 702

● Repository Name - Name of the source repository from which the object originates. ● Description - Describes the object.

The default properties that appear in the 'Results Task' area can, however, be rearranged, added, and/or removed for a Metadata Manager user account. For example, you can remove the default Class and Source Update Date properties, move the Repository Name property to precede the Label property, and add a different property, such as the Warehouse Insertion Date, to the list.

Additionally, you can add other properties that are specific to the class of the selected object. With the exception of Label, all other default properties can be removed. You can select up to ten properties to display in the 'Results Task' area. Metadata Manager displays them in the order specified while configuring.

If there are more than ten properties to display, Metadata Manager displays the first ten, displaying common properties first in the order specified and then all remaining properties in alphabetical order based on the property display label.

Applying Favorite Properties for Multiple Classes of Objects Property

The modified property display settings can be applied to any class of objects displayed in the 'Results Task' area. When selecting an object in the metadata tree, multiple classes of objects can appear in the 'Results Task' area. The following figure shows how to apply the modified display settings for each class of objects in the 'Results Task' area:

INFORMATICA CONFIDENTIAL BEST PRACTICE 423 of 702

The same settings can be applied to the other classes of objects that currently display in the 'Results Task' area.

If the settings are not applied to the other classes, then the settings apply to the objects of the same class as the object selected in the metadata tree.

Configuring Object Links

Object links are created to link related objects without navigating the metadata tree or searching for the object. Refer to the Metadata Manager User Guide to configure the object link.

Configuring Report Links

Report links can be created to run reports on a particular metadata object. When creating a report link, assign a Metadata Manager report to a specific object. While creating a report link, you can also create a run report button to run the associated report. The run report button appears in the top, right corner of the 'Details Task' area. When you create the run report button, you also have the option of applying it to all objects of the same class. You can create a maximum of three run report buttons per object.

Customizing Metadata Manager Packaged Reports, Dashboards, and Indicators

You can create new reporting elements and attributes under ‘Schema Design’. These elements can be used in new reports or existing report extensions. You can also extend or customize out-of-the-box reports, indicators, or dashboards. Informatica recommends using the ‘Save As’ new report option for such changes in order to avoid any conflicts during upgrades.

Further, you can use Data Analyzer's 1-2-3-4 report creation wizard to create new reports. Informatica recommends saving such reports in a new report folder to avoid conflict during upgrades.

Customizing Metadata Manager ODS Reports

Use the operational data store (ODS) report templates to analyze metadata stored in a particular repository. Although these reports can be used as is, they can also be customized to suit particular business requirements. Out-of-the-box reports can be used as a guideline for creating reports for other types of source repositories, such as a repository for which Metadata Manager does not prepackage an XConnect.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 424 of 702

Estimating Metadata Manager Volume Requirements

Challenge

Understanding the relationship between various inputs for the Metadata Manager solution in order to estimate volumes for the Metadata Manager Warehouse.

Description

The size of the Metadata Manager warehouse is directly proportional to the size of metadata being loaded into it. The size is also dependent on the number of element attributes being captured in source metadata and the associations defined in the metamodel.

When estimating volume requirements for a Metadata Manager implementation, consider the following Metadata Manager components:

• Metadata Manager Server • Metadata Manager Console • Metadata Manager Integration Repository • Metadata Manager Warehouse

Note: Refer to the Metadata Manager Installation Guide for complete information on minimum system requirements for server, console and integration repository.

Considerations

Volume estimation for Metadata Manager is an iterative process. Use the Metadata Manager development environment to get accurate size estimates for the Metadata Manager production environment. The required steps are as follow:

1. Identify the source metadata that needs to be loaded in the Metadata Manager production warehouse.

2. Size the Metadata Manager Development warehouse based on the initial sizing estimates (as explained in next section of this document).

3. Run the XConnects and monitor the disk usage. The development data loaded during the initial run of the XConnects should be used as a baseline for furthers sizing estimates.

4. Restart the XConnect if a failure due to lack of disk space is encountered after adding additional disk space.

Repeat steps 1 through 4 until the XConnect run is successful.

The following figures illustrate the initial sizing estimates for a typical Metadata Manager implementation:

INFORMATICA CONFIDENTIAL BEST PRACTICE 425 of 702

Metadata Manager Server

Metadata Manager Console

INFORMATICA CONFIDENTIAL BEST PRACTICE 426 of 702

Metadata Manager Integration Repository

Metadata Manager Warehouse

INFORMATICA CONFIDENTIAL BEST PRACTICE 427 of 702

The following table is an initial estimation matrix that should be helpful in deriving a reasonable initial estimation. For increased input sizes, consider the expected Metadata Manager warehouse target size to increase in direct proportion.

XConnect Input Size Expected Metadata Manager Warehouse Target Size

Metamodel and other tables

- 50MB

PowerCenter 1MB 10MB Data Analyzer 1MB 4MB Database 1MB 5MB Other XConnect 1MB 4.5MB

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 428 of 702

Metadata Manager Load Validation

Challenge

Just as it is essential to know that all data for the current load cycle has loaded correctly, it is important to ensure that all metadata extractions (XConnects) loaded correctly into the Metadata Manager warehouse. If metadata extractions do not execute successfully, the Metadata Manager warehouse will not be current with the most up-to-date metadata.

Description

The process for validating Metadata Manager metadata loads is very simple using the Metadata Manager Configuration Console. In the Metadata Manager Configuration Console, you can view the run history for each of the XConnects. For those who are familiar with PowerCenter, the “Run History” portion of the Metadata Manager Configuration Console is similar to the Workflow Monitor in PowerCenter.

To view XConnect run history, first log into the Metadata Manager Configuration Console.

After logging into the console, click Administration > Repositories. The XConnect Repositories are displayed with their last load date and status.

INFORMATICA CONFIDENTIAL BEST PRACTICE 429 of 702

The XConnect run history is displayed (see below) on the “Source Repository Management” screen. A Metadata Manager Administrator should log into the Metadata Manager Configuration Console on a regular basis and verify that all XConnects that were scheduled ran to successful completion.

INFORMATICA CONFIDENTIAL BEST PRACTICE 430 of 702

If any XConnects have a status of “Failed” as shown above in the Last Refresh Status column, the issue should be investigated to correct it and the XConnect should be re-executed. XConnects can fail for a variety of reasons common in IT such as unavailability of the database, network failure, improper configuration, etc.

More detailed error messages can be found in the activity log or in the workflow log files. By clicking on the “Output” tab of the selected XConnect in the Metadata Manager Console, you can view the output for the most recent run of the selected XConnect. In most cases, the logging is setup to write to the <PowerCenter installation directory>\client\Console\ActivityLog file.

After investigating and correcting the issue, the XConnect that failed should be re-executed at the next available time in order to load the most recent metadata.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 431 of 702

Metadata Manager Migration Procedures

Challenge This Best Practice describes the processes that need to be followed (as part of the Metadata Manager deployment in multiple environments) whenever out of the box Metadata Manager components are customized or configured, or when new components are added to Metadata Manager. Because the Metadata Manager product consists of multiple components, the steps apply to individual product components. The deployment processes are divided into the following four categories:

Reports: This would include changes to the reporting schema and the out of the box reports. In addition, this would also include any new reports or schema elements created to cater to the custom reporting needs at the specific implementation instance of the product.

Metamodel: This would include the creation of new metamodel components to help associate any custom metadata against repository types and domains that are not covered by the out of the box Metadata Manager repository types.

Metadata: This would include the creation of new metadata objects, their properties or associations against repository instances configured within Metadata Manager. These repository instances could either belong to the repository types supported out of the box by Metadata Manager or any new repository types configured through custom additions to the metamodels.

Integration Repository: This would include changes to the out of the box PowerCenter workflows or mappings. In addition, this would also include any new PowerCenter objects (mappings, transformations etc.) or associated workflows.

Description

Report Changes

The following chart depicts the various scenarios related to the reporting area and the actions that need to be taken as relates to the deployment of the changed components. It is always advisable to create new schema elements (metrics, attributes etc.) or new reports in a new Data Analyzer folder to facilitate exporting or importing the Data Analyzer objects across development, test and production.

Nature of Report Change: Modify schema component (metric, attribute etc.)

Development Test Production

Perform the change in development, test the same and certify it for deployment.

Do an XML export of the changed components.

Import the XML exported in the development environment.

Answer ‘Yes’ to overriding the definitions that already exist for the changed schema components.

Import the XML exported in the development environment.

Answer ‘Yes’ to overriding the definitions that already exist for the changed schema components.

Nature of Report Change: Modify an existing report (add or delete metrics, attributes, filters, change formatting etc.)

INFORMATICA CONFIDENTIAL BEST PRACTICE 432 of 702

Development Test Production

Perform the change in development, test the same and certify it for deployment.

Do an XML export of the changed report.

Import the XML exported in the development environment.

Answer ‘Yes’ to overriding the definitions that already exist for the changed report.

Import the XML exported in the development environment.

Answer ‘Yes’ to overriding the definitions that already exist for the changed report.

Nature of Report Change: Add new schema component (metric, attribute etc.)

Development Test Production

Perform the change in development, test the same and certify it for deployment.

Do an XML export of the new schema components.

Import the XML exported in the development environment.

Import the XML exported in the development environment.

Nature of Report Change: Add new report

Development Test Production

Perform the change in development, test the same and certify it for deployment.

Do an XML export of the new report.

Import the XML exported in the development environment.

Import the XML exported in the development environment.

Metamodel Changes

The following chart depicts the various scenarios related to the metamodel area and the actions that need to be taken as relates to the deployment of the changed components.

Nature of the Change: Add new metamodel component

Development Test Production

INFORMATICA CONFIDENTIAL BEST PRACTICE 433 of 702

Perform the change in development, test the same and certify it for deployment.

Do an XML export of the new metamodel components (export can be done at 3 levels: Originators, Repository Types and Entry Points) using the “Export Metamodel” option.

Import the XML exported in the development environment using the “Import metamodel” option.

Import the XML exported in the development environment using the “Import metamodel” option.

Integration Repository Changes

The following chart depicts the various scenarios related to the integration repository area and the actions that need to be taken as relates to the deployment of the changed components. It is always advisable to create new mappings, transformations, workflows etc in a new PowerCenter folder so that it becomes easy to export the ETL objects across development, test and production.

Nature of the Change: Modify an existing mapping, transformation and/or the associated workflows etc.

Development Test Production

Perform the change in development, test the same and certify it for deployment.

Do an XML export of the changed objects.

Import the XML exported in the development environment.

Answer ‘Yes’ to overriding the definitions that already exist for the changed object.

Import the XML exported in the development environment.

Answer ‘Yes’ to overriding the definitions that already exist for the changed object.

Nature of the Change: Add new ETL object (mapping, transformation etc.) and create an associated

Development Test Production

Perform the change in development, test the same and certify it for deployment.

Do an XML export of the new objects.

Import the XML exported in the development environment.

Import the XML exported in the development environment.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 434 of 702

Metadata Manager Repository Administration

Challenge

The task of administering the Metadata Manager Repository involves taking care of both the integration repository and the Metadata Manager warehouse. This requires a knowledge of both PowerCenter administrative features (i.e., the integration repository used in Metadata Manager) and Metadata Manager administration features.

Description

A Metadata Manager administrator needs to be involved in the following areas to ensure that the Metadata Manager metadata warehouse is fulfilling the end-user needs:

● Migration of Metadata Manager objects created in the Development environment to QA or the Production environment

● Creation and maintenance of access and privileges of Metadata Manager objects

● Repository backups ● Job monitoring ● Metamodel creation.

Migration from Development to QA or Production

In cases where a client has modified out-of-the-box objects provided in Metadata Manager or created a custom metamodel for custom metadata, the objects must be tested in the Development environment prior to being migrated to the QA or Production environments. The Metadata Manager Administrator needs to do the following to ensure that the objects are in sync between the two environments:

● Install a new Metadata Manager instance for the QA/Production environment. This involves creating a new integration repository and Metadata Manager warehouse

● Export the metamodel from the Development environment and import it to QA or production via XML Import/Export functionality (in the Metadata Manager Administration tab) or via the Metadata Manager command line

INFORMATICA CONFIDENTIAL BEST PRACTICE 435 of 702

utility. ●

Export the custom or modified reports created or configured in the Development environment and import them to QA or Production via XML Import/Export functionality in SG Administration Tab. This functionality is identical to the function in Data Analyzer; refer to the Data Analyzer Administration Guide for details on the import/export function.

Providing Access and Privileges

Users can perform a variety of Metadata Manager tasks based on their privileges. The Metadata Manager Administrator can assign privileges to users by assigning them roles. Each role has a set of privileges that allow the associated users to perform specific tasks. The Administrator can also create groups of users so that all users in a particular group have the same functions. When an Administrator assigns a role to a group, all users of that group receive the privileges assigned to the role. For more information about privileges, users, and groups, see the Data Analyzer Administrator Guide.

The Metadata Manager Administrator can assign privileges to users to enable users to perform the any of the following tasks in Metadata Manager:

● Configure reports. Users can view particular reports, create reports, and/or modify the reporting schema.

● Configure the Metadata Manager Warehouse. Users can add, edit, and delete repository objects using Metadata Manager.

● Configure metamodels. Users can add, edit, and delete metamodels.

Metadata Manager also allows the Administrator to create access permissions on specific source repository objects for specific users. Users can be restricted to reading, writing, or deleting source repository objects that appear in Metadata Manager.

Similarly, the Administrator can establish access permissions for source repository objects in the Metadata Manager warehouse. Access permissions determine the tasks that users can perform on specific objects. When the Administrator sets access permissions, he or she determines which users have access to the source repository objects that appear in Metadata Manager. The Administrator can assign the following types of access permissions to objects:

INFORMATICA CONFIDENTIAL BEST PRACTICE 436 of 702

● Read - Grants permission to view the details of an object and the names of any objects it contains.

● Write - Grants permission to edit an object and create new repository objects in the Metadata Manager warehouse.

● Delete - Grants permission to delete an object from a repository. ● Change permission - Grants permission to change the access permissions for

an object.

When a repository is first loaded into the Metadata Manager warehouse, Metadata Manager provides all permissions to users with the System Administrator role. All other users receive read permissions. The Administrator can then set inclusive and exclusive access permissions.

Metamodel Creation

In cases where a client needs to create custom metamodels for sourcing custom metadata, the Metadata Manager Administrator needs to create new packages, originators, repository types and class associations. For details on how to create new metamodels for custom metadata loading and rendering in Metadata Manager, refer to the Metadata Manager Installation and Administration Guide.

Job Monitoring

When Metadata Manager Xconnects are running in the Production environment, Informatica recommends monitoring loads through the Metadata Manager console. The Configuration Console Activity Log in the Metadata Manager console can identify the total time it takes for an Xconnect to complete. The console maintains a history of all runs of an Xconnect, enabling a Metadata Manager Administrator to ensure that load times are meeting the SLA agreed upon with end users and that the load times are not increasing inordinately as data increases in the Metadata Manager warehouse.

The Activity Log provides the following details about each repository load:

● Repository Name- name of the source repository defined in Metadata Manager

● Run Start Date- day of week and date the XConnect run began ● Start Time- time the XConnect run started ● End Time- time the XConnect run completed ● Duration- number of seconds the XConnect run took to complete

INFORMATICA CONFIDENTIAL BEST PRACTICE 437 of 702

● Ran From- machine hosting the source repository ● Last Refresh Status- status of the XConnect run, and whether it completed

successfully or failed

Repository Backups

When Metadata Manager is running in either the Production or QA environment, Informatica recommends taking periodic backups of the following areas:

● Database backups of the Metadata Manager warehouse ● Integration repository; Informatica recommends either of two methods for this

backup:

❍ The PowerCenter Repository Server Administration Console or pmrep command line utility

❍ The traditional, native database backup method.

The native PowerCenter backup is required but Informatica recommends using both methods because, if database corruption occurs, the native PowerCenter backup provides a clean backup that can be restored to a new database.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 438 of 702

Upgrading Metadata Manager

Challenge

This best practices document summarizes the instructions for a Metadata Manager upgrade and should not be used for upgrading a PowerCenter repository. Refer to the PowerCenter Installation and Configuration Guide for detailed instructions on the PowerCenter or Metadata Manager upgrade process.

Before you start the upgrade process be sure to check through the Informatica support information for the Metadata Manager upgrade path. For instance, Metadata Manager 2.1 or 2.2 should first be upgraded to Metadata Manager 8.1 and then to the Metadata Manager 8.1.1.

Also verify the requirements for the following Metadata Manager 8.1.1 components:

● Metadata Manager and Metadata Manager Client ● Web browser ● Databases ● Third-party software ● Code pages ● Application server

For more information about requirements for each component, see Chapter 3 “PowerCenter Prerequisites” in the PowerCenter Installation and Configuration Guide.

As we already know from the existing installation, Metadata Manager is made up of various components. Except for the Metadata Manager Repository all other Metadata Manager components (i.e., Metadata Manager Server, PowerCenter Repository, PowerCenter Clients and Metadata Manager Clients) should be uninstalled and then reinstalled with the latest version of the Metadata Manager by the Metadata Manager upgrade process.

Keep in mind that all modifications and/or customizations to the standard version of Metadata Manager will be lost and will need to be re-created and re-tested after the upgrade process.

INFORMATICA CONFIDENTIAL BEST PRACTICE 439 of 702

DescriptionUpgrade Steps

1. Set up new repository database and user account.

● Set up new database/schema for the PowerCenter Metadata Manager repository. For Oracle, set the appropriate storage parameters. For IBM DB2, use a single node tablespace to optimize PowerCenter performance. For IBM DB2, configure the system temporary table spaces and update the heap sizes.

● Create a database user account for the PowerCenter Metadata Manager repository. The database user must have permissions to create and drop tables and indexes, and to select, insert, update, and delete data from tables. For more information, see the PowerCenter Installation and Configuration Guide.

2. Make a copy of the existing Metadata Manager repository.

● You can use any backup or copy utility provided with the database to make a copy of the working Metadata Manager repository prior to upgrading the Metadata Manager. Use the copy of the Metadata Manager repository for the new Metadata Manager installation.

3. Back up the existing parameter files.

● Make a copy of the existing parameter files. If you have custom XConnects and the parameter, attribute and data files of these custom XConnects is in a different place, do not forget to take a backup of them too. You may need to refer to these files when you later configure the parameters for the custom XConnects as part of the Metadata Manager client upgrade.

For PowerCenter 8.0, you can find the parameter files in the following directory:

PowerCenter_Home\server\infa_shared\SrcFiles

For Metadata Manager, you can find the parameter files in the following directory:

PowerCenter_Home\Server\SrcFiles

INFORMATICA CONFIDENTIAL BEST PRACTICE 440 of 702

4. Export the Metadata Manager mappings that you customized or created for your environment.

● If you made any changes on the standard Metadata Manager mappings, or create some new mappings within the Metadata Manager Integration repository, make an export of these mappings, workflows and/or sessions.

● If you created some additional reports, make an export of these reports too.

5. Install Metadata Manager.

● Select the Custom installation set and install Metadata Manager. The installer creates a Repository Service and Integration Service in the PowerCenter domain and creates a PowerCenter repository for Metadata Manager. For more information about installing Metadata Manager, see the PowerCenter Installation and Configuration Guide.

6. Stop the Metadata Manager server.

● You must stop the Metadata Manager server before you upgrade the Metadata Manager repository contents. For more information about stopping Metadata Manager, see Appendix C “Starting and Stopping Application Servers” in the PowerCenter Installation and Configuration Guide.

7. Upgrade the Metadata Manager repository.

● Use the Metadata Manager upgrade utility shipped with the latest version of Metadata Manager to upgrade the Metadata Manager repository. For instructions on running the Metadata Manager upgrade utility, see the PowerCenter Installation and Configuration Guide.

8. Complete the Metadata Manager post-upgrade tasks.

After you upgrade the Metadata Manager repository, perform the following tasks:

● Update metamodels for Business Objects and Cognos ReportNet Content Manager.

● Delete obsolete Metadata Manager objects. ● Refresh Metadata Manager views. ● For a DB2 Metadata Manager repository, import metamodels.

INFORMATICA CONFIDENTIAL BEST PRACTICE 441 of 702

For more information about the post-upgrade tasks, see the PowerCenter Installation and Configuration Guide.

9. Upgrade the Metadata Manager Client.

● For instructions on upgrading the Metadata Manager Client, see the PowerCenter Installation and Configuration Guide.

● After you complete the upgrade steps, verify that all dashboards and reports are working correctly in Metadata Manager. When you are sure that the new version is working properly, you can delete the old instance of Metadata Manager and switch to the new version.

10. Compare and redeploy the exported Metadata Manager mappings that were customized or created for your environment.

● If you had any modified Metadata Manager mappings in the previous release of Metadata Manager, check whether the modifications are still necessary. If the modifications still needed override or rebuild the changes into the new PowerCenter mappings.

● Import the customized reports into the new environment and check that the reports are still working with the new Metadata Manager environment. If not then make the necessary modifications to make them compatible with the new structure.

11. Upgrade the Custom XConnects

● If you have any custom XConnects in your environment, you need to regenerate the XConnect mappings that were generated by the previous version of the custom XConnect configuration wizard. Before starting the regeneration process, ensure that the absolute paths to the .csv files are the same as the previous version. If all the paths are the same, no further actions are required after the regeneration of the workflows and mappings.

12. Uninstall the previous version of Metadata Manager.

● Verify that the browser and all reports are working correctly in Metadata Manager 8.1. If the upgrade is successful, you can uninstall the previous version of Metadata Manager.

INFORMATICA CONFIDENTIAL BEST PRACTICE 442 of 702

Daily Operations

Challenge

Once the data warehouse has been moved to production, the most important task is keeping the system running and available for the end users.

Description

In most organizations, the day-to-day operation of the data warehouse is the responsibility of a Production Support team. This team is typically involved with the support of other systems and has expertise in database systems and various operating systems. The Data Warehouse Development team becomes, in effect, a customer to the Production Support team. To that end, the Production Support team needs two documents, a Service Level Agreement and an Operations Manual, to help in the support of the production data warehouse.

Monitoring the System

Monitoring the system is useful for identifying any problems or outages before the users notice. The Production Support team must know what failed, where it failed, when it failed, and who needs to be working on the solution. Identifying outages and/or bottlenecks can help to identify trends associated with various technologies. The goal of monitoring is to reduce downtime for the business user. Comparing the monitoring data against threshold violations, service level agreements, and other organizational requirements helps to determine the effectiveness of the data warehouse and any need for changes.

Service Level Agreement

The Service Level Agreement (SLA) outlines how the overall data warehouse system is to be maintained. This is a high-level document that discusses system maintenance and the components of the system, and identifies the groups responsible for monitoring the various components. The SLA should be able to be measured against key performance indicators. At a minimum, it should contain the following information:

• Times when the system should be available to users. • Scheduled maintenance window. • Who is expected to monitor the operating system. • Who is expected to monitor the database. • Who is expected to monitor the PowerCenter sessions. • How quickly the support team is expected to respond to notifications of system failures. • Escalation procedures that include data warehouse team contacts in the event that the support

team cannot resolve the system failure.

Operations Manual

The Operations Manual is crucial to the Production Support team because it provides the information needed to perform the data warehouse system maintenance. This manual should be self-contained, providing all of the information necessary for a production support operator to maintain the system and resolve most problems that can arise. This manual should contain information on how to maintain all data warehouse system components. At a minimum, the Operations Manual should contain:

• Information on how to stop and re-start the various components of the system. • Ids and passwords (or how to obtain passwords) for the system components. • Information on how to re-start failed PowerCenter sessions and recovery procedures.

A listing• of all jobs that are run, their frequency (daily, weekly, monthly, etc.), and the average run times.

INFORMATICA CONFIDENTIAL BEST PRACTICE 443 of 702

• Error handling strategies. Who to • call in the event of a component failure that cannot be resolved by the Production Support team.

PowerExchange Operations Manual

s in a scheduler and/or after an IPL. There are certain commands that need to be executed by operations.

Adapter Guide provides detailed information on the operation of PowerExchange Change Data Capture.

Archive/Listener Log Maintenace

old ; to do that, use SMS to override the specification, removing the need to change the

EDMUPARM.

e s scheduled to restart every weekend, the log will be

refreshed and a new spool file will be created.

//DTLLOG DD DSN=&HLQ..LOG, this will log the file to the member LOG in the HLQ..RUNLIB.

Recovery After Failure

here are other solutions. In any case, if you do need every change, re-initializing may not be an option.

Application ID

ions – the processes that extract changes, whether they are realtime or change (periodic batch extraction).

tically, this means that each session must have an application id parameter containing a unique “label”.

Restart Tokens

is a potential restart point. It is possible, using the Navigator interface directly, or by updating the restart file,

The need to maintain archive logs and listener logs, use started tasks, perform recovery, and other operation functions on MVS are challenges that need to be addressed in the Operations Manual. If listener logs are not cleaned up on a regular basis, operations is likely to face space issues. Setting up archive logs on MVS requires datasets to be allocated and sized. Recovery after failure requires operations intervention to restart workflows and set the restart tokens. For Change Data Capture, operations are required to start the started task

The PowerExchange Reference Guide (8.1.1) and the related

The archive log should be controlled by using the Retention Period specified in the EDMUPARM ARCHIVE_OPTIONS in parameter ARCHIVE_RTPD=. The default supplied in the Install (in RUNLIB member SETUPCC2) is 9999. This is generally longer than most organizations need. To change it, just rerun the first step (and only the first step) in SETUPCC2 after making the appropriate changes. Any new archive log datasets will be created with the new retention period. This does not, however, fix thearchive datasets

The listener default log are part of the joblog of the running listener. If the listener job runs continuously, there is a potential risk of the spool file reaching the maximum and causing issues with thlistener. For example, if the listener started task i

If necessary, change the started task listener jobs from //DTLLOG DD SYSOUT=*

The last resort recovery procedure is to re-execute your initial extraction and load, and restart the CDC process from the new initial load start point. Fortunately t

PowerExchange documentation talks about “consuming” applicat

Each “consuming” application must identify itself to PowerExchange. Realis

Power Exchange remembers each time that a consuming application successfully extracts changes. The end-point of the extraction (Address in the database Log – RBA or SCN) is stored in a file on the server hosting the Listener that reads the changed data. Each of these memorized end-points (i.e., Restart Tokens)

INFORMATICA CONFIDENTIAL BEST PRACTICE 444 of 702

to force the next extraction to restart from any of these points. If you’re using the ODBC interface for PowerExchange, this is the best solution to implement.

If you are running periodic extractions of changes and everything finishes cleanly, the restart token history is a good approach to recovery back to a previous extraction. You simply chose the recovery point from the list and re-use it.

There are more likely scenarios though. If you are running realtime extractions, potentially never-ending or until there’s a failure, there are no end-points to memorize for restarts. If your batch extraction fails, you may already have processed and committed many changes. You can’t afford to “miss” any changes and you don’t want to reapply the same changes you’ve just processed, but the previous restart token does not correspond to the reality of what you’ve processed.

If you are using the Power Exchange Client for PowerCenter (PWXPC), the best answer to the recovery problem lies with PowerCenter, which has historically been able to deal with restarting this type of process – Guaranteed Message Delivery. This functionality is applicable to both realtime and change CDC options.

The PowerExchange Client for PowerCenter stores the Restart Token of the last successful extraction run for each Application Id in files on the PowerCenter Server. The directory and file name are required parameters when configuring the PWXPC connection in the Workflow Manager. This functionality greatly simplifies recovery procedures compared to using the ODBC interface to PowerExchange.

To enable recovery, select the Enable Recovery option in the Error Handling settings of the Configuration tab in the session properties. During normal session execution, PowerCenter Server stores recovery information in cache files in the directory specified for $PMCacheDir.

Normal CDC Execution

If the session ends "cleanly" (i.e., zero return code), PowerCenter writes tokens to the restart file, and the GMD cache is purged.

If the session fails, you are left with unprocessed changes in the GMD cache and a Restart Token corresponding to the point in time of the last of the unprocessed changes. This information is useful for recovery.

Recovery

If a CDC session fails, and it was executed with recovery enabled, you can restart it in recovery mode – either from the PowerCenter Client interfaces or using the pmcmd command line instruction. Obviously, this assumes that you are able to identify that the session failed previously.

1. Start from the point in time specified by the Restart Token in the GMD cache. 2. PowerCenter reads the change records from the GMD cache. 3. PowerCenter processes and commits the records to the target system(s). 4. Once the records in the GMD cache have been processed and committed, PowerCenter purges

the records from the GMD cache and writes a restart token to the restart file. 5. The PowerCenter session ends “cleanly”.

The CDC session is now ready for you to execute in normal mode again.

Recovery Using PWX ODBC Interface

You can, of course, successfully recover if you are using the ODBC connectivity to PowerCenter, but you have to build in some things yourself – coping with processing all the changes from the last restart token, even if you’ve already processed some of them.

INFORMATICA CONFIDENTIAL BEST PRACTICE 445 of 702

When you re-execute a failed CDC session, you receive all the changed data since the last Power Exchange restart token. Your session has to cope with processing some of the same changes you already processed at the start of the failed execution – either using lookups/joins to the target to see if you’ve already applied the change you are processing, or simply ignoring database error messages such as trying to delete a record you already deleted.

If you run DTLUAPPL to generate a restart token periodically during the execution of your CDC extraction and save the results, you can use the generated restart token to force a recovery at a more recent point in time than the last session-end restart token. This is especially useful if you are running realtime extractions using ODBC, otherwise you may find yourself re-processing several days of changes you’ve already processed.

Finally, you can always re-initialize the target and the CDC processing:

• Take an image copy of the tablespace containing the table to be captured, with QUIESCE option. • Monitor the EDMMSG output from the PowerExchange Logger job. • Look for message DTLEDM172774I which identifies the PowerExchange Logger sequence number

corresponding to the QUIESCE event. • The output logger show detail with the following format:

DB2 QUIESCE of TABLESPACE TSNAME.TBNAME at DB2 RBA/LRSN 0000000000

000849C56185 EDP Logger RBA . . . . . . . . . : D5D3D3D34040000000084E

00000 Sequence number . . . . . . . . . : 000000084E00000Edition number . . . . . . . . . : B93C4F9C2A79B000 Source EDMNAME(s) . . . . . . . . : DB2DSN1CAPTNAME1

• Take note of the log sequence number. • Repeat for all tables that form part of the same PowerExchange Application. • Run the DTLUAPPL utility specifying the application name and the registration name for each table

in the application. Alter the SYSIN as follows: •

• d string from the sequence number found in the Logger messages after the Copy/Quiesce.

extraction start point on the PowerExchange Logger to the point at which the QUIESCE was done above.

The image copy obtained above can be used for the initial materialization of the target tables.

MOD APPL REGDEMO DSN1 (where REGDEMO is Registration name on Navim Navigator)

gator) add RSTTKN CAPDEMO (where CAPDEMO is Capture name fro

0000000 SEQUENCE 000000084E0000000000000000084E000RESTART D5D3D3D34040000000084E0000000000 END APPL REGDEMO (where REGDEMO is Registration name from Navigator)

Note how the sequence number is a repeate

Note that the Restart parameter specified in the DTLUAPPL job is the EDP Logger RBA generated in the same message sequence. This sets the

INFORMATICA CONFIDENTIAL BEST PRACTICE 446 of 702

PowerExchange Tasks: MVS Start and Stop Command Summary

TaskStart

Command*

Stop Command Notes Description of Task

Listener /S DTLLST

/F DTLLST, CLOSE

/F DTLLST, CLOSE, FORCE

/P DTLLST

/C DTLLST

Preferred method /F DTLLST, CLOSE

If CLOSE doesn’t work

If FORCE doesn’t work

If STOP doesn’t work

The PowerExchange listener is used for bulk data movement and registering sources for Change Data Capture

Agent /S DTLA /DTLA shutdown

/DTLA DRAIN and SHUTDOWN COMPLETELY can be used only at the request of Informatica Support

The PowerExchange Agent, used to manage connections to the PowerExchange Logger and handle repository and other tasks. This must be started before the Logger.

Logger /S DTLL /P DTLL

/F DTLL, STOP

****(if you are installing, you need to run setup2 here prior to starting the Logger) /f DTLL, display

The PowerExchange Logger used to manage the Linear datasets and hiperspace that hold change capture data.

ECCR (DB2)

/S DTLDB2EC

/F DTLDB2EC, STOP or /F DTLDB2EC, QUIESCE or /P DTLDB2EC

STOP command just cancel ECCR, QUIESCE wait for open UOWs to complete.

/F DTLDB2EC, display will publish stats into the ECCR sysout

There must be registrations present prior to bringing up most adaptor ECCRs.

Condense /S DTLC /F DTLC, SHUTDOWN

The PowerExchange Condenser used to run condense jobs against the PowerExchange Logger. This is used with PowerExchange CHANGE to organize the data by table, allow for interval-based extraction, and optionally fully condense multiple changes to a single row.

INFORMATICA CONFIDENTIAL BEST PRACTICE 447 of 702

Apply

Submit JCL or /S DTLAPP

(1) F <Listener job>, D A (2) F DTLLST, STOPTASK name

(1) To identify all tasks running through a certain listener issue the following: (2) Then to stop the Apply issue the following where: name = DBN2 (apply name) If the CAPX access and apply is running locally not through a listener then issue the following command: <Listener job>, CLOSE

The PowerExchange Apply process used in situations where straight replication is required and the data is not moved through PowerCenter before landing in the target.

Notes:

1. /p is an MVS STOP command , /f is an MVS MODIFY command. 2. REMOVE the / if the command is done from the console not SDSF.

If you attempt to shut down the Logger before the ECCR(s), a message indicates that there are still active ECCRs and that the logger will come down AFTER the ECCRS go away. What you should do is:

You can shut the Listener and the ECCR(s) down at the same time.

The Listener:

1. F <Listener_job>,CLOSE 2. If this isn’t coming down fast enough for you, issue F <Listener_job>,CLOSE FORCE 3. If it still isn’t coming down fast enough, issue C <Listener_job>

Note that these commands are listed in the order of most to least desirable method for bringing the listener down.

The DB2 ECCR:

1. F <DB2 ECCR>,QUIESCE - this waits for all OPEN UOWs to finish, which can be awhile if a long-running batch job is running.

2. F <DB2 ECCR>,STOP - this terminates immediately 3. P <DB2 ECCR> - this also terminates immediately

INFORMATICA CONFIDENTIAL BEST PRACTICE 448 of 702

Once the ECCR(s) are down, you can then bring the Logger down.

The Logger: P <Logger job_name>

The Agent: CMDPREFIX SHUTDOWN

If you know that you are headed for an IPL, you can issue all these commands at the same time. The Listener and ECCR(s) should start down, if you are looking for speed, issue F <Listener_job>,CLOSE FORCE to shut down the Listener, then issue F <DB2 ECCR>,STOP to terminate DB2 ECCR, then shut down the Logger and the Agent.

Note: Bringing the Agent down before the ECCR(S) are down can result in a loss of captured data. If a new file/DB2 table/IMS database is being updated during this shutdown process and the Agent is not available, the call to see if the source is registered returns a “Not being captured” answer. The update, therefore, occurs without you capturing it, leaving your target in a broken state (which you won't know about until too late!)

Sizing the Logger

When you install PWX-CHANGE, up to two active log data sets are allocated with minimum size requirements. The information in this section can help to determine if you need to increase the size of the data sets, and if you should allocate additional log data sets. When you define your active log data sets, consider your system’s capacity and your changed data requirements, including archiving and performance issues.

After the PWX Logger is active, you can change the log data set configuration as necessary. In general, remember that you must balance the following variables:

• Data set size • Number of data sets • Amount of archiving

The choices you make depend on the following factors:

• Resource availability requirements • Performance requirements

-realtime or batch replication • Whether you are running near• Data recovery requirements

An inverse relationship exists between the size of the log data sets and the frequency of archiving required. Larger data sets need to be archived less often than smaller data sets.

Note: Although smaller data sets require more frequent archiving, the archiving process requires less time.

Use the following formulas to estimate the total space you need for each active log data set. For an example of the calculated data set size, refer to the PowerExchange Reference Guide.

• active log data set size in bytes = (average size of captured change record * number of changes captured per hour * desired number of hours between archives) * (1 + overhead rate)

• active log data set size in cylinders = active log data set size in tracks/number of tracks per cylinder • active log data set size in tracks = active log data set size in bytes/number of usable bytes per track

When determining the average size of your captured change records, note the following information:

INFORMATICA CONFIDENTIAL BEST PRACTICE 449 of 702

• PWX Change Capture captures the full object that is changed. For example, if one field in an IMS segment has changed, the product captures the entire segment.

• The PWX header adds overhead to the size of the change record. Per record, the overhead is approximately 300 bytes plus the key length.

• The type of change transaction affects whether PWX Change Capture includes a before-image, after-image, or both:

o DELETE includes a before-image. o INSERT includes an after-image. o UPDATE includes both.

Informatica suggests using an overhead rate of 5 to 10 percent, which includes the following factors:

• Overhead for control information • Overhead for writing recovery-related information, such as system checkpoints.

You have some control over the frequency of system checkpoints when you define your PWX Logger parameters. See CHKPT_FREQUENCY in the PowerExchange Reference Guide for more information about this parameter.

DASD Capacity Conversion Table

Space Information Model 3390 Model 3380usable bytes per track 49,152 40,960 tracks per cylinder 15 15

This example is based on the following assumptions:

• estimated average size of a changed record = 600 bytes • estimated rate of captured changes = 40,000 changes per hour • tween archives = 12 desired number of hours be

ent • overhead rate = 5 perc• DASD model = 3390

The estimated size of each active log data set in bytes is calculated as follows:

600 * 40,000 * 12 * 1.05 = 302,400,000

The number of cylinders to allocate is calculated as follows:

302,400,000 / 49,152 = approximately 6152 tracks

6152 / 15 = approximately 410 cylinders

The following example shows an IDCAMS DEFINE statement that uses the above calculations:

DEFINE CLUSTER - (NAME (HLQ.EDML.PRILOG.DS01) - LINEAR - VOLUMES(volser) - SHAREOPT

0) ) -IONS(2,3) -

CYL(41 DATA - (NAME(HLQ.EDML.PRILOG.DS01.DATA) )

INFORMATICA CONFIDENTIAL BEST PRACTICE 450 of 702

The variable HLQ represents the high-level qualifier that you defined for the log data sets during installation.

Additional Logger Tips

The Logger format utility (EDMLUTL0) formats only the primary space allocation. This means that the Logger does not use secondary allocation. This includes Candidate Volumes and Space, such as that allocated by SMS when using a STORCLAS with the Guaranteed Space attribute. Logger active logs should be defined through IDCAMS with:

• No secondary allocation. • A single VOLSER in the VOLUME parameter. • An SMS STORCLAS, if used, without GUARANTEED SPACE=YES.

PowerExchange Agent Commands

You can use commands from the MVS system to control certain aspects of PowerExchange Agent processing. To issue a PowerExchange Agent command, enter the PowerExchange Agent command prefix (as specified by CmdPrefix in your configuration parameters), followed by the command. For example, if CmdPrefix=AG01, issue the following command to close the Agent's message log:

AG01 LOGCLOSE

The PowerExchange Agent intercepts agent commands issued on the MVS console and processes them in the agent address space. If the PowerExchange Agent address space is inactive, MVS rejects any PowerExchange Agent commands that you issue. If the PowerExchange Agent has not been started during the current IPL, or if you issue the command with the wrong prefix, MVS generates the following message:

IEE305I command COMMAND INVALID

See PowerExchange Reference Guide (8.1.1) for detailed information on Agent commands.

PowerExchange Logger Commands

The PowerExchange Logger uses two types of commands: interactive and batch

You run interactive commands from the MVS console when the PowerExchange logger is running. You can use PowerExchange Logger interactive commands to:

• Display PowerExchange Logger log data sets, units of work (UOWs), and reader/writer connections.

• Resolve in-doubt UOWs. • Stop a PowerExchange Logger. • Print the contents of the PowerExchange active log file (in hexadecimal format).

You use batch commands primarily in batch change utility jobs to make changes to parameters and configurations when the PowerExchange Logger is stopped. Use PowerExchange Logger batch commands to:

• Define PowerExchange Loggers and PowerExchange Logger options, including PowerExchange ptions, and mode (single or dual). Logger names, archive log options, buffer o

• Add log definitions to the restart data set. • Delete data set records from the restart data set. • Display log data sets, UOWs, and reader/writer connections.

INFORMATICA CONFIDENTIAL BEST PRACTICE 451 of 702

See PowerExchange Reference Guide (8.1.1) for detailed information on Logger Commands (Chapter 4, Page 59)

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL BEST PRACTICE 452 of 702

Data Integration Load Traceability

Challenge

Load management is one of the major difficulties facing a data integration or data warehouse operations team. This Best Practice tries to answer the following questions:

• How can the team keep track of what has been loaded? • What order should the data be loaded in? • What happens when there is a load failure? • How can bad data be removed and replaced? • f data be identified? How can the source o• When it was loaded?

Description

Load management provides an architecture to allow all of the above questions to be answered with minimal operational effort.

Benefits of a Load Management Architecture

Data Lineage

The term Data Lineage is used to describe the ability to track data from its final resting place in the target back to its original source. This requires the tagging of every row of data in the target with an ID from the load management metadata model. This serves as a direct link between the actual data in the target and the original source data.

To give an example of the usefulness of this ID, a data warehouse or integration competency center operations team, or possibly end users, can, on inspection of any row of data in the target schema, link back to see when it was loaded, where it came from, any other metadata about the set it was loaded with, validation check results, number of other rows loaded at the same time, and so forth.

It is also possible to use this ID to link one row of data with all of the other rows loaded at the same time. This can be useful when a data issue is detected in one row and the operations team needs to see if the same error exists in all of the other rows. More than this, it is the ability to easily identify the source data for a specific row in the target, enabling the operations team to quickly identify where a data issue may lie.

It is often assumed that data issues are produced by the transformation processes executed as part of the target schema load. Using the source ID to link back the source data makes it easy to identify whether the issues were in the source data when it was first encountered by the target schema load processes or if those load processes caused the issue. This ability can save a huge amount of time, expense, and frustration -- particularly in the initial launch of any new subject areas.

Process Lineage

Tracking the order that data was actually processed in is often the key to resolving processing and data issues. Because choices are often made during the processing of data based on business rules and logic, the order and path of processing differs from one run to the next. Only by actually tracking these processes as they act upon the data can issue resolution be simplified.

INFORMATICA CONFIDENTIAL BEST PRACTICE 453 of 702

Process Dependency Management

Having a metadata structure in place provides an environment to facilitate the application and maintenance of business dependency rules. Once a structure is in place that identifies every process, it becomes very simple to add the necessary metadata and validation processes required to ensure enforcement of the dependencies among processes. Such enforcement resolves many of the scheduling issues that operations teams typically faces.

Process dependency metadata needs to exist because it is often not possible to rely on the source systems to deliver the correct data at the correct time. Moreover, in some cases, transactions are split across multiple systems and must be loaded into the target schema in a specific order. This is usually difficult to manage because the various source systems have no way of coordinating the release of data to the target schema.

Robustness

Using load management metadata to control the loading process also offers two other big advantages, both of which fall under the heading of robustness because they allow for a degree of resilience to load failure.

Load Ordering

Load ordering is a set of processes that use the load management metadata to identify the order in which the source data should be loaded. This can be as simple as making sure the data is loaded in the sequence it arrives, or as complex as having a pre-defined load sequence planned in the metadata.

There are a number of techniques used to manage these processes. The most common is an automated process that generates a PowerCenter load list from flat files in a directory, then archives the files in that list after the load is complete. This process can use embedded data in file names or can read header records to identify the correct ordering of the data. Alternatively the correct order can be pre-defined in the load management metadata using load calendars.

Either way, load ordering should be employed in any data integration or data warehousing implementation because it allows the load process to be automatically paused when there is a load failure, and ensures that the data that has been put on hold is loaded in the correct order as soon as possible after a failure.

The essential part of the load management process is that it operates without human intervention, helping to make the system self healing!

Rollback

If there is a loading failure or a data issue in normal daily load operations, it is usually preferable to remove all of the data loaded as one set. Load management metadata allows the operations team to selectively roll back a specific set of source data, the data processed by a specific process, or a combination of both. This can be done using manual intervention or by a developed automated feature.

INFORMATICA CONFIDENTIAL BEST PRACTICE 454 of 702

Simple Load Management Metadata Model

As you can see from the simple load management metadata model above, there are two sets of data linked to every transaction in the target tables. These represent the two major types of load management metadata:

• Source tracking • Process tracking

Source Tracking

Source tracking looks at how the target schema validates and controls the loading of source data. The aim is to automate as much of the load processing as possible and track every load from the source through to the target schema.

Source Definitions

Most data integration projects use batch load operations for the majority of data loading. The sources for these come in a variety of forms, including flat file formats (ASCII, XML etc), relational databases, ERP systems, and legacy mainframe systems.

The first control point for the target schema is to maintain a definition of how each source is structured, as well as other validation parameters.

These definitions should be held in a Source Master table like the one shown in the data model above.

These definitions can and should be used to validate that the structure of the source data has not changed. A great example of this practice is the use of DTD files in the validation of XML feeds.

INFORMATICA CONFIDENTIAL BEST PRACTICE 455 of 702

In the case of flat files, it is usual to hold details like:

• Header information (if any) • How many columns • Data types for each column • Expected number of rows

For RDBMS sources, the Source Master record might hold the definition of the source tables or store the structure of the SQL statement used to extract the data (i.e., the SELECT, FROM and ORDER BY clauses).

These definitions can be used to manage and understand the initial validation of the source data structures. Quite simply, if the system is validating the source against a definition, there is an inherent control point at which problem notifications and recovery processes can be implemented. It’s better to catch a bad data structure than to start loading bad data.

Source Instances

A Source Instance table (as shown in the load management metadata model) is designed to hold one record for each separate set of data of a specific source type being loaded. It should have a direct key link back to the Source Master table which defines its type.

The various source types may need slightly different source instance metadata to enable optimal control over each individual load.

Unlike the source definitions, this metadata will change every time a new extract and load is performed. In the case of flat files, this would be a new file name and possibly date / time information from its header record. In the case of relational data, it would be the selection criteria (i.e., the SQL WHERE clause) used for each specific extract, and the date and time it was executed.

This metadata needs to be stored in the source tracking tables so that the operations team can identify a specific set of source data if the need arises. This need may arise if the data needs to be removed and reloaded after an error has been spotted in the target schema.

Process Tracking

Process tracking describes the use of load management metadata to track and control the loading processes rather than the specific data sets themselves. There can often be many load processes acting upon a single source instance set of data.

While it is not always necessary to be able to identify when each individual process completes, it is very beneficial to know when a set of sessions that move data from one stage to the next has completed. Not all sessions are tracked this way because, in most cases, the individual processes are simply storing data into temporary tables that will be flushed at a later date. Since load management process IDs are intended to track back from a record in the target schema to the process used to load it, it only makes sense to generate a new process ID if the data is being stored permanently in one of the major staging areas.

Process Definition

Process definition metadata is held in the Process Master table (as shown in the load management metadata model ). This, in its basic form, holds a description of the process and its overall status. It can also be extended, with the introduction of other tables, to reflect any dependencies among processes, as well as processing holidays.

INFORMATICA CONFIDENTIAL BEST PRACTICE 456 of 702

Process Instances

A process instance is represented by an individual row in the load management metadata Process Instance table. This represents each instance of a load process that is actually run. This holds metadata about when the process started and stopped, as well as its current status. Most importantly, this table allocates a unique ID to each instance.

The unique ID allocated in the process instance table is used to tag every row of source data. This ID is then stored with each row of data in the target table.

Integrating Source and Process Tracking

Integrating source and process tracking can produce an extremely powerful investigative and control tool for the administrators of data warehouses and integrated schemas. This is achieved by simply linking every process ID with the source instance ID of the source it is processing. This requires that a write-back facility be built into every process to update its process instance record with the ID of the source instance being processed.

The effect is that there is a one to one/many relationship between the source instance table and the process instance table containing several rows for each set of source data loaded into a target schema. For example, in a data warehousing project, a row for loading the extract into a staging area, a row for the move from the staging area to an ODS, and a final row for the move from the ODS to the warehouse.

Integrated Load Management Flow Diagram

INFORMATICA CONFIDENTIAL BEST PRACTICE 457 of 702

Tracking Transactions

This is the simplest data to track since it is loaded incrementally and not updated. This means that the process and source tracking discussed earlier in this document can be applied as is.

Tracking Reference Data

This task is complicated by the fact that reference data, by its nature, is not static. This means that if you simply update the data in a row any time there is a change, there is no way that the change can be backed out using the load management practice described earlier. Instead, Informatica recommends always using slowly changing dimension processing on every reference data and dimension table to accomplish source and process tracking. Updating the reference data as a ‘slowly changing table’ retains the previous versions of updated records, thus allowing any changes to be backed out.

Tracking Aggregations

Aggregation also causes additional complexity for load management because the resulting aggregate row very often contains the aggregation across many source data sets. As with reference data, this means that the aggregated row cannot be backed out in the same way as transactions.

This problem is managed by treating the source of the aggregate as if it was an original source. This means that rather than trying to track the original source, the load management metadata only tracks back to the transactions in the target that have been aggregated. So, the mechanism is the same as used for transactions but the resulting load management metadata only tracks back from the aggregate to the fact table in the target schema.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 458 of 702

High Availability

Challenge

An increasing number of customers find their Data Integration implementation must be available 24x7 without interruption or failure. This Best Practice describes the High Availability (HA) capabilities incorporated in PowerCenter and explains why it is critical to address both architectural (i.e., systems, hardware, firmware) and procedural (i.e., application design, code implementation, session/workflow features) recovery with HA.

Description

When considering HA recovery, be sure to explore the following two components of HA that exist on all enterprise systems:

External Resilience

External resilience has to do with the integration and specification of domain name servers, database servers, FTP servers, network access servers in a defined, tested 24x7 configuration. The nature of Informatica’s DI setup places it at many interface points in system integration. Before placing and configuring PowerCenter within an infrastructure that has an HA expectation, the following questions should be answered:

● Is the pre-existing set of servers already in a sustained HA configuration? Is there a schematic with applicable settings to use for reference? If so, is there a unit test or system test to exercise before installing PowerCenter products? It is important to remember that the external systems must be HA before the PowerCenter architecture they support can be.

● What are the bottlenecks or perceived failure points of the existing system? Are these bottlenecks likely to be exposed or heightened by placing PowerCenter in the infrastructure? (e.g., five times the amount of Oracle traffic, ten times the amount of DB2 traffic, a UNIX server that always shows 10% idle may now have twice as many processes running).

● Finally, if a proprietary solution (such as IBM HACMP or Veritas Storage Foundation for Windows) has been implemented with success at a customer site, this sets a different expectation. The customer may merely want the grid capability of multiple PowerCenter nodes to splay/recover Informatica tasks, and expect their back-end system (such as those listed above) to provide file system or server bootstrap recovery upon a fundamental failure of those back-end systems. If these back-end systems have a script/command capability to, for example, restart a repository service, PowerCenter can be installed in this fashion. However, PowerCenter's HA capability extends as far as the PowerCenter components.

Internal Resilience

INFORMATICA CONFIDENTIAL BEST PRACTICE 459 of 702

In an HA PowerCenter environment key elements to keep in mind are:

● Rapid and constant connectivity to the repository metadata. ● Rapid and constant network connectivity between all gateway and worker nodes in the

PowerCenter domain. ● A common highly-available storage system accessible to all PowerCenter domain nodes

with one service name and one file protocol. Only domain nodes on the same operating system can share gateway and log files (see Admin Console->Domain->Properties->Log and Gateway Configuration).

Internal resilience occurs within the PowerCenter environment among PowerCenter services, the PowerCenter Client tools, and other client applications such as pmrep and pmcmd. Internal resilience can be configured at the following levels:

● Domain. Configure service connection resilience at the domain level in the general properties for the domain. The domain resilience timeout determines how long services attempt to connect as clients to application services or the Service Manager. The domain resilience properties are the default values for all services in the domain.

● Service. It is possible to configure service connection resilience in the advanced properties for an application service. When configuring connection resilience for an application service, this overrides the resilience values from the domain settings.

● Gateway. The master gateway node maintains a connection to the domain configuration database. If the domain configuration database becomes unavailable, the master gateway node tries to reconnect. The resilience timeout period depends on user activity and whether the domain has one or multiple gateway nodes:

❍ Single gateway node. If the domain has one gateway node, the gateway node tries to reconnect until a user or service tries to perform a domain operation. When a user tries to perform a domain operation, the master gateway node shuts down.

❍ Multiple gateway nodes. If the domain has multiple gateway nodes and the master gateway node cannot reconnect, then the master gateway node shuts down. If a user tries to perform a domain operation while the master gateway node is trying to connect, the master gateway node shuts down. If another gateway node is available, the domain elects a new master gateway node. The domain tries to connect to the domain configuration database with each gateway node. If none of the gateway nodes can connect, the domain shuts down and all domain operations fail.

Process

Be aware that your implementation has a dependency on the installation environment. For example, you may want to combine multiple disparate ETL repositories onto a single upgraded PowerCenter platform. This has the benefit of:

● Single point of access/administration from the Admin Console. ● A group of repositories that now can become a repository domain.

INFORMATICA CONFIDENTIAL BEST PRACTICE 460 of 702

● A group of repositories that can be shaped into common processing/backup/schedule patterns for optimal performance and administration.

HA items of concern are now:

● Single point of failure of one PowerCenter domain. ● One repository, possibly heavy in processing or poorly designed, degrading that

entire PowerCenter domain.

Common Elements of Concern in an HA Configuration

Restart and Failover

Restart and Failover has to do with the Domain Services (Integration and Repository). Obviously if these services are not highly available, the scheduling, dependencies(e.g., touch files, ftp, etc) and artifacts of your ETL cannot be highly available.

If a service process becomes unavailable, the Service Manager can restart the process or fail it over to a backup node based on the availability of the node. When a service process restarts or fails over, the service restores the state of operation and begins recovery from the point of interruption.

You can configure backup nodes for services if you have the high availability option. If you configure an application service to run on primary and backup nodes, one service process can run at a time. The following situations describe restart and failover for an application service:

● If the primary node running the service process becomes unavailable, the service fails over to a backup node. The primary node may be unavailable if it shuts down or if the connection to the node becomes unavailable.

● If the primary node running the service process is available, the domain tries to restart the process based on the restart options configured in the domain properties. If the process does not restart, the Service Manager can mark the process as failed. The service then fails over to a backup node and starts another process. If the Service Manager marks the process as failed, the administrator must enable the process after addressing any configuration problem.

If a service process fails over to a backup node, it does not fail back to the primary node when the node becomes available. You can disable the service process on the backup node to cause it to fail back to the primary node.

Recovery

Recovery is the completion of operations after an interrupted service is restored. When a service recovers, it restores the state of operation and continues processing the job from the point of interruption.

INFORMATICA CONFIDENTIAL BEST PRACTICE 461 of 702

The state of operation for a service contains information about the service process. The PowerCenter services include the following states of operation:

● Service Manager. The Service Manager for each node in the domain maintains the state of service processes running on that node. If the master gateway shuts down, the newly elected master gateway collects the state information from each node to restore the state of the domain.

● Repository Service. The Repository Service maintains the state of operation in the repository. This includes information about repository locks, requests in progress, and connected clients.

● Integration Service. The Integration Service maintains the state of operation in the shared storage configured for the service. This includes information about scheduled, running, and completed tasks for the service. The Integration Service maintains session and workflow state of operation based on the recovery strategy you configure for the session and workflow.

When designing a system that has HA recovery as a core component, be sure to include architectural and procedural recovery.

Architectural recovery for a PowerCenter domain involves the three bulleted items above restarting in a complete, sustainable and traceable manner. If the Service Manager and Repository Service recover, but the Integration service cannot recover the restart is not successful and has little value to a production environment.

Field experience with PowerCenter has yielded these key items in planning a proper recovery upon a systemic failure:

● A PowerCenter domain cannot be established without at least one gateway node running. Even if you have established a domain with ten worker nodes and one gateway node, none of the worker nodes can run ETL jobs without a gateway node managing the domain.

● An Integration Service cannot run without its associated Repository Service being started and connected to its metadata repository.

● A Repository Service cannot run without its metadata repository DBMS being started and accepting database connections. Often database connections are established on periodic windows that expire – which puts the repository offline.

● If the installed domain configuration is running from Authentication Module Configuration and the LDAP Principal User account becomes corrupt or inactive, all PowerCenter repository access is lost. If the installation uses any additional authentication outside PowerCenter (such as LDAP), an additional recovery and restart plan is required.

Procedural recovery is supported with many features of PowerCenter 8. Consider the following very simple mapping that might run in production for many ETL applications:

INFORMATICA CONFIDENTIAL BEST PRACTICE 462 of 702

Suppose there is a situation where the ftp server sending this ff_customer file is inconsistent. Many times the file is not there, but the processes depending on this must always run. The process is always insert only. You do not want the succession of ETL that follows this small process to fail - they can run to customer_stg with current records only. This setting in the Workflow Manager, Session, Properties would fit your need:

Since it is not critical the ff_customer records run each time, record the failure but continue the process.

Now say the situation has changed. Sessions are failing on a PowerCenter server due to target

INFORMATICA CONFIDENTIAL BEST PRACTICE 463 of 702

database timeouts. A requirement is given that the session must recover from this:

Resuming from last checkpoint restarts the process from its prior commit, allowing no loss of ETL work.

To finish this second case, consider three basic items on the workflow side when HA is incorporated in your environment:

INFORMATICA CONFIDENTIAL BEST PRACTICE 464 of 702

An Integration Service in an HA environment can only recover those workflows marked with “Enable HA recovery”. For all critical workflows, this should be considered.

For a mature set of ETL code running in QA or Production, you may consider the following workflow property:

INFORMATICA CONFIDENTIAL BEST PRACTICE 465 of 702

This would automatically recover tasks from where they failed in a workflow upon an application or system wide failure. Consider carefully the use of this feature, however. Remember, automated restart of critical ETL processes without interaction can have vast unintended side effects. For instance, if a database alias or synonym was dropped, all ETL targets may now refer to different objects than the original intent. Only PowerCenter environments with HA, mature production support practices, and a complete operations manual per Velocity, should expect complete recovery with this feature.

In an HA environment, certain components of the Domain can go offline while the Domain stays up to execute ETL jobs. This is a time to use the “Suspend On Error” feature from the General tab of Workflow settings. The backup Integration Server would then pickup this workflow and resume processing based on the resume settings of this workflow:

INFORMATICA CONFIDENTIAL BEST PRACTICE 466 of 702

Features

A variety of HA features exist in PowerCenter 8. Specifically, they include:

● Integration Service HA option ● Integration Service Grid option ● Repository Service HA option

First, proceed from an assumption that nodes have been provided to you such that a basic HA configuration of PowerCenter 8 can take place. A lab-tested version completed by Informatica is configured as below with an HP solution. Your solution can be completed with any reliable clustered file system. Your first step would always be implementing and thoroughly exercising a clustered file system:

INFORMATICA CONFIDENTIAL BEST PRACTICE 467 of 702

Now, let’s address the options in order:

Integration Service HA Option

You must have the HA option on the license key for this to be available on install. Note that once the base PowerCenter 8 install is configured, all nodes are available from the Admin Console->Domain->Integration Services->Grid/Node Assignments. From the above example, you would see Node 1, Node 2, Node 3 as dropdown options on that browse page. With the HA (Primary/Backup) install complete, Integration Services are then displayed with both “P” and “B” in a configuration, with the current operating node highlighted:

INFORMATICA CONFIDENTIAL BEST PRACTICE 468 of 702

If a failure were to occur on this HA configuration, the Integration Service INT_SVCS_DEV would poll the Domain: Domain_Corp_RD for another Gateway Node, then assign INT_SVCS_DEV over to that Node, in this case Node_Corp_RD02. Then the “B” button would highlight showing this Node as providing INT_SVCS_DEV.

A vital component of configuring the Integration Service for HA is making sure the Integration Service files are stored in a shared persistent environment. You must specify the paths for Integration Service files for each Integration Service process. Examples of Integration Service files include run-time files, state of operation files, and session log files.

Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files.

State of operation files must be accessible by all Integration Service processes. When you enable an Integration Service, it creates files to store the state of operations for the service. The state of operations includes information such as the active service requests, scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover operations from the point of interruption.

All Integration Service processes associated with an Integration Service must use the same shared location. However, each Integration Service can use a separate location.

By default, the installation program creates a set of Integration Service directories in the server

INFORMATICA CONFIDENTIAL BEST PRACTICE 469 of 702

\infa_shared directory. You can set the shared location for these directories by configuring the process variable $PMRootDir to point to the same location for each Integration Service process. The key HA concern of this is $PMRootDir should be on the highly-available clustered file system mentioned above.

Integration Service Grid Option

You must have the Server Grid option on the license key for this to be available on install. In configuring the $PMRootDir files for the Integration Service, retain the methodology described above. Also, in Admin Console->Domain->Properties->Log and Gateway Configuration, the log and directory paths should be on the clustered file system mentioned above. A grid must be created before it can be used in a Power Center 8 domain. It is essential to remember that a grid can only be created from machines running the same operating system.

Be sure to remember these key points:

● PowerCenter supports nodes from heterogeneous operating systems, bit modes, and others to be used within same domain. However, if there are heterogeneous nodes for a grid, then you can only run a workflow on the Grid, not a session.

● A session on grid does not support heterogeneous operating systems. This is because a session may have a sharing cache file and other objects that may not be compatible with all of the operating systems. For session on a grid, you need a homogeneous grid.

In short, scenarios such as a production failure are the worst possible time to find out that a multi-OS grid does not meet your needs.

If you have a large volume of disparate hardware, it is certainly possible to make perhaps two grids centered on two different operating systems. In either case, the performance of your clustered file system is going to affect the performance of your server grid, and should be considered as part of your performance/maintenance strategy.

Repository Service HA Option

You must have the HA option on the license key for this to be available on install. There are two ways to include the Repository Service HA capability when configuring PowerCenter 8:

● The first is during install. When the Install Program prompts for your nodes to do a Repository install (after answering “Yes” to Create Repository), you can enter a second node where the Install Program can create and invoke the PowerCenter service and Repository Service for a backup repository node. Keep in mind that all of the database, OS, and server preparation steps referred to in the PowerCenter Installation and Configuration Guide still hold true for this backup node. When the install is complete, the Repository Service displays a “P”/”B” link similar to that illustrated above for the INT_SVCS_DEV example Integration Service.

● A second method for configuring Repository Service HA allows for measured, incremental implementation of HA from a tested base configuration. After ensuring that your initial

INFORMATICA CONFIDENTIAL BEST PRACTICE 470 of 702

Repository Service settings (e.g., resilience timeout, codepage, connection timeout) and the DBMS repository containing the metadata are running and stable, you can add a second node and make it the Repository Backup. Install the PowerCenter Service on this second server following the PowerCenter Installation and Configuration Guide. In particular, skip creating Repository Content or an Integration Service on the node. Following this, Go to Admin Console->Domain and select:

“Create->Node”. The server to contain this node should be of the exact same configuration/clustered file system/OS as the Primary Repository Service.

The following dialog should appear:

Assign a logical name to the node to describe its place, and select “Create”. The node should now be running as part of your domain, but if it isn't, refer to the PowerCenter Command Line Reference with the infaservice and infacmd commands to ensure the node is running on the domain. When it is running, go to Domain->Repository->Properties->Node Assignments->Edit and the browser window displays:

INFORMATICA CONFIDENTIAL BEST PRACTICE 471 of 702

Click “OK” and the Repository Service is now configured in a Primary/Backup setup for the domain. To ensure the P/B setting, test the following elements of the configuration:

1. Be certain the same version of the DBMS client is installed on the server and can access the metadata.

2. Both nodes must be on the same clustered file system. 3. Log onto the OS for the Backup Repository Service and ping the Domain Master Gateway

Node. Be sure a reasonable response time is being given at an OS level (i.e., less than 5 seconds).

4. Take the Primary Repository Service Node offline and validate that the polling, failover, restart process takes place in a methodical, traceable manner for the Repository Service on the Domain. This should be clearly visible from the node logs on the Primary and Secondary Repository Service boxes [$INFA_HOME/server/tomcat/logs] or from Admin Console->Repository->Logs.

Note: Remember that when a node is taken offline, you cannot access Admin Console from it.

Last updated: 09-Feb-07 15:34

INFORMATICA CONFIDENTIAL BEST PRACTICE 472 of 702

Load Validation

Challenge

Knowing that all data for the current load cycle has loaded correctly is essential for effective data warehouse management. However, the need for load validation varies depending on the extent of error checking, data validation, and data cleansing functionalities inherent in your mappings. For large data integration projects with thousands of mappings, the task of reporting load statuses becomes overwhelming without a well-planned load validation process.

Description

Methods for validating the load process range from simple to complex. Use the following steps to plan a load validation process:

1. Determine what information you need for load validation (e.g., work flow names, session names, session start times, session completion times, successful rows and failed rows).

2. Determine the source of the information. All of this information is stored as metadata in the PowerCenter repository, but you must have a means of extracting it.

3. Determine how you want the information presented to you. Should the information be delivered in a report? Do you want it emailed to you? Do you want it available in a relational table so that history is easily preserved? Do you want it stored as a flat file?

Weigh all of these factors to find the correct solution for your project.

Below are descriptions of five possible load validation solutions, ranging from fairly simple to increasingly complex:

1. Post-session Emails on Success or Failure

One practical application of the post-session email functionality is the situation in which a key business user waits for completion of a session to run a report. Email is configured to notify the user when the session was successful so that the report can be run. Another practical application is the situation in which a production support analyst needs to be notified immediately of any failures. Configure the session to send an email to the analyst upon failure. For round-the-clock support, a pager number that has the ability to receive email can be used in place of an email address.

Post-session email is configured in the session, under the General tab and ‘Session Commands’.

A number of variables are available to simplify the text of the email:

● %s Session name

INFORMATICA CONFIDENTIAL BEST PRACTICE 473 of 702

● %e Session status ● %b Session start time ● %c Session completion time ● %i Session elapsed time ● %l Total records loaded ● %r Total records rejected ● %t Target table details ● %m Name of the mapping used in the session ● %n Name of the folder containing the session ● %d Name of the repository containing the session ● %g Attach the session log to the message ● %a <file path> Attache a file to the message

2. Other Workflow Manager Features

In addition to post session email messages, there are other features available in the Workflow Manager to help validate loads. Control, Decision, Event, and Timer tasks are some of the features that can be used to place multiple controls on the behavior of loads. Another solution is to place conditions within links. Links are used to connect tasks within a workflow or worklet. Use the pre-defined or user-defined variables in the link conditions. In the example below, upon the ‘Successful’ completion of both sessions A and B, the PowerCenter Server executex session C.

3. PowerCenter Reports (PCR)

The PowerCenter Reports (PCR) is a web-based business intelligence (BI) tool that is included with every PowerCenter license to provide visibility into metadata stored in the PowerCenter repository in a manner that is easy to comprehend and distribute. The PCR includes more than 130 pre-packaged metadata reports and dashboards delivered through Data Analyzer, Informatica’s BI offering. These pre-packaged reports enable PowerCenter customers to extract extensive business and technical metadata through easy-to-read reports including:

INFORMATICA CONFIDENTIAL BEST PRACTICE 474 of 702

● Load statistics and operational metadata that enable load validation. ● Table dependencies and impact analysis that enable change management. ● PowerCenter object statistics to aid in development assistance. ● Historical load statistics that enable planning for growth.

In addition to the 130 pre-packaged reports and dashboards that come standard with PCR, you can develop additional custom reports and dashboards that are based upon the PCR limited-use license that allows you to source reports from the PowerCenter repository. Examples of custom components that can be created include:

● Repository-wide reports and/or dashboards with indicators of daily load success/failure. ● Customized project-based dashboard with visual indicators of daily load success/failure. ● Detailed daily load statistics report for each project that can be exported to Microsoft Excel

or PDF. ● Error handling reports that deliver error messages and source data for row level errors that

may have occurred during a load.

Below is an example of a custom dashboard that gives instant insight into the load validation across an entire repository through four custom indicators.

INFORMATICA CONFIDENTIAL BEST PRACTICE 475 of 702

4. Query Informatica Metadata Exchange (MX) Views

Informatica Metadata Exchange (MX) provides a set of relational views that allow easy SQL access to the PowerCenter repository. The Repository Manager generates these views when you create or upgrade a repository. Almost any query can be put together to retrieve metadata related to the load execution from the repository. The MX view, REP_SESS_LOG, is a great place to start. This view is likely to contain all the information you need. The following sample query shows how to extract folder name, session name, session end time, successful rows, and session duration:

select subject_area, session_name, session_timestamp, successful_rows,

(session_timestamp - actual_start) * 24 * 60 * 60 from rep_sess_log a where

session_timestamp = (select max(session_timestamp) from rep_sess_log

where session_name =a.session_name) order by subject_area, session_name

The sample output would look like this:

INFORMATICA CONFIDENTIAL BEST PRACTICE 476 of 702

TIP Informatica strongly advises against querying directly from the repository tables. Because future versions of PowerCenter are likely to alter the underlying repository tables, PowerCenter supports queries from the unaltered MX views, not the repository tables.

5. Mapping Approach

A more complex approach, and the most customizable, is to create a PowerCenter mapping to populate a table or a flat file with desired information. You can do this by sourcing the MX view REP_SESS_LOG and then performing lookups to other repository tables or views for additional information.

The following graphic illustrates a sample mapping:

This mapping selects data from REP_SESS_LOG and performs lookups to retrieve the absolute minimum and maximum run times for that particular session. This enables you to compare the current execution time with the minimum and maximum durations.

INFORMATICA CONFIDENTIAL BEST PRACTICE 477 of 702

Note: Unless you have acquired additional licensing, a customized metadata data mart cannot be a source for a PCR report. However, you can use a business intelligence tool of your choice instead.

Last updated: 09-Feb-07 15:47

INFORMATICA CONFIDENTIAL BEST PRACTICE 478 of 702

Repository Administration

Challenge

Defining the role of the PowerCenter Administrator to understand the tasks required to properly manage the domain and repository.

Description

The PowerCenter Administrator has many responsibilities. In addition to regularly backing up the domain and repository, truncating logs, and updating the database statistics, he or she also typically performs the following functions:

● Determines metadata strategy ● Installs/configures client/server software ● Migrates development to test and production ● Maintains PowerCenter servers ● Upgrades software ● Administers security and folder organization ● Monitors and tunes environment

Note: The Administrator is also typically responsible for maintaining domain and repository passwords; changing them on a regular basis and keeping a record of them in a secure place.

Determine Metadata Strategy

The PowerCenter Administrator is responsible for developing the structure and standard for metadata in the PowerCenter Repository. This includes developing naming conventions for all objects in the repository, creating a folder organization, and maintaining the repository. The Administrator is also responsible for modifying the metadata strategies to suit changing business needs or to fit the needs of a particular project. Such changes may include new folder names and/or a different security setup.

Install/Configure Client/Server Software

INFORMATICA CONFIDENTIAL BEST PRACTICE 479 of 702

This responsibility includes installing and configuring the application servers in all applicable environments (e.g., development, QA, production, etc.). The Administrator must have a thorough understanding of the working environment, along with access to resources such as a Windows 2000/2003 or UNIX Admin and a DBA.

The Administrator is also responsible for installing and configuring the client tools. Although end users can generally install the client software, the configuration of the client tool connections benefits from being consistent throughout the repository environment. The Administrator, therefore, needs to enforce this consistency in order to maintain an organized repository.

Migrate Development to Production

When the time comes for content in the development environment to be moved to the test and production environments, it is the responsibility of the Administrator to schedule, track, and copy folder changes. Also, it is crucial to keep track of the changes that have taken place. It is the role of the Administrator to track these changes through a change control process. The Administrator should be the only individual able to physically move folders from one environment to another.

If a versioned repository is used, the Administrator should set up labels and instruct the developers on the labels that they must apply to their repository objects (i.e., reusable transformations, mappings, workflows and sessions). This task also requires close communication with project staff to review the status of items of work to ensure, for example, that only tested or approved work is migrated.

Maintain PowerCenter Servers

The Administrator must also be able to understand and troubleshoot the server environment. He or she should have a good understanding of PowerCenter’s Service Oriented Architecture and how the domain and application services interact with each other. The Administrator should also understand what the Integration Service does when a session is running and be able to identify those processes. Additionally, certain mappings may produce files in addition to the standard session and workflow logs. The Administrator should be familiar with these files and know how and where to maintain them.

Upgrade Software

If and when the time comes to upgrade software, the Administrator is responsible for overseeing the installation and upgrade process.

INFORMATICA CONFIDENTIAL BEST PRACTICE 480 of 702

Security and Folder Administration

Security administration consists of both the PowerCenter domain and repository. For the domain, it involves creating, maintaining, and updating all domain users and their associated rights and privileges to services and alerts. For the repository, it involves creating, maintaining, and updating all users within the repository, including creating and assigning groups based on new and changing projects and defining which folders are to be shared, and at what level. Folder administration involves creating and maintaining the security of all folders. The Administrator should be the only user with privileges to edit folder properties.

Monitor and Tune Environment

Proactively monitoring the domain and user activity helps ensure a healthy functioning PowerCenter environment. The Administrator should review user activity for the domain to verify that the appropriate rights and privileges have been applied. The domain activity will ensure correct CPU and license usage.

The Administrator should have sole responsibility for implementing performance changes to the server environment. He or she should observe server performance throughout development so as to identify any bottlenecks in the system. In the production environment, the Repository Administrator should monitor the jobs and any growth (e.g., increases in data or throughput time) and communicate such change to appropriate staff address bottlenecks, accommodate growth, and ensure that the required data is loaded within the prescribed load window.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 481 of 702

Third Party Scheduler

Challenge

Successfully integrate a third-party scheduler with PowerCenter. This Best Practice describes various levels to integrate a third-party scheduler.

Description

Tasks such as getting server and session properties, session status, or starting or stopping a workflow or a task can be performed either through the Workflow Monitor or by integrating a third-party scheduler with PowerCenter. A third-party scheduler can be integrated with PowerCenter at any of several levels. The level of integration depends on the complexity of the workflow/schedule and the skill sets of production support personnel.

Many companies want to automate the scheduling process by using scripts or third-party schedulers. In some cases, they are using a standard scheduler and want to continue using it to drive the scheduling process.

A third-party scheduler can start or stop a workflow or task, obtain session statistics, and get server details using the pmcmd commands. Pmcmd is a program used to communicate with the PowerCenter server.

Third Party Scheduler Integration Levels

In general, there are three levels of integration between a third-party scheduler and PowerCenter: Low, Medium, and High.

Low Level

Low-level integration refers to a third-party scheduler kicking off the initial PowerCenter workflow. This process subsequently kicks off the rest of the tasks or sessions. The PowerCenter scheduler handles all processes and dependencies after the third-party scheduler has kicked off the initial workflow. In this level of integration, nearly all control lies with the PowerCenter scheduler.

INFORMATICA CONFIDENTIAL BEST PRACTICE 482 of 702

This type of integration is very simple to implement because the third-party scheduler kicks off only one process. It is only used as a loophole to fulfil a corporate mandate on a standard scheduler. This type of integration also takes advantage of the robust functionality offered by the Workflow Monitor.

Low-level integration requires production support personnel to have a thorough knowledge of PowerCenter. Because Production Support personnel in many companies are only knowledgeable about the company’s standard scheduler, one of the main disadvantages of this level of integration is that if a batch fails at some point, the Production Support personnel may not be able to determine the exact breakpoint. Thus, the majority of the production support burden falls back on the Project Development team.

Medium Level

With Medium-level integration, a third-party scheduler kicks off some, but not all, workflows or tasks. Within the tasks, many sessions may be defined with dependencies. PowerCenter controls the dependencies within the tasks.

With this level of integration, control is shared between PowerCenter and the third-party scheduler, which requires more integration between the third-party scheduler and PowerCenter. Medium-level integration requires Production Support personnel to have a fairly good knowledge of PowerCenter and also of the scheduling tool. If they do not have in-depth knowledge about the tool, they may be unable to fix problems that arise, so the production support burden is shared between the Project Development team and the Production Support team.

High Level

With High-level integration, the third-party scheduler has full control of scheduling and kicks off all PowerCenter sessions. In this case, the third-party scheduler is responsible for controlling all dependencies among the sessions. This type of integration is the most complex to implement because there are many more interactions between the third-party scheduler and PowerCenter.

Production Support personnel may have limited knowledge of PowerCenter but must have thorough knowledge of the scheduling tool. Because Production Support personnel in many companies are knowledgeable only about the company’s standard scheduler, one of the main advantages of this level of integration is that if the batch fails at some point, the Production Support personnel are usually able to determine the exact breakpoint. Thus, the production support burden lies with the Production Support

INFORMATICA CONFIDENTIAL BEST PRACTICE 483 of 702

team.

Sample Scheduler Script

There are many independent scheduling tools on the market. The following is an example of a AutoSys script that can be used to start tasks; it is included here simply as an illustration of how a scheduler can be implemented in the PowerCenter environment. This script can also capture the return codes, and abort on error, returning a success or failure (with associated return codes to the command line or the Autosys GUI monitor).

# Name: jobname.job # Author: Author Name # Date: 01/03/2005 # Description: # Schedule: Daily # # Modification History # When Who Why # #------------------------------------------------------------------

. jobstart $0 $*

# set variables ERR_DIR=/tmp

# Temporary file will be created to store all the Error Information # The file format is TDDHHMISS<PROCESS-ID>.lst curDayTime=`date +%d%H%M%S` FName=T$CurDayTime$$.lst

if [ $STEP -le 1 ] then echo "Step 1: RUNNING wf_stg_tmp_product_xref_table..."

cd /dbvol03/vendor/informatica/pmserver/ #pmcmd startworkflow -s ah-hp9:4001 -u Administrator -p informat01 wf_stg_tmp_product_xref_table #pmcmd starttask -s ah-hp9:4001 -u Administrator -p informat01 -f FINDW_SRC_STG -w WF_STG_TMP_PRODUCT_XREF_TABLE -wait s_M_S

INFORMATICA CONFIDENTIAL BEST PRACTICE 484 of 702

# The above lines need to be edited to include the name of the workflow or the task that you are attempting to start.

TG_TMP_PRODUCT_XREF_TABLE

# Checking whether to abort the Current Process or not RetVal=$? echo "Status = $RetVal" if [ $RetVal -ge 1 ] then jobend abnormal "Step 1: Failed wf_stg_tmp_product_xref_table...\n" exit 1 fi echo "Step 1: Successful" fi

jobend normal

exit 0

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 485 of 702

Updating Repository Statistics

Challenge

The PowerCenter repository has more than 170 tables, and most have one or more indexes to speed up queries. Most databases use column distribution statistics to determine which index to use to optimize performance. It can be important, especially in large or high-use repositories, to update these statistics regularly to avoid performance degradation.

Description

For PowerCenter, statistics are updated during copy, backup or restore operations. In addition, the RMREP command has an option to update statistics that can be scheduled as part of a regularly-run script.

For PowerCenter 6 and earlier there are specific strategies for Oracle, Sybase, SQL Server, DB2 and Informix discussed below. Each example shows how to extract the information out of the PowerCenter repository and incorporate it into a custom stored procedure.

Features in PowerCenter version 7 and later

Copy, Backup and Restore Repositories

PowerCenter automatically identifies and updates all statistics of all repository tables and indexes when a repository is copied, backed-up, or restored. If you follow a strategy of regular repository back-ups, the statistics will also be updated.

PMREP Command

PowerCenter also has a command line option to update statistics in the database. This allows this command to be put in a Windows batch file or Unix Shell script to run. The format of the command is: pmrep updatestatistics {-s filelistfile}

The –s option allows for you to skip different tables you may not want to update statistics.

Example of Automating the Process

One approach to automating this would be to use a UNIX shell that includes the pmrep command “updatestatistics” which is incorporated into a special workflow in PowerCenter and run on a scheduled basis. Note: Workflow Manager supports command line as well as scheduling.

Below listed is an example of the command line object.

INFORMATICA CONFIDENTIAL BEST PRACTICE 486 of 702

In addition, this workflow can be scheduled to run continuously on a daily, weekly or monthly schedule. This allows the statistics to be updated regularly so performance is not degraded.

Tuning Strategies for PowerCenter version 6 and earlier

The following are strategies for generating scripts to update distribution statistics. Note that all PowerCenter repository tables and index names begin with "OPB_" or "REP_".

Oracle

Run the following queries:

select 'analyze table ', table_name, ' compute statistics;' from user_tables where table_name like 'OPB_%'

select 'analyze index ', INDEX_NAME, ' compute statistics;' from user_indexes where INDEX_NAME like 'OPB_%'

This will produce output like:

INFORMATICA CONFIDENTIAL BEST PRACTICE 487 of 702

'ANALYZETABLE' TABLE_NAME 'COMPUTESTATISTICS;'

analyze table OPB_ANALYZE_DEP compute statistics;

analyze table OPB_ATTR compute statistics;

analyze table OPB_BATCH_OBJECT compute statistics;

'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;'

analyze index OPB_DBD_IDX compute statistics;

analyze index OPB_DIM_LEVEL compute statistics;

analyze index OPB_EXPR_IDX compute statistics;

Save the output to a file. Then, edit the file and remove all the headers. (i.e., the lines that look like:

'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;'

Run this as a SQL script. This updates statistics for the repository tables.

MS SQL Server

Run the following query:

select 'update statistics ', name from sysobjects where name like 'OPB_%'

This will produce output like :

name

update statistics OPB_ANALYZE_DEP

update statistics OPB_ATTR

update statistics OPB_BATCH_OBJECT

Save the output to a file, then edit the file and remove the header information (i.e., the top two lines) and add a 'go' at the end of the file.

Run this as a SQL script. This updates statistics for the repository tables.

INFORMATICA CONFIDENTIAL BEST PRACTICE 488 of 702

Sybase

Run the following query:

select 'update statistics ', name from sysobjects where name like 'OPB_%'

This will produce output like

name

update statistics OPB_ANALYZE_DEP

update statistics OPB_ATTR

update statistics OPB_BATCH_OBJECT

Save the output to a file, then remove the header information (i.e., the top two lines), and add a 'go' at the end of the file.

Run this as a SQL script. This updates statistics for the repository tables.

Informix

Run the following query:

select 'update statistics low for table ', tabname, ' ;' from systables where tabname like 'opb_%' or tabname like 'OPB_%';

This will produce output like :

(constant) tabname (constant)

update statistics low for table OPB_ANALYZE_DEP ;

update statistics low for table OPB_ATTR ;

update statistics low for table OPB_BATCH_OBJECT ;

Save the output to a file, then edit the file and remove the header information (i.e., the top line that looks like:

(constant) tabname (constant)

Run this as a SQL script. This updates statistics for the repository tables.

INFORMATICA CONFIDENTIAL BEST PRACTICE 489 of 702

DB2

Run the following query :

select 'runstats on table ', (rtrim(tabschema)||'.')||tabname, ' and indexes all;'

from sysstat.tables where tabname like 'OPB_%'

This will produce output like:

runstats on table PARTH.OPB_ANALYZE_DEP

and indexes all;

runstats on table PARTH.OPB_ATTR

and indexes all;

runstats on table PARTH.OPB_BATCH_OBJECT

and indexes all;

Save the output to a file.

Run this as a SQL script to update statistics for the repository tables.

Last updated: 12-Feb-07 15:29

INFORMATICA CONFIDENTIAL BEST PRACTICE 490 of 702

Determining Bottlenecks

Challenge

Because there are many variables involved in identifying and rectifying performance bottlenecks, an efficient method for determining where bottlenecks exist is crucial to good data warehouse management.

Description

The first step in performance tuning is to identify performance bottlenecks. Carefully consider the following five areas to determine where bottlenecks exist; using a process of elimination, investigating each area in the order indicated:

1. Target 2. Source 3. Mapping 4. Session 5. System

Best Practice Considerations

Use Thread Statistics to Identify Target, Source, and Mapping Bottlenecks

Use thread statistics to identify source, target or mapping (transformation) bottlenecks. By default, an Integration Service uses one reader, one transformation, and one target thread to process a session. Within each session log, the following thread statistics are available:

● Run time – Amount of time the thread was running ● Idle time – Amount of time the thread was idle due to other threads within

application or Integration Service. This value does not include time the thread is blocked due to the operating system.

● Busy – Percentage of the overall run time the thread is not idle. This percentage is calculated using the following formula:

INFORMATICA CONFIDENTIAL BEST PRACTICE 491 of 702

(run time – idle time) / run time x 100

By analyzing the thread statistics found in an Integration Service session log, it is possible to determine which thread is being used the most.

If a transformation thread is 100 percent busy and there are additional resources (e.g., CPU cycles and memory) available on the Integration Service server, add a partition point in the segment.

If reader or writer thread is 100 percent busy, consider using string data types in source or target ports since non-string ports require more processing.

Use the Swap Method to Test Changes in Isolation

Attempt to isolate performance problems by running test sessions. You should be able to compare the session’s original performance with that of tuned session’s performance.

The swap method is very useful for determining the most common bottlenecks. It involves the following five steps:

1. Make a temporary copy of the mapping, session and/or workflow that is to be tuned, then tune the copy before making changes to the original.

2. Implement only one change at a time and test for any performance improvements to gauge which tuning methods work most effectively in the environment.

3. Document the change made to the mapping, session and/or workflow and the performance metrics achieved as a result of the change. The actual execution time may be used as a performance metric.

4. Delete the temporary mapping, session and/or workflow upon completion of performance tuning.

5. Make appropriate tuning changes to mappings, sessions and/or workflows.

Evaluating the Five Areas of Consideration

Target Bottlenecks

Relational Targets

The most common performance bottleneck occurs when the Integration Service writes to a target database. This type of bottleneck can easily be identified with the following

INFORMATICA CONFIDENTIAL BEST PRACTICE 492 of 702

procedure:

1. Make a copy of the original workflow 2. Configure the session in the test workflow to write to a flat file and run the

session. 3. Read the thread statistics in session log

If session performance increases significantly when writing to a flat file, you have a write bottleneck. Consider performing the following tasks to improve performance:

● Drop indexes and key constraints ● Increase checkpoint intervals ● Use bulk loading ● Use external loading ● Minimize deadlocks ● Increase database network packet size ● Optimize target databases

Flat file targets

If the session targets a flat file, you probably do not have a write bottleneck. If the session is writing to a SAN or a non-local file system, performance may be slower than writing to a local file system. If possible, a session can be optimized by writing to a flat file target local to the Integration Service. If the local flat file is very large, you can optimize the write process by dividing it among several physical drives.

If the SAN or non-local file system is significantly slower than the local file system, work with the appropriate network/storage group to determine if there are configuration issues within the SAN.

Source Bottlenecks

Relational sources

If the session reads from a relational source, you can use a filter transformation, a read test mapping, or a database query to identify source bottlenecks.

Using a Filter Transformation.

INFORMATICA CONFIDENTIAL BEST PRACTICE 493 of 702

Add a filter transformation in the mapping after each source qualifier. Set the filter condition to false so that no data is processed past the filter transformation. If the time it takes to run the new session remains about the same, then you have a source bottleneck.

Using a Read Test Session.

You can create a read test mapping to identify source bottlenecks. A read test mapping isolates the read query by removing any transformation logic from the mapping. Use the following steps to create a read test mapping:

1. Make a copy of the original mapping. 2. In the copied mapping, retain only the sources, source qualifiers, and any

custom joins or queries. 3. Remove all transformations. 4. Connect the source qualifiers to a file target.

Use the read test mapping in a test session. If the test session performance is similar to the original session, you have a source bottleneck.

Using a Database Query

You can also identify source bottlenecks by executing a read query directly against the source database. To do so, perform the following steps:

● Copy the read query directly from the session log. ● Run the query against the source database with a query tool such as SQL

Plus. ● Measure the query execution time and the time it takes for the query to return

the first row.

If there is a long delay between the two time measurements, you have a source bottleneck.

If your session reads from a relational source and is constrained by a source bottleneck, review the following suggestions for improving performance:

● Optimize the query. ● Create tempdb as in-memory database.

INFORMATICA CONFIDENTIAL BEST PRACTICE 494 of 702

● Use conditional filters. ● Increase database network packet size. ● Connect to Oracle databases using IPC protocol.

Flat file sources

If your session reads from a flat file source, you probably do not have a read bottleneck. Tuning the line sequential buffer length to a size large enough to hold approximately four to eight rows of data at a time (for flat files) may improve performance when reading flat file sources. Also, ensure the flat file source is local to the Integration Service.

Mapping Bottlenecks

If you have eliminated the reading and writing of data as bottlenecks, you may have a mapping bottleneck. Use the swap method to determine if the bottleneck is in the mapping.

Begin by adding a Filter transformation in the mapping immediately before each target definition. Set the filter condition to false so that no data is loaded into the target tables. If the time it takes to run the new session is the same as the original session, you have a mapping bottleneck. You can also use the performance details to identify mapping bottlenecks: high Rowsinlookupcache and High Errorrows counters indicate mapping bottlenecks.

Follow these steps to identify mapping bottlenecks:

Create a test mapping without transformations

1. Make a copy of the original mapping. 2. In the copied mapping, retain only the sources, source qualifiers, and any

custom joins or queries. 3. Remove all transformations. 4. Connect the source qualifiers to the target.

Check for High Rowsinlookupcache counters

Multiple lookups can slow the session. You may improve session performance by locating the largest lookup tables and tuning those lookup expressions.

INFORMATICA CONFIDENTIAL BEST PRACTICE 495 of 702

Check for High Errorrows counters

Transformation errors affect session performance. If a session has large numbers in any of the Transformation_errorrows counters, you may improve performance by eliminating the errors.

For further details on eliminating mapping bottlenecks, refer to the Best Practice: Tuning Mappings for Better Performance

Session Bottlenecks

Session performance details can be used to flag other problem areas. Create performance details by selecting “Collect Performance Data” in the session properties before running the session.

View the performance details through the Workflow Monitor as the session runs, or view the resulting file. The performance details provide counters about each source qualifier, target definition, and individual transformation within the mapping to help you understand session and mapping efficiency.

To view the performance details during the session run:

● Right-click the session in the Workflow Monitor. ● Choose Properties. ● Click the Properties tab in the details dialog box.

To view the resulting performance daa file, look for the file session_name.perf in the same directory as the session log and open the file in any text editor.

All transformations have basic counters that indicate the number of input row, output rows, and error rows. Source qualifiers, normalizers, and targets have additional counters indicating the efficiency of data moving into and out of buffers. Some transformations have counters specific to their functionality. When reading performance details, the first column displays the transformation name as it appears in the mapping, the second column contains the counter name, and the third column holds the resulting number or efficiency percentage.

Low buffer input and buffer output counters

INFORMATICA CONFIDENTIAL BEST PRACTICE 496 of 702

If the BufferInput_efficiency and BufferOutput_efficiency counters are low for all sources and targets, increasing the session DTM buffer pool size may improve performance.

Aggregator, Rank, and Joiner readfromdisk and writetodisk counters

If a session contains Aggregator, Rank, or Joiner transformations, examine each Transformation_readfromdisk and Transformation_writetodisk counter. If these counters display any number other than zero, you can improve session performance by increasing the index and data cache sizes.

If the session performs incremental aggregation, the Aggregator_readtodisk and writetodisk counters display a number other than zero because the Integration Service reads historical aggregate data from the local disk during the session and writes to disk when saving historical data. Evaluate the incremental Aggregator_readtodisk and writetodisk counters during the session. If the counters show any numbers other than zero during the session run, you can increase performance by tuning the index and data cache sizes.

Note: PowerCenter versions 6.x and above include the ability to assign memory allocation per object. In versions earlier than 6.x, aggregators, ranks, and joiners were assigned at a global/session level.

For further details on eliminating session bottlenecks, refer to the Best Practice: Tuning Sessions for Better Performance and Tuning SQL Overrides and Environment for Better Performance.

System Bottlenecks

After tuning the source, target, mapping, and session, you may also consider tuning the system hosting the Integration Service.

The Integration Service uses system resources to process transformations, session execution, and the reading and writing of data. The Integration Service also uses system memory for other data tasks such as creating aggregator, joiner, rank, and lookup table caches.

You can use system performance monitoring tools to monitor the amount of system resources the Server uses and identify system bottlenecks.

● Windows NT/2000. Use system tools such as the Performance and

INFORMATICA CONFIDENTIAL BEST PRACTICE 497 of 702

Processes tab in the Task Manager to view CPU usage and total memory usage. You can also view more detailed performance information by using the Performance Monitor in the Administrative Tools on Windows.

● UNIX. Use the following system tools to monitor system performance and identify system bottlenecks:

❍ lsattr -E -I sys0 - To view current system settings ❍ iostat - To monitor loading operation for every disk attached to the

database server ❍ vmstat or sar –w - To monitor disk swapping actions ❍ sar –u - To monitor CPU loading.

For further information regarding system tuning, refer to the Best Practices: Performance Tuning UNIX Systems and Performance Tuning Windows 2000/2003 Systems.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 498 of 702

Performance Tuning Databases (Oracle)

Challenge

Database tuning can result in a tremendous improvement in loading performance. This Best Practice covers tips on tuning Oracle.

DescriptionPerformance Tuning Tools

Oracle offers many tools for tuning an Oracle instance. Most DBAs are already familiar with these tools, so we’ve included only a short description of some of the major ones here.

V$ Views

V$ views are dynamic performance views that provide real-time information on database activity, enabling the DBA to draw conclusions about database performance. Because SYS is the owner of these views, only SYS can query them. Keep in mind that querying these views impacts database performance; with each query having an immediate hit. With this in mind, carefully consider which users should be granted the privilege to query these views. You can grant viewing privileges with either the ‘SELECT’ privilege, which allows a user to view for individual V$ views or the ‘SELECT ANY TABLE’ privilege, which allows the user to view all V$ views. Using the SELECT ANY TABLE option requires the ‘O7_DICTIONARY_ACCESSIBILITY’ parameter be set to ‘TRUE’, which allows the ‘ANY’ keyword to apply to SYS owned objects.

Explain Plan

Explain Plan, SQL Trace, and TKPROF are powerful tools for revealing bottlenecks and developing a strategy to avoid them.

Explain Plan allows the DBA or developer to determine the execution path of a block of SQL code. The SQL in a source qualifier or in a lookup that is running for a long time should be generated and copied to SQL*PLUS or other SQL tool and tested to avoid inefficient execution of these statements. Review the PowerCenter session log for long initialization time (an indicator that the source qualifier may need tuning) and the time it takes to build a lookup cache to determine if the SQL for these transformations should

INFORMATICA CONFIDENTIAL BEST PRACTICE 499 of 702

be tested.

SQL TraceSQL Trace extends the functionality of Explain Plan by providing statistical information about the SQL statements executed in a session that has tracing enabled. This utility is run for a session with the ‘ALTER SESSION SET SQL_TRACE = TRUE’ statement.

TKPROFThe output of SQL Trace is provided in a dump file that is difficult to read. TKPROF formats this dump file into a more understandable report.

UTLBSTAT & UTLESTATExecuting ‘UTLBSTAT’ creates tables to store dynamic performance statistics and begins the statistics collection process. Run this utility after the database has been up and running (for hours or days). Accumulating statistics may take time, so you need to run this utility for a long while and through several operations (i.e., both loading and querying). ‘UTLESTAT’ ends the statistics collection process and generates an output file called ‘report.txt.’ This report should give the DBA a fairly complete idea about the level of usage the database experiences and reveal areas that should be addressed.

Disk I/ODisk I/O at the database level provides the highest level of performance gain in most systems. Database files should be separated and identified. Rollback files should be separated onto their own disks because they have significant disk I/O. Co-locate tables that are heavily used with tables that are rarely used to help minimize disk contention. Separate indexes so that when queries run indexes and tables, they are not fighting for the same resource. Also be sure to implement disk striping; this, or RAID technology can help immensely in reducing disk contention. While this type of planning is time consuming, the payoff is well worth the effort in terms of performance gains.

Dynamic Sampling

Dynamic sampling enables the server to improve performance by:

● Estimating single-table predicate statistics where available statistics are missing or may lead to bad estimations.

● Estimating statistics for tables and indexes with missing statistics. ● Estimating statistics for tables and indexes with out of date statistics.

Dynamic sampling is controlled by the OPTIMIZER_DYNAMIC_SAMPLING parameter, which accepts values from "0" (off) to "10" (aggressive sampling) with a default value of "2". At compile-time, Oracle determines if dynamic sampling can improve query

INFORMATICA CONFIDENTIAL BEST PRACTICE 500 of 702

performance. If so, it issues recursive statements to estimate the necessary statistics. Dynamic sampling can be beneficial when:

● The sample time is small compared to the overall query execution time. ● Dynamic sampling results in a better performing query.

The query can be executed multiple times.

Automatic SQL Tuning in Oracle Database 10gIn its normal mode, the query optimizer needs to make decisions about execution plans in a very short time. As a result, it may not always be able to obtain enough information to make the best decision. Oracle 10g allows the optimizer to run in tuning mode, where it can gather additional information and make recommendations about how specific statements can be tuned further. This process may take several minutes for a single statement so it is intended to be used on high-load, resource-intensive statements. In tuning mode, the optimizer performs the following analysis:

● Statistics Analysis. The optimizer recommends the gathering of statistics on objects with missing or stale statistics. Additional statistics for these objects are stored in an SQL profile.

● SQL Profiling. The optimizer may be able to improve performance by gathering additional statistics and altering session specific parameters such as the OPTIMIZER_MODE. If such improvements are possible, the information is stored in an SQL profile. If accepted, this information can then used by the optimizer when running in normal mode. Unlike a stored outline, which fixes the execution plan, an SQL profile may still be of benefit when the contents of the table alter drastically. Even so, it's sensible to update profiles periodically. The SQL profiling is not performed when the tuining optimizer is run in limited mode.

● Access Path Analysis. The optimizer investigates the effect of new or modified indexes on the access path. Because its index recommendations relate to a specific statement, where practical, it is also suggest the use of the SQL Access Advisor to check the impact of these indexes on a representative SQL workload.

● SQL Structure Analysis. The optimizer suggests alternatives for SQL statements that contain structures that may affect performance. Be aware that implementing these suggestions requires human intervention to check their

INFORMATICA CONFIDENTIAL BEST PRACTICE 501 of 702

validity.

TIP The automatic SQL tuning features are accessible from Enterprise Manager on the "Advisor Central" page

Useful Views

Useful views related to automatic SQL tuning include:

● DBA_ADVISOR_TASKS ● DBA_ADVISOR_FINDINGS ● DBA_ADVISOR_RECOMMENDATIONS ● DBA_ADVISOR_RATIONALE ● DBA_SQLTUNE_STATISTICS ● DBA_SQLTUNE_BINDS ● DBA_SQLTUNE_PLANS ● DBA_SQLSET ● DBA_SQLSET_BINDS ● DBA_SQLSET_STATEMENTS ● DBA_SQLSET_REFERENCES ● DBA_SQL_PROFILES ● V$SQL ● V$SQLAREA ● V$ACTIVE_SESSION_HISTORY

Memory and Processing

Memory and processing configuration is performed in the init.ora file. Because each database is different and requires an experienced DBA to analyze and tune it for optimal performance, a standard set of parameters to optimize PowerCenter is not practical and is not likely to ever exist.

INFORMATICA CONFIDENTIAL BEST PRACTICE 502 of 702

TIP Changes made in the init.ora file take effect after a restart of the instance. Use svrmgr to issue the commands “shutdown” and “startup” (eventually “shutdown immediate”) to the instance. Note that svrmgr is no longer available as of Oracle 9i because Oracle is moving to a web-based Server Manager in Oracle 10g. If you are using Oracle 9i, install Oracle client tools and log onto Oracle Enterprise Manager. Some other tools like DBArtisan expose the initialization parameters.

The settings presented here are those used in a four-CPU AIX server running Oracle 7.3.4 set to make use of the parallel query option to facilitate parallel processing queries and indexes. We’ve also included the descriptions and documentation from Oracle for each setting to help DBAs of other (i.e., non-Oracle) systems determine what the commands do in the Oracle environment to facilitate setting their native database commands and settings in a similar fashion.

HASH_AREA_SIZE = 16777216

● Default value: 2 times the value of SORT_AREA_SIZE ● Range of values: any integer ● This parameter specifies the maximum amount of memory, in bytes, to be

used for the hash join. If this parameter is not set, its value defaults to twice the value of the SORT_AREA_SIZE parameter.

● The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. (Note: ALTER SESSION refers to the Database Administration command issued at the svrmgr command prompt).

● HASH_JOIN_ENABLED

❍ In Oracle 7 and Oracle 8 the hash_join_enabled parameter must be set to true.

❍ In Oracle 8i and above hash_join_enabled=true is the default value

● HASH_MULTIBLOCK_IO_COUNT

❍ Allows multiblock reads against the TEMP tablespace ❍ It is advisable to set the NEXT extentsize to greater than the value for

hash_multiblock_io_count to reduce disk I/O ❍ This is the same behavior seen when setting the

db_file_multiblock_read_count parameter for data tablespaces except this one applies only to multiblock access of segments of TEMP Tablespace

● STAR_TRANSFORMATION_ENABLED

❍ Determines whether a cost-based query transformation will be applied to

INFORMATICA CONFIDENTIAL BEST PRACTICE 503 of 702

star queries ❍ When set to TRUE, the optimizer will consider performing a cost-based

query transformation on the n-way join table

● OPTIMIZER_INDEX_COST_ADJ

❍ Numeric parameter set between 0 and 1000 (default 1000) ❍ This parameter lets you tune the optimizer behavior for access path

selection to be more or less index friendly

Optimizer_percent_parallel=33This parameter defines the amount of parallelism that the optimizer uses in its cost functions. The default of 0 means that the optimizer chooses the best serial plan. A value of 100 means that the optimizer uses each object's degree of parallelism in computing the cost of a full-table scan operation.The value of this parameter can be changed without shutting down the Oracle instance by using the ALTER SESSION command. Low values favor indexes, while high values favor table scans.

Cost-based optimization is always used for queries that reference an object with a nonzero degree of parallelism. For such queries, a RULE hint or optimizer mode or goal is ignored. Use of a FIRST_ROWS hint or optimizer mode overrides a nonzero setting of OPTIMIZER_PERCENT_PARALLEL.

parallel_max_servers=40

● Used to enable parallel query. ● Initially not set on Install. ● Maximum number of query servers or parallel recovery processes for an

instance.

Parallel_min_servers=8

● Used to enable parallel query. ● Initially not set on Install. ● Minimum number of query server processes for an instance. Also the number

of query-server processes Oracle creates when the instance is started.

SORT_AREA_SIZE=8388608

INFORMATICA CONFIDENTIAL BEST PRACTICE 504 of 702

● Default value: operating system-dependent ● Minimum value: the value equivalent to two database blocks ● This parameter specifies the maximum amount, in bytes, of program global

area (PGA) memory to use for a sort. After the sort is complete, and all that remains to do is to fetch the rows out, the memory is released down to the size specified by SORT_AREA_RETAINED_SIZE. After the last row is fetched out, all memory is freed. The memory is released back to the PGA, not to the operating system.

● Increasing SORT_AREA_SIZE size improves the efficiency of large sorts. Multiple allocations never exist; there is only one memory area of SORT_AREA_SIZE for each user process at any time.

● The default is usually adequate for most database operations. However, if very large indexes are created, this parameter may need to be adjusted. For example, if one process is doing all database access, as in a full database import, then an increased value for this parameter may speed the import, particularly the CREATE INDEX statements.

Automatic Shared Memory Management in Oracle 10g

Automatic Shared Memory Management puts Oracle in control of allocating memory within the SGA. The SGA_TARGET parameter sets the amount of memory available to the SGA. This parameter can be altered dynamically up to a maximum of the SGA_MAX_SIZE parameter value. Provided the STATISTICS_LEVEL is set to TYPICAL or ALL, and the SGA_TARGET is set to a value other than "0", Oracle will control the memory pools that would otherwise be controlled by the following parameters:

● DB_CACHE_SIZE (default block size) ● SHARED_POOL_SIZE ● LARGE_POOL_SIZE ● JAVA_POOL_SIZE

If these parameters are set to a non-zero value, they represent the minimum size for the pool. These minimum values may be necessary if you experience application errors when certain pool sizes drop below a specific threshold.

The following parameters must be set manually and take memory from the quota allocated by the SGA_TARGET parameter:

● DB_KEEP_CACHE_SIZE

INFORMATICA CONFIDENTIAL BEST PRACTICE 505 of 702

● DB_RECYCLE_CACHE_SIZE ● DB_nK_CACHE_SIZE (non-default block size) ● STREAMS_POOL_SIZE ● LOG_BUFFER

IPC as an Alternative to TCP/IP on UNIX

On an HP/UX server with Oracle as a target (i.e., PMServer and Oracle target on same box), using an IPC connection can significantly reduce the time it takes to build a lookup cache. In one case, a fact mapping that was using a lookup to get five columns (including a foreign key) and about 500,000 rows from a table was taking 19 minutes. Changing the connection type to IPC reduced this to 45 seconds. In another mapping, the total time decreased from 24 minutes to 8 minutes for ~120-130 bytes/row, 500,000 row write (array inserts), and primary key with unique index in place. Performance went from about 2Mb/min (280 rows/sec) to about 10Mb/min (1360 rows/sec).

A normal tcp (network tcp/ip) connection in tnsnames.ora would look like this:

DW.armafix = (DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL =TCP) (HOST = armafix) (PORT = 1526) ) ) (CONNECT_DATA=(SID=DW) ) )

Make a new entry in the tnsnames like this, and use it for connection to the local Oracle instance:

DWIPC.armafix = (DESCRIPTION = (ADDRESS = (PROTOCOL=ipc) (KEY=DW) ) (CONNECT_DATA=(SID=DW)) )

INFORMATICA CONFIDENTIAL BEST PRACTICE 506 of 702

Improving Data Load Performance

Alternative to Dropping and Reloading Indexes

Experts often recommend dropping and reloading indexes during very large loads to a data warehouse but there is no easy way to do this. For example, writing a SQL statement to drop each index, then writing another SQL statement to rebuild it, can be a very tedious process.

Oracle 7 (and above) offers an alternative to dropping and rebuilding indexes by allowing you to disable and re-enable existing indexes. Oracle stores the name of each index in a table that can be queried. With this in mind, it is an easy matter to write a SQL statement that queries this table. then generate SQL statements as output to disable and enable these indexes.

Run the following to generate output to disable the foreign keys in the data warehouse:

SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE CONSTRAINT ' || CONSTRAINT_NAME || ' ;'

FROM USER_CONSTRAINTS

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

AND CONSTRAINT_TYPE = 'R'

This produces output that looks like:

ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011077 ;

ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT SYS_C0011075 ;

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011060 ;

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011059 ;

INFORMATICA CONFIDENTIAL BEST PRACTICE 507 of 702

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011133 ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011134 ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011131 ;

Dropping or disabling primary keys also speeds loads. Run the results of this SQL statement after disabling the foreign key constraints:

SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;'

FROM USER_CONSTRAINTS

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

AND CONSTRAINT_TYPE = 'P'

This produces output that looks like:

ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE PRIMARY KEY ;

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE PRIMARY KEY ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE PRIMARY KEY ;

Finally, disable any unique constraints with the following:

SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE PRIMARY KEY ;'

FROM USER_CONSTRAINTS

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

INFORMATICA CONFIDENTIAL BEST PRACTICE 508 of 702

AND CONSTRAINT_TYPE = 'U'

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT SYS_C0011070 ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT SYS_C0011071 ;

Save the results in a single file and name it something like ‘DISABLE.SQL’

To re-enable the indexes, rerun these queries after replacing ‘DISABLE’ with ‘ENABLE.’ Save the results in another file with a name such as ‘ENABLE.SQL’ and run it as a post-session command.

Re-enable constraints in the reverse order that you disabled them. Re-enable the unique constraints first, and re-enable primary keys before foreign keys.

TIPDropping or disabling foreign keys often boosts loading, but also slows queries (such as lookups) and updates. If you do not use lookups or updates on your target tables, you should get a boost by using this SQL statement to generate scripts. If you use lookups and updates (especially on large tables), you can exclude the index that will be used for the lookup from your script. You may want to experiment to determine which method is faster.

Optimizing Query Performance

Oracle Bitmap Indexing

With version 7.3.x, Oracle added bitmap indexing to supplement the traditional b-tree index. A b-tree index can greatly improve query performance on data that has high cardinality or contains mostly unique values, but is not much help for low cardinality/highly-duplicated data and may even increase query time. A typical example of a low cardinality field is gender – it is either male or female (or possibly unknown). This kind of data is an excellent candidate for a bitmap index, and can significantly improve query performance.

Keep in mind however, that b-tree indexing is still the Oracle default. If you don’t specify an index type when creating an index, Oracle defaults to b-tree. Also note that for certain columns, bitmaps are likely to be smaller and faster to create than a b-tree index on the same column.

Bitmap indexes are suited to data warehousing because of their performance, size, and

INFORMATICA CONFIDENTIAL BEST PRACTICE 509 of 702

ability to create and drop very quickly. Since most dimension tables in a warehouse have nearly every column indexed, the space savings is dramatic. But it is important to note that when a bitmap-indexed column is updated, every row associated with that bitmap entry is locked, making bit-map indexing a poor choice for OLTP database tables with constant insert and update traffic. Also, bitmap indexes are rebuilt after each DML statement (e.g., inserts and updates), which can make loads very slow. For this reason, it is a good idea to drop or disable bitmap indexes prior to the load and re-create or re-enable them after the load.

The relationship between Fact and Dimension keys is another example of low cardinality. With a b-tree index on the Fact table, a query processes by joining all the Dimension tables in a Cartesian product based on the WHERE clause, then joins back to the Fact table. With a bitmapped index on the Fact table, a ‘star query’ may be created that accesses the Fact table first followed by the Dimension table joins, avoiding a Cartesian product of all possible Dimension attributes. This ‘star query’ access method is only used if the STAR_TRANSFORMATION_ENABLED parameter is equal to TRUE in the init.ora file and if there are single column bitmapped indexes on the fact table foreign keys. Creating bitmap indexes is similar to creating b-tree indexes. To specify a bitmap index, add the word ‘bitmap’ between ‘create’ and ‘index’. All other syntax is identical.

Bitmap Indexes

drop index emp_active_bit;

drop index emp_gender_bit;

create bitmap index emp_active_bit on emp (active_flag);

create bitmap index emp_gender_bit on emp (gender);

B-tree Indexes

drop index emp_active;

drop index emp_gender;

create index emp_active on emp (active_flag);

create index emp_gender on emp (gender);

INFORMATICA CONFIDENTIAL BEST PRACTICE 510 of 702

Information for bitmap indexes is stored in the data dictionary in dba_indexes, all_indexes, and user_indexes with the word ‘BITMAP’ in the Uniqueness column rather than the word ‘UNIQUE.’ Bitmap indexes cannot be unique.

To enable bitmap indexes, you must set the following items in the instance initialization file:

● compatible = 7.3.2.0.0 # or higher ● event = "10111 trace name context forever" ● event = "10112 trace name context forever" ● event = "10114 trace name context forever"

Also note that the parallel query option must be installed in order to create bitmap indexes. If you try to create bitmap indexes without the parallel query option, a syntax error appears in the SQL statement; the keyword ‘bitmap’ won't be recognized.

TIP To check if the parallel query option is installed, start and log into SQL*Plus. If the parallel query option is installed, the word ‘parallel’ appears in the banner text.

Index Statistics

Table method

Index statistics are used by Oracle to determine the best method to access tables and should be updated periodically as normal DBA procedures. The following should improve query results on Fact and Dimension tables (including appending and updating records) by updating the table and index statistics for the data warehouse:

The following SQL statement can be used to analyze the tables in the database:

SELECT 'ANALYZE TABLE ' || TABLE_NAME || ' COMPUTE STATISTICS;'

FROM USER_TABLES

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

This generates the following result:

INFORMATICA CONFIDENTIAL BEST PRACTICE 511 of 702

ANALYZE TABLE CUSTOMER_DIM COMPUTE STATISTICS;

ANALYZE TABLE MARKET_DIM COMPUTE STATISTICS;

ANALYZE TABLE VENDOR_DIM COMPUTE STATISTICS;

The following SQL statement can be used to analyze the indexes in the database:

SELECT 'ANALYZE INDEX ' || INDEX_NAME || ' COMPUTE STATISTICS;'

FROM USER_INDEXES

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

This generates the following results:

ANALYZE INDEX SYS_C0011125 COMPUTE STATISTICS;

ANALYZE INDEX SYS_C0011119 COMPUTE STATISTICS;

ANALYZE INDEX SYS_C0011105 COMPUTE STATISTICS;

Save these results as a SQL script to be executed before or after a load.

Schema method

Another way to update index statistics is to compute indexes by schema rather than by table. If data warehouse indexes are the only indexes located in a single schema, you can use the following command to update the statistics:

EXECUTE SYS.DBMS_UTILITY.Analyze_Schema ('BDB', 'compute');

In this example, BDB is the schema for which the statistics should be updated. Note that the DBA must grant the execution privilege for dbms_utility to the database user executing this command.

INFORMATICA CONFIDENTIAL BEST PRACTICE 512 of 702

TIP These SQL statements can be very resource intensive, especially for very large tables. For this reason, Informatica recommends running them at off-peak times when no other process is using the database. If you find the exact computation of the statistics consumes too much time, it is often acceptable to estimate the statistics rather than compute them. Use ‘estimate’ instead of ‘compute’ in the above examples.

Parallelism

Parallel execution can be implemented at the SQL statement, database object, or instance level for many SQL operations. The degree of parallelism should be identified based on the number of processors and disk drives on the server, with the number of processors being the minimum degree.

SQL Level Parallelism

Hints are used to define parallelism at the SQL statement level. The following examples demonstrate how to utilize four processors:

SELECT /*+ PARALLEL(order_fact,4) */ …;

SELECT /*+ PARALLEL_INDEX(order_fact, order_fact_ixl,4) */ …;

TIP When using a table alias in the SQL Statement, be sure to use this alias in the hint. Otherwise, the hint will not be used, and you will not receive an error message.

Example of improper use of alias:

SELECT /*+PARALLEL (EMP, 4) */ EMPNO, ENAME

FROM EMP A

Here, the parallel hint will not be used because of the used alias “A” for table EMP. The correct way is:

SELECT /*+PARALLEL (A, 4) */ EMPNO, ENAME FROM EMP A

Table Level Parallelism

Parallelism can also be defined at the table and index level. The following example

INFORMATICA CONFIDENTIAL BEST PRACTICE 513 of 702

demonstrates how to set a table’s degree of parallelism to four for all eligible SQL statements on this table:

ALTER TABLE order_fact PARALLEL 4;

Ensure that Oracle is not contending with other processes for these resources or you may end up with degraded performance due to resource contention.

Additional Tips

Executing Oracle SQL Scripts as Pre- and Post-Session Commands on UNIX

You can execute queries as both pre- and post-session commands. For a UNIX environment, the format of the command is:

sqlplus –s user_id/password@database @ script_name.sql

For example, to execute the ENABLE.SQL file created earlier (assuming the data warehouse is on a database named ‘infadb’), you would execute the following as a post-session command:

sqlplus –s user_id/password@infadb @ enable.sql

In some environments, this may be a security issue since both username and password are hard-coded and unencrypted. To avoid this, use the operating system’s authentication to log onto the database instance.

In the following example, the Informatica id “pmuser” is used to log onto the Oracle database. Create the Oracle user “pmuser” with the following SQL statement:

CREATE USER PMUSER IDENTIFIED EXTERNALLY DEFAULT TABLESPACE . . . TEMPORARY TABLESPACE . . .

In the following pre-session command, “pmuser” (the id Informatica is logged onto the operating system as) is automatically passed from the operating system to the database and used to execute the script:

sqlplus -s /@infadb @/informatica/powercenter/Scripts/ENABLE.SQL

INFORMATICA CONFIDENTIAL BEST PRACTICE 514 of 702

You may want to use the init.ora parameter “os_authent_prefix” to distinguish between “normal” oracle-users and “external-identified” ones.

DRIVING_SITE ‘Hint’

If the source and target are on separate instances, the Source Qualifier transformation should be executed on the target instance.

For example, you want to join two source tables (A and B) together, which may reduce the number of selected rows. However, Oracle fetches all of the data from both tables, moves the data across the network to the target instance, then processes everything on the target instance. If either data source is large, this causes a great deal of network traffic. To force the Oracle optimizer to process the join on the source instance, use the ‘Generate SQL’ option in the source qualifier and include the ‘driving_site’ hint in the SQL statement as:

SELECT /*+ DRIVING_SITE */ …;

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 515 of 702

Performance Tuning Databases (SQL Server)

Challenge

Database tuning can result in tremendous improvement in loading performance. This Best Practice offers tips on tuning SQL Server.

Description

Proper tuning of the source and target database is a very important consideration in the scalability and usability of a business data integration environment. Managing performance on an SQL Server involves the following points.

● Manage system memory usage (RAM caching). ● Create and maintain good indexes. ● Partition large data sets and indexes. ● Monitor disk I/O subsystem performance. ● Tune applications and queries. ● Optimize active data.

Taking advantage of grid computing is another option for improving the overall SQL Server performance. To set up a SQL Server cluster environment, you need to set up a cluster where the databases are split among the nodes. This provides the ability to distribute the load across multiple nodes. To achieve high performance, Informatica recommends using a fibre-attached SAN device for shared storage.

Manage RAM Caching

Managing RAM buffer cache is a major consideration in any database server environment. Accessing data in RAM cache is much faster than accessing the same information from disk. If database I/O can be reduced to the minimal required set of data and index pages, the pages stay in RAM longer. Too much unnecessary data and index information flowing into buffer cache quickly pushes out valuable pages. The primary goal of performance tuning is to reduce I/O so that buffer cache is used effectively.

INFORMATICA CONFIDENTIAL BEST PRACTICE 516 of 702

Several settings in SQL Server can be adjusted to take advantage of SQL Server RAM usage:

● Max async I/O is used to specify the number of simultaneous disk I/O operations that SQL Server can submit to the operating system. Note that this setting is automated in SQL Server 2000

● SQL Server allows several selectable models for database recovery, these include:

❍ Full Recovery ❍ Bulk-Logged Recovery ❍ Simple Recovery

Create and Maintain Good Indexes

Creating and maintaining good indexes is key to maintaining minimal I/O for all database queries.

Partition Large Data Sets and Indexes

To reduce overall I/O contention and improve parallel operations, consider partitioning table data and indexes. Multiple techniques for achieving and managing partitions using SQL Server 2000 are addressed in this document.

Tune Applications and Queries

Tuning applications and queries is especially important when a database server is likely to be servicing requests from hundreds or thousands of connections through a given application. Because applications typically determine the SQL queries that are executed on a database server, it is very important for application developers to understand SQL Server architectural basics and know how to take full advantage of SQL Server indexes to minimize I/O.

Partitioning for Performance

The simplest technique for creating disk I/O parallelism is to use hardware partitioning and create a single "pool of drives" that serves all SQL Server database files except transaction log files, which should always be stored on physically-separate disk drives dedicated to log files. (See Microsoft documentation for installation procedures.)

INFORMATICA CONFIDENTIAL BEST PRACTICE 517 of 702

Objects For Partitioning Consideration

The following areas of SQL Server activity can be separated across different hard drives, RAID controllers, and PCI channels (or combinations of the three):

● Transaction logs ● Tempdb ● Database ● Tables ● Nonclustered Indexes

Note: In SQL Server 2000, Microsoft introduced enhancements to distributed partitioned views that enable the creation of federated databases (commonly referred to as scale-out), which spread resource load and I/O activity across multiple servers. Federated databases are appropriate for some high-end online analytical processing (OLTP) applications, but this approach is not recommended for addressing the needs of a data warehouse.

Segregating the Transaction Log

Transaction log files should be maintained on a storage device that is physically separate from devices that contain data files. Depending on your database recovery model setting, most update activity generates both data device activity and log activity. If both are set up to share the same device, the operations to be performed compete for the same limited resources. Most installations benefit from separating these competing I/O activities.

Segregating tempdb

SQL Server creates a database, tempdb, on every server instance to be used by the server as a shared working area for various activities, including temporary tables, sorting, processing subqueries, building aggregates to support GROUP BY or ORDER BY clauses, queries using DISTINCT (temporary worktables have to be created to remove duplicate rows), cursors, and hash joins.

To move the tempdb database, use the ALTER DATABASE command to change the physical file location of the SQL Server logical file name associated with tempdb. For example, to move tempdb and its associated log to the new file locations E:\mssql7 and C:\temp, use the following commands:

INFORMATICA CONFIDENTIAL BEST PRACTICE 518 of 702

alterdatabasetempdbmodifyfile(name='tempdev',filename=

'e:\mssql7\tempnew_location.mDF')

alterdatabasetempdbmodifyfile(name='templog',filename=

'c:\temp\tempnew_loglocation.mDF')

The master database, msdb, and model databases are not used much during production (as compared to user databases), so it is generally y not necessary to consider them in I/O performance tuning considerations. The master database is usually used only for adding new logins, databases, devices, and other system objects.

Database Partitioning

Databases can be partitioned using files and/or filegroups. A filegroup is simply a named collection of individual files grouped together for administration purposes. A file cannot be a member of more than one filegroup. Tables, indexes, text, ntext, and image data can all be associated with a specific filegroup. This means that all their pages are allocated from the files in that filegroup. The three types of filegroups are:

● Primary filegroup. Contains the primary data file and any other files not placed into another filegroup. All pages for the system tables are allocated from the primary filegroup.

● User-defined filegroup. Any filegroup specified using the FILEGROUP keyword in a CREATE DATABASE or ALTER DATABASE statement, or on the Properties dialog box within SQL Server Enterprise Manager.

● Default filegroup. Contains the pages for all tables and indexes that do not have a filegroup specified when they are created. In each database, only one filegroup at a time can be the default filegroup. If no default filegroup is specified, the default is the primary filegroup.

Files and filegroups are useful for controlling the placement of data and indexes and eliminating device contention. Quite a few installations also leverage files and filegroups as a mechanism that is more granular than a database in order to exercise more control over their database backup/recovery strategy.

Horizontal Partitioning (Table)

Horizontal partitioning segments a table into multiple tables, each containing the same

INFORMATICA CONFIDENTIAL BEST PRACTICE 519 of 702

number of columns but fewer rows. Determining how to partition tables horizontally depends on how data is analyzed. A general rule of thumb is to partition tables so queries reference as few tables as possible. Otherwise, excessive UNION queries, used to merge the tables logically at query time, can impair performance.

When you partition data across multiple tables or multiple servers, queries accessing only a fraction of the data can run faster because there is less data to scan. If the tables are located on different servers, or on a computer with multiple processors, each table involved in the query can also be scanned in parallel, thereby improving query performance. Additionally, maintenance tasks, such as rebuilding indexes or backing up a table, can execute more quickly.

By using a partitioned view, the data still appears as a single table and can be queried as such without having to reference the correct underlying table manually

Cost Threshold for Parallelism Option

Use this option to specify the threshold where SQL Server creates and executes parallel plans. SQL Server creates and executes a parallel plan for a query only when the estimated cost to execute a serial plan for the same query is higher than the value set in cost threshold for parallelism. The cost refers to an estimated elapsed time in seconds required to execute the serial plan on a specific hardware configuration. Only set cost threshold for parallelism on symmetric multiprocessors (SMP).

Max Degree of Parallelism Option

Use this option to limit the number of processors (from a maximum of 32) to use in parallel plan execution. The default value is zero, which uses the actual number of available CPUs. Set this option to one to suppress parallel plan generation. Set the value to a number greater than one to restrict the maximum number of processors used by a single query execution.

Priority Boost Option

Use this option to specify whether SQL Server should run at a higher scheduling priority than other processors on the same computer. If you set this option to one, SQL Server runs at a priority base of 13. The default is zero, which is a priority base of seven.

Set Working Set Size Option

INFORMATICA CONFIDENTIAL BEST PRACTICE 520 of 702

Use this option to reserve physical memory space for SQL Server that is equal to the server memory setting. The server memory setting is configured automatically by SQL Server based on workload and available resources. It can vary dynamically among minimum server memory and maximum server memory. Setting ‘set working set’ size means the operating system does not attempt to swap out SQL Server pages, even if they can be used more readily by another process when SQL Server is idle.

Optimizing Disk I/O Performance

When configuring a SQL Server that contains only a few gigabytes of data and does not sustain heavy read or write activity, you need not be particularly concerned with the subject of disk I/O and balancing of SQL Server I/O activity across hard drives for optimal performance. To build larger SQL Server databases however, which can contain hundreds of gigabytes or even terabytes of data and/or that sustain heavy read/write activity (as in a DSS application), it is necessary to drive configuration around maximizing SQL Server disk I/O performance by load-balancing across multiple hard drives.

Partitioning for Performance

For SQL Server databases that are stored on multiple disk drives, performance can be improved by partitioning the data to increase the amount of disk I/O parallelism.

Partitioning can be performed using a variety of techniques. Methods for creating and managing partitions include configuring the storage subsystem (i.e., disk, RAID partitioning) and applying various data configuration mechanisms in SQL Server such as files, file groups, tables and views. Some possible candidates for partitioning include:

● Transaction log ● Tempdb ● Database ● Tables ● Non-clustered indexes

Using bcp and BULK INSERT

Two mechanisms exist inside SQL Server to address the need for bulk movement of data: the bcp utility and the BULK INSERT statement.

INFORMATICA CONFIDENTIAL BEST PRACTICE 521 of 702

● Bcp is a command prompt utility that copies data into or out of SQL Server. ● BULK INSERT is a Transact-SQL statement that can be executed from within

the database environment. Unlike bcp, BULK INSERT can only pull data into SQL Server. An advantage of using BULK INSERT is that it can copy data into instances of SQL Server using a Transact-SQL statement, rather than having to shell out to the command prompt.

TIP Both of these mechanisms enable you to exercise control over the batch size. Unless you are working with small volumes of data, it is good to get in the habit of specifying a batch size for recoverability reasons. If none is specified, SQL Server commits all rows to be loaded as a single batch. For example, you attempt to load 1,000,000 rows of new data into a table. The server suddenly loses power just as it finishes processing row number 999,999. When the server recovers, those 999,999 rows will need to be rolled back out of the database before you attempt to reload the data. By specifying a batch size of 10,000 you could have saved significant recovery time, because SQL Server would have only had to rollback 9999 rows instead of 999,999.

General Guidelines for Initial Data Loads

While loading data:

● Remove indexes. ● Use Bulk INSERT or bcp. ● Parallel load using partitioned data files into partitioned tables. ● Run one load stream for each available CPU. ● Set Bulk-Logged or Simple Recovery model. ● Use the TABLOCK option. ● Create indexes. ● Switch to the appropriate recovery model. ● Perform backups

General Guidelines for Incremental Data Loads

● Load data with indexes in place. ● Use performance and concurrency requirements to determine locking

granularity (sp_indexoption). ● Change from Full to Bulk-Logged Recovery mode unless there is an overriding

need to preserve a point–in-time recovery, such as online users modifying the database during bulk loads. Read operations should not affect bulk loads.

INFORMATICA CONFIDENTIAL BEST PRACTICE 522 of 702

Performance Tuning Databases (Teradata)

Challenge

Database tuning can result in tremendous improvement in loading performance. This Best Practice provides tips on tuning Teradata.

Description

Teradata offers several bulk load utilities including:

● MultiLoad which supports inserts, updates, deletes, and “upserts” to any table.

● FastExport which is a high-performance bulk export utility. ● BTEQ which allows you to export data to a flat file but is suitable for smaller

volumes than FastExport. ● FastLoad which is used for loading inserts into an empty table. ● TPump which is a light-weight utility that does not lock the table that is being

loaded.

Tuning MultiLoad

There are many aspects to tuning a Teradata database. Several aspects of tuning can be controlled by setting MultiLoad parameters to maximize write throughput. Other areas to analyze when performing a MultiLoad job include estimating space requirements and monitoring MultiLoad performance.

MultiLoad parameters

Below are the MultiLoad-specific parameters that are available in PowerCenter:

● TDPID. A client based operand that is part of the logon string. ● Date Format. Ensure that the date format used in your target flat file is

equivalent to the date format parameter in your MultiLoad script. Also validate that your date format is compatible with the date format specified in the Teradata database.

● Checkpoint. A checkpoint interval is similar to a commit interval for other

INFORMATICA CONFIDENTIAL BEST PRACTICE 523 of 702

databases. When you set the checkpoint value to less than 60, it represents the interval in minutes between checkpoint operations. If the checkpoint is set to a value greater than 60, it represents the number of records to write before performing a checkpoint operation. To maximize write speed to the database, try to limit the number of checkpoint operations that are performed.

● Tenacity. Interval in hours between MultiLoad attempts to log on to the database when the maximum number of sessions are already running.

● Load Mode. Available load methods include Insert, Update, Delete, and Upsert. Consider creating separate external loader connections for each method, selecting the one that will be most efficient for each target table.

● Drop Error Tables. Allows you to specify whether to drop or retain the three error tables for a MultiLoad session. Set this parameter to 1 to drop error tables or 0 to retain error tables.

● Max Sessions. This parameter specifies the maximum number of sessions that are allowed to log on to the database. This value should not exceed one per working amp (Access Module Process).

● Sleep. This parameter specifies the number of minutes that MultiLoad waits before retrying a logon operation.

Estimating Space Requirements for MultiLoad Jobs

Always estimate the final size of your MultiLoad target tables and make sure the destination has enough space to complete your MultiLoad job. In addition to the space that may be required by target tables, each MultiLoad job needs permanent space for:

● Work tables ● Error tables ● Restart Log table

Note: Spool space cannot be used for MultiLoad work tables, error tables, or the restart log table. Spool space is freed at each restart. By using permanent space for the MultiLoad tables, data is preserved for restart operations after a system failure. Work tables, in particular, require a lot of extra permanent space. Also remember to account for the size of error tables since error tables are generated for each target table.

Use the following formula to prepare the preliminary space estimate for one target table, assuming no fallback protection, no journals, and no non-unique secondary indexes:

INFORMATICA CONFIDENTIAL BEST PRACTICE 524 of 702

PERM = (using data size + 38) x (number of rows processed) x (number of apply conditions satisfied) x (number of Teradata SQL statements within the applied DML)

Make adjustments to your preliminary space estimates according to the requirements and expectations of your MultiLoad job.

Monitoring MultiLoad Performance

Below are tips for analyzing MultiLoad performance:

1. Determine which phase of the MultiLoad job is causing poor performance.

● If the performance bottleneck is during the acquisition phase, as data is acquired from the client system, then the issue may be with the client system. If it is during the application phase, as data is applied to the target tables, then the issue is not likely to be with the client system.

● The MultiLoad job output lists the job phases and other useful information. Save these listings for evaluation.

2. Use the Teradata RDBMS Query Session utility to monitor the progress of the MultiLoad job.

3. Check for locks on the MultiLoad target tables and error tables. 4. Check the DBC.Resusage table for problem areas, such as data bus or CPU

capacities at or near 100 percent for one or more processors. 5. Determine whether the target tables have non-unique secondary indexes

(NUSIs). NUSIs degrade MultiLoad performance because the utility builds a separate NUSI change row to be applied to each NUSI sub-table after all of the rows have been applied to the primary table.

6. Check the size of the error tables. Write operations to the fallback error tables are performed at normal SQL speed, which is much slower than normal MultiLoad tasks.

7. Verify that the primary index is unique. Non-unique primary indexes can cause severe MultiLoad performance problems

8. Poor performance can happen when the input data is skewed with respect to the Primary Index of the database. Teradata depends upon random and well distributed data for data input and retrieval. For example, a file containing a million rows with a single value 'AAAAAA' for the Primary Index will take an infinite time to load.

9. One common tool used for determining load issues/skewed data/locks is Performance Monitor (PMON). PMON requires MONITOR access on the Teradata system. If you do not have Monitor access, then the DBA can help

INFORMATICA CONFIDENTIAL BEST PRACTICE 525 of 702

you to look at the system. 10. SQL against the system catalog can also be used to determine any

performance bottle necks. The following query is used to see if the load is inserting data into the system. Spool space (a type of work space) is inside the build as data is transferred to the database. So if the load is going well, the spool will be built rapidly in the database. Use the following query to check:

SELECT sum(currentspool) from dbc.diskspace where databasename = userid loading the database.

After the spool rises has a reached its peak, spool will fall rapidly as data is inserted from spool into the table. If the spool grows slowly, then the input data is probably skewed.

FastExport

FastExport is a bulk export Teradata utility. One way to pull up data for Lookup/Sources is by using ODBC since there is not native connectivity to Teradata. However, ODBC is slow. For higher performance, use FastExport if the number of rows to be pulled is in the order of a million rows. FastExport writes to a file. The lookup or source qualifier then reads this file. FastExport integrated within PowerCenter.

BTEQ

BTEQ is a SQL executor utility similar to SQL*Plus. Life FastExport, BTEQ allows you to export data to a flat file, but is suitable for smaller volumes of data. This provides faster performance than ODBC but doesn't tax Teradata system resources the way FastExport can. A possible use for BTEQ with PowerCenter is to export smaller volumes of data to a flat file (i.e., less than 1 million rows). The flat file is then read by PowerCenter. BTEQ is not integrated with PowerCenter but can be called from a pre-session script.

TPump

TPump was a load utility primarily intended for streaming data (think of loading bundles of messages arriving from MQ using Power Center Real Time). TPump can also load from a file or a named pipe.

While FastLoad and MultiLoad are bulk load utilities, TPump is a lightweight utility. Another important difference between MultiLoad and TPump is that TPump locks at the

INFORMATICA CONFIDENTIAL BEST PRACTICE 526 of 702

row-hash level instead of the table level thus providing users read access to fresher data. Although Teradata says that it has improved the speed of TPump for loading files to compare with that of MultiLoad. So, try a test load using TPump first. Also, be cautious with the use of TPump to load streaming data if the data throughput is large.

Push Down Optimization

PowerCenter embeds a powerful engine that actually has a memory management system built within and all the smart algorithms built into the engine to perform various transformation operations such as aggregation, sorting, joining, lookup etc. This is a typically referred to as an ETL architecture where Extracts, Transformations and Loads are performed. So, data is extracted from the data source to the PowerCenter Engine (can be on the same machine as the source or a separate machine) where all the transformations are applied and then pushed to the target. Some of the performance considerations for this type of architecture are:

Is the network fast enough and tuned effectively to support the necessary data transfer?

Is the hardware on which PowerCenter is running sufficiently robust with high processing capability and high memory capacity.

ELT (Extract, Load, Transform) is a relatively new design or runtime paradigm that became popular with the advent of high performance RDBM systems such asDSS and OLTP. Because Teradata typically runs on well tuned operating systems and well tuned hardware, the ELT paradigm tries to push as much of the transformation logic as possible onto the Teradata system.

The ELT design paradigm can be achieved through the Pushdown Optimization option offered with PowerCenter.

ETL or ELT

Because many database vendors and consultants advocate using ELT (Extract, Load and Transform) over ETL (Extract, Transform and Load), the use of Pushdown Optimization can be somewhat controversial. Informatica advocates using Pushdown Optimization as an option to solve specific performance situations rather than as the default design of a mapping.

INFORMATICA CONFIDENTIAL BEST PRACTICE 527 of 702

The following scenarios can help in deciding on when to use ETL with PowerCenter and when to use ELT (i.e., Pushdown Optimization):

1. When the load needs to look up only dimension tables then there may be no need to use Pushdown Optimization. In this context, PowerCenter's ability to build dynamic, persistent caching is significant. If a daily load involves 10s or 100s of fact files to be loaded throughout the day, then dimension surrogate keys can be easily obtained from PowerCenter's cache in memory. Compare this with the cost of running the same dimension lookup queries on the database.

2. In many cases large Teradata systems contain only a small amount of data. In such cases there may be no need to push down.

3. When only simple filters or expressions need to be applied on the data then there may be no need to push down. The special case is that of applying filters or expression logic to non-unique columns in incoming data in PowerCenter. Compare this to loading the same data into the database and then applying a WHERE clause on a non-unique column, which is highly inefficient for a large table. The principle here is: Filter and resolve the data AS it gets loaded instead of loading it into a database, querying the RDBMS to filter/resolve and re-loading it into the database. In other words, ETL instead of ELT.

4. Push Down optimization needs to be considered only if a large set of data needs to be merged or queried for getting to your final load set.

Maximizing Performance using Pushdown Optimization

You can push transformation logic to either the source or target database using pushdown optimization. The amount of work you can push to the database depends on the pushdown optimization configuration, the transformation logic, and the mapping and session configuration.

When you run a session configured for pushdown optimization, the Integration Service analyzes the mapping and writes one or more SQL statements based on the mapping transformation logic. The Integration Service analyzes the transformation logic, mapping, and session configuration to determine the transformation logic it can push to the database. At run time, the Integration Service executes any SQL statement generated against the source or target tables, and processes any transformation logic that it cannot push to the database.

Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the Integration Service can push to the source or target database. You can also use the Pushdown Optimization Viewer to view the messages related to

INFORMATICA CONFIDENTIAL BEST PRACTICE 528 of 702

Pushdown Optimization.

Known Issues with Teradata

You may encounter the following problems using ODBC drivers with a Teradata database:

● Teradata sessions fail if the session requires a conversion to a numeric data type and the precision is greater than 18.

● Teradata sessions fail when you use full pushdown optimization for a session containing a Sorter transformation.

● A sort on a distinct key may give inconsistent results if the sort is not case sensitive and one port is a character port.

● A session containing an Aggregator transformation may produce different results from PowerCenter if the group by port is a string data type and it is not case-sensitive.

● A session containing a Lookup transformation fails if it is configured for target-side pushdown optimization.

● A session that requires type casting fails if the casting is from x to date/time. ● A session that contains a date to string conversion fails

Working with SQL Overrides

You can configure the Integration Service to perform an SQL override with Pushdown Optimization. To perform an SQL override, you configure the session to create a view. When you use a SQL override for a Source Qualifier transformation in a session configured for source or full Pushdown Optimization with a view, the Integration Service creates a view in the source database based on the override. After it creates the view in the database, the Integration Service generates a SQL query that it can push to the database. The Integration Service runs the SQL query against the view to perform Pushdown Optimization.

Note: To use an SQL override with pushdown optimization, you must configure the session for pushdown optimization with a view.

Running a Query

If the Integration Service did not successfully drop the view, you can run a query against the source database to search for the views generated by the Integration Service. When the Integration Service creates a view, it uses a prefix of PM_V. You

INFORMATICA CONFIDENTIAL BEST PRACTICE 529 of 702

can search for views with this prefix to locate the views created during pushdown optimization.

Teradata specific SQL:

SELECT TableName FROM DBC.Tables

WHERE CreatorName = USER

AND TableKind ='V'

AND TableName LIKE 'PM\_V%' ESCAPE '\'

Rules and Guidelines for SQL OVERIDE

Use the following rules and guidelines when you configure pushdown optimization for a session containing an SQL override:

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 530 of 702

Performance Tuning UNIX Systems

Challenge

Identify opportunities for performance improvement within the complexities of the UNIX operating environment.

Description

This section provides an overview of the subject area, followed by discussion of the use of specific tools.

Overview

All system performance issues are fundamentally resource contention issues. In any computer system, there are three essential resources: CPU, memory, and I/O - namely disk and network I/O. From this standpoint, performance tuning for PowerCenter means ensuring that the PowerCenter and its sub-processes have adequate resources to execute in a timely and efficient manner.

Each resource has its own particular set of problems. Resource problems are complicated because all resources interact with each other. Performance tuning is about identifying bottlenecks and making trade-off to improve the situation. Your best approach is to initially take a baseline measurement and to obtain a good understanding of how it behaves, then evaluate any bottleneck revealed on each system resource during your load window and determine the removal of whichever resource contention offers the greatest opportunity for performance enhancement.

Here is a summary of each system resource area and the problems it can have.

CPU

● On any multiprocessing and multi-user system, many processes want to use the CPUs at the same time. The UNIX kernel is responsible for allocation of a finite number of CPU cycles across all running processes. If the total demand on the CPU exceeds its finite capacity, then all processing is likely to reflect a negative impact on performance; the system scheduler puts each process in a queue to wait for CPU availability.

INFORMATICA CONFIDENTIAL BEST PRACTICE 531 of 702

● An average of the count of active processes in the system for the last 1, 5, and 15 minutes is reported as load average when you execute the command uptime. The load average provides you a basic indicator of the number of contenders for CPU time. Likewise vmstat command provides an average usage of all the CPUs along with the number of processes contending for CPU (the value under the r column).

● On SMP (symmetric multiprocessing) architecture servers, watch the even utilization of all the CPUs. How well all the CPUs are utilized depends on how well an application can be parallelized, If a process is incurring a high degree of involuntary context switch by the kernel; binding the process to a specific CPU may improve performance.

Memory

● Memory contention arises when the memory requirements of the active processes exceed the physical memory available on the system; at this point, the system is out of memory. To handle this lack of memory, the system starts paging, or moving portions of active processes to disk in order to reclaim physical memory. When this happens, performance decreases dramatically. Paging is distinguished from swapping, which means moving entire processes to disk and reclaiming their space. Paging and excessive swapping indicate that the system can't provide enough memory for the processes that are currently running.

● Commands such as vmstat and pstat show whether the system is paging; ps, prstat and sar can report the memory requirements of each process.

Disk I/O

● The I/O subsystem is a common source of resource contention problems. A finite amount of I/O bandwidth must be shared by all the programs (including the UNIX kernel) that currently run. The system's I/O buses can transfer only so many megabytes per second; individual devices are even more limited. Each type of device has its own peculiarities and, therefore, its own problems.

● Tools are available to evaluate specific parts of the I/O subsystem

❍ iostat can give you information about the transfer rates for each disk drive. ps and vmstat can give some information about how many processes are blocked waiting for I/O.

❍ sar can provide voluminous information about I/O efficiency. ❍ sadp can give detailed information about disk access patterns.

INFORMATICA CONFIDENTIAL BEST PRACTICE 532 of 702

Network I/O

● The source data, the target data, or both the source and target data are likely to be connected through an Ethernet channel to the system where PowerCenter resides. Be sure to consider the number of Ethernet channels and bandwidth available to avoid congestion.

❍ netstat shows packet activity on a network, watch for high collision rate of output packets on each interface.

❍ nfstat monitors NFS traffic; execute nfstat –c from a client machine (not from the nfs server); watch for high time rate of total call and “not responding” message.

Given that these issues all boil down to access to some computing resource, mitigation of each issue con sists of making some adjustment to the environment to provide more (or preferential) access to the resource; for instance:

● Adjusting execution schedules to allow leverage of low usage times may improve availability of memory, disk, network bandwidth, CPU cycles, etc.

● Migrating other applications to other hardware is likely tol reduce demand on the hardware hosting PowerCenter.

● For CPU intensive sessions, raising CPU priority (or lowering priority for competing processes) provides more CPU time to the PowerCenter sessions.

● Adding hardware resources, such as adding memory, can make more resource available to all processes.

● Re-configuring existing resources may provide for more efficient usage, such as assigning different disk devices for input and output, striping disk devices, or adjusting network packet sizes.

Detailed Usage

The following tips have proven useful in performance tuning UNIX-based machines. While some of these tips are likely to be more helpful than others in a particular environment, all are worthy of consideration.

Availability, syntax and format of each varies across UNIX versions.

Running ps -axu

INFORMATICA CONFIDENTIAL BEST PRACTICE 533 of 702

Run ps -axu to check for the following items:

● Are there any processes waiting for disk access or for paging? If so check the I/O and memory subsystems.

● What processes are using most of the CPU? This may help to distribute the workload better.

● What processes are using most of the memory? This may help to distribute the workload better.

● Does ps show that your system is running many memory-intensive jobs? Look for jobs with a large set (RSS) or a high storage integral.

Identifying and Resolving Memory Issues

Use vmstat or sar to check for paging/swapping actions. Check the system to ensure that excessive paging/swapping does not occur at any time during the session processing. By using sar 5 10 or vmstat 1 10, you can get a snapshot of paging/swapping. If paging or excessive swapping does occur at any time, increase memory to prevent it. Paging/swapping, on any database system, causes a major performance decrease and increased I/O. On a memory-starved and I/O-bound server, this can effectively shut down the PowerCenter process and any databases running on the server.

Some swapping may occur normally regardless of the tuning settings. This occurs because some processes use the swap space by their design. To check swap space availability, use pstat and swap. If the swap space is too small for the intended applications, it should be increased.

Runvmstate 5 (sar wpgr ) for SunOS, vmstat S 5 to detect and confirm memory problems and check for the following:

● Are pages-outs occurring consistently? If so, you are short of memory. ● Are there a high number of address translation faults? (System V only). This

suggests a memory shortage. ● Are swap-outs occurring consistently? If so, you are extremely short of

memory. Occasional swap-outs are normal; BSD systems swap-out inactive jobs. Long bursts of swap-outs mean that active jobs are probably falling victim and indicate extreme memory shortage. If you dont have vmstat S, look at the w and de fields of vmstat. These should always be zero.

If memory seems to be the bottleneck, try following remedial steps:

INFORMATICA CONFIDENTIAL BEST PRACTICE 534 of 702

● Reduce the size of the buffer cache (if your system has one) by decreasing BUFPAGES.

● If you have statically allocated STREAMS buffers, reduce the number of large (e.g., 2048- and 4096-byte) buffers. This may reduce network performance, but netstat-m should give you an idea of how many buffers you really need.

● Reduce the size of your kernels tables. This may limit the systems capacity (i.e., number of files, number of processes, etc.).

● Try running jobs requiring a lot of memory at night. This may not help the memory problems, but you may not care about them as much.

● Try running jobs requiring a lot of memory in a batch queue. If only one memory-intensive job is running at a time, your system may perform satisfactorily.

● Try to limit the time spent running sendmail, which is a memory hog. ● If you dont see any significant improvement, add more memory.

Identifying and Resolving Disk I/O Issues

Use iostat to check I/O load and utilization as well as CPU load. Iostat can be used to monitor the I/O load on the disks on the UNIX server. Using iostat permits monitoring the load on specific disks. Take notice of how evenly disk activity is distributed among the system disks. If it is not, are the most active disks also the fastest disks?

Run sadp to get a seek histogram of disk activity. Is activity concentrated in one area of the disk (good), spread evenly across the disk (tolerable), or in two well-defined peaks at opposite ends (bad)?

● Reorganize your file systems and disks to distribute I/O activity as evenly as possible.

● Using symbolic links helps to keep the directory structure the same throughout while still moving the data files that are causing I/O contention.

● Use your fastest disk drive and controller for your root file system; this almost certainly has the heaviest activity. Alternatively, if single-file throughput is important, put performance-critical files into one file system and use the fastest drive for that file system.

● Put performance-critical files on a file system with a large block size: 16KB or 32KB (BSD).

● Increase the size of the buffer cache by increasing BUFPAGES (BSD). This may hurt your systems memory performance.

INFORMATICA CONFIDENTIAL BEST PRACTICE 535 of 702

● Rebuild your file systems periodically to eliminate fragmentation (i.e., backup, build a new file system, and restore).

● If you are using NFS and using remote files, look at your network situation. You don’t have local disk I/O problems.

● Check memory statistics again by running vmstat 5 (sar-rwpg). If your system is paging or swapping consistently, you have memory problems, fix memory problem first. Swapping makes performance worse.

If your system has disk capacity problem and is constantly running out of disk space try the following actions:

● Write a find script that detects old core dumps, editor backup and auto-save files, and other trash and deletes it automatically. Run the script through cron.

● Use the disk quota system (if your system has one) to prevent individual users from gathering too much storage.

● Use a smaller block size on file systems that are mostly small files (e.g., source code files, object modules, and small data files).

Identifying and Resolving CPU Overload Issues

Use uptime or sar -u to check for CPU loading. Sar provides more detail, including %usr (user), %sys (system), %wio (waiting on I/O), and %idle (% of idle time). A target goal should be %usr + %sys= 80 and %wio = 10 leaving %idle at 10.

If %wio is higher, the disk and I/O contention should be investigated to eliminate I/O bottleneck on the UNIX server. If the system shows a heavy load of %sys, and %usr has a high %idle, this is indicative of memory and contention of swapping/paging problems. In this case, it is necessary to make memory changes to reduce the load on the system server.

When you run iostat 5, also watch for CPU idle time. Is the idle time always 0, without letup? It is good for the CPU to be busy, but if it is always busy 100 percent of the time, work must be piling up somewhere. This points to CPU overload.

● Eliminate unnecessary daemon processes. rwhod and routed are particularly likely to be performance problems, but any savings will help.

● Get users to run jobs at night with at or any queuing system thats available. You may not care if the CPU (or the memory or I/O system) is overloaded at night, provided the work is done in the morning.

INFORMATICA CONFIDENTIAL BEST PRACTICE 536 of 702

● Using nice to lower the priority of CPU-bound jobs improves interactive performance. Also, using nice to raise the priority of CPU-bound jobs expedites them but may hurt interactive performance. In general though, using nice is really only a temporary solution. If your workload grows, it will soon become insufficient. Consider upgrading your system, replacing it, or buying another system to share the load.

Identifying and Resolving Network I/O Issues

Suspect problems with network capacity or with data integrity if users experience slow performance when they are using rlogin or when they are accessing files via NFS.

Look at netsat-i. If the number of collisions is large, suspect an overloaded network. If the number of input or output errors is large, suspect hardware problems. A large number of input errors indicate problems somewhere on the network. A large number of output errors suggests problems with your system and its interface to the network.

If collisions and network hardware are not a problem, figure out which system appears to be slow. Use spray to send a large burst of packets to the slow system. If the number of dropped packets is large, the remote system most likely cannot respond to incoming data fast enough. Look to see if there are CPU, memory or disk I/O problems on the remote system. If not, the system may just not be able to tolerate heavy network workloads. Try to reorganize the network so that this system isn’t a file server.

A large number of dropped packets may also indicate data corruption. Run netstat-s on the remote system, then spray the remote system from the local system and run netstat-s again. If the increase of UDP socket full drops (as indicated by netstat) is equal to or greater than the number of drop packets that spray reports, the remote system is slow network server If the increase of socket full drops is less than the number of dropped packets, look for network errors.

Run nfsstat and look at the client RPC data. If the retransfield is more than 5 percent of calls, the network or an NFS server is overloaded. If timeout is high, at least one NFS server is overloaded, the network may be faulty, or one or more servers may have crashed. If badmix is roughly equal to timeout, at least one NFS server is overloaded. If timeout and retrans are high, but badxid is low, some part of the network between the NFS client and server is overloaded and dropping packets.

Try to prevent users from running I/O- intensive programs across the network. The greputility is a good example of an I/O intensive program. Instead, have users log into the remote system to do their work.

INFORMATICA CONFIDENTIAL BEST PRACTICE 537 of 702

Reorganize the computers and disks on your network so that as many users as possible can do as much work as possible on a local system.

Use systems with good network performance as file servers.

lsattr E l sys0 is used to determine some current settings on some UNIX environments. (In Solaris, you execute prtenv.) Of particular attention is maxuproc, the setting to determine the maximum level of user background processes. On most UNIX environments, this is defaulted to 40, but should be increased to 250 on most systems.

Choose a file system. Be sure to check the database vendor documentation to determine the best file system for the specific machine. Typical choices include: s5, the UNIX System V file system; ufs, the UNIX file system derived from Berkeley (BSD); vxfs, the Veritas file system; and lastly raw devices that, in reality are not a file system at all. Additionally, for the PowerCenter Grid option cluster file system (CFS), products such as GFS for RedHat Linux, Veritas CFS, and GPFS for IBM AIX are some of the available choices.

Cluster File System Tuning

In order to take full advantage of the PowerCenter Grid option, cluster file system (CFS) is recommended. PowerCenter Grid option requires that the directories for each Integration Service to be shared with other servers. This allows Integration Services to share files such as cache files between different session runs. CFS performance is a result of tuning parameters and tuning the infrastructure. Therefore, using the parameters recommended by each CFS vendor is the best approach for CFS tuning.

PowerCenter Options

The Integration Service Monitor is available to display system resource usage information about associated nodes. The window displays resource usage information about the running tasks, including CPU%, memory, and swap usage.

The PowerCenter 64-bit option can allocate more memory to sessions and achieve higher throughputs compared to 32-bit version of PowerCenter.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 538 of 702

Performance Tuning Windows 2000/2003 Systems

Challenge

Windows Server is designed as a self-tuning operating system. Standard installation of Windows Server provides good performance out-of-the-box, but optimal performance can be achieved by tuning.

Note: Tuning is essentially the same for both Windows 2000 and 2003-based systems.

Description

The following tips have proven useful in performance-tuning Windows Servers. While some are likely to be more helpful than others in any particular environment, all are worthy of consideration.

The two places to begin tuning an NT server are:

● Performance Monitor. ● Performance tab (hit ctrl+alt+del, choose task manager, and click on the

Performance tab).

Although the Performance Monitor can be tracked in real-time, creating a result-set representative of a full day is more likely to render an accurate view of system performance.

Resolving Typical Windows Server Problems

The following paragraphs describe some common performance problems in a Windows Server environment and suggest tuning solutions.

Server Load: Assume that some software will not be well coded, and some background processes (e.g., a mail server or web server) running on a single machine, can potentially starve the machine's CPUs. In this situation, off-loading the CPU hogs may be the only recourse.

INFORMATICA CONFIDENTIAL BEST PRACTICE 539 of 702

Device Drivers: The device drivers for some types of hardware are notorious for inefficient CPU clock cycles. Be sure to obtain the latest drivers from the hardware vendor to minimize this problem.

Memory and services: Although adding memory to Windows Server is always a good solution, it is also expensive and usually must be planned in advance. Before adding memory, check the Services in Control Panel because many background applications do not uninstall the old service when installing a new version. Thus, both the unused old service and the new service may be using valuable CPU memory resources.

I/O Optimization: This is, by far, the best tuning option for database applications in the Windows Server environment. If necessary, level the load across the disk devices by moving files. In situations where there are multiple controllers, be sure to level the load across the controllers too.

Using electrostatic devices and fast-wide SCSI can also help to increase performance. Further, fragmentation can usually be eliminated by using a Windows Server disk defragmentation product.

Finally, on Windows Servers, be sure to implement disk striping to split single data files across multiple disk drives and take advantage of RAID (Redundant Arrays of Inexpensive Disks) technology. Also increase the priority of the disk devices on the Windows Server. Windows Server, by default, sets the disk device priority low.

Monitoring System Performance in Windows Server

In Windows Server, PowerCenter uses system resources to process transformation, session execution, and reading and writing of data. The PowerCenter Integration Service also uses system memory for other data such as aggregate, joiner, rank, and cached lookup tables. With Windows Server, you can use the system monitor in the Performance Console of the administrative tools, or system tools in the task manager, to monitor the amount of system resources used by the PowerCenter and to identify system bottlenecks.

Windows Server provides the following tools (accessible under the Control Panel/Administration Tools/Performance) for monitoring resource usage on your computer:

● System Monitor ● Performance Logs and Alerts

These Windows Server monitoring tools enable you to analyze usage and detect

INFORMATICA CONFIDENTIAL BEST PRACTICE 540 of 702

bottlenecks at the disk, memory, processor, and network level.

System Monitor

The System Monitor displays a graph which is flexible and configurable. You can copy counter paths and settings from the System Monitor display to the Clipboard and paste counter paths from Web pages or other sources into the System Monitor display. Because the System Monitor is portable, it is useful in monitoring other systems that require administration.

Performance Monitor

The Performance Logs and Alerts tool provides two types of performance-related logs—counter logs and trace logs—and an alerting function.

Counter logs record sampled data about hardware resources and system services based on performance objects and counters in the same manner as System Monitor. They can, therefore, be viewed in System Monitor. Data in counter logs can be saved as comma-separated or tab-separated files that are easily viewed with Excel.

Trace logs collect event traces that measure performance statistics associated with events such as disk and file I/O, page faults, or thread activity. The alerting function allows you to define a counter value that will trigger actions such as sending a network message, running a program, or starting a log. Alerts are useful if you are not actively monitoring a particular counter threshold value but want to be notified when it exceeds or falls below a specified value so that you can investigate and determine the cause of the change. You may want to set alerts based on established performance baseline values for your system.

Note: You must have Full Control access to a subkey in the registry in order to create or modify a log configuration. (The subkey is HKEY_CURRENT_MACHINE\SYSTEM\CurrentControlSet\Services\SysmonLog\Log_Queries).

The predefined log settings under Counter Logs (i.e., System Overview) are configured to create a binary log that, after manual start-up, updates every 15 seconds and logs continuously until it achieves a maximum size. If you start logging with the default settings, data is saved to the Perflogs folder on the root directory and includes the counters: Memory\ Pages/sec, PhysicalDisk(_Total)\Avg. Disk Queue Length, and Processor(_Total)\ % Processor Time.

If you want to create your own log setting, press the right mouse on one of the log

INFORMATICA CONFIDENTIAL BEST PRACTICE 541 of 702

types.

PowerCenter Options

The Integration Service Monitor is available to display system resource usage information about associated nodes. The window displays resource usage information about running task including CPU%, Memory and Swap usage.

PowerCenter's 64-bit option running on Intel Itanium processor-based machines and 64-bit Windows Server 2003 can allocate more memory to sessions and achieve higher throughputs than the 32-bit version of PowerCenter on Windows Server.

Using PowerCenter Grid option on Windows Server enables distribution of a session or sessions in a workflow to multiple servers and reduces the processing load window. The PowerCenter Grid option requires that the directories for each integration service to be shared with other servers. This allows integration services to share files such as cache files among various session runs. With a Cluster File System (CFS), integration services running on various servers can perform concurrent reads and write to the same block of data.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 542 of 702

Recommended Performance Tuning Procedures

Challenge

To optimize PowerCenter load times by employing a series of performance tuning procedures.

Description

When a PowerCenter session or workflow is not performing at the expected or desired speed, there is a methodology that can help to diagnose problems that may be adversely affecting various components of the data integration architecture. While PowerCenter has its own performance settings that can be tuned, you must consider the entire data integration architecture, including the UNIX/Windows servers, network, disk array, and the source and target databases to achieve optimal performance. More often than not, an issue external to PowerCenter is the cause of the performance problem. In order to correctly and scientifically determine the most logical cause of the performance problem, you need to execute the performance tuning steps in a specific order. This enables you to methodically rule out individual pieces and narrow down the specific areas on which to focus your tuning efforts.

1. Perform Benchmarking

You should always have a baseline of current load times for a given workflow or session with a similar row count. Maybe you are not achieving your required load window or simply think your processes could run more efficiently based on comparison with other similar tasks running faster. Use the benchmark to estimate what your desired performance goal should be and tune to that goal. Begin with the problem mapping that you created, along with a session and workflow that use all default settings. This helps to identify which changes have a positive impact on performance.

2. Identify the Performance Bottleneck Area

This step helps to narrow down the areas on which to focus further. Follow the areas and sequence below when attempting to identify the bottleneck:

● Target

INFORMATICA CONFIDENTIAL BEST PRACTICE 543 of 702

● Source ● Mapping ● Session/Workflow ● System.

The methodology steps you through a series of tests using PowerCenter to identify trends that point where next to focus. Remember to go through these tests in a scientific manner; running them multiple times before reaching any conclusion and always keep in mind that fixing one bottleneck area may create a different bottleneck. For more information, see Determining Bottlenecks.

3. "Inside" or "Outside" PowerCenter

Depending on the results of the bottleneck tests, optimize “inside” or “outside” PowerCenter. Be sure to perform the bottleneck test in the order prescribed in Determining Bottlenecks, since this is also the order in which you should make any performance changes.

Problems “outside” PowerCenter refers to anything that indicates the source of the performance problem is external to PowerCenter. The most common performance problems “outside” PowerCenter are source/target database problem, network bottleneck, server, or operating system problem.

● For source database related bottlenecks, refer to Tuning SQL Overrides and Environment for Better Performance

● For target database related problems, refer to Performance Tuning Databases - Oracle, SQL Server, or Teradata

● For operating system problems, refer to Performance Tuning UNIX Systems or Performance Tuning Windows 2000/2003 Systems for more information.

Problems “inside” PowerCenter refers to anything that PowerCenter controls, such as actual transformation logic, and PowerCenter Workflow/Session settings. The session settings contain quite a few memory settings and partitioning options that can greatly improve performance. Refer to the Tuning Sessions for Better Performance for more information.

Although there are certain procedures to follow to optimize mappings, keep in mind that, in most cases, the mapping design is dictated by business logic; there may be a

INFORMATICA CONFIDENTIAL BEST PRACTICE 544 of 702

more efficient way to perform the business logic within the mapping, but you cannot ignore the necessary business logic to improve performance. Refer to Tuning Mappings for Better Performance for more information.

4. Re-Execute the Problem Workflow or Session

After you have completed the recommended steps for each relevant performance bottleneck, re-run the problem workflow or session and compare the results to the benchmark and compare load performance against the baseline. This step is iterative, and should be performed after any performance-based setting is changed. You are trying to answer the question, “Did the performance change have a positive impact?” If so, move on to the next bottleneck. Be sure to prepare detailed documentation at every step along the way so you have a clear record of what was and wasn't tried.

While it may seem like there are an enormous number of areas where a performance problem can arise, if you follow the steps for finding the bottleneck(s), and apply the tuning techniques specific to it, you are likely to improve performance and achieve your desired goals.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 545 of 702

Tuning and Configuring Data Analyzer and Data Analyzer Reports

Challenge

A Data Analyzer report that is slow to return data means lag time to a manager or business analyst. It can be a crucial point of failure in the acceptance of a data warehouse. This Best Practice offers some suggestions for tuning Data Analyzer and Data Analyzer reports.

Description

Performance tuning reports occurs both at the environment level and the reporting level. Often report performance can be enhanced by looking closely at the objective of the report rather than the suggested appearance. The following guidelines should help with tuning the environment and the report itself.

1. Perform Benchmarking. Benchmark the reports to determine an expected rate of return. Perform benchmarks at various points throughout the day and evening hours to account for inconsistencies in network traffic, database server load, and application server load. This provides a baseline to measure changes against.

2. Review Report. Confirm that all data elements are required in the report. Eliminate any unnecessary data elements, filters, and calculations. Also be sure to remove any extraneous charts or graphs. Consider if the report can be broken into multiple reports or presented at a higher level. These are often ways to create more visually appealing reports and allow for linked detail reports or drill down to detail level.

3. Scheduling of Reports. If the report is on-demand but can be changed to a scheduled report, schedule the report to run during hours when the system use is minimized. Consider scheduling large numbers of reports to run overnight. If mid-day updates are required, test the performance at lunch hours and consider scheduling for that time period. Reports that require filters by users can often be copied and filters pre-created to allow for scheduling of the report.

4. Evaluate Database. Database tuning occurs on multiple levels. Begin by reviewing the tables used in the report. Ensure that indexes have been created on dimension keys. If filters are used on attributes, test the creation of secondary indices to improve the efficiency of the query. Next, execute reports while a DBA monitors the database environment. This provides the DBA the opportunity to tune the database for querying. Finally, look into changes in database settings. Increasing the database memory in the initialization file often improves Data Analyzer performance significantly.

5. Investigate Network. Reports are simply database queries, which can be found by clicking the "View SQL" button on the report. Run the query from the report, against the database using a client tool on the server that the database resides on. One caveat to this is that even the database tool on the server may contact the outside network. Work with the DBA during this test to use a local database connection, (e.g., Bequeath / IPC Oracle’s local database communication protocol) and monitor the database throughout this process. This test may pinpoint if the bottleneck is occurring on the network or in the database. If, for instance, the query performs well regardless of where it is executed, but the report continues to be slow, this indicates an application server bottleneck. Common locations for network bottlenecks include router tables, web server demand, and server input/output. Informatica does recommend installing Data Analyzer on a dedicated application server.

6. Tune the Schema. Having tuned the environment and minimized the report requirements, the final level of tuning involves changes to the database tables. Review the under performing reports.

Can any of these be generated from aggregate tables instead of from base tables? Data Analyzer makes efficient use of linked aggregate tables by determining on a report-by-report basis if the report can utilize an aggregate table. By studying the existing reports and future requirements, you can determine what key aggregates can be created in the ETL tool and stored in the database.

Calculated metrics can also be created in an ETL tool and stored in the database instead of created in Data Analyzer. Each time a calculation must be done in Data Analyzer, it is being performed as part of the query process. To determine if a query can be improved by building these elements in the database, try removing

INFORMATICA CONFIDENTIAL BEST PRACTICE 546 of 702

them from the report and comparing report performance. Consider if these elements are appearing in a multitude of reports or simply a few.

7. Database Queries. As a last resort for under-performing reports, you may want to edit the actual report query. To determine if the query is the bottleneck, select the View SQL button on the report. Next, copy the SQL into a query utility and execute. (DBA assistance may be beneficial here.) If the query appears to be the bottleneck, revisit Steps 2 and 6 above to ensure that no additional report changes are possible. Once you have confirmed that the report is as required, work to edit the query while continuing to re-test it in a query utility. Additional options include utilizing database views to cache data prior to report generation. Reports are then built based on the view.

Note: Editing the report query requires query editing for each report change and may require editing during migrations. Be aware that this is a time-consuming process and a difficult-to-maintain method of performance tuning.

The Data Analyzer repository database should be tuned for an OLTP workload.

Tuning Java Virtual Machine (JVM)

JVM Layout

The Java Virtual Machine (JVM) is the repository for all live objects, dead objects, and free memory. It has the following primary jobs:

• Execute code • Manage memory • Remove garbage objects

The size of the JVM determines how often and how long garbage collection runs.

The JVM parameters can be set in the "startWebLogic.cmd" or "startWebLogic.sh" if using the Weblogic application server.

Parameters of the JVM

1. -Xms and -Xmx parameters define the minimum and maximum heap size; for large applications like Data Analyzer, the values should be set equal to each other.

2. Start with -ms=512m -mx=512m as needed, increase JVM by 128m or 256m to reduce garbage collection.

3. Permanent generation, which holds the JVM's class and method objects -XX:MaxPermSize command line parameter controls the permanent generation's size.

4. "NewSize" and "MaxNewSize" parameters control the new generation's minimum and maximum size.

5. XX:NewRatio=5 divides the old-to-new in the order of 5:1 (i.e the old generation occupies 5/6 of the heap while the new generation occupies 1/6 of the heap).

INFORMATICA CONFIDENTIAL BEST PRACTICE 547 of 702

• When the new generation fills up, it triggers a minor collection, in which surviving objects are moved to the old generation.

• When the old generation fills up, it triggers a major collection, which involves the entire object heap. This is more expensive in terms of resources than a minor collection.

6. If you increase the new generation size, the old generation size decreases. Minor collections occur less often, but the frequency of major collection increases.

7. If you decrease the new generation size, the old generation size increases. Minor collections occur more, but the frequency of major collection decreases.

8. As a general rule, keep the new generation smaller than half the heap size (i.e., 1/4 or 1/3 of the heap size).

9. Enable additional JVM if you expect large numbers of users. Informatica typically recommends two to three CPUs per JVM.

Other Areas to Tune

Execute Threads

• Threads available to process simultaneous operations in Weblogic. • Too few threads means CPUs are under-utilized and jobs are waiting for threads to become

available. • Too many threads means system is wasting resource in managing threads. The OS performs

unnecessary context switching. • The default is 15 threads. Informatica recommends using the default value, but you may need to

experiment to determine the optimal value for your environment.

Connection Pooling

The application borrows a connection from the pool, uses it, and then returns it to the pool by closing it.

• Initial capacity = 15 • Maximum capacity = 15 • Sum of connections of all pools should be equal to the number of execution threads.

nd shrinking the pool size dynamically by setting the initial and maximum pool size at the same level.

. They are available on: Windows NT/2000 (default installed), Solaris 2.6/2.7, AIX 4.3, HP/UX, and Linux.

b. • Adds <NativeIOEnabled> to config.xml as true.

For Websphere, use the Performance Tuner to modify the configurable parameters.

tion server , the data warehouse database, and the repository database onto separate dedicated machines.

Connection pooling avoids the overhead of growing a

Performance packs use platform-optimized (i.e., native) sockets to improve server performance

• Check Enable Native I/O on the server attribute ta

For optimal configuration, separate the applica

INFORMATICA CONFIDENTIAL BEST PRACTICE 548 of 702

Application Server-Specific Tuning Details

JBoss Application Server

Web Container. Tune the web container by modifying the following configuration file so that it accepts a reasonable number of HTTP requests as required by the Data Analyzer installation. Ensure that the web container is made available to optimal number of threads so that it can accept and process more HTTP requests.

<JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/META-INF/jboss-service.xml

The following is a typical configuration:

<!-- A HTTP/1.1 Connector on port 8080 --> <Connector className="org.apache.coyote.tomcat4.CoyoteConnector" port= "8080" minProcessors="10" maxProcessors="100" enableLookups="true" acceptCount="20" debug="0" tcpNoDelay="true" bufferSize="2048" connectionLinger="-1" connectionTimeout="20000" /><

The following parameters may need tuning:

• minProcessors. Number of threads created initially in the pool. • maxProcessors. Maximum number of threads that can ever be created in the pool. • acceptCount. Controls the length of the queue of waiting requests when no more threads are

available from the pool to process the request. • connectionTimeout. Amount of time to wait before a URI is received from the stream. Default is

20 seconds. This avoid problems where a client opens a connection and does not send any data • tcpNoDelay. Set to true when data should be sent to the client without waiting for the buffer to be

full. This reduces latency at the cost of more packets being sent over the network. The default is true.

• enableLookups. Determines whether a reverse DNS lookup is performed. This can be enabled to prevent IP spoofing. Enabling this parameter can cause problems when a DNS is misbehaving. The enableLookups parameter can be turned off when you implicitly trust all clients.

• connectionLinger. How long connections should linger after they are closed. Informatica recommends using the default value: -1 (no linger).

In the Data Analyzer application, each web page can potentially have more than one request to the application server. Hence, the maxProcessors should always be more than the actual number of concurrent users. For an installation with 20 concurrent users, a minProcessors of 5 and maxProcessors of 100 is a suitable value.

If the number of threads is too low, the following message may appear in the log files:

ERROR [ThreadPool] All threads are busy, waiting. Please increase maxThreads

JSP Optimization. To avoid having the application server compile JSP scripts when they are executed for the first time, Informatica ships Data Analyzer with pre-compiled JSPs.

The following is a typical configuration:

<JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/web.xml

<servlet> <servlet-name>jsp</servlet-name> <servlet-class>org.apache.jasper.servlet.JspServlet</servlet-class> <init-param>

INFORMATICA CONFIDENTIAL BEST PRACTICE 549 of 702

<param-name>logVerbosityLevel</param-name> <param-value>WARNING</param-value> <param-name>development</param-nam <param-value>fal </init-param> <load-on-s

e> se</param-value>

tartup>3</load-on-startup> </servlet>

The following parameter may need tuning:

• Set the development parameter to false in a production installation.

n pool e. To optimize Data Analyzer database connections, you can tune the database

connection pools.

nnection Pool. To optimize the repository database connection pool, modify the JBoss configuration file:

<JBOSS_HOME>/server/informatica/deploy/<DB_Type>_ds.xml

ml. With some versions of Data Analyzer, the configuration file may simply be named DataAnalyzer-ds.xml.

The following is a typical configuration:

ection-url> cle.OracleDriver</driver-class>

boss.resource.adapter.jdbc.vendor.OracleExceptionSorter

>1500</idle-timeout-minutes>

urce> </datasources>

The following parameters may need tuning:

• pty until it is first accessed. Once used, it will always have at least the min-pool-

• idle-timeout-minutes. The length of time an idle connection remains in the pool before it is used.

Database Connection Pool. Data Analyzer accesses the repository database to retrieve metadata information. When it runs reports, it accesses the data sources to get the report information. Data Analyzer keeps a pool of database connections for the repository. It also keeps a separate database connectiofor each data sourc

Repository Database Co

The name of the file includes the database type. <DB_Type> can be Oracle, DB2, or other databases. For example, for an Oracle repository, the configuration file name is oracle_ds.x

<datasources> <local-tx-datasource> <jndi-name>jdbc/IASDataSource</jndi-name> <connection-url> jdbc:informatica:oracle://aries:1521;SID=prfbase8</conn <driver-class>com.informatica.jdbc.ora <user-name>powera</user-name> <password>powera</password> <exception-sorter-class-name>org.j </exception-sorter-class-name> <min-pool-size>5</min-pool-size> <max-pool-size>50</max-pool-size> <blocking-timeout-millis>5000</blocking-timeout-millis> <idle-timeout-minutes </local-tx-dataso

min-pool-size. The minimum number of connections in the pool. (The pool is lazily constructed, that is, it will be emsize connections.)

• max-pool-size. The strict maximum size of the connection pool. The maximum time in millisecond• blocking-timeout-millis. s that a caller waits to get a connection

when no more free connections are available in the pool.

INFORMATICA CONFIDENTIAL BEST PRACTICE 550 of 702

The max-pool-size value is recommended to be at least five more than maximum number of concurrent users because there may be several scheduled reports running in the background and each of them needs a database connection.

A higher value is recommended for idle-timeout-minutes. Because Data Analyzer accesses the repository very frequently, it is inefficient to spend resources on checking for idle connections and cleaning them ouChecking for idle connec

t. tions may block other threads that require new connections.

connection pools, the data source also has a pool of connections that Data Analyzer dynamically creates as soon as the first client

ols are present in following file:

BOSS_HOME>/bin/IAS.properties.file

e following is a typical configuration:

dynapool.waitForConnection=true

dynapool.poolNamePrefix=IAS_dynapool.refreshTestMinutes=60 data rt

The follo s-specific parameters may need tuning:

•y

kPeriodMins. This parameter determines the amount of time (in minutes) an idle wed to be in the pool. After this period, the number of connections in the pool

shrinks back to the value of its initialCapacity parameter. This is done only if the allowShrinking

-driven beans (MDBs) that are used for the

ng parameter is the EJB pool. You following file:

Data Source Database Connection Pool. Similar to the repository database

requests a connection.

The tuning parameters for these dynamic po

<J

Th

# # Datasource definition # dynapool.initialCapacity=5 dynapool.maxCapacity=50 dynapool.capacityIncrement=2 dynapool.allowShrinking=true dynapool.shrinkPeriodMins=20

dynapool.waitSec=1

ma .defaultRowPrefetch=20< /FONT>

wing JBos

y.dynapool.initialCapacit The minimum number of initial connections in the data source pool. • dynapool.maxCapacity. The maximum number of connections that the data source pool ma

grow to. • dynapool.poolNamePrefix. This parameter is a prefix added to the dynamic JDB pool name for

identification purposes. dynapool.waitSec. The maximum amount of time (in seconds) a client will wait to grab a connection from the pool if none is readily available.

• dynapool.refreshTestMinutes. This parameter determines the frequency at which a health checkis performed on the idle connections in the pool. This should not be performed too frequently because it locks up the connection pool and may prevent other clients from grabbing connections from the pool.

• dynapool.shrinconnection is allo

parameter is set to true.

EJB Container

Data Analyzer uses EJBs extensively. It has more than 50 stateless session beans (SLSB) and more than 60 entity beans (EB). In addition, there are six messagescheduling and real-time functionalities.

Stateless Session Beans (SLSB). For SLSBs, the most important tunican tune the EJB pool parameters in the

INFORMATICA CONFIDENTIAL BEST PRACTICE 551 of 702

<JBOSS_HOME>/server/In

The following is a typical configuratio

<container-configuration> <container-name> Standard Stateless SessionBean</container-nam <call-logging>false</call-logging> <invoker-proxy-binding-name> stateless-rmi-invoker</invoker-proxy-binding-name> <container-interceptors> <interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor </interceptor> <interceptor> org.jboss.ejb.plugins.LogInterceptor</interc <interceptor> org.jboss.ejb.plugins.SecurityInterceptor</interceptor> <!-- CMT --> <interceptor transaction="Container"> org.jboss.ejb.plugins.TxInterceptorCMT</interceptor> <interceptor tra org.jboss.ejb.plugins.MetricsInte <interceptor transaction="Container"> org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor </interceptor> <!-- BMT --> <interceptor transaction="Bean"> org.jboss.ejb.plugins.StatelessSessionInstanceIntercep </interceptor> <interceptor transaction="Bean"> org.jboss.ejb.plugins.TxInterceptorBMT</interceptor> <interceptor transaction= org.jboss.ejb.plugins.MetricsInterc <interceptor> org.jboss.resource.connectionmanager.CachedConnectionIntercepto </interceptor> </container-interceptors> <instance-pool> org.jboss.ejb.plugins.StatelessSessionIns <instance-cache></instanc <persistence-manager></

formatica/conf/standardjboss.xml.

n:

e>

eptor>

nsaction="Container" metricsEnabled="true"> rceptor</interceptor>

tor

"Bean" metricsEnabled="true"> eptor</interceptor>

r

tancePool</instance-pool> e-cache>

persistence-manager>

</container-pool-conf> </co in

The follo

e>

•n Data Analyzer. They can be tuned after you have performed

• strictMaximumSize. When the value is set to true, the <strictMaximumSize> enforces a rule that it for an

ble in the pool.

<container-pool-conf> <MaximumSize>100</MaximumSize>

nta er-configuration>

wing parameter may need tuning:

• MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> isset to true, then <MaximumSize> is a strict upper limit for the number of objects that can be created. If <strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are requests for more objects. However, only the <MaximumSiznumber of objects can be returned to the pool.

Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are not set by default iproper iterative testing in Data Analyzer to increase the throughput for high-concurrency installations.

only <MaximumSize> number of objects can be active. Any subsequent requests must waobject to be returned to the pool.

• strictTimeout. If you set <strictMaximumSize> to true, then <strictTimeout> is the amount of time that requests wait for an object to be made availa

INFORMATICA CONFIDENTIAL BEST PRACTICE 552 of 702

Message-Driven Beans (MDB). MDB tuning parameters are very similar to stateless bean tuning parameters. The main difference is that MDBs are not invoked by clients. Instead, the messaging system

following configuration file:

formatica/conf/standardjboss.xml

n:

ejb.plugins.ProxyFactoryFinderInterceptor

>

MetricsInterceptor

r

" metricsEnabled="true">

.CachedConnectionInterceptor

-cache> e-manager>

<container-pool-conf>

</container-pool-conf>

umSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit for the number of objects that can be created. Otherwise, if

e

delivers messages to the MDB when they are available.

To tune the MDB parameters, modify the

<JBOSS_HOME>/server/in

The following is a typical configuratio

<container-configuration> <container-name>Standard Message Driven Bean</container-name> <call-logging>false</call-logging> <invoker-proxy-binding-name>message-driven-bean </invoker-proxy-binding-name> <container-interceptors> <interceptor>org.jboss. </interceptor> <interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor> <interceptor>org.jboss.ejb.plugins.RunAsSecurityInterceptor </interceptor> <!-- CMT --> <interceptor transaction="Container"> org.jboss.ejb.plugins.TxInterceptorCMT</interceptor> <interceptor transaction="Container" metricsEnabled="true" org.jboss.ejb.plugins. </interceptor> <interceptor transaction="Container"> org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor </interceptor> <!-- BMT --> <interceptor transaction="Bean"> org.jboss.ejb.plugins.MessageDrivenInstanceIntercepto </interceptor> <interceptor transaction="Bean"> org.jboss.ejb.plugins.MessageDrivenTxInterceptorBMT </interceptor> <interceptor transaction="Bean org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor> org.jboss.resource.connectionmanager </interceptor> </container-interceptors> <instance-pool>org.jboss.ejb.plugins.MessageDrivenInstancePool </instance-pool> <instance-cache></instance <persistence-manager></persistenc

<MaximumSize>100</MaximumSize>

</container-configuration>

The following parameter may need tuning:

Maxim

<strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are request for more objects. However, only the <MaximumSize> number of objects can be returned to thpool.

INFORMATICA CONFIDENTIAL BEST PRACTICE 553 of 702

Addi naparametiterative roughput for high-concurrency installations.

• mumSize> parameter enforces a rule that only <MaximumSize> number of objects will be active. Any subsequent requests must wait

s wait for an object to be made available in the pool.

an-managed persistence) as opposed to CMP (container-managed persistence). The EJB tuning parameters are very similar to the stateless bean

llowing configuration file:

tica/conf/standardjboss.xml.

e>

Interceptor

Interceptor</interceptor>

ejb.plugins.EntityReentranceInterceptor

nmanager.CachedConnectionInterceptor

jb.plugins.lock.QueuedPessimisticEJBLock

tio lly, there are two other parameters that you can set to fine tune the EJB pool. These two ers are not set by default in Data Analyzer. They can be tuned after you have performed proper testing in Data Analyzer to increase the th

strictMaximumSize. When the value is set to true, the <strictMaxi

for an object to be returned to the pool. • strictTimeout. If you set <strictMaximumSize> to true, then <strictTimeout> is the amount of time

that request

Enterprise Java Beans (EJB). Data Analyzer EJBs use BMP (be

tuning parameters.

The EJB tuning parameters are in the fo

<JBOSS_HOME>/server/informa

The following is a typical configuration:

<container-configuration> <container-name>Standard BMP EntityBean</container-nam <call-logging>false</call-logging> <invoker-proxy-binding-name>entity-rmi-invoker </invoker-proxy-binding-name> <sync-on-commit-only>false</sync-on-commit-only> <container-interceptors> <interceptor>org.jboss.ejb.plugins.ProxyFactoryFinder </interceptor> <interceptor>org.jboss.ejb.plugins.Log <interceptor>org.jboss.ejb.plugins.SecurityInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.TxInterceptorCMT </interceptor> <interceptor metricsEnabled="true"> org.jboss.ejb.plugins.MetricsInterceptor</interceptor> <interceptor>org.jboss.ejb.plugins.EntityCreationInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.EntityLockInterceptor </interceptor> <interceptor>org.jboss.ejb.plugins.EntityInstanceInterceptor </interceptor> <interceptor>org.jboss. </interceptor> <interceptor> org.jboss.resource.connectio </interceptor> <interceptor> org.jboss.ejb.plugins.EntitySynchronizationInterceptor </interceptor> </container-interceptors> <instance-pool>org.jboss.ejb.plugins.EntityInstancePool </instance-pool> <instance-cache>org.jboss.ejb.plugins.EntityInstanceCache </instance-cache> <persistence-manager>org.jboss.ejb.plugins.BMPPersistenceManager </persistence-manager> <locking-policy>org.jboss.e </locking-policy> <container-cache-conf>

INFORMATICA CONFIDENTIAL BEST PRACTICE 554 of 702

<cache-policy>org.jboss.ejb.plugins.LRUEnt </cache-policy> <cache-policy-conf> <min-capacity>50</min-capacity> <max-capacity>1000000</max-capacity> <overager-period>300</overager-period> <max-bean-age>600</max-bean-age> <resizer-period>400</resi <max-cache-miss-period>6 <min-cache-miss-pe <cache-load-factor>0.75</cache-load </cache-policy-conf> </container-cache-conf> <container-pool-conf>

erpriseContextCachePolicy

zer-period> 0</max-cache-miss-period>

riod>1</min-cache-miss-period> -factor>

<MaximumSize>100</MaximumSize>

MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to rwise, if

e ool.

Add napara titerative

• mSize> parameter enforces a rule that only <MaximumSize> number of objects can be active. Any subsequent requests must

r an object to be returned to the pool. out. If you set <strictMaximumSize> to true, then <strictTimeout> is the amount of time

that requests will wait for an object to be made available in the pool.

threads to accept connections from clients for remote method invocation (RMI). If you use the Java RMI protocol to access the Data Analyzer API from

ool parameters.

owing configuration file:

oker"name="jboss:service=invoker,type=pooled

ss"></attribute> <attribute name="ClientConnectPort">0</attribute>

</container-pool-conf> <commit-option>A</commit-option> </container-configuration>

The following parameter may need tuning:

true, then <MaximumSize> is a strict upper limit for the number of objects that can be created. Othe<strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if therare request for more objects. However, only the <MaximumSize> number of objects are returned to the p

itio lly, there are two other parameters that you can set to fine tune the EJB pool. These two me ers are not set by default in Data Analyzer. They can be tuned after you have performed proper

testing in Data Analyzer to increase the throughput for high-concurrency installations.

strictMaximumSize. When the value is set to true, the <strictMaximu

wait fo• strictTime

RMI Pool

The JBoss Application Server can be configured to have a pool of

other custom applications, you can optimize the RMI thread p

To optimize the RMI pool, modify the foll

<JBOSS_HOME>/server/informatica/conf/jboss-service.xml

The following is a typical configuration:

<mbeancode="org.jboss.invocation.pooled.server.PooledInv"> <attribute name="NumAcceptThreads">1</attribute> <attribute name="MaxPoolSize">300</attribute> <attribute name="ClientMaxPoolSize">300</attribute> <attribute name="SocketTimeout">60000</attribute> <attribute name="ServerBindAddress"></attribute> <attribute name="ServerBindPort">0</attribute> <attribute name="ClientConnectAddre

INFORMATICA CONFIDENTIAL BEST PRACTICE 555 of 702

<attribute nam <depends

e="EnableTcpNoDelay">false</attribute> optional-attribute-name="TransactionManagerService">

jboss:service=TransactionManager

The follo

• lient. • Backlog. The number of requests in the queue when all the processing threads are in use.

re packets will be sent across the network.

ation Server 5.1. The Tivoli Performance Viewer can be used to observe the behavior of some of the parameters and arrive at a good settings.

Navi tefollowing

en

• optimal.

ads

vironment, there is likely to be more than one server instance that may be ines. In such a scenario, be sure that the changes have been properly

propagated to all of the server instances.

n certain circumstances (e.g., import of large XML files), the default value nt and should be increased. This parameter can be modified during

runtime also.

Dia o

• Disable the trace in a production environment . • Navigate to “Application Servers > [your_server_instance] > Administration Services > Diagnostic

Trace Service “ and make sure “Enable Tracing” is not checked.

</depends> </mbean>

wing parameters may need tuning:

• NumAcceptThreads. The controlling threads used to accept connections from the client. • MaxPoolSize. A strict maximum size for the pool of threads to service requests on the server.

ClientMaxPoolSize. A strict maximum size for the pool of threads to service requests on the c

• EnableTcpDelay. Indicates whether information should be sent before the buffer is full. Setting it totrue may increase the network traffic because mo

WebSphere Applic

Web Container

ga to “Application Servers > [your_server_instance] > Web Container > Thread Pool” to tune the parameters.

Minimum Size: Specifies the minimum number of threads to allow in the pool. The default value of 10 is appropriate.

• Maximum Size: Specifies the minimum number of threads to allow in the pool. For a highly concurrent usage scenario (with a 3 VM load-balanced configuration), the value of 50-60 has bedetermined to be optimal. Thread Inactivity Timeout: Specifies the number of milliseconds of inactivity that should elapse before a thread is reclaimed. The default of 3500ms is considered

• Is Growable: Specifies whether the number of threads can increase beyond the maximum sizeconfigured for the thread pool. Be sure to leave this option unchecked. Also, the maximum threshould be hard-limited to the value given in the “Maximum Size”.

Note: In a load-balanced enspread across multiple mach

Transaction Services

Total transaction lifetime timeout: Iof 120 seconds may not be sufficie

gn stic Trace Services

INFORMATICA CONFIDENTIAL BEST PRACTICE 556 of 702

Debugging Services

Ensure that the tracing is disabled in a production environment.

Navigate to “Application Servers > [your_server_instance] > Logging and Tracing > Diagnostic Trace Service > Debugging Service “ and make sure “Startup” is not checked.

Performance Monitoring Services

This set of parameters is for monitoring the health of the Application Server. This monitoring service tries to ping the application server after a certain interval; if the server is found to be dead, it then tries to restart the server.

Navigate to “Application Servers > [your_server_instance] > Process Definition > MonitoringPolicy “ and tune the parameters according to a policy determined for each Data Analyzer installation.

Note: The parameter “Ping Timeout” determines the time after which a no-response from the server implies that it is faulty. The monitoring service then attempts to kill the server and restart it if “Automatic restart” is checked. Take care that “Ping Timeout” is not set to too small a value.

Process Definitions (JVM Parameters)

For a Data Analyzer installation with a high number of concurrent users, Informatica recommends that the minimum and the maximum heap size be set to the same values. This avoids the heap allocation-reallocation expense during a high-concurrency scenario. Also, for a high-concurrency scenario, Informatica recommends setting the values of minimum heap and maximum heap size to at least 1000MB. Further tuning of this heap-size is recommended after carefully studying the garbage collection behavior by turning on the verbosegc option.

The following is a list of java parameters (for IBM JVM 1.4.1) that should not be modified from the default values for Data Analyzer installation:

• -Xnocompactgc. This parameter switches off heap compaction altogether. Switching off heap compaction results in heap fragmentation. Since Data Analyzer frequently allocates large objects, heap fragmentation can result in OutOfMemory exceptions.

• -Xcompactgc. Using this parameter leads to each garbage collection cycle carrying out compaction, regardless of whether it's useful.

• -Xgcthreads. This controls the number of garbage collection helper threads created by the JVM during startup. The default is N-1 threads for an N-processor machine. These threads provide the parallelism in parallel mark and parallel sweep modes, which reduces the pause time during garbage collection.

• -Xclassnogc. This disables collection of class objects. • -Xinitsh. This sets the initial size of the application-class system heap. The system heap is

expanded as needed and is never garbage collected.

You may want to alter the following parameters after carefully examining the application server processes:

• Navigate to “Application Servers > [your_server_instance] > Process Definition > Java Virtual Machine"

• Verbose garbage collection. Check this option to turn on verbose garbage collection. This can help in understanding the behavior of the garbage collection for the application. It has a very low overhead on performance and can be turned on even in the production environment.

• Initial heap size. This is the –ms value. Only the numeric value (without MB) needs to be specified. For concurrent usage, the initial heap-size should be started with a 1000 and, depending on the garbage collection behavior, can be potentially increased up to 2000. A value beyond 2000 may

INFORMATICA CONFIDENTIAL BEST PRACTICE 557 of 702

actually reduce throughput because the garbage collection cycles will take more time to go through

• size. This is the –mx value. It should be equal to the “Initial heap size” value.

s down the VM considerably.

nchecked (i.e., JIT should never be disabled).

Navigate to “Application Servers > [your_server_instance] > Performance Monitoring Services“ and be sure

The repository database connection pool can be configured by navigating to “JDBC Providers > User-ource > Connection Pools”

The rio

0

• s

• twork are stable, there should not be a reason for age timeout. The default is 0

(i.e., connections do not age). If the database or the network connection to the repository database

Much like the repository database connection pools, the data source or data warehouse databases also a Analyzer as soon as the first client makes a

The tuning parameters for these dynamic pools are present in <WebSphere_Home>/AppServer/IAS.properties file.

the large heap, even though the cycles may be occurring less frequently. Maximum heap

• RunHProf:. This should remain unchecked in production mode, because it slows down the VM considerably.

• Debug Mode. This should remain unchecked in production mode, because it slow

• Disable JIT.: This should remain u

Performance Monitoring Services

Be sure that performance monitoring services are not enabled in a production environment.

“Startup” is not checked.

Database Connection Pool

defined JDBC Provider > Data Sources > IASDataS

va us parameters that may need tuning are:

• Connection Timeout. The default value of 180 seconds should be good. This implies that after 18seconds, the request to grab a connection from the pool will timeout. After it times out, DataAnalyzer will throw an exception. In that case, the pool size may need to be increased.

• Max Connections. The maximum number of connections in the pool. Informatica recommends avalue of 50 for this.

• Min Connections. The minimum number of connections in the pool. Informatica recommends a value of 10 for this. Reap Time. This specifies the frequency of pool maintenance thread. This should not be set very high because when pool maintenance thread is running, it blocks the whole pool and no procescan grab a new connection form the pool. If the database and the network are reliable, this shouldhave a very high value (e.g., 1000).

• Unused Timeout. This specifies the time in seconds after which an unused connection will be discarded until the pool size reaches the minimum size. In a highly concurrent usage, this should be a high value. The default of 1800 seconds should be fine. Aged Timeout. Specifies the interval in seconds before a physical connection is discarded. If the database and the ne

frequently comes down (compared to the life of the AppServer), this can be used to age-out the stale connections.

have a pool of connections that are created dynamically by Datrequest.

INFORMATICA CONFIDENTIAL BEST PRACTICE 558 of 702

The following is a typical configuration:.

#

# Datasource definition

#

dynapool.initialCapacity=5

dynapool.maxCapacity=50

dynapool.capacityIncrement=2

dynapool.allowShrinking=true

dynapool.shrinkPeriodMins=20

dynapool.waitForConnection=true

dynapool.waitSec=1

dynapool.poolNamePrefix=IAS_

dynapool.refreshTestMinutes=60

datamart.defaultRowPrefetch=20

The various parameters that may need tuning are:

• dynapool.initialCapacity - the minimum number of initial connections in the data-source pool. • dynapool.maxCapacity - the maximum number of connections that the data-source pool may grow

up to. • dynapool.poolNamePrefix - a prefix added to the dynamic JDB pool name for identification

purposes. • dynapool.waitSec - the maximum amount of time (in seconds) that a client will wait to grab a

connection from the pool if none is readily available. • dynapool.refreshTestMinutes - determines the frequency at which a health check on the idle

connections in the pool is performed. Such checks should not be performed too frequently because they lock up the connection pool and may prevent other clients from grabbing connections from the pool.

• dynapool.shrinkPeriodMins - determines the amount of time (in minutes) an idle connection is allowed to be in the pool. After this period, the number of connections in the pool decreases (to its initialCapacity). This is done only if allowShrinking is set to true.

Message Listeners Services

To process scheduled reports, Data Analyzer uses Message-Driven-Beans. It is possible to run multiple reports within one schedule in parallel by increasing the number of instances of the MDB catering to the Scheduler (InfScheduleMDB). Take care however, not to increase the value to some arbitrarily high value since each report consumes considerable resources (e.g., database connections, and CPU processing at both the application-server and database server levels) and setting this to a very high value may actually be detrimental to the whole system.

INFORMATICA CONFIDENTIAL BEST PRACTICE 559 of 702

Navigate to “Application Servers > [your_server_instance] > Message Listener Service > Listener Ports > IAS_ScheduleMDB_ListenerPort” .

The parameters that can be tuned are:

• Maximum sessions. The default value is one. On a highly-concurrent user scenario, Informatica does not recommend going beyond five.

• Maximum messages. This should remain as one. This implies that each report in a schedule will be executed in a separate transaction instead of a batch. Setting it to more than one may have unwanted effects like transaction timeouts, and the failure of one report may cause all the reports in the batch to fail.

Plug-in Retry Intervals and Connect Timeouts

When Data Analyzer is set up in a clustered WebSphere environment, a plug-in is normally used to perform the load-balancing between each server in the cluster. The proxy http-server sends the request to the plug-in and the plug-in then routes the request to the proper application-server.

The plug-in file can be generated automatically by navigating to “

Environment > Update web server plugin configuration”.

The default plug-in file contains ConnectTimeOut=0, which means that it relies on the tcp timeout setting of the server. It is possible to have different timeout settings for different servers in the cluster. The timeout settings implies that after the given number of seconds if the server doesn’t respond, then it is marked as down and the request is sent over to the next available member of the cluster.

The RetryInterval parameter allows you to specify how long to wait before retrying a server that is marked as down. The default value is 10 seconds. This means if a cluster member is marked as down, the server does not try to send a request to the same member for 10 seconds.

Last updated: 13-Feb-07 17:59

INFORMATICA CONFIDENTIAL BEST PRACTICE 560 of 702

Tuning Mappings for Better Performance

Challenge

In general, mapping-level optimization takes time to implement, but can significantly boost performance. Sometimes the mapping is the biggest bottleneck in the load process because business rules determine the number and complexity of transformations in a mapping.

Before deciding on the best route to optimize the mapping architecture, you need to resolve some basic issues. Tuning mappings is a grouped approach. The first group can be of assistance almost universally, bringing about a performance increase in all scenarios. The second group of tuning processes may yield only small performance increase, or can be of significant value, depending on the situation.

Some factors to consider when choosing tuning processes at the mapping level include the specific environment, software/ hardware limitations, and the number of rows going through a mapping. This Best Practice offers some guidelines for tuning mappings.

Description

Analyze mappings for tuning only after you have tuned the target and source for peak performance. To optimize mappings, you generally reduce the number of transformations in the mapping and delete unnecessary links between transformations.

For transformations that use data cache (such as Aggregator, Joiner, Rank, and Lookup transformations), limit connected input/output or output ports. Doing so can reduce the amount of data the transformations store in the data cache. Having too many Lookups and Aggregators can encumber performance because each requires index cache and data cache. Since both are fighting for memory space, decreasing the number of these transformations in a mapping can help improve speed. Splitting them up into different mappings is another option.

Limit the number of Aggregators in a mapping. A high number of Aggregators can increase I/O activity on the cache directory. Unless the seek/access time is fast on the directory itself, having too many Aggregators can cause a bottleneck. Similarly, too many Lookups in a mapping causes contention of disk and memory, which can lead to thrashing, leaving insufficient memory to run a mapping efficiently.

Consider Single-Pass Reading

If several mappings use the same data source, consider a single-pass reading. If you have several sessions that use the same sources, consolidate the separate mappings with either a single Source Qualifier Transformation or one set of Source Qualifier Transformations as the data source for the separate data flows.

Similarly, if a function is used in several mappings, a single-pass reading reduces the number of times that function is called in the session. For example, if you need to subtract percentage from the PRICE ports for both the Aggregator and Rank transformations, you can minimize work by subtracting the percentage before splitting the pipeline.

INFORMATICA CONFIDENTIAL BEST PRACTICE 561 of 702

Optimize SQL Overrides

When SQL overrides are required in a Source Qualifier, Lookup Transformation, or in the update override of a target object, be sure the SQL statement is tuned. The extent to which and how SQL can be tuned depends on the underlying source or target database system. See Tuning SQL Overrides and Environment for Better Performance for more information .

Scrutinize Datatype Conversions

PowerCenter Server automatically makes conversions between compatible datatypes. When these conversions are performed unnecessarily, performance slows. For example, if a mapping moves data from an integer port to a decimal port, then back to an integer port, the conversion may be unnecessary.

In some instances however, datatype conversions can help improve performance. This is especially true when integer values are used in place of other datatypes for performing comparisons using Lookup and Filter transformations.

Eliminate Transformation Errors

Large numbers of evaluation errors significantly slow performance of the PowerCenter Server. During transformation errors, the PowerCenter Server engine pauses to determine the cause of the error, removes the row causing the error from the data flow, and logs the error in the session log.

Transformation errors can be caused by many things including: conversion errors, conflicting mapping logic, any condition that is specifically set up as an error, and so on. The session log can help point out the cause of these errors. If errors recur consistently for certain transformations, re-evaluate the constraints for these transformations. If you need to run a session that generates a large number of transformation errors, you might improve performance by setting a lower tracing level. However, this is not a long-term response to transformation errors. Any source of errors should be traced and eliminated.

Optimize Lookup Transformations

There are a several ways to optimize lookup transformations that are set up in a mapping.

When to Cache Lookups

Cache small lookup tables. When caching is enabled, the PowerCenter Server caches the lookup table and queries the lookup cache during the session. When this option is not enabled, the PowerCenter Server queries the lookup table on a row-by-row basis.

Note: All of the tuning options mentioned in this Best Practice assume that memory and cache sizing for lookups are sufficient to ensure that caches will not page to disks. Information regarding memory and cache sizing for Lookup transformations are covered in the Best Practice: Tuning Sessions for Better Performance.

A better rule of thumb than memory size is to determine the size of the potential lookup cache with regard to the number of rows expected to be processed. For example, consider the following example.

INFORMATICA CONFIDENTIAL BEST PRACTICE 562 of 702

In Mapping X, the source and lookup contain the following number of records:

ITEMS (source): 5000 records

MANUFACTURER: 200 records

DIM_ITEMS: 100000 records

Number of Disk Reads

Cached Lookup Un-cached Lookup

LKP_Manufacturer

Build Cache 200 0

Read Source Records 5000 5000

Execute Lookup 0 5000

Total # of Disk Reads 5200 100000

LKP_DIM_ITEMS

Build Cache 100000 0

Read Source Records 5000 5000

Execute Lookup 0 5000

Total # of Disk Reads 105000 10000

Consider the case where MANUFACTURER is the lookup table. If the lookup table is cached, it will take a total of 5200 disk reads to build the cache and execute the lookup. If the lookup table is not cached, then it will take a total of 10,000 total disk reads to execute the lookup. In this case, the number of records in the lookup table is small in comparison with the number of times the lookup is executed. So this lookup should be cached. This is the more likely scenario.

Consider the case where DIM_ITEMS is the lookup table. If the lookup table is cached, it will result in 105,000 total disk reads to build and execute the lookup. If the lookup table is not cached, then the disk reads would total 10,000. In this case the number of records in the lookup table is not small in comparison with the number of times the lookup will be executed. Thus, the lookup should not be cached.

Use the following eight step method to determine if a lookup should be cached:

INFORMATICA CONFIDENTIAL BEST PRACTICE 563 of 702

1. Code the lookup into the mapping. 2. Select a standard set of data from the source. For example, add a "where" clause on a relational

source to load a sample 10,000 rows. 3. Run the mapping with caching turned off and save the log. 4. Run the mapping with caching turned on and save the log to a different name than the log created

in step 3. 5. Look in the cached lookup log and determine how long it takes to cache the lookup object. Note

this time in seconds: LOOKUP TIME IN SECONDS = LS. 6. In the non-cached log, take the time from the last lookup cache to the end of the load in seconds

and divide it into the number or rows being processed: NON-CACHED ROWS PER SECOND = NRS.

7. In the cached log, take the time from the last lookup cache to the end of the load in seconds and divide it into number or rows being processed: CACHED ROWS PER SECOND = CRS.

8. Use the following formula to find the breakeven row point: (LS*NRS*CRS)/(CRS-NRS) = X Where X is the breakeven point. If your expected source records is less than X, it is better to not cache the lookup. If your expected source records is more than X, it is better to cache the lookup. For example: Assume the lookup takes 166 seconds to cache (LS=166). Assume with a cached lookup the load is 232 rows per second (CRS=232). Assume with a non-cached lookup the load is 147 rows per second (NRS = 147). The formula would result in: (166*147*232)/(232-147) = 66,603. Thus, if the source has less than 66,603 records, the lookup should not be cached. If it has more than 66,603 records, then the lookup should be cached.

Sharing Lookup Caches

There are a number of methods for sharing lookup caches:

● Within a specific session run for a mapping, if the same lookup is used multiple times in a mapping, the PowerCenter Server will re-use the cache for the multiple instances of the lookup. Using the same lookup multiple times in the mapping will be more resource intensive with each successive instance. If multiple cached lookups are from the same table but are expected to return different columns of data, it may be better to setup the multiple lookups to bring back the same columns even though not all return ports are used in all lookups. Bringing back a common set of columns may reduce the number of disk reads.

● Across sessions of the same mapping, the use of an unnamed persistent cache allows multiple runs to use an existing cache file stored on the PowerCenter Server. If the option of creating a persistent cache is set in the lookup properties, the memory cache created for the lookup during the initial run is saved to the PowerCenter Server. This can improve performance because the Server builds the memory cache from cache files instead of the database. This feature should only be used when the lookup table is not expected to change between session runs.

● Across different mappings and sessions, the use of a named persistent cache allows sharing an existing cache file.

Reducing the Number of Cached Rows

INFORMATICA CONFIDENTIAL BEST PRACTICE 564 of 702

There is an option to use a SQL override in the creation of a lookup cache. Options can be added to the WHERE clause to reduce the set of records included in the resulting cache.

Note: If you use a SQL override in a lookup, the lookup must be cached.

Optimizing the Lookup Condition

In the case where a lookup uses more than one lookup condition, set the conditions with an equal sign first in order to optimize lookup performance.

Indexing the Lookup Table

The PowerCenter Server must query, sort, and compare values in the lookup condition columns. As a result, indexes on the database table should include every column used in a lookup condition. This can improve performance for both cached and un-cached lookups.

In the case of a cached lookup, an ORDER BY condition is issued in the SQL statement used to create the cache. Columns used in the ORDER BY condition should be indexed. The session log will contain the ORDER BY statement.

In the case of an un-cached lookup, since a SQL statement is created for each row passing into the lookup transformation, performance can be helped by indexing columns in the lookup condition.

Use a Persistent Lookup Cache for Static Lookups

If the lookup source does not change between sessions, configure the Lookup transformation to use a persistent lookup cache. The PowerCenter Server then saves and reuses cache files from session to session, eliminating the time required to read the lookup source.

Optimize Filter and Router Transformations

Filtering data as early as possible in the data flow improves the efficiency of a mapping. Instead of using a Filter Transformation to remove a sizeable number of rows in the middle or end of a mapping, use a filter on the Source Qualifier or a Filter Transformation immediately after the source qualifier to improve performance.

Avoid complex expressions when creating the filter condition. Filter transformations are most effective when a simple integer or TRUE/FALSE expression is used in the filter condition.

Filters or routers should also be used to drop rejected rows from an Update Strategy transformation if rejected rows do not need to be saved.

Replace multiple filter transformations with a router transformation. This reduces the number of transformations in the mapping and makes the mapping easier to follow.

INFORMATICA CONFIDENTIAL BEST PRACTICE 565 of 702

Optimize Aggregator Transformations

Aggregator Transformations often slow performance because they must group data before processing it.

Use simple columns in the group by condition to make the Aggregator Transformation more efficient. When possible, use numbers instead of strings or dates in the GROUP BY columns. Also avoid complex expressions in the Aggregator expressions, especially in GROUP BY ports.

Use the Sorted Input option in the Aggregator. This option requires that data sent to the Aggregator be sorted in the order in which the ports are used in the Aggregator's group by. The Sorted Input option decreases the use of aggregate caches. When it is used, the PowerCenter Server assumes all data is sorted by group and, as a group is passed through an Aggregator, calculations can be performed and information passed on to the next transformation. Without sorted input, the Server must wait for all rows of data before processing aggregate calculations. Use of the Sorted Inputs option is usually accompanied by a Source Qualifier which uses the Number of Sorted Ports option.

Use an Expression and Update Strategy instead of an Aggregator Transformation. This technique can only be used if the source data can be sorted. Further, using this option assumes that a mapping is using an Aggregator with Sorted Input option. In the Expression Transformation, the use of variable ports is required to hold data from the previous row of data processed. The premise is to use the previous row of data to determine whether the current row is a part of the current group or is the beginning of a new group. Thus, if the row is a part of the current group, then its data would be used to continue calculating the current group function. An Update Strategy Transformation would follow the Expression Transformation and set the first row of a new group to insert, and the following rows to update.

Use incremental aggregation if you can capture changes from the source that changes less than half the target. When using incremental aggregation, you apply captured changes in the source to aggregate calculations in a session. The PowerCenter Server updates your target incrementally, rather than processing the entire source and recalculating the same calculations every time you run the session.

Joiner Transformation

Joining Data from the Same Source

You can join data from the same source in the following ways:

● Join two branches of the same pipeline. ● Create two instances of the same source and join pipelines from these source instances.

You may want to join data from the same source if you want to perform a calculation on part of the data and join the transformed data with the original data. When you join the data using this method, you can maintain the original data and transform parts of that data within one mapping.

When you join data from the same source, you can create two branches of the pipeline. When you branch a pipeline, you must add a transformation between the Source Qualifier and the Joiner transformation in at least one branch of the pipeline. You must join sorted data and configure the Joiner transformation for sorted input.

INFORMATICA CONFIDENTIAL BEST PRACTICE 566 of 702

If you want to join unsorted data, you must create two instances of the same source and join the pipelines.

For example, you may have a source with the following ports:

● Employee ● Department ● Total Sales

In the target table, you want to view the employees who generated sales that were greater than the average sales for their respective departments. To accomplish this, you create a mapping with the following transformations:

● Sorter transformation. Sort the data. ● Sorted Aggregator transformation. Average the sales data and group by department. When

you perform this aggregation, you lose the data for individual employees. To maintain employee data, you must pass a branch of the pipeline to the Aggregator transformation and pass a branch with the same data to the Joiner transformation to maintain the original data. When you join both branches of the pipeline, you join the aggregated data with the original data.

● Sorted Joiner transformation. Use a sorted Joiner transformation to join the sorted aggregated data with the original data.

● Filter transformation. Compare the average sales data against sales data for each employee and filter out employees with less than above average sales.

Note: You can also join data from output groups of the same transformation, such as the Custom transformation or XML Source Qualifier transformations. Place a Sorter transformation between each output group and the Joiner transformation and configure the Joiner transformation to receive sorted input.

Joining two branches can affect performance if the Joiner transformation receives data from one branch much later than the other branch. The Joiner transformation caches all the data from the first branch, and writes the cache to disk if the cache fills. The Joiner transformation must then read the data from disk when it receives the data from the second branch. This can slow processing.

You can also join same source data by creating a second instance of the source. After you create the second source instance, you can join the pipelines from the two source instances.

Note: When you join data using this method, the PowerCenter Server reads the source data for each source instance, so performance can be slower than joining two branches of a pipeline.

Use the following guidelines when deciding whether to join branches of a pipeline or join two instances of a source:

● Join two branches of a pipeline when you have a large source or if you can read the source data only once. For example, you can only read source data from a message queue once.

● Join two branches of a pipeline when you use sorted data. If the source data is unsorted and you use a Sorter transformation to sort the data, branch the pipeline after you sort the data.

● Join two instances of a source when you need to add a blocking transformation to the pipeline between the source and the Joiner transformation.

● Join two instances of a source if one pipeline may process much more slowly than the other

INFORMATICA CONFIDENTIAL BEST PRACTICE 567 of 702

pipeline.

Performance Tips

Use the database to do the join when sourcing data from the same database schema. Database systems usually can perform the join more quickly than the PowerCenter Server, so a SQL override or a join condition should be used when joining multiple tables from the same database schema.

Use Normal joins whenever possible. Normal joins are faster than outer joins and the resulting set of data is also smaller.

Join sorted data when possible. You can improve session performance by configuring the Joiner transformation to use sorted input. When you configure the Joiner transformation to use sorted data, the PowerCenter Server improves performance by minimizing disk input and output. You see the greatest performance improvement when you work with large data sets.

For an unsorted Joiner transformation, designate as the master sourcethe source with fewer rows. For optimal performance and disk storage, designate the master source as the source with the fewer rows. During a session, the Joiner transformation compares each row of the master source against the detail source. The fewer unique rows in the master, the fewer iterations of the join comparison occur, which speeds the join process.

For a sorted Joiner transformation, designate as the master source the source with fewer duplicate key values. For optimal performance and disk storage, designate the master source as the source with fewer duplicate key values. When the PowerCenter Server processes a sorted Joiner transformation, it caches rows for one hundred keys at a time. If the master source contains many rows with the same key value, the PowerCenter Server must cache more rows, and performance can be slowed.

Optimizing sorted joiner transformations with partitions. When you use partitions with a sorted Joiner transformation, you may optimize performance by grouping data and using n:n partitions.

Add a hash auto-keys partition upstream of the sort origin

To obtain expected results and get best performance when partitioning a sorted Joiner transformation, you must group and sort data. To group data, ensure that rows with the same key value are routed to the same partition. The best way to ensure that data is grouped and distributed evenly among partitions is to add a hash auto-keys or key-range partition point before the sort origin. Placing the partition point before you sort the data ensures that you maintain grouping and sort the data within each group.

Use n:n partitions

You may be able to improve performance for a sorted Joiner transformation by using n:n partitions. When you use n:n partitions, the Joiner transformation reads master and detail rows concurrently and does not need to cache all of the master data. This reduces memory usage and speeds processing. When you use 1:n partitions, the Joiner transformation caches all the data from the master pipeline and writes the cache to disk if the memory cache fills. When the Joiner transformation receives the data from the detail pipeline, it must then read the data from disk to compare the master and detail pipelines.

INFORMATICA CONFIDENTIAL BEST PRACTICE 568 of 702

Optimize Sequence Generator Transformations

Sequence Generator transformations need to determine the next available sequence number; thus, increasing the Number of Cached Values property can increase performance. This property determines the number of values the PowerCenter Server caches at one time. If it is set to cache no values, then the PowerCenter Server must query the repository each time to determine the next number to be used. You may consider configuring the Number of Cached Values to a value greater than 1000. Note that any cached values not used in the course of a session are lost since the sequence generator value in the repository is set when it is called next time, to give the next set of cache values.

Avoid External Procedure Transformations

For the most part, making calls to external procedures slows a session. If possible, avoid the use of these Transformations, which include Stored Procedures, External Procedures, and Advanced External Procedures.

Field-Level Transformation Optimization

As a final step in the tuning process, you can tune expressions used in transformations. When examining expressions, focus on complex expressions and try to simplify them when possible.

To help isolate slow expressions, do the following:

1. Time the session with the original expression. 2. Copy the mapping and replace half the complex expressions with a constant. 3. Run and time the edited session. 4. Make another copy of the mapping and replace the other half of the complex expressions with a

constant. 5. Run and time the edited session.

Processing field level transformations takes time. If the transformation expressions are complex, then processing is even slower. It’s often possible to get a 10 to 20 percent performance improvement by optimizing complex field level transformations. Use the target table mapping reports or the Metadata Reporter to examine the transformations. Likely candidates for optimization are the fields with the most complex expressions. Keep in mind that there may be more than one field causing performance problems.

Factoring Out Common Logic

Factoring out common logic can reduce the number of times a mapping performs the same logic. If a mapping performs the same logic multiple times, moving the task upstream in the mapping may allow the logic to be performed just once. For example, a mapping has five target tables. Each target requires a Social Security Number lookup. Instead of performing the lookup right before each target, move the lookup to a position before the data flow splits.

Minimize Function Calls

Anytime a function is called it takes resources to process. There are several common examples where function calls can be reduced or eliminated.

INFORMATICA CONFIDENTIAL BEST PRACTICE 569 of 702

Aggregate function calls can sometime be reduced. In the case of each aggregate function call, the PowerCenter Server must search and group the data. Thus, the following expression:

SUM(Column A) + SUM(Column B)

Can be optimized to:

SUM(Column A + Column B)

In general, operators are faster than functions, so operators should be used whenever possible. For example if you have an expression which involves a CONCAT function such as:

CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME)

It can be optimized to:

FIRST_NAME || LAST_NAME

Remember that IIF() is a function that returns a value, not just a logical test. This allows many logical statements to be written in a more compact fashion. For example:

IIF(FLG_A=Y and FLG_B=Y and FLG_C= Y, VAL_A+VAL_B+VAL_C,< /FONT>

IIF(FLG_A=Y and FLG_B=Y and FLG_C= N, VAL_A+VAL_B,< /FONT>

IIF(FLG_A=Y and FLG_B=N and FLG_C= Y, VAL_A+VAL_C,< /FONT>

IIF(FLG_A=Y and FLG_B=N and FLG_C= N, VAL_A,< /FONT>

IIF(FLG_A=N and FLG_B=Y and FLG_C= Y, VAL_B+VAL_C,< /FONT>

IIF(FLG_A=N and FLG_B=Y and FLG_C= N, VAL_B,< /FONT>

IIF(FLG_A=N and FLG_B=N and FLG_C= Y, VAL_C,< /FONT>

IIF(FLG_A=N and FLG_B=N and FLG_C= N, 0.0))))))))< /FONT>

Can be optimized to:

IIF(FLG_A=Y, VAL_A, 0.0) + IIF(FLG_B=Y, VAL_B, 0.0) + IIF(FLG_C= Y, VAL_C, 0.0)< /FONT>

The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized expression results in three IIFs, three comparisons, and two additions.

Be creative in making expressions more efficient. The following is an example of rework of an expression that eliminates three comparisons down to one:

INFORMATICA CONFIDENTIAL BEST PRACTICE 570 of 702

IIF(X=1 OR X=5 OR X=9, 'yes', 'no')< /FONT>

Can be optimized to:

IIF(MOD(X, 4) = 1, 'yes', 'no')< /FONT >

Calculate Once, Use Many Times

Avoid calculating or testing the same value multiple times. If the same sub-expression is used several times in a transformation, consider making the sub-expression a local variable. The local variable can be used only within the transformation in which it was created. Calculating the variable only once and then referencing the variable in following sub-expressions improves performance.

Choose Numeric vs. String Operations

The PowerCenter Server processes numeric operations faster than string operations. For example, if a lookup is performed on a large amount of data on two columns, EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID improves performance.

Optimizing Char-Char and Char-Varchar Comparisons

When the PowerCenter Server performs comparisons between CHAR and VARCHAR columns, it slows each time it finds trailing blank spaces in the row. To resolve this, treat CHAR as the CHAR On Read option in the PowerCenter Server setup so that the server does not trim trailing spaces from the end of CHAR source fields.

Use DECODE Instead of LOOKUP

When a LOOKUP function is used, the PowerCenter Server must lookup a table in the database. When a DECODE function is used, the lookup values are incorporated into the expression itself so the server does not need to lookup a separate table. Thus, when looking up a small set of unchanging values, using DECODE may improve performance.

Reduce the Number of Transformations in a Mapping

Because there is always overhead involved in moving data among transformations, try, whenever possible, to reduce the number of transformations. Also, resolve unnecessary links between transformations to minimize the amount of data moved. This is especially important with data being pulled from the Source Qualifier Transformation.

Use Pre- and Post-Session SQL Commands

You can specify pre- and post-session SQL commands in the Properties tab of the Source Qualifier transformation and in the Properties tab of the target instance in a mapping. To increase the load speed, use these commands to drop indexes on the target before the session runs, then recreate them when the session completes.

Apply the following guidelines when using SQL statements:

INFORMATICA CONFIDENTIAL BEST PRACTICE 571 of 702

● You can use any command that is valid for the database type. However, the PowerCenter Server does not allow nested comments, even though the database may.

● You can use mapping parameters and variables in SQL executed against the source, but not against the target.

● Use a semi-colon (;) to separate multiple statements. ● The PowerCenter Server ignores semi-colons within single quotes, double quotes, or within /* ...*/. ● If you need to use a semi-colon outside of quotes or comments, you can escape it with a back

slash (\). ● The Workflow Manager does not validate the SQL.

Use Environmental SQL

For relational databases, you can execute SQL commands in the database environment when connecting to the database. You can use this for source, target, lookup, and stored procedure connections. For instance, you can set isolation levels on the source and target systems to avoid deadlocks. Follow the guidelines listed above for using the SQL statements.

Use Local Variables

You can use local variables in Aggregator, Expression, and Rank transformations.

Temporarily Store Data and Simplify Complex Expressions

Rather than parsing and validating the same expression each time, you can define these components as variables. This also allows you to simplyfy complex expressions. For example, the following expressions:

AVG( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT >

SUM( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT >

can use variables to simplify complex expressions and temporarily store data:

Port Value

V_CONDITION1 JOB_STATUS = 'Full-time'

V_CONDITION2 OFFICE_ID = 1000

AVG_SALARY AVG( SALARY, V_CONDITION1 AND V_CONDITION2 )

SUM_SALARY SUM( SALARY, V_CONDITION1 AND V_CONDITION2 )

Store Values Across Rows

INFORMATICA CONFIDENTIAL BEST PRACTICE 572 of 702

You can use variables to store data from prior rows. This can help you perform procedural calculations. To compare the previous state to the state just read:

IIF( PREVIOUS_STATE = STATE, STATE_COUNTER + 1, 1 )< /FONT >

Capture Values from Stored Procedures

Variables also provide a way to capture multiple columns of return values from stored procedures.

Last updated: 13-Feb-07 17:43

INFORMATICA CONFIDENTIAL BEST PRACTICE 573 of 702

Tuning Sessions for Better Performance

Challenge

Running sessions is where the pedal hits the metal. A common misconception is that this is the area where most tuning should occur. While it is true that various specific session options can be modified to improve performance, PowerCenter 8 comes with options of Grid and Pushdown optimizations that also improve performance tremendously.

Description

Once you optimize the source and target database, and mapping, you can focus on optimizing the session. The greatest area for improvement at the session level usually involves tweaking memory cache settings. The Aggregator (without sorted ports), Joiner, Rank, Sorter and Lookup transformations (with caching enabled) use caches.

The PowerCenter Server uses index and data caches for each of these transformations. If the allocated data or index cache is not large enough to store the data, the PowerCenter Server stores the data in a temporary disk file as it processes the session data. Each time the PowerCenter Server pages to the temporary file, performance slows.

You can see when the PowerCenter Server pages to the temporary file by examining the performance details. The transformation_readfromdisk or transformation_writetodisk counters for any Aggregator, Rank, Lookup, Sorter, or Joiner transformation indicate the number of times the PowerCenter Server must page to disk to process the transformation. Index and data caches should both be sized according to the requirements of the individual lookup. The sizing can be done using the estimation tools provided in the Transformation Guide, or through observation of actual cache sizes on in the session caching directory.

The PowerCenter Server creates the index and data cache files by default in the PowerCenter Server variable directory, $PMCacheDir. The naming convention used by the PowerCenter Server for these files is PM [type of transformation] [generated session instance id number] _ [transformation instance id number] _ [partition index].dat or .idx. For example, an aggregate data cache file would be named PMAGG31_19.dat. The cache directory may be changed however, if disk space is a constraint. Informatica recommends that the cache directory be local to the PowerCenter Server. A RAID 0 arrangement that gives maximum performance with no redundancy is

INFORMATICA CONFIDENTIAL BEST PRACTICE 574 of 702

recommended for volatile cache file directories (i.e., no persistent caches).

If the PowerCenter Server requires more memory than the configured cache size, it stores the overflow values in these cache files. Since paging to disk can slow session performance, the RAM allocated needs to be available on the server. If the server doesn’t have available RAM and uses paged memory, your session is again accessing the hard disk. In this case, it is more efficient to allow PowerCenter to page the data rather than the operating system. Adding additional memory to the server is, of course, the best solution.

Refer to Session Caches in the Workflow Administration Guide for detailed information on determining cache sizes.

The PowerCenter Server writes to the index and data cache files during a session in the following cases:

● The mapping contains one or more Aggregator transformations, and the session is configured for incremental aggregation.

● The mapping contains a Lookup transformation that is configured to use a persistent lookup cache, and the PowerCenter Server runs the session for the first time.

● The mapping contains a Lookup transformation that is configured to initialize the persistent lookup cache.

● The Data Transformation Manager (DTM) process in a session runs out of cache memory and pages to the local cache files. The DTM may create multiple files when processing large amounts of data. The session fails if the local directory runs out of disk space.

When a session is running, the PowerCenter Server writes a message in the session log indicating the cache file name and the transformation name. When a session completes, the DTM generally deletes the overflow index and data cache files. However, index and data files may exist in the cache directory if the session is configured for either incremental aggregation or to use a persistent lookup cache. Cache files may also remain if the session does not complete successfully.

Configuring Automatic Memory Settings

PowerCenter 8 allows you to configure the amount of cache memory. Alternatively, you can configure the Integration Service to automatically calculate cache memory settings at run time. When you run a session, the Integration Service allocates buffer memory to the session to move the data from the source to the target. It also creates session

INFORMATICA CONFIDENTIAL BEST PRACTICE 575 of 702

caches in memory. Session caches include index and data caches for the Aggregator, Rank, Joiner, and Lookup transformations, as well as Sorter and XML target caches. The values stored in the data and index caches depend upon the requirements of the transformation. For example, the Aggregator index cache stores group values as configured in the group by ports, and the data cache stores calculations based on the group by ports. When the Integration Service processes a Sorter transformation or writes data to an XML target, it also creates a cache.

Configuring Session Cache Memory

The Integration Service can determine cache memory requirements for the Lookup, Aggregator, Rank, Joiner, Sorter and XML.

You can configure auto for the index and data cache size in the transformation properties or on the mappings tab of the session properties

Max Memory Limits

Configuring maximum memory limits allows you to ensure that you reserve a designated amount or percentage of memory for other processes. You can configure the memory limit as a numeric value and as a percent of total memory. Because available memory varies, the Integration Service bases the percentage value on the total memory on the Integration Service process machine.

For example, you configure automatic caching for three Lookup transformations in a session. Then, you configure a maximum memory limit of 500MB for the session. When you run the session, the Integration Service divides the 500MB of allocated memory among the index and data caches for the Lookup transformations.

When you configure a maximum memory value, the Integration Service divides memory among transformation caches based on the transformation type.

When you configure a numeric value and a percent both, the Integration Service compares the values and uses the lower value as the maximum memory limit.

When you configure automatic memory settings, the Integration Service specifies a minimum memory allocation for the index and data caches. The Integration Service allocates 1,000,000 bytes to the index cache and 2,000,000 bytes to the data cache for each transformation instance. If you configure a maximum memory limit that is less than the minimum value for an index or data cache, the Integration Service overrides this value. For example, if you configure a maximum memory value of 500 bytes for

INFORMATICA CONFIDENTIAL BEST PRACTICE 576 of 702

session containing a Lookup transformation, the Integration Service overrides or disable the automatic memory settings and uses the default values.

When you run a session on a grid and you configure Maximum Memory Allowed for Auto Memory Attributes, the Integration Service divides the allocated memory among all the nodes in the grid. When you configure Maximum Percentage of Total Memory Allowed for Auto Memory Attributes, the Integration Service allocates the specified percentage of memory on each node in the grid.

Aggregator Caches

Keep the following items in mind when configuring the aggregate memory cache sizes:

● Allocate at least enough space to hold at least one row in each aggregate group.

● Remember that you only need to configure cache memory for an Aggregator transformation that does not use sorted ports. The PowerCenter Server uses Session Process memory to process an Aggregator transformation with sorted ports, not cache memory.

● Incremental aggregation can improve session performance. When it is used, the PowerCenter Server saves index and data cache information to disk at the end of the session. The next time the session runs, the PowerCenter Server uses this historical information to perform the incremental aggregation. The PowerCenter Server names these files PMAGG*.dat and PMAGG*.idx and saves them to the cache directory. Mappings that have sessions which use incremental aggregation should be set up so that only new detail records are read with each subsequent run.

● When configuring Aggregate data cache size, remember that the data cache holds row data for variable ports and connected output ports only. As a result, the data cache is generally larger than the index cache. To reduce the data cache size, connect only the necessary output ports to subsequent transformations.

Joiner Caches

When a session is run with a Joiner transformation, the PowerCenter Server reads from master and detail sources concurrently and builds index and data caches based on the master rows. The PowerCenter Server then performs the join based on the detail source data and the cache data.

The number of rows the PowerCenter Server stores in the cache depends on the

INFORMATICA CONFIDENTIAL BEST PRACTICE 577 of 702

partitioning scheme, the data in the master source, and whether or not you use sorted input.

After the memory caches are built, the PowerCenter Server reads the rows from the detail source and performs the joins. The PowerCenter Server uses the index cache to test the join condition. When it finds source data and cache data that match, it retrieves row values from the data cache.

Lookup Caches

Several options can be explored when dealing with Lookup transformation caches.

● Persistent caches should be used when lookup data is not expected to change often. Lookup cache files are saved after a session with a persistent cache lookup is run for the first time. These files are reused for subsequent runs, bypassing the querying of the database for the lookup. If the lookup table changes, you must be sure to set the Recache from Database option to ensure that the lookup cache files are rebuilt. You can also delete the cache files before the session run to force the session to rebuild the caches.

● Lookup caching should be enabled for relatively small tables. Refer to the Best Practice Tuning Mappings for Better Performance to determine when lookups should be cached. When the Lookup transformation is not configured for caching, the PowerCenter Server queries the lookup table for each input row. The result of the lookup query and processing is the same, regardless of whether the lookup table is cached or not. However, when the transformation is configured to not cache, the PowerCenter Server queries the lookup table instead of the lookup cache. Using a lookup cache can usually increase session performance.

● Just like for a joiner, the PowerCenter Server aligns all data for lookup caches on an eight-byte boundary, which helps increase the performance of the lookup

Allocating Buffer Memory

The Integration Service can determine the memory requirements for the buffer memory:

● DTM Buffer Size ● Default Buffer Block Size

You can also configure DTM buffer size and the default buffer block size in the session properties. When the PowerCenter Server initializes a session, it allocates blocks of

INFORMATICA CONFIDENTIAL BEST PRACTICE 578 of 702

memory to hold source and target data. Sessions that use a large number of sources and targets may require additional memory blocks.

To configure these settings, first determine the number of memory blocks the PowerCenter Server requires to initialize the session. Then you can calculate the buffer size and/or the buffer block size based on the default settings, to create the required number of session blocks.

If there are XML sources or targets in the mappings, use the number of groups in the XML source or target in the total calculation for the total number of sources and targets.

Increasing the DTM Buffer Pool Size

The DTM Buffer Pool Size setting specifies the amount of memory the PowerCenter Server uses as DTM buffer memory. The PowerCenter Server uses DTM buffer memory to create the internal data structures and buffer blocks used to bring data into and out of the server. When the DTM buffer memory is increased, the PowerCenter Server creates more buffer blocks, which can improve performance during momentary slowdowns.

If a session's performance details show low numbers for your source and target BufferInput_efficiency and BufferOutput_efficiency counters, increasing the DTM buffer pool size may improve performance.

Using DTM buffer memory allocation generally causes performance to improve initially and then level off. (Conversely, it may have no impact on source or target-bottlenecked sessions at all and may not have an impact on DTM bottlenecked sessions). When the DTM buffer memory allocation is increased, you need to evaluate the total memory available on the PowerCenter Server. If a session is part of a concurrent batch, the combined DTM buffer memory allocated for the sessions or batches must not exceed the total memory for the PowerCenter Server system. You can increase the DTM buffer size in the Performance settings of the Properties tab.

Running Workflows and Sessions Concurrently

The PowerCenter Server can process multiple sessions in parallel and can also process multiple partitions of a pipeline within a session. If you have a symmetric multi-processing (SMP) platform, you can use multiple CPUs to concurrently process session data or partitions of data. This provides improved performance since true parallelism is achieved. On a single processor platform, these tasks share the CPU, so there is no parallelism.

INFORMATICA CONFIDENTIAL BEST PRACTICE 579 of 702

To achieve better performance, you can create a workflow that runs several sessions in parallel on one PowerCenter Server. This technique should only be employed on servers with multiple CPUs available.

Partitioning Sessions

Performance can be improved by processing data in parallel in a single session by creating multiple partitions of the pipeline. If you have PowerCenter partitioning available, you can increase the number of partitions in a pipeline to improve session performance. Increasing the number of partitions allows the PowerCenter Server to create multiple connections to sources and process partitions of source data concurrently.

When you create or edit a session, you can change the partitioning information for each pipeline in a mapping. If the mapping contains multiple pipelines, you can specify multiple partitions in some pipelines and single partitions in others. Keep the following attributes in mind when specifying partitioning information for a pipeline:

● Location of partition points. The PowerCenter Server sets partition points at several transformations in a pipeline by default. If you have PowerCenter partitioning available, you can define other partition points. Select those transformations where you think redistributing the rows in a different way is likely to increase the performance considerably.

● Number of partitions. By default, the PowerCenter Server sets the number of partitions to one. You can generally define up to 64 partitions at any partition point. When you increase the number of partitions, you increase the number of processing threads, which can improve session performance. Increasing the number of partitions or partition points also increases the load on the server. If the server contains ample CPU bandwidth, processing rows of data in a session concurrently can increase session performance. However, if you create a large number of partitions or partition points in a session that processes large amounts of data, you can overload the system. You can also overload source and target systems, so that is another consideration.

● Partition types. The partition type determines how the PowerCenter Server redistributes data across partition points. The Workflow Manager allows you to specify the following partition types:

1. Round-robin partitioning. PowerCenter distributes rows of data evenly to all partitions. Each partition processes approximately the same number of rows. In a pipeline that reads data from file sources of different sizes, you can use round-robin partitioning to ensure that each

INFORMATICA CONFIDENTIAL BEST PRACTICE 580 of 702

partition receives approximately the same number of rows. 2. Hash keys. The PowerCenter Server uses a hash function to group

rows of data among partitions. The Server groups the data based on a partition key. There are two types of hash partitioning:

❍ Hash auto-keys. The PowerCenter Server uses all grouped or sorted ports as a compound partition key. You can use hash auto-keys partitioning at or before Rank, Sorter, and unsorted Aggregator transformations to ensure that rows are grouped properly before they enter these transformations.

❍ Hash user keys. The PowerCenter Server uses a hash function to group rows of data among partitions based on a user-defined partition key. You choose the ports that define the partition key.

3. Key range. The PowerCenter Server distributes rows of data based on a port or set of ports that you specify as the partition key. For each port, you define a range of values. The PowerCenter Server uses the key and ranges to send rows to the appropriate partition. Choose key range partitioning where the sources or targets in the pipeline are partitioned by key range.

4. ­Pass-through partitioning. The PowerCenter Server processes data without redistributing rows among partitions. Therefore, all rows in a single partition stay in that partition after crossing a pass-through partition point.

5. Database partitioning partition. You can optimize session performance by using the database partitioning partition type instead of the pass-through partition type for IBM DB2 targets.

If you find that your system is under-utilized after you have tuned the application, databases, and system for maximum single-partition performance, you can reconfigure your session to have two or more partitions to make your session utilize more of the hardware. Use the following tips when you add partitions to a session:

● Add one partition at a time. To best monitor performance, add one partition at a time, and note your session settings before you add each partition.

● Set DTM buffer memory. For a session with n partitions, this value should be at least n times the value for the session with one partition.

● Set cached values for Sequence Generator. For a session with n partitions, there should be no need to use the number of cached values property of the Sequence Generator transformation. If you must set this value to a value greater than zero, make sure it is at least n times the original value for the

INFORMATICA CONFIDENTIAL BEST PRACTICE 581 of 702

session with one partition. ● Partition the source data evenly. Configure each partition to extract the

same number of rows. Or redistribute the data among partitions early using a partition point with round-robin. This is actually a good way to prevent hammering of the source system. You could have a session with multiple partitions where one partition returns all the data and the override SQL in the other partitions is set to return zero rows (where 1 = 2 in the where clause prevents any rows being returned). Some source systems react better to multiple concurrent SQL queries; others prefer smaller numbers of queries.

● Monitor the system while running the session. If there are CPU cycles available (twenty percent or more idle time), then performance may improve for this session by adding a partition.

● Monitor the system after adding a partition. If the CPU utilization does not go up, the wait for I/O time goes up, or the total data transformation rate goes down, then there is probably a hardware or software bottleneck. If the wait for I/O time goes up a significant amount, then check the system for hardware bottlenecks. Otherwise, check the database configuration.

● Tune databases and system. Make sure that your databases are tuned properly for parallel ETL and that your system has no bottlenecks.

Increasing the Target Commit Interval

One method of resolving target database bottlenecks is to increase the commit interval. Each time the target database commits, performance slows. If you increase the commit interval, the number of times the PowerCenter Server commits decreases and performance may improve.

When increasing the commit interval at the session level, you must remember to increase the size of the database rollback segments to accommodate the larger number of rows. One of the major reasons that Informatica set the default commit interval to 10,000 is to accommodate the default rollback segment / extent size of most databases. If you increase both the commit interval and the database rollback segments, you should see an increase in performance. In some cases though, just increasing the commit interval without making the appropriate database changes may cause the session to fail part way through (i.e., you may get a database error like "unable to extend rollback segments" in Oracle).

Disabling High Precision

If a session runs with high precision enabled, disabling high precision may improve session performance.

INFORMATICA CONFIDENTIAL BEST PRACTICE 582 of 702

The Decimal datatype is a numeric datatype with a maximum precision of 28. To use a high-precision Decimal datatype in a session, you must configure it so that the PowerCenter Server recognizes this datatype by selecting Enable High Precision in the session property sheet. However, since reading and manipulating a high-precision datatype (i.e., those with a precision of greater than 28) can slow the PowerCenter Server down, session performance may be improved by disabling decimal arithmetic. When you disable high precision, the PowerCenter Server reverts to using a dataype of Double.

Reducing Error Tracking

If a session contains a large number of transformation errors, you may be able to improve performance by reducing the amount of data the PowerCenter Server writes to the session log.

To reduce the amount of time spent writing to the session log file, set the tracing level to Terse. At this tracing level, the PowerCenter Server does not write error messages or row-level information for reject data. However, if terse is not an acceptable level of detail, you may want to consider leaving the tracing level at Normal and focus your efforts on reducing the number of transformation errors. Note that the tracing level must be set to Normal in order to use the reject loading utility.

As an additional debug option (beyond the PowerCenter Debugger), you may set the tracing level to verbose initialization or verbose data.

● Verbose initialization logs initialization details in addition to normal, names of index and data files used, and detailed transformation statistics.

● Verbose data logs each row that passes into the mapping. It also notes where the PowerCenter Server truncates string data to fit the precision of a column and provides detailed transformation statistics. When you configure the tracing level to verbose data, the PowerCenter Server writes row data for all rows in a block when it processes a transformation.

However, the verbose initialization and verbose data logging options significantly affect the session performance. Do not use Verbose tracing options except when testing sessions. Always remember to switch tracing back to Normal after the testing is complete.

The session tracing level overrides any transformation-specific tracing levels within the mapping. Informatica does not recommend reducing error tracing as a long-term response to high levels of transformation errors. Because there are only a handful of

INFORMATICA CONFIDENTIAL BEST PRACTICE 583 of 702

reasons why transformation errors occur, it makes sense to fix and prevent any recurring transformation errors. PowerCenter uses the mapping tracing level when the session tracing level is set to none.

Pushdown Optimization

You can push transformation logic to the source or target database using pushdown optimization. The amount of work you can push to the database depends on the pushdown optimization configuration, the transformation logic, and the mapping and session configuration.

When you run a session configured for pushdown optimization, the Integration Service analyzes the mapping and writes one or more SQL statements based on the mapping transformation logic. The Integration Service analyzes the transformation logic, mapping, and session configuration to determine the transformation logic it can push to the database. At run time, the Integration Service executes any SQL statement generated against the source or target tables, and it processes any transformation logic that it cannot push to the database.

Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the Integration Service can push to the source or target database. You can also use the Pushdown Optimization Viewer to view the messages related to Pushdown Optimization.

Source-Side Pushdown Optimization Sessions

In source-side pushdown optimization, the Integration Service analyzes the mapping from the source to the target until it reaches a downstream transformation that cannot be pushed to the database.

The Integration Service generates a SELECT statement based on the transformation logic up to the transformation it can push to the database. Integration Service pushes all transformation logic that is valid to push to the database by executing the generated SQL statement at run time. Then, it reads the results of this SQL statement and continues to run the session. Similarly it create the view for SQL override and then generate SELECT statement and runs the SELECT statement against this view. When the session completes, the Integration Service drops the view from the database.

Target-Side Pushdown Optimization Sessions

When you run a session configured for target-side pushdown optimization, the

INFORMATICA CONFIDENTIAL BEST PRACTICE 584 of 702

Integration Service analyzes the mapping from the target to the source or until it reaches an upstream transformation it cannot push to the database. It generates an INSERT, DELETE, or UPDATE statement based on the transformation logic for each transformation it can push to the database, starting with the first transformation in the pipeline it can push to the database. The Integration Service processes the transformation logic up to the point that it can push the transformation logic to the target database. Then, it executes the generated SQL.

Full Pushdown Optimization Sessions

To use full pushdown optimization, the source and target must be on the same database. When you run a session configured for full pushdown optimization, the Integration Service analyzes the mapping from source to target and analyze each transformation in the pipeline until it analyzes the target. It generates and executes the SQL on sources and targets,

When you run a session for full pushdown optimization, the database must run a long transaction if the session contains a large quantity of data. Consider the following database performance issues when you generate a long transaction:

● A long transaction uses more database resources. ● A long transaction locks the database for longer periods of time, and thereby

reduces the database concurrency and increases the likelihood of deadlock. ● A long transaction can increase the likelihood that an unexpected event may

occur.

The Rank transformation cannot be pushed to the database. If you configure the session for full pushdown optimization, the Integration Service pushes the Source Qualifier transformation and the Aggregator transformation to the source. It pushes the Expression transformation and target to the target database, and it processes the Rank transformation. The Integration Service does not fail the session if it can push only part of the transformation logic to the database and the session is configured for full optimization.

Using a Grid

You can use a grid to increase session and workflow performance. A grid is an alias assigned to a group of nodes that allows you to automate the distribution of workflows and sessions across nodes.

INFORMATICA CONFIDENTIAL BEST PRACTICE 585 of 702

When you use a grid, the Integration Service distributes workflow tasks and session threads across multiple nodes. Running workflows and sessions on the nodes of a grid provides the following performance gains:

● Balances the Integration Service workload. ● Processes concurrent sessions faster. ● Processes partitions faster.

When you run a session on a grid, you improve scalability and performance by distributing session threads to multiple DTM processes running on nodes in the grid.

To run a workflow or session on a grid, you assign resources to nodes, create and configure the grid, and configure the Integration Service to run on a grid.

Running a Session on Grid

When you run a session on a grid, the master service process runs the workflow and workflow tasks, including the Scheduler. Because it runs on the master service process node, the Scheduler uses the date and time for the master service process node to start scheduled workflows. The Load Balancer distributes Command tasks as it does when you run a workflow on a grid. In addition, when the Load Balancer dispatches a Session task, it distributes the session threads to separate DTM processes.

The master service process starts a temporary preparer DTM process that fetches the session and prepares it to run. After the preparer DTM process prepares the session, it acts as the master DTM process, which monitors the DTM processes running on other nodes.

The worker service processes start the worker DTM processes on other nodes. The worker DTM runs the session. Multiple worker DTM processes running on a node might be running multiple sessions or multiple partition groups from a single session depending on the session configuration.

For example, you run a workflow on a grid that contains one Session task and one Command task. You also configure the session to run on the grid.

When the Integration Service process runs the session on a grid, it performs the following tasks:

● On Node 1, the master service process runs workflow tasks. It also starts a

INFORMATICA CONFIDENTIAL BEST PRACTICE 586 of 702

temporary preparer DTM process, which becomes the master DTM process. The Load Balancer dispatches the Command task and session threads to nodes in the grid.

● On Node 2, the worker service process runs the Command task and starts the worker DTM processes that run the session threads.

● On Node 3, the worker service process starts the worker DTM processes that run the session threads.

For information about configuring and managing a grid, refer to the PowerCenter Administrator Guide.

For information about how the DTM distributes session threads into partition groups, see "Running Workflows and Sessions on a Grid" in the Workflow Administration Guide.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 587 of 702

Tuning SQL Overrides and Environment for Better Performance

Challenge

Tuning SQL Overrides and SQL queries within the source qualifier objects can improve performance in selecting data from source database tables, which positively impacts the overall session performance. This Best Practice explores ways to optimize a SQL query within the source qualifier object. The tips here can be applied to any PowerCenter mapping. While the SQL discussed here is executed in Oracle 8 and above, the techniques are generally applicable, but specifics for other RDBMS products (e.g., SQL Server, Sybase, etc.) are not included.

Description SQL Queries Performing Data Extractions

Optimizing SQL queries is perhaps the most complex portion of performance tuning. When tuning SQL, the developer must look at the type of execution being forced by hints, the execution plan, and the indexes on the query tables in the SQL, the logic of the SQL statement itself, and the SQL syntax. The following paragraphs discuss each of these areas in more detail.

DB2 Coalesce and Oracle NVL

When examining data with NULLs, it is often necessary to substitute a value to make comparisons and joins work. In Oracle, the NVL function is used, while in DB2, the COALESCE function is used.

Here is an example of the Oracle NLV function:

SELECT DISTINCT bio.experiment_group_id, bio.database_site_code

FROM exp.exp_bio_result bio, sar.sar_data_load_log log

WHERE bio.update_date BETWEEN log.start_time AND log.end_time

AND NVL(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’)

AND log.seq_no = (SELECT MAX(seq_no) FROM sar.sar_data_load_log < /FONT >

WHERE load_status = 'P')<

Here is the same query in DB2:

SELECT DISTINCT bio.experiment_group_id, bio.database_site_code

FROM bio_result bio, data_load_log log

WHERE bio.update_date BETWEEN log.start_time AND log.end_time

AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’)

AND log.seq_no = (SELECT MAX(seq_no) FROM data_load_log < /FONT >

WHERE load_status = 'P')< /FONT >

INFORMATICA CONFIDENTIAL BEST PRACTICE 588 of 702

Surmounting the Single SQL Statement Limitation in Oracle or DB2: In-line Views

In source qualifiers and lookup objects, you are limited to a single SQL statement. There are several ways to get around this limitation.

You can create views in the database and use them as you would tables, either as source tables, or in the FROM clause of the SELECT statement. This can simplify the SQL and make it easier to understand, but it also makes it harder to maintain. The logic is now in two places: in an Informatica mapping and in a database view

You can use in-line views which are SELECT statements in the FROM or WHERE clause. This can help focus the query to a subset of data in the table and work more efficiently than using a traditional join. Here is an example of an in-line view in the FROM clause:

SELECT N.DOSE_REGIMEN_TEXT as DOSE_REGIMEN_TEXT,

N.DOSE_REGIMEN_COMMENT as DOSE_REGIMEN_COMMENT,

N.DOSE_VEHICLE_BATCH_NUMBER as DOSE_VEHICLE_BATCH_NUMBER,

N.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID

FROM DOSE_REGIMEN N,

(SELECT DISTINCT R.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID

FROM EXPERIMENT_PARAMETER R,

NEW_GROUP_TMP TMP

WHERE R.EXPERIMENT_PARAMETERS_ID = TMP.EXPERIMENT_PARAMETERS_ID< /FONT >

AND R.SCREEN_PROTOCOL_ID = TMP.BDS_PROTOCOL_ID < /FONT >

) X

WHERE N.DOSE_REGIMEN_ID = X.DOSE_REGIMEN_ID < /FONT >

ORDER BY N.DOSE_REGIMEN_ID

Surmounting the Single SQL Statement Limitation in DB2: Using the Common Table Expression temp tables and the WITH Clause

The Common Table Expression (CTE) stores data in temp tables during the execution of the SQL statement. The WITH clause lets you assign a name to a CTE block. You can then reference the CTE block multiple places in the query by specifying the query name. For example:

WITH maxseq AS (SELECT MAX(seq_no) as seq_no FROM data_load_log WHERE load_status = 'P') < /FONT >

SELECT DISTINCT bio.experiment_group_id, bio.database_site_code

FROM bio_result bio, data_load_log log

WHERE bio.update_date BETWEEN log.start_time AND log.end_time

INFORMATICA CONFIDENTIAL BEST PRACTICE 589 of 702

AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’)

AND log.seq_no = maxseq. seq_no< /FONT >

Here is another example using a WITH clause that uses recursive SQL:

WITH PERSON_TEMP (PERSON_ID, NAME, PARENT_ID) AS

(SELECT PERSON_ID, NAME, PARENT_ID

FROM PARENT_CHILD

WHERE NAME IN (‘FRED’, ‘SALLY’, ‘JIM’)

UNION ALL

SELECT C.PERSON_ID, C.NAME, C.PARENT_ID

FROM PARENT_CHILD C, PERSON_TEMP RECURS

WHERE C.PERSON_ID = RECURS.PERSON_ID < /FONT >

AND LEVEL < 5)

SELECT * FROM PERSON_TEMP

The PARENT_ID in any particular row refers to the PERSON_ID of the parent. Pretty stupid since we all have two parents, but you get the idea. The LEVEL clause prevents infinite recursion.

CASE (DB2) vs. DECODE (Oracle)

The CASE syntax is allowed in ORACLE, but you are much more likely to see the DECODE logic, even for a single case since it was the only legal way to test a condition in earlier versions.

DECODE is not allowed in DB2.

In Oracle:

SELECT EMPLOYEE, FNAME, LNAME,

DECODE (SALARY)

< 10000, ‘NEED RAISE’,

> 1000000, ‘OVERPAID’,

‘THE REST OF US’) AS COMMENT

FROM EMPLOYEE

In DB2:

INFORMATICA CONFIDENTIAL BEST PRACTICE 590 of 702

SELECT EMPLOYEE, FNAME, LNAME,

CASE

WHEN SALARY < 10000 THEN ‘NEED RAISE’

WHEN SALARY > 1000000 THEN ‘OVERPAID’

ELSE ‘THE REST OF US’

END AS COMMENT

FROM EMPLOYEE

Debugging Tip: Obtaining a Sample Subset

It is often useful to get a small sample of the data from a long running query that returns a large set of data. The logic can be commented out or removed after it is put in general use.

DB2 uses the FETCH FIRST n ROWS ONLY clause to do this as follows:

SELECT EMPLOYEE, FNAME, LNAME

FROM EMPLOYEE

WHERE JOB_TITLE = ‘WORKERBEE’ < /FONT >

FETCH FIRST 12 ROWS ONLY

Oracle does it this way using the ROWNUM variable:

SELECT EMPLOYEE, FNAME, LNAME

FROM EMPLOYEE

WHERE JOB_TITLE = ‘WORKERBEE’ < /FONT >

AND ROWNUM <= 12< /FONT>

INTERSECT, INTERSECT ALL, UNION, UNION ALL

Remember that both the UNION and INTERSECT operators return distinct rows, while UNION ALL and INTERSECT ALL return all rows.

System Dates in Oracle and DB2

Oracle uses the system variable SYSDATE for the current time and date, and allows you to display either the time and/or the date however you want with date functions.

Here is an example that returns yesterday’s date in Oracle (default format as mm/dd/yyyy):

SELECT TRUNC(SYSDATE) – 1 FROM DUAL

INFORMATICA CONFIDENTIAL BEST PRACTICE 591 of 702

DB2 uses the system variables, here called special registers, CURRENT DATE, CURRENT TIME and CURRENT TIMESTAMP

Here is an example for DB2:

SELECT FNAME, LNAME, CURRENT DATE AS TODAY

FROM EMPLOYEE

Oracle: Using Hints

Hints affect the way a query or sub-query is executed and can therefore, provide a significant performance increase in queries. Hints cause the database engine to relinquish control over how a query is executed, thereby giving the developer control over the execution. Hints are always honored unless execution is not possible. Because the database engine does not evaluate whether the hint makes sense, developers must be careful in implementing hints. Oracle has many types of hints: optimizer hints, access method hints, join order hints, join operation hints, and parallel execution hints. Optimizer and access method hints are the most common.

In the latest versions of Oracle, the Cost-based query analysis is built-in and Rule-based analysis is no longer possible. It was in Rule-based Oracle systems that hints mentioning specific indexes were most helpful. In Oracle version 9.2, however, the use of /*+ INDEX */ hints may actually decrease performance significantly in many cases. If you are using older versions of Oracle however, the use of the proper INDEX hints should help performance.

The optimizer hint allows the developer to change the optimizer's goals when creating the execution plan. The table below provides a partial list of optimizer hints and descriptions.

Optimizer hints: Choosing the best join method

Sort/merge and hash joins are in the same group, but nested loop joins are very different. Sort/merge involves two sorts while the nested loop involves no sorts. The hash join also requires memory to build the hash table.

Hash joins are most effective when the amount of data is large and one table is much larger than the other.

Here is an example of a select that performs best as a hash join:

SELECT COUNT(*) FROM CUSTOMERS C, MANAGERS M

WHERE C.CUST_ID = M.MANAGER_ID< /FONT >

Considerations Join Type

Better throughput Sort/Merge

Better response time Nested loop

Large subsets of data Sort/Merge

Index available to support join Nested loop

Limited memory and CPU available for sorting Nested loop

Parallel execution Sort/Merge or Hash

INFORMATICA CONFIDENTIAL BEST PRACTICE 592 of 702

Joining all or most of the rows of large tables Sort/Merge or Hash

Joining small sub-sets of data and index available Nested loop

Hint Description

ALL_ROWS The database engine creates an execution plan that optimizes for throughput. Favors full table scans. Optimizer favors Sort/Merge

FIRST_ROWS The database engine creates an execution plan that optimizes for response time. It returns the first row of data as quickly as possible. Favors index lookups. Optimizer favors Nested-loops

CHOOSE The database engine creates an execution plan that uses cost-based execution if statistics have been run on the tables. If statistics have not been run, the engine uses rule-based execution. If statistics have been run on empty tables, the engine still uses cost-based execution, but performance is extremely poor.

RULE The database engine creates an execution plan based on a fixed set of rules.

USE NL Use nested loops

USE MERGE Use sort merge joins

HASH The database engine performs a hash scan of the table. This hint is ignored if the table is not clustered.

Access method hints

Access method hints control how data is accessed. These hints are used to force the database engine to use indexes, hash scans, or row id scans. The following table provides a partial list of access method hints.

Hint Description

ROWID The database engine performs a scan of the table based on ROWIDS.

INDEX DO NOT USE in Oracle 9.2 and above. The database engine performs an index scan of a specific table, but in 9.2 and above, the optimizer does not use any indexes other than those mentioned.

USE_CONCAT The database engine converts a query with an OR condition into two or more queries joined by a UNION ALL statement.

The syntax for using a hint in a SQL statement is as follows:

Select /*+ FIRST_ROWS */ empno, ename

From emp;

INFORMATICA CONFIDENTIAL BEST PRACTICE 593 of 702

Select /*+ USE_CONCAT */ empno, ename

From emp;

SQL Execution and Explain Plan

The simplest change is forcing the SQL to choose either rule-based or cost-based execution. This change can be accomplished without changing the logic of the SQL query. While cost-based execution is typically considered the best SQL execution; it relies upon optimization of the Oracle parameters and updated database statistics. If these statistics are not maintained, cost-based query execution can suffer over time. When that happens, rule-based execution can actually provide better execution time.

The developer can determine which type of execution is being used by running an explain plan on the SQL query in question. Note that the step in the explain plan that is indented the most is the statement that is executed first. The results of that statement are then used as input by the next level statement.

Typically, the developer should attempt to eliminate any full table scans and index range scans whenever possible. Full table scans cause degradation in performance.

Information provided by the Explain Plan can be enhanced using the SQL Trace Utility. This utility provides the following additional information including:

● The number of executions ● The elapsed time of the statement execution ● The CPU time used to execute the statement

The SQL Trace Utility adds value because it definitively shows the statements that are using the most resources, and can immediately show the change in resource consumption after the statement has been tuned and a new explain plan has been run.

Using Indexes

The explain plan also shows whether indexes are being used to facilitate execution. The data warehouse team should compare the indexes being used to those available. If necessary, the administrative staff should identify new indexes that are needed to improve execution and ask the database administration team to add them to the appropriate tables. Once implemented, the explain plan should be executed again to ensure that the indexes are being used. If an index is not being used, it is possible to force the query to use it by using an access method hint, as described earlier.

Reviewing SQL Logic

The final step in SQL optimization involves reviewing the SQL logic itself. The purpose of this review is to determine whether the logic is efficiently capturing the data needed for processing. Review of the logic may uncover the need for additional filters to select only certain data, as well as the need to restructure the where clause to use indexes. In extreme cases, the entire SQL statement may need to be re-written to become more efficient.

Reviewing SQL Syntax

SQL Syntax can also have a great impact on query performance. Certain operators can slow performance, for example:

● EXISTS clauses are almost always used in correlated sub-queries. They are executed for each row of the parent query and cannot take advantage of indexes, while the IN clause is executed once and does use indexes, and may be translated to a JOIN by the optimizer. If possible, replace EXISTS with an IN clause. For example:

SELECT * FROM DEPARTMENTS WHERE DEPT_ID IN

INFORMATICA CONFIDENTIAL BEST PRACTICE 594 of 702

(SELECT DISTINCT DEPT_ID FROM MANAGERS) -- Faster

SELECT * FROM DEPARTMENTS D WHERE EXISTS

(SELECT * FROM MANAGERS M WHERE M.DEPT_ID = D.DEPT_ID)< /FONT >

Situation Exists In

Index supports subquery Yes Yes

No Index to support subquery No Table scans per parent row

Yes Table scan once

Sub-query returns many rows Probably not Yes

Sub-query returns one or a few rows Yes Yes

Most of the sub-query rows are eliminated by the parent query

No Yes

Index in parent that match sub-query columns Possibly not since the EXISTS cannot use the index

Yes – IN uses the index

● Where possible, use the EXISTS clause instead of the INTERSECT clause. Simply modifying the query in this way can improve performance by more than100 percent.

● Where possible, limit the use of outer joins on tables. Remove the outer joins from the query and create lookup objects within the mapping to fill in the optional information.

Choosing the Best Join Order

Place the smallest table first in the join order. This is often a staging table holding the IDs identifying the data in the incremental ETL load.

Always put the small table column on the right side of the join. Use the driving table first in the WHERE clause, and work from it outward. In other words, be consistent and orderly about placing columns in the WHERE clause.

Outer joins limit the join order that the optimizer can use. Don’t use them needlessly.

Anti-join with NOT IN, NOT EXISTS, MINUS or EXCEPT, OUTER JOIN

● Avoid use of the NOT IN clause. This clause causes the database engine to perform a full table scan. While this may not be a problem on small tables, it can become a performance drain on large tables.

SELECT NAME_ID FROM CUSTOMERS

WHERE NAME_ID NOT IN

(SELECT NAME_ID FROM EMPLOYEES)

INFORMATICA CONFIDENTIAL BEST PRACTICE 595 of 702

● Avoid use of the NOT EXISTS clause. This clause is better than the NOT IN, but still may cause a full table scan.

SELECT C.NAME_ID FROM CUSTOMERS C

WHERE NOT EXISTS

(SELECT * FROM EMPLOYEES E

WHERE C.NAME_ID = E.NAME_ID)< /FONT >

● In Oracle, use the MINUS operator to do the anti-join, if possible. In DB2, use the equivalent EXCEPT operator.

SELECT C.NAME_ID FROM CUSTOMERS C

MINUS

SELECT E.NAME_ID* FROM EMPLOYEES E

● Also consider using outer joins with IS NULL conditions for anti-joins.

SELECT C.NAME_ID FROM CUSTOMERS C, EMPLOYEES E

WHERE C.NAME_ID = E.NAME_ID (+)< /FONT >

AND C.NAME_ID IS NULL

Review the database SQL manuals to determine the cost benefits or liabilities of certain SQL clauses as they may change based on the database engine.

● In lookups from large tables, try to limit the rows returned to the set of rows matching the set in the source qualifier. Add the WHERE clause conditions to the lookup. For example, if the source qualifier selects sales orders entered into the system since the previous load of the database, then, in the product information lookup, only select the products that match the distinct product IDs in the incremental sales orders.

● Avoid range lookups. This is a SELECT that uses a BETWEEN in the WHERE clause that uses values retrieved from a table as limits in the BETWEEN. Here is an example:

SELECT

R.BATCH_TRACKING_NO,

R.SUPPLIER_DESC,

R.SUPPLIER_REG_NO,

R.SUPPLIER_REF_CODE,

R.GCW_LOAD_DATE

FROM CDS_SUPPLIER R,

(SELECT LOAD_DATE_PREV AS LOAD_DATE_PREV,

INFORMATICA CONFIDENTIAL BEST PRACTICE 596 of 702

L.LOAD_DATE) AS LOAD_DATE

FROM ETL_AUDIT_LOG L

WHERE L.LOAD_DATE_PREV IN

(SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV

FROM ETL_AUDIT_LOG Y)

) Z

WHERE

R.LOAD_DATE BETWEEN Z.LOAD_DATE_PREV AND Z.LOAD_DATE

The work-around is to use an in-line view to get the lower range in the FROM clause and join it to the main query that limits the higher date range in its where clause. Use an ORDER BY the lower limit in the in-line view. This is likely to reduce the throughput time from hours to seconds.

Here is the improved SQL:

SELECT

R.BATCH_TRACKING_NO,

R.SUPPLIER_DESC,

R.SUPPLIER_REG_NO,

R.SUPPLIER_REF_CODE,

R.LOAD_DATE

FROM

/* In-line view for lower limit */

(SELECT

R1.BATCH_TRACKING_NO,

R1.SUPPLIER_DESC,

R1.SUPPLIER_REG_NO,

R1.SUPPLIER_REF_CODE,

R1.LOAD_DATE

FROM CDS_SUPPLIER R1,

(SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV

INFORMATICA CONFIDENTIAL BEST PRACTICE 597 of 702

FROM ETL_AUDIT_LOG Y) Z

WHERE R1.LOAD_DATE >= Z.LOAD_DATE_PREV< /FONT>

ORDER BY R1.LOAD_DATE) R,

/* end in-line view for lower limit */

(SELECT MAX(D.LOAD_DATE) AS LOAD_DATE

FROM ETL_AUDIT_LOG D) A /* upper limit /*

WHERE R. LOAD_DATE <= A.LOAD_DATE< /FONT>

Tuning System Architecture

Use the following steps to improve the performance of any system:

1. Establish performance boundaries (baseline). 2. Define performance objectives. 3. Develop a performance monitoring plan. 4. Execute the plan. 5. Analyze measurements to determine whether the results meet the objectives. If objectives are met, consider

reducing the number of measurements because performance monitoring itself uses system resources. Otherwise continue with Step 6.

6. Determine the major constraints in the system. 7. Decide where the team can afford to make trade-offs and which resources can bear additional load. 8. Adjust the configuration of the system. If it is feasible to change more than one tuning option, implement one at a

time. If there are no options left at any level, this indicates that the system has reached its limits and hardware upgrades may be advisable.

9. Return to Step 4 and continue to monitor the system. 10. Return to Step 1. 11. Re-examine outlined objectives and indicators. 12. Refine monitoring and tuning strategy.

System Resources

The PowerCenter Server uses the following system resources:

● CPU ● Load Manager shared memory ● DTM buffer memory ● Cache memory

When tuning the system, evaluate the following considerations during the implementation process.

● Determine if the network is running at an optimal speed. Recommended best practice is to minimize the number of network hops between the PowerCenter Server and the databases.

● Use multiple PowerCenter Servers on separate systems to potentially improve session performance. ● When all character data processed by the PowerCenter Server is US-ASCII or EBCDIC, configure the

PowerCenter Server for ASCII data movement mode. In ASCII mode, the PowerCenter Server uses one byte to store each character. In Unicode mode, the PowerCenter Server uses two bytes for each character, which can potentially slow session performance

● Check hard disks on related machines. Slow disk access on source and target databases, source and target file

INFORMATICA CONFIDENTIAL BEST PRACTICE 598 of 702

systems, as well as the PowerCenter Server and repository machines can slow session performance. ● When an operating system runs out of physical memory, it starts paging to disk to free physical memory.

Configure the physical memory for the PowerCenter Server machine to minimize paging to disk. Increase system memory when sessions use large cached lookups or sessions have many partitions.

● In a multi-processor UNIX environment, the PowerCenter Server may use a large amount of system resources. Use processor binding to control processor usage by the PowerCenter Server.

● In a Sun Solaris environment, use the psrset command to create and manage a processor set. After creating a processor set, use the pbind command to bind the PowerCenter Server to the processor set so that the processor set only runs the PowerCenter Sever. For details, see project system administrator and Sun Solaris documentation.

● In an HP-UX environment, use the Process Resource Manager utility to control CPU usage in the system. The Process Resource Manager allocates minimum system resources and uses a maximum cap of resources. For details, see project system administrator and HP-UX documentation.

● In an AIX environment, use the Workload Manager in AIX 5L to manage system resources during peak demands. The Workload Manager can allocate resources and manage CPU, memory, and disk I/O bandwidth. For details, see project system administrator and AIX documentation.

Database Performance Features

Nearly everything is a trade-off in the physical database implementation. Work with the DBA in determining which of the many available alternatives is the best implementation choice for the particular database. The project team must have a thorough understanding of the data, database, and desired use of the database by the end-user community prior to beginning the physical implementation process. Evaluate the following considerations during the implementation process.

● Denormalization. The DBA can use denormalization to improve performance by eliminating the constraints and primary key to foreign key relationships, and also eliminating join tables.

● Indexes. Proper indexing can significantly improve query response time. The trade-off of heavy indexing is a degradation of the time required to load data rows in to the target tables. Carefully written pre-session scripts are recommended to drop indexes before the load and rebuilding them after the load using post-session scripts.

● Constraints. Avoid constraints if possible and try to exploit integrity enforcement through the use of incorporating that additional logic in the mappings.

● Rollback and Temporary Segments. Rollback and temporary segments are primarily used to store data for queries (temporary) and INSERTs and UPDATES (rollback). The rollback area must be large enough to hold all the data prior to a COMMIT. Proper sizing can be crucial to ensuring successful completion of load sessions, particularly on initial loads.

● OS Priority. The priority of background processes is an often-overlooked problem that can be difficult to determine after the fact. DBAs must work with the System Administrator to ensure all the database processes have the same priority.

● Striping. Database performance can be increased significantly by implementing either RAID 0 (striping) or RAID 5 (pooled disk sharing) disk I/O throughput.

● Disk Controllers. Although expensive, striping and RAID 5 can be further enhanced by separating the disk controllers.

Last updated: 13-Feb-07 17:47

INFORMATICA CONFIDENTIAL BEST PRACTICE 599 of 702

Using Metadata Manager Console to Tune the XConnects

Challenge

Improving the efficiency and reducing the run-time of your XConnects through the parameter settings of the Metadata Manager console.

Description

Remember that the minimum system requirements for a machine hosting the Metadata Manager console are:

● Windows operating system (2000, NT 4.0 SP 6a) ● 400MB disk space ● 128MB RAM (256MB recommended) ● 133 MHz processor.

If the system meets or exceeds the minimal requirements, but an XConnect is still taking an inordinately long time to run, use the following steps to try to improve its performance.

To improve performance of your XConnect loads from database catalogs:

● Modify the inclusion/exclusion schema list (if schema to be loaded is more than exclusion, then use exclusion)

● Carefully examine how many old objects the project needs by default. Modify the “sysdate -5000” to a smaller value to reduce the result set.

To improve performance of your XConnect loads from the PowerCenter repository:

● Load only the production folders that are needed for a particular project. ● Run the XConnects with just one folder at a time, or select the list of folders for

a particular run.

INFORMATICA CONFIDENTIAL BEST PRACTICE 600 of 702

Advanced Client Configuration Options

Challenge

Setting the Registry to ensure consistent client installations, resolve potential missing or invalid license key issues, and change the Server Manager Session Log Editor to your preferred editor.

DescriptionEnsuring Consistent Data Source Names

To ensure the use of consistent data source names for the same data sources across the domain, the Administrator can create a single "official" set of data sources, then use the Repository Manager to export that connection information to a file. You can then distribute this file and import the connection information for each client machine.

Solution:

● From Repository Manager, choose Export Registry from the Tools drop-down menu. ● For all subsequent client installs, simply choose Import Registry from the Tools drop-down

menu.

Resolving Missing or Invalid License Keys

The “missing or invalid license key” error occurs when attempting to install PowerCenter Client tools on NT 4.0 or Windows 2000 with a userid other than Administrator.

This problem also occurs when the client software tools are installed under the Administrator account, and a user with a non-administrator ID subsequently attempts to run the tools. The user who attempts to log in using the normal ‘non-administrator’ userid will be unable to start the PowerCenter Client tools. Instead, the software displays the message indicating that the license key is missing or invalid.

Solution:

● While logged in as the installation user with administrator authority, use regedt32 to edit the registry.

● Under HKEY_LOCAL_MACHINE open Software/Informatica/PowerMart Client Tools/. ● From the menu bar, select Security/Permissions, and grant read access to the users that

should be permitted to use the PowerMart Client. (Note that the registry entries for both PowerMart and PowerCenter Server and client tools are stored as PowerMart Server and PowerMart Client tools.)

Changing the Session Log Editor

INFORMATICA CONFIDENTIAL BEST PRACTICE 601 of 702

In PowerCenter versions 6.0 to 7.1.2, the session and workflow log editor defaults to Wordpad within the workflow monitor client tool. To choose a different editor, just select Tools>Options in the workflow monitor. Then browse for the editor that you want on the General tab.

For PowerCenter versions earlier than 6.0, the editor does not default to Wordpad unless the wordpad.exe can be found in the path statement. Instead, a window appears the first time a session log is viewed from the PowerCenter Server Manager prompting the user to enter the full path name of the editor to be used to view the logs. Users often set this parameter incorrectly and must access the registry to change it.

Solution:

● While logged in as the installation user with administrator authority, use regedt32 to go into the registry.

● Move to registry path location: HKEY_CURRENT_USER Software\Informatica\PowerMart Client Tools\[CLIENT VERSION]\Server Manager\Session Files. From the menu bar, select View Tree and Data.

● Select the Log File Editor entry by double clicking on it. ● Replace the entry with the appropriate editor entry (i.e., typically WordPad.exe or Write.

exe). ● Select Registry --> Exit from the menu bar to save the entry.

For PowerCenter version 7.1 and above, you should set the log editor option in the Workflow Monitor.

The following figure shows the Workflow Monitor Options Dialog box to use for setting the editor for workflow and session logs.

INFORMATICA CONFIDENTIAL BEST PRACTICE 602 of 702

Adding a New Command Under Tools Menu

Other tools, in addition to the PowerCenter client tools, are often needed during development and testing. For example, you may need a tool such as Enterprise manager (SQL Server) or Toad (Oracle) to query the database. You can add shortcuts to executable programs from any client tool’s ‘Tools’ drop-down menu to provide quick access to these programs.

Solution:

Choose ‘Customize’ under the Tools menu and add a new item. Once it is added, browse to find the executable it is going to call (as shown below).

INFORMATICA CONFIDENTIAL BEST PRACTICE 603 of 702

When this is done once, you can easily call another program from your PowerCenter client tools.

In the following example, TOAD can be called quickly from the Repository Manager tool.

Changing Target Load Type

In PowerCenter versions 6.0 and earlier, each time a session was created, it defaulted to be of type ‘bulk’, although this was not necessarily what was desired and could cause the session to fail under certain conditions if not changed. In versions 7.0 and above, you can set a property in Workflow Manager to choose the default load type to be either 'bulk' or 'normal'.

INFORMATICA CONFIDENTIAL BEST PRACTICE 604 of 702

Solution:

● In the Workflow Manager tool, choose Tools > Options and go to the Miscellaneous tab. ● Click the button for either 'normal' or 'bulk', as desired. ● Click OK, then close and open the Workflow Manager tool.

After this, every time a session is created, the target load type for all relational targets will default to your choice.

Resolving Undocked Explorer Windows

The Repository Navigator window sometimes becomes undocked. Docking it again can be frustrating because double clicking on the window header does not put it back in place.

Solution:

● To get the Window correctly docked, right-click in the white space of the Navigator window.

● Make sure that ‘Allow Docking’ option is checked. If it is checked, double-click on the title bar of the Navigator Window.

INFORMATICA CONFIDENTIAL BEST PRACTICE 605 of 702

Resolving Client Tool Window Display Issues

If one of the windows (e.g., Navigator or Output) in a PowerCenter 7.x or later client tool (e.g., Designer) disappears, try the following solutions to recover it:

● Clicking View > Navigator ● Toggling the menu bar ● Uninstalling and reinstalling Client tools

Note: If none of the above solutions resolve the problem, you may want to try the following solution using the Registry Editor. Be aware, however, that using the Registry Editor incorrectly can cause serious problems that may require reinstalling the operating system. Informatica does not guarantee that any problems caused by using Registry Editor incorrectly can be resolved. Use the Registry Editor at your own risk.

Solution:

Starting with PowerCenter 7.x, the settings for the client tools are in the registry. Display issues can often be resolved as follows:

● Close the client tool. ● Go to Start > Run and type "regedit". ● Go to the key HKEY_CURRENT_USER\Software\Informatica\PowerMart Client Tools\x.y.z

Where x.y.z is the version and maintenance release level of the PowerCenter client as follows:

PowerCenter Version

Folder Name

7.1 7.1

7.1.1 7.1.1

7.1.2 7.1.1

7.1.3 7.1.1

7.1.4 7.1.1

8.1 8.1

● Open the key of the affected tool (for the Repository Manager open Repository Manager Options).

● Export all of the Toolbars sub-folders and rename them. ● Re-open the client tool.

Enhancing the Look of the Client Tools

INFORMATICA CONFIDENTIAL BEST PRACTICE 606 of 702

The PowerCenter client tools allow you to customize the look and feel of the display. Here are a few examples of what you can do.

Designer

● From the Menu bar, select Tools > Options. ● In the dialog box, choose the Format tab. ● Select the feature that you want to modify (i.e., workspace colors, caption colors, or fonts).

Changing the background workspace colors can help identify which workspace is currently open. For example, changing the Source Analyzer workspace color to green or the Target Designer workspace to purple to match their respective metadata definitions helps to identify the workspace.

Alternatively, click the Select Theme button to choose a color theme, which displays background colors based on predefined themes.

INFORMATICA CONFIDENTIAL BEST PRACTICE 607 of 702

Workflow Manager

You can modify the Workflow Manager using the same approach as the Designer tool.

From the Menu bar, select Tools > Options and click the Format tab. Select a color theme or customize each element individually.

Workflow Monitor

You can modify the colors in the Gantt Chart view to represent the various states of a task. You can also select two colors for one task to give it a dimensional appearance; this can be helpful in

INFORMATICA CONFIDENTIAL BEST PRACTICE 608 of 702

distinguishing between running tasks, succeeded tasks, etc.

To modify the Gantt chart appearance, go to the Menu bar and select Tools > Options and Gantt Chart.

Using Macros in Data Stencil

Data Stencil contains unsigned macros. Set the security level in Visio to Medium so you can enable macros when you start Data Stencil. If the security level for Visio is set to High or Very High, you

INFORMATICA CONFIDENTIAL BEST PRACTICE 609 of 702

cannot run the Data Stencil macros.

To use the security level for the Visio, select Tools > Macros > Security from the menu. On the Security Level tab, select Medium.

When you start Data Stencil, Visio displays a security warning about viruses in macros. Click Enable Macros to enable the macros for Data Stencil.

Last updated: 09-Feb-07 15:58

INFORMATICA CONFIDENTIAL BEST PRACTICE 610 of 702

Advanced Server Configuration Options

Challenge

Correctly configuring Advanced Integration Service properties, Integration Service process variables, and automatic memory settings; using custom properties to write service logs to files; and adjusting semaphore and shared memory settings in the UNIX environment.

DescriptionConfiguring Advanced Integration Service Properties

Use the Administration Console to configure the advanced properties, such as the character set of the Integration Service logs. To edit the advanced properties, select the Integration Service in the Navigator, and click the Properties tab > Advanced Properties > Edit.

The following Advanced properties are included:

Limit on Resilience Timeouts Optional Maximum amount of time (in seconds) that the service holds on to resources for resilience purposes. This property places a restriction on clients that connect to the service. Any resilience timeouts that exceed the limit are cut off at the limit. If the value of this property is blank, the value is derived from the domain-level settings.

Valid values are between 0 and 2592000, inclusive. Default is blank.

Resilience Timeout Optional Period of time (in seconds) that the service tries to establish or reestablish a connection to another service. If blank, the value is derived from the domain-level settings.

Valid values are between 0 and 2592000, inclusive. Default is blank.

Configuring Integration Service Process Variables

One configuration best practice is to properly configure and leverage the Integration service (IS) process variables. The benefits include:

● Ease of deployment across environments (DEV > TEST > PRD) ● Ease of switching sessions from one IS to another without manually editing all the sessions to change directory paths. ● All the variables are related to directory paths used by a given Integration Service.

You must specify the paths for Integration Service files for each Integration Service process. Examples of Integration Service files include run-time files, state of operation files, and session log files.

Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files.

State of operation files must be accessible by all Integration Service processes.When you enable an Integration Service, it creates files to store the state of operations for the service. The state of operations includes information such as the active service requests, scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover operations from the point of interruption.

All Integration Service processes associated with an Integration Service must use the same shared location. However, each Integration Service can use a separate location.

By default, the installation program creates a set of Integration Service directories in the server\infa_shared directory. You can set the shared location for these directories by configuring the process variable $PMRootDir to point to the same location for each Integration Service process.

You must specify the directory path for each type of file. You specify the following directories using service process variables:

INFORMATICA CONFIDENTIAL BEST PRACTICE 611 of 702

Each registered server has its own set of variables. The list is fixed, not user-extensible.

Service Process Variable Value

$PMRootDir (no default – user must insert a path)

$PMSessionLogDir $PMRootDir/SessLogs

$PMBadFileDir $PMRootDir/BadFiles

$PMCacheDir $PMRootDir/Cache

$PMTargetFileDir $PMRootDir/TargetFiles

$PMSourceFileDir $PMRootDir/SourceFiles

$PMExtProcDir $PMRootDir/ExtProc

$PMTempDir $PMRootDir/Temp

$PMSuccessEmailUser (no default – user must insert a path)

$PMFailureEmailUser (no default – user must insert a path)

$PMSessionLogCount 0

$PMSessionErrorThreshold 0

$PMWorkflowLogCount 0

$PMWorkflowLogDir $PMRootDir/WorkflowLogs

$PMLookupFileDir $PMRootDir/LkpFiles

$PMStorageDir $PMRootDir/Storage

Writing PowerCenter 8 Service Logs to Files

Starting with PowerCenter 8, all the logging for the services and sessions created use the log service and can only be viewed through the PowerCenter Administration Console. However, it is still possible to get this information logged into a file similar to the previous versions.

To write all Integration Service logs (session, workflow, server, etc.) to files:

1. <!--[endif]-->Log in to the Admin Console. 2. Select the Integration Service 3. Add a Custom property called UseFileLog and set its value to "Yes". 4. Add a Custom property called LogFileName and set its value to the desired file name. 5. Restart the service.

Integration Service Custom Properties (undocumented server parameters) can be entered here as well:

1. At the bottom of the list enter the Name and Value of the custom property 2. Click OK.

Adjusting Semaphore Settings on UNIX Platforms

INFORMATICA CONFIDENTIAL BEST PRACTICE 612 of 702

When PowerCenter runs on a UNIX platform, it uses operating system semaphores to keep processes synchronized and to prevent collisions when accessing shared data structures. You may need to increase these semaphore settings before installing the server.

Seven semaphores are required to run a session. Most installations require between 64 and 128 available semaphores, depending on the number of sessions the server runs concurrently. This is in addition to any semaphores required by other software, such as database servers.

The total number of available operating system semaphores is an operating system configuration parameter, with a limit per user and system. The method used to change the parameter depends on the operating system:

● HP/UX: Use sam (1M) to change the parameters. ● Solaris: Use admintool or edit /etc/system to change the parameters. ● AIX: Use smit to change the parameters.

Setting Shared Memory and Semaphore Parameters on UNIX Platforms

Informatica recommends setting the following parameters as high as possible for the UNIX operating system. However, if you set these parameters too high, the machine may not boot. Always refer to the operating system documentation for parameter limits. Note that different UNIX operating systems set these variables in different ways or may be self tuning. Always reboot the system after configuring the UNIX kernel.

HP-UX

For HP-UX release 11i the CDLIMIT and NOFILES parameters are not implemented. In some versions, SEMMSL is hard-coded to 500. NCALL is referred to as NCALLOUT.

Use the HP System V IPC Shared-Memory Subsystem to update parameters.

To change a value, perform the following steps:

1. Enter the /usr/sbin/sam command to start the System Administration Manager (SAM) program. 2. Double click the Kernel Configuration icon. 3. Double click the Configurable Parameters icon. 4. Double click the parameter you want to change and enter the new value in the Formula/Value field. 5. Click OK. 6. Repeat these steps for all kernel configuration parameters that you want to change. 7. When you are finished setting all of the kernel configuration parameters, select Process New Kernel from the Action menu.

The HP-UX operating system automatically reboots after you change the values for the kernel configuration parameters.

IBM AIX

None of the listed parameters requires tuning because each is dynamically adjusted as needed by the kernel.

SUN Solaris

Keep the following points in mind when configuring and tuning the SUN Solaris platform:

1. Edit the /etc/system file and add the following variables to increase shared memory segments:

set shmsys:shminfo_shmmax=value set shmsys:shminfo_shmmin=value set shmsys:shminfo_shmmni=value set shmsys:shminfo_shmseg=value set semsys:seminfo_semmap=value set semsys:seminfo_semmni=value set semsys:seminfo_semmns=value set semsys:seminfo_semmsl=value

INFORMATICA CONFIDENTIAL BEST PRACTICE 613 of 702

set semsys:seminfo_semmnu=value set semsys:seminfo_semume=value

2. Verify the shared memory value changes:

# grep shmsys /etc/system

3. Restart the system:

# init 6

Red Hat Linux

The default shared memory limit (shmmax) on Linux platforms is 32MB. This value can be changed in the proc file system without a restart. For example, to allow 128MB, type the following command:

$ echo 134217728 >/proc/sys/kernel/shmmax

You can put this command into a script run at startup.

Alternatively, you can use sysctl(8), if available, to control this parameter. Look for a file called /etc/sysctl.conf and add a line similar to the following:

kernel.shmmax = 134217728

This file is usually processed at startup, but sysctl can also be called explicitly later.

To view the values of other parameters, look in the files /usr/src/linux/include/asm-xxx/shmparam.h and /usr/src/linux/include/linux/sem.h.

SuSE Linux

The default shared memory limits (shhmax and shmall) on SuSE Linux platforms can be changed in the proc file system without a restart. For example, to allow 512MB, type the following commands:

#sets shmall and shmmax shared memory

echo 536870912 >/proc/sys/kernel/shmall #Sets shmall to 512 MB

echo 536870912 >/proc/sys/kernel/shmmax #Sets shmmax to 512 MB

You can also put these commands into a script run at startup.

Also change the settings for the system memory user limits by modifying a file called /etc/profile. Add lines similar to the following:

#sets user limits (ulimit) for system memory resources

ulimit -v 512000 #set virtual (swap) memory to 512 MB

ulimit -m 512000 #set physical memory to 512 MB

Configuring Automatic Memory Settings

With Informatica PowerCenter 8, you can configure the Integration Service to determine buffer memory size and session cache size at runtime. When you run a session, the Integration Service allocates buffer memory to the session to move the data from the source to the target. It also creates session caches in memory. Session caches include index and data caches for the Aggregator, Rank, Joiner, and Lookup transformations, as well as Sorter and XML target caches.

INFORMATICA CONFIDENTIAL BEST PRACTICE 614 of 702

Configure buffer memory and cache memory settings in the Transformation and Session Properties. When you configure buffer memory and cache memory settings, consider the overall memory usage for best performance.

Enable automatic memory settings by configuring a value for the Maximum Memory Allowed for Auto Memory Attributes or the Maximum Percentage of Total Memory Allowed for Auto Memory Attributes. If the value is set to zero for either of these attributes, the Integration Service disables automatic memory settings and uses default values.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 615 of 702

Causes and Analysis of UNIX Core Files

Challenge

This Best Practice explains what UNIX core files are and why they are created, and offers some tips on analyzing them.

Description

Fatal run-time errors in UNIX programs usually result in the termination of the UNIX process by the operating system. Usually, when the operating system terminates a process, a "core dump" file is also created, which can be used to analyze the reason for the abnormal termination.

What is a Core File and What Causes it to be Created?

UNIX operating systems may terminate a process before its normal, expected exit for several reasons. These reasons are typically for bad behavior by the program, and include attempts to execute illegal or incorrect machine instructions, attempts to allocate memory outside the memory space allocated to the program, attempts to write to memory marked read-only by the operating system, and other similar incorrect low-level operations. Most of these bad behaviors are caused by errors in programming logic in the program.

UNIX may also terminate a process for some reasons that are not caused by programming errors. The main examples of this type of termination are when a process exceeds its CPU time limit, and when a process exceeds its memory limit.

When UNIX terminates a process in this way, it normally writes an image of the processes memory to disk in a single file. These files are called "core files", and are intended to be used by a programmer to help determine the cause of the failure. Depending on the UNIX version, the name of the file may be "core", or in more recent UNIX versions, "core.nnnn" where nnnn is the UNIX process ID of the process that was terminated.

Core files are not created for "normal" runtime errors such as incorrect file permissions, lack of disk space, inability to open a file or network connection, and other errors that a program is expected to detect and handle. However, under certain error conditions a program may not handle the error conditions correctly and may follow a path of

INFORMATICA CONFIDENTIAL BEST PRACTICE 616 of 702

execution that causes the OS to terminate it and cause a core dump.

Mixing incompatible versions of UNIX, vendor, and database libraries can often trigger behavior that causes unexpected core dumps. For example, using an odbc driver library from one vendor and an odbc driver manager from another vendor may result in a core dump if the libraries are not compatible. A similar situation can occur if a process is using libraries from different versions of a database client, such as a mixed installation of Oracle 8i and 9i. An installation like this should not exist, but if it does, core dumps are often the result.

Core File Locations and Size Limits

A core file is written to the current working directory of the process that was terminated. For PowerCenter, this is always the directory the services were started from. For other applications, this may not be true.

UNIX also implements a per user resource limit on the maximum size of core files. This is controlled by the ulimit command. If the limit is 0, then core files will not be created. If the limit is less than the total memory size of the process, a partial core file will be written. Refer to the Best Practice Understanding and Setting UNIX Resources for PowerCenter Installations .

Analyzing Core Files

Core files provide valuable insight into the state and condition the process was in just before it was terminated. It also contains the history or log of routines that the process went through before that fateful function call; this log is known as the stack trace. There is little information in a core file that is relevant to an end user; most of the contents of a core file are only relevant to a developer, or someone who understands the internals of the program that generated the core file. However, there are a few things that an end user can do with a core file in the way of initial analysis. The most important aspect of analyzing a core file is the task of extracting this stack trace out of the core dump. Debuggers are the tools that help retrieve this stack trace and other vital information out of the core. Informatica recommends using the pmstack utility. The first step is to save the core file under a new name so that it is not overwritten by a later crash of the same application. One option is to append a timestamp to the core, but it can be renamed to anything:

mv core core.ddmmyyhhmi

INFORMATICA CONFIDENTIAL BEST PRACTICE 617 of 702

The second step is to log in with the same UNIX user id that started up the process that crashed. This sets the debugger's environment to be same as that of the process at startup time.The third step is to go to the directory where the program is installed. Run the "file" command on the core file. This returns the name of the process that created the core file.

file <fullpathtocorefile>/core.ddmmyyhhmi

Core files can be generated by the PowerCenter executables (i.e., pmserver, infaservices, and pmdtm) as well as from other UNIX commands executed by the Integration Service, typically from command tasks and per- or post-session commands. If a PowerCenter process is terminated by the OS and a core is generated, the session or server log typically indicates ‘Process terminating on Signal/Exception’ as its last entry.

Using the pmstack Utility

Informatica provides a ‘pmstack’ utility that can automatically analyze a core file. If the core file is from PowerCenter, it will generate a complete stack trace from the core file, which can be sent to Informatica Customer Support for further analysis. The track contains everything necessary to further diagnose the problem. Core files themselves are normally not useful on a system other than the one where they were generated.

The pmstack utility can be downloaded from the Informatica Support knowledge base as article 13652, and from the support ftp server at tsftp.informatica.com. Once downloaded, run pmstack with the –c option, followed by the name of the core file:

$ pmstack -c core.21896 ================================= SSG pmstack ver 2.0 073004 ================================= Core info : -rw------- 1 pr_pc_d pr_pc_d 58806272 Mar 29 16:28 core.21896 core.21896: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from ''''''''pmdtm'''''''' Process name used for analyzing the core : pmdtm Generating stack trace, please wait.. Pmstack completed successfully Please send file core.21896.trace to Informatica Technical Support

You can then look at the generated trace file or send it to support.

INFORMATICA CONFIDENTIAL BEST PRACTICE 618 of 702

Pmstack also supports a –p option, which can be used to extract a stack trace from a running process. This is sometimes useful if the process appears to be hung to determine what the process is doing.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 619 of 702

Domain Configuration

Challenge

The addition of the domain architecture in PowerCenter 8 simplified administration of disparate PowerCenter services across the enterprise, and allows for grouping previously-separately administered application services and nodes into logically-grouped folders within the domain based on administrative ownership. It is vital, when installing or upgrading PowerCenter, that the Application Administrator understand the terminology and architecture surrounding the Domain Configuration in order to effectively administer, upgrade, deploy, and maintain PowerCenter Services throughout the enterprise.

Description

The domain architecture allows PowerCenter to provide a service-oriented architecture where you can specify which services are running on which node or physical machine from one central location. The components in the domain are ‘aware’ of each other’s presence and continually monitor one another via ‘heartbeats’. The various services within the domain can move from one physical machine to another without any interruption to the PowerCenter environment. As long as clients can connect to the domain, the domain can route their needs to the appropriate physical machine.

From a monitoring perspective, the domain provides the ability to monitor all services in the domain from a central location. You no longer have to log into and ping multiple machines in a robust PowerCenter environment – instead a single screen displays the current availability state of all services.

For more details on the individual components and detailed configuration of a domain, refer to the PowerCenter Administrator Guide.

Key Domain Components

There are several key domain components to consider during installation and setup:

● Master Gateway – The node designated as the master gateway or domain controller is the main ’entry point’ to the domain. This server(s) should be your most reliable and available machine in the architecture. It is the first point of entry for all clients wishing to connect to one of the PowerCenter services. If the master gateway is unavailable, the entire domain is unavailable. You may designate more than one node to run the gateway service. One gateway is always the master or primary, but by having the gateway services running on more than one node in a multimode configuration, you domain can continue to function if the master gateway is no longer available. In a high-availability environment, it is critical to have one or more nodes running the gateway service as a backup to the master gateway.

INFORMATICA CONFIDENTIAL BEST PRACTICE 620 of 702

● Shared File System – The PowerCenter domain architecture provides centralized logging capability and; when high-availability is enabled, a highly available environment with automatic fail-over of workflows and sessions. In order to achieve this, the base PowerCenter server file directories must reside on a file system that is accessable by all nodes in the domain. When PowerCenter is initially installed, this directory is called infa_shared and is located under the server directory of the PowerCenter installation. It includes logs and checkpoint information that is shared among nodes of the domain. Ideally, this file system is both high-performance and highly available.

● Domain Metadata – As of PowerCenter 8, a store of metadata exists to hold all of the configuration for the domain. This domain repository is separate from the one or more PowerCenter repositories in your domain. Instead, it is a handful of tables that replace the older pmserver.cfg, pmrep.cfg and other PowerCenter configuration information. Upon installation you will be prompted for the RDBMS location for the domain repository. This information should be treated similar to a PowerCenter repository, with regularly-scheduled backups and a disaster recovery plan. Without this metadata, your domain is unable to function. The RDBMS user provided to PowerCenter requires permissions to create and drop tables, as well as insert, update, and delete records. Ideally, if you are going to be grouping multiple independent nodes within this domain, the domain configuration database should reside on a separate and independent server so as to eliminate the single point of failure if the node hosting the domain configuration database fails.

Domain Architecture

Just as in other PowerCenter architectures, the premise of the architecture is maintain flexibility and scalability across the environment. There is no single best way to deploy the architecture. Rather, each environment should be assessed for external factors and then PowerCenter configured appropriately to function best in that particular environment. The advantage of the service-oriented architecture is that components in the architecture (i.e., repository services, integration services, and others) can be moved among nodes without needing to make changes to the mappings or workflows. In this way, it is very simple to alter architecture components if you find a suboptimal configuration and want to alter it in your environment. The key here is that you are not tied to any choices you make at installation time and have the flexibility to make changes to your architecture as your business needs change.

TIP While the architecture is very flexible and provides easy movement of services throughout the environment, one area that to carefully consider at installation time is the name of the domain and the subsequent nodes. These are somewhat troublesome to change later because of the

INFORMATICA CONFIDENTIAL BEST PRACTICE 621 of 702

nature of their criticality to the domain. It is not recommended that you imbed server IP addresses and names in the domain name or the node names. You never know when you may need to move to new hardware or move nodes to new locations. For example, instead of naming your domain ‘PowerCenter_11.5.8.20’, consider naming it ‘Enterprise_Dev_Test’. This makes it much more intuitive to understand what domain you are attaching to and if you ever decide to move the main gateway to another server, you don’t need to change the domain or node name. While these names can be changed, the change is not easy and requires using command line programs to alter the domain metadata.

In the next sections, we look at a couple of sample domain configurations.

Single Node Domain

Even in a single server/single node installation, you must still create a domain. In this case, all domain components reside on a single physical machine (i.e., node). You can have any number of PowerCenter services running on this domain. It is important to note that with PowerCenter 8 and beyond, you can run multiple integration services at the same time on the same machine – even in a NT/Windows environment.

Naturally this configuration exposes a single point of failure for every component in the domain and high availability is not available in this situation.

INFORMATICA CONFIDENTIAL BEST PRACTICE 622 of 702

Multiple Node Domains

Domains can continue to expand to meet the demands of true enterprise-wide data integration.

Domain Architecture for Production/Development/Quality Assurance Environments

The architecture picture becomes more complex when you consider a typical development environment, which usually includes some level of a Development, Quality Assurance, and

INFORMATICA CONFIDENTIAL BEST PRACTICE 623 of 702

Production environment. In most implementations, these are separate PowerCenter repositories and associated servers. It is possible to define a single domain to include one or more of these development environments. However, there are a few points to consider:

● If the domain gateway is unavailable for any reason, the entire domain is inaccessible. Keep in mind that if you place your development, quality assurance and production services in a single domain, you have the possibility of affecting your production environment with development and quality assurance work. If you decide to restart the domain in Development for some reason, you are effectively restarting development, quality assurance and production at the same time. Also, if you experience some sort of failure that affects the domain in production, you have also brought down your development environment and have no place to test to fix the problem since your entire environment is compromised.

● For the domain you should have a common, shared, high-performance file system to share the centralized logging and checkpoint files. If you have all three environments together on one domain, you are mixing production logs with development logs and other files on the same physical disk. Your production backups and disaster recovery files will have more than just production information in them.

● For future upgrade, it is very likely that you will need to upgrade all components of the domain at once to the new version of PowerCenter. If you have placed development, quality assurance, and production in the same domain, you may need to upgrade all of it at once. This is an undesirable situation in most data integration environments.

For these reasons, Informatica generally recommends having at least two separate domains in any environment:

● Production Domain ● Development/Quality Assurance Domain

INFORMATICA CONFIDENTIAL BEST PRACTICE 624 of 702

Some architects choose to deploy a separate domain for each environment to further isolate them and ensure no disruptions occur in the Quality Assurance environment by any changes in the development environment. The tradeoff is an additional administration console to log into and maintain.

One thing to keep in mind is that while you may have separate domains with separate domain metadata repositories, there is no need to migrate any of the metadata from the separate domain repositories between development, Quality Assurance and production. The domain metadata repositories collect information on the physical location and connectivity of the components and thus, it makes no sense to migrate between environments. You do need to provide separate database locations for each, but there is no migration needs for the data within; each one is specific to the environment it services.

Administration

The domain administrator has the permission to start/shutdown all services within the domain, as well as the ability to create other users and delegate roles and responsibilities to them. Keep in mind that if the domain is shutdown, it has to be restarted via the command line or the host operating system GUI.

PowerCenter's High Availability option provides the ability to create multiple gateway nodes to a domain, such that if the Master Gateway Node fails, another can assume its responsibilities, including authentication, logging, and service management.

Security and Folders

Much like the Repository Manager, security in the domain interface is set up on a “per-folder” basis, with owners being designated per logical grouping of objects/services in the domain. One of the major differences is that Domain security allows the creation of subfolders to segment your nodes and services in any way you like.

There are many considerations when deciding on a folder structure, keeping in mind that this is a purely administrative interface and does not effect the users and permissions associated with a developer role, which are designated at the Repository level. New legislation in the United States and Europe, such as Basel II and the Public Company Accounting Reform and Investor Protection

INFORMATICA CONFIDENTIAL BEST PRACTICE 625 of 702

Act of 2002 (also known as SOX, SarbOx and Sarbanes-Oxley) have been widely interpreted to place many restrictions on the ability of persons in development roles to have direct write access to production systems, and consequently, you may have to plan your administration roles accordingly. Your organization may simply need to use different folders to group objects in Development, Quality Assurance, and Production roles with separate administrators. In some instances, systems may need to be entirely separate, with different domains for the Development, Quality Assurance, and Production systems. Sharing of metadata remains simple between separate domains, with PowerCenter’s ability to “link” domains, and copy data between linked domains.

For Data Migration projects, it is recommended to establish a standardized architecture that includes a set of folders, connections and developer access in accordance with the needs of the project. Typically this include folders for:

● Acquiring data ● Converting data to match the target system ● The final load to the target application ● Establishing reference data structures

Maintenance

As part of your regular backup of metadata, you should schedule a recurring backup of your PowerCenter domain configuration database metadata. This can be accomplished through PowerCenter by using the infasetup command, further explained in the Command Line Reference. You should also add the schema to your normal RDBMS backup schedule, providing two reliable backup methods for disaster recovery purposes.

Licensing

As part of PowerCenter’s new Service-Oriented Architecture (SOA), licensing for PowerCenter services has been centralized within the domain. You receive your license key file(s) from Informatica at the same time the download location for your software is provided. Adding license object(s) and assigning individual PowerCenter Services to the license(s) is the method by which you enable a PowerCenter Service. You can do this during install, or add initial/incremental license keys after install via the Administration Console web-based utility, or the infacmd command line utility.

Last updated: 09-Feb-07 16:10

INFORMATICA CONFIDENTIAL BEST PRACTICE 626 of 702

Managing Repository Size

Challenge

The PowerCenter repository is expected to grow over time as new development and production runs occur. Over time, the repository can be expected to grow to a size that may start slowing performance of the repository or make backups increasingly difficult. This Best Practice discusses methods to manage the size of the repository.

The release of PowerCenter version 8.x added several features that aid in managing the repository size. Although the repository is slightly larger with version 8.x than it was with the previous versions, the client tools have increased functionality to limit the dependency on the size of the repository. PowerCenter versions earlier than 8.x require more administration to keep the repository sizes manageable.

Description

Why should we manage the size of the repository?

Repository size affects the following:

● DB backups and restores. If database backups are being performed, the size required for the backup can be reduced. If PowerCenter backups are being used, you can limit what gets backed up.

● Overall query time of the repository, which slows performance of the repository over time. Analyzing tables on a regular basis can aid in repository table performance.

● Migrations (i.e., copying from one repository to the next). Limit data transfer between repositories to avoid locking up the repository for a long period of time. Some options are available to avoid transferring all run statistics when migrating. A typical repository starts off small (i.e., 50MB to 60MB for an empty repository) and grows to upwards of 1GB for a large repository. The type of information stored in the repository includes:

❍ Versions ❍ Objects ❍ Run statistics ❍ Scheduling information

INFORMATICA CONFIDENTIAL BEST PRACTICE 627 of 702

❍ Variables

Tips for Managing Repository Size

Versions and Objects

Delete old versions or purged objects from the repository. Use your repository queries in the client tools to generate reusable queries that can determine out-of-date versions and objects for removal. Use Query Browser to run object queries on both versioned and non-versioned repositories..

Old versions and objects not only increase the size of the repository, but also make it more difficult to manage further into the development cycle. Cleaning up the folders makes it easier to determine what is valid and what is not.

One way to keep repository size small is to use shortcuts by creating shared folders if you are using the same source/target definition, reusable transformations in multiple folders.

Folders

Remove folders and objects that are no longer used or referenced. Unnecessary folders increase the size of the repository backups. These folders should not be a part of production but they may exist in development or test repositories.

Run Statistics

Remove old run statistics from the repository if you no longer need them. History is important to determine trending, scaling, and performance tuning needs but you can always generate reports based on the PowerCenter Metadata Reporter and save reports of the data you need. To remove the run statistics, go to Repository Manager and truncate the logs based on the dates.

Recommendations

Informatica strongly recommends upgrading to the latest version of PowerCenter since the most recent release includes such features as skip workflow and session log, skip deployment group history, skip MX data and so forth. The repository size in version 8.x and above is larger than the previous versions of PowerCenter, but the added size does not significantly affect the performance of the repository. It is still advisable to

INFORMATICA CONFIDENTIAL BEST PRACTICE 628 of 702

analyze the tables or run statistics to optimize the tables.

Informatica does not recommend directly querying the repository tables or performing deletes on them. Use the client tools unless otherwise advised by Informatica technical support personnel.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 629 of 702

Organizing and Maintaining Parameter Files & Variables

Challenge

Organizing variables and parameters in Parameter files and maintaining Parameter files for ease of use.

Description

Parameter files are a means of providing run time values for parameters and variables defined in a workflow, worklet, session, mapplet, or mapping. A parameter file can have values for multiple workflows, sessions, and mappings, and can be created using text editors such as notepad, vi, shell script, or an Informatica mapping.

Variable values are stored in the repository and can be changed within mappings. However, variable values specified in parameter files supersede values stored in the repository. The values stored in the repository can be cleared or reset using workflow manager.

Parameter File Contents

A Parameter File contains the values for variables and parameters. Although a parameter file can contain values for more than one workflow (or session), it is advisable to build a parameter file to contain values for a single or logical group of workflows for ease of administration. When using the command line mode to execute workflows, multiple parameter files can also be configured and used for a single workflow if the same workflow needs to be run with different parameters.

Types of Parameters and Variables

A parameter file contains the following types of parameters and variables:

● Service Variable. Defines a service variable for an Integration Service. ● Service Process Variable. Defines a service process variable for an Integration Service that

runs on a specific node. ● Workflow Variable. References values and records information in a workflow. For example,

use a workflow variable in a decision task to determine whether the previous task ran properly. ● Worklet Variable. References values and records information in a worklet. You can use

predefined worklet variables in a parent workflow, but cannot use workflow variables from the parent workflow in a worklet.

● Session Parameter. Defines a value that can change from session to session, such a database connection or file name.

● Mapping Parameter. Defines a value that remains constant throughout a session, such as a state sales tax rate.

● Mapping Variable. Defines a value that can change during the session. The Integration Service saves the value of a mapping variable to the repository at the end of each successful

INFORMATICA CONFIDENTIAL BEST PRACTICE 630 of 702

session run and uses that value the next time the session runs.

Configuring Resources with Parameter File

If a session uses a parameter file, it must run on a node that has access to the file. You create a resource for the parameter file and make it available to one or more nodes. When you configure the session, you assign the parameter file resource as a required resource. The Load Balancer dispatches the Session task to a node that has the parameter file resource. If no node has the parameter file resource available, the session fails.

Configuring Pushdown Optimization with Parameter File

Depending on the database workload, you may want to use source-side, target-side, or full pushdown optimization at different times. For example, you may want to use partial pushdown optimization during the database's peak hours and full pushdown optimization when activity is low. Use the $$PushDownConfig mapping parameter to use different pushdown optimization configurations at different times. The parameter lets you run the same session using the different types of pushdown optimization.

When you configure the session, choose $$PushdownConfig for the Pushdown Optimization attribute.

Define the parameter in the parameter file. Enter one of the following values for $$PushdownConfig in the parameter file:

● None. The Integration Service processes all transformation logic for the session. ● Source. The Integration Service pushes part of the transformation logic to the source database. ● Source with View. The Integration Service creates a view to represent the SQL override value,

and runs an SQL statement against this view to push part of the transformation logic to the source database.

● Target. The Integration Service pushes part of the transformation logic to the target database. ● Full. The Integration Service pushes all transformation logic to the database. ● Full with View. The Integration Service creates a view to represent the SQL override value,

and runs an SQL statement against this view to push part of the transformation logic to the source database. The Integration Service pushes any remaining transformation logic to the target database.

Parameter File Name

Informatica recommends giving the Parameter File the same name as the workflow with a suffix of “.par”. This helps in identifying and linking the parameter file to a workflow.

Parameter File: Order of Precedence

While it is possible to assign Parameter Files to a session and a workflow, it is important to note that a file specified at the workflow level always supersedes files specified at session levels.

Parameter File Location

INFORMATICA CONFIDENTIAL BEST PRACTICE 631 of 702

Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files.

Place the Parameter Files in directory that can be accessed using the server variable. This helps to move the sessions and workflows to a different server without modifying workflow or session properties. You can override the location and name of parameter file specified in the session or workflow while executing workflows via the pmcmd command.

The following points apply to both Parameter and Variable files, however these are more relevant to Parameters and Parameter files, and are therefore detailed accordingly.

Multiple Parameter Files for a Workflow

To run a workflow with different sets of parameter values during every run:

1. Create multiple parameter files with unique names. 2. Change the parameter file name (to match the parameter file name defined in Session or

Workflow properties). You can do this manually or by using a pre-session shell (or batch script). 3. Run the workflow.

Alternatively, run the workflow using pmcmd with the -paramfile option in place of steps 2 and 3.

Generating Parameter Files

Based on requirements, you can obtain the values for certain parameters from relational tables or generate them programmatically. In such cases, the parameter files can be generated dynamically using shell (or batch scripts) or using Informatica mappings and sessions.

Consider a case where a session has to be executed only on specific dates (e.g., the last working day of every month), which are listed in a table. You can create the parameter file containing the next run date (extracted from the table) in more than one way.

Method 1:

1. The workflow is configured to use a parameter file. 2. The workflow has a decision task before running the session: comparing the Current System

date against the date in the parameter file. 3. Use a shell (or batch) script to create a parameter file. Use an SQL query to extract a single

date, which is greater than the System Date (today) from the table and write it to a file with required format.

4. The shell script uses pmcmd to run the workflow. 5. The shell script is scheduled using cron or an external scheduler to run daily. The

following figure shows the use of a shell script to generate a parameter file.

INFORMATICA CONFIDENTIAL BEST PRACTICE 632 of 702

The following figure shows a generated parameter file.

Method 2:

1. The Workflow is configured to use a parameter file. 2. The initial value for the data parameter is the first date on which the workflow is to run. 3. The workflow has a decision task before running the session: comparing the Current System

date against the date in the parameter file 4. The last task in the workflow generates the parameter file for the next run of the workflow (using

INFORMATICA CONFIDENTIAL BEST PRACTICE 633 of 702

a command task calling a shell script) or a session task, which uses a mapping. This task extracts a date that is greater than the system date (today) from the table and writes into parameter file in the required format.

5. Schedule the workflow using Scheduler, to run daily (as shown in the following figure).

Parameter File Templates

In some other cases, the parameter values change between runs, but the change can be incorporated into the parameter files programmatically. There is no need to maintain separate parameter files for each run.

Consider, for example, a service provider who gets the source data for each client from flat files located in client-specific directories and writes processed data into global database. The source data structure, target data structure, and processing logic are all same. The log file for each client run has to be preserved in a client-specific directory. The directory names have the client id as part of directory structure (e.g., /app/data/Client_ID/)

You can complete the work for all clients using a set of mappings, sessions, and a workflow, with one parameter file per client. However, the number of parameter files may become cumbersome to manage when the number of clients increases.

INFORMATICA CONFIDENTIAL BEST PRACTICE 634 of 702

In such cases, a parameter file template (i.e., a parameter file containing values for some parameters and placeholders for others) may prove useful. Use a shell (or batch) script at run time to create actual parameter file (for a specific client), replacing the placeholders with actual values, and then execute the workflow using pmcmd.

[PROJ_DP.WF:Client_Data]

$InputFile_1=/app/data/Client_ID/input/client_info.dat

$LogFile=/app/data/Client_ID/logfile/wfl_client_data_curdate.log

Using a script, replace “Client_ID” and “curdate” to actual values before executing the workflow.

The following text is an excerpt from a parameter file that contains service variables for one Integration Service and parameters for four workflows:

[Service:IntSvs_01]

[email protected]

[email protected]

[HET_TGTS.WF:wf_TCOMMIT_INST_ALIAS]

$$platform=unix

[HET_TGTS.WF:wf_TGTS_ASC_ORDR.ST:s_TGTS_ASC_ORDR]

$$platform=unix

$DBConnection_ora=qasrvrk2_hp817

[ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1]

$$DT_WL_lvl_1=02/01/2005 01:05:11

$$Double_WL_lvl_1=2.2

[ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1.WT:NWL_PARAM_Lvl_2]

$$DT_WL_lvl_2=03/01/2005 01:01:01

$$Int_WL_lvl_2=3

$$String_WL_lvl_2=ccccc

INFORMATICA CONFIDENTIAL BEST PRACTICE 635 of 702

Use Case 1: Fiscal Calendar-Based Processing

Some Financial and Retail industries use Fiscal calendar for accounting purposes. Use the mapping parameters to process the correct fiscal period.

For example, create a calendar table in the database with the mapping between the Gregorian calendar and fiscal calendar. Create mapping parameters in the mappings for the starting and ending dates. Create another mapping with the logic to create a parameter file. Run the parameter file creation session before running the main session.

The calendar table can be directly joined with the main table, but the performance may not be good in some databases depending upon how the indexes are defined. Using a parameter file can resolve the index and result in better performance.

Use Case 2: Incremental Data Extraction

Mapping parameters and variables can be used to extract inserted/updated data since previous extract. Use the mapping parameters or variables in the source qualifier to determine the beginning timestamp and the end timestamp for extraction.

For example, create a user-defined mapping variable $$PREVIOUS_RUN_DATE_TIME that saves the timestamp of the last row the Integration Service read in the previous session. Use this variable for the beginning timestamp and the built-in variable $$$SessStartTime for the end timestamp in the source filter.

Use the following filter to incrementally extract data from the database:

LOAN.record_update_timestamp > TO_DATE(‘$$PREVIOUS_DATE_TIME’) and

LOAN.record_update_timestamp <= TO_DATE(‘$$$SessStartTime’)

Use Case 3: Multi-Purpose Mapping

Mapping parameters can be used to extract data from different tables using a single mapping. In some cases the table name is the only difference between extracts.

For example, there are two similar extracts from tables FUTURE_ISSUER and EQUITY_ISSUER; the column names and data types within the tables are same. Use mapping parameter $$TABLE_NAME in the source qualifier SQL override, create two parameter files for each table name. Run the workflow using the pmcmd command with the corresponding parameter file, or create two sessions with corresponding parameter file.

Use Case 4: Using Workflow Variables

You can create variables within a workflow. When you create a variable in a workflow, it is valid only in

INFORMATICA CONFIDENTIAL BEST PRACTICE 636 of 702

that workflow. Use the variable in tasks within that workflow. You can edit and delete user-defined workflow variables.

Use user-defined variables when you need to make a workflow decision based on criteria you specify. For example, you create a workflow to load data to an orders database nightly. You also need to load a subset of this data to headquarters periodically, every tenth time you update the local orders database. Create separate sessions to update the local database and the one at headquarters. Use a user-defined variable to determine when to run the session that updates the orders database at headquarters.

To configure user-defined workflow variables, set up the workflow as follows:

Create a persistent workflow variable, $$WorkflowCount, to represent the number of times the workflow has run. Add a Start task and both sessions to the workflow. Place a Decision task after the session that updates the local orders database. Set up the decision condition to check to see if the number of workflow runs is evenly divisible by 10. Use the modulus (MOD) function to do this. Create an Assignment task to increment the $$WorkflowCount variable by one.

Link the Decision task to the session that updates the database at headquarters when the decision condition evaluates to true. Link it to the Assignment task when the decision condition evaluates to false.

When you configure workflow variables using conditions, the session that updates the local database runs every time the workflow runs. The session that updates the database at headquarters runs every 10th time the workflow runs.

Last updated: 09-Feb-07 16:20

INFORMATICA CONFIDENTIAL BEST PRACTICE 637 of 702

Platform Sizing

Challenge

Determining the appropriate platform size to support the PowerCenter environment based on customer environments and requirements.

DescriptionThe required platform size to support PowerCenter depends on each customer’s unique environment and processing requirements. The Integration Service allocates resources for individual extraction, transformation, and load (ETL) jobs or sessions. Each session has its own resource requirements. The resources required for the Integration Service depend on the number of sessions, what each session does while moving data, and how many sessions run concurrently. This Best Practice discusses the relevant questions pertinent to estimating the platform requirements.

TIP An important concept regarding platform sizing is not to size your environment too soon in the project lifecycle. Too often, clients size their machines before any ETL is designed or developed, and in many cases these platforms are too small for the resultant system. Thus, it is better to analyze sizing requirements after the data transformation processes have been well defined during the design and development phases.

Environment Questions

To determine platform size, consider the following questions regarding your environment:

● What sources do you plan to access? ● How do you currently access those sources? ● Have you decided on the target environment (i.e., database, hardware,

operating system)? If so, what is it? ● Have you decided on the PowerCenter environment (i.e., hardware, operating

system)? ● Is it possible for the PowerCenter services to be on the same machine as the

target? ● How do you plan to access your information (i.e., cube, ad-hoc query tool) and

INFORMATICA CONFIDENTIAL BEST PRACTICE 638 of 702

what tools will you use to do this? ● What other applications or services, if any, run on the PowerCenter server? ● What are the latency requirements for the PowerCenter loads?

Engine Sizing Questions

To determine engine size, consider the following questions:

● Is the overall ETL task currently being performed? If so, how is it being done, and how long does it take?

● What is the total volume of data to move? ● What is the largest table (i.e., bytes and rows)? Is there any key on this table

that can be used to partition load sessions, if needed? ● How often does the refresh occur? ● Will refresh be scheduled at a certain time, or driven by external events? ● Is there a "modified" timestamp on the source table rows? ● What is the batch window available for the load? ● Are you doing a load of detail data, aggregations, or both? ● If you are doing aggregations, what is the ration of source/target rows for the

largest result set? How large is the result set (bytes and rows)?

The answers to these questions provide an approximation guide to the factors that affect PowerCenter's resource requirements. To simplify the analysis, focus on large jobs that drive the resource requirement.

Engine Resource Consumption

The following sections summarize some recommendations on the PowerCenter engine resource consumption.

Processor

1 to 1.5 CPUs per concurrent non-partitioned session or transformation job.

Memory

● 20 to 30MB of memory for the main engine for session coordination.

INFORMATICA CONFIDENTIAL BEST PRACTICE 639 of 702

● 20 to 30MB of memory per session, if there are no aggregations, lookups, or heterogeneous data joins. Note that 32-bit systems have an operating system limitation of 2GB per session.

● Caches for aggregation, lookups or joins use additional memory: ● Lookup tables are cached in full; the memory consumed depends on the size

of the tables. ● Aggregate caches store the individual groups; more memory is used if there

are more groups. ● Sorting the input to aggregations greatly reduces the need for memory. ● Joins cache the master table in a join; memory consumed depends on the size

of the master.

System Recommendations

PowerCenter has a service-oriented architecture that provides the ability to scale services and share resources across multiple machines. Below are the recommendations for the system.

Minimum server

● 1 Node, 4 CPUs and 8GB of memory (instead of the minimal requirement of 4GB RAM).

Disk Space

Disk space is not a factor if the machine is used only for PowerCenter services, unless the following conditions exist:

● Data is staged to flat files on the PowerCenter machine. ● Data is stored in incremental aggregation files for adding data to aggregates.

The space consumed is about the size of the data aggregated. ● Temporary space is needed for paging for transformations that require large

caches that cannot be entirely cached by system memory ● Sessions logs are saved by timestamp

If any of these factors is true, Informatica recommends monitoring disk space on a regular basis or maintaining some type of script to purge unused files.

Sizing Analysis

INFORMATICA CONFIDENTIAL BEST PRACTICE 640 of 702

The basic goal is to size the machine so that all jobs can complete within the specified load window. You should consider the answers to the questions in the "Environment" and "Engine Sizing" sections to estimate the required number of sessions, the volume of data that each session moves, and its lookup table, aggregation, and heterogeneous join caching requirements. Use these estimates with the recommendations in the "Engine Resource Consumption" section to determine the required number of processors, memory, and disk space to achieve the required performance to meet the load window.

Note that the deployment environment often creates performance constraints that hardware capacity cannot overcome. The engine throughput is usually constrained by one or more of the environmental factors addressed by the questions in the "Environment" section. For example, if the data sources and target are both remote from the PowerCenter machine, the network is often the constraining factor. At some point, additional sessions, processors, and memory may not yield faster execution because the network (not the PowerCenter services) imposes the performance limit. The hardware sizing analysis is highly dependent on the environment in which the server is deployed. You need to understand the performance characteristics of the environment before making any sizing conclusions.

It is also vitally important to remember that other applications (in addition to PowerCenter) are likely to use the platform. PowerCenter often runs on a server with a database engine and query/analysis tools. In fact, in an environment where PowerCenter, the target database, and query/analysis tools all run on the same machine, the query/analysis tool often drives the hardware requirements. However, if the loading is performed after business hours, the query/analysis tools requirements may not be a sizing limitation.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 641 of 702

PowerCenter Admin Console

Challenge

Using the PowerCenter Administration Console to administer PowerCenter domain and services.

Description

PowerCenter has a service-oriented architecture that provides the ability to scale services and share resources across multiple machines. The PowerCenter domain is the fundamental administrative unit in PowerCenter. A domain is a collection of nodes and services that you can group in folders based on administration ownership.

The Administration Console consolidates administrative tasks for domain objects such as services, nodes, licenses, and grids. For more information on domain configuration, refer the Best Practice on Domain Configuration.

Folders and Security

It is a good practice to create folders in the domain in order to organize objects and manage security. Folders can contain nodes, services, grids, licenses and other folders. Folders can be created based on functionality type, object type, or environment type.

● Functionality-type folders group services based on a functional area such as Sales or Marketing.

● Object type-folders group objects based on the service type. For example, Integration services folder.

● Environment-type folders group objects based on the environment. For example, if you have development and testing on the same domain, group the services according to the environment.

Create User Accounts in the admin console, then set permissions and privileges to the folders the users need access to. It is a good practice for the Administrator to monitor the user activity in the domain periodically and save the reports for audit purposes.

INFORMATICA CONFIDENTIAL BEST PRACTICE 642 of 702

Nodes, Services, and Grids

A node is the logical representation of a machine in a domain. One node in the domain acts as a gateway to receive service requests from clients and route them to the appropriate service and node. Node properties can be set and modified using the admin console. It is important to note that the property to set the maximum session/tasks to run is “Maximum Processes”. Set this threshold to a maximum number; for example, 200 is a good threshold. If you are using Adaptive Dispatch mode it is a good practice to recalculate the CPU profile when the node is idle since it uses 100 percent of the CPU.

The admin console also allows you to manage application services. You can access properties of the services under one window using the admin console. For more information on configuring the properties, refer the Best Practice on Advanced Server Configuration Options

In addition, you can create and configure grids to nodes using admin console.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 643 of 702

Understanding and Setting UNIX Resources for PowerCenter Installations

Challenge

This Best Practice explains what UNIX resource limits are, and how to control and manage them.

Description

UNIX systems impose per-process limits on resources such as processor usage, memory, and file handles. Understanding and setting these resources correctly is essential for PowerCenter installations.

Understanding UNIX Resource Limits

UNIX systems impose limits on several different resources. The resources that can be limited depend on the actual operating system (e.g., Solaris, AIX, Linux, or HPUX) and the version of the operating system. In general, all UNIX systems implement per-process limits on the following resources. There may be additional resource limits, depending on the operating system.

Resource Description

Processor time The maximum amount of processor time that can be used by a process, usually in seconds.

Maximum file size The size of the largest single file a process can create. Usually specified in blocks of 512 bytes.

Process data The maximum amount of data memory a process can allocate. Usually specified in KB.

Process stack The maximum amount of stack memory a process can allocate. Usually specified in KB.

Number of open files The maximum number of files that can be open simultaneously.

Total virtual memory The maximum amount of memory a process can use, including stack, instructions, and data. Usually specified in KB.

Core file size The maximum size of a core dump file. Usually specified in blocks of 512 bytes.

These limits are implemented on an individual process basis. The limits are also ‘inherited’ by child processes when they are created.

In practice, this means that the resource limits are typically set at log-on time, and apply to all processes started from the log-in shell. In the case of PowerCenter, any limits in effect before the Integration Service is started also apply to all sessions (pmdtm) started from that node. Any limits in effect when the Repository Service is started also apply to all pmrepagents started from that repository service (repository service process is an instance of the repository service running on a particular machine or node).

When a process exceeds its resource limit, UNIX fails the operation that caused the limit to be exceeded. Depending on the limit that is reached, memory allocations fail, files can’t be opened, and processes are terminated when they exceed their processor time.

Since PowerCenter sessions often use a large amount of processor time, open many files, and can use large amounts of memory, it is important to set resource limits correctly to prevent the operating system from limiting access to required resources, while preventing problems.

INFORMATICA CONFIDENTIAL BEST PRACTICE 644 of 702

Hard and Soft Limits

Each resource that can be limited actually allows two limits to be specified – a ‘soft’ limit and a ‘hard’ limit. Hard and soft limits can be confusing.

From a practical point of view, the difference between hard and soft limits doesn’t matter to PowerCenter or any other process; the lower value is enforced when it reached, whether it is a hard or soft limit.

The difference between hard and soft limits really only matters when changing resource limits. The hard limits are the absolute maximums set by the System Administrator that can only be changed by the System Administrator. The soft limits are ‘recommended’ values set by the System Administrator, and can be increased by the user, up to the maximum limits.

UNIX Resource Limit Commands

The standard interface to UNIX resource limits is the ‘ulimit’ shell command. This command displays and sets resource limits. The C shell implements a variation of this command called ‘limit’, which has different syntax but the same functions.

● ulimit –a Displays all soft limits ● ulimit –a –H Displays all hard limits in effect

Recommended ulimit settings for a PowerCenter server:

Resource Description

Processor time Unlimited. This is needed for the pmserver and pmrepserver that run forever.

Maximum file size Based on what’s needed for the specific application. This is an important parameter to keep a session from filling a whole filesystem, but needs to be large enough to not affect normal production operations.

Process data 1GB to 2GB

Process stack 32MB

Number of open files At least 256. Each network connection counts as a ‘file’ so source, target, and repository connections, as well as cache files all use file handles.

Total virtual memory The largest expected size of a session. 1Gig should be adequate, unless sessions are expected to create large in-memory aggregate and lookup caches that require more memory. If you have sessions that are likely to required more than 1Gig, set the Total virtual memory appropriately. Remember that in 32-bit OS, the maximum virtual memory for a session is 2Gigs.

Core file size Unlimited, unless disk space is very tight. The largest core files can be ~2-3GB, but after analysis they should be deleted, and there really shouldn’t be multiple core files lying around.

Setting Resource Limits

Resource limits are normally set in the log-in script, either .profile for the Korn shell or .bash_profile for the bash shell. One ulimit command is required for each resource being set, and usually the soft limit is set. A typical sequence is:

ulimit -S -c unlimited ulimit -S -d 1232896 ulimit -S -s 32768

INFORMATICA CONFIDENTIAL BEST PRACTICE 645 of 702

ulimit -S -t unlimited ulimit -S -f 2097152 ulimit -S -n 1024 ulimit -S -v unlimited

after running this, the limits are changed:

% ulimit –S –a core file size (blocks, -c) unlimited data seg size (kbytes, -d) 1232896 file size (blocks, -f) 2097152 max memory size (kbytes, -m) unlimited open files (-n) 1024 stack size (kbytes, -s) 32768 cpu time (seconds, -t) unlimited virtual memory (kbytes, -v) unlimited

Setting or Changing Hard Resource Limits

Setting or changing hard resource limits varies across UNIX types. Most current UNIX systems set the initial hard limits in the file /etc/profile, which must be changed by a System Administrator. In some cases, it is necessary to run a system utility such as smit on AIX to change the global system limits.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 646 of 702

PowerExchange CDC for Oracle

Challenge

Configuration of the Oracle environment for optimal performance of PowerExchange Change Data Capture in production environments.

Description

The performance of PowerExchange CDC on Oracle databases is dependant upon a variety of factors including

● The type of connection that PowerExchange has to the Oracle database being captured .

● The amount of data that is being written to the Oracle redo logs.● The workload of the server where the Oracle database being captured resides.

Connection Type

Ensure that wherever possible PowerExchange has a connection type of Local mode to the source database. Connections over slow networks and via SQL*Net should be avoided.

Volume of Data

The volume of data that the Oracle Log Miner has to process in order to provide changed data to PowerExchange has a significant impact upon performance. Bear in mind that other processes may be writing large volumes of data to the Oracle redo logs, in addition to the changed data rows. These include, but are not restricted to

● Oracle Catalog dumps.● Oracle Workload monitor customizations.● Other (non-Oracle) tools that use the redo logs to provide proprietary

information.

In order to optimize the PowerExchange CDC performance, the amount of data these processes write to the Oracle redo logs needs to be minimized, both in terms of volume

INFORMATICA CONFIDENTIAL BEST PRACTICE 647 of 702

and frequency. Review the processes that are actively writing data to the Oracle redo logs and tune them within the context of a production environment.

For example, is it strictly necessary to perform a Catalog dump every 30 minutes? In a production environment schema, changes are less frequent than in a development environment where Catalog dumps may be needed at this frequency.

Server Workload

Optimize the performance of the Oracle database server by reducing the number of uneccessary tasks it is performing concurrently with the PowerExchange CDC components. This may include a full review of the scheduling of backups and restores, Oracle import and export processing, and other application software utilized within the production server environment.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 648 of 702

PowerExchange Installation (for Mainframe)

Challenge

Installing and configuring a PowerExchange listener on a mainframe, ensuring that the process is both efficient and effective.

Description

PowerExchange installation is very straight-forward and can generally be accomplished in a timely fashion. When considering a PowerExchange installation, be sure that the appropriate resources are available. These include, but are not limited to:

● MVS systems operator ● Appropriate database administrator; this depends on what (if any) databases

are going to be sources/and or targets (e.g., IMS, IDMS, etc.). ● MVS Security resources

Be sure to adhere to the sequence of the following steps to successfully install PowerExchange. Note that in this very typical scenario, the mainframe source data is going to be “pulled” across to a server box.

1. Complete the PowerExchange pre-install checklist and obtain valid license keys. 2. Install PowerExchange on the mainframe. 3. Start the PowerExchange jobs/tasks on the mainframe. 4. Install the PowerExchange client (Navigator) on a workstation. 5. Test connectivity to the mainframe from the workstation. 6. Install Navigator on the UNIX/NT server. 7. Test connectivity to the mainframe from the server.

Complete the PowerExchange Pre-install Checklist and Obtain Valid License Keys

Reviewing the environment and recording the information in a detailed checklist facilitates the PowerExchange install. The checklist (which is a prerequisite) is installed in the Documentation Folder when the PowerExchange software is installed. It is also

INFORMATICA CONFIDENTIAL BEST PRACTICE 649 of 702

available within the client from the PowerExchange Program Group. Be sure to complete all relevant sections.

You will need a valid license key in order to run any of the PowerExchange components. This is a 44-byte key that uses hyphens every 4 bytes. For example:

1234-ABCD-1234-EF01-5678-A9B2-E1E2-E3E4-A5F1

The key is not case-sensitive and uses hexadecimal digits and letters (0-9 and A-F). Keys are valid for a specific time period and are also linked to an exact or generic TCP/IP address. They also control access to certain databases and determine if the PowerCenter Mover can be used. You cannot successfully install PowerExchange without a valid key for all required components.

Note: When copying software from one machine to another, you may encounter license key problems since the license key is IP specific. Be prepared to deal with this eventuality, especially if you are going to a backup site for disaster recovery testing.

Install PowerExchange on the Mainframe

Step 1: Create a folder c:\PWX on the workstation. Copy the file with a naming convention similar to PWXOS26.Vxxx.EXE from the PowerExchange CD to this directory. Double click the file to unzip its contents into this directory.

Step 2: Create the PDS “HLQ.PWXVxxx.RUNLIB” and “HLQ.PWXVxxx.BINLIB” on the mainframe in order to pre-allocate the needed libraries.. Ensure sufficient space for the required jobs/tasks by setting the cylinders to 150 and directory blocks of 50.

Step 3: Run the “MVS_Install” file. This displays the MVS Install Assistant. Configure the IP Address, Logon ID, Password, HLQ, and Default volume setting on the display screen. Also, enter the license key.

Click the Custom buttons to configure the desired data sources.

Be sure that the HLQ on this screen matches the HLQ of the allocated RUNLIB (from step 2).

Save these settings and click Process. This creates the JCL libraries and opens the following screen to FTP these libraries to MVS. Click XMIT to complete the FTP process.

INFORMATICA CONFIDENTIAL BEST PRACTICE 650 of 702

Step 4: Edit JOBCARD in RUNLIB and configure as per the environment (e.g., execution class, message class, etc.)

Step 5: Edit the SETUP member in RUNLIB. Copy in the JOBCARD and SUBMIT. This process can submit from 5 to 24 jobs. All jobs should end with return code 0 (success) or 1, and a list of the needed installation jobs can be found in the XJOBS member.

Start The PowerExchange Jobs/Tasks on the Mainframe

The installed PowerExchange Listener can be run as a normal batch job or as a started task. Informatica recommends that it initially be submitted as a batch job: RUNLIB(STARTLST).

It should return: DTL-00607 Listener VRM x.x.x Build Vxxx_P0x started.

If implementing change capture, start the PowerExchange Agent (as a started task):

/S DTLA

It should return: DTLEDMI1722561: EDM Agent DTLA has completed initialization.

Install The PowerExchange Client (Navigator) on a Workstation

Step 1: Run the Windows or UNIX installation file in the software folder on the installation CD and follow the prompts.

Step 2: Enter the license key.

Step 3: Follow the wizard to complete the install and reboot the machine.

Step 4: Add a node entry to the configuration file “\Program Files\Informatica\Informatica Power Exchange\dbmover.cfg” to point to the Listener on the mainframe.

node = (mainframe location name, TCPIP, mainframe IP address, 2480)

INFORMATICA CONFIDENTIAL BEST PRACTICE 651 of 702

Test Connectivity to the Mainframe from the Workstation

Ensure communication to the PowerExchange Listener on the mainframe by entering the following in DOS on the workstation:

DTLREXE PROG=PING LOC=mainframe location or nodename in dbmover.cfg< /FONT>

It should return: DLT-00755 DTLREXE Command OK!

Install PowerExchange on the UNIX Server

Step 1: Create a user for the PowerExchange installation on the UNIX box.

Step 2: Create a UNIX directory “/opt/inform/pwxvxxxp0x”.

Step 3: FTP the file “\software\Unix\dtlxxx_vxxx.tar” on the installation CD to the pwx installation directory on UNIX.

Step 4: Use the UNIX tar command to extract the files. The command is “tar –xvf pwxxxx_vxxx.tar”.

Step 5: Update the logon profile with the correct path, library path, and home environment variables.

Step 6: Update the license key file on the server.

Step 7: Update the configuration file on the server (dbmover.cfg) by adding a node entry to point to the Listener on the mainframe.

Step 8: If using an ETL tool in conjunction with PowerExchange, via ODBC, update the odbc.ini file on the server by adding data source entries that point to PowerExchange-accessed data:

[pwx_mvs_db2]

DRIVER=<install dir>/libdtlodbc.so

DESCRIPTION=MVS DB2

INFORMATICA CONFIDENTIAL BEST PRACTICE 652 of 702

DBTYPE=db2

LOCATION=mvs1

DBQUAL1=DB2T

Test Connectivity to the Mainframe from the Server

Ensure communication to the PowerExchange Listener on the mainframe by entering the following on the UNIX server:

DTLREXE PROG=PING LOC=mainframe location< /FONT>

It should return: DLT-00755 DTLREXE Command OK!

Changed Data Capture

There is a separate manual for each type of change data capture adapter. This manual contains the specifics on the following general steps. You will need to understand the appropiate adapter guide to ensure success.

Step 1: APF authorize the .LOAD and the .LOADLIB libraries. This is required for external security.

Step 2: Copy the Agent from the PowerExchange PROCLIB to the system site PROCLIB.

Step 3: After the Agent has been started, run job SETUP2.

Step 4: Create an active registration for a table/segment/record in Navigator that is setup for changes.

Step 5: Start the ECCR.

Step 6: Issue a change to the table/segment/record that you registered in Navigator.

Step 7: Perform an extraction map row test in Navigator

INFORMATICA CONFIDENTIAL BEST PRACTICE 653 of 702

Assessing the Business Case

Challenge

Assessing the business case for a project must consider both the tangible and intangible potential benefits. The assessment should also validate the benefits and ensure they are realistic to the Project Sponsor and Key Stakeholders to ensure project funding.

Description

A Business Case should include both qualitative and quantitative measures of potential benefits.

The Qualitative Assessment portion of the Business Case is based on the Statement of Problem/Need and the Statement of Project Goals and Objectives (both generated in Subtask 1.1.1 Establish Business Project Scope ) and focuses on discussions with the project beneficiaries regarding the expected benefits in terms of problem alleviation, cost savings or controls, and increased efficiencies and opportunities.

Many qualitative items are intangible, but you may be able to cite examples of the potential costs or risks if the system is not implemented. An example may be the cost of bad data quality resulting in the loss of a key customer or an invalid analysis resulting in bad business decisions. Risk factors may be classified as business, technical, or execution in nature. Examples of these risks are uncertainty of value or the unreliability of collected information, new technology employed, or a major change in business thinking for personnel executing change.

It is important to identify an estimated value added or cost eliminated to strengthen the business case. The better definition of the factors, the better the value to the business case.

The Quantitative Assessment portion of the Business Case provides specific measurable details of the proposed project, such as the estimated ROI. This may involve the following calculations:

● Cash flow analysis- Projects positive and negative cash flows for the anticipated life of the project. Typically, ROI measurements use the cash flow formula to depict results.

INFORMATICA CONFIDENTIAL BEST PRACTICE 654 of 702

● Net present value - Evaluates cash flow according to the long-term value of current investment. Net present value shows how much capital needs to be invested currently, at an assumed interest rate, in order to create a stream of payments over time. For instance, to generate an income stream of $500 per month over six months at an interest rate of eight percent would require an investment (i.e., a net present value) of $2,311.44.

● Return on investment - Calculates net present value of total incremental cost savings and revenue divided by the net present value of total costs multiplied by 100. This type of ROI calculation is frequently referred to as return-on-equity or return-on-capital.

● Payback Period - Determines how much time must pass before an initial capital investment is recovered.

The following are steps to calculate the quantitative business case or ROI:

Step 1 – Develop Enterprise Deployment Map. This is a model of the project phases over a timeline, estimating as specifically as possible participants, requirements, and systems involved. A data integration or migration initiative or amendment may require estimating customer participation (e.g., by department and location), subject area and type of information/analysis, numbers of users, numbers and complexity of target data systems (data marts or operational databases, for example) and data sources, types of sources, and size of data set. A data migration project may require customer participation, legacy system migrations, and retirement procedures. The types of estimations vary by project types and goals. It is important to note that the more details you have for estimations, the more precise your phased solutions are likely to be. The scope of the project should also be made known in the deployment map.

Step 2 – Analyze Potential Benefits. Discussions with representative managers and users or the Project Sponsor should reveal the tangible and intangible benefits of the project. The most effective format for presenting this analysis is often a "before" and "after" format that compares the current situation to the project expectations, Include in this step, costs that can be avoided by the deployment of this project.

Step 3 – Calculate Net Present Value for all Benefits. Information gathered in this step should help the customer representatives to understand how the expected benefits are going to be allocated throughout the organization over time, using the enterprise deployment map as a guide.

Step 4 – Define Overall Costs. Customers need specific cost information in order to assess the dollar impact of the project. Cost estimates should address the following fundamental cost components:

INFORMATICA CONFIDENTIAL BEST PRACTICE 655 of 702

● Hardware ● Networks ● RDBMS software ● Back-end tools ● Query/reporting tools ● Internal labor ● External labor ● Ongoing support ● Training

Step 5 – Calculate Net Present Value for all Costs. Use either actual cost estimates or percentage-of-cost values (based on cost allocation assumptions) to calculate costs for each cost component, projected over the timeline of the enterprise deployment map. Actual cost estimates are more accurate than percentage-of-cost allocations, but much more time-consuming. The percentage-of-cost allocation process may be valuable for initial ROI snapshots until costs can be more clearly predicted.

Step 6 – Assess Risk, Adjust Costs and Benefits Accordingly. Review potential risks to the project and make corresponding adjustments to the costs and/or benefits. Some of the major risks to consider are:

● Scope creep, which can be mitigated by thorough planning and tight project scope.

● Integration complexity, which may be reduced by standardizing on vendors with integrated product sets or open architectures.

● Architectural strategy that is inappropriate. ● Current support infrastructure may not meet the needs of the project. ● Conflicting priorities may impact resource availability. ● Other miscellaneous risks from management or end users who may withhold

project support; from the entanglements of internal politics; and from technologies that don't function as promised.

● Unexpected data quality, complexity, or definition issues often are discovered late in the course of the project and can adversely affect effort, cost, and schedule. This can be somewhat mitigated by early source analysis.

Step 7 – Determine Overall ROI. When all other portions of the business case are complete, calculate the project's "bottom line". Determining the overall ROI is simply a matter of subtracting net present value of total costs from net present value of (total

INFORMATICA CONFIDENTIAL BEST PRACTICE 656 of 702

incremental revenue plus cost savings).

Final Deliverable

The final deliverable of this phase of development is a complete business case that documents both tangible (quantified) and in-tangible (non-quantified, but estimate of benefits and risks) to be presented to the Project Sponsor and Key Stakeholders. This allows them to review the Business Case in order to justify the development effort.

If your organization has the concept of a Project Office which provides the governance for project and priorities, many times this is part of the original Project Charter which states items like scope, initial high level requirements, and key project stakeholders. However, developing a full Business Case can validate any initial analysis and provide additional justification. Additionally, the Project Office should provide guidance in building and communicating the Business Case.

Once completed, the Project Manager is responsible for scheduling the review and socialization of the Business Case.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 657 of 702

Defining and Prioritizing Requirements

Challenge

Defining and prioritizing business and functional requirements is often accomplished through a combination of interviews and facilitated meetings (i.e., workshops) between the Project Sponsor and beneficiaries and the Project Manager and Business Analyst.

Requirements need to be gathered from business users who currently use and/or have the potential to use the information being assessed. All input is important since the assessment should encompass an enterprise view of the data rather than a limited functional, departmental, or line-of-business view.

Types of specific detailed data requirements gathered include:

● Data names to be assessed ● Data definitions ● Data formats and physical attributes ● Required business rules including allowed values ● Data usage ● Expected quality levels

By gathering and documenting some of the key detailed data requirements, a solid understanding the business rules involved is reached. Certainly, all elements can’t be analyzed in detail, but it helps in getting to the heart of the business system so you are better prepared when speaking with business and technical users.

Description

The following steps are key for successfully defining and prioritizing requirements:

Step 1: Discovery

Gathering business requirements is one of the most important stages of any data integration project. Business requirements affect virtually every aspect of the data integration project starting from Project Planning and Management to End-User

INFORMATICA CONFIDENTIAL BEST PRACTICE 658 of 702

Application Specification. They are like a hub that sits in the middle and touches the various stages (spokes) of the data integration project. There are two basic techniques for gathering requirements and investigating the underlying operational data: interviews and facilitated sessions.

Data Profiling

Informatica Data Explorer (IDE) is an automated data profiling and analysis software product that can be extremely beneficial in defining and prioritizing requirements. It provides a detailed description of data content, structure, rules, and quality by profiling the actual data that is loaded into the product.

Some industry examples of why data profiling is crucial prior to beginning the development process are:

Cost of poor data quality is 15 to 25 percent of operating profit.●

Poor data management is costing global business $1.4 billion a year.●

37 percent of projects are cancelled; 50 percent are completed but with 20 percent overruns, leaving only 13 percent completed on time and within budget.

Using a Data Profiling Tool can lower the risk and lower the cost of the project and increase the chances of success.

Data Profiling reports can be posted to a central presence where all team members can review results and track accuracy.

IDE provides the ability to promote collaboration through tags, notes, action items, transformations and rules. By profiling the information, the framework is set to have an effective interview process with business and technical users.

Interviews

By conducting interview research before starting the requirements gathering process, interviewees can be categorized into functional business management and Information Technology (IT) management. This, in conjunction with effective data profiling, helps to establish a comprehensive set of business requirements.

INFORMATICA CONFIDENTIAL BEST PRACTICE 659 of 702

Business Interviewees. Depending on the needs of the project, even though you may be focused on a single primary business area, it is always beneficial to interview horizontally to achieve a good cross-functional perspective of the enterprise. This also provides insight into how extensible your project is across the enterprise.

Before you interview, be sure to develop an interview questionnaire based upon profiling results, as well as business questions; schedule the interview time and place; and prepare the interviewees by sending a sample agenda. When interviewing business people, it is always important to start with the upper echelons of management so as to understand the overall vision, assuming you have the business background, confidence and credibility to converse at those levels.

If not adequately prepared, the safer approach is to interview middle management. If you are interviewing across multiple teams, you might want to scramble interviews among teams. This way if you hear different perspectives from finance and marketing, you can resolve the discrepancies with a scrambled interview schedule. A note to keep in mind is that business is sponsoring the data integration project and is going to be the end-users of the application. They will decide the success criteria of your data integration project and determine future sponsorship. Questioning during these sessions should include the following:

● Who are the stakeholders for this milestone delivery (IT, field business analysts, executive management)?

● What are the target business functions, roles, and responsibilities? ● What are the key relevant business strategies, decisions, and processes (in

brief)? ● What information is important to drive, support, and measure success for

those strategies/processes? What key metrics? What dimensions for those metrics?

● What current reporting and analysis is applicable? Who provides it? How is it presented? How is it used? How can it be improved?

IT interviewees. The IT interviewees have a different flavor than the business user community. Interviewing the IT team is generally very beneficial because it is composed of data gurus who deal with the data on a daily basis. They can provide great insight into data quality issues, help in systematic exploration of legacy source systems, and understanding business user needs around critical reports. If you are developing a prototype, they can help get things done quickly and address important business reports. Questioning during these sessions should include the following:

● Request an overview of existing legacy source systems. How does data

INFORMATICA CONFIDENTIAL BEST PRACTICE 660 of 702

current flow from these systems to the users? ● What day-to-day maintenance issues does the operations team encounter with

these systems? ● Ask for their insight into data quality issues. ● What business users do they support? What reports are generated on a daily,

weekly, or monthly basis? What are the current service level agreements for these reports?

● How can the DI project support the IS department needs? ● Review data profiling reports and analyze the anomalies in the data. Note and

record each of the comments from the more detailed analysis. What are the key business rules involved in each item?

Facilitated Sessions

Facilitated sessions - known sometimes as JAD (Joint Application Development) or RAD (Rapid Application Development) - are ways to work as a group of business and technical users to capture the requirements. This can be very valuable in gathering comprehensive requirements and building the project team. The difficulty is the amount of preparation and planning required to make the session a pleasant, and worthwhile experience.

Facilitated sessions provide quick feedback by gathering all the people from the various teams into a meeting and initiating the requirements process. You need a facilitator who is experienced in these meetings to ensure that all the participants get a chance to speak and provide feedback. During individual (or small group) interviews with high-level management, there is often focus and clarity of vision that may be hindered in large meetings. Thus, it is extremely important to encourage all attendees to participate and minimize a small number from dominating the requirement process.

A challenge of facilitated sessions is matching everyone’s busy schedules and actually getting them into a meeting room. However, this part of the process must be focused and brief or it can become unwieldy with too much time expended just trying to coordinate calendars among worthy forum participants. Set a time period and target list of participants with the Project Sponsor, but avoid lengthening the process if some participants aren't available. Questions asked during facilitated sessions are similar to the questions asked to business and IS interviewees.

Step 2: Validation and Prioritization

The Business Analyst, with the help of the Project Architect, documents the findings of

INFORMATICA CONFIDENTIAL BEST PRACTICE 661 of 702

the discovery process after interviewing the business and IT management. The next step is to define the business requirements specification. The resulting Business Requirements Specification includes a matrix linking the specific business requirements to their functional requirements.

Defining the business requirements is a time consuming process and should be facilitated by forming a working group team. A working group team usually consists of business users, business analysts, project manager, and other individuals who can help to define the business requirements. The working group should meet weekly to define and finalize business requirements. The working group helps to:

● Design the current state and future state ● Identify supply format and transport mechanism ● Identify required message types ● Develop Service Level Agreement(s), including timings ● Identify supply management and control requirements ● Identify common verifications, validations, business validations and

transformation rules ● Identify common reference data requirements ● Identify common exceptions ● Produce the physical message specification

At this time also, the Architect develops the Information Requirements Specification to clearly represent the structure of the information requirements. This document, based on the business requirements findings, can facilitate discussion of informational details and provide the starting point for the target model definition.

The detailed business requirements and information requirements should be reviewed with the project beneficiaries and prioritized based on business need and the stated project objectives and scope.

Step 3: The Incremental Roadmap

Concurrent with the validation of the business requirements, the Architect begins the Functional Requirements Specification providing details on the technical requirements for the project.

As general technical feasibility is compared to the prioritization from Step 2, the Project

INFORMATICA CONFIDENTIAL BEST PRACTICE 662 of 702

Manager, Business Analyst, and Architect develop consensus on a project "phasing" approach. Items of secondary priority and those with poor near-term feasibility are relegated to subsequent phases of the project. Thus, they develop a phased, or incremental, "roadmap" for the project (Project Roadmap).

Final Deliverable

The final deliverable of this phase of development is a complete list of business requirement, a diagram of current and future state, and a list of high-level business rules affected by the requirements that will effect the change from current to future. This provides the development team with much of the information in order to begin the design effort of the system modifications. Once completed, the Project Manager is responsible for scheduling the review and socialization of the requirements and plan to achieve sign-off on the deliverable.

This is presented to the Project Sponsor for approval and becomes the first "increment" or starting point for the Project Plan.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 663 of 702

Developing a Work Breakdown Structure (WBS)

Challenge

Developing a comprehensive work breakdown structure (WBS) is crucial for capturing all the tasks required for a data integration project. Many times, items such as full analysis, testing, or even specification development, can create a sense of false optimism for the project. The WBS clearly depicts all of the various tasks and subtasks required to complete a project. Most project time and resource estimates are supported by the WBS. A thorough, accurate WBS is critical for effective monitoring and also facilitates communication with project sponsors and key stakeholders.

Description

The WBS is a deliverable-oriented hierarchical tree that allows large tasks to be visualized as a group of related smaller, more manageable subtasks. These tasks and subtasks can then be assigned to various resources, which helps to identify accountability and is invaluable for tracking progress. The WBS serves as a starting point as well as a monitoring tool for the project.

One challenge in developing a thorough WBS is obtaining the correct balance between sufficient detail, and too much detail. The WBS shouldn’t include every minor detail in the project, but it does need to break the tasks down to a manageable level of detail. One general guideline is to keep task detail to a duration of at least a day. It is also important to maintain consistency across project for the level of detail.

A well designed WBS can be extracted at a higher level to communicate overall project progress, as shown in the following sample. The actual WBS for the project manager may, for example, may be a level of detail deeper than the overall project WBS to ensure that all steps are completed, but the communication can roll up a level or two to make things more clear.

Plan%

Complete Budget Hours

Actual Hours

Architecture - Set up of Informatica Environment 82% 167 137

Develop analytic solution architecture 46% 28 13

Design development architecture 59% 32 19

Customize and implement Iterative Framework

Data Profiling 100% 32 32

Legacy Stage 150% 10 15

Pre-Load Stage 150% 10 15

Reference Data 128% 18 23

INFORMATICA CONFIDENTIAL BEST PRACTICE 664 of 702

Reusable Objects 56% 27 15

Review and signoff of Architecture 50% 10 5

Analysis - Target-to-Source Data Mapping 48% 1000 479

Customer (9 tables) 87% 135 117

Product (7 tables) 98% 215 210

Inventory (3 tables) 0% 60 0

Shipping (3 tables) 0% 60 0

Invoicing (7 tables) 0% 140 0

Orders (13 tables) 37% 380 140

Review and signoff of Functional Specification 0% 10 0

Total Architecture and Analysis 52% 1167 602

A fundamental question is to whether to include “activities” as part of a WBS. The following statements are generally true for most projects, most of the time, and therefore are appropriate as the basis for resolving this question.

● The project manager should have the right to decompose the WBS to whatever level of detail he or she requires to effectively plan and manage the project. The WBS is a project management tool that can be used in different ways, depending upon the needs of the project manager.

The lowest level of the WBS can be activities. ●

The hierarchical structure should be organized by deliverables and milestones with process steps detailed within it. The WBS can be structured from a process or life cycle basis (i.e., the accepted concept of Phases), with non-deliverables detailed within it.

At the lowest level in the WBS, an individual should be identified and held accountable for the result. This person should be an individual contributor, creating the deliverable personally, or a manager who will in turn create a set of tasks to plan and manage the results.

The WBS is not necessarily a sequential document. Tasks in the hierarchy are often completed in parallel. At part, the goal is to list every task that must be completed; it is not necessary to determine the critical path for completing these tasks.

❍ For example, multiple subtasks under a task (e.g., 4.3.1 through 4.3.7 under task 4.3). Subtasks 4.3.1 through 4.3.4 may have sequential requirements that forces them to be

INFORMATICA CONFIDENTIAL BEST PRACTICE 665 of 702

completed in order while subtasks 4.3.5 through 4.3.7 can - and should - be completed in parallel if they do not have sequential requirements.

❍ It is important to remember that a task is not complete until all of its corresponding subtasks are completed - whether sequentially or in parallel. For example, the Build Phase is not complete until tasks 4.1 through 4.7 are complete, but some work can (and should) begin for the Deploy Phase long before the Build Phase is complete.

The Project Plan provides a starting point for further development of the project WBS. This sample is a Microsoft Project file that has been "pre-loaded" with the phases, tasks, and subtasks that make up the Informatica methodology. The Project Manager can use this WBS as a starting point, but should review it to ensure that it corresponds to the specific development effort, removing any steps that aren’t relevant or adding steps as necessary. Many projects require the addition of detailed steps to accurately represent the development effort.

If the Project Manager chooses not to use Microsoft Project, an Excel version of the Work Breakdown Structure is also available. The phases, tasks, and subtasks can be exported from Excel into many other project management tools, simplifying the effort of developing the WBS.

Sometimes it is best to build an initial task list and timeline with a project team using a facilitator with the project team. The project manager can act as a facilitator or can appoint one, freeing up the project manager and enabling team members to focus on determining the actual tasks and effort needed.

Depending on the size and scope of the project, sub-projects may be beneficial, with multiple project teams creating their own project plans. The overall project manager then brings the plans together into a master project plan. This group of projects can be defined as a program and the project manager and project architect manage the interaction among the various development teams.

Caution: Do not expect plans to be set in stone. Plans inevitably change as the project progresses; new information becomes available; scope, resources and priorities change; deliverables are (or are not) completed on time, etc. The process of estimating and modifying the plan should be repeated many times throughout the project. Even initial planning is likely to take several iterations to gather enough information. Significant changes to the project plan become the basis to communicate with the project sponsor(s) and/or key stakeholders with regard to decisions to be made and priorities rearranged. The goal of the project manager is to be non-biased toward any decision, but to place the responsibility with the sponsor to shape direction.

Approaches to Building WBS Structures: Waterfall vs. Iterative

Data integration projects differ somewhat from other types of development projects, although they also share some key attributes. The following list summarizes some unique aspects of data integration projects:

● Business requirements are less tangible and predictable than in OLTP (online transactional processing) projects.

● Database queries are very data intensive, involving few or many tables, but with many, many rows. In OLTP, transactions are data selective, involving few or many tables and comparatively few rows.

INFORMATICA CONFIDENTIAL BEST PRACTICE 666 of 702

Metadata is important, but in OLTP the meaning of fields is predetermined on a screen or report. In a data integration project (e.g., warehouse or common data management, etc.), metadata and traceability are much more critical.

Data integration projects, like all development projects, must be managed. To manage them, they must follow a clear plan. Data integration project managers often have a more difficult job than those managing OLTP projects because there are so many pieces and sources to manage.

Two purposes of the WBS are to manage work and ensure success. Although this is the same as any project, data integration projects are unlike typical waterfall projects in that they are based on a iterative approach. Three of the main principles of iteration are as follows:

Iteration. Division of work into small “chunks” of effort using lessons learned from earlier iterations.

Time boxing. Delivery of capability in short intervals, with the first release typically requiring from three to nine months (depending on complexity) and quarterly releases thereafter.

Prototyping. Early delivery of a prototype, with a working database delivered approximately one-third of the way through.

Incidentally, most iteration projects follow an essentially waterfall process within a given increment. The danger is that projects can iterate or spiral out of control..

The three principles listed above are very important because even the best data integration plans are likely to invite failure if these principles are ignored. An example of a failure waiting to happen, even with a fully detailed plan, is a large common data management project that gathers all requirements upfront and delivers the application all-at-once after three years. It is not the "large" that is the problem, but the "all requirements upfront" and the "all-at-once in three years."

Even enterprise data warehouses are delivered piece-by-piece using these three (and other) principles. The feedback you can gather from increment to increment is critical to the success of the future increments. The benefit is that such incremental deliveries establish patterns for development that can be used and leveraged for future deliveries.

What is the Correct Development Approach?

The correct development approach is usually dictated by corporate standards and by departments such as the Project Management Office (PMO). Regardless of the development approach chosen, high-level phases typically include planning the project; gathering data requirements; developing data models; designing and developing the physical database(s); developing the source, profile, and map data; and extracting, transforming, and loading the data. Lower-level planning details are typically carried out by the project manager and project team leads.

Preparing the WBS

The WBS can be prepared using manual or automated techniques, or a combination of the two.

INFORMATICA CONFIDENTIAL BEST PRACTICE 667 of 702

In many cases, a manual technique is used identify and record the high-level phases and tasks, then the information is transferred to project tracking software such as Microsoft Project. Project team members typically begin by identifying the high-level phases and tasks, writing the relevant information on large sticky notes or index cards, then mount the notes or cards on a wall or white board. Use one sticky note or card per phase or task so that you can easily be rearrange them as the project order evolves. As the project plan progresses, you can add information to the cards or notes to flesh out the details, such as task owner, time estimates, and dependencies. This information can then be fed into the project tracking software.

Once you have a fairly detailed methodology, you can enter the phase and task information into your project tracking software. When the project team is assembled, you can enter additional tasks and details directly into the software. Be aware however, that the project team can better understand a project and its various components if they actually participate in the high-level development activities, as they do in the manual approach. Using software alone, without input from relevant project team members, to designate phases, tasks, dependencies and time lines can be difficult and prone to errors and ommissions.

Benefits of developing the project timeline manually, with input from team members include:

Tasks, effort and dependencies are visible to all team members.●

Team has a greater understanding of and commitment to the project.●

Team members have an opportunity to work with each other and set the foundation. This is particularly important if the team is geographically dispersed and cannot work face-to-face throughout much of the project.

How Much Descriptive Information is Needed?

The project plan should incorporate a thorough description of the project and its goals. Be sure to review the business objectives, constraints, and high-level phases but keep the description as short and simple as possible. In many cases, a verb-noun form works well (e.g., interview users, document requirments, etc.). After you have described the project on a high-level, identify the tasks needed to complete each phase. It is often helpful to use the notes section in the tracking software (e.g., Microsoft Project) to provide narrative for each task or subtask. In general, decompose the tasks until they have a rough durations of two to 20 days.

Remember to break down the tasks only to the level of detail that you are willing to track. Include key checkpoints or milestones as tasks to be completed. Again, a noun-verb form works well for milestones (e.g., requirements completed, data model completed, etc.).

Assigning and Delegating Responsibility

Identify a single owner for each task in the project plan. Although other resources may help to complete the task; the individual who is designated as the owner is ultimately responsible for ensuring that the task, and any associated deliverables, is completed on time.

After the WBS is loaded into the selected project tracking software and refined for the specific project requirements, the Project Manager can begin to estimate the level of effort involved in completing each of the steps. When the estimate is complete, the project manager can assign individual resources and

INFORMATICA CONFIDENTIAL BEST PRACTICE 668 of 702

prepare a project schedule . The end result is the Project Plan. Refer to Developing and Maintaining the Project Plan for further information about the project plan.

Use your project plan to track progress. Be sure to review and modify estimates and keep the project plan updated throughout the project.

Last updated: 09-Feb-07 16:29

INFORMATICA CONFIDENTIAL BEST PRACTICE 669 of 702

Developing and Maintaining the Project Plan

Challenge

The challenge of developing and maintaing a project plan is to incorporate all of the necessary components while retaining the flexibility necessary to accommodate change.

A two-fold approach is required to meet the challenge:

1. A project that is clear in scope contains the following elements:

● A designated begin and end date. ● Well-defined business and technical requirements ● Adequate resources must be assigned.

Without these components, the project is subject slippage and incorrect expectations set with the Project Sponsor.

2. Project Plans are subject to revision and change throughout the project. It is imperative to establish a communication plan with the Project Sponsor; such communication may involve a weekly status report of accomplishments, and/or a report on issues and plans for the following week. This type of forum is very helpful in involving the Project Sponsor to actively make decisions with regards to change in scope or timeframes.

If your organization has the concept of a Project Office that provides governance for the project and priorities, look for a Project Charter that contains items like scope, initial high-level requirements, and key project stakeholders. Additionally, the Project Office should provide guidance in funding and resource allocation for key projects.

Informatica’s PowerCenter and Data Quality are not exempted from this project planning process. However, the purpose here is to provide some key elements that can be used to develop and maintain a data integration, data migration, or data quality project.

DescriptionUse the following steps as a guide for developing the initial project plan:

1. Define major milestones based on the project scope. (Be sure to list all key items such as analysis, design, development, and testing.)

2. Break the milestones down into major tasks and activities. The Project Plan should be helpful as a starting point or for recommending tasks for inclusion.

3. Continue the detail breakdown, if possible, to a level at which there are logical “chunks” of work can be completed and assigned to resources for accountability purposes. This level provides satisfactory detail to facilitate estimation, assignment of resources, and tracking of progress. If the detail tasks are too broad in scope, such as assigning multiple resources, estimates are much less likely to be accurate and resource accountability becomes difficult to maintain.

4. Confer with technical personnel to review the task definitions and effort estimates (or even to help define them, if applicable). This helps to build commitment for the project plan.

5. Establish the dependencies among tasks, where one task cannot be started until another is completed (or must start or complete concurrently with another).

6. Define the resources based on the role definitions and estimated number of resources needed for each role. 7. Assign resources to each task. If a resource will only be part-time on a task, indicate this in the plan. 8. Ensure that project plan follows your organization’s system development methodology.

Note: Informatica Professional Services has found success in projects that blend the“waterfall” method with the “iterative” method. The“Waterfall” method works well in the early stages of a project, such as analysis and initial design. The “Iterative” methods work well in accelerating development and testing where feedback from extensive testing

INFORMATICA CONFIDENTIAL BEST PRACTICE 670 of 702

validates the design of the system.

At this point, especially when using Microsoft Project, it is advisable to create dependencies (i.e., predecessor relationships) between tasks assigned to the same resource in order to indicate the sequence of that person's activities. Set the constraint type to “As Soon As Possible” and avoid setting a constraint date. Use the Effort-Driven approach so that the Project Plan can be easily modified as adjustments are made.

By setting the initial definition of tasks and efforts, the resulting schedule should provide a realistic picture of the project, unfettered by concerns about ideal user-requested completion dates. In other words, be as realistic as possible in your initial estimations, even if the resulting scheduling is likely to miss Project Sponsor expectations. This helps to establish good communications with your Project Sponsor so you can begin to negotiate scope and resources in good faith.

This initial schedule becomes a starting point. Expect to review and rework it, perhaps several times. Look for opportunities for parallel activities, perhaps adding resources if necessary, to improve the schedule.

When a satisfactory initial plan is complete, review it with the Project Sponsor and discuss the assumptions, dependencies, assignments, milestone dates, etc. Expect to modify the plan as a result of this review.

Reviewing and Revising the Project Plan

Once the Project Sponsor and Key Stakeholders agree to the initial plan, it becomes the basis for assigning tasks and setting expectations regarding delivery dates. The planning activity then shifts to tracking tasks against the schedule and updating the plan based on status and changes to assumptions.

One of the key communication methods is building the concept of a weekly or bi-weekly Project Sponsor meeting. Attendance at this meeting should include the Project Sponsor, Key Stakeholders, Lead Developers, and the Project Manager.

Elements of a Project Sponsor meeting should include: a) Key Accomplishments (milestones, events at a high-level), b) Progress to Date against the initial plan, c) Actual Hours vs. Budgeted Hours, d) Key Issues and e) Plans for Next Period.

Key Accomplishments

Listing key accomplishments provides an audit trail of activities completed for comparison against the initial plan. This is an opportunity to bring in the lead developers and have them report to management on what they have accomplished; it also provides them with an opportunity to raise concerns, which is very good from a motivation perspective since they have to own and account to management.

Keep accomplishments at a high-level and coach the team members to be brief, keeping their presentation to a five to ten minute maximum during this portion of the meeting.

Progress against Initial Plan

The following matrix shows progress on relevant stages of the project. Roll-up tasks to a management level so it is readable to the Project Sponsor (see sample below).

Plan Percent Complete Budget Hours

Architecture - Set up of Informatica Migration Environment 167Develop data integration solution architecture 10% 28Design development architecture 28% 32

Customize and implement Iterative Migration Framework Data Profiling 80% 32Legacy Stage 100% 10Pre-Load Stage 100% 10

INFORMATICA CONFIDENTIAL BEST PRACTICE 671 of 702

Reference Data 83% 18Reusable Objects 19% 27Review and signoff of Architecture 0% 10

Analysis - Target-to-Source Data Mapping 1000Customer (9 tables) 90% 135Product (6 tables) 90% 215Inventory (3 tables) 0% 60Shipping (3 tables) 0% 60Invoicing (7 tables) 57% 140Orders (19 tables) 40% 380Review and signoff of Functional Specification 0% 10

Budget versus Actual

A key measure to be aware of is budgeted vs. actual cost of the project. The Project Sponsor needs to know if additional funding is required; forecasting actual hours against budgeted hours allows the Project Sponsor to determine when additional funding or a change in scope is required.

Many projects are cancelled because of cost overruns, so it is the Project Manager’s job to keep expenditures under control. The following example shows how a budgeted vs. actual report may look.

10-Apr 17-Apr 24-Apr 1-May 8-May 15-May 22-May 29-May

Resource A 28 40 24 40 40 40 40 32 284

Resource B 10 40 40 40 40 32 202

Resource C 40 36 40 40 32 188

Resource D 24 40 36 40 40 32 212

Project Manager 12 8 8 16 32 76 *462 962 110 160 97 160 160 160 160 160 1167 687

Key Issues

This is the most important part of the meeting. Presenting key issues such as resource commitment, user roadblocks, key design concerns, etc, to the Project Sponsor and Key Stakeholders as they occur allows them to make immediate decisions and minimizes the risk of impact to the project.

Plans for Next Period

This communicates back to the Project Sponsor where the resources are to be deployed. If key issues dictate a change, this is an opportunity to redirect the resources and use them correctly.

Be sure to evaluate any changes to scope (see 1.2.4 Manage Project and Scope Change Assessment Sample Deliverable), or changes in priority or approach as they arise to determine if they effect the plan. It may be necessary to revise the plan if changes in scope or priority require rearranging task assignments or delivery sequences, or if they add new tasks or postpone existing ones.

Tracking Changes

One approach is to establish a baseline schedule (and budget, if applicable) and then track changes against it. With Microsoft Project, this involves creating a "Baseline" that remains static as changes are applied to the schedule. If company and project management do not require tracking against a baseline, simply maintain the plan through updates without a baseline. Maintain all records of Project Sponsor meetings and recap changes in scope after the meeting is

INFORMATICA CONFIDENTIAL BEST PRACTICE 672 of 702

completed.

Summary

Managing a data integration, data migration, or data quality project requires good project planning and communications. Many data integration project fail because of issues such as poor data quality or complexity of integration issues. However, good communication and expectation setting with the Project Sponsor can prevent such issues from causing a project to fail.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 673 of 702

Developing the Business Case

Challenge

Identifying the departments and individuals that are likely to benefit directly from the project implementation. Understanding these individuals, and their business information requirements, is key to defining and scoping the project.

Description

The following four steps summarize business case development and lay a good foundation for proceeding into detailed business requirements for the project.

1. One of the first steps in establishing the business scope is identifying the project beneficiaries and understanding their business roles and project participation. In many cases, the Project Sponsor can help to identify the beneficiaries and the various departments they represent. This information can then be summarized in an organization chart that is useful for ensuring that all project team members understand the corporate/business organization.

● Activity - Interview project sponsor to identify beneficiaries, define their business roles and project participation.

● Deliverable - Organization chart of corporate beneficiaries and participants.

2. The next step in establishing the business scope is to understand the business problem or need that the project addresses. This information should be clearly defined in a Problem/Needs Statement, using business terms to describe the problem. For example, the problem may be expressed as "a lack of information" rather than "a lack of technology" and should detail the business decisions or analysis that is required to resolve the lack of information. The best way to gather this type of information is by interviewing the Project Sponsor and/or the project beneficiaries.

● Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding problems and needs related to project.

● Deliverable - Problem/Need Statement

3. The next step in creating the project scope is defining the business goals and objectives for the project and detailing them in a comprehensive Statement of Project

INFORMATICA CONFIDENTIAL BEST PRACTICE 674 of 702

Goals and Objectives. This statement should be a high-level expression of the desired business solution (e.g., what strategic or tactical benefits does the business expect to gain from the project,) and should avoid any technical considerations at this point. Again, the Project Sponsor and beneficiaries are the best sources for this type of information. It may be practical to combine information gathering for the needs assessment and goals definition, using individual interviews or general meetings to elicit the information.

● Activity - Interview (individually or in forum) Project Sponsor and/or beneficiaries regarding business goals and objectives for the project.

● Deliverable - Statement of Project Goals and Objectives

4. The final step is creating a Project Scope and Assumptions statement that clearly defines the boundaries of the project based on the Statement of Project Goals and Objective and the associated project assumptions. This statement should focus on the type of information or analysis that will be included in the project rather than what will not.

The assumptions statements are optional and may include qualifiers on the scope, such as assumptions of feasibility, specific roles and responsibilities, or availability of resources or data.

● Activity -Business Analyst develops Project Scope and Assumptions statement for presentation to the Project Sponsor.

● Deliverable - Project Scope and Assumptions statement Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 675 of 702

Managing the Project Lifecycle

Challenge

To provide an effective communications plan to provide on-going management throughout the project lifecycle and to inform the Project Sponsor regarding status of the project.

Description

The quality of a project can be directly correlated to the amount of review that occurs during its lifecycle and the involvement of the Project Sponsor and Key Stakeholders.

Project Status Reports

In addition to the initial project plan review with the Project Sponsor, it is critical to schedule regular status meetings with the sponsor and project team to review status, issues, scope changes and schedule updates. This is known as the project sponsor meeting.

Gather status, issues and schedule update information from the team one day before the status meeting in order to compile and distribute the Project Status Report . In addition, make sure lead developers of major assignments are present to report on the status and issues, if applicable.

Project Management Review

The Project Manager should coordinate, if not facilitate, reviews of requirements, plans and deliverables with company management, including business requirements reviews with business personnel and technical reviews with project technical personnel.

Set a process in place beforehand to ensure appropriate personnel are invited, any relevant documents are distributed at least 24 hours in advance, and that reviews focus on questions and issues (rather than a laborious "reading of the code").

Reviews may include:

INFORMATICA CONFIDENTIAL BEST PRACTICE 676 of 702

● Project scope and business case review. ● Business requirements review. ● Source analysis and business rules reviews. ● Data architecture review. ● Technical infrastructure review (hardware and software capacity and

configuration planning). ● Data integration logic review (source to target mappings, cleansing and

transformation logic, etc.). ● Source extraction process review. ● Operations review (operations and maintenance of load sessions, etc.). ● Reviews of operations plan, QA plan, deployment and support plan.

Project Sponsor Meetings

A project sponsor meeting should be completed weekly to bi-weekly to communicate progress to the Project Sponsor and Key Stakeholders. The purpose is to keep key user management involved and engaged in the process. In addition, it is to communicate any changes to the initial plan and to have them weigh in on the decision process.

Elements of the meeting include:

● Key Accomplishments. ● Activities Next Week. ● Tracking of Progress to-Date (Budget vs. Actual). ● Key Issues / Roadblocks.

It is the Project Manager’s role to stay neutral to any issue and to effectively state facts and allow the Project Sponsor or other key executives to make decisions. Many times this process builds the partnership necessary for success.

Change in Scope

Directly address and evaluate any changes to the planned project activities, priorities, or staffing as they arise, or are proposed, in terms of their impact on the project plan.

The Project Manager should institute a change management process in response to any issue or request that appears to add or alter expected activities and has the

INFORMATICA CONFIDENTIAL BEST PRACTICE 677 of 702

potential to affect the plan.

● Use the Scope Change Assessment to record the background problem or requirement and the recommended resolution that constitutes the potential scope change. Note that such a change-in-scope document helps capture key documentation that is particularly useful if the project overruns or fails to deliver upon Project Sponsor expectations.

● Review each potential change with the technical team to assess its impact on the project, evaluating the effect in terms of schedule, budget, staffing requirements, and so forth.

● Present the Scope Change Assessment to the Project Sponsor for acceptance (with formal sign-off, if applicable). Discuss the assumptions involved in the impact estimate and any potential risks to the project.

Even if there is no evident effect on the schedule, it is important to document these changes because they may affect project direction and it may become necessary, later in the project cycle, to justify these changes to management.

Management of Issues

Any questions, problems, or issues that arise and are not immediately resolved should be tracked to ensure that someone is accountable for resolving them so that their effect can also be visible.

Use the Issues Tracking template, or something similar, to track issues, their owner, and dates of entry and resolution as well as the details of the issue and of its solution.

Significant or "showstopper" issues should also be mentioned on the status report and communicated through the weekly project sponsor meeting. This way, the Project Sponsor has the opportunity to resolve and cure a potential issue.

Project Acceptance and Close

A formal project acceptance and close helps document the final status of the project. Rather than simply walking away from a project when it seems complete, this explicit close procedure both documents and helps finalize the project with the Project Sponsor.

For most projects this involves a meeting where the Project Sponsor and/or department managers acknowledge completion or sign a statement of satisfactory completion.

INFORMATICA CONFIDENTIAL BEST PRACTICE 678 of 702

● Even for relatively short projects, use the Project Close Report to finalize the project with a final status report detailing:

❍ What was accomplished. ❍ Any justification for tasks expected but not completed. ❍ Recommendations.

● Prepare for the close by considering what the project team has learned about the environments, procedures, data integration design, data architecture, and other project plans.

● Formulate the recommendations based on issues or problems that need to be addressed. Succinctly describe each problem or recommendation and if applicable, briefly describe a recommended approach.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 679 of 702

Using Interviews to Determine Corporate Data Integration Requirements

Challenge

Data warehousing projects are usually initiated out of a business need for a certain type of reports (i.e., “we need consistent reporting of revenue, bookings and backlog”). Except in the case of narrowly-focused, departmental data marts however, this is not enough guidance to drive a full data integration solution. Further, a successful, single-purpose data mart can build a reputation such that, after a relatively brief period of proving its value to users, business management floods the technical group with requests for more data marts in other areas. The only way to avoid silos of data marts is to think bigger at the beginning and canvas the enterprise (or at least the department, if that’s your limit of scope) for a broad analysis of data integration requirements.

Description

Determining the data integration requirements in satisfactory detail and clarity is a difficult task however, especially while ensuring that the requirements are representative of all the potential stakeholders. This Best Practice summarizes the recommended interview and prioritization process for this requirements analysis.

Process Steps

The first step in the process is to identify and interview “all” major sponsors and stakeholders. This typically includes the executive staff and CFO since they are likely to be the key decision makers who will depend on the data integraton. At a minimum, figure on 10 to 20 interview sessions.

The next step in the process is to interview representative information providers. These individuals include the decision makers who provide the strategic perspective on what information to pursue, as well as details on that information, and how it is currently used (i.e., reported and/or analyzed). Be sure to provide feedback to all of the sponsors and stakeholders regarding the findings of the interviews and the recommended subject areas and information profiles. It is often helpful to facilitate a Prioritization Workshop with the major stakeholders, sponsors, and information providers in order to set priorities on the subject areas.

INFORMATICA CONFIDENTIAL BEST PRACTICE 680 of 702

Conduct Interviews

The following paragraphs offer some tips on the actual interviewing process. Two sections at the end of this document provide sample interview outlines for the executive staff and information providers.

Remember to keep executive interviews brief (i.e., an hour or less) and to the point. A focused, consistent interview format is desirable. Don't feel bound to the script, however, since interviewees are likely to raise some interesting points that may not be included in the original interview format. Pursue these subjects as they come up, asking detailed questions. This approach often leads to “discoveries” of strategic uses for information that may be exciting to the client and provide sparkle and focus to the project.

Questions to the “executives” or decision-makers should focus on what business strategies and decisions need information to support or monitor them. (Refer to Outline for Executive Interviews at the end of this document). Coverage here is critical if key managers are left out, you may miss a critical viewpoint and may miss an important buy-in.

Interviews of information providers are secondary but can be very useful. These are the business analyst-types who report to decision-makers and currently provide reports and analyses using Excel or Lotus or a database program to consolidate data from more than one source and provide regular and ad hoc reports or conduct sophisticated analysis. In subsequent phases of the project, you must identify all of these individuals, learn what information they access, and how they process it. At this stage however, you should focus on the basics, building a foundation for the project and discovering what tools are currently in use and where gaps may exist in the analysis and reporting functions.

Be sure to take detailed notes throughout the interview process. If there are a lot of interviews, you may want the interviewer to partner with someone who can take good notes, perhaps on a laptop to save note transcription time later. It is important to take down the details of what each person says because, at this stage, it is difficult to know what is likely to be important. While some interviewees may want to see detailed notes from their interviews, this is not very efficient since it takes time to clean up the notes for review. The most efficient approach is to simply consolidate the interview notes into a summary format following the interviews.

Be sure to review previous interviews as you go through the interviewing process, You can often use information from earlier interviews to pursue topics in later interviews in more detail and with varying perspectives.

INFORMATICA CONFIDENTIAL BEST PRACTICE 681 of 702

The executive interviews must be carried out in “business terms.” There can be no mention of the data warehouse or systems of record or particular source data entities or issues related to sourcing, cleansing or transformation. It is strictly forbidden to use any technical language. It can be valuable to have an industry expert prepare and even accompany the interviewer to provide business terminology and focus. If the interview falls into “technical details,” for example, into a discussion of whether certain information is currently available or could be integrated into the data warehouse, it is up to the interviewer to re-focus immediately on business needs. If this focus is not maintained, the opportunity for brainstorming is likely to be lost, which will reduce the quality and breadth of the business drivers.

Because of the above caution, it is rarely acceptable to have IS resources present at the executive interviews. These resources are likely to engage the executive (or vice versa) in a discussion of current reporting problems or technical issues and thereby destroy the interview opportunity.

Keep the interview groups small. One or two Professional Services personnel should suffice with at most one client project person. Especially for executive interviews, there should be one interviewee. There is sometimes a need to interview a group of middle managers together, but if there are more than two or three, you are likely to get much less input from the participants.

Distribute Interview Findings and Recommended Subject Areas

At the completion of the interviews, compile the interview notes and consolidate the content into a summary.This summary should help to breakout the input into departments or other groupings significant to the client. Use this content and your interview experience along with “best practices” or industry experience to recommend specific, well-defined subject areas.

Remember that this is a critical opportunity to position the project to the decision-makers by accurately representing their interests while adding enough creativity to capture their imagination. Provide them with models or profiles of the sort of information that could be included in a subject area so they can visualize its utility. This sort of “visionary concept” of their strategic information needs is crucial to drive their awareness and is often suggested during interviews of the more strategic thinkers. Tie descriptions of the information directly to stated business drivers (e.g., key processes and decisions) to further accentuate the “business solution.”

A typical table of contents in the initial Findings and Recommendations document might look like this:

INFORMATICA CONFIDENTIAL BEST PRACTICE 682 of 702

I. Introduction

II. Executive Summary

A. Objectives for the Data Warehouse

B. Summary of Requirements

C. High Priority Information Categories

D. Issues

III. Recommendations

A. Strategic Information Requirements

B. Issues Related to Availability of Data

C. Suggested Initial Increments

D. Data Warehouse Model

IV. Summary of Findings

A. Description of Process Used

B. Key Business Strategies [this includes descriptions of processes, decisions, other drivers)

C. Key Departmental Strategies and Measurements

D. Existing Sources of Information

E. How Information is Used

F. Issues Related to Information Access

V. Appendices

A. Organizational structure, departmental roles

B. Departmental responsibilities, and relationships

INFORMATICA CONFIDENTIAL BEST PRACTICE 683 of 702

Conduct Prioritization Workshop

This is a critical workshop for consensus on the business drivers. Key executives and decision-makers should attend, along with some key information providers. It is advisable to schedule this workshop offsite to assure attendance and attention, but the workshop must be efficient — typically confined to a half-day.

Be sure to announce the workshop well enough in advance to ensure that key attendees can put it on their schedules. Sending the announcement of the workshop may coincide with the initial distribution of the interview findings.

The workshop agenda should include the following items:

● Agenda and Introductions ● Project Background and Objectives ● Validate Interview Findings: Key Issues ● Validate Information Needs ● Reality Check: Feasibility ● Prioritize Information Needs ● Data Integration Plan ● Wrap-up and Next Steps

Keep the presentation as simple and concise as possible, and avoid technical discussions or detailed sidetracks.

Validate information needs

Key business drivers should be determined well in advance of the workshop, using information gathered during the interviewing process. Prior to the workshop, these business drivers should be written out, preferably in display format on flipcharts or similar presentation media, along with relevant comments or additions from the interviewees and/or workshop attendees.

During the validation segment of the workshop, attendees need to review and discuss the specific types of information that have been identified as important for triggering or monitoring the business drivers. At this point, it is advisable to compile as complete a list as possible; it can be refined and prioritized in subsequent phases of the project.

INFORMATICA CONFIDENTIAL BEST PRACTICE 684 of 702

As much as possible, categorize the information needs by function, maybe even by specific driver (i.e., a strategic process or decision). Considering the information needs on a function by function basis fosters discussion of how the information is used and by whom.

Reality check: feasibility

With the results of brainstorming over business drivers and information needs listed (all over the walls, presumably), take a brief detour into reality before prioritizing and planning. You need to consider overall feasibility before establishing the first priority information area(s) and setting a plan to implement the data warehousing solution with initial increments to address those first priorities.

Briefly describe the current state of the likely information sources (SORs). What information is currently accessible with a reasonable likelihood of the quality and content necessary for the high priority information areas? If there is likely to be a high degree of complexity or technical difficulty in obtaining the source information, you may need to reduce the priority of that information area (i.e., tackle it after some successes in other areas).

Avoid getting into too much detail or technical issues. Describe the general types of information that will be needed (e.g., sales revenue, service costs, customer descriptive information, etc.), focusing on what you expect will be needed for the highest priority information needs.

Data Integration Plan

The project sponsors, stakeholders, and users should all understand that the process of implementing the data warehousing solution is incremental.. Develop a high-level plan for implementing the project, focusing on increments that are both high-value and high-feasibility. Implementing these increments first provides an opportunity to build credibility for the project. The objective during this step is to obtain buy-in for your implementation plan and to begin to set expectations in terms of timing. Be practical though; don't establish too rigorous a timeline!

Wrap-up and next steps

At the close of the workshop, review the group's decisions (in 30 seconds or less), schedule the delivery of notes and findings to the attendees, and discuss the next steps of the data warehousing project.

INFORMATICA CONFIDENTIAL BEST PRACTICE 685 of 702

Document the Roadmap

As soon as possible after the workshop, provide the attendees and other project stakeholders with the results:

● Definitions of each subject area, categorized by functional area ● Within each subject area, descriptions of the business drivers and information

metrics ● Lists of the feasibility issues ● The subject area priorities and the implementation timeline.

Outline for Executive Interviews

I. Introductions II. General description of information strategy process

A. Purpose and goals B. Overview of steps and deliverables

● Interviews to understand business information strategies and expectations

● Document strategy findings ● Consensus-building meeting to prioritize information

requirements and identify “quick hits” ● Model strategic subject areas ● Produce multi-phase Business Intelligence strategy

III. Goals for this meeting A. Description of business vision, strategies B. Perspective on strategic business issues and how they drive information

needs

● Information needed to support or achieve business goals ● How success is measured

IV. Briefly describe your roles and responsibilities?

● The interviewee may provide this information before the actual interview. In this case, simply review with the interviewee and ask if there is anything to add.

INFORMATICA CONFIDENTIAL BEST PRACTICE 686 of 702

A. What are your key business strategies and objectives?

● How do corporate strategic initiatives impact your group? ● These may include “MBOs” (personal performance objectives),

and workgroup objectives or strategies.

B. What do you see as the Critical Success Factors for an Enterprise Information Strategy?

● What are its potential obstacles or pitfalls?

C. What information do you need to achieve or support key decisions related to your business objectives?

D. How will your organization’s progress and final success be measured (e.g., metrics, critical success factors)?

E. What information or decisions from other groups affect your success? F. What are other valuable information sources (i.e., computer reports,

industry reports, email, key people, meetings, phone)? G. Do you have regular strategy meetings? What information is shared as

you develop your strategy? H. If it is difficult for the interviewee to brainstorm about information needs,

try asking the question this way: "When you return from a two-week vacation, what information do you want to know first?"

I. Of all the information you now receive, what is the most valuable? J. What information do you need that is not now readily available? K. How accurate is the information you are now getting? L. To whom do you provide information? M. Who provides information to you? N. Who would you recommend be involved in the cross-functional

Consensus Workshop?

Outline for Information Provider Interviews

I. Introductions

II. General description of information strategy process

A. Purpose and goals

B.

INFORMATICA CONFIDENTIAL BEST PRACTICE 687 of 702

Overview of steps and deliverables

Interviews to understand business information strategies and expectations

Document strategy findings and model the strategic subject areas

Consensus-building meeting to prioritize information requirements and identify “quick hits”

Produce multi-phase Business Intelligence strategy

III. Goals for this meeting

A. Understanding of how business issues drive information needs

B. High-level understanding of what information is currently provided to whom

Where does it come from●

How is it processed●

What are its quality or access issues

IV. Briefly describe your roles and responsibilities?

The interviewee may provide this information before the actual interview. In this case, simply review with the interviewee and ask if there is anything to add.

A. Who do you provide information to?

B.

INFORMATICA CONFIDENTIAL BEST PRACTICE 688 of 702

What information do you provide to help support or measure the progress/success of their key business decisions?

C. Of all the information you now provide, what is the most requested or most widely used?

D. What are your sources for the information (both in terms of systems and personnel)?

E. What types of analysis do you regularly perform (i.e., trends, investigating problems)? How do you provide these analyses (e.g., charts, graphs, spreadsheets)?

F. How do you change/add value to the information?

G. Are there quality or usability problems with the information you work with? How accurate is it?

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 689 of 702

Upgrading Data Analyzer

Challenge

Seamlessly upgrade Data Analyzer from one release to another while safeguarding the repository.

Description

Upgrading Data Analyzer involves two steps:

1. Upgrading the Data Analyzer application. 2. Upgrading the Data Anaylzer repository.

Steps Before The Upgrade

1. Backup the repository. To ensure a clean backup, shutdown Data Analyzer and create the backup, following the steps in the Data Analyzer manual.

2. Restore the backed up repository into an empty database or a new schema. This will ensure that you have a hot backup of the repository if, for some reason, the upgrade fails.

Steps for upgrading Data Analyzer application

The upgrade process varies from application server to application server on which Data Analyzer is hosted.

For WebLogic:

1. Install WebLogic 8.1 without uninstalling the existing Application Server(WebLogic 6.1).

2. Install the Data Analyzer application on the new WebLogic 8.1 Application Server, making sure to use a different port than the one used in the old installation.. When prompted for repository, please choose the option of “existing repository” and give the connection details of the database that hosts the backed up old repository of Data Analyzer.

3. When the installation is complete, use the Upgrade utility to connect to the database that hosts the Data Analyzer backed up repository and perform the

INFORMATICA CONFIDENTIAL BEST PRACTICE 690 of 702

upgrade.

For Jboss and WebSphere:

1. Uninstall Data Analyzer 2. Install new Data Analyzer version 3. When prompted for a repository, choose the option of “existing repository” and

give the connection details of the database that hosts the backed up Data Analyzer

4. Use the Upgrade utility and connect to the database that hosts the backed up Data Analyzer repository and perform the upgrade.

When the repository upgrade is complete, start Data Analyzer and perform a simple acceptance test.

You can use the following test case (or a subset of the following test case) as an acceptance test).

1. Open a simple report 2. Open a cached report. 3. Open a report with filtersets. 4. Open a sectional report. 5. Open a workflow and also its nodes. 6. Open a report and drill through it.

When all the reports open without problems, your upgrade can be called complete.

Once the upgrade is complete, repeat the above process on the actual repository.

Note: This upgrade process creates two instances of Data Analyzer. So when the upgrade is successful, uninstall the older version, following the steps in the Data Analyzer manual.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 691 of 702

Upgrading PowerCenter

Challenge

Upgrading an existing installation of PowerCenter to a later version encompasses upgrading the repositories, implementing any necessary modifications, testing, and configuring new features. With PowerCenter 8.1, the expansion of the Service-Oriented Archicture with its domain and node concept brings additional challenges to the upgrade process. The challenge is for data integration administrators to approach the upgrade process in a structured fashion and minimize risk to the environment and on-going project work.

Some of the challenges typically encountered during an upgrade include:

● Limiting development downtime. ● Ensuring that development work performed during the upgrade is accurately

migrated to the upgraded environment. ● Testing the upgraded environment to ensure that data integration results are

identical to the previous version. ● Ensuring that all elements of the various environments (e.g., Development,

Test, and Production) are upgraded successfully.

Description

Some typical reasons for an initiating a PowerCenter upgrade include:

● Additional features and capabilities in the new version of PowerCenter that enhance development productivity and administration.

● To keep pace with higher demands for data integration. ● To achieve process performance gains. ● To maintain an environment of fully supported software as older PowerCenter

versions end support status.

Upgrade Team

Assembling a team of knowledgeable individuals to carry out the PowerCenter upgrade is key to completing the process within schedule and budgetary guidelines. Typically,

INFORMATICA CONFIDENTIAL BEST PRACTICE 692 of 702

the upgrade team needs the following key players:

● PowerCenter Administrator ● Database Administrator ● System Administrator ● Informatica team - the business and technical users that "own" the various

areas in the Informatica environment. These resources are required for knowledge transfer and testing during the upgrade process and after the upgrade is complete.

Upgrade Paths

The upgrade process details depend on which of the existing PowerCenter versions you are upgrading from and which version you are moving to. The following bullet items summarize the upgrade paths for the various PowerCenter versions:

● PowerCenter 8.1.1 (available since September 2006)

❍ Direct upgrade for PowerCenter 6.x to 8.1.1 ❍ Direct upgrade for PowerCenter 7.x to 8.1.1 ❍ Direct upgrade for PowerCenter 8.0 to 8.1.1

● Other versions:

❍ For version 4.6 or earlier - upgrade to 5.x, then to 7.x and to 8.1.1 ❍ For version 4.7 or later - upgrade to 6.x and then to 8.1.1

Upgrade Tips

Some of the following items may seem obvious, but adhering to these tips should help to ensure that the upgrade process goes smoothly.

● Be sure to have sufficient memory and disk space (database) for the installed software.

● As new features are added into PowerCenter, the repository grows in size anywhere from 5 to 25 percent per release to accommodate the metadata for the new features. Plan for this increase in all of your PowerCenter repositories.

● Always read and save the upgrade log file. ● Backup Repository Server and PowerCenter Server configuration files prior to

beginning the upgrade process. ● Test the AEP/EP (Advanced External Procedure/External Procedure) prior to

INFORMATICA CONFIDENTIAL BEST PRACTICE 693 of 702

beginning the upgrade. Recompiling may be necessary. ● PowerCenter 8.x and beyond require Domain Metadata in addition to the

standard PowerCenter Repositories. Work with your DBA to create a location for the Domain Metadata Repository, which is created at install time.

● Ensure that all repositories for upgrade are backed up and that they can be restored successfully. Repositories can be restored to the same database in a different schema to allow an upgrade to be carried out in parallel. This is especially useful if the PowerCenter test and development environments reside in a single repository.

● When naming your nodes and domains in PowerCenter 8, think carefully about the naming convention before the upgrade. While changing the name of a node or the domain later is possible, it is not an easy task since it is embedded in much of the general operation of the product. Avoid using IP addresses and machine names for the domain and node names since over time machine IP addresses and server names may change.

● With PowerCenter 8, a central location exists for shared files (i.e., log files, error files, checkpoint files, etc.) across the domain. If using the Grid option or High Availability option, it is important that this file structure is on a high-performance file system and viewable by all nodes in the domain. If High Availability is configured, this file system should also be highly available.

Upgrading Multiple Projects

Be sure to consider the following items if the upgrade involves multiple projects:

● All projects sharing a repository must upgrade at same time (test concurrently).

● Projects using multiple repositories must all upgrade at same time. ● After upgrade, each project should undergo full regression testing.

Upgrade Project Plan

The full upgrade process from version to version can be time consuming, particularly around the testing and verification stages. Informatica strongly recommends developing a project plan to track progress and inform managers and team members of the tasks that need to be completed, uncertainties, or missed steps

Scheduling the Upgrade

INFORMATICA CONFIDENTIAL BEST PRACTICE 694 of 702

When an upgrade is scheduled in conjunction with other development work, it is prudent to have it occur within a separate test environment that mimics (or at least closely resembles) production. This reduces the risk of unexpected errors and can decrease the effort spent on the upgrade. It may also allow the development work to continue in parallel with the upgrade effort, depending on the specific site setup.

Environmental Impact

With each new PowerCenter release, there is the potential for the upgrade to effect your data integration environment based on new components and features. The PowerCenter 8 upgrade changes the architecture from PowerCenter version 7, so you should spend time planning the upgrade strategy concerning domains, nodes, domain metadata, and the other architectural components with PowerCenter 8. Depending on the complexity of your data integration environment, this may be a minor or major impact. Single integration server/single repository installations are not likely to notice much of a difference to the architecture, but customers striving for highly-available systems with enterprise scalability may need to spend time understanding how to alter their physical architecture to take advantage of these new features in PowerCenter 8. For more information on these architecture changes, reference the PowerCenter documentation and the Best Practice on Domain Configuration.

Upgrade Process

Informatica recommends using the following approach to handle the challenges inherent in an upgrade effort.

Choosing an Appropriate Environment

It is always advisable to have at least three separate environments: one each for Development, Test, and Production.

The Test environment is generally the best place to start the upgrade process since it is likely to be the most similar to Production. If possible, select a test sandbox that parallels production as closely as possible. This enables you to carry out data comparisons between PowerCenter versions. An added benefit of starting the upgrade process in a test environment is that development can continue without interruption. Your corporate policies on development, test, and sandbox environments and the work that can or cannot be done in them will determine the precise order for the upgrade and any associated development changes. Note that if changes are required as a result of the upgrade, they need to be migrated to Production. Use the existing version to backup the PowerCenter repository, then ensure that the backup works by restoring it to a new schema in the repository database.

INFORMATICA CONFIDENTIAL BEST PRACTICE 695 of 702

Alternatively, you can begin the upgrade process in the Development environment or create a parallel environment in which to start the effort. The decision to use or copy an existing platform depends on the state of project work across all environments. If it is not possible to set up a parallel environment, the upgrade may start in Development, then progress to the Test and Production systems. However, using a parallel environment is likely to minimize development downtime. The important thing is to understand the upgrade process and your own business and technical requirements, then adapt the approaches described in this document to one that suits your particular situation.

Organizing the Upgrade Effort

Begin by evaluating the entire upgrade effort in terms of resources, time, and environments. This includes training, availability of database, operating system, and PowerCenter administrator resources as well as time to perform the upgrade and carry out the necessary testing in all environments. Refer to the release notes to help identify mappings and other repository objects that may need changes as a result of the upgrade.

Provide detailed training for the Upgrade team to ensure that everyone directly involved in the upgrade process understands the new version and is capable of using it for their own development work and assisting others with the upgrade process.

Run regression tests for all components on the old version. If possible, store the results so that you can use them for comparison purposes after the upgrade is complete.

Before you begin the upgrade, be sure to backup the repository and server caches, scripts, logs, bad files, parameter files, source and target files, and external procedures. Also be sure to copy backed-up server files to the new directories as the upgrade progresses.

If you are working in a UNIX environment and have to use the same machine for existing and upgrade versions, be sure to use separate users and directories. Be careful to ensure that profile path statements do not overlap between the new and old versions of PowerCenter. For additional information, refer to the installation guide for path statements and environment variables for your platform and operating system.

Installing and Configuring the Software

● Install the new version of the PowerCenter components on the server.

INFORMATICA CONFIDENTIAL BEST PRACTICE 696 of 702

● Ensure that the PowerCenter client is installed on at least one workstation to be used for upgrade testing and that connections to repositories are updated if parallel repositories are being used.

● Re-compile any Advanced External Procedures/External Procedures if necessary, and test them.

● The PowerCenter license key is now in the form of a file. During the installation of PowerCenter, you’ll be asked for the location of this key file. The key should be saved on the server prior to beginning the installation process.

● When installing PowerCenter 8.x, you’ll configure the domain, node, repository service, and the integration service at the same time. Ensure that you have all necessary database connections ready before beginning the installation process.

● If upgrading to PowerCenter 8.x from PowerCenter 7.x (or earlier), you must gather all of your configuration files that are going to be used in the automated process to upgrade the Integration Services and Repositories. See the PowerCenter Upgrade Manual for more information on how to gather them and where to locate them for the upgrade process.

● Once the installation has been completed, use the Repository Server Administration Console to perform the upgrade. Unlike previous versions of PowerCenter, in version 8 the Administration Console is a web application. The Administration Console URL is http://hostname:portnumber where hostname is the name of their server where the PowerCenter services are installed and port number is the port identified during the installation process. The default port number is 6001.

● Re-register any plug-ins (such as PowerExchange) to the newly upgraded environment.

● You can start both the repository and integration services on the Admin Console.

● Analyze upgrade activity logs to identify areas where changes may be required, rerun full regression tests on the upgraded repository.

● Execute test plans. Ensure that there are no failures and all the loads run successfully in the upgraded environment.

● Verify the data to ensure that there are no changes and no additional or missing records.

Implementing Changes and Testing

If changes are needed, decide where those changes are going to be made. It is generally advisable to migrate work back from test to an upgraded development environment. Complete the necessary changes, then migrate forward through test to

INFORMATICA CONFIDENTIAL BEST PRACTICE 697 of 702

production. Assess the changes when the results from the test runs are available. If you decide to deviate from best practice and make changes in test and migrate them forward to production, remember that you'll still need to implement the changes in development. Otherwise, these changes will be lost the next time work is migrated from development to the test environment.

When you are satisfied with the results of testing, upgrade the other environments by backing up and restoring the appropriate repositories. Be sure to closely monitor the production environment and check the results after the upgrade. Also remember to archive and remove old repositories from the previous version.

After the Upgrade

● If multiple nodes were configured and you own the PowerCenter Grid option, you can create a server grid to test performance gains

● If you own the high-availability option, you should configure your environment for high availability including setting up failover gateway node(s) and designating primary and backup nodes for your various PowerCenter services. In addition, your shared file location for the domain should be located on a highly available, high-performance file server.

● Start measuring data quality by creating a sample data profile. ● If LDAP is in use, associate LDAP users with PowerCenter users. ● Install PowerCenter Reports and configure the built-in reports for the

PowerCenter repository.

Repository Versioning

After upgrading to version 8.x, you can set the repository to versioned if you purchased the Team-Based Management option and enabled it via the license key.

Keep in mind that once the repository is set to versioned, it cannot be set back to non-versioned. You can invoke the team-based development option in the Administration Console.

Upgrading Folder Versions

After upgrading to version 8.x, you'll need to remember the following:

● There are no more folder versions in version 8.

INFORMATICA CONFIDENTIAL BEST PRACTICE 698 of 702

● The folder with the highest version number becomes the current folder. ● Other versions of the folders are folder_<folder_version_number>. ● Shortcuts are created to mappings from the current folder.

Upgrading Pmrep and Pmcmd Scripts

● No more folder versions for pmrep and pmrepagent scripts. ● Ensure that the workflow/session folder names match the upgraded names. ● Note that pmcmd command structure changes significantly after version 5.

Version 5 pmcmd commands can still run in version 8, but may not be backwards-compatible in future versions.

Advanced External Procedure Transformations

AEPs are upgraded to Custom Transformation, a non-blocking transformation. To use this feature, you need to recompile the procedure, but you can use the old DLL/library if recompilation is not required.

Upgrading XML Definitions

● Version 8 supports XML schema. ● The upgrade removes namespaces and prefixes for multiple namespaces. ● Circular reference definitions are read-only after the upgrade. ● Some datatypes are changed in XML definitions by the upgrade.

For more information on the specific changes to the PowerCenter software for your particular upgraded version, reference the release notes as well as the PowerCenter documentation.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 699 of 702

Upgrading PowerExchange

Challenge

Upgrading and configuring PowerExchange on a mainframe to a new release and ensuring that there is minimum impact to the current PowerExchange schedule.

Description

The PowerExchange upgrade is essentially an installation with a few additional steps and some changes to the steps of a new installation. When planning for a PowerExchange upgrade the same resources are required as the initial implementation requires. These include, but are not limited to:

● MVS systems operator ● Appropriate database administrator; this depends on what (if any) databases

are going to be sources/and or targets (e.g., IMS, IDMS, etc.). ● MVS Security resources

Since an upgrade is so similar to an initial implementation of PowerExchange, this document does not address the details of the installation. This document addresses the steps that are not documented in the Best Practices Installation document, as well as changes to existing steps in that document. For details on installing a new PowerExchange release see the Best Practice PowerExchange Installation (for Mainframe) .

Upgrading PowerExchange on the Mainframe

The following steps are modifications to the installation steps or additional steps required to upgrade PowerExchange on the mainframe. More detailed information for upgrades can also be found in the PWX Migration Guide that comes with each release.

1. Choose a new high-level qualifier when allocating the libraries, RUNLIB and BINLIB, on the mainframe. Consider using the version of PowerExchange as part of the dataset name. An example would be SYSB.PWX811.RUNLIB. These two libraries need to be APF authorized.

2. Backup the mainframe datasets and libraries. Also, backup the PowerExchange

INFORMATICA CONFIDENTIAL BEST PRACTICE 700 of 702

paths on the client workstations and the PowerCenter server. 3. When executing the MVS Install Assistant and providing values on each screen,

make sure the following parameters differ from those used in the existing version of PowerExchange. Specify new high-level qualifiers used for the PowerExchange datasets, libraries, and VSAM files. The value needs to match the qualifier used for the RUNLIB and BINLIB datasets allocated earlier. Consider including the version of PowerExchange in the high-level nodes of the datasets. An example could be SYSB.PWX811. The PowerExchange Agent/Logger three character prefix needs to be unique and differ from that used in the existing version of PowerExchange. Make sure the values on Logger/Agent/Condenser Parameters screen reflect the new prefix. For DB2, the plan name specified should differ from that used in the existing release.

4. Run the jobs listed in the XJOBS member in the RUNLIB. 5. Before starting the Listener, rename the DBMOVER member in the new

RUNLIB dataset. 6. Copy the DBMOVER member from the current PowerExchange RUNLIB to the

corresponding library for the new release of PowerExchange. Update the port numbers to reflect the new ports. Update any dataset names specified in the NETPORTS to reflect the new high-level qualifier.

7. Start the Listener and make sure the PING works. See the other document or the Implementation guide for more details.

8. The existing Datamaps must now be migrated to the new release using the DTLURDMO utility. Details and examples can be found in the PWX Utilities Guide and the PWX Migration Guide.

At this point, the mainframe upgrade is complete for bulk processing.

For PowerExchange Change Data Capture or Change Data Capture Real-time, complete the additional steps in the installation manual. Also perform the following steps:

1. Use the DTLURDMO utility to migrate existing Capture Registrations and Capture Extractions to the new release.

2. Create a Registration Group for each source. 3. Open and save each Extraction Map in the new Extraction Groups. 4. Insure the values for CHKPT_BASENAME and EXT_CAPT_MASK parameters

are correct before running a Condense.

INFORMATICA CONFIDENTIAL BEST PRACTICE 701 of 702

Upgrade PowerExchange on a Client Workstation and the Server

The installation procedures on the client workstations and the server are the same as they are for an initial implementation with a few exceptions. The differences are as follow:

1. New paths should be specified during the installation of the new release. 2. After the installation, copy the old DBMOVER.CFG configuration member to the

new path and modify the ports to reflect those of the new release. 3. Make sure the PATHS reflects the path specified earlier for the new release.

Testing can begin now. When testing is complete, the new version can go live.

Go Live With New Release

1. Stop all workflows. 2. Stop all production updates to the existing sources. 3. Ensure all captured data has been processed. 4. Stop all tasks on the mainframe (Agent, Listener, etc.) 5. Start the new tasks on the mainframe. 6. Resume production updates to the sources and resume the workflow schedule.

After the Migration

Consider removing or de-installing the software for the old release on the workstations and server to avoid any conflicts.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL BEST PRACTICE 702 of 702