working with informtiaca teradata parallel transporter

23
2 White Paper|Working with Informatica-Teradata Parallel Transporter 1. Introduction Today's growing data warehouses demand fast, reliable tools are help to acquire and manage data and flexibility to load large volumes of data from any source at any time. Challenges come from everywhere: more data sources, growing data volumes, dynamically changing business requirements and user demands for fresher data. Teradata is a global technology leader in enterprise data warehousing, business analytics, and data warehousing services. Teradata provides a powerful suite of software that includes the Teradata Database, data access and management tools, and data mining applications. PowerCenter works with the Teradata Database and Teradata tools to provide a data integration solution that allows integrating data from virtually any business system into Teradata as well as leveraging Teradata data for use in other business systems. PowerCenter uses the following techniques when extracting data from and loading data to the Teradata database: ETL (Extract, transform, and load). This technique extracts data from the source systems, transforms the data within PowerCenter, and loads it to target tables. ELT (Extract, load, and then transform). This technique extracts data from the source systems, loads it to user-defined staging tables in the target database, and transforms the data within the target system using generated SQL. The SQL queries include a final insert into the target tables. ETL-T (ETL and ELT hybrid). This technique extracts data from the source systems, transforms the data within PowerCenter, loads the data to user-defined staging tables in the target database, and further transforms the data within the target system using generated SQL. The SQL queries include a final insert into the target tables. The ELT-T technique is optimized within PowerCenter, so that the transformations that better perform within the database system can be performed there and the Integration Service performs the other transformations. To perform ETL operations, configure PowerCenter sessions to use a Teradata relational connection, a Teradata standalone load or unload utility, or Teradata Parallel Transporter. To use ELT or ETL-T techniques, configure PowerCenter sessions to use pushdown optimization.

Upload: anjaneyulu-gunti

Post on 23-Jan-2018

5.734 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Working with informtiaca teradata parallel transporter

2 White Paper|Working with Informatica-Teradata Parallel Transporter

1. Introduction

Today's growing data warehouses demand fast, reliable tools are help to acquire and

manage data and flexibility to load large volumes of data from any source at any

time. Challenges come from everywhere: more data sources, growing data volumes,

dynamically changing business requirements and user demands for fresher data.

Teradata is a global technology leader in enterprise data warehousing, business

analytics, and data warehousing services. Teradata provides a powerful suite of

software that includes the Teradata Database, data access and management tools,

and data mining applications. PowerCenter works with the Teradata Database and

Teradata tools to provide a data integration solution that allows integrating data from

virtually any business system into Teradata as well as leveraging Teradata data for

use in other business systems.

PowerCenter uses the following techniques when extracting data from and loading

data to the Teradata database:

ETL

(Extract, transform, and load). This technique extracts data from the source systems,

transforms the data within PowerCenter, and loads it to target tables.

ELT

(Extract, load, and then transform). This technique extracts data from the source

systems, loads it to user-defined staging tables in the target database, and

transforms the data within the target system using generated SQL. The SQL queries

include a final insert into the target tables.

ETL-T

(ETL and ELT hybrid). This technique extracts data from the source systems,

transforms the data within PowerCenter, loads the data to user-defined staging tables

in the target database, and further transforms the data within the target system using

generated SQL. The SQL queries include a final insert into the target tables. The ELT-T

technique is optimized within PowerCenter, so that the transformations that better

perform within the database system can be performed there and the Integration

Service performs the other transformations.

To perform ETL operations, configure PowerCenter sessions to use a Teradata

relational connection, a Teradata standalone load or unload utility, or Teradata Parallel

Transporter. To use ELT or ETL-T techniques, configure PowerCenter sessions to use

pushdown optimization.

Page 2: Working with informtiaca teradata parallel transporter

3 White Paper|Working with Informatica-Teradata Parallel Transporter

2. Teradata standalone Load and Unload Utilities

Teradata standalone load and unload utilities are fast, reliable tools that help to

export large amounts of data from Teradata databases and load session target files

into Teradata databases. Use a standalone load or unload utility when PowerCenter

sessions extract or load large amounts of data. Standalone load and unload utilities

are faster than Teradata relational connections because they load or extract data

directly from a file or pipe rather than run SQL commands to load or extract the data.

PowerCenter works with the following Teradata standalone load and unload utilities:

2.1. Teradata FastLoad

· Teradata FastLoad is a command-line utility that quickly loads large amounts

of data to empty tables in a Teradata database. Use FastLoad for a high-

volume initial load or for high-volume truncate and reload operations.

FastLoad is the fastest load utility, but it has the following limitations:

· FastLoad uses multiple sessions to load data, but it can load data to only one

table in a Teradata database per job.

· It locks tables while loading data, preventing others and other instances of

FastLoad from accessing the tables during data loading.

· FastLoad only works with empty tables with no secondary indexes.

· It can only insert data.

2.2. Teradata MultiLoad

Teradata MultiLoad is a command-driven utility for fast, high-volume

maintenance on multiple tables and views of a Teradata database. Each

MultiLoad instance can perform multiple data insert, update, and delete

operations on up to five different tables or views. MultiLoad optimizes operations

that rapidly acquire, process, and apply data to Teradata tables. Use MultiLoad

for large volume, incremental data loads.

2.2.1. MultiLoad has the following advantages:

· MultiLoad is very fast. It can process millions of rows in a few minutes.

· MultiLoad supports inserts, updates, upserts, deletes, and data-driven

operations in PowerCenter.

· We can use variables and embed conditional logic into MultiLoad control

files.

· MultiLoad supports sophisticated error recovery. It allows load jobs to be

restarted without having to redo all of the prior work.

Page 3: Working with informtiaca teradata parallel transporter

4 White Paper|Working with Informatica-Teradata Parallel Transporter

2.2.2. MultiLoad has the following limitations:

· MultiLoad is designed for the highest possible throughput, so it can be very

resource intensive.

· It locks tables while loading data, preventing others and other instances of

MultiLoad from accessing the tables during data loading.

· Because of its “phased” nature, there are potentially inconvenient windows

of time when MultiLoad cannot be stopped without losing access to target

tables.

2.3. Teradata TPump

Teradata TPump is a highly parallel utility that can continuously move data from

data sources into Teradata tables without locking the affected table. TPump

supports inserts, updates, deletes, and data-driver updates. TPump acquires row

hash locks on a database table instead of table-level locks, so multiple TPump

instances can load data simultaneously to the same table. TPump is often used to

“trickle-load” a database table. Use TPump for low volume, online data loads.

2.3.1. TPump has the following advantages:

· TPump can refresh database tables in near real-time.

· TPump continuously loads data into Teradata tables without locking the

affected tables, so users can run queries when TPump is running.

· TPump is less resource-intensive than MultiLoad because it does not write to

temporary tables.

· Users can control the rate at which statements are sent to the Teradata

database, limiting resource consumption.

· It supports parallel processing.

· TPump can always be stopped and all of its locks dropped with no effect.

· TPump is not as fast as the other standalone loaders for large volume loads

because it changes the same data block multiple times.

2.4. Teradata FastExport

Teradata FastExport is a command-driven utility that uses multiple sessions to

quickly transfer large amounts of data from Teradata sources to PowerCenter.

Use FastExport to quickly extract data from Teradata sources.

2.4.1. FastExport has the following advantages:

· It is faster than Teradata relational connections when extracting large

amounts of data.

· FastExport can be run in streaming mode, which avoids the need to stage

the data file.

Page 4: Working with informtiaca teradata parallel transporter

5 White Paper|Working with Informatica-Teradata Parallel Transporter

· We can encrypt the data transfer between FastExport and the Teradata

server.

· FastExport is available for sources and pipeline lookups.

When we create a FastExport connection, verify the settings of the following

connection attributes:

· Data encryption.

Enable this attribute to encrypt the data transfer between FastExport and

the Teradata server so that unauthorized users cannot access the data being

transferred across the network.

· Fractional seconds.

This attribute specifies the precision of the decimal portion of timestamp

data. To avoid session failure or possible data corruption, make sure this

value matches the timestamp precision of the column in the Teradata

database. For date column filed selects from Teradata to Informatica we

need to match the Teradata date format with Informatica date format

To use FastExport in a session, configure the mapping to extract from a Teradata

source, configure the session to read from FastExport instead of a relational

database, and select the FastExport connection for the session. When a session

transfers data between Teradata and PowerCenter, the following files are

created:

· A staging file or pipe.

PowerCenter creates a staging file or named pipe for data transfer based on

how we configure the connection. Named pipes are generally faster than

staging files because data is transferred as soon as it appears in the pipe. If

we use a staging file, data is not transferred until all data appears in the file.

· A control file.

PowerCenter generates a control file that contains instructions for loading or

extracting data. PowerCenter creates the control file based on the loader or

FastExport attributes we configure for the connection and the session.

· A log file.

The load or unload utility creates a log file and writes error messages to it.

The PowerCenter session log indicates whether the session ran successfully,

but does not contain load or unload utility error messages. Use the log file to

debug problems that occur during data loading or extraction.

Page 5: Working with informtiaca teradata parallel transporter

6 White Paper|Working with Informatica-Teradata Parallel Transporter

By default, loader staging, control, and log files are created in the target file

directory. The FastExport staging, control, and log files are created in the

PowerCenter temporary files directory.

All of these load and unload utilities are included in the Teradata Tools and Utilities

(TTU), available from Teradata.PowerCenter supports these entire standalone load

and unload utilities. Before we can configure a session to use a load or unload utility,

create a loader or FastExport (application) connection in the PowerCenter Workflow

Manager and enter a value for the TDPID in the connection attributes.

For more information about creating connection objects in PowerCenter, see the

APPENDIX A, APPENDIX-B and PowerCenter Workflow Basics Guide.

3. Teradata Parallel Transporter

Teradata Parallel Transporter (TPT) is a client application that provides scalable, high-

speed, parallel data extraction, loading, and updating. It uses and expands upon the

functions and capabilities of the standalone Teradata load and unload utilities.

Teradata PT supports a single scripting environment with different system operators

for extracting and loading data. It also supports massive parallel extraction and

loading, so if we partition a Teradata PT session, multiple Teradata PT instances can

extract or load large amounts of data in the same database tables at the same time.

To provide the functionality of the standalone load and unload utilities, Teradata PT

extracts or loads data using one of the following system operators:

· Export. Exports large data sets from Teradata tables or views and imports the

data to file or Informatica pipe using the FastExport protocol.

· Load. Bulk loads large volumes of data into empty Teradata database tables using

the FastLoad protocol.

· Update. Batch updates, inserts, upserts, and deletes data in Teradata database

tables using the MultiLoad protocol.

· Stream. Continuously updates, inserts, upserts, and deletes data in near real-time

using the TPump protocol.

Fig1. TPT operators for Informatica job

Page 6: Working with informtiaca teradata parallel transporter

7 White Paper|Working with Informatica-Teradata Parallel Transporter

3.1. Teradata PT advantages:

Teradata PT is up to 20% faster than the standalone Teradata load and unload

utilities, even though it uses the underlying protocols from the standalone

utilities.

Teradata PT supports recovery for sessions that use the Stream operator when

the source data is repeatable. This feature is especially useful when running real-

time sessions and streaming the changes to Teradata.

Users can invoke Teradata PT through a set of open APIs that communicate with

the database directly, eliminating the need for a staging file or pipe and a control

file.

Teradata PT includes added features are enabled by users and some are

improvements made behind the scenes. They include:

3.1.1. Parallel Reading:

Allowing multiple instances of a single operator to read from the same input

file takes advantage of the file I/O cache monitored and maintained by the

operating system (OS). As long as all instances run at the same relative

speed, the OS will read a block from the file into memory and all instances

will be accessing the same memory. Dramatic improvements in performance

have been seen when using up to five or six instances reading from a single

(very large) file.

3.1.2. Exporting of Data:

The normal operating procedure for consumer operators is for the first

instance to get as much work as possible, provided it can keep up with the

rate at which data arrives through the data streams. The default behavior is

to only write to a single output file.

3.1.3. Parallel Writing:

If multiple instances are specified for the DataConnector operator (as a file

writer) and indicated that each instance should write to a different file, then

there can be some parallelism in data output. However, when exporting data

to flat files, this results in an uneven distribution of data across output files.

3.1.4. Round-Robin Data Distribution:

This new parallel writing feature allows for the ability to switch to a round-

robin fashion of file writing. This can evenly distribute data across multiple

Page 7: Working with informtiaca teradata parallel transporter

8 White Paper|Working with Informatica-Teradata Parallel Transporter

files, thereby improving the performance of writing to disk even more. That

performance improvement can be further enhanced by specifying each

output file on a separate disk.

PowerCenter communicates with Teradata PT using PowerExchange for Teradata

Parallel Transporter, which is available through the Informatica-Teradata Enterprise

Data Warehousing Solution. PowerExchange for Teradata Parallel Transporter provides

integration between PowerCenter and Teradata databases for data extraction and

loading.

PowerExchange for Teradata Parallel Transporter executes Teradata PT operators

directly through API [Fig2] calls. This improves performance by eliminating the

staging file or named pipe. It also improves security by eliminating the control file, so

there is no need to overwrite or store passwords in the control file. PowerExchange for

Teradata Parallel Transporter supports session and workflow recovery. It also captures

Teradata PT error messages and displays them in the session log, so do not need to

check the utility log file when errors occur. Before we can configure a session to use

Teradata PT, we must create a Teradata PT (relational) connection in the Workflow

Manager and enter a value for the TDPID in the connection attributes.

For more information about creating Teradata PT connection objects in PowerCenter,

see the APPENDIX A and APPENDIX B

Fig2. Informatica communicates with Teradata using TPT API Interface

Page 8: Working with informtiaca teradata parallel transporter

9 White Paper|Working with Informatica-Teradata Parallel Transporter

3.2. Selecting the Load utility based on records counts

For loading data into table, we will use different kinds of Teradata load utilities

based on data volume. Below table explains the best load utility to use in

Informatica based on data volume.

Task First Choice Second Choice

Insert a small number of rows into

an empty table

TPT-Load, FastLoad TPump, TPT-Steam, or

Relational from Informatica

Insert a large number of rows into

an empty table

TPT-Load, FastLoad TPump, TPT-Steam, or

Relational from Informatica

Insert a small number of rows into

small populated table

TPT-Update, TPT-

Stream, Relational,

Multiload, or TPump

Insert a small number of rows into

large populated table

TPump, TPT-Steam,

or Relational

Insert a moderate number of rows

into large populated table

TPT-Update, Multiload TPT-Update, TPT-Stream,

Relational, Multiload, TPump

Insert a large number of rows into

an multiple populated tables

TPT-Update, Multiload TPump, TPT-Steam, or

Relational

Insert a large number of rows into

large populated table

TPT-Update, Multiload

Update/Delete large number of rows TPT-Update, Multiload Relational

Update/Delete small number of rows Relational TPump, TPT-Steam, or

Relational

3.3. Using TPT connection with Load and Update operator in session:

To configure a session to extract data, configure the associated mapping to read

from Teradata, change the reader type for the session to Teradata Parallel

Transporter Reader, and select the Teradata PT connection.

1. In 'writers' select 'Teradata Parallel Transporter Writer'.

2. Once we select the above option in 'Connections' two types of connections will

appear.

i. The first connection defines the connection to Teradata PT API i.e., TPT

connection. The PowerCenter Integration Service uses Teradata PT API

connection to extract from or to load data to Teradata.

ii. The second connection defines an optional ODBC connection to the target

database. The PowerCenter Integration Service uses the target ODBC

connection to drop log, error, and work tables, truncate target tables, and

create recovery tables in the target database. The PowerCenter Integration

Service does not use the ODBC connection to extract from or load data to

Teradata.

Page 9: Working with informtiaca teradata parallel transporter

10 White Paper|Working with Informatica-Teradata Parallel Transporter

3. Select appropriate TPT and ODBC connections for the two types as shown

below

4. Make sure the following session properties are set when using TPT

connections:

i. Specify the database and table name for work table.

ii. Truncate table option can be used for Load, stream and update system

operator.

5. We can select any of the following options for 'Mark missing rows' option as

per requirement which specifies how TPT handles rows if it is not present in

the target table.

· Mark Missing Rows:

i. None - If Teradata PT API receives a row marked for update or delete but it

is missing in the target table, it does not mark the row in the error table.

ii. For Update - If Teradata PT API receives a row marked for update but it is

missing in the target table, it marks the row as an error row.

iii. For Delete - If Teradata PT API receives a row marked for delete but it is

missing in the target table, it marks the row as an error row.

iv. Both - If Teradata PT API receives a row marked for update or delete but it

is missing in the target table, it marks the row as an error row.

6. Similarly we can select the following options for Mark Duplicate Rows option

as well.

Fig3. Selecting TPT connection type and connection name

Page 10: Working with informtiaca teradata parallel transporter

11 White Paper|Working with Informatica-Teradata Parallel Transporter

· Mark Duplicate Rows:

i. None - If Teradata PT API receives a row marked for insert or update that

causes a duplicate row in the target table, it does not mark the row in the

error table.

ii. For Insert - If Teradata PT API receives a row marked for insert but it exists

in the target table, it marks the row as an error row.

iii. For Update - If Teradata PT API receives a row marked for update that

causes a duplicate row in the target table, Teradata PT API marks the row as

an error row.

iv. Both - If Teradata PT API receives a row marked for insert or update that

causes a duplicate row in the target table, Teradata PT API marks the row as

an error row.

7. Specify the database name and the table name for log table to be created

8. Specify the database name and the table name for error tables to be created.

9. Drop work/log/error tables option is available. This option can be checked

when we wish to drop the intermediate tables that are created during session

run.

10. Pack, Pack minimum and buffers options are all used when we use Stream

load operator.

11. Tracing options can be selected as per needs. Usually we use the default

values

Fig4. Selecting TPT (Load or update) connection Options

Page 11: Working with informtiaca teradata parallel transporter

12 White Paper|Working with Informatica-Teradata Parallel Transporter

3.4. TPT Connection with export operator:

Connection that uses export operator can be used only as a source connection.

The below image shows, how to select The TPT export operator for Teradata

source tables in Informatica session

3.5. TPT connection using Stream system operator:

Stream operator uses macros to modify tables. So macro database has to be specified in the session properties while using this type of connection.

3.5.1. Macro Database:

Name of the database that stores the macros Teradata PT API creates when

we select the Stream system operator. The Stream system operator uses macros to modify tables. It creates macros before Teradata PT API begins

loading data and removes them from the database after Teradata PT API loads all rows to the target.

Fig5. Selecting TPT Export connection for Source

Page 12: Working with informtiaca teradata parallel transporter

13 White Paper|Working with Informatica-Teradata Parallel Transporter

3.6. Restart in case of failure:-

3.6.1. Restart for System Operators- Update, Export or Stream

In case there is a session failure where the session uses one of the operators

like Update, Export or Stream then the session can rerun successfully once

all the intermediate tables are dropped.

3.6.2. Restart for System Operator - Load

In case of a session failure where it uses System operator as 'Load' then we

need to follow the below mentioned steps to rerun the session successfully:

There is no need to drop and recreate the target table for load operator,

instead a simple checkout of the session and an ' uncheck ' for the below

options in session properties to run the failed TPT job is sufficient.

· Drop Error/Work/Log Tables

· Truncate Table options

Also, there is no need to drop the intermediate tables. And there is no

manual intervention required at the Database side. Once the session reruns

successfully after the failure one must keep in mind that the session has to

be 'Undo checkout’ for the next future successful runs.

3.7. Comparison of loading Teradata table using Informatica with

Teradata Relational Connection and Teradata TPT Connection

The Teradata RDBMS offers several tools to load external data into database. The

Teradata load utilities can be divided into two main groups:

The first group of utilities makes use of the transient journal but loading of data

is slower, while the second group is thought to bring the data as fast as possible

into the target tables because of bypassing the transient journal.

The second group utilities include the BTEQ, TPUMP and relational/native

connections to load the data into Teradata tables. These will insert the data

record by record into target table, allowing full transaction handling and rollback

functionality.

These are still quite useful for loading small amounts of data, but as it were

missing some important features like the usage of several parallel load sessions.

Page 13: Working with informtiaca teradata parallel transporter

14 White Paper|Working with Informatica-Teradata Parallel Transporter

The Image Fig5, it is Informatica Workflow monitor workflow load statics. Here the

source, target are Teradata tables and relational connections are used. The

relational connection uses the transient journal and loads the data record by

record into target table, so the throughput of target loading is less and it was

taken around 5 hours of time to load the 1.7 millions of records.

The second group utilities include the Teradata standalone Load and Unload

Utilities. These are the fastest way of getting external data into Teradata.it will

bypass transaction handling and data is loaded in big chunks, namely blocks of 64

kilobytes of data.

Bypassing the transaction log (transient journal) means several restrictions:

· It does not support target tables with unique secondary indexes (USI), join

indexes (JI), referential integrity (RI) or triggers. FastLoad even goes a step

further and requires the target table to be empty.

· In the case of a SET target table, a duplicate row check would be required.

FastLoad by default remove the record level duplicates

· Multiload, as the second tool in the set of fast loading utilities can take over

additional tasks such as UPDATE, DELETE, UPSERT etc. but all again on a block

level. Multiload uses temporary work tables for this purpose and does not reach

the load performance like FastLoad but still is a pretty good choice to get in

huge amounts of data very fast.

TPT is to integrate with the infrastructure in a plug-in fashion to perform Teradata

standalone Load and Unload Utilities.

Fig6. Session load statistics with Teradata relational connection

Page 14: Working with informtiaca teradata parallel transporter

15 White Paper|Working with Informatica-Teradata Parallel Transporter

The Image Fig6, it is Informatica Workflow monitor workflow load statistics. Here

the source, targets are Teradata tables and TPT connection used. The TPT

operators load data by block level, so the throughput of target loading is huge and

it was taken few seconds to load the 1.7 millions of records.

Each group has its own advantages and disadvantages and depending on load

requirements. The decision, when we should use which tool for loading depends on

the amount of data records and the number of bytes a data records is consuming.

3.8. Loading the Teradata target table using Informatica update else

insert logic by using TPT connection.

In Informatica session, we can update the target table against to source table

data. Generally we will use the two flows for insert else update operation one flow

is for bulk insert and another one is for update.

The TPT update operator (Multiload) will lock the target table to load the data.

When we will use the two or more target instances in single session, the TPT will

lock the target table while loading the data using one of the targets instance and

session will try to load the data using another target instance, here again the

MultiLoad job try to lock the same target table, which is already locked by insert

instance so session will fail with the TPT error.

“Error Code 2652: Operation not allowed - Table is being mloaded”

To avoid this kind of error, we need to make use of Informatica Update else Insert

option from session properties and treat source rows as update. We will update

the record based on key columns and we will not update some fields like

Create_Timestamp. If we want to use single instance for insert and update the

record then we need to update record with previous (target) values for some

fields which we don’t want to update with new values. i.e., Create_Timestamp.

Fig7. Session load statistics with Teradata relation connection

Page 15: Working with informtiaca teradata parallel transporter

16 White Paper|Working with Informatica-Teradata Parallel Transporter

3.9. Loading the PPI indexed Teradata table with Informatica by using

TPT connection.

Whenever we need to Load and select incremental data from a huge table, we can

use Multiload jobs to load data and create index to select data. But when we create

index like Secondary index or join index Teradata Bulk load utilities doesn’t work.

We can overcome this issue by using the partitioned primary index on the table.

Teradata Multiload job will work on partitioned primary indexed tables.

For MultiLoad job, the table should contain the primary index for update the table

records and we need another index on table for fast selecting of incremental data.

We can use PI for MultiLoad job and PPI for selecting incremental data. For

portioned primary index tables, Teradata MultiLoad tasks require all values of the

primary index column set and all values of the partitioning column set for deletes

and updates. The table creation statement will be:

CREATE TABLE <TABLE>

(

Primary_key_field INTEGER,

Field_1 INTEGER,

Field_2 Varchar (10),

CREATE_DT TIMESTAMP (0),

UPDATE_DT TIMESTAMP (0))

PRIMARY INDEX (Primary_key_field, UPDATE_DT)

PARTITION BY (

RANGE_N (UPDATE_DT BETWEEN DATE ‘2014-01-01′ AND DATE ‘2020-12-31′ EACH

INTERVAL ‘1’ DAY, NO RANGE, UNKNOWN));

We will load data with CREATE_DT and UPDATE_DT values as current_date for inserting records and for updating records UPDATE_DT will be updated with

current_date.

We will select the Incremental data using SELECT * FROM <TABLE>

WHERE UPDATE_DT= current_date

This will select both inserted and updated records for the day.

4. Issues Affecting Loading to and Unloading from Teradata

This section describes issues might encounter when we move data between

PowerCenter and Teradata.

4.1. Increasing Lookup Performance

Applies to: Teradata relational connections, FastExport

Page 16: Working with informtiaca teradata parallel transporter

17 White Paper|Working with Informatica-Teradata Parallel Transporter

Sessions that perform lookups on Teradata tables must use Teradata relational

connections. If we experience performance problems when running a session that

performs lookups against a Teradata database, we might be able to increase

performance in the following ways:

· Use FastExport to extract data to a flat file and perform the lookup on the flat file.

· Enable or disable the Lookup Cache.

4.2. Using FastExport to Extract Lookup Data

If a session performs a lookup on a large, static Teradata table, we might be able

to increase performance by using FastExport to extract the data to a flat file and

configuring the session to look up data in the flat file. To do this, redesign the

mapping as follows:

1. Create a simple, pass-through mapping to pass the lookup data to a flat file.

Configure the session to extract data to the flat file using FastExport.

2. Configure the original mapping to perform the lookup on the flat file.

Note: If we redesign the mapping using this procedure, we can further increase

performance by specifying an ORDER BY clause on the FastExport SQL and

enabling the Sorted Input property for the lookup file. This prevents Power

Center from having to sort the file before populating the lookup cache.

4.3. Restarting a Failed MultiLoad Job Manually

Applies to: MultiLoad

When loading data, MultiLoad puts the target table into the “MultiLoad” state and

creates a log table for the target table. After successfully loading the data, it

returns the target table to the normal (non-MultiLoad) state and deletes the log

table. When we load data using MultiLoad, and the MultiLoad job fails for any

reason, MultiLoad reports an error, and leaves the target table in the Multi-Load

state. Additionally, MultiLoad queries the log table to check for errors. If a target

table is in the MultiLoad state or if a log table exists for the target table, we

cannot restart the job.

To recover from a failed MultiLoad job, we must release the target table from the

MultiLoad state and drop the Multiload log, error and work tables.

To release the target table from Multi-load state:

Release mload <target table>;

4.4. Configuring Sessions that Load to the Same Table

Applies to: MultiLoad

Page 17: Working with informtiaca teradata parallel transporter

18 White Paper|Working with Informatica-Teradata Parallel Transporter

While Teradata MultiLoad loads data to a database table, it locks the table.

MultiLoad requires that all instances handle wait events so they do not try to

access the same table simultaneously.

If we have multiple PowerCenter sessions that load to the same Teradata table

using MultiLoad, set the Tenacity attribute for the session to a value that is

greater than the expected run time of the session. The Tenacity attribute controls

the amount of time a MultiLoad instance waits for the table to become available.

Also configure each session to use unique log file names.

For more information about the Tenacity, see the PowerCenter Advanced

Workflow Guide.

4.5. Loading from Partitioned Sessions

Applies to: FastLoad, MultiLoad

When we configure multiple partitions in a session that uses staging files, the

Integration Service creates a separate flat file for each partition. Since FastLoad

and MultiLoad cannot load data from multiple files, use round-robin partitioning

to route the data to a single file. When we do this, the Integration Service writes

all data to the first partition and starts only one instance of FastLoad or

MultiLoad. It writes the following message in the session log:

MAPPING> DBG_21684 Target [TD_INVENTORY] does not support multiple partitions. All

data will be routed to the first partition.

If we do not route the data to a single file, the session fails with the following error:

WRITER_1_*_1> WRT_8240 Error: The external loader [Teradata Mload Loader] does not

support partitioned sessions.

WRITER_1_*_1> Thu Jun 16 11:58:21 2005

WRITER_1_*_1> WRT_8068 Writer initialization failed. Writer terminating.

For more information about loading from partitioned sessions, see the

PowerCenter Advanced Workflow Guide.

4.6. Using Error Tables to Identify Problems during Loading

Applies to: FastLoad, MultiLoad, TPump

When problems occur during loading data, the Teradata standalone load utilities

generate error tables. (FastExport generates an error log file.) The load utilities

generate different errors during the different phases of loading data.

FastLoad jobs run in two main phases: loading and end loading. During the

loading phase, FastLoad initiates the job, locks the target table, and loads the

data. During the end loading phase, the Teradata database distributes the rows

Page 18: Working with informtiaca teradata parallel transporter

19 White Paper|Working with Informatica-Teradata Parallel Transporter

of data to the target table and unlocks it. FastLoad requires an exclusive lock on

the target table during the loading Phase.

MultiLoad also loads data during two main phases: acquisition and application.

In the acquisition phase, MultiLoad reads the input data and writes it to a

temporary work table. In the application phase, MultiLoad writes the data from

the work table to the actual target table. MultiLoad requires an exclusive lock on

the target table during the application phase.

TPump loads data in a single phase. It converts the SQL in the control file into a

database macro and applies the macro to the input data. TPump uses standard

SQL and standard table locking.

The following table lists the error tables can check to troubleshoot load or unload

utility errors:

Utility

Data Loading

Phase Default Error table Name

Error types

FastLoad

· Loading

· End Loading

· INFA_ET1_<SEQ_no>

· INFA_ET2_<SEQ_no>

· Constraint violations, conversion

errors, unavailable AMP conditions

· Unique Primary index violations

MultiLoad

· Acquisition

· Application

· INFA_ET1_<SEQ_no>

· INFA_ET2_<SEQ_no>

· All Acquisition phase errors,

application phase errors if the

Teradata database cannot build a

valid primary index

· Uniqueness violations, field overflow

on columns other than primary

index fields, constraints errors

TPump

Single phase

· INFA_ET_<SEQ_no>

· All TPump errors

When a load fails, check the “ET1_” error table first for specific information. The

Error Field or Error Field Name column indicates the column in the target table

that could not be loaded. The Error Code field provides details that explain why

the column

· 2689: Trying to load a null value into a non-null field

· 2665: Invalid date format

In the MultiLoad “ET2_” error table, we can also check the DBC Error Field

column and DBC Error Code field. The DBC Error Field column is not initialized in

the case of primary key uniqueness violations. The DBC Error Code that

corresponds to a primary key uniqueness violation is 2794.

Page 19: Working with informtiaca teradata parallel transporter

20 White Paper|Working with Informatica-Teradata Parallel Transporter

5. Conclusion

Teradata Parallel Transporter provides flexible, high-performance load and unload

facilities that enable population of the data warehouse. It supports many traditional

functions of Teradata utilities, as well as many that are new and advanced. Through its

parallel execution design, Teradata Parallel Transporter improves ETL performance

while providing a uniform interface. It also facilitates active data warehousing by

allowing continuous and scalable access to data sources and targets, thus keeping the

data fresh and with maximum throughput. In all, Teradata Parallel Transporter drives

business value by enabling real-time decision making as required by enterprise-wide

business intelligence (BI) applications.

APPENDIX-A

Before we run sessions that move data between PowerCenter and Teradata, we might

want to install Teradata client tools. We also need to locate the Teradata TDPID.

1. Teradata Client Tools

Teradata client tools help to communicate with the Teradata database and debug

problems that occur when session loads data to or extracts data from the Teradata

database. We can install the following Teradata client tools:

2. BTEQ.

A general-purpose, command-line utility (similar to Oracle SQL*Plus) that enables to

communicate with one or more Teradata databases.

3. Teradata SQL Assistant.

A GUI-based tool that allows retrieving data from any ODBC-compliant database

server and manipulating and storing the data in desktop applications. Teradata

Quarryman is the older version of this tool. Install BTEQ or Teradata SQL Assistant to

help we debug problems that occur when loading to and extracting from Teradata.

Both tools are included in the Teradata Utility Pack, which is available from Teradata.

4. TDPID

The Teradata TPDID indicates the name of the Teradata instance and defines the

name a client uses to connect to a server. When we use a Teradata Parallel

Transporter or a standalone load or unload utility with PowerCenter, we must specify

the TDPID in the connection properties. The Teradata TDPID appears in the hosts file

Page 20: Working with informtiaca teradata parallel transporter

21 White Paper|Working with Informatica-Teradata Parallel Transporter

on the machines on which the Integration Service and PowerCenter Client run. By

default, the hosts file appears in the following location:

UNIX: /etc/hosts

Windows: %SystemRoot%\system32\drivers\etc\hosts*

* The actual location is defined in the Registrykey

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\DataB

asePath

The hosts file contains client configuration information for Teradata. In a hosts file

entry, the TDPID precedes the string“cop1.”

For example, the hosts file contains the following entries:

127.0.0.1 Localhost demo1099cop1

192.168.80.113 td_1 custcop1 192.168.80.114 td_2 custcop2 192.168.80.115 td_3 custcop3

192.168.80.116 td_4 custcop4

The first entry has the TDPID “demo1099.” This entry tells the Teradata database that

when a client tool references the Teradata instance “demo1099,” it should direct requests

to “localhost” (IP address 127.0.0.1).The following entries have the same TDPID, “cust.”

Multiple hosts file entries with the same TDPID indicate the Teradata instance is

configured for load balancing among nodes. When a client tool attempts to reference

Teradata instance “cust,” the Teradata database directs requests to the first node in the

entry list, “td_1.” If it takes too long for the node to respond, the database redirects the

request to the second node, and so on. This process prevents the first node, “td_1” from

becoming overloaded.

APPENDIX-B

a. Teradata Connections

Teradata relational connections use ODBC to connect to Teradata. PowerCenter

uses the ODBC Driver for Teradata to retrieve metadata and read and write to

Teradata. To establish ODBC connectivity between Teradata and PowerCenter,

install the ODBC Driver for Teradata on each PowerCenter machine that

communicates with Teradata. The ODBC Driver for Teradata is included in the

Teradata Tools and Utilities (TTU). We can download the driver from the Teradata

Page 21: Working with informtiaca teradata parallel transporter

22 White Paper|Working with Informatica-Teradata Parallel Transporter

web site. PowerCenter works with the ODBC Driver for Teradata available in the

following TTU versions:

Informatica Version

Teradata Version(s)

9.5 TD 14.0

TD 13.10

9.1

TD 13.10

TD 13.0

TD 12.0

9.0.1

TD 13.10

TD 13.0

TD 12.0

8.6.1 TD 13.0

TD 12.0

8.6.0 TD 12.0

8.1.1 SP1 -SP5 TD 12.0

To load or extract data using a Teradata relational connection on UNIX, we must

verify the configuration of environment variables and the odbc.ini file on the

machine on which the Integration Service runs. To verify the environment variable

configuration, ensure the Teradata ODBC path precedes the Data Direct driver

path information in the PATH and shared library path environment variables. Place

the Teradata path before the Data Direct path because both sets of ODBC software

use some of the same file names. To verify the odbc.ini file configuration, make

sure there is an entry for the Teradata ODBC driver in the “[ODBC Data Sources]”

section of odbc.ini. The following excerpt from an odbc.ini file shows a Teradata

ODBC driver (tdata.so) entry on Linux:

[ODBC Data Sources]intdv14=tdata.so

[intdv14]

Driver=/usr/odbc/drivers/tdata.so

Description= running Teradata V14DBCName=intdv14

SessionMode=Teradata

CharacterSet=UTF8

StCheckLevel=0

DateTimeFormat=AAA

LastUser=

Username=

Password=

Database=

DefaultDatabase=

Page 22: Working with informtiaca teradata parallel transporter

23 White Paper|Working with Informatica-Teradata Parallel Transporter

ODBC is a native interface for Teradata. Teradata provides 32- and 64-bit ODBC

drivers for Windows and UNIX platforms. The driver bit mode must be compatible with

the bit mode of the platform on which the PowerCenter Integration Services runs. For

example, 32-bit PowerCenter only runs with 32-bit drivers.

For more information about configuring odbc.ini, see the PowerCenter Configuration

Guide and the ODBC Driver for Teradata User Guide.

b. Creating a Teradata Relational Connection

When we create a Teradata (relational) connection object in the Workflow Manager,

choose “Teradata,” and not “ODBC,” as the connection type in the connection

properties

.

When we choose Teradata as the connection type, the Integration Service still uses

Teradata ODBC to connect to Teradata. Although both ODBC and Teradata connection

types might work, the Integration Service communicates with the Teradata database

more efficiently when we choose the Teradata connection type.

c. Creating a Teradata Parallel Transporter Connection

When we create a Teradata (TPT) connection object in the Workflow Manager,

choose “Teradata PT connection”. The Teradata PT connection will ask for

connection attributes.

1. In connections → relational → select Type as 'Teradata PT Connection' → click on New.

Fig8. Teradata relational connection creation

Page 23: Working with informtiaca teradata parallel transporter

24 White Paper|Working with Informatica-Teradata Parallel Transporter

2. Configure the connection as per the requirement as shown in the screen shot below:

Here we can specify the TDPID, Database name and the system operator that we need for a specific connection.

3. System operator drop down has all 4 operators as shown in the screen shot:

Fig9. Teradata TPT connection creation

Fig10. Selecting the TPT connection System operator