transformer looping functions for pivoting the data...

29
Transformer Looping Functions for Pivoting the data : Convert a single row into multiple rows using Transformer Looping Function? (Pivoting of data using parallel transformer in Datastage 8.5,8.7 and 9.1) Refer This link for more details : Looping Concept in Datastage Now you can argue that this is possible using a pivot stage. But for the sake of this article lets try doing this using a Transformer! Below is a screenshot of our input data We are going to read the above data from a sequential file and transform it to look like this

Upload: dinhduong

Post on 06-Mar-2018

224 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Transformer Looping Functions for Pivoting the data :

Convert a single row into multiple rows using Transformer

Looping Function? (Pivoting of data using parallel

transformer in Datastage 8.5,8.7 and 9.1)

Refer This link for more details : Looping Concept in Datastage

Now you can argue that this is possible using a pivot stage. But for the sake of this article lets try

doing this using a Transformer!

Below is a screenshot of our input data

We are going to read the above data from a sequential file and transform it to look like this

Page 2: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

So lets get to the job design

Step 1: Read the input data.

Step 2: Logic for Looping in Transformer Properties

In the adjacent image you can see a new box called Loop Condition. This where we are going to

control the loop variables.

Below is the screenshot when we expand the Loop Condition box

Page 3: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

The Loop While constraint is used to implement a functionality similar to “WHILE” statement in

programming. So, similar to a while statement need to have a condition to identify how many times

the loop is supposed to be executed.

To achieve this @ITERATION system variable was introduced.

In our example we need to loop the data 3 times to get the column data onto subsequent rows.

So lets have @ITERATION <=3

Now create a new Loop variable with the name LoopName

The derivation for this loop variable should be

If @ITERATION=1 Then DSLink2.Name1

Else If @ITERATION=2 Then DSLink2.Name2

Page 4: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Else DSLink2.Name3

Below is a screenshot illustrating the same

Now all we have to do is map this Loop variable Loop Name to our output column Name

Page 5: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Lets map the output to a sequential file stage and see if the output is a desired.

After running the job, we did a view data on the output stage and here is the data as desired.

Making some tweaks to the above design we can implement things like

1. Adding new rows to existing rows

2. Splitting data in a single column to multiple rows and many more such stuff..

Posted by Devendra Kumar Yadav at 4:37 AM No comments:

Page 6: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Partitioning considerations For Best Performance Of datastage Jobs

This Blog give you a complete details, how we can improve the

performance of datastage Parallel jobs using appropriate partitioning

methods.

Refer These links as well : 1. Datastage Partitioning Methods and Use 2. Datastage Jobs Performance Improvement Tips1 3. Datastage Performance Tuning Tips

1.0 Partitioning considerations:

Choose a partition method which makes sure that the number of rows per partition is close

to equal. This will minimize the processing work load and there by improves the overall

run time. Any stage that process a group of related records must be partitioned using a

keyed partition technique. (Egs in the case of Aggregator stage, Remove duplicate,

Change capture, Change apply, Join, Merge stages etc, as well as for transformers that

process group of related records)

Minimize repartitioning as it decreases the performance unless the partition distribution is

highly skewed. Repartitioning results in overhead of network transport as well as even

distribution of data among partitions is also gets disturbed.

Specify hash partitioning for stages that require processing of group of related records.

Partitioning keys should include only those key columns that are necessary for proper

grouping If the grouping is on a single integer key column, go for Modulus partition on the

same key column If the data is highly skewed and the key column values and distribution

will not change significantly over time, use the Range partitioning technique

Use Round robin partition to distribute data evenly across all partitions. (If grouping is not

needed).This is very much suggested when the input data is in sequential mode or it is

very much skewed Same partitioning requires minimum resources and can be used

for optimization of job and to eliminate repartitioning of the already partitioned data

Page 7: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

When the input data set is sorted in parallel, we need to use Sort merge collector, which

will produce a single sorted stream of rows. When the input data set is sorted in parallel

and range partitioned, the ordered collector method is more preferred for collection

For round robin partitioned input data set use round robin collector to reconstruct rows in

input order, as the long as the data set has not been re partitioned or reduced.

Minimize the use of sorts in a job.

Figure: Partitioning tab in a Datastage stage properties

Page 8: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Posted by Devendra Kumar Yadav at 12:22 AM No comments:

Datastage Jobs Best Practices and Performance Tuning

This Blog give you a complete details, how we can

improve the performance of datastage Parallel

jobs. Best practices we have to follow, while creating

the datastage jobs.

This Blog will help you on following topics.

1. Performance Tuning Guidelines

1.1 General Job Design

1.2 Transformer Stage

1.3 Data grouping Stages

1.4 ODBC Stages

Refer This link as well : Parallel Job Performance Tuning Tips1

1.0 Performance Tuning Guidelines

1.1 General Job Design

Jobs need to be developed using the modular development approach. Large jobs can be

broken down in to smaller modules, which help in improving the performance.

In scenarios where same data (huge number of records) is to be shared among more than

one jobs in the same project, use dataset stage approach instead of re-reading the same

data again.

Eliminate unused columns

Page 9: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Eliminate unused references

If the input file has huge number of records and the business logic allows splitting up of

the data, then run the job in parallel to have a significant improvement in the

performance

1.2 Transformer stage

Use parallel transformer stage instead of filter/switch stages ( filter/switch stages will

take more resources for execution. For egs: in the case of filter stage the were clause will

get executed during run time, thus creating the requirement for more resources, there by

decaying the job performance)

Page 10: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Figure: Example of using a Transformer stage instead of using a filter stage. The filter condition is

given in the constraint section of the transformer stage properties.

Use BuildOp stage only when the required logic cannot be implemented using the parallel

transformer stage.

Avoid calling routines in derivations in the transformer stage. Implement the logic in

derivation. This will avoid the over head of procedure call

Page 11: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Implement the logic using stage variables and call these stage variables in the derivations.

During processing the execution starts with stage variables then constraints and then to

individual columns. If ever there is a prerequisite formulae which can be used by both

constraints and also individual columns then we can define it in stage variables so that it

can be processed once and can be used by multiple records. If ever we require the

formulae to be modified for each and every row then it is advisable to place in code in

record level than stage variable level

Figure: Example for using stage variables in and using it in the derivations.

Page 12: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

1.3 Data grouping stages

When dealing with stages like Aggregator, Filter etc, always try to use sorted data for

better performance

Figure: Sorting the input data on the grouping keys in an aggregator stage

The example shown in the figure is the properties window for an aggregator stage that

finds out the sum of a quantity column by grouping on the columns shown above. In such

scenarios, we will do sorting of the input data on the same columns so that the records

with same/similar values for these grouping columns will come together there by

increasing the performance. Also note that if we are using more than one node, then the

input dataset should be properly partitioned so that the similar records will be available

in the same node.

Page 13: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

1.4 ODBC Stages

If possible sort the data in ODBC stage itself; this will reduce the over head of DS sorting

the data. Don’t use the sort stage when we have ORDER BY clause in ODBC sql

Select only the required records or Remove the unwanted rows as early, so that the job

need not deal with unnecessary records causing performance degrade

Using a constraint to filter a record is much slower as compared to having a

SELECT….WHERE in ODBC stage. User the power of database where ever possible and

reduce the over head for DS.

Page 14: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Figure: Using the User-defined SQL option in ODBC stages to reduce the overhead of datastage by

specifying the WHERE and ORDER BY clause in the SQL used to get data.

Avoid using “like “ operator in user

defined queries in ODBC stages. But one thing to be noted here is that , if our custom sql

requires a must scenario like it is doing a filter on some string pattern, we will be forced

to use the like pattern to get the requirement done.

Avoid using

Stored Proceedures until and unless the functionality cannot be implemented in Data Stage

jobs.

Posted by Devendra Kumar Yadav at 12:07 AM No comments:

TUESDAY, OCTOBER 22, 2013

Know about Conductor Node, Section Leaders and Players Process in Datastage Details about Conductor Node, Section Leaders and Players Process in Datastage

Refer This Link as well For More Details : Job Run Time Architecture

Jobs developed with DataStage Enterprise Edition (EE) are independent of the actual

hardware and degree of parallelism used to run the job. The parallel Configuration File

provides a mapping at runtime between the job and the actual runtime infrastructure and

resources by defining logical processing nodes.

To facilitate scalability across the boundaries of a single server, and to

maintain platform independence, the parallel framework uses a multi-process

architecture.

The runtime architecture of the parallel framework uses a process-based

architecture that enables scalability beyond server boundaries while avoiding platform-

dependent threading calls. The actual runtime deployment for a given job design is

composed of a hierarchical relationship of operating system processes, running on one or

more physical servers

Section Leaders (one per logical processing node): used to create and manage player processes which perform the actual job execution. The Section Leaders also manage communication between the individual player processes and the master Conductor Node.

Page 15: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Players: one or more logical groups of processes used to execute the data flow logic. All players are created as groups on the same server as their managing Section Leader process.

Conductor Node (one per job): the main process used to startup jobs, determine resource assignments, and create Section Leader processes on one or more processing nodes. Acts as a single coordinator for status and error messages, manages orderly shutdown when processing completes or in the event of a fatal error. The conductor node is run from the primary server

It is a main process to

1. Start up jobs

2. Resource assignments

3. Responsible to create Section leader (used to create & manage player

player process which perform actual job execution).

4. Single coordinator for status and error messages.

5. manages orderly shutdown when processing completes in the event of fatal

error.

When the job is initiated the primary process (called the “conductor”) reads the job design, which is a generated Orchestrate shell (osh) script. The conductor also reads the parallel execution configuration file specified by the current setting of the APT_CONFIG_FILE environment variable. Once the execution nodes are known (from the configuration file) the conductor causes a coordinating process called a “section leader” to be started on each; by forking a child process if the node is on the same machine as the conductor or by remote shell execution if the node is on a different machine from the conductor (things are a little more dynamic in a grid configuration, but essentially this is what happens). Communication between the conductor, section leaders and player processes in a parallel job is effected via TCP.

Senario's To Calculate the Processes :

Sample APT CONFIG FILE : See in bold to mention conductor node.

{node "node1"

{

fastname "DevServer1"pools "conductor"

resource disk "/datastage/Ascential/DataStage/Datasets/node1" {pools "conductor"}

resource scratchdisk "/datastage/Ascential/DataStage/Scratch/node1" {pools ""}

}

node "node2"

{

Page 16: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

fastname "DevServer1"

pools ""

resource disk "/datastage/Ascential/DataStage/Datasets/node2" {pools ""}

resource scratchdisk "/datastage/Ascential/DataStage/Scratch/node2" {pools ""}

}

}

Please find the below different answers :

For every job that starts there will be one

(1) conductor process (started on the conductor node),

There will be one (1) section leader for each node in the configuration file and

There will be one (1) player process (may or may not be true)

for each stage in your job for each node.

So if you have a job that uses a two (2) node configuration file and has 3 stages then

your job will have

1 Conductor Node

2 Section leaders (2 Nodes * 1 Section leader per node)

6 Player processes (3 stages * 2 Nodes)Your dump score may show that your job will run

9 processes on 2 nodes.

This kind of information is very helpful when determining the impact that a particular job

or process will have on the underlying operating system and system resources.

Posted by Devendra Kumar Yadav at 11:53 PM No comments:

Situations to choose Parallel or Server Datastage Jobs

Situations to choose Parallel or Server Datastage Jobs

1. The choice of server or parallel depends upon time to implement,

functionality and cost.

2. When we have lots of functionality to implement for lower volume and

hardware is less and ease of implementation we can go for Server jobs.

3. Parallel jobs are costly due to high scale of hardware , difficult to

implement, extreme processing capabilities for absurd volumes with vast array of

operators for high-performance manipulation.

4. When the data volume is less it is better to go for Server job as parallel jobs

can have a longer start up time.

5. When data volume is high, it is better to choose parallel job than server

job. Parallel job will be a lot faster than server job even if it runs on single node.

Page 17: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

The obvious incentive for going parallel is data volume. Parallel jobs can remove

bottlenecks and run across multiple nodes in a cluster for almost unlimited

scalability. At this point parallel jobs become the faster and easier option. A

parallel sort stage is lot faster than server stage. A Transformer stage in parallel

job with the same transformations in server job is faster. Even on one node with a

compiled transformer stage, the parallel version was three times faster. On 1 node

configuration that does not have a lot of parallel processing also we can still get

big performance improvements from an Enterprise Edition job. The improvements

will be multiplied 10 or more than that if we work on 2CPU machines and two

nodes in most stages.

6. Parallel jobs take advantage of both pipeline parallelism and partitioning

parallelism.

7. We can improve the performance of server job by enabling inter process

row buffering. This helps stages to exchange data as soon as it is available in the

link. IPC stage also helps passive stage to read data from another as soon as data is

available. In other words, stages do not have to wait for the entire set of records

to be read first and then transferred to the next stage. Link partitioner and link

collector stages can be used to achieve a certain degree of partitioning

parallelism.

8. Look up with sequential file is possible in parallel jobs and not possible in

server jobs.

9. Datastage EE jobs are compiled into OSH (Orchestrate Shell script

language).

OSH executes operators - instances of executable C++ classes, pre-built

components representing stages used in Datastage jobs.

Server Jobs are compiled into Basic which is an interpreted pseudo-code. This is

why parallel jobs run faster, even if processed on one CPU.

10. The major difference between Infosphere Datastage Enterprise and Server

edition is that Enterprise Edition (EE) introduces Parallel jobs. Parallel jobs

support a completely new set of stages, which implement the scalable and parallel

data processing mechanisms. In most cases parallel jobs and stages look similiar to

the Datastage Server objects, however their capababilities are way different.

In rough outline:

Parallel jobs are executable datastage programs, managed and controlled by Datastage Server runtime environment

Parallel jobs have a built-in mechanism for Pipelining, Partitioning and Parallelism. In most cases no manual intervention is needed to implement optimally those techniques.

Parallel jobs are a lot faster in such ETL tasks like sorting, filtering, aggregating

Refer This Link to Know More about parallel Jobs Stages: Parallel Jobs Stages

Posted by Devendra Kumar Yadav at 11:02 PM No comments:

Page 18: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Surrogate Key Generator Implementation Surrogate Key Generator Implementation in Datastage 8.1, 8.5 & 9.1

The Surrogate Key Generator stage is a processing stage that generates surrogate key columns and maintains the key source.

A surrogate key is a unique primary key that is not derived from the data that it represents, therefore changes to the data will not change the primary key. In a star schema database, surrogate keys are used to join a fact table to a dimension table.

Surrogate key generator stage uses:

1. Create or delete the key source before other jobs run

2. Update a state file with a range of key values

3. Generate surrogate key columns and pass them to the next stage in the job

4. View the contents of the state file

Generated keys are 64 bit integers and the key source can be stat file or database sequence.

Surrogate keys are used to join a dimension table to a fact table in a star schema database.

When the SCD stage performs a dimension lookup :

A) If a matching record is found, it retrieves the value of the existing surrogate key.

B) If a match is not found, the stage obtains a new surrogate key value by using the derivation of the Surrogate Key column on the Dim Update tab.

• If you want the SCD stage to generate new surrogate keys by using a

key source that you created with a Surrogate Key Generator stage as

described in “Surrogate Key Generator”.

• If you want to use your own method to handle surrogate keys, you

should derive the Surrogate Key column from a source column.

You can replace the dimension information in the source data stream with the

Page 19: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

surrogate key value by mapping the Surrogate Key column to the output link.

Creating the key Source :

Drag the surrogate key stage from palette to parallel job canvas with no input and output links.

Double click on the surrogate key stage and click on properties tab.

Page 20: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Properties:

Key Source Action = create

Source Type : FlatFile or Database sequence(in this case we are using FlatFile)

When you run the job it will create an empty file.

If you want to the check the content change the View Stat File = YES and check the job log for details.

skey_genstage,0: State file /tmp/skeycutomerdim.stat is empty.

if you try to create the same file again job will abort with the following error.

skey_genstage,0: Unable to create state file /tmp/skeycutomerdim.stat: File exists.

Deleting the key source:

Page 21: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Updating the stat File:

To update the stat file add surrogate key stage to the job with single input link from other stage.

We use this process to update the stat file if it is corrupted or deleted.

1 1. Open the surrogate key stage editor and go to the properties tab.

Page 22: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :
Page 23: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

If the stat file exists we can update otherwise we can create and update it.

We are using SkeyValue parameter to update the stat file using transformer stage.

Page 24: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :
Page 25: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Generating Surrogate Keys:

Now we have created stat file and will generate keys using the stat key file.

Click on the surrogate keys stage and go to properties add add type a name for the surrogate key column in the Generated Output Column Name property.

Page 26: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Go to ouput and define the mapping like below.

Page 27: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

Rowgen we are using 10 rows and hence when we run the job we see 10 skey values in the output.I have updated the stat file with 100 and below is the output.

Page 28: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :
Page 29: Transformer Looping Functions for Pivoting the data ...files.datastage.webnode.com/200000153-da9b6db90b/transformer sta… · Transformer Looping Functions for Pivoting the data :

If you want to generate the key value from begining you can use following property in the surrogate key stage.

A. If the key source is a flat file, specify how keys are generated:

1. To generate keys in sequence from the highest value that was last used, set the Generate Key from Last Highest Value property to Yes. Any gaps in the key range are ignored.

2. To specify a value to initialize the key source, add the File Initial Value property to the Options group, and specify the start value for key generation. 3. To control the block size for key ranges, add the File Block Size property to the Options group, set this property to User specified, and specify a value for the block size. B. If there is no input link, add the Number of Records property to the Options group, and specify how many records to generate.