86043838 datastage interview

161
What is Data warehouse ? What is Operational Databases ? Data Extraction ? Data Aggregation ? Data Transformation ?

Upload: ramesh158

Post on 28-Apr-2015

65 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 86043838 Datastage Interview

What is Data warehouse ?

What is Operational Databases ?

Data Extraction ?

Data Aggregation ?

Data Transformation ?

DataStage Designer

Page 2: 86043838 Datastage Interview

Advantages of Data warehouse ?

DataStage ?

Client Component ?

Server Component ?

DataStage Jobs ?

Page 3: 86043838 Datastage Interview

DataStage NLS ?

Page 4: 86043838 Datastage Interview

Stages

Passive Stage ?

Active Stage ?

Server Job Stages

Page 5: 86043838 Datastage Interview
Page 6: 86043838 Datastage Interview

Parallel Job Stage

Page 7: 86043838 Datastage Interview
Page 8: 86043838 Datastage Interview
Page 9: 86043838 Datastage Interview

Links ?

Parallel Processing

Page 10: 86043838 Datastage Interview

Types of Parallelism

Plug in Stage?

Page 11: 86043838 Datastage Interview

Difference Between Lookup and Join:

What is Staging Variable?

What are Routines?

what are the Job parameters?

What are Stage Variables,

Derivations and Constants?

why fact table is in normal form?

What are an Entity, Attribute and

Relationship?

What is Metastage?

Page 12: 86043838 Datastage Interview

How many places u can call Routines?

What about System variables?

Page 13: 86043838 Datastage Interview

What are all the third party tools

used in DataStage?

What is the difference between

change capture and change apply

stages

DataStage Engine Commands

What is the difference between

Transform and Routine in

DataStage?

Where can you output data using

the peek stage?

What is complex stage? In which

situation we are using this one?

Page 14: 86043838 Datastage Interview

What is Ad-hoc query?

What is Version Control?

How Version Control Works?

Benefits of Using Version Control

Page 15: 86043838 Datastage Interview

Lookup types in Datastage 8

Page 16: 86043838 Datastage Interview

A data warehouse is a central integrated database containing data from all

the operational sources and archive systems in an organization. It contains

a copy of transaction data specifically structured for query analysis.

This database can be accessed by all users, ensuring that each group in an organization

is accessing valuable, stable data.

Operational databases are usually accessed by many concurrent users. The

data in the database changes quickly and often. It is very difficult to obtain

an accurate picture of the contents of the database at any one time.

Because operational databases are task oriented, for example, stock inventory

systems, they are likely to contain “dirty” data. The high throughput

of data into operational databases makes it difficult to trap mistakes or

incomplete entries. However, you can cleanse data before loading it into a

data warehouse, ensuring that you store only “good” complete records.

Data extraction is the process used to obtain data from operational sources, archives, and

external data sources.

The summed (aggregated) total is stored in the data warehouse. Because

the number of records stored in the data warehouse is greatly reduced, it

is easier for the end user to browse and analyze the data.

Transformation is the process that converts data to a required definition and value.

Data is transformed using routines based on a transformation rule, for

example, product codes can be mapped to a common format using a transformation

rule that applies only to product codes.

After data has been transformed it can be loaded into the data warehouse

in a recognized and required format.

DataStage Designer

Page 17: 86043838 Datastage Interview

• Capitalizes on the potential value of the organization’s information

• Improves the quality and accessibility of data

• Combines valuable archive data with the latest data in operational sources

• Increases the amount of information available to users

• Reduces the requirement of users to access operational data

• Reduces the strain on IT departments, as they can produce one database to serve all user groups

• Allows new reports and studies to be introduced without disrupting operational systems

• Promotes users to be self sufficient

the design and processing required to build a data warehouse. It is ETL

• Extracts data from any number or type of database.

• Transforms data. DataStage has a set of predefined transforms and functions you can use to convert your data.

You can easily extend the functionality by defining your own transforms to use.

• Loads the data warehouse.

It Consist of number of Client Component ans Server Component

DataStage server and parallel jobs are compiled and run on the DataStage server. The job will connect to databases on other machines as necessary,

extract data, process it, then write the data to the target data warehouse.

DataStage mainframe jobs are compiled and run on a mainframe. Data extracted by such jobs is then loaded into the data warehouse.

DataStage Designer -> A design interface used to create DataStageapplications (known as jobs).

DataStage Director-> A user interface used to validate, schedule,run, and monitor DataStage server jobs and parallel jobs.

DataStage Manager -> A user interface used to view and edit thecontents of the Repository.

DataStage Administrator -> A user interface used to perform administrationtasks such as setting up DataStage users, creating and moving projects, and setting up purging criteria.

Repository -> A central store that contains all the informationrequired to build a data mart or data warehouse.

DataStage Server -> Runs executable jobs that extract, transform,and load data into a data warehouse.

Datastage Package Installer -> A user interface used to install packagedDataStage jobs and plug-ins.

Basic type of DataStage

Page 18: 86043838 Datastage Interview

Server Jobs ->These are compiled and run on the DataStage server.

A server job will connect to databases on other machines as necessary,

extract data, process it, then write the data to the target data

warehouse.Parallel Jobs -> These are compiled and run on the DataStage server

in a similar way to server jobs, but support parallel processing on

SMP, MPP, and cluster systems.MainFrame Jobs -> These are available only if you have Enterprise

MVS Edition installed. A mainframe job is compiled and run on

the mainframe. Data extracted by such jobs is then loaded into the

data warehouse.Shared Containers -> These are reusable job elements. They typically

comprise a number of stages and links. Copies of shared containers

can be used in any number of server jobs or parallel jobs and edited

as required. Job Sequences -> A job sequence allows you to specify a sequence of

DataStage jobs to be executed, and actions to take depending on

results.Built in Stages -> Supplied with DataStage and used for extracting,

aggregating, transforming, or writing data. All types of job have

these stages.Plug in Stages-> Additional stages that can be installed in DataStage

to perform specialized tasks that the built-in stages do not support.

Server jobs and parallel jobs can make use of these.Job Sequences Stages-> Special built-in stages which allow you to

define sequences of activities to run. Only Job Sequences have

these.

DataStage has built-in National Language Support (NLS). With NLS installed,

DataStage can do the following:

• Process data in a wide range of languages

• Accept data in any character set into most DataStage fields

• Use local formats for dates, times, and money (server jobs)

Page 19: 86043838 Datastage Interview

• Sort data according to local rules

• Convert data between different encodings of the same language

(for example, for Japanese it can convert JIS to EUC)

A job consists of stages linked together which describe the flow of data

from a data source to a data target (for example, a final data warehouse).

The different types of job have different stage types. The stages that are

available in the DataStage Designer depend on the type of job that is

currently open in the Designer.

A passive stage handlesaccess to databases for the extraction or writing of data.

Active stagesmodel the flow of data and provide mechanisms for combining datastreams, aggregating data,

and converting data from one data type to another.

Database

ODBC. -> Extracts data from or loads data into databases that support the industry standard Open Database

Connectivity API. -> This stage is also used as an intermediate stage for aggregating data. This is a passive stage.

UniVerse. -> Extracts data from or loads data into UniVerse databases. This stage is also used as an intermediate

stage for aggregating data. This is a passive stage.

UniData. -> Extracts data from or loads data into UniData databases. This is a passive stage.

Oracle 7 Load. - > Bulk loads an Oracle 7 database. Previously known as ORABULK.

Sybase BCP Load. - >Bulk loads a Sybase 6 database. Previously known as BCPLoad.

FileHashed File. -> Extracts data from or loads data into databases

that contain hashed files. Also acts as an

intermediate stage for quick lookups. This is a passive stage.

Page 20: 86043838 Datastage Interview

Sequential File. -> Extracts data from, or loads data into,

operating system text files. This is a passive stage.

ProcessingAggregator.-> Classifies inc oming data into groups,

computes totals and other summary functions for each

group, and passes them to another stage in the job. This

is an active stage.BASIC Transformer. -> Receives incoming data, transforms

it in a variety of ways, and outputs it to another

stage in the job. This is an active stage.

Folder. -> Folder stages are used to read or write data as

files in a directory located on the DataStage server.Inter-process. ->Provides a communication channel

between DataStage processes running simultaneously in

the same job. This is a passive stage.Link Partitioner. -> Allows you to partition a data set into

up to 64 partitions. Enables server jobs to run in parallel

on SMP systems. This is an active stage.Link Collector. -> Collects partitioned data from up to 64

partitions. Enables server jobs to run in parallel on SMP

systems. This is an active stage.

RealTimeRTI Source. -> Entry point for a Job exposed as an RTI

service. The Table Definition specified on the output link

dictates the input arguments of the generated RTI

service.RTI Target. -> Exit point for a Job exposed as an RTI

service. The Table Definition on the input link dictates

the output arguments of the generated RTI service.

Containers

Page 21: 86043838 Datastage Interview

Server Shared Container. -> Represents a group of stages

and links. The group is replaced by a single Shared

Container stage in the Diagram window.Local Container. -> Represents a group of stages and links.

The group is replaced by a single Container stage in the

Diagram window

Container Input and Output. -> Represent the interface

that links a container stage to the rest of the job design.

DataBases

DB2/UDB Enterprise. Allows you to read and write a

DB2 database.

Informix Enterprise. Allows you to read and write an

Informix XPS database.

Oracle Enterprise. Allows you to read and write an

Oracle database.

Teradata Enterprise. Allows you to read and write a

Teradata database.

Development/Debug StagesRow Generator. -> Generates a dummy data set.

Column Generator. -> Adds extra columns to a data set.

Head. -> Copies the specified number of records from

the beginning of a data partition.Peek. -> Prints column values to the screen as records are

copied from its input data set to one or more output

data sets.

Sample. -> Samples a data set.

Tail. -> Copies the specified number of records from the

end of a data partition.

Write range map. -> Enables you to carry out range map

partitioning on a data set.

Page 22: 86043838 Datastage Interview

File Stages

Complex Flat File. -> Allows you to read or writecomplex flat files on a mainframe machine. This isintended for use on USS systemsData set.-> Stores a set of data.

External source. -> Allows a parallel job to read anexternal data source.

External target. -> Allows a parallel job to write to anexternal data source.

File set. -> A set of files used to store data.

Lookup file set. ->Provides storage for a lookup table.

SAS data set. -> Provides storage for SAS data sets.

Sequential file. -> Extracts data from, or writes data to, atext file.

Processing Stages

Transformer. - >Receives incoming data, transforms it in avariety of ways, and outputs it to another stage in thejob.Aggregator. -> Classifies incoming data into groups,

computes totals and other summary functions for each

group, and passes them to another stage in the job.

Change apply. -> Applies a set of captured changes to a data set.

Change Capture. -> Compares two data sets and recordsthe differences between them.

Compare. -> Performs a column by column compare oftwo pre-sorted data sets.

Compress. -> Compresses a data set.

Copy . -> Copies a data set.

Decode. -> Uses an operating system command to decodea previously encoded data set.

Difference. -> Compares two data sets and works out the difference between them.

Encode. -> Encodes a data set using an operating systemcommand.

Expand. -> Expands a previously compressed data set.

External Filter. -> Uses an external program to filter a dataset.

Filter. -> Transfers, unmodified, the records of the inputdata set which satisfy requirements that you specify, andfilters out all other records.

Funnel. -> Copies multiple data sets to a single data set.

Generic. -> Allows Orchestrate experts to specify their owncustom commands.

Lookup. -> Performs table lookups.

Merge.-> Combines data sets.

Modify. -> Alters the record schema of its input data set.

Remove duplicates.-> Removes duplicate entries from adata set.

Page 23: 86043838 Datastage Interview

SAS(Statistical Analysis System)-> Allows you to run SAS applications from

within

Sort. -> Sorts input columns.

Switch. -> Takes a single data set as input and assigns eachinput record to an output data set based on the value of aselector field.

Surrogate Key.-> Generates one or more surrogate keycolumns and adds them to an existing data set.

Real TimeRTI Source. -> Entry point for a Job exposed as an RTI

service. The Table Definition specified on the output link

dictates the input arguments of the generated RTI

service.RTI Target. -> Exit point for a Job exposed as an RTI

service. The Table Definition on the input link dictates

the output arguments of the generated RTI service.

Restructure

Column export. -> Exports a column of another type to astring or binary column.

Column import. -> Imports a column from a string orbinary column.

Combine records. -> Combines several columns associatedby a key field to build a vector.

Make subrecord. -> Combines a number of vectors to forma subrecord.

Make vector. -> Combines a number of fields to form avector.

Promote subrecord. -> Promotes the members of asubrecord to a top level field.

Split subrecord. -> Separates a number of subrecords intotop level fields.

Split vector. -> Separates a number of vector members intoseparate columns.

Other StagesParallel Shared Container. -> Represents a group of stages

and links. The group is replaced by a single Parallel

Shared Container stage in the Diagram window. Parallel

Shared Container stages are handled differently to other

stage types, they do not appear on the palette.

Page 24: 86043838 Datastage Interview

Local Container. -> Represents a group of stages and links.

The group is replaced by a single Container stage in the

Diagram window

Container Input and Output. -> Represent the interface

that links a container stage to the rest of the job design.

Links join the various stages in a job together and are used to specify how

data flows when the job is run.

Linking Server Stages - >Stream. A link representing the flow of data. This is the principal

type of link, and is used by both active and passive stages.Reference. A link representing a table lookup. Reference links are

only used by active stages. They are used to provide information

that might affect the way data is changed, but do not supply the

data to be changed.

Linkning Parallel Stages ->Stream. -> A link representing the flow of data. This is the principal

type of link, and is used by all stage types.Reference.-> A link representing a table lookup. Reference links can

only be input to Lookup stages, they can only be output from

certain types of stage.

Reject. -> Some parallel job stages allow you to output records that

have been rejected for some reason onto an output link.

Parallel processing is the ability to carry out multiple operations or tasks simultaneously.

Page 25: 86043838 Datastage Interview

Pipeline Parallelism

->If we run a job on a system with at least three processors the stage reading

would start on one processor and start filling a pipeline with the data it had

read.

->The transformation stage would start running on second processor as soon

as there was a data in a pipeline, process it and start filling another pipeline.

->The target stage would start running on 3rd processor as soon as there was

Partitioning Parallelism

-> Using Partitioning Parallelism the same job would effectively be run on

simultaneously by several processors.

BULK COPY PROGRAM: Microsoft SQL Server and Sybase have a utility called

BCP (Bulk Copy

Program). This command line utility copies SQL Server data to or from an

operating system file in a user-specified format. BCP uses the bulk copy

API in the SQL Server client libraries.

By using BCP, you can load large volumes of data into a table without

recording each insert in a log file. You can run BCP manually from a

command line using command line options (switches). A format (.fmt) file

The Orabulk stage is a plug-in stage supplied by Ascential. The Orabulk

plug-in is installed automatically when you install DataStage.

An Orabulk stage generates control and data files for bulk loading into a

single table on an Oracle target database. The files are suitable for loading

into the target database using the Oracle command sqlldr.

One input link provides a sequence of rows to load into an Oracle table.

The meta data for each input column determines how it is loaded. One

optional output link provides a copy of all input rows to allow easy

combination of this stage with other stages.

Page 26: 86043838 Datastage Interview

Lookup and join perform equivalent operations: combining two or more

input datasets based on one or more specified keys.

Lookup requires all but one (the first or primary) input to fit into physical

memory. Join requires all inputs to be sorted.

When one unsorted input is very large or sorting isn’t feasible, lookup is

the preferred solution. When all inputs are of manageable size or are

presorted,

These are the temporary variables created in transformer for calculation.

Routines are the functions which we develop in BASIC Code for required

tasks, which we Datastage is not fully supported (Complex).

These Parameters are used to provide Administrative access and change run

time values of the job. EDIT > JOBPARAMETERSIn that

Parameters Tab we can define the name,prompt,type,value.

Stage Variable - An intermediate processing variable that retains value during

read and does not pass the value into target column.

Derivation - Expression that specifies value to be passed on to the target

column.

A fact table consists of measurements of business requirements and foreign

keys of dimensions tables as per business rules.

An entity represents a chunk of information. In relational databases, an entity

often maps to a table.

An attribute is a component of an entity and helps define the uniqueness of

the entity. In relational databases, an attribute maps to a column.

MetaStage is a persistent metadata Directory that uniquely synchronizes

metadata across multiple separate silos, eliminating re keying and the manual

establishment of cross-tool relationships. Based on patented technology, it

provides seamless cross-tool integration throughout the entire Business

Page 27: 86043838 Datastage Interview

Four Places u can call

(i) Transform of routine

(A) Date Transformation

(B) Upstring Transformation

(ii) Transform of the Before & After Subroutines

(iii) XML transformation

(iv)Web base trannsformation

DataStage provides a set of variables containing useful system information

that you can access from a transform or routine. System variables are read-

only.

@DATE The internal date when the program started. See the Date function.

@DAY The day of the month extracted from the value in @DATE.

@FALSE The compiler replaces the value with 0.

@FM A field mark, Char(254).

@IM An item mark, Char(255).

@INROWNUM Input row counter. For use in constrains and derivations in

Transformer stages.

@OUTROWNUM Output row counter (per link). For use in derivations in

Transformer stages.

Page 28: 86043838 Datastage Interview

Autosys, TNG, event coordinator,Maestro Schedular,Contl-M job schedular

are the third party Tool.which are being used in datatstage projects

Change capture stage is used to get the difference between two sources i.e.

after dataset and before dataset. The source which is used as a reference to

capture the changes is called after dataset. The source in which we are

looking for the change is called before dataset. This change capture will add

one field called "chage code" in the output from this stage. By this change

code one can recognize which kind of change this is like whether it is delete,

insert or update.

the following commands can be taken as DS Engine commands, used to start

and stop the DS Engine

DSHOME/bin/uv -admin -start

Routines are used to return the values ,transform cannot return the values

In datastage Director!

Look at the datastage director Log

A complex flat file can be used to read the data at the intial level. By using

CFF, we can read ASCII or EBCDIC (Extended Binary coded Decimal Interchage

Code) data. We can select the required columns and can omit the remaining.

We can collect the rejects (bad formatted records) by setting the property of

Page 29: 86043838 Datastage Interview

Ad hoc querying is a term in information science. Many application software

systems have an underlying database which can be accessed by only a limited

number of queries and reports. Typically these are available via some sort of

menu, and will have been carefully designed, pre-programmed and optimized

for performance by expert programmers.

By contrast, "ad hoc" reporting systems allow the users themselves to create

specific, customized queries. Typically this would be via a user-friendly GUI-

based system without the need for the in-depth knowledge of SQL, or

database schema that a programmer would have.

Because such reporting has the potential to severely degrade the

Version Control allows you to:

• Store different versions of DataStage jobs.

• Run different versions of the same job.

• Revert to a previous version of a job.

• View version histories.

• Ensure that everyone is using the same version of a job.

• Protect jobs by making them read-only.

• Store all changes in one centralized place.

Version Control utilizes the DataStage repository, and uses a specially

created DataStage project (normally called ‘VERSION’) to store its

information.

This special project stores all changes made to all the projects and

Version Control is effective because it captures entire component releases,

making it possible to view all changes between release levels.

Version Control also provides these benefits:

• Version tracking

• Central code repository

• DataStage integration

• Team coordination

Page 30: 86043838 Datastage Interview

Two types of Lookup: Range Lookup and Caseless Lookup

Page 31: 86043838 Datastage Interview

This database can be accessed by all users, ensuring that each group in an organization

Because operational databases are task oriented, for example, stock inventory

Data extraction is the process used to obtain data from operational sources, archives, and

Transformation is the process that converts data to a required definition and value.

example, product codes can be mapped to a common format using a transformation

Page 32: 86043838 Datastage Interview

• Reduces the strain on IT departments, as they can produce one database to serve all user groups

• Allows new reports and studies to be introduced without disrupting operational systems

• Transforms data. DataStage has a set of predefined transforms and functions you can use to convert your data.

You can easily extend the functionality by defining your own transforms to use.

DataStage server and parallel jobs are compiled and run on the DataStage server. The job will connect to databases on other machines as necessary,

DataStage mainframe jobs are compiled and run on a mainframe. Data extracted by such jobs is then loaded into the data warehouse.

DataStage Designer -> A design interface used to create DataStageapplications (known as jobs).

DataStage Director-> A user interface used to validate, schedule,run, and monitor DataStage server jobs and parallel jobs.

DataStage Manager -> A user interface used to view and edit thecontents of the Repository.

DataStage Administrator -> A user interface used to perform administrationtasks such as setting up DataStage users,

Repository -> A central store that contains all the informationrequired to build a data mart or data warehouse.

DataStage Server -> Runs executable jobs that extract, transform,and load data into a data warehouse.

Datastage Package Installer -> A user interface used to install packagedDataStage jobs and plug-ins.

Page 33: 86043838 Datastage Interview
Page 34: 86043838 Datastage Interview

A passive stage handlesaccess to databases for the extraction or writing of data.

Active stagesmodel the flow of data and provide mechanisms for combining datastreams, aggregating data,

ODBC. -> Extracts data from or loads data into databases that support the industry standard Open Database

Connectivity API. -> This stage is also used as an intermediate stage for aggregating data. This is a passive stage.

UniVerse. -> Extracts data from or loads data into UniVerse databases. This stage is also used as an intermediate

UniData. -> Extracts data from or loads data into UniData databases. This is a passive stage.

Oracle 7 Load. - > Bulk loads an Oracle 7 database. Previously known as ORABULK.

Sybase BCP Load. - >Bulk loads a Sybase 6 database. Previously known as BCPLoad.

Page 35: 86043838 Datastage Interview
Page 36: 86043838 Datastage Interview
Page 37: 86043838 Datastage Interview

Complex Flat File. -> Allows you to read or writecomplex flat files on a mainframe machine. This isintended for use on USS systems

Transformer. - >Receives incoming data, transforms it in avariety of ways, and outputs it to another stage in thejob.

Change Capture. -> Compares two data sets and recordsthe differences between them.

Compare. -> Performs a column by column compare oftwo pre-sorted data sets.

Decode. -> Uses an operating system command to decodea previously encoded data set.

Difference. -> Compares two data sets and works out the difference between them.

Filter. -> Transfers, unmodified, the records of the inputdata set which satisfy requirements that you specify, andfilters out all other records.

Generic. -> Allows Orchestrate experts to specify their owncustom commands.

Page 38: 86043838 Datastage Interview

Switch. -> Takes a single data set as input and assigns eachinput record to an output data set based on the value of aselector field.

Surrogate Key.-> Generates one or more surrogate keycolumns and adds them to an existing data set.

Column export. -> Exports a column of another type to astring or binary column.

Combine records. -> Combines several columns associatedby a key field to build a vector.

Promote subrecord. -> Promotes the members of asubrecord to a top level field.

Split vector. -> Separates a number of vector members intoseparate columns.

Page 39: 86043838 Datastage Interview

Parallel processing is the ability to carry out multiple operations or tasks simultaneously.

Page 40: 86043838 Datastage Interview

Job Sequence?

Activity Stages?

JOB SEQUENCE

Page 41: 86043838 Datastage Interview

Triggers?

Job Sequence Properties?

Job Report

Page 42: 86043838 Datastage Interview

How do you generate Sequence number in

Datastage?

Sequencers are job control programs that

execute other jobs with preset Job parameters.

Page 43: 86043838 Datastage Interview

DataStage provides a graphical Job Sequencer which allows you to specify

a sequence of server jobs or parallel jobs to run. The sequence can also

contain control information; for example, you can specify different courses

of action to take depending on whether a job in the sequence succeeds or

fails. Once you have defined a job sequence, it can be scheduled and run

using the DataStage Director. It appears in the DataStage Repository and

in the DataStage Director client as a job.

• Job. Specifies a DataStage server or parallel job.

• Routine. Specifies a routine. This can be any routine in

the DataStage Repository (but not transforms).

• ExecCommand. Specifies an operating system command

to execute.

• Email Notification. Specifies that an email notification

should be sent at this point of the sequence (uses SMTP).

• Wait-for-file. Waits for a specified file to appear or disappear.

• Exception Handler. There can only be one of these in a

job sequence. It is executed if a job in the sequence fails to

run (other exceptions are handled by triggers) or if the

job aborts and the Automatically handle job runs that

fail option is set for that job.

• Nested Conditions. Allows you to further branch the

execution of a sequence depending on a condition.

• Sequencer. Allows you to synchronize the control flow

of multiple activities in a job sequence.

• Terminator. Allows you to specify that, if certain situations

occur, the jobs a sequence is running shut down

cleanly.

• Start Loop and End Loop. Together these two stages

allow you to implement a For…Next or For…Each loop

within your sequence.

• User Variable. Allows you to define variables within a

sequence. These variables can then be used later on in

the sequence, for example to set job parameters.

JOB SEQUENCE

Page 44: 86043838 Datastage Interview

The control flow in the sequence is dictated by how you interconnect

activity icons with triggers.

There are three types of trigger:

• Conditional. A conditional trigger fires the target activity if the

source activity fulfills the specified condition. The condition is

defined by an expression, and can be one of the following types:

– OK. Activity succeeds.

– Failed. Activity fails.

– Warnings. Activity produced warnings.

– ReturnValue. A routine or command has returned a value.

– Custom. Allows you to define a custom expression.

– User status. Allows you to define a custom status message to

write to the log.

• Unconditional. An unconditional trigger fires the target activity

once the source activity completes, regardless of what other triggers

are fired from the same activity.

• Otherwise. An otherwise trigger is used as a default where a

source activity has multiple output triggers, but none of the conditional

ones have fired.

General,Parameters,Job Control,Dependencies,NLS

The job reporting facility allows you to generate an HTML report of a

server, parallel, or mainframe job or shared containers. You can view this

report in a standard Internet browser (such as Microsoft Internet Explorer)

and print it from the browser.

The report contains an image of the job design followed by information

about the job or container and its stages. Hotlinks facilitate navigation

through the report. The following illustration shows the first page of an

example report, showing the job image and the contents list from which

you can link to more detailed job component descriptions: The

report is not dynamic, if you change the job design you will need to

regenerate the report.

Page 45: 86043838 Datastage Interview

Using the Routine

KeyMgtGetNextVal

KeyMgtGetNextValConn

They can also be done by Oracle Sequence.

A sequencer allows you to synchronize the control flow of multiple activities in a job

sequence. It can have multiple input triggers as well as multiple output triggers.The

sequencer operates in two modes:ALL mode. In this mode all of the inputs to the

sequencer must be TRUE for any of the sequencer outputs to fire.ANY mode. In this

mode, output triggers can be fired if any of the sequencer inputs are TRUE

Page 46: 86043838 Datastage Interview

if suppose we have 3 jobs in sequencer, while running

if job1 is failed then we have to run job2 and job 3

,how we can run?

how do you remove duplicates using transformer

stage in datastage.

how you will call shell scripts in sequencers in

datastage

What are the Environmental variables in Datastage?

How to extract job parameters from a file?

Scenarios

Page 47: 86043838 Datastage Interview

How to get the unique records on multiple columns by

using sequential file stage only

if a column contains data like

abc,aaa,xyz,pwe,xok,abc,xyz,abc,pwe,abc,pwe,xok,xyz

,xxx,abc,

roy,pwe,aaa,xxx,xyz,roy,xok....

how to send the unique data to one source and

remaining data

to another source????

how do u reduce warnings?

Is there any possibility to generate alphanumeric

surrogate key?

How to lock\unlock the jobs as datastage admin?

How to enter a log in auditing table whenever a job

get finished?

what is Audit table? Have u use audit table in ur

project?

Page 48: 86043838 Datastage Interview

Can we use Round Robin for aggregator? Is there any

benefit underlying?

How many number of reject links merge stage can

have?

I have 3 jobs A,B and C , which are dependent each

other. I want to run A & C jobs daily and B job run only

on sunday. how can we do it?

How to generate surrogate key without using

surrogate key stage?

what is push and pull technique??? I want to two seq

files using push technique import in my desktop what i

will do?

what is .dsx files

how to capture rejected data by using join stage not

for lookup stage. please let me know?

What is APT_DUMP_SCORE?

Country, state 2 tables r there. in table 1 have

cid,cname

table2 have sid,sname,cid. i want based on cid which

country's

having more than 25 states i want to display?

Page 49: 86043838 Datastage Interview

what is the difference between 7.1,7.5.2,8.1 versions

in datastage?

what is normalization and denormalization?

What is diff between Junk dimensions and conform

dimension?

30 jobs are running in unix.i want to find out my

job.how to do this?Give me command?

How do u convert the columns to rows in DataStage?

What is environment variables?

Where the DataStage stored his repository?

How one source columns or rows to be loaded in to

two different tables?

How do you register plug-ins?

Page 50: 86043838 Datastage Interview

How many number of ways that you can implement

SCD2 ? Explain them

A sequential file has 8 records with one column, below

are the values in the column separated by space,1 1 2

2 3 4 5 6in a parallel job after reading the sequential

file 2 more sequential files should be created, one

with duplicate records and the other without

duplicates.File 1 records separated by...

how to perform left outer join and right outer join in

lookup stage

what are the ways to read multiiple files from

sequential file if the both files are different

What happens if the job fails at night?

Page 51: 86043838 Datastage Interview

If there are 10000 records and while loading, if the

session fails in between, how will you load the

remaining data?

Tell me one situation from your last project, where

you had faced problem and How did u solve it?

How to handle Date convertions in Datastage?

Convert a mm/dd/yyyy format to yyyy-dd-mm?

Page 52: 86043838 Datastage Interview

what is trouble shooting in server jobs ? what are the

diff kinds of errors encountered while running any

job?

what are validations you perform after creating jobs in

designer.what r the different type of errors u faced

during loading and how u solve them

If the size of the Hash file exceeds 2GB..What

happens? Does it overwrite the current rows?

What is the purpose of Debugging stages? In real time

Where we will use?

Page 53: 86043838 Datastage Interview

How do you you delete header and footer on the

source sequential file and how do you create header

and footer on target sequential file using datastage?

Using server job, how to transform data in XML file

into sequential file?? i have used XML input, XML

transformer and a sequential file.

How to develop the SCD using LOOKUP stage?

Page 54: 86043838 Datastage Interview

source has 10000 records, Job failed after 5000

records are loaded. This status of the job is abort ,

Instead of removing 5000 records from target , How

can i resume the load

if we using two sources having same meta data and

how to check the data in two sorces is same or

not?and if the data is not same i want to abort the job

?how we can do this?

Page 55: 86043838 Datastage Interview

Scenario based Question ........... Suppose that 4 job

control by the sequencer like (job 1, job 2, job 3, job 4

)if job 1 have 10,000 row ,after run the job only 5000

data has been loaded in target table remaining are not

loaded and your job going to be aborted then.. How

can short out the problem.

Tell me the environment in your last projects

Give the OS of the Server and the OS of the Client of

your recent most project

Where does UNIX script of datastage executes

weather in client machine or in server.Suppose if it

executes on server then it will execute ?

What are the Repository Tables in DataStage and

What are they?

Page 56: 86043838 Datastage Interview

How the hash file is doing lookup in serverjobs?How is

it comparing the key values?

how to extract data from more than 1 hetrogenious

Sources.

mean, example 1 sequenal file, Sybase , Oracle in a

single Job.

how can you do incremental load in datastage?

Job run reports generated by sequence jobs do not

show the final error message

Page 57: 86043838 Datastage Interview

To run a job even if its previous job in the sequence is failed you need to go to the TRIGGER tab of

that particular job activity in the sequence itself.

There you will find three fields:

Name: This is the name of the next link (link goin to the next job, e.g. for job activity 1 link name

will be the link goin to job activity 2).

Expression Type: This will allow you to trigger your next job activity based on the status you

want. For example, if in case job 1 fails and you want to run the job 2 and job 3 then go to trigger

properties of the job 1 and select expression type as "Failed - (Conditional)". This way you can run

your job 2 even if your job 1 is aborted. There are many other options available.

Expression: This is editable for some options. Like for expression type "Failed" you can not

change this field.

I think this will solve your problem.

In that Time double click on transformer stage---> Go to Stage properties(its having in hedder line

first icon) ---->double click on stage properties --->Go to inputs ---->go to partitioning---->select one

partition technick(with out auto)--->now enable perform sort--->click on perfom sort----> now

enable unique---->click on that and we can take required colum name. now out put will come unique

values so here duplicats will be removed.

Shell scripts can be called in the sequences by using "Execute command activity". In this activity

type following command :

bash /path of your script/scriptname.sh

bash command is used to run the shell script.

The Environmental variables in datastage are some pathes which can support system can use as

shortcuts to fulfill the program running instead of doing nonsense activity. In most time,

environmental variables are defined when the software have been installed or being installed.

Could we use dsjob command on linux or unix platform to achive the activity of extacting

parameters from a job?

Scenarios

Page 58: 86043838 Datastage Interview

In sequential file there is one option is there i.e filter.in this filter we use unix commands like what

ever we want. Goto Seq Properties -> Output -> Option ->set Filter

By Using Sort Stage. GoTo Properties -> set Sorting Keys key=column name and set option Allow

Duplicate= false.

In order to reduce the warnings you need to get clear idea

about particular warning, if you get any idea on code or

design side you fix it, other wise goto director-->select warning and right click and add rule to

message, then click ok. from next run onward you shouldn't find any warnings.

It is not possible to generate alphanumeric surrogate key

in datastage.

I think this answer might satisfy you..

1.just open administrator

2.Go to projects tab

3.click on command button.

4.Give list.readu command and press execute(It gives you all the jobs status

and please not the PID(Process ID) of those jobs which you want to unlock)

5.Now close that and again come back to command window.

6.now give the command ds.tools and execute

7.read the options given there.... and type "4" (option)

8.and now give 6/7 depending up on ur requirement...

9.Now give the PID that you have noted before..

10.Then "yes"

11.Generally at first time it won't work.. but if we press again 7 then

after that give PID again.. It ll work....

Please get back to me If any further clarifications req

some companies using shell script to load logs into audit table or some companies load logs into

audit table using datastage jobs. These jobs are we developed.

Audit table mean its log file.in every job should has audit

table.

Page 59: 86043838 Datastage Interview

Yes we can use Round Robin in Aggregator. It is used for Partitioning and Collecting.

we can have n-1 rejects for merge.

First you have to schedule A & C jobs Monday to Saturday in one sequence.Next take three jobs

according to dependency in one more sequence and schedule that jobonly Sunday.

by using the transformer we can do it.To generate seqnum

there is a formula by using the system variables

ie [@partation num + (@inrow num -1) * @num partation. OR

@PARTITIONNUM+ @INROWNUM- @NUMPARTITIONS

push means the source team sends the data and pull means

the developer extracts the data from source.

.dsx file is nothing but the datastage project backup file..

when we want to load the project at the another system or server we take the file and load at the

other system/server.

We can not capture the reject data by using join stage.

For that we can use transformer stage after join stage.

APT_DUMP_SCORE is an reporting environment variable , used to show how the data is processing

and processes are combining.

Join these two tables on cid and get all the columns to

output. Then in aggregator stage, count rows with key

collumn cid..Then use filter or transformer to get records

with count> 25

Page 60: 86043838 Datastage Interview

The main difference is in 7.5 we can open job only once at a system but in 8.1 we can open on job in

multiple time as a read only mode and another difference is in 8.1 having Slowly Changing

Dimention stage and Repository are there in 8.1.

IN Normalization is controlled by elimination redundant

data where as in Denormalisation is controlled by redundant

data.

JUNK DIMENSION

A Dimension which cannot be used to describe the facts is

known as junk dimension(junk dimension provides additional

information to the main dimension)

ex:-customer add

Confirmed Dimension

A dimension table which can be shared by multiple fact tables

is known as Confirmed dimension

Ex:- Time dimension

ps -ef|grep USER_ID|grep JOB_NAME

Using Pivot Stage .

Basically Environment variable is predefined variable those we can use while

creating DS job.We can set eithere as Project level or Job level.Once we set

specific variable that variable will be availabe into the project/job.

DataStage stored his repository in IBM Universe Database.

For Columns - We can directly map the single source columns to two different

targets.

For Rows - We have to put some constraint (condition ).

Using DataStage Manager. Tool-> Register Plugin -> Set Specific path and ->ok

Page 61: 86043838 Datastage Interview

3 ways to construct the scd2 in datastage 8.0.1

1)using SCD stage in processing stage

2)using change capture and change applay stages

3)using source file,lookup,transformers,filters,surrogate key gene...

Hi,we are having the data1122345By using sort we can send the

duplicates into one link and non-duplicates into another link.In sort by

using keychange column we can identify the duplicates .By using

transformer

In Lookup stage properties, you will have constraints option. If you click on

constraints button- you will get options like continue, drop, fail and reject

If you select the option continue: it means left outer join operation will be

performed.

If you select the option drop: it means inner join operation will be performed.

This can be achieved by selecting the File pattern option and the path of the the

files in the sequential stage.

U can define a job sequence to send an email using SMTP activity if the job fails.

Or log the failure to a log file using DSlogfatal/DSLogEvent from controlling job

or using a After Job Routine.

or

Use dsJob -log from CLI.

Page 62: 86043838 Datastage Interview

Different companies use different strategies to recover the workflows.

1) You can use the session properties to Recover from last check point.

2) Use a temporary table before every target and load it with the keys. When a

job fails,

You can identify the rows that are not loaded from the source by using these

keys in SQL override

3) You can delete the rows that are loaded in to the target by date, and restart

the job from begining.

a) We had a big job with around 40 stages.The job was taking too long tocompile

and run.We broke the job into 3 smaller jobs.After this ,we observed that the

performance was slighly improved and maintenance of the jobs became easier.

b) We were facing problems in deleting the records using OEE stage.We wrote a

bulk delete statment instead of record by record delete.it improved the

performance of our job and the deletion time reduced to 5 minutes.Earlier the

same job was taking 25 minutes.

etc..

I will explain how to Convert a mm/dd/yyyy format to yyyy-dd-mm

Below is the format

Oconv(Iconv(Filedname D/MDY[2 2 4] ) D-YDM[4 2 2] )

here first Iconv(Filedname D/MDY[2 2 4] ) will convert our given date in the

internal format

later Oconv( Inter_date_format D-YDM[4 2 2] ) will convert our internal date

format to required yyyy-dd-mm...

Page 63: 86043838 Datastage Interview

Troubleshooting in datastage server jobs involves monitoring the job log for fatal

errors and taking appropriate actions to resolve them.There can be various

errors which could be encountered while running the ds jobs.Some are

following:

a) Ora-1400 error

b) Invalid userid or password.login denied(From OCI stage)

c) error - Dataset does not exist. (parallel jobs)

d) Job may fail for lookup failure saiyng -- lookup failed on a key column.(If

"failure" setting is done in lookup stage for lookup failures.) etc....

I performed the following validations:

1)all letters should be in smallcase

2)email id field should not contain more than 255 characters

3)it should not contain special characters except underscore

While loading sometimes i came across to the following errors:

1)"unknown field name....." because metadata was not properly loaded..i

reloaded the data and it worked fine...

2)"data truncation warning"..bcoz in data stage data type size was less than the

size of data type in database

When you create hash file, by default in that directory we will have 2 files

data.30

over.30

If data has exceed the specified limit, extra data will be written into over.30.

It again depends up on storage capacity.

The main use of Debugging Stages(row gen,peak,tail,head etc) are they are help

full to monitor jobs,and they generate mock data wen we dont have real time

data to test

Page 64: 86043838 Datastage Interview

In Designer Pallete Development/Debug we can find Head & tail. By using this

we can do......

I will explain u the stages used inorder..

FOLDER STAGE--------->XMLINPUTSTAGE--------->TRANSFORMER------

>SEQUENTIAL FILE

folder stage is to check for the folder which has xmlfile and u have to give

wildcard as .xml

in XML inputstage load the columns from the xml importer and select only the

values and map the same in transformer.thats it

we can impliment SCD by using LOOKUP stage, but it is for only scd1, not for

scd2.

we have to take source(file or db) and dataset as a ref link(for look up) and then

LOOKUP stage, in this we have to compare the source with dataset and we have

to give condition as continue, continue there. after that in t/r we have to give

the conditon, after that we have to take two targets for insert and update, there

we have to manually write the sql insert and update statements.

If u see the design, then u can easily understand that.

Page 65: 86043838 Datastage Interview

But we keep the Extract , Transform and Load proess seperately.

Generally only load job never failes unles there is a data issue.

All data issues are cleared before in trasform only.

there are some DB tools that do this automatically

If you want to do this manually. Keep track of number of records in a has file or

test file.

Update the file as you insert the record.

if job failed in the middle then read the number from the file and process the

records from there only ignoring the record numbers before that

try @INROWNUM function for better result.

Use a change Capture Stage.Output it into a Transformer.

Write a routine to abort the job which is initiated at the Function.

@INROWNUM = 1.

So if the data is not matching it is passed in the transformer and the job is

aborted.

Page 66: 86043838 Datastage Interview

Suppose job sequencer synchronies or control 4 job but job 1 have problem, in

this condition should go director and check it what type of problem showing

either data type problem, warning massage, job fail or job aborted, If job fail

means data type problem or missing column action .So u should go Run window -

>Click-> Tracing->Performance or In your target table ->general -> action-> select

this option here two option

(i) On Fail -- commit , Continue

(ii) On Skip -- Commit, Continue.

First u check how many data already load after then select on skip option then

continue and what remaining position data not loaded then select On Fail ,

Continue ...... Again Run the job defiantly u get successful massage

server is unix and client machine i.e is ur machine where u design a job is

windows xp professional

Datastage jobs are executed in the server machines only. There is nothing that is

stored in the client machine.

A datawarehouse is a repository(centralized as well as distributed) of Data, able

to answer any adhoc,analytical,historical or complex queries.Metadata is data

about data. Examples of metadata include data element descriptions, data type

descriptions, attribute/property descriptions, range/domain descriptions, and

process/method descriptions. The repository environment encompasses all

corporate metadata resources: database catalogs, data dictionaries, and

navigation services. Metadata includes things like the name, length, valid values,

and description of a data element. Metadata is stored in a data dictionary and

repository. It insulates the data warehouse from changes in the schema of

operational systems.In data stage I/O and Transfer , under interface tab: input ,

out put & transfer pages.U will have 4 tabs and the last one is build under that u

can find the TABLE NAME .

Page 67: 86043838 Datastage Interview

The DataStage client components are:AdministratorAdministers DataStage

projects and conducts housekeeping on the serverDesignerCreates DataStage

jobs that are compiled into executable programs DirectorUsed to run and

monitor the DataStage jobsManagerAllows you to view and edit the contents of

the repository.

Hashed File is used for two purpose: 1. Remove Duplicate Records 2. Then Used

for reference lookups.The hashed file contains 3 parts: Each record having

Hashed Key, Key Header and Data portion.By using hashed algorith and the key

valued the lookup is faster.

U can convert all hetrogenous sources into sequential files & join them using

merge

or

U can write user defined query in the source itself to join them

Incremental load means daily load.

when ever you are selecting data from source, select the records which are

loaded or updated between the timestamp of last successful load and todays

load start date and time.

for this u have to pass parameters for those two dates.

store the last run date and time in a file and read the parameter through job

parameters and state second argument as currentdate and time.

BM® InfoSphere™ DataStage®: A sequence job collects job run information after

each job activity is run. This information can be written to the job log or sent by

email using the Notification Activity stage. If any stages or links in a job activity

produce warning or error messages from the job run, the last warning or error

message is retrieved and added to the report.

Page 68: 86043838 Datastage Interview

What is DatawareHouse? Concept of Dataware

house?

What type of data available in Datawarehouse?

What is Node? What is Node Configuration?

What are the types of nodes in datastage?

DataStage Important Interview Question And Answer

Page 69: 86043838 Datastage Interview

What is the use of Nodes

Fork-join

Execution flow

Conductor

Section

Player

Page 70: 86043838 Datastage Interview

What are descriptor file and data file in Dataset.

What is Job Commit ( in Datastage).

What is Iconv and Oconv functions

How to Improve Performance of Datastage Jobs?

Page 71: 86043838 Datastage Interview

Difference between Server Jobs and Parallel Jobs

Page 72: 86043838 Datastage Interview

Difference between Datastage and Informatica.

What is complier ? Compliation Process in

datastage

What is Modelling Of Datastage?

Page 73: 86043838 Datastage Interview

Types Of Modelling ?

What is DataMart, Importance and Advantages?

Page 74: 86043838 Datastage Interview

Data Warehouse vs. Data Mart

What are different types of error in datastage?

Page 75: 86043838 Datastage Interview

What are the client components in DataStage 7.5x2

version?

Page 76: 86043838 Datastage Interview

Difference Between 7.5x2 And 8.0.1?

Page 77: 86043838 Datastage Interview

What is IBM Infosphere? And History

What is Datastage Project Contains?

Page 78: 86043838 Datastage Interview

What is Difference Between Hash And Modulus

Technique?

What are Features of Datastage?

Page 79: 86043838 Datastage Interview

ETL Project Phase?

What is RCP?

Page 80: 86043838 Datastage Interview

What is Roles And Responsibilties of Software

Engineer?

Server Component of DataStage 7.5x2 version?

Page 81: 86043838 Datastage Interview

How to create Group ID in Sort Stage?

What is Fastly Changing Dimension?

Force Compilation ?

Page 82: 86043838 Datastage Interview

how many rows sorted in sort stage by default in

server jobs

when we have to go for a sequential file stage &

for a

dataset in datastage?

what is the diff b/w switch and filter stage in

datastage?

specify data stage strength?

symmetric multiprocessing (SMP)

Page 83: 86043838 Datastage Interview

Briefly state different between data ware house &

data mart?

What are System variables?

What are Sequencers?

Whats difference betweeen operational data stage

(ODS) and data warehouse?

What is the difference between Hashfile and

Sequential File?

Page 84: 86043838 Datastage Interview

What is OCI?

Which algorithm you used for your hashfile?

how to perform left outer join and right outer join

in lookup stage

What is the difference between DataStage and

DataStage Scripting?

Orchestrate Vs Datastage Parallel Extender?

The above might rise another question: why do we

have to load the dimensional tables first, then fact

tables:

Page 85: 86043838 Datastage Interview

how to create batches in Datastage from command

prompt

How will the performance affect if we use more

number of Transformer stages in Datastage parallel

jobs?

Page 86: 86043838 Datastage Interview

What various validations do you perform on the

data after extraction?

what is PROFILE STAGE , QUALITY STAGE,AUDIT

STAGE in datastage..

please expalin in detail.

How do you fix the error "OCI has fetched

truncated data" in DataStage

Why is hash file is faster than sequential file n odbc

stage??

Page 87: 86043838 Datastage Interview

how to fetch the last row from a particular column..

Input file may be sequential file...

What is project life cycle and how do you

implement it?

What is the alternative way where we can do job

control??

Page 88: 86043838 Datastage Interview

It is possible to access the same job two users at a

time in datastage?

How to kill the job in data stage?

What is Integrated & Unit testing in DataStage ?

how do u clean the datastage repository.

give one real time situation where

link partitioner stage used?

Page 89: 86043838 Datastage Interview

what is the transaction size and array size in OCI

stage?how these can be used?

How do you do Usage analysis in datastage ?

Page 90: 86043838 Datastage Interview

Datawarehouse is a database which is used to store the heterogeneous

sources of data with characteristics like

a) Stucture Oriented

b) Historical Information

c) Integrated

d) Non Volatile

e) Time Variant

Source will be Online Transaction Process ( OLTP). It collects the data from

Online Transaction Process ( OLTP). It maintains the data for 30 - 90 days. It is

time sensitive. If we like to store the data for long period, we need a

permanent data base. That is Archyl Database ( AD).

Data in the Datawarehouse comes from the client systems.Data that you are

using to manage your business is very important to do the manupulations

according to the client requirements.

Node is a Logical Cpu in datastage .

Each node in a configuration file is distinguished by the virtual name and

defines a number , speed, cpu's , memory availability etc.

Node configuration is a technique of creating logical C.P.U

The degree of parallellism of parallel jobs depends on the number of nodes

you define in your configuration file.Nodes are just the logically created

processes by the OS.

basically two types of nodes exist :

a) Conductor node : Datastage engine is loaded into conductor node.

b) processing nodes : One section leader is created per node.Section leaders

fork the player processes.

DataStage Important Interview Question And Answer

Page 91: 86043838 Datastage Interview

In a Grid environment a node is the place where the jobs are executes.

Nodes are like processors , if we have more nodes when running the job , the

performance will be good to run parallel to make the job efficient.

A job is split into N sub-jobs which are served by each of the N servers. After

service, sub-job wait until all other sub-jobs have also been processed. The

sub-jobs are then rejoined and leave the system.

Actual data flows from player to player — the conductor and section leader

are only used to control process execution through control and message

channels.

* Conductor is the initial framework process. It creates the Section Leader (SL)

processes (one per node), consolidates messages to the DataStage log, and

manages orderly shutdown. The Conductor node has the start-up process.

The Conductor also communicates with the players.

* Section Leader is a process that forks player processes (one per stage) and

manages up/down communications. SLs communicate between the

conductor and player processes only. For a given parallel configuration file,

one section leader will be started for each logical node.

* Players are the actual processes associated with the stages. It sends stderr

and stdout to the SL, establishes connections to other players for data flow,

and cleans up on completion. Each player has to be able to communicate

with every other player. There are separate communication channels

(pathways) for control, errors, messages and data. The data channel does

not go through the section leader/conductor as this would limit scalability.

Data flows directly from upstream operator to downstream operator.

Page 92: 86043838 Datastage Interview

Descriptor and Data files are the dataset files.

Descriptor file contains the Schema details and address of the data.

And Data file contains the data in the native format.

In DRS Stage we have a transaction Isolation , set to read committed .

And set Array Sze and transaction size to 10,2000 . So that , it will commit for

every 2000 records.

Iconv and Oconv functions are used to convert the date functions.

Iconv() is used to convert string to Internal storage format.

Oconv() is used to convert expression to an output format.

Performance of the Job is really important to maintain.Some of the

precautions are as follows to get good performance of the Jobs.Avoid the use

of only one flow of tuning for performance testing or tuning testing.Try to

work in Increment. Isolate and solve the Jobs. And Work in increment.

Page 93: 86043838 Datastage Interview

For that

a) Avoid using Transformer stage where ever necessary. For example if you

are using Transformer stage to change the column names or to drop the

column names. Use Copy stage, rather than using Transformer stage. It will

give good performance to the Job.

b)Take care to take correct partitioning technique, according to the Job and

requirement.

c) Use User defined queries for extracting the data from databases .

d) If the data is less , use Sql Join statements rather then using a Lookup

stage.

e) If you have more number of stages in the Job, divide the job into multiple

jobs.

Server Jobs works only if the server jobs datastage has been installed in

your system. Server Jobs doesnot supports the parallelism and partition

techniques. Server Jobs generates basic programs after Job Compilation.

Parallel Jobs works, if you have installed Enterprise Edition. This works

on the Datastage Servers that are SMP (Symmetric Multi-Processing) , MPP (

Massively Parallel Processing ) etc. Parallel Jobs generates OSH ( Orchestrate

Shell ) Programs after job compilation. Different Stages will be like datasets,

lookup stages etc.

Server Jobs works in sequential way while parallel jobs work in parallel

fashion (Parallel Extender work on the principal of pipeline and partition) for

Input/Output processing.

Page 94: 86043838 Datastage Interview

Difference between Datastage and Informatica is

Datastage is having Partition, Parallelism, Lookup , Merge etc

But Informtica Doesn't have this concept of partition and parallelism. File

lookup is really horrible

Compilation is the process of converting the GUI into its machine code .That

is nothing but machine understandable language.

In this process it will checks all the link requirements, stage mandatory

property values, and if there any logical errors.

And Compiler produces OSH Code.

Modeling is a Logical and physical representation of Source system.

Modeling have two types of Modeling Tools

They are

ERWIN AND ER-STUDIO

In Source System there will be a ER-Model and

in the Target system there will be a ER-Model and Dimensional Model

Dimension:- The table which was designed for the client perspective. We can

see in many ways in the Dimension tables.

Page 95: 86043838 Datastage Interview

And there are two types of Models.

They are

Forward Engineering (F.E)

Reverse Engineering (R.E)

F.E:- F.E is the process starting the process from the scratch for banking

sector.

Ex: Any Bank which was required Datawarehouse.

R.E:- R.E is the process altering existing model for another bank.

A data mart is a repository of data gathered from operational data and other

sources that is designed to serve a particular community of knowledge

workers. In scope, the data may derive from an enterprise-wide database or

data warehouse or be more specialized. The emphasis of a data mart is on

meeting the specific demands of a particular group of knowledge users in

terms of analysis, content, presentation, and ease-of-use. Users of a data

mart can expect to have data presented in terms that are familiar.There are many reasons to create Datamart.There is lot of importance of

Datamart and advantages.

It is easy to access frequently needed data from the database when required

by the client.

We can give access to group of users to view the Datamart when it is

required. Ofcourse performance will be good.

It is easy to maintain and to create the datamart. It will be related to specific

business.

And It is low cost to create a datamart rather than creating datwarehouse

with a huge space.

Page 96: 86043838 Datastage Interview

A data warehouse tends to be a strategic but somewhat unfinished

concept. The design of a data warehouse tends to start from an analysis of

what data already exists and how it can be collected in such a way that the

data can later be used. A data warehouse is a central aggregation of data

(which can be distributed physically);

A data mart tends to be tactical and aimed at meeting an immediate

need. The design of a data mart tends to start from an analysis of user needs.

A data mart is a data repository that may derive from a data warehouse or

not and that emphasizes ease of access and usability for a particular designed

purpose.

You may get many errors in datastage while compiling the jobs or running the

jobs.

Some of the errors are as follows

a)Source file not found. If you are trying to read the file, which was not there

with that name.

b)Some times you may get Fatal Errors.

c) Data type mismatches. This will occur when data type mismatches occurs

in the jobs.

d) Field Size errors.

e) Meta data Mismatch

f) Data type size between source and target different

g) Column Mismatch

i) Pricess time out. If server is busy. This error will come some time.

Page 97: 86043838 Datastage Interview

In Datastage 7.5X2 Version, they are 4 client Components. They are

1) Datastage Designer

2) Datastage Director

3) Datastage Manager

4) Datastage Admin

In Datastage Designer, We

Create the Jobs

Compile the Jobs

Run the Jobs

In Director, We can

View the Jobs

View the Logs

Batch Jobs

Unlock Jobs

Scheduling Jobs

Monitor the JOBS

Message Handling

Page 98: 86043838 Datastage Interview

1) In Datastage 7.5X2, there are 4 client components. They are

a) Datastage Design

b) Datastage Director

c Datastage Manager

d) Datastage Admin

And in

2) Datastage 8.0.1 Version, there are 5 components. They are

a) Datastage Design

b) Datastage Director

c) Datastage Admin

d) Web Console

e) Information Analyzer

Here Datastage Manager will be integrated with the Datastage Design option.

2) Datastage 7.X.2 Version is OS Dependent. That is OS users are Datastage

Users.

Page 99: 86043838 Datastage Interview

Datastage is the product owned by I.B.M

Datastage is a ETL Tool an it is independent of platform.

Etl means Extraction , Transformation and loading the jobs.

Datastage is the product introduced by the company called V-mark with the

name

DataIntegrator in UK in the year 1997.

And later it was acquired by other companies. Finally it was reached to I.B.M

in 2006.

Datastage got parallel capabilities when it was integrated with the

Orchestrate file

and got independent platform capabilities when integrated with the MKS

Tool Kit

Datastage is a Comprehensive ETL Tool. It is used to Extract , transformation

and loading the Jobs. Datastage Project will be worked on the Datastage

don't. We can login to the Datastage Designer in order to enter the Datastage

too for datastage jobs, designing of the jobs etc.

Datastage jobs are maintained according to the project standards.

In every project we contain the Datastage Jobs , Built in Components , Table

Definitions , Repository and components required for the project.

Page 100: 86043838 Datastage Interview

Hash and Modulus techniques are Key based partition techniques.

Hash and Modulus techniques are used for different purpose.

If Key column data type is textual then we use hash partition technique for

the job.

If Key column data type is numeric, we use modulus partition technique.

If one key column numeric and another text then also we use hash partition

technique.

if both the key columns are numeric data type then we use modulus partition

technique.

1)Any to Any

That means Datastage can Extrace the data from any source and can loads

the data into the any target.

2) Platform Independent

The Job developed in the one platform can run on the any other platform.

That means if we designed a job in the Uni level processing, it can be run in

the SMP machine.

3 )Node Configuration

Node Configuration is a technique to create logical C.P.U

Node is a Logical C.P.U

4)Partition Parallelism

Partition parallelim is a technique distributing the data across the nodes

based on the partition techniques. Partition Techniques are

a) Key based Techniques are

1 ) Hash 2)Modulus 3) Range 4) DB2

Page 101: 86043838 Datastage Interview

And four phases are

1) Data Profiling

2) Data Quality

3) Data Transformation

4) Meta data management Data

Profiling:-

Data Profiling performs in 5 steps. Data Profiling will analysis weather the

source data is good or dirty or not.

And these 5 steps are

a) Column Analysis

b) Primary Key Analysis

c) Foreign Key Analysis

d) Cross domain Analysis

e) Base Line analysis

After completing the Analysis, if the data is good not a problem. If your data

is dirty, it will be sent for cleansing. This will be done in the second phase.

Data Quality:-

Data Quality, after getting the dirty data it will clean the data by using 5

RCP is nothing but Runtime Column Propagation. When we run the Datastage

Jobs, the columns may change from one stage to another stage. At that point

of time we will be loading the unnecessary columns in to the stage, which is

not required. If we want to load the required columns to load into the target,

we can do this by enabling a RCP. If we enable RCP, we can sent the required

columns into the target.

Page 102: 86043838 Datastage Interview

Roles and Responsibilities of Software Engineer are

1) Preparing Questions

2) Logical Designs ( i.e Flow Chart )

3) Physical Designs ( i.e Coding )

4) Unit Testing

5) Performance Tuning.

6) Peer Review

7) Design Turnover Document or Detailed Design Document or Technical

design Document

8) Doing Backups

9) Job Sequencing ( It is for Senior Developer )

There are three Architecture Components in datastage 7.5x2

They are

Repository:--

Repository is an environment where we create job, design, compile and run

etc.

Some Components it contains are

JOBS,TABLE DEFINITIONS,SHARED CONTAINERS, ROUTINES ETC

Server( engine):-- Here it runs executable jobs that extract , transform,

and

load data into a datawarehouse.

Datastage Package Installer:--

It is a user interface used to install packaged datastage jobs and plugins.

Page 103: 86043838 Datastage Interview

Group ids are created in two different ways. We can create group id's by

using

a) Key Change Column

b) Cluster Key change Column

Both of the options used to create group id's .

When we select any option and keep true. It will create the Group id's group

wise.

Data will be divided into the groups based on the key column and it will give

(1) for the first row of every group and (0) for rest of the rows in all groups.

Key change column and Cluster Key change column used, based on the data

we are getting from the source.

If the data we are getting is not sorted , then we use key change column to

create group id's

If the data we are getting is sorted data, then we use Cluster Key change

Column to create Group Id's .

The Entities in the Dimension which are change rapidly is

called Rapidly(fastly) changing dimention. best example is

atm machine transactions.

For parallel jobs there is also a force compile option. The compilation of

parallel jobs is by default optimized such that transformer stages only get

recompiled if they have changed since the last compilation. The force

compile option overrides this and causes all transformer stages in the job

to be compiled. To select this option:

• Choose File ➤ Force Compile

Page 104: 86043838 Datastage Interview

10,000

When there is Memory limit is requirement is more, then go for Dataset, And

sequential file doesn’t support more than 2gb.

filter:1)we can write the multiple conditions on multiple

fields

2)it supports one inputlink and n number of outputlinks

Switch:1)multiple conditions on a single field(column)

2)it supports one inputlink and 128 output links

The major strength of the datastage are :

Partitioning,

pipelining,

Node configuration,

handles Huge volume of data,

Platform independent.

symmetric multiprocessing (SMP) involves a multiprocessor computer

hardware architecture where two or more identical processors are connected

to a single shared main memory and are controlled by a single OS instance.

Most common multiprocessor systems today use an SMP architecture.

Page 105: 86043838 Datastage Interview

Data warehouse is made up of many datamarts. DWH contain many

subject areas. However, data mart focuses on one subject area generally. E.g.

If there will be DHW of bank then there can be one data mart for accounts,

one for Loans etc. This is high-level definitions.

A data mart (DM) is the access layer of the data warehouse (DW)

environment that is used to get data out to the users. The DM is a subset of

the DW, usually oriented to a specific business line or team.

System variables comprise of a set of variables which are used to get system

information and they can be accessed from a transformer or a routine. They

are read only and start with an @.

A sequencer allows you to synchronize the control flow of multiple activities

in a job sequence. It can have multiple input triggers as well as multiple

output triggers.

A dataware house is a decision support database for organisational needs.It is

subject oriented,non volatile,integrated ,time varient collect of data.

ODS(Operational Data Source) is a integrated collection of related

information . it contains maximum 90 days information.

ODS is nothing but operational data store is the part of transactional

database. this db keeps integrated data from different tdb and allow common

operations across organisation. eg: banking transaction.

In simple terms ODS is dynamic data.

Hash file stores the data based on hash algorithm and on a key value.

A sequential file is just a file with no key column.

Hash file used as a reference for look up.

Sequential file cannot.

Page 106: 86043838 Datastage Interview

If you mean by Oracle Call Interface (OCI), it is a set of low-level APIs used to

interact with Oracle databases. It allows one to use operations like logon,

execute, parss etc. using a C or C++ program.

It uses GENERAL or SEQ.NUM. algorithm

In Lookup stage properties, you will have constraints option. If you click on

constraints button- you will get options like continue, drop, fail and reject

If you select the option continue: it means left outer join operation will be

performed.

If you select the option drop: it means inner join operation will be performed.

Datastage jobs,when compiled generate OSH.OSH is the abbreviation of

Orchestrate scripting language.When a datastage job is run,the generated

OSH is executed in the backend.

Orchestrate itself is an ETL tool with extensive parallel processing capabilities

and running on UNIX platform. Datastage used Orchestrate with Datastage XE

(Beta version of 6.0) to incorporate the parallel processing capabilities. Now

Datastage has purchased Orchestrate and integrated it with Datastage XE and

released a new version Datastage 6.0 i.e Parallel Extender.

In dimensional model,fact tables are dependenet on the dimension

tables.This means that fact table contains foreign keys to dimension

tables.This is the reason,dimension tables are loaded first and th...

Page 107: 86043838 Datastage Interview

From command prompt batches can be created in following way :

a) create a batch file say RunbatchJobs.bat

b) Open this file in notepad.

c) Now,write the command "dsjob" with proper syntax for each job you want

to run.

d) in there are four jobs to be run in a batch,use dsjob command 4 times with

different job names on each line.

e) Save the file and close it.

f) Next time whenever you want to run the jobs,just click on the batch file

"RunbatchJobs.bat".All jobs will run one by one by the batch file.

Traditionally batch programs are created in the following way :

A Batch Program is used to run a batch of jobs by writing the Server routine

code in the job control section.To generate a batch program,do following:

a) Open datasatge director.

b) Go to Tools->Batch->New.

c) A new window will open with the "Job Control" tab selected.

d)Write the routine code and save it.You may run multiple jobs in batch by

making use of this.

Transformer stages compile in C++ whereas other stages compile into OSH

(Orchestarte scripting language.).If number of transfomers are more,first

thing is the compilation time will be impacted.it will take more time to

compile the Transformer stage.

Practically,transformer stage really does not have performance impact on DS

Jobs.

If in your jobs,the number of stages are more,the performance will be

impacted(not necessarily transformer stages).Hence,try to implement the job

logic by using minimum stages in your DS Jobs.

Page 108: 86043838 Datastage Interview

NULL check

MetaData Check

Duplicate Check

Invalid Check

Profile stage: is a profiling tool to investigate data sources to see inherent

structures, frequencies of phrases, identify datatypes, etc.. In addition it can,

based on the real data, rather than metadata, suggest a data model for the

union of your data sources. This datamodel would be in 3 NF.

Quality Stage: is now embedded in Information Server and provides

functionality for fuzzy matching records and for standardizing record fields

based on predefined rules.

Audit stage: is now a part of Information Analyzer. This part of IA can, based

on predefined rules, expose exceptions in your data from the required

format, contents and relationships.

This error occurs when Oracle Stage try to fetch a column like 34.55676776...

and actually it's data type is decimal(10,2)The solution here is to either

truncate or Round the data till 2 decimal positions

Hash file is datastage internal file. Data will be stored on computer memory..

and works on key column. So retrieval will be faster when compare to hitting

the database.

Page 109: 86043838 Datastage Interview

Develop a job source seq file--> Transformer--> output stage

In the transformer write a stage variable as rowcount with the following

derivation

Goto DSfunctions click on DSGetLinkInfo..

you will get "DSGetLinkInfo(DSJ.ME,%Arg2%,%Arg3%,%Arg4%)"

Arg 2 is your source stage name

Arg 3 is your source link name

Arg 4 --> Click DS Constant and select DSJ.LINKROWCOUNT.

Now ur derivation is

"DSGetLinkInfo(DSJ.ME,"source","link", DSJ.LINKROWCOUNT)"

Create a constraint as @INROWNUM =rowcount

and map the required column to output link.

Project life cycle is related to SDLC

that is software development life cycle....which mean there are 4 stages

involved

that is

1)Analysis

2)development

3)Testing

4)Implementation

This covers the entire project life cycle !

Jobcontrol can be done using :

Datastage job Sequencers

Datastage Custom routines

Scripting

Scheduling tools like Autosys

Page 110: 86043838 Datastage Interview

No chance ..... u have to kill the job process

U can also do it by using data stage director clean up resources

Unit Testing:

In Datastage senario Unit Testing is the technique of testing the

individual Datastage jobs for its functionality.

Integrating Testing:

When the two or more jobs are collectively tested for its

functionality that is callled Integrating testing.

REmove log files periodically..... And by using command CLEAR.FILE &PH&

If we want to send more data from the source to the targets quickly we will

be using the link partioner stage in the server jobs we can make a maximum

of 64 partitions. And this will be in active stage. We can't connect two active

stages but it is accpeted only for this stage to connect to the transformer or

aggregator stage. The data sent from the link partioner will be collected by

the link collector at a max of 64 partition. This is also an active stage so in

order to aviod the connection of active stage from the transformer to teh link

collector we will be using inter process communication. As this is a passive

stage by using this data can be collected by the link collector. But we can use

inter process communication only when the target is in passive stage

Page 111: 86043838 Datastage Interview

Transaction Size - This field exists for backward compatibility, but it is ignored

for release 3.0 and later of the Plug-in. The transaction size for new jobs is

now handled by Rows per transaction on the Transaction Handling tab on the

Input page.

Rows per transaction - The number of rows written before a commit is

executed for the transaction. The default value is 0, that is, all the rows are

written before being committed to the data table.

Array Size - The number of rows written to or read from the database at a

time. The default value is 1, that is, each row is written in a separate

statement.

1. If u want to know some job is a part of a sequence, then in the Manager

right click the job and select Usage Analysis. It will show all the jobs

dependents.

2. To find how many jobs are using a particular table.

3. To find how many jobs are usinga particular routine.

Like this, u can find all the dependents of a particular object.

Its like nested. U can move forward and backward and can see all the

dependents.

Page 112: 86043838 Datastage Interview

SQL SELECT DISTINCT

SQL AND & OR Operators

SQL ORDER BY

SQL UPDATE

SQL

Page 113: 86043838 Datastage Interview

SQL DELETE

SQL SUBQUERY

SQL CASE

Page 114: 86043838 Datastage Interview

SQL TOP

SQL LIKE

SQL IN

SQL BETWEEN

Page 115: 86043838 Datastage Interview

SQL Alias

SQL Joins

SQL INNER JOIN

SQL LEFT JOIN

Page 116: 86043838 Datastage Interview

SQL RIGHT JOIN

SQL FULL JOIN

SQL UNION

SQL INTERSECT

Page 117: 86043838 Datastage Interview

SQL MINUS

SQL LIMIT

SQL CREATE DATABASE

SQL CREATE TABLE

Page 118: 86043838 Datastage Interview

SQL Constraints

SQL NOT NULL

SQL UNIQUE

Page 119: 86043838 Datastage Interview

SQL PRIMARY KEY

Page 120: 86043838 Datastage Interview

SQL FOREIGN KEY

SQL CHECK

Page 121: 86043838 Datastage Interview

SQL DEFAULT

Page 122: 86043838 Datastage Interview

SQL CREATE INDEX

SQL ALTER TABLE

SQL AUTO INCREMENT

Page 123: 86043838 Datastage Interview

SQL Views

SQL Date Functions

Page 124: 86043838 Datastage Interview

SQL NULL Values

SQL ISNULL VALUES

Page 125: 86043838 Datastage Interview

SQL COALESCE FUNCTION

SQL IFNULL VALUES

Page 126: 86043838 Datastage Interview

SQL NVL Function

SQL NULLIF FUNCTION

SQL RANK FUNCTION

Page 127: 86043838 Datastage Interview

SQL RUNNINNG TOTAL

SQL PERCENT TOTAL

SQL CUMULATIVE PERCENT TOTAL

Page 128: 86043838 Datastage Interview

SQL Functions

SQL AVG() Function

SQL COUNT() Function

Page 129: 86043838 Datastage Interview

SQL FIRST() Function

SQL MAX() Function

SQL MIN() Function

Page 130: 86043838 Datastage Interview

SQL SUM() Function

SQL GROUP BY Statement

SQL HAVING Clause

Page 131: 86043838 Datastage Interview

SQL Upper() Function/UCASE

SQL lower() Function/LCASE

SQL MID() Function

SQL LENGTH() Function

SQL ROUND() Function

Page 132: 86043838 Datastage Interview

SQL NOW() Function

Concatenate Function

Substring Function

STRING FUNCTION

Page 133: 86043838 Datastage Interview

INSTR Function

Trim Function

Page 134: 86043838 Datastage Interview

Length Function

Replace Function

DATEADD FUNCTION

DATEDIFF FUNCTION

DATEPART FUNCTION

DATE FUNCTION (SQL SERVER)

Page 135: 86043838 Datastage Interview

GETDATE FUNCTION

SYSDATE FUNCTION

Page 136: 86043838 Datastage Interview

In a table, some of the columns may contain duplicate values. This is not a

problem, however, sometimes you will want to list only the different (distinct)

values in a table.

The DISTINCT keyword can be used to return only distinct (different) values.

SELECT DISTINCT column_name(s)

FROM table_name

The AND operator displays a record if both the first condition and the second

condition is true.

The OR operator displays a record if either the first condition or the second

condition is true. AND

SELECT * FROM Persons

WHERE FirstName='Tove'

AND LastName='Svendson'

OR

SELECT * FROM Persons

WHERE FirstName='Tove'

OR FirstName='Ola'

The ORDER BY keyword is used to sort the result-set by a specified column.

The ORDER BY keyword sort the records in ascending order by default.

If you want to sort the records in a descending order, you can use the DESC

keyword.

SQL ORDER BY Syntax

SELECT column_name(s)

FROM table_name

ORDER BY column_name(s) ASC|DESC

The UPDATE statement is used to update records in a table.

UPDATE table_name

SET column1=value, column2=value2,...

WHERE some_column=some_value

SQL

Page 137: 86043838 Datastage Interview

The DELETE statement is used to delete records in a table.

DELETE FROM table_name

WHERE some_column=some_value

It is possible to embed a SQL statement within another. When this is done on the

WHERE or the HAVING statements, we have a subquery construct.

The syntax is as follows:

SELECT "column_name1"

FROM "table_name1"

WHERE "column_name2" [Comparison Operator]

(SELECT "column_name3"

FROM "table_name2"

WHERE [Condition])

Case is used to provide if-then-else type of logic to SQL. Its syntax is:

SELECT CASE ("column_name")

WHEN "condition1" THEN "result1"

WHEN "condition2" THEN "result2"

...

[ELSE "resultN"]

END

FROM "table_name"

"condition" can be a static value or an expression. The ELSE clause is optional.

Example :- SELECT store_name, CASE store_name

WHEN 'Los Angeles' THEN Sales * 2

WHEN 'San Diego' THEN Sales * 1.5

ELSE Sales

END

"New Sales",

Date

FROM Store_Information

Page 138: 86043838 Datastage Interview

The TOP clause is used to specify the number of records to return.

SELECT column_name(s)

FROM table_name

WHERE ROWNUM <= number

The LIKE operator is used in a WHERE clause to search for a specified pattern in a

column. Start searchin from first

character 's'

SELECT * FROM Persons

WHERE City LIKE 's%'

Start searchin from last character 's'

SELECT * FROM Persons

WHERE City LIKE '%s'

Start searching which not contain 'tav' SELECT * FROM Persons

WHERE City NOT LIKE '%tav%'

The IN operator allows you to specify multiple values in a WHERE clause.

SQL IN Syntax

SELECT column_name(s)

FROM table_name

WHERE column_name IN (value1,value2,...)

The BETWEEN operator is used in a WHERE clause to select a range of data

between two values.

SQL BETWEEN Syntax

SELECT column_name(s)

FROM table_name

WHERE column_name

BETWEEN value1 AND value2

Page 139: 86043838 Datastage Interview

With SQL, an alias name can be given to a table or to a column.

SQL Alias Syntax for Tables

SELECT column_name(s)

FROM table_name

AS alias_name

SQL Alias Syntax for Columns

SELECT column_name AS alias_name

FROM table_name

SQL joins are used to query data from two or more tables, based on a

relationship between certain columns in these tables.

The INNER JOIN keyword return rows when there is at least one match in both

tables.

SQL INNER JOIN Syntax

SELECT column_name(s)

FROM table_name1

INNER JOIN table_name2

ON table_name1.column_name=table_name2.column_name

The LEFT JOIN keyword returns all rows from the left table (table_name1), even

if there are no matches in the right table (table_name2).

SQL LEFT JOIN Syntax

SELECT column_name(s)

FROM table_name1

LEFT JOIN table_name2

ON table_name1.column_name=table_name2.column_name

Page 140: 86043838 Datastage Interview

The RIGHT JOIN keyword returns all the rows from the right table (table_name2),

even if there are no matches in the left table (table_name1).

SQL RIGHT JOIN Syntax

SELECT column_name(s)

FROM table_name1

RIGHT JOIN table_name2

ON table_name1.column_name=table_name2.column_name

The FULL JOIN keyword return rows when there is a match in one of the tables.

SQL FULL JOIN Syntax

SELECT column_name(s)

FROM table_name1

FULL JOIN table_name2

ON table_name1.column_name=table_name2.column_name

The UNION operator is used to combine the result-set of two or more SELECT

statements.

Notice that each SELECT statement within the UNION must have the same

number of columns. The columns must also have similar data types. Also, the

columns in each SELECT statement must be in the same order.

SQL UNION Syntax

SELECT column_name(s) FROM table_name1

UNION

SELECT column_name(s) FROM table_name2

Similar to the UNION command, INTERSECT also operates on two SQL

statements. The difference is that, while UNION essentially acts as an OR

operator (value is selected if it appears in either the first or the second

statement), the INTERSECT command acts as an AND operator (value is selected

only if it appears in both statements).

The syntax is as follows:

[SQL Statement 1]

INTERSECT

[SQL Statement 2]

Page 141: 86043838 Datastage Interview

The MINUS operates on two SQL statements. It takes all the results from the first

SQL statement, and then subtract out the ones that are present in the second

SQL statement to get the final answer. If the second SQL statement includes

results not present in the first SQL statement, such results are ignored.

The syntax is as follows:

[SQL Statement 1]

MINUS

[SQL Statement 2]

we may not want to retrieve all the records that satsify the critera specified in

WHERE or HAVING clauses.

In MySQL, this is accomplished using the LIMIT keyword. The syntax for LIMIT is

as follows:

[SQL Statement 1]

LIMIT [N]

The CREATE DATABASE statement is used to create a database.

SQL CREATE DATABASE Syntax

CREATE DATABASE database_name

The CREATE TABLE statement is used to create a table in a database.

SQL CREATE TABLE Syntax

CREATE TABLE table_name

(

column_name1 data_type,

column_name2 data_type,

column_name3 data_type,

....

)

Page 142: 86043838 Datastage Interview

Constraints are used to limit the type of data that can go into a table.

Constraints can be specified when a table is created (with the CREATE TABLE

statement) or after the table is created (with the ALTER TABLE statement).

We will focus on the following constraints:

NOT NULL

UNIQUE

PRIMARY KEY

FOREIGN KEY

CHECK

DEFAULT

The NOT NULL constraint enforces a column to NOT accept NULL values.

CREATE TABLE Persons

(

P_Id int NOT NULL,

LastName varchar(255) NOT NULL,

FirstName varchar(255),

Address varchar(255),

City varchar(255)

)

The UNIQUE constraint uniquely identifies each record in a database table.

The UNIQUE and PRIMARY KEY constraints both provide a guarantee for

uniqueness for a column or set of columns.

A PRIMARY KEY constraint automatically has a UNIQUE constraint defined on it.

Note: that you can have many UNIQUE constraints per table, but only one

PRIMARY KEY constraint per table.

Page 143: 86043838 Datastage Interview

CREATE TABLE Persons

(

P_Id int NOT NULL,

LastName varchar(255) NOT NULL,

FirstName varchar(255),

Address varchar(255),

City varchar(255),

CONSTRAINT uc_PersonID UNIQUE (P_Id,LastName)

)

SQL UNIQUE Constraint on ALTER TABLE

ALTER TABLE Persons

ADD CONSTRAINT uc_PersonID UNIQUE (P_Id,LastName)

To DROP a UNIQUE Constraint

ALTER TABLE Persons

DROP CONSTRAINT uc_PersonID

The PRIMARY KEY constraint uniquely identifies each record in a database table.

Primary keys must contain unique values.

A primary key column cannot contain NULL values.

Each table should have a primary key, and each table can have only ONE primary

key.

CREATE TABLE Persons

(

P_Id int NOT NULL PRIMARY KEY,

LastName varchar(255) NOT NULL,

FirstName varchar(255),

Address varchar(255),

City varchar(255)

)

SQL PRIMARY KEY Constraint on ALTER TABLE

ALTER TABLE Persons

ADD CONSTRAINT pk_PersonID PRIMARY KEY (P_Id,LastName)

To DROP a PRIMARY KEY Constraint

ALTER TABLE Persons

DROP CONSTRAINT pk_PersonID

Page 144: 86043838 Datastage Interview

A FOREIGN KEY in one table points to a PRIMARY KEY in another table.

CREATE TABLE Orders

(

O_Id int NOT NULL PRIMARY KEY,

OrderNo int NOT NULL,

P_Id int FOREIGN KEY REFERENCES Persons(P_Id)

)

SQL FOREIGN KEY Constraint on ALTER TABLE

To create a FOREIGN KEY constraint on the "P_Id" column when the "Orders"

table is already created, use the following SQL:

ALTER TABLE Orders

ADD CONSTRAINT fk_PerOrders

FOREIGN KEY (P_Id)

REFERENCES Persons(P_Id)

To DROP a FOREIGN KEY Constraint

ALTER TABLE Orders

DROP CONSTRAINT fk_PerOrders

The CHECK constraint is used to limit the value range that can be placed in a

column.

If you define a CHECK constraint on a single column it allows only certain values

for this column.

If you define a CHECK constraint on a table it can limit the values in certain

columns based on values in other columns in the row.

CREATE TABLE Persons

(

P_Id int NOT NULL CHECK (P_Id>0),

LastName varchar(255) NOT NULL,

FirstName varchar(255),

Address varchar(255),

City varchar(255)

)

Page 145: 86043838 Datastage Interview

SQL CHECK Constraint on ALTER TABLE

To create a CHECK constraint on the "P_Id" column when the table is already

created, use the following SQL:

ALTER TABLE Persons

ADD CONSTRAINT chk_Person CHECK (P_Id>0 AND City='Sandnes')

To DROP a CHECK Constraint

To drop a CHECK constraint, use the following SQL:

SQL Server / Oracle / MS Access:

ALTER TABLE Persons

DROP CONSTRAINT chk_Person

The DEFAULT constraint is used to insert a default value into a column.

The default value will be added to all new records, if no other value is specified.

CREATE TABLE Persons

(

P_Id int NOT NULL,

LastName varchar(255) NOT NULL,

FirstName varchar(255),

Address varchar(255),

City varchar(255) DEFAULT 'Sandnes'

)

SQL DEFAULT Constraint on ALTER TABLE

ALTER TABLE Persons

ALTER COLUMN City SET DEFAULT 'SANDNES'

To DROP a DEFAULT Constraint

ALTER TABLE Persons

ALTER COLUMN City DROP DEFAULT

Page 146: 86043838 Datastage Interview

An index can be created in a table to find data more quickly and efficiently.

The users cannot see the indexes, they are just used to speed up

searches/queries.

Note: Updating a table with indexes takes more time than updating a table

without (because the indexes also need an update). So you should only create

indexes on columns (and tables) that will be frequently searched against.

SQL CREATE INDEX Syntax

Creates an index on a table. Duplicate values are allowed:

CREATE INDEX index_name

ON table_name (column_name)

SQL CREATE UNIQUE INDEX Syntax

Creates a unique index on a table. Duplicate values are not allowed:

CREATE UNIQUE INDEX index_name

ON table_name (column_name)

The ALTER TABLE statement is used to add, delete, or modify columns in an

existing table.

SQL ALTER TABLE Syntax

To add a column in a table, use the following syntax:

ALTER TABLE table_name

ADD column_name datatype

Very often we would like the value of the primary key field to be created

automatically every time a new record is inserted.

We would like to create an auto-increment field in a table.

Use the following CREATE SEQUENCE syntax:

CREATE SEQUENCE seq_person

MINVALUE 1

START WITH 1

INCREMENT BY 1

CACHE 10

Page 147: 86043838 Datastage Interview

In SQL, a view is a virtual table based on the result-set of an SQL statement.

A view contains rows and columns, just like a real table. The fields in a view are

fields from one or more real tables in the database.

You can add SQL functions, WHERE, and JOIN statements to a view and present

the data as if the data were coming from one single table.

SQL CREATE VIEW Syntax

CREATE VIEW view_name AS

SELECT column_name(s)

FROM table_name

WHERE condition

SQL Updating a View

You can update a view by using the following syntax:

SQL CREATE OR REPLACE VIEW Syntax

CREATE OR REPLACE VIEW view_name AS

SELECT column_name(s)

FROM table_name

WHERE condition

SQL Dropping a View

You can delete a view with the DROP VIEW command.

SQL DROP VIEW Syntax

DROP VIEW view_name

The most difficult part when working with dates is to be sure that the format of

the date you are trying to insert, matches the format of the date column in the

database.

SQL Server comes with the following data types for storing a date or a date/time

value in the database:

DATE - format YYYY-MM-DD

DATETIME - format: YYYY-MM-DD HH:MM:SS

SMALLDATETIME - format: YYYY-MM-DD HH:MM:SS

TIMESTAMP - format: a unique number

Page 148: 86043838 Datastage Interview

NULL values represent missing unknown data.

By default, a table column can hold NULL values.

NULL means that data does not exist. NULL does not equal to 0 or an empty

string. Both 0 and empty string represent a value, while NULL has no value.

Any mathematical operations performed on NULL will result in NULL. For

example,

10 + NULL = NULL

SQL IS NULL

How do we select only the records with NULL values in the "Address" column?

We will have to use the IS NULL operator:

SELECT LastName,FirstName,Address FROM Persons

WHERE Address IS NULL

SQL IS NOT NULL

How do we select only the records with no NULL values in the "Address"

column?

We will have to use the IS NOT NULL operator:

SELECT LastName,FirstName,Address FROM Persons

WHERE Address IS NOT NULL

In SQL Server, the ISNULL() function is used to replace NULL value with another

value.

For example, if we have the following table,

Table Sales_Data

store_name, Sales

Store A, 300

Store B, NULL

EXAMPLE :-SELECT SUM(ISNULL(Sales,100)) FROM Sales_Data;

Page 149: 86043838 Datastage Interview

COALESCE function in SQL returns the first non-NULL expression among its

arguments.It is the same as the following CASE statement:

SELECT CASE ("column_name")

WHEN "expression 1 is not NULL" THEN "expression 1"

WHEN "expression 2 is not NULL" THEN "expression 2"

...

[ELSE "NULL"]

END

FROM "table_name"

EXAMPLE :-SELECT Name, COALESCE(Business_Phone, Cell_Phone,

Home_Phone) Contact_Phone

FROM Contact_Info;

This function takes two arguments. If the first argument is not NULL, the function

returns the first argument. Otherwise, the second argument is returned. This

function is commonly used to replace NULL value with another value. It is similar

to the NVL function in Oracle and the ISNULL Function in SQL Server.

For example, if we have the following table,

Table Sales_Data

store_name Sales

Store A 300

Store B NULL

EXAMPLE :- SELECT SUM(IFNULL(Sales,100)) FROM Sales_Data;

returns 400. This is because NULL has been replaced by 100 via the ISNULL

function.

Page 150: 86043838 Datastage Interview

Is available in Oracle, and not in MySQL or SQL Server. This function is used to

replace NULL value with another value. It is similar to the IFNULL Function in

MySQL and the ISNULL Function in SQL Server.

For example, if we have the following table,

Table Sales_Data

store_name Sales

Store A 300

Store B NULL

Store C 150

EXAMPLE :- SELECT SUM(NVL(Sales,100)) FROM Sales_Data;

returns 550. This is because NULL has been replaced by 100 via the ISNULL

function, hence the sum of the 3 rows is 300 + 100 + 150 = 550.

function takes two arguments. If the two arguments are equal, then NULL is

returned. Otherwise, the first argument is returned.

It is the same as the following CASE statement:

SELECT CASE ("column_name")

WHEN "expression 1 = expression 2 " THEN "NULL"

[ELSE "expression 1"]

END

FROM "table_name"

EXAMPLE :- SELECT Store_name, NULLIF(Actual,Goal) FROM Sales_Data;

The rank associated with each row is a common request, and there is no

straightforward way to do so in SQL. To display rank in SQL, the idea is to do a

self-join, list out the results in order, and do a count on the number of records

that's listed ahead of (and including) the record of interest. Let's use an example

to illustrate. Say we have the following table,

EXAMPLE :- SELECT a1.Name, a1.Sales, COUNT(a2.sales) Sales_Rank

FROM Total_Sales a1, Total_Sales a2

WHERE a1.Sales <= a2.Sales or (a1.Sales=a2.Sales and a1.Name = a2.Name)

GROUP BY a1.Name, a1.Sales

ORDER BY a1.Sales DESC, a1.Name DESC;

Page 151: 86043838 Datastage Interview

running totals is a common request, and there is no straightforward way to do so

in SQL. The idea for using SQL to display running totals similar to that for

displaying rank: first do a self-join, then, list out the results in order. Where as

finding the rank requires doing a count on the number of records that's listed

ahead of (and including) the record of interest, finding the running total requires

summing the values for the records that's listed ahead of (and including) the

record of interest.

EXAMPLE :- SELECT a1.Name, a1.Sales, SUM(a2.Sales) Running_Total

FROM Total_Sales a1, Total_Sales a2

WHERE a1.Sales <= a2.sales or (a1.Sales=a2.Sales and a1.Name = a2.Name)

GROUP BY a1.Name, a1.Sales

ORDER BY a1.Sales DESC, a1.Name DESC;

Percent to total in SQL, we want to leverage the ideas we used for rank/running

total plus subquery. Different from what we saw in the SQL Subquery section,

here we want to use the subquery as part of the SELECT.

EXAMPLE :- SELECT a1.Name, a1.Sales, a1.Sales/(SELECT SUM(Sales) FROM

Total_Sales) Pct_To_Total

FROM Total_Sales a1, Total_Sales a2

WHERE a1.Sales <= a2.sales or (a1.Sales=a2.Sales and a1.Name = a2.Name)

GROUP BY a1.Name, a1.Sales

ORDER BY a1.Sales DESC, a1.Name DESC;

cumulative percent to total in SQL, we use the same idea as we saw in the

Percent To Total section. The difference is that we want the cumulative percent

to total, not the percentage contribution of each individual row. EXAMPLE :-

SELECT a1.Name, a1.Sales, SUM(a2.Sales)/(SELECT SUM(Sales) FROM

Total_Sales) Pct_To_Total

FROM Total_Sales a1, Total_Sales a2

WHERE a1.Sales <= a2.sales or (a1.Sales=a2.Sales and a1.Name = a2.Name)

GROUP BY a1.Name, a1.Sales

ORDER BY a1.Sales DESC, a1.Name DESC;

Page 152: 86043838 Datastage Interview

SQL Aggregate Functions

SQL aggregate functions return a single value, calculated from values in a

column.

Useful aggregate functions:

AVG() - Returns the average value

COUNT() - Returns the number of rows

FIRST() - Returns the first value

LAST() - Returns the last value

MAX() - Returns the largest value

MIN() - Returns the smallest value

SUM() - Returns the sum

SQL Scalar functions

SQL scalar functions return a single value, based on the input value.

Useful scalar functions:

UCASE() - Converts a field to upper case

LCASE() - Converts a field to lower case

MID() - Extract characters from a text field

LEN() - Returns the length of a text field

ROUND() - Rounds a numeric field to the number of decimals specified

NOW() - Returns the current system date and time

FORMAT() - Formats how a field is to be displayed

The AVG() Function

The AVG() function returns the average value of a numeric column.

SELECT AVG(column_name) as (Alias_column_name)FROM table_name

Now we want to find the customers that have an OrderPrice value higher than

the average OrderPrice value.

We use the following SQL statement:

SELECT Customer FROM Orders

WHERE OrderPrice>(SELECT AVG(OrderPrice) FROM Orders)

The COUNT() function returns the number of rows that matches a specified

criteria.

SQL COUNT(column_name) Syntax

Page 153: 86043838 Datastage Interview

SQL COUNT(*) Syntax

The COUNT(*) function returns the number of records in a table:

SELECT COUNT(*) FROM table_name

SQL COUNT(DISTINCT column_name) Syntax

The COUNT(DISTINCT column_name) function returns the number of distinct

values of the specified column:

SELECT COUNT(DISTINCT column_name) FROM table_name

The FIRST() function returns the first value of the selected column.

SQL FIRST() Syntax

SELECT FIRST(OrderPrice) AS FirstOrderPrice FROM Orders

The MAX() Function

The MAX() function returns the largest value of the selected column.

SQL MAX() Syntax

SELECT MAX(column_name) as (Alias_Column_name) FROM table_name

The MIN() Function

The MIN() function returns the smallest value of the selected column.

SQL MIN() Syntax

SELECT MIN(column_name) as (Alias_Column_name) FROM table_name

Page 154: 86043838 Datastage Interview

The SUM() Function

The SUM() function returns the total sum of a numeric column.

SQL SUM() Syntax

SELECT SUM(column_name) as (Alias_Column_name) FROM table_name

The GROUP BY Statement

The GROUP BY statement is used in conjunction with the aggregate functions to

group the result-set by one or more columns.

SQL GROUP BY Syntax

SELECT column_name, aggregate_function(column_name)

FROM table_name

WHERE column_name operator value

GROUP BY column_name

The HAVING Clause

The HAVING clause was added to SQL because the WHERE keyword could not be

used with aggregate functions.

SQL HAVING Syntax

SELECT column_name, aggregate_function(column_name)

FROM table_name

WHERE column_name operator value

GROUP BY column_name

HAVING aggregate_function(column_name) operator value

Page 155: 86043838 Datastage Interview

The Upper() function converts the value of a field to uppercase.

Syntax for SQL Server

SELECT UPPER(column_name) FROM table_name

The lower() function converts the value of a field to uppercase.

Syntax for SQL Server

SELECT lower(column_name) FROM table_name

The MID() function is used to extract characters from a text field.

SQL MID() Syntax

SELECT MID(column_name,start[,length]) FROM table_name

Example

SELECT MID(City,1,4) as SmallCity FROM Persons

The LENGTH() Function

The LENGTH() function returns the length of the value in a text field.

SQL LENGTH() Syntax

SELECT LENGTH(column_name) FROM table_name

The ROUND() Function

The ROUND() function is used to round a numeric field to the number of

decimals specified.

SQL ROUND() Syntax

SELECT ROUND(column_name,decimals) FROM table_name

Page 156: 86043838 Datastage Interview

it is necessary to combine together (concatenate) the results from several

different fields. Each database provides a way to do this:

MySQL: CONCAT()

Oracle: CONCAT(), ||

SQL Server: +

Example :- MySQL/Oracle:

SELECT CONCAT(Column1,Column2) FROM Geography

WHERE Column2 = 'Boston';

Oracle:

SELECT Column1 || ' ' || Column2 FROM Geography

WHERE Column2 = 'Boston';

SQL Server:

SELECT Column1 + ' ' + Column2 FROM Geography

WHERE Column2 = 'Boston';

is used to grab a portion of the stored data. This function is called differently for

the different databases:

MySQL: SUBSTR(), SUBSTRING()

Oracle: SUBSTR()

SQL Server: SUBSTRING()

Example 1 :- SELECT SUBSTR(store_name, 3)

FROM Geography

WHERE store_name = 'Los Angeles';

Example 2 :- SELECT SUBSTR(store_name,2,4)

FROM Geography

WHERE store_name = 'San Diego';

STRING FUNCTION

Page 157: 86043838 Datastage Interview

is used to find the starting location of a pattern in a string. This function is

available in MySQL and Oracle, though they have slightly different syntaxes:

The syntax for the Length function is as follows:

MySQL: INSTR (str, pattern): Find the staring location of pattern in string str.

Oracle: INSTR (str, pattern, [starting position, [nth location]]):

Example 1 :-SELECT INSTR(store_name,'o')

FROM Geography

WHERE store_name = 'Los Angeles';

Example 2 :- SELECT INSTR(store_name,'p')

FROM Geography

WHERE store_name = 'Los Angeles';

Examle 3 :- SELECT INSTR(store_name,'e', 1, 2)

FROM Geography

WHERE store_name = 'Los Angeles';

s used to remove specified prefix or suffix from a string. The most common

pattern being removed is white spaces. This function is called differently in

different databases:

MySQL: TRIM(), RTRIM(), LTRIM()

Oracle: RTRIM(), LTRIM()

SQL Server: RTRIM(), LTRIM()

Example 1 :- SELECT TRIM(' Sample ');

Example 2 :- SELECT LTRIM(' Sample ');

Example 3 :- Select RTIM(' Sample ');

Page 158: 86043838 Datastage Interview

is used to get the length of a string. This function is called differently for the

different databases:

MySQL: LENGTH()

Oracle: LENGTH()

SQL Server: LEN()

Example 1 :- SELECT Length(store_name)

FROM Geography

WHERE store_name = 'Los Angeles';

Example 2 :- SELECT region_name, Length(region_name)

FROM Geography;

is used to update the content of a string. The function call is REPLACE() for

MySQL, Oracle, and SQL Server. The syntax of the Replace function is

Syntax : -

Replace(str1, str2, str3): In str1, find where str2 occurs, and replace it with str3.

Example :- SELECT REPLACE(region_name, 'ast', 'astern')

FROM Geography;

is used to add an interval to a date. This function is available in SQL Server.

The usage for the DATEADD function is

DATEADD (datepart, number, expression)

Example :- SELECT DATEADD(day, 10,'2000-01-05 00:05:00.000');

is used to calculate the difference between two days, and is used in MySQL and

SQL Server.

Example :- SELECT DATEDIFF(day, '2000-01-10','2000-01-05');

Is a SQL Server function that extracts a specific part of the date/time value. Its

syntax is as follows:

DATEPART (part_of_day, expression)

DATE FUNCTION (SQL SERVER)

Page 159: 86043838 Datastage Interview

Example :- SELECT DATEPART (yyyy,'2000-01-20');

Example :- SELECT DATEPART(dy, '2000-02-10');

Is used to retrieve the current database system time in SQL Server. Its syntax is

GETDATE()

Example :- SELECT DATEPART (yyyy,'2000-01-20');

is used to retrieve the current database system time in Oracle and MySQL.

Example :- SELECT SYSDATE FROM DUAL;

Page 160: 86043838 Datastage Interview

Installation log files

Troubleshooting

Page 161: 86043838 Datastage Interview

%TEMP%\ibm_is_logs

Troubleshooting